m-‘m‘ .. mm? 1' 1'1 x 1.151;"... 5155..., '11313Lri4 g1 *1 1111-1-11 1 1555 LC? . .55: u . ‘1 . L - .. 51 fig; 11"}? 21.51;“ 2313:1122 1115351551. -. 1"3'11115‘1" 5‘ 1“ 1h“ "1"". 1'5 .5 '1: 71-26“. 21331 "ms-,2.-. 1123,..‘1551'2'1 ' ' ' “155":1‘.“ 1:11:551311‘27‘111‘3‘1VV" I...‘1.n u.x_§”5q%§&ggnk~ ‘3 ‘1 . ' cf; 3.21 ,3 1111122111121 141-55 ,1:le 55555:“, C Q . 1: 5.2.1 3" d: _ L; l} m.“ 'EEQW - WP. €1,121“) 1......51’1 ""'V;§‘§% 1.. "J“. {3:114:73 7.15:1” . “ .1: N1 ' .1.‘ n v- ’22 $1.15.; ':x;3¢4~:"v"'§1."31:m (”A170, ‘ ‘1; “1.1... . ‘1 1 ”1113,; Ni ,‘2' H H 'V ‘ 'Vi'L 2:? '\.'V figth 1:35 15.1%.“. ’2'1‘ 1131111 2:71.. “ timfig‘ ;' 1"": . ’14:... 32111131 V- ....uu€' I ‘11"; “93"}.‘3 21.151215911535129. » M. 5131111111." 11211111 1 1 5:151:11" 13.1111 .. .111 Vggm')‘ ~ ' 11.111 5 ' V ‘.' -H V . .11 1515112513 5.. “51.5.11“ 7 1 3733. y. 5 1 % 15-4 11’ ".uw. o "‘1' .3111». 0- :fii}? . w .11..» A I . “11.5 1. N . - ,5 . \-~ '1‘" " 11".: 1 “.4: ”(114 . "$211k x: 1, , “xii-5%; ,M-' JV; .I'P‘J, }y(€1‘&?1§§ I1 311'- 'L'g'v V'R'i‘i 1 ‘ 1"»: "“ ' V’" 5.1%.4 '50-‘51 C}? 1.1.55.1 r - “ 351‘ «gig-‘1" 1:115» wing; W 3;... ..hv .- Kiziflfi 1...»- "n1;- '4‘: 1 5:21:13“ 1 ”Ha? ,1." : ”mm-$1 $112,113 13. $1112?" ~‘~tg§?£‘—I§~“ ‘23 11.1.1. ’3 t 2 131211113 ~ REM-11v 11:51 «1; .K3Vn'1-Vv'1'33r3’1’W' ’ 5 : :1.“ u-M- . 3a , - 3-1; . r“. ‘2," 2L); w:':1. . ' .- NV. . ' IV E . :,. ...sléikt .311. . . .- - . - .ui‘f‘tfl'" '1: ‘2 ‘ .1212. “12411551513... 1....“ ‘3’42‘1351-113:»¢—..-~ 55;; “31;. 1 ~s. :an mm 53‘ '3, 5:“; 5 . 5 “.."p. 'fw-(u-bn'” .133“: "'1. .3...“ny “um “-3 Sr; . ~. '11 '23?" ”951 :1:- 311;: "43' " ~- 1 " w 5.21%“ 1’ ' ‘ ' . Hal?“ Lin-29'" . ’3. 1-04 ' V; ’H-‘g’gfi. . I 1&7; "$2.1 1' " m ‘11 1.)“. 5'13'ré-“I:.,.”"""“" 1 “kirk-'52; 9- ... - :1." 1.13 rm ,1211. 1:197:11" ""103“. “can; 1 .55..“ . M II"! ”5-1112" :...1',.‘-‘.1... . N‘J 1 ring}... .. . fin” {M V .L p.- 1 ‘1"11‘2'1 n‘l\rflil"“ ...g “555:5. 5‘51»: ' &,§1P'1...” 5“ :1 3555;:511 * . ”-5.1,“ ., . W . 1.; AVA-7g: PW 191% " - -. $3211. 1335171 .2 -~-!- Cid; 1.3:; firzws:%~1€2511 ”5" 1 "51.11“. .. LIA} s . ,. , Fag Michigmi mate University 4w . This is to certify that the dissertation entitled ALGORITHM IMPLEMENTATION AND DESIGN OF RECONFIGURABLE MIXED SYSTOLIC ARRAYS presented by Anwar Khurshid has been accepted towards fulfillment of the requirements for Ph .D. degree in Electrical Engineering %/ 5/ ZZL/ Major professor P . David Fisher Date June 4, 1985 MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LIBRARIES m RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. a; MAR its 2002 ALGORITHM IMPLEMENTATION AND DESIGN OF RECONFIGURABLE MIXED SYSTOLIC ARRAYS By Anwar Khurshid A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrica1.Engineering and Systems Science 1985 ABSTRACT ALGORITHM IMPLEMENTATION AND DESIGN OF RECONFIGURABLE MIXED SYSTOLIC ARRAYS By Anwar Khurshid One drawback of systolic architectures is their fixed-flow structure for data streams, which limits the type of algorithms or applications that can effectively be supported by such architectures. This thesis presents a methodology for algorithm implementation and design of a class of reconfigurable multiprocessor architectures called the mixed systolic array (MSA). In the MSA architecture. switching cells are mixed with computation cells to achieve flexibility in the data-flay patterns. This architecture broadens the scape of systolic arrays by achieving reconfigurability, algorithmic flexibility and fault tolerance. The level sensitive scan design (LSSD) technique is employed to load and implement the distributed control structure required to establish a desired interconnection pattern on the MSA. The control structure for a particular configuration is loaded into the array as a binary vector in a bit serial fashion. This approach enhances testability and incorporates fault tolerance in the MSA structures. Efficient implementation of algorithms on VLSI structures requires exploitation of parallelism in the algorithm and mapping of the algorithm communication structure into This thesis presents a representing reconfigurable procedure for implementing generating the control code required to the processor interconnection structure. general mathematical model for formally MSA architectures and a step-by-step a given algorithm into the MSA structure by reconfigure the array. The mapping procedure is based on time and space transformations of the data dependence vectors of the algorithm. These transformations provide a description of the data-flow and timing, and dictate the interconnection structure the required to implement algorithm on the array. The procedure presented in this work, will provide a useful tool in the design automation of reconfigurable MSAs. To illustrate the methodology and explain the reconfiguration procedure, two sample algorithms, the finite impulse response (FIR) filtering algorithm and the priority queue algorithm, are mapped into a linear reconfigurable systolic array. A computer-aided design (CAD) facility is also presented, for modeling and simulating mixed systolic arrays. This CAD facility serves as a high-level design tool which supports the design of MBA architectures, and provides the designer of MSAs with a facility to interactively deve10p an MSA structure for a given set of user-specified attributes and simulate the execution of algorithms on MSA processors at the register-transfer level. To MOTHER AND FATHER ACKNOWLEDGEMENTS I would like to express my deepest gratitude to my advisor, Dr. P.D. Fisher, for his continuous moral and professional support and guidance throughout my doctoral pregram. He was a rich source of inspiration, and I gratefully acknowledge the encouragement and motivation provided by him which made possible the successful conclusion of my Ph.D. work. I also would like to thank Dr. Michael Shanblatt for his valuable suggestions and evaluation of this research. I ma very thankful to Dr. Lionel Ni for his professional advice and review of this work. The interest and encouragement provided by Dr. S.R. Crouch is gratefully acknowledged. Appreciation is extended to Dr. M.A.P. Jayasumana for his help during this research. I am thankful to Mr. B. MacArthur for his skillful drawings. The support provided by the National Science Foundation (Grant No. MCS 79-09216) for this research is gratefully acknowledged. I shall always be grateful to my parents for their unending support and encouragement through the years. Finally, I am specially thankful to my wife, Nadira, for her patience and understanding in awaiting the completion of this dissertation. iii TABLE OF CONTENTS Chapter Page 1. INTRODUCTION ‘ 1 2. BACKGROUND 7 2.1 Systolic Arrays 7 2.1.1 Systolic Array Architectures and Algorithms 7 2.1.2 Systolic and Semisystolic Design 16 2.2 Reconfigurable Mixed Systolic Arrays 18 2.3 Algorithm Transformation and Mapping into VLSI Structures 23 2.4 Fault Tolerance in Systolic Arrays 26 2.5 Level Sensitive Scan Design (LSSD) Techniques 28 3. A LINEAR RECONFIGURABLE SYSTOLIC ARRAY (LRSA) ARCHITECTURE 33 3.1 Structure of the Linear Reconfigurable Systolic Array 34 3.1.1 The Computation Cells Array 34 3.1.2 Function-Select Array 38 3.1.3 Data Flow Control Array 38 3.2 Systolic Algorithms Implemented on the LRSA 42 3.2.1 Systolic Filtering Array 42 3.2.2 Systolic FIR Filtering Array 46 3.2.3 Pattern Matching Systolic Array 48 3.2.4 Discrete Fourier Transform (DFT) Systolic Array 50 3.3 Summary and Discussion 52 4. DESIGN OF A MULTIPURPOSE RECONFIGURABLE ARRAY PROCESSOR (MRAP) 55 4.1 Structure of the Multipurpose Reconfigurable Array Processor 56 4.1.1 The Computation Cell 59 4.1.2 The Switching Cell 61 4.1.3 Data-Flow and Timing 61 4.1.4 Pregramming Data-Flow Control 66 iv Chapter 4.2 Implementation of Algorithms on the MRAP 4.2.1 Dense-Matrix/Dense-Matrix Multiplication 4.2.2 Band-Matrix/Dense-Matrix Multiplication 4.2.3 Band-Matrix/Band-Matrix Multiplication 4.2.4 Recursive Filtering 4.3 Performance Analysis and Comparison 3 1 Dense-Mattix/Dense-Matrix Multiplication .3.2 Band-Matrix/Dense-Matrir Multiplication 3.3 Band-Matrix/Band-Matrix Multiplication .u.u-e 4.4 Fault Tolerance 4.5 Summary and Discussion 5. ALGORITHM IMPLEMENTATION 0N MIXED SYSTOLIC ARRAYS 5.1 A Mbdel for Reconfigurable Mixed Systolic Arrays 5.2 A Medel for Algorithms 5.3 Mapping Algorithms into Mixed Systolic Arrays 5.3.1 Procedure for Mapping Algorithms into Mixed Systolic Arrays 5.4 Examples 1 Finite Impulse Response (FIR) Filtering Algorithm . .2 Priority Queue 5 .4 O 5 4 5.5 Summary and Discussion 6. SUMMARY AND CONCLUSIONS APPENDIX A,- A CDMPUTER-AIDED DESIGN FACILITY FOR MIXED SYSTOLIC ARRAYS A.l Description of the System A.2 Discussion BIBLIOGRAPHY Page 69 71 73 75 77 79 82 83 86 87 89 93 94 98 104 108 112 113 119 122 126 132 134 145 147 LIST OF FIGURES Figure Page 2.1 Linearly connected systolic array for computing the product of band-matrix A and vector X. 10 2.2 Two geometries for the inner-product step processor. 11 2.3 Hex-connected systolic array for computing the product matrix C, of matrices A and B [44]. 12 2.4 Some common interconnection schemes for systolic arrays: (a) two-dimensional systolic arrays, (b) degenerate two-dimensional systolic array, (c) one-dimensional systolic array. 15 2.5 A semisystolic system.can implement broadcasting by Mealy machines [41]. 17 2.6 A mixed systolic array constructed from a diamond-like basis. 21 2.7 Block diagram representation of a shift register latch. 29 2.8 Interconnections of SRLs on an integrated circuit [57]. . 30 2.9 General structure of an LSSD subsystem with two system clocks. 32 3.1 General system diagram of the Linear Reconfigurable Systolic Array. 35 3.2 Block diagram representation of a Computation Cell. 36 3.3 SRLs are used in the function-select array. 39 3.4 Four interconnection states of an interchange box. 40 3.5 Logic diagram of a two-function interchange box. 41 3.6 Structure of the Linear Reconfigurable Systolic Array. 43 3.7 One-dimensional systolic array for filtering: (a) basic cells used in the array, (b) the systolic filtering array. 45 vi Figure Page 3.8 One-dimensional systolic array for FIR filtering: (a) basic cells used in the array, (b) the systolic FIR filtering array. 47 3.9 One-dimensional systolic array for pattern matching: (a) basic cells used in the array, (b) systolic pattern matching array. 49 3.10 Linear systolic array for DFT algorithm: (a) basic cells used in the array, (b) the systolic DFT array. 51 3.11 LSSD scan path within a computation cell incorporating testing and register initialization. 53 4.1 The structure of the Multipurpose Reconfigurable Array Processor (MRAP). 57 4.2 Scan path for loading the control vectors. 58 4.3 General system diagram of the MRAP. 60 4.4 The computation cell. 62 4.5 A gate-level logic implementation of the switching cell. 63 4.6 A timing diagram for inter-processor communication. 65 4.7 A computation cell CCi’j and its neighbors. 57 4.8 (a) Dense-matrix/dense-matrix multiplication. 72 4.8 (b) Band-matrix/dense-matrix multiplication. 74 4.8 (c) Band-matrix/band-matrix multiplication. 76 4.8 (d) One—dimensional arrays for filtering. 78 4.9 (a) The MRAP with one faulty computation cell, (b) spare .cells are used for fault masking. 88 4.10 Fault tolerance in the MRAP for 2-dimensional algorithms. 90 5.1 The Linear Reconfigurable Systolic Array (LRSA) architecture: (a) structure of the LRSA, (b) a switching cell (SC). (c) a computation cell (CC). 97 5.2 Structure of the Multipurpose Reconfigurable Array Processor 99 5.3 A computation cell with communication links. 107 5.4 LSSD scan path sequence in an LRSA. 114 5.5 A linear systolic array for FIR filtering. 118 vii Figure- 5.6 A.l A.2 A.3 A systolic priority queue. The system diagram. Hexagonal array bases with (a) 9b = 1/7, (b) Pb = 3/7, and (c) pb = 3/7. An arc indicating a connection between two neighboring cells. viii Page 136 141 144 CHAPTER 1 INTRODUCTION Rapid advancements in VLSI technology demand innovative computational algorithms and hardware structures which fully exploit the technology to achieve high throughput rates and efficient resource utilization. Efficient implementation of algorithms on VLSI structures requires exploitation of parallelism in the algorithm and mapping of the algorithm communication structure into pipeline, vector or array processor interconnection structures. The main issues in VLSI design are those of modularity, simplicity of communication and control, and extensibility. A modular design with a large number of identical modules organized in a simple regular fashion is an ideal structure for VLSI. In recent years, systolic arrays have been at the focus of attention of many researchers as pipelined multiprocessor structures suitable for solving a variety of computation-intensive and real-time problems requiring high throughput [2,4,31,32,41]. In the systolic concept, VLSI devices consist of arrays of interconnected primitive processors with distributed control and a high degree of modularity. Each processor operates on a string of data that flows regularly and rhythmically through the array. Systolic arrays feature the important preperties of modularity, regularity, locality of interconnection and highly pipelined multiprocessing. However, one drawback of systolic architectures is their fixed-flow structure for data streams, which limits the type of algorithms or applications that can effectively be supported by such architectures. Therefore, it is desirable to have reconfigurability in the data-flow structure and flexibility in the algorithm implementation to make more general purpose arrays. The mixed systolic array (MSA) is a class of reconfigurable multiprocessor architectures, introduced by Chang and Fisher [6-8]. In this architecture, control elements are mixed with computing elements according to a certain mixing profile and the data-flow patterns are determined by the distributed control structure stored in the control elements. Classes of algorithms with similar data requirements may be executed on the same array by merely presetting the control elements at load time. MSA architectures broaden the scape of systolic arrays and at the same time preserve VLSI design attributes such as locality of communication, modularity, extensibility and simplicity of control. While Chang and Fisher developed, characterized, and evaluated the basic computing model for MSA: [6], they did not present procedures for loading and implementing the distributed control structure required to establish a desired interconnection pattern on the MSA. Also, no formal methodology was presented for implementing a given algorithm into the MSA structure by systematically generating the control code required for its reconfiguration. The goal of this research is to investigate structured methodologies for mapping parallel algorithms into reconfigurable MSA architectures, and implementing the distributed control structure required for the algorithm implementation. Another goal is to exploit the emerging VLSI and Wafer-Scale Integration (WSI) technologies by designing computer architectures, which employ modularity in structure, simplicity and regularity in communication and control paths, and extensibility in design. From a more general standpoint, this research broadens the scope and enhances the applicability of special-purpose VLSI array processors, and at the same time contributes to the understanding of the problems and nature of a parallel processing approach to computation. The specific tasks are outlined as follows: 1. Investigate the procedures for loading and implementing the distributed control structure required to establish a desired interconnection pattern on the MSA structure. 2. Relate the MSA's architectural model with the parallel algorithms implemented on the array and investigate the procedures for implementing algorithms into MSA structures by systematically generating the required control code for reconfiguration. This research investigates structured methodologies for designing and implementing mixed systolic arrays. The Level Sensitive Scan Design (LSSD) technique [12,25,57] is employed to load and implement the distributed control structure required to establish a desired interconnection pattern on the MBA. The control structure for a particular configuration is loaded into the array as a binary vector in a bit-serial fashion. This approach enhances testability and incorporates fault tolerance in the MSA structure. This thesis presents a general mathematical model for formally representing reconfigurable MSA architectures and a step-by-step procedure for implementing a given ' algorithm into the MSA structure by generating the control code required to reconfigure the array. The mapping procedure is based on time and space transformations of data dependence vectors of the algorithm. These transformations provide a description of the data-flow and timing, and dictate the interconnection structure required to implement an algorithm on the array. A computer-aided design facility for modeling and simulating MSAs is designed and partially implemented. This facility helps the designer of MSAs to interactively deve10p an MSA structure for a given set of user-specified attributes, such as mixing density and array geometry, and to simulate the execution of algorithms on MSA processors at the register-transfer level. The procedures presented in this thesis, for mapping algorithms onto MSAs and for implementing the control structure necessary for reconfiguration, provide a useful tool in the automated design of reconfigurable MSAs. For instance, in order to design an MSA to implement a set of algorithms, one can start with a high-level language description of algorithms and use the approach presented in this thesis to find a suitable MSA which can implement these algorithms. Usually, many valid time and space transformations are generated in the procedure, providing the designer flexibility to choose the ones which map easily on the array. The procedures are especially useful during the application of an MBA processor. Whenever a new configuration of the MSA is desired for a new algorithm, the host computer can use the procedure to generate the reconfiguration control vector for the algorithm to be implemented. This control vector can be loaded into the MSA processor to reconfigure the array on-the-fly. The procedure can also be used to determine whether or not an algorithm can be implemented on a given mixed systolic array. Throughout, we use symbols I and Z to denote the set of all natural numbers and the set of all integers, respectively. An denotes the nth cartesian power of a given set A, i.e., the set of all possible n-tuples of elements of A. Chapter 2 reviews some background information and related research work regarding systolic architectures, parallel algorithms, and LSSD techniques. In Chapter 3, design of a one-dimensional MSA architecture, called the Linear Reconfigurable Systolic Array (LRSA), is presented which employs LSSD techniques to achieve reconfigurability and multifunctionality. This approach is extended to two-dimensional arrays in Chapter 4, which describes a Multipurpose Reconfigurable Array Processor (MRAP) architecture for implementing various systolic and semisystolic algorithms. Also, performance of the MRAP is analyzed in Chapter 4. for various matrix-multiplication algorithms based on their Space-Time-Bandwidth complexity. Chapter 5 presents a mathematical formalism for modeling reconfigurable MSA architectures and a methodology for reconfiguring the array by mapping the communication structure of an algorithm into the interconnection structure of the array. The sample algorithms, the finite impulse reSponse (FIR) filtering algorithm and the priority queue algorithm, are mapped into a linear reconfigurable systolic array in order to illustrate the reconfiguration procedure. Chapter 6 contains a summary of this research work and some thoughts for future research possibilities. Finally, a computer-aided design facility for modeling and simulating MSAs is described in Appendix A. CHAPTER 2 BACKGROUND 2.1 SYSTOLIC ARRAYS Several types of VLSI architectures have been proposed in recent years, such as the systolic arrays [31-35], the Wavefront Array Processor [39], mixed systolic arrays [6-8] and the Configurable Highly Parallel (CHiP) computer [53,54]. Most of these parallel architectures attempt to match the underlying hardware to specific algorithms for fast and efficient execution. Systolic array architectures are particularly attractive for VLSI implementation because of their regular, short and simple communication geometry [44,45]. This section discusses the characteristics of systolic array architectures, systolic algorithms, and semisystolic design. 2.1.1 SYSTOLIC ARRAY ARCHITECTURES AND ALGORITHMS Systolic array architectures are multiprocessing systems in which data is pipelined among processors by using next-neighbor communication. Systolic algorithms are defined as mathematical algorithms which are implementable with systolic architectures. A number of special purpose systolic arrays, suitable for VLSI and wafer-scale integration (WSI), have been proposed for solving various computationrintensive problems [4,18,35,38]. Their applications range from numerical problems, such as signal and image processing and matrix arithmetic, to non-numerical tasks, such as searching and sorting, graph algorithms and relational databases. A sytolic array comprises a network of interconnected cells, where each cell is capable of performing a small set of operations and has some local memory and control logic. Strictly next-neighbor type of connections constitute the interprocessor communication structure and data moves through the architecture in a synchronous pipelined manner. Communication with the outside world takes place only at the array boundary, possibly through some special I/O processors. The following design criteria for systolic arrays have been suggested: First, the design should use only a small number of different types of simple cells. Second, these cells should be interconnected by a network with short, regular connections. Third, multiple use of each input data item should be made in order to achieve high computation rate and throughput without requiring high memory to array bandwidth. And, finally, computational algorithms should be employed which exploit both data pipelining and parallel execution. The principles of systolic array design are illustrated here by considering two examples of systolic arrays. A linearly connected systolic array, shown in Figure 2.1 [31], uses W processing elements to multiply an NxN band-matrix, with bandwidth W by a vector of N elements. The basic processing element is called the inner-product step processor (see Figure 2.2), and consists of three internal registers RA, RB and RC. This processing element performs the inner-product operation C <--— C + A * B; where A, B and C are the contents of the registers RA, RB and RC, respectively. The time to compute the entire multiplication is 2N+W time units. The input and output vector elements march in opposite directions, so that each input vector element meets all the output vector elements before it leaves the array. The same inner-product step processor is used in a hexagonal systolic array to implement band-matrix multiplication. Figure 2.3 illustrates the multiplication algorithm for two NxN band-matrices with bandwidths W1 and W2, respectively [31]. This algorithm requires Wlxwz inner-product step processors and takes 3N+min(W1’w2) units of time for the computation. Several other computation-intensive algorithms can be solved on systolic arrays such as LU decomposition, triangular linear systems, convolution, filter, and discrete Fourier transforms. Many implementation alternatives exist for systolic array processors providing different interconnection tapologies and degrees of flexibility. According to their degree of flexibility, systolic array processors can be classified as follows [34]: l. Single-purpose systolic arrays [2.4.38]. In this approach, a systolic array is built to implement only one algorithm and a different array needs to be designed for each new algorithm. This approach is reasonable if the performance of the processor is of ultimate importance and the processor is to be used in large quantities despite the fact that it is single-purpose. 2. Multi-purpose systolic arrays [58]. A systolic array processor 10 l I | l 1 l l l I J 0 all a12 821 822 ":123 331 a32 a33 a34 , aaz a43 344 o A Figure 2.1. Linearly connected systolic array for computing the product of band-matrix A and vector X. ll C «I-I-v *I—C B——-— ‘——-.-B C +—- A x B + C Figure 2.2. Two geometries for the inner-product step processor. 12 Hex-connected systolic array for computing the product matrix C, of matrices A and B [44]. Figure 2.3. 13 of this type can implement a predefined set of algorithms. The control overhead for providing necessary flexibility should be kept small and the VLSI attributes, such as modularity and locality of communication, should be preserved in the design. Non-programmable building blocks [29]. Building-block processors are constructed which can execute a few predefined commonly used functions. These blocks are connected to form a variety of systolic array processors of different sizes and shapes. Pregrammable building blocks [14]. The building block is a programmable processor which can be programmed to implement a large family of systolic cells. This approach is not very efficient because of the overhead for supporting the programmability. However some systolic algorithms involving complicated data dependencies, such as greatest common divisor computation [2], can be effectively implemented on this type of arrays. Pragrammable systolic arrays [3,7]. In this approach, programmable processing elements are mixed with other control units in a certain manner. These arrays are more flexible than the multi-purpose systolic arrays in the sense that the processing elements are programmable and their interconnections can be configured by software control before a computation 14 starts. Various interconnection topologies for systolic arrays can be defined according to the number of computations performed for each input/output operation. First, for two-dimensional systolic arrays shown in Figure 2.4(a), 0(a) processing elements perform computations in each cycle, whereas 0(/;I) 'boundary cells perform input/output operations. Thus the computation over input/output ratio is OQEE.). Systolic arrays for matrix arithmetic algorithms are included in this class [37]. Second interconnection topology can be defined as degenerate two-dimensional systolic arrays shown in Figure 2.4(b), in which 0(n) processing elements perform computations and 0(n) elements perform input/output operations. So the computation over input/output ratio is 0(1) in this case. Systolic arrays for solution of triangular linear systems and orthogonal transformations are examples of degenerated two-dimensional arrays [19,37]. Finally, linear or one-dimensional systolic arrays shown in Figure 2.4(c). perform the input/output via the two processing elements at the ends of the array, and thus the computation over input/output ratio is 0(a) for an array of size n. Systolic arrays for filtering or pattern matching come under this category [32,37]. Linear arrays may be preferred over two-dimensional arrays in situations where the input/output bandwidth, between the host system and the systolic array, is a major limiting factor for achieving high performance. (a) (b) " L] J [1 I J Figure 2.4. Some common interconnection schemes for systolic arrays: (a) two-dimensional systolic arrays, (b) degenerate two-dimensional systolic array, (0) one-dimensional systolic array. 16 2.1.2 SYSTOLIC AND SEMISYSTOLIC DESIGN Systolic array architectures do not allow any global data communication and all communication between processing elements is clocked through a register. Semisystolic array architectures, on the other hand, allow global data communication and a data item may be broadcast to many processing elements simultaneously. Mathematically, the structure of a systolic system S(n) is given by a machine graph G = (V,E) of n interconnected Moore machines, where the vertices in V represent the machines and the directed edges in E represent interconnections between the machines [41]. The machines operate synchronously by means of a common clock, and time in the system is measured as the number of clock cycles. A semisystolic system is similar to a systolic system except that some of the machines may be Mealy machines with the condition that the output edges from Mealy machines may not form a cycle in the machine graph [41]. Mealy machines can implement data broadcasting, whereas Moore machines can not. Figure 2.5 shows an example of a semisystolic array which can implement data broadcasting. In this example, the combinational logic for Mealy machines is a simple wire from input to output. The exclusion of Mealy machines in systolic systems makes the clock period independent of the system size. Semisystolic systems do not meet the design criterion of extensibility as systolic systems do. That is, many semisystolic arrays can not be cascaded together to form an arbitrarily large array, because the clock cycle time, which depends upon the delay due to broadcasting or rippling of logic, may asymptotically become arbitrarily large. In 17 Mealy Machines I 3 Host 7 : Moore Machines Figure 2.5. A semisystolic system can implement broadcasting by Mealy machines [41]. 18 general, there are two main disadvantages associated with data broadcasting. First, large broadcasts can not be implemented in a single communication cycle, because the broadcast delay can dominate the execution time of an algorithm. Second, large drivers are required for broadcasting in order to drive flhe combined load of all the cells connected to the broadcasting bus. Two techniques, retiming and slowdown, are described in [41] and [42] for converting a semisystolic design into a systolic one by eliminating broadcasting. For many systolic algorithms, some fraction of the processors are always idle at a given time. For example, in FIR filtering, half of the processors are idle at each clock tick and in band-matrix multiplication, two third of the processors are idle at each clock tick [44]. Leiserson proposed coalescing and interlacing techniques for improving processor utilization [41]. For specific algorithms, introduction of data broadcast concept results in more efficient parallel algorithms and better processor utilization [9,10,21]. Huang and Abraham [21] have compared the efficiencies of systolic and semisystolic arrays for matrix multiplication algorithms. They concluded that, for specific computations, semisystolic arrays perform better than systolic arrays according to the Space-Time-Bandwidth complexity criterion. 2.2 RECONFIGURABLE MIXED SYSTOLIC ARRAIS The mixed systolic array (MSA) is a class of reconfigurable multiprocessor architectures, first introduced by Chang and Fisher [6-8]. In this architecture, control elements are mixed with computing 19 elements according to a certain mixing profile and the data-flow patterns are determined by the distributed control structure stored in the control elements. The cmmputing elements are pragrammable multifunctional arithmetic and logic processors, whereas the control elements are programmable interconnection networhs which establish flhe communication structure of the array. Classes of algorithms with similar data requirements may be executed on the same array by merely presetting the control elements at load time. In this thesis, we refer the computing element and the control element as the computation cell and the switching cell, respectively. The mixed systolic array provides programmable interconnection structure by way of mixing switching cells with computation cells in a systolic fashion. Its main objective is to broaden the scope of systolic arrays by achieving reconfigurability, algorithmic flexibility and fault tolerance. Another objective is to preserve the design attributes of modularity, uniformity, locality of communication and simplicity of control, in order to exploit the very large scale integration (VLSI) and wafer scale integration (WSI) technologies. The computation cells, in the MSA, are programmable multifunctional arithmetic and logic processors, which process the incoming data according to the control codes stored in their control registers. The switching cells are programmable interconnection networks which establish the interprocessor communication structure in the array to meet the communication requirements of the algorithms implemented. A switching cell directs the data-flow among its neighboring computation cells according to the interconnection configuration defined by its 20 control code register. An MSA executes a specific algorithm according to the control codes stored in individual cells. The control code registers of computation cells determine the computational structure of the MSA and their contents are determined by the basic computational requirements of the algorithm. The control code registers of switching cells establish the communication structure of the MSA, and their contents are determined by the communication requirements of the algorithm. In order to configure an MSA to implement a specific algorithm, the control code corresponding to that particular algorithm is loaded into the array, and the array performs the execution in a synchronous manner. Whenever a new application of the same MSA is needed, a new control code can be loaded into the array for its reconfiguration. Basically, the structure of an MSA is determined by its mixing profile, which establishes the possible frames of data-flow patterns within the array. An MSA structure is called a regular structure, when there is a basis or a subarray from which the MSA can be constructed. Figure 2.6 shows an MSA constructed from a diamond-like basis. Other possible MSA structures with irregular and partially regular mixing, are discussed in [6]. In this thesis, we are concerned only, with the regular MSA structures, mainly because programming and mapping algorithms into irregular arrays are too complex, and most of the existing synchronous parallel algorithms map into regular array structures [4,35]. The mixing profile of an MSA is determined by its mixing density and boundary conditions. The mixing density, p, is defined as VAV Computing Element Control Buffer Figure 2.6. A mixed systolic array constructed from a diamond-like basis [6]. 22 N56 p = (2.1) N so + Ncc where Nsc and Nee are the number of switching cells and the number of computation cells, respectively. An MSA with high mixing density is suitable for implementation of algorithms which require more complex data routing and less computing. 0n the other hand, an array with low mixing density can be applied to algorithms which have simple data routing but require large amount of computing. When p=0, the MSA reduces to a fixed structure single-purpose systolic array, and when p=1, the MSA reduces to a reconfigurable interconnection network with no computing power. The boundary conditions in an MSA are represented by a boundary condition function, 0, which is defined as p = (2.2) where Nsc' is the number of control buffers placed on the MSA boundary. Various other general purpose reconfigurable VLSI architectures have been prOposed such as Configurable Highly Parallel (CHiP) computer by Snyder [53], and the Programmable Systolic Chip by Fisher and Kung [14]. In the CHiP architecture, a lattice structure of programmable switches is incorporated, into which processing elements are placed at regular intervals. This architecture can implement various special purpose parallel architectures such as mesh structures and tree structures, under the supervision of a master controller. The controller broadcasts .commands to all the switches to invoke a Z3 particular configuration setting. This master controller may be undesirable for failure critical applications. In LEA architecture, neither a broadcasting of command signals nor a master controller is required; instead, the control structure is distributed among the switching cells and the computation cells of the array. While Chang and Fisher developed, characterized, and evaluated the basic computing model for MSAs [6], they did not present procedures for loading and implementing the distributed control structure required to establish a desired interconnection pattern on the MSA structure. Also, no formal methodology was presented for implementing a given algorithm into the MSA structure by systematically generating the control code required to reconfigure the array. The main motivation for this research work came from the above mentioned issues regarding algorithm design and implementation of MSAs. 2.3 ALGORITHM TRANSFORMATION AND MAPPING INTO VLSI STRUCTURES Efficient implementation of algorithms on VLSI structures requires exploitation of parallelism in the algorithm and mapping of the algorithm communication structure into the array processor interconnection structure. An approach to design VLSI algorithms based on recurrences was first suggested by Cohen in [11] and later expanded by Iohnnson and Cohen in [22], and Weiser and Davis in [56]. This approach is adopted from z-transforms in signal processing and uses delay Operators (Z-Operators) for specifying and representing sets of data as wavefront entities in the mathematical expressions. 24 Mathematical formulas are manipulated to obtain different expressions that correspond to different computational networks. The drawback with this approach is that the notation gets difficult to manage for complex computations. Leiserson and Saxe presented a general theory for optimizing a synchronous circuit by adjusting the number of register delays in the data paths _ [43]. This theory justifies some transformations used to eliminate broadcasting from semisystolic designs, but does not offer a methodology for systematically designing systolic arrays starting from a high—level language description of an algorithm. Kuhn introduced the idea of exploiting parallelism in 100ps with multiple levels of nesting by reindexing the loop computations [30]. He used the concept of program dependence for detection and exploitation of parallelism in programs. He defined dependence as an arc in the dependence graph directed from the source occurrence of a variable to the destination occurrence of the same variable. Both the source and destination nodes are labeled by the values of the loop indices at which the generation and use of a variable occur, and the dependence arc is labeled by the difference vector of the source and destination labels. He mapped several specific algorithms into SIMD computers with single-stage interconnection networks and illustrated how the design of certain VLSI systolic arrays could be done automatically by reindexing algorithms. Moldovan [45-47] and Fortes [15] extended the approach of Kuhn, and formalized the procedure for mapping algorithms into VLSI architectures by transforming dependencies of the original algorithm by a reindexing transformation. Algorithm transformations comprise a time 25 transformation which dictates the order of execution of computations and a space transformation which determines the data movement, and the array size and geometry. Necessary and sufficient conditions for the existence of a certain type of transformation are given in [45]. This approach is best suited to algorithms described by pragrams with loops or by recurrence equations. Some other related work on formalizing the design process of algorithmieally specialized devices has been reported by Cappello and Steiglitz [5]. Quinton [49], and Lam and Mostcw [40]. Cappello's method010gy is based on geometric transformations for mapping nested-loop algorithms into systolic arrays. The computation is modeled as a lattice in which nodes represent Operations and edges represent data dependencies. Different systolic designs can be derived by applying geometric transformations to the lattice. Quinton's approach finds a uniform recurrent system of equations that is equivalent to the problem to be solved and maps this system of equations into a finite architecture. The methods given in [5] and [49], regarding the formalizing of systolic array design, are suitable for algorithms described by recurrence equations. Lam and Mostow described a design model, in which software transformations are first applied to put the algorithm to be implemented into a regular form conducive to systolic implementation. The algorithm is then mapped into a systolic design described by a structure and a driver. The structure describes the hardware cells and the driver defines data streams in terms of the original variables in the algorithm. This approach can process algorithms with simple FOeroops and BEGIN-END blocks, but cannot deal 26 with conditional execution, computed iteration bounds and array indices. A survey of systematic approaches to the design of algorithmically specified systolic arrays can be found in [16]. 2.4 FAULT TOLERANCE IN SYSTOLIC ARRAYS Fault-tolerant systems are capable of performing correctly even in the presence of one or more faulty components. Fault-tolerant systems require some form of redundancy incorporated in their design. This redundancy could be either physical or temporal, or a combination of the two [51]. Physical redundancy is provided by replicating resources and may involve the use of extra gates, memory cells or functional modules. A taxonomy of fault-tolerance techniques and various stages of response to a system-failure in a fault-tolerant system are described in [51]. Fault tolerance in a pipeline architecture is very critical because a single fault in any segment would cause a total failure of the pipeline. Physical redundancy can be incorporated in pipeline architecture either at the pipeline level or at the segment (module) level. Reconfigurable parallel pipelines can be utilized to implement fault tolerance and achieve better performance in terms of average throughput, mean time to failure (MTTF), and mean computation before failure (MCBF)[24]. Initially the system utilizes all the pipelines in parallel, but as soon as one segment in a pipeline fails, the pipeline containing the failed segment ceases operation resulting in a degradation in performance. The other segments in the ceased pipeline become available as spares to mask subsequent faults in adjacent 27 pipelines. In this way, whenever the system fails, it does it with graceful degradation. Although systolic array processing is a very efficient method of gaining increased system performance, this architecture is highly susceptible to faults. As a systolic array consists of many parallel pipelines, a single fault in any processing cell will propagate down the pipeline causing the system to fail. In two-dimensional systolic arrays, a single processing cell is shared by more than one pipelines, so the fault can propagate in multiple directions which makes the situation even worse. The above makes clear that the application of fault tolerance to a systolic array-based architecture is very critical. A fault-tolerant design of systolic arrays should enable an array to withstand one or more faults without total failure. This extends the life of the system and increases the mean time to failure (MTTF) and mean computation before failure (MCBF) for the system. Fault tolerance, of course, will require that redundancy be incorporated in the design of systolic arrays [20]. Temporal redundancy is not suitable because it adversely affects the system speed and throughput, which is not desirable if high performance is required. Physical redundancy can be introduced in the systolic array designs at the processing cell level or at the pipeline level. In response to a fault, the faulty cell or pipeline is discarded and replaced by another working one. One method of designing fault-tolerant systolic arrays uses modular redundancy at the processing cell level. Triple modular redundancy (TMR) and N-modular redundancy will mask out the faulty outputs from a bad processing cell, but these are very costly techniques and their 28 application is limited to only critical short term uses [50]. Another fault-tolerant scheme, proposed in [36], enhances the yield of wafer-scale integration implementation of systolic arrays by replacing defective cells with clocked delays. This allows data to flow through the array with faulty cells at the original clock speed. Reconfigurable parallel pipelines for fault tolerance can be utilized in systolic arrays in order to achieve graceful degradation property. In case of no fault, all the pipelines are utilized and contribute to increased system performance. In case of faults, however, some pipes will [shut down degrading the system performance. This makes many processing cells available to be used to mask out any further faults. 2.5 LEVEL SENSITIVE SCAN DESIGN (LSSD) THIHNIQUES LSSD is IBM's discipline for structural design for testability [12]. In this concept, the memory elements or latches in an IC can be threaded together to form a serial-in, serial-out shift register. This provides an efficient means for "controlling" and "observing" the internal states of a machine with only three or four additional pinouts. A key element in this design is the "shift register latch" (SRL) as shown in Figure 2.7 [57]. Since IBM has used the LSSD technique extensively, considerable attention has been given to the efficient implementation of LSSD latches such that the overhead due to complexity of SRLs is substantially reduced [13]. The lines D and C form the normal mode memory function while lines I, A, B and L2 comprise circuitry for the shift register function. The shift registers are [Pt-400 L2 Output Figure 2.7. Block diagram representation of a shift register latch. 3O Scan In c, I Chlp Borders A c 1 ,____, lfiLz _ B or [\J L 0 Scan Out Figure 2.8. Interconnections of SRLs on an integrated circuit [57]. 31 threaded by connecting I to L2 and operated by clock lines A and B in two phase fashion. Figure 2.8 illustrates how four SRLs can be threaded together for shift register action, and Figure 2.9 shows general structure of an LSSD subsystem with two clocks. Specific design rules and constraints concerning gating of clocks, etc., are given in [12]. LSSD techniques are _employed in reconfigurable systolic architectures for loading and implementing the control structure required for reconfiguration [25]. The main advantage of using the LSSD technique is that overhead due to additional pinouts does not exceed three or four pins regardless of the size of the array. Shift register latches (SRLs) hold the control information. All SRLs are threaded together in a chain of shift registers in a manner such that a control vector entered in a bit-serial fashion sets up the array configuration for a particular algorithm. Output >6) L1 _-__-, L2 ——-1 ‘ c—" I f“ ~ I Y1 I l l L__-_ __________ q, Combina- -—1 1 : : D—d—cuo-d F—i P tional L1 w———4 I? ‘ : Y. e-—->~ run , ——-w , 2 Network 1 I >JL_4, I Scan Path N | | e I 0 e I . l . [lb/FM if Q :‘E . If? . fl fly-J are ' ] ‘ e I 0 L. _______ _._ ______ ' I I l I I r--- 1____ I n 1 I l l c1 (2} I g A Shift (3 g I Scan In G- ------- -’ L..- a Shift 6 Scan Out or C2 Figure 2.9. General structure of an LSSD subsystem with two system clocks [57]. CHAPTER.3 A LINEAR RECONFIGURABLE SYSTOLIC ARRAY (LRSA) ARCHITECTURE The Level Sensitive Scan Design (LSSD) technique is employed in mixed systolic arrays to load and implement the distributed control structure required to configure the array for a desired algorithm implementation. To illustrate the concepts, this chapter presents the design of a one-dimensional linear reconfigurable systolic array (LRSA). This array structure is statically reconfigurable to realize any of the following: a filtering array, an FIR filtering array, a pattern matching array, and a Discrete Fourier Transform (DFT) array. What's more, the array structure is partitionable and can be divided into two or more independent subarrays, each capable of executing a preprogrammed algorithm. Shift register latches (SRLs) hold the control information for setting up the interconnections configuration and selecting the functional mapping of computation cells. Both the data flow through the array and the functions executed in the computational cells are established by inputting a control vector in a bit-serial fashion, using a two-phase clock. The next section describes the structure of the LRSA, and Section 3.2 characterizes four systolic algorithms which can be implemented on this structure. 33 34 3.1 STRUCTURE OF THE LINEAR RECONFIGURABLE SYSTOLIC ARRAY The linear reconfigurable systolic array can be configured to implement one or more of the following types of systolic arrays: l) a general filtering array, 2) a FIR filtering array, which has better resource utilization than the general filtering array, 3) a pattern matching array, and 4) a Discrete Fourier Transform (DFT) array. The systolic algorithms for the above mentioned arrays are described in Section 3.2. The LRSA architecture is configured, by simply loading a binary control vector through the LSSD scan input line, such that the computational and the data-flow structures of the algorithm are implemented on the array architecture. Once the array is configured, the architecture matches the algorithm structure exactly. Structure of the LRSA can be divided into three parts, the computation cells array, the function-select array and the data flow control array. Figure 3.1 illustrates a system block diagram of this structure, and each part is described in the remainder of this section. 3.1.1 THE COMPUTATION CELLS ARRAY All computation cells are identical so the modularity is preserved in the design. This, of course, is an important factor in the efficient design and implementation of systolic arrays [22]. Each computation cell has three input data ports and three output data ports. These are all local ports except for the cells at the extreme ends of the array, where global I/O ports are used. Due to the local I/O ports the interconnections are regular, simple and short. Three control bits select the functionality of the computation cell. Figure 3.2 shows a _ . . -. .. -- - —F . ‘,2f,_~.."r. . 35 ~Jh- ' — Function Select Array Scan Out A: f . \) In;::--fi;7 Array 0 Computation Cells Ifiutput Data Data Control Vector Reconfigurable Interconnection .‘_______ Network Scan In Figure 3.1. General system diagram of the Linear Reconfigurable Systolic Array. I(1,K) O(3,K) I(2,K) Figure 3.2. 36 fl h..- ‘ # CC (K) n:— 1H Function-Select Lines O(1,K) I(3,K) O(2,K) Block diagram representation of a Computation Cell. 37 computation cell at K'th position in the array. The control input lines will be referred to as function-select lines. For a computation cell CC(K), the input ports are referred to as I(l.K), I(2,K) and I(3,K), whereas the output ports are O(1,K), O(2,K) and 0(3,K). Each cell contains two operand register buffers A(K) and B(K). which hold the filter coefficients loaded by the host, prior to start of computation. Let set of functionrselect lines for cell CC(K) be called S(K). Then, functions of the computation cell can be described as follows: Begin if S(K) = 001 then 0(1.K) <-- I(1,K) 0(2.K) <-- I(2.K) O(3,K) <-- A(K)*I(2,K)+B(K)‘I(1,K)+I(3,K) elseif S(K) 010 then 0(1.K) <-- B(K) 0(2.K) <-- A(K)‘I(1.K)+I(2.K) B(K) <--— I(l,K) elseif S(K) 011 then elseif S(K) 100 then 0(1.K) <--— I(l.K) 0(3 .K) <--- I(3.K) R(K) (-- R(K).AND.(I(1,K).EQ.I(3,K)) elseif S(K) = 101 then 38 O(2.K) <--— I(l.K)‘I(2.K)+A(K) end if end R(K) is a flip flop in each computation cell which is used in the pattern matching algorithm. A(K) and 8(K) are two registers in each CC(K) which hold the coefficients loaded from the host or main memory. 3.1.2 FUNCTION-SELECT ARRAY The functionrselect array contains a three-bit element for each computation cell; the contents of this array select the function performed by the computation cell. Each computation cell can be programmed to perform a selected primitive arithmetic or logic operation by setting its functionrselect array element. All elements of the function-select array consist of SRLs which are all threaded together to form a shift register. The control information can be entered in a bit-serial fashion using two phase clock via Scan In input line, as shown in Figure 3.3. 3 .1.3 DATA FLOW CONTROL ARRAY The data flow control array contains the control information for the interconnection network configuration. Data flow control is achieved by employing interchange boxes which are devices with two inputs and two outputs. Figure 3.4 shows the four legitimate states of an interchange box. Two-bit control is required for each box and the control bits for each box are stored in an SRL. All SRLs are threaded together as in case of the function-select array. Figure 3.5 shows a hardware implementation CC(K+1) CC(K) I 1 SRL's f".- T-‘ [- 7--_- 1.-- I." I V'— ] ""_1‘|‘ I 1"" ‘1 l «—I .J .1 ...J .J .J ...-_—— Scan Out ..I... .____1 Scan In L #4 L J ’ v V S(K+1) S(K) Figure 3.3. SRLs are used in the function-select array. A0 A1 A0 A1 A0 A1 A0 A1 Figure 3.4. 40 ~——----“ q——-——. ——_—-F_«_——. ’r——* \ / /‘ k I I /' 4” -~—-’* \‘ B0 B1 B0 Bl BO B1 BO Bl Straight Exchange Lower Broadcast Upper Broadcast Four interconnection states of an interchange box. 41 Straight A0 A1 D—M W Figure 3.5. Logic diagram of a two-function interchange box. 42 of an interchange box. Figure 3.6 illustrates how two adjacent cells in the LRSA are connected together with interchange boxes and function-select elements. Although the computation cells are multifunctional, yet various functions are computationally not much different from each other. By employing some data selection logic in an inner product step processor, computation cells for LRSA could be obtained. So, the overhead because of introducing complexity in the computation cell by making it multifunctional should be small. All the control information is entered through one pin in a bit-serial fashion, so overhead because of additional pinouts is minimal. 3.2 SYSTOLIC ALGORITHMS IMPLEMENTED ON THE LRSA This section describes four systolic algorithms implemented on the LRSA. These algorithms include general filtering, Finite Impulse Response (FIR) filtering, pattern matching and Discrete Fourier Transform (DFT). 3.2.1 SYSTOLIC FILTERING ARRAY The general filtering problem is defined as follows [49]: Given the weighting coefficients {'0' '1' , . , wh]. {r1, r2, . . . rk}' the initial values [y.k, y;k+1. . . . y;1). and the input sequence {x_h, 1—h+1» , , , 10» x1, . . , xn]. semis. the output sequence (yo, y1, . . , yn] defined by 43 .hauu< ofiHOHmhm ouaeunmfimnooom agenda ego mo ounuonuom .w.m ounmmm Aavm A~+Mvm 5H cmom Inlll AHIM.NvH Am+e.mvo Aauz.mvo Aa+evou A~+e.mve Anue.an Am+z.avo use zoom Allaluau lllllllll 44 Yi = 2? w- x'_- + 2? r- y°-'. (3.1) Jgo J 1 J 1:1 J 1 J A systolic array to implement the above filtering problem is shown in Figure 3.7 [49]. This array computes a new output yi every two cycles, where a cycle is the time to perform two multiplications and two additions. The weighting coefficients '1's and fi" are preloaded into the array. The filtering computation starts by loading the xiv, from the host to the systolic array. When the array has received all the xi's for 'h 1 i ( 0, it starts outputting the computed yi's at the rate of one every two cycles. The two types of basic cells used are shown in Figure 3.7(a). The systolic array for filtering is a linear array and consists of m type-l cells and one type-2 cell, where m = max(h+l,k). Each yi is initialized as zero as entering the array from the right-most cell. It accumulates terms as it travels along the array towards left and eventually achieves its final value Yi when reaching the left-most cell. The output yi is fed back into the array for use in other computations. In order to implement the above filtering algorithm on an LRSA consisting of (n+1) computation cells, the function-select elements should be as follows: S(K) 001,11K‘m; S(K) 011, K = m+1 The switching cells in the array should be used in straight configuration. It can be easily seen from Figure 3.6, that exchange and broadcast configurations of the switching cell can be used to partition 45 TYPE-1 CELL: y, yout“‘ yin in out *- e . + e . + out Y. out w x1n r y1n Y . X1n in out Xout‘“_'xin TYPE-2 CELL: yout I- '- - 7’ yout ‘— Yin I L... Y I I . out “--1______I*——-' 1n (a) m TYPE-1 CELLS I‘— I:— HI? y.s ' 1" ']_""' “1 <-L-—-{____}‘—-'l---l +——~I—--I~ 'S L._.J.___.IL..JI . xi _.___I ,L____1 (b) Figure 3.7. One-dimensional systolic array for filtering: (a) basic cells used in the array, (b) the systolic filtering array. 46 the LRSA into subarrays, or to bypass faulty computation cells. 3.2.2 SYSTOLIC FIR FILTERING ARRAY The general filtering problem described in the previous subsection realizes systems having Infinite-duration Impulse Response (IIR). Such systems involve a recursive computational algorithm. In the case of Finite-duration Impulse Response (FIR) systems [49], realization generally takes the form of a nonrecursive computational algorithm. An FIR filtering problem is a special case of the general filtering problem given in Subsection 3.2.1 where ri=0 for all 1 i i i k. The dedicated throughput of the systolic array for filtering described in the previous subsection is one half, i.e. only one half the cells in the array are active at any given time. Figure 3.8 shows a systolic array for FIR filtering algorithm [24]. Data streams move in the same direction at two different speeds in the systolic array and all the cells are used all the time. The basic cell used in the design is shown in Figure 3.8(a). The 'i's are preloaded in the array during the initialization phase. Both th° 11's and yi's travel towards right but yi's travel twice as fast as xi's. Each yi accumulates terms as travelling towards right and achieves its final value as it leaves the right-most cell. The FIR filtering problem is mathematically identical to the convolution problem, so the systolic array described here also applies to conyolution computations. In order to implement the FIR filtering algorithm on the LRSA, the function-select elements for all the computation cells in the array, should contain 010. i.e.. 47 X out in Xin ——-* r; ‘I ‘—-’ out x ‘_ X. . F-g'4 1n Y ____.,L._“J _——_~.'Y ' out x <—— x 1n ‘ , out (a) 4 3 1 '5—" "-""‘ Fx21_‘F%1—4r‘7—”’:‘1 Y Yo In; -1 Hr -I 1'; -I - -I .3... __. 1.3.] I_... I. 21—, II} .J L‘ifu .___.. .______J L_____. (b) One-dimensional systolic array for FIR filtering: (a) basic cells used in the array. (b) the systolic FIR filtering array. Figure 3.8. 48 s = 010, 1 r x 1 h, 3.2.3 PATTERN MATCHING SYSTOLIC ARRAY A systolic array design for pattern matching problem is shown in Figure 3.9 [24]. The pattern matching problem can be defined as follows: .QiZSE a t91t string ($1, $2, . . . , an) and a pattern string (P1, p2, . . , pm), with n much larger than m, compute positions of all occurrences of the pattern within the text string. For example, if the text string is DBABBFBABABB and the pattern is BAB, then the result is 010000101000, where each '1' indicates the beginning position of an occurrence of the pattern inside the text string. Let the resulting Boolean string be {r1, r2, . . , rn-m+1}: such that 31:1 if and only if (si, 31+1, . , , 3i+n-l) = (p1, p2, . . , pm). Then the systolic array shown in Figure 3.9, using the basic cell as shown, can compute th° ri's by comparing the characters of the pattern with the characters of the text string. Each ri is initialized to be a '1'. A pattern character Pk and a text string character sj are compared at the cell where they meet and the cell updates the value of ri such that ‘1 <-- r1 AND'(pk§sj). In case of a mismatch, the value of ri is reset to '0'. The pattern string travels towards right and the text string travels towards left, such that each pattern character meets each text character. Th9 value 0f ti at a cell finalizes when the last character in the pattern string passes through that cell. This systolic array with m cells can solve the pattern matching problem in time n, whereas a sequential software solution takes time proportional to m‘n. 49 ' r I 3 8out ‘_- 8in out ...__.., L- 4 ~-——— in _ (a) p p ‘ "'"'_" ‘ 7—2—5 “2"" '—"'"'" —L'F'7_’J'F‘11_"F“’I'—' f-r] ' rl ' s Ir2 ' ' r3 J s ' 4 1 I I ,, *___,L._HJI__L_.L._WJ..__4,L.-.JI.JL_.L...J hn—du h—n——-I» I—n———-JL ]—____ (b) Figure 3.9. One- (a) (b) the dimensional systolic array for pattern matching: basic cells used in the array, systolic pattern matching array. 'l l 50 In order to implement the pattern matching algorithm on an LRSA consisting of m computation cells, the function-select elements for all the computation cells in the array should contain 100, i.e., S(K) = 100. 1 1 x a m, 3.2.4 DISCRETE FOURIER TRANSFORM (DFT) SYSTOLIC ARRAY An n-point Discrete Fourier Transform (DFT) is defined as follows: .QiZEB {30, a1, . . . , an.1] be u samples of a time function, compute (yo, yl, . . . , Yn—l} defined by '1 5'1 = 2:0 aj ‘ Wij» (3.2) where w = eZFk/n and k = -l . The straightforward method for computation of n-point DFT requires 0(n2) operations and the Fast Fourier Transform (FFT) algorithm requires 0(n10gn) operations for the same computation. The linear systolic array shown in Figure 3.10 [48] with (nrl) basic cells can compute an n point DFT in 0(n) time. However, the communication scheme for the systolic array is much simpler compared to the complicated data communication requirements for the FFT algorithm. The basic cell used in the DFT systolic array is essentially a multiplier-accumulator cell and is shown in Figure 3.10(a). The array consists of (nrl) basic cells and the input samples an_2 to ac are preloaded in the array cells. The inputs Yin and xin to the left-most cell are “url and some power of w, respectively. Th9 output xout from the right-most cell is always ignored. Each yi, in in F"'1 lan_l “I L- Figure 3.10. out (a) II———.I r—-l la _| (b) g 0 “a o— X. out in ”Y. e X. J-a out in in _IP-j a I 0. .3 Linear systolic array for DFT algorithm: (a) (b) basic cells used in the array, the systolic DFT array. 52 initialized “3 “n.1, accumulates its terms as travelling towards right, and reaches its final value as it leaves the right-most cell. In order to implement an nrpoint discrete Fourier transform algorithm on the LRSA, the function-select elements for all the computation cells in the array, should contain 101, i.e., S(K) = 101, 1 1 K i n-l. 3.3 SUMMARY AND DISCUSSION This chapter presented the architecture and design of a one-dimensional Linear Reconfigurable Systolic Array. This design approach illustrates how the level sensitive scan design technique can be employed in mixed systolic arrays, to load and implement the distributed control structure required to configure the array for a desired algorithm implementation. The control structure for a particular configuration is loaded into the array through the LSSD scan path as a binary vector in a bit-serial fashion. This approach has the advantage that overhead in terms of additional pinouts is limited to only three or four pins. Also, serial loading of the control structure reduces the overhead in terms of additional interconnections on the chip due to the control hardware. The major incentive for employing the LSSD technique in the design of the LRSA is to load the configuration control code in the array. This approach, in addition, enhances testability, incorporates fault tolerance, and provides capability for initializing the internal data registers of individual cells. In order to achieve testability and data 53 .______1I .______. r- A -_-.R “5'3 ‘7‘ .‘r— ‘T‘ ‘1'“! : [ Y I—‘—I I 1(1) .7 : * "’"Z :O(1) 1 I ' Combinational I 4(3) I Logic I_+T ' ' I o 2 I ' __..L_.l L._...-..r_. S ._ db—— —--——-— '1 V Scan Out Scan In Figure 3.11. LSSD scan path within a computation cell incorporating testing and register initialization. 54 register initialization capability, the LSSD scan path must pass through the data registers and the input/output ports of each cell. Figure 3.11 shows one possible arrangement for a scan path within a computation cell. Registers X, Y and Z are the local output ports; S is the control register which selects the function performed by the computation cell; A and B are the data registers for holding the preloaded coefficients and R is a flip-flop used to save the result of a Boolean Operation in pattern matching algorithm. This arrangement provides capability for testing the functionality of the individual cells, initializing the registers and preloading the coefficients. During test mode, a desired bit-vector is loaded into the array through LSSD scan path. This vector contains the control information required for each cell as well as the test data for data registers. After the vector has been loaded, one system clock is applied so that each cell can perform its required function. The results in the local output ports are scanned out using LSSD shift clock. This vector contains results Of the Operations performed by all the cells. A comparison of this vector with the expected results helps identify the faulty cells which may be bypassed by reconfiguring the chip. Fault tolerance in mixed systolic arrays employing LSSD techniques is discussed in more detail in Chapter 4, where this approach is extended to two-dimensional mixed systolic arrays. CHAPTER 4 DESIGN OF A MULTIPURPOSE RECONFIGURABLE ARRAY PROCESSOR (MRAP) Chapter 3 presented a design approach for mixed systolic arrays, which employs the level sensitive scan design technique for configuring the array architecture to match the structure of the algorithm to be implemented. This chapter extends the same approach to two-dimensional mixed systolic array architectures, and discusses the issues of fault tolerance, performance analysis and algorithm implementation. We present the design of an MSA-based processor, called the multipurpose reconfigurable array processor (MRAP). This architecture implements both systolic and semisystolic algorithms, and incorporates fault tolerance in its design. The performance of MRAP, taking into account the total computation time and the data-transfer bandwidth, is analyzed for specific algorithms. Also, it is demonstrated, by way of examples, how the MRAP can efficiently implement different systolic and semisystolic algorithms. Structure of the MRAP is a two-dimensional array type, which can be configured to implement a number of systolic and semisystolic algorithms involving two-dimensional linear recurrences such as matrix manipulations. This two-dimensional array could also be partitioned into many independent linear arrays to implement independent algorithms involving one-dimensional linear recurrences such as finite impulse response (FIR) filtering. 55 56 The next section describes the structure of the MRAP, the design of basic functional cells and the data flow in the array at register-transfer level. Section 4.2 illustrates, by way of examples, how the MRAP can be configured to implement different parallel algorithms. In Section 4.3, performance of the MRAP is analyzed and compared with other existing systolic structures. Section 4.4 investigates fault tolerance capabilities inherent in the MRAP design. Finally, a summary and discussion is presented in Section 4.5. 4.1 STRUCTURE OF THE MULTIPURPOSE RECONFIGURABLE ARRAY PROCESSOR The multipurpose reconfigurable array processor (MRAP) employs two types of basic elements, computation cells and switching cells, as shown in Figure 4.1. In general, the computation cells and the switching cells are mixed according to a certain mixing profile which determines the possible frames of data-flow patterns within the array. Mixing profile is specified by the mixing density and the basis of the array [6]. The MRAP designed here, is constructed from a diamond-like basis with a mixing density of 1/2, i.e., switching cells and computation cells are equally represented in the basis. The control structure for communication and computation is embedded in both types of cells in form of control registers. We employ the LSSD technique to realize this control structure. All the control registers are threaded together through a Scaann line as shown in Figure 4.2. For a particular configuration the control structure is realized by loading a binary vector through the Scan-In line in a bit-serial fashion. In the next 57 Diamond-like Basis (I) O I > \ J >\ / .I— ,< HIM—— \/ I I I \t I79 / I l \./ J/ \/'\ JLI \ I l / l r L j / \ i I z 1 vi] '\ / \. ;>! r) \ I I I 1 / l l / \ I l/ I \ F 1 I/ \ \ I I I ll —\ / \Jl/ \ I ./ J :K \/‘/ I I \ I I / I l / I l I / /\/\ ’1 J \ IOF 7°I ml_ “I \ \ ./ of \ I 1 \ I \ /\[ /\ / I L ok/ I ___/l\l/ I m] __/‘\l/ CC : Computation Cell SC : Switching Cell Figure 4.1. The Structure of the Multipurpose Reconfigurable Array Processor (MRAP). 58 Scan-Out Scan-In Scan Path for Loading the Control Vectors. Figure 4.2. 59 section, we provide some examples of how the control vectors can be generated for implementing different algorithms. Figure 4.3 shows the system block diagram of the array and each part of the structure is described later in this section. The array can be configured in various topologies depending upon the application, such that a variety of algorithms can be efficiently implemented. Various configurations obtained from this structure are discussed in Section 4.2. In the remainder of this section, we describe the design of the computation cell and the switching cell. Also, a register-transfer level data-flow description and a timing diagram for the computation cell are given. 4.1.1 THE COMPUTATION CELL The computation cell is basically an inner product step processor with some extra control logic added into it. All computation cells are identical so the modularity is preserved in the design. This, of course, is an important factor in the efficient design and implementation of structures for VLSI computation. Each computation cell has six data ‘ports, which are all local ports except for the cells at the boundary of the array, where global I/O ports are used [6.8]. Due to local I/O ports the interconnections are simple, regular and short. Three control bits are used to select the functionality Of the computation cell. There are three dedicated data registers DRA, DRB and DRR which can be used to hold the preloaded coefficients or the results of a local computation. There is a three-bit control register, comprising of shift-register latches and threaded into the Scan-In line. We shall refer to this control register as RC and the three latches in it as RCO, RC1 and RC2. The input-output 60 Scan-Out 4" Function-Select Array —-—--—> 2; - . ._____I;,‘ Array or Computation Cells ‘5; / Input Output Data Data Scan-In Reconfigurable Interconnection ~1————-— Network Control Vector Input Figure 4.3. General System Diagrmu 0f the ERAP‘ 61 functional mapping of a cell is determined by the contents of RC. Figure 4.4 shows a computation cell and describes some of its functions for different control vectors. A register-transfer level description of various Operations is given later in this section. 4.1.2 THE SWITCHING CELL The switching cell used in this array is an extension Of the two-by-two interchange box [52] and is shown in Figure 4.5. When the R/W control line is low, this switching cell acts as a single-stage, nonblocking SIMD interconnection network which requires three-bit control to realize all possible connections from input lines A1, A2 to the output lines BO, El and B2, without resulting in a conflict. When R/W line is high, A2 acts as an output line which is connected to the input A0. R/W is a local I/O control provided by the neighboring computation cell at line A2. The control bits are held in the three-bit control register RS associated with each switching cell. Figure 4.5 shows a logic implementation of the switching cell. A function table for the switching cell is given in Table 4.1, where R80, R51 and R82 are the bits in the control register RS. 4.1.3 DATA-FLOW AND TIMING This section describes a register-transfer level description of data-flow in the array. Figure 4.6 shows a timing diagram for the instruction cycle of the computation cell, when a two-phase system clock is used. There are three phases in the instruction cycle, as shown in Figure 4.6. Ph35°3 t0 and t1 are used to latch input data from the 62 G F Arithmetic Unit DRA Data RegistersI7 DRB DRR Control Register Begin DRA14 b24 b34 b44 I I C11 G21 031 c41 °12 °22 °32 C42 Dense Matrix/ Dense Matrix Multiplication. c13 c23 c33 C43 G14 G24 c34 C44 73 broadcast type of configuration and should "have the following interconnections: Bl:A1;Bz=A1. (4.2) This implies, from Table 4.1, that the control bits for switching cells can be 000 or 001. We arbitrarily choose 001 here. The computation cell should perform the following functions: DRA (-- G ; E <-- DRA ; DRR <-- DRR.+ DRA‘B . The control bits for the above functions are 001 from Figure 4.4. So, the control vector for the MRAP, in order to implement a broadcast two-dimensional array for efficient dense-matrix multiplication, is given as follows: Control Vector (in HEX digits): 249249249249249249249249249 The above vector is loaded through the Scan-In line using a two phase clock with the right most digit entering first. After all the 108 bits are shifted into the array in a bit-serial manner, each control register in the array contains appropriate control bits. 4 .2 .2 BAND-MATRIX/ DEN SE-MATRIX MULTIPLICATION Let A be an NxN band-matrix with a bandwidth W (W(N), and B be an NxW dense-matrix. The matrix C = A*B can be computed by using an array configuration as shown in Figure 4.8(b). This algorithm is similar to that described in [9], except that broadcasting is used only in one 74 I. ‘I ‘ ' 11 a12 . 0 I”1’11 b12 b13 b14 I1211 012 c13 c14 a 21 a22 a23 b21 b22 . c21 c22 a31 a32 :33 . . , b31 . r c31 a42 1143 '. . a53 . O J le bN4 I ch . . CNN " - ...I. 0 O O 0 C41 #232 {C23 C14 a_ a,” 3.6-.1 —-." ‘P——-.‘ L—. 33 44. .91 .4 h .__1 I c c c ...IL.31 PAL-7°- I 13 -I a, a a -u-I I-——i- --a— I———.J #- b) L») N N 0—4 “I f ‘ -l 0 I s N H l O r—J Ix) l I I: I l l j I 'c I a33 azz a1 1"“ *—V -—~ --'~ C 11 II _y II a a O C I r—m_1‘ F I-I _ -__ =; '7— o-‘ ~>—’; N -—-—-.- 7—7 --i-—-—>- ---> 22 13 °23 bla le bza by: bus Figure 4.8.(b) Band Matrix/ Dense Matrix Multiplication. 75 direction in our case while in '[9] broadcasting is used in both directions. The array configuration described in Subsection 4.2.1, for dense-matrix/dense-matrix .multiplication, requires NxN processors to solve the above problem, whereas the array described here needs only wxw processors to solve the same problem. The control vector for realizing the array, shown in Figure 4.8(b). is determined as follows. The switching cells are still used in broadcast configuration, connecting A1 to both Bl and BZ. From Table 4.1, control bits 001 are chosen. The computation cells perform the following functions: DRA (-- G ; E <-- DRA ; C <--— F + DRA‘B . The control bits for realizing the above functions are 010, as seen from Figure 4.4. So, in order to realize the configuration for efficient band-matrix/dense-matrix multiplication, the control vector for the MRAP is as follows: Control Vector (in HEX digits): 28A28A25145144A28A289451451 4 .2 .3 BAND-MATRIX/BAND—MATRIX MULTIPLICATION Let A and B be two band matrices with bandwidths W1 and W2, and the product matrix C = A‘B is to be computed. The array configuration for this computation is shown in Figure 4.8(c), for W1=W2=4. This algorithm is the same as proposed by Kung in [31] for efficient band-matrix multiplication. It can be noted that the data-flow is in three dimensions in this case. This array can be realized by setting switches :\ >I\ $133 -—.- -—-.- __~ --->I lu—u _ 3: I TI Multiplication. Figure 4.8.(c) Band Matrix/ Band Matrix 77 to support diagonal data-flow. Control bits for the switching cells are chosen to be 011 (from Table 4.1), and the control vector for computation cells is 011 (from Figure 4.4). The computation cells realize the following functions. DRA <-- G ; E <-- DRA ; G <-- D + DRA‘F ; C <-- F . The following control vector configures the MRAP to realize Knng’s hexagonal systolic array for narrow band-matrix multiplication: Control Vector (in HEX digits): 6DB6DB6DBGDBGDBGDBGDBGDBGDB 4 .2 .4 RFCURSIVE FILTERING An MRAP containing nxn computation cells can be partitioned into n independent, one-dimensional, linear systolic arrays. A one-dimensional linear array consisting of n processors can be used to solve an n'th order recurrence problem. Recursive digital filtering in signal processing is an example where a recurrence equation is used. An n'th order recurrence problem is defined as follows: Given x0, 1-1, . . . , x-n+1, compute x1, x2, . . . , defined by xi = Fi(xi-1, . . . , xi-n): for i > O; (4.3) where Fi is a given recurrence function. For a large class of recurrence functions, an n'th order recurrence problem can be solved in real time on n linearly connected processors, such that 8 n°' xi can be obtained at the output at regular time 78 1 all 1,3,] ,1. H _.I: i‘ A if‘ J k— H L i 1‘1 ‘w-IH i-/-.L ‘ Buffers Figure 4.8.(d) One Dimensional Arrays for Filtering. 79 intervals [31]. A linearly connected broadcast array with n processors and one buffer can solve the n'th order recurrence problem [21]. Figure 4.8(d) shows the array configuration for implementing recursive filtering algorithm. This algorithm is given in detail in [21], where it is shown that this algorithm is Optimal in the limit. The coefficients of the recursive equation are preloaded into the computation cells. The control vector for this configuration is as follows: Control Vector (in HEX digits): 30C30C26186184C30C309861861 4.3 PERFORMANCE ANALYSIS AND COMPARISON Performance of an array processor can be measured by the efficiency of parallel algorithms executed on it. Huang and Abraham have developed a criteria for measuring the efficiency for parallel algorithms, based on the Space-Time-Bandwidth complexity [21]. Bandwidth, here, refers to the data transfer bandwidth and is defined as the maximum. number of words which have to be transferred through the I/O ports of the boundary cells in a time unit (a time unit is the period of time a processing element performs an operation). Let P number of required processing elements; T turnaround time of the computation; B = data transfer bandwidth; 0 ll number of operations in the computation task; I = number of input and output operands transferred. 80 It can be seen from the above, that for any computation task, T 2 C/P, (4.4) and T 2 I/B. (4.5) Thus, PBTz 2 CI. (4.6) The product of P, B, and T’ is the Space-Time-Bandwidth complexity of an algorithm executed in a processor array, and the product CI is the lower bound of this complexity. A measure of the efficiency of an algorithm is then defined as R = PBT’lCI. (4-7) A lower value of R means a better performance and R=l means an optimally implemented algorithm. In the performance analysis of any semisystolic system, we must consider the time delays associated with the rippling of logic when deriving performance parameters such as throughput and response time. This is important because semisystolic systems do not meet the design criterion of extensibility as systolic systems do. We can not cascade many semisystolic arrays together to form an arbitrarily large array because the clock cycle time, which depends upon the delay due to rippling, may asymptotically become arbitrarily large. In the MRAP structure, data ripples through the switching cells (SCs) whenever global 81 computations are required, as in case of semisystolic algorithms. We take into account the time delay due to this rippling in our analysis by defining the clock cycle time of the MRAP as follows: where, T = clock cycle time for the MRAP , Tc = computing time for a computation cell (CC) , Tr = prOpagation delay of a switching cell (SC) . m = maximum number of SCs in a ripple path . In pure systolic systems, throughput and response time are derived in terms of the computing time of a computation cell (To). For most of the matrix arithmetic algorithms, the computation cell is a multiplier accumulator 3° To approximately equals the time required for one multiplication and one addition. Data transfer bandwidth of the array is defined as the maximum number of words transferred through the I/O ports of the array boundary in one time unit Tc. We can 'ritg the clock cycle time for the MRAP in terms of To time units as follows: T = l + m‘k, where k = Tr/Tc- (4.9) In the remainder of this section, we analyze the performance of the MRAP for implementing various matrix multiplication algorithms and compare this performance with other existing systolic computing structures. This analysis also gives us the upper bounds on the size of 82 specific semisystolic arrays, beyond which they do not remain efficient because of extremely long clock periods. Space-Time-Bandwidth complexity of an algorithm is used as a criterion for evaluating the performance of an array proce 880T. 4.3.1 DENSE-MATRIX/DENSE-MATRIH MULTIPLICATION For dense-matrix multiplication, as described in Subsection 4.2.1. the total number of operations in the computation task is N' and the number of operands is 3N3. So the lower bound on the complexity of this algorithm is CI = 3N‘. (4.10) The cmplexity of the algorithm when implemented on the MRAP, as shown in Figure 4.8(a), is derived as follows: P = N2 , B = 2N . T = 3N*(1 + kN) . and PBT‘ = 18N‘*(l + no“. (4.11) Therefore, R.= 6‘(1 + kN)’. (4.12) 83 The same computation, when executed on the systolic array in [31], gives the following complexity and performance: P = 3N3 , and PM”3 = lSO‘N‘ . (4.13) Therefore, R = 50. (4.14) Comparing Equations (4.12) and (4.14), we get 6‘(1 + kN)2 < 50 , 01' kN ( 1.9. (4.15) For a value of k = 0.05, the MRAP algorithm performs better for N < 38. 4.3.2 BAND-MATRIX/DENSE-MATRIX MULTIPLICATION For band-matrix/dense-matrix multiplication, as described in Subsection 4.2.2, the number of computations, C, approaches N'W’W and the number of the input and output operands, I, approaches 3‘W‘N. So the lower bound on this algorithm is 84 CI = sN’w3 . (4.16) The algorithm, as implemented on the MRAP, is shown in Figure 4.8(b). The Space-Time-Bandwidth (PBT‘) complexity of this algorithm can be derived as follows: P = W“ , B = 3W , T = (N + 2W)'(1 + kW). and PBT‘ = 3113.01 + 2W)"(1 + kW)“. (4.17) Therefore, from Equations (4.7). (4.16) and (4.17). R = (N + 2W)‘*(1 + HIP/N“. and, R = (1 + km“, when N >> II. (4.18) The performance of the systolic algorithm, given in [31] for solving the same problem, can be derived as follows: P = (N + W)‘W 8 NOW when N >> W , B = (2/3)'(N + 2*W) a 2N/3 P ram, and 85 PBT‘ = 32N‘W/3. (4.19) Therefore, R.= 32N1/9W’. - (4.20) Performance of the MRAP can be compared with Kung's systolic array by looking at Equations (4.18) and (4.20). It can be observed that the performance of the MRAP is not affected by the size N of the matrix, whereas the performance of systolic array degrades with the size N of the input matrix. The MRAP performs better as long as the result from Equation (4.18) is less than that from Equation (4.20). Comparing Equations (4.18) and (4.20), we get the following condition when the MRAP performs better: (1 + H1)2 < (32/9)‘N’/W’. and, kW ( 1.9‘N/W, when N >> W. (4.21) Since typically k << 1 and W << N, the above condition is easily met. For an array involving 16-bit integer multiplier-accumulator (MAC) cells, if a 16 x 16 multiplier array and an accumulator are used in the MAC cell, the multiply-accumulate time is approximately equal to 60 gate-delays (assuming that a full-adder has a propagation time equal to three gate-delays) [55]. The propagation delay in the switching cell is equal to three gate-delays (see Figure 4.5). So, the value of k is approximately 0.05, which means that the condition when the MRAP performs 86 better is W < 38‘N. 4 .3 .3 BAND-MATRIX/BAND-MATRIX MULTIPLICATION For band-matrix multiplication, as described in Subsection 4.2.3. the MRAP is configured as a pure systolic array and the algorithm given in [31] is implemented on it. .The performance of the MRAP, for this computation, is analyzed as follows: "U ll WI . WI: w I - 2. (W1+W3) /3 a T‘=3N‘(l + k). So, PBT’ = 6‘(W1+wz)ew1sw3sN3‘(1 + k)’, (4.22) and CI = 2‘(W1+W2)awlswthz. (4.23) Therefore, R = 3*(1 + k)“. (4.24) For the systolic array in [31], the value of R is 3. So, the performance of the MRAP nearly equals that of Kung's systolic array, which is very efficient for narrow band-matrix multiplication. The MRAP performance is degraded by a factor of (1 + k)“, due to the prOpagation delay in the switching cells. Since the value of k<<1, this degradation in performance is insignificant. 87 4.4 FAULT TOLERANCE The MRAP architecture incorporates fault tolerance in its design due to its prOperties of reconfigurability and programmability at individual cell level. In MRAP, switch settings are used for data routing among computation cells. An MRAP architecture can be viewed as many reconfigurable pipelines working together. If a cell in the array fails, the array might be able to be reconfigured to bypass the faulty cell with data now flowing through the properly functioning cells. The performance of the array may be degraded, but the entire system does not fail. It is assumed here that the fault detection is done by the host and faulty cells can be identified. This is possible in MRAP structure due to the presence of LSSD latches and Scan-In and Scan-Out lines which provide both observability and controllability in the system. Inclusion of data registers in the scan path provides testability for the functional cells and also facilitates the initialization of data registers within the cells at the time of preprocessing the chip. Some examples shown below demonstrate fault tolerance capability of the MRAP. Figure 4.9 shows computation cells of the MRAP, configured to implement linear arrays to solve recursive filtering problems of fourth order. 'F' indicates a faulty cell, '8' shows a spare cell and ’U' denotes a used cell. If a fault occurs in a computation cell, the column containing that computation cell shuts down and other working cells in this column become spares, as shown in Figure 4.9(a). Figure 4.9(b) shows how spare cells can be used to mask out two other faulty cells A and B such that the array has the same performance as with only one .88 man LJICsed Cell fl 5: Spare Cell n F; Faulty Cell fl Figure 4.9.(a) The MRAP with One Faulty Computation Cell. Figure 4.9.(b) Spare Cells are used for Fault Masking. 89 fault. More faults may cause more columns to be shut down, but that will provide even more spare cells to mask out further faults. So the system can fail eventually but with graceful degradation. When two-dimensional algorithms are implemented on the MRAP, the fault tolerance becomes more difficult and degradation occurs at a higher rate. This is so because data. flows in multiple directions and the output from a faulty cell will be propagated in multiple pipelines which share that computation cell. In this case, a single faulty cell will cause the shut down of all the pipelines which share that cell. All the working cells in the shut down pipelines become available as spares. Figure 4.10 illustrates how data-flow can be reconfigured to mask out faults due to failed cells A and B. 4.5 SUMMARY AND DISCUSSION In this chapter, we presented the design and analyzed the performance of a VLSI compatible two-dimensional mixed systolic array architecture called multipurpose reconfigurable array processor (MRAP). This array processor can implement efficiently various systolic and semisystolic algorithms. Level-sensitive scan design is employed to achieve reconfigurability and multifunctionality. This design approach provides fault tolerance capabilities in the mixed systolic array architecture. In general, the MRAP performs better than systolic arrays for specific semisystolic algorithms. We develOped expressions for performance of the MRAP when executing various algorithms. These 90 Figure 4.10. Fault Telerance in the MRAP for 2-Dimensional Algorithms 91 expressions determine the conditions under which a certain algorithm can be efficiently implemented on the MRAP according to a specific performance criterion. The performance of MRAP degrades with the array size for semisystolic algorithms because the propagation delay due to the switching cells becomes more significant in large arrays and tend to dominate the execution time. . However, when systolic algorithms are implemented on the MRAP, the array performance is not affected by the size. It can be seen from the expression deve10ped for the performance of band-matrix multiplication algorithm in Section 4.3, that the degradation occurs by a factor of (1+k)’. This degradation is quite insignificant because the degradation factor is a constant, and k<<1. This is a relatively small penalty on the performance in view of the reconfigurability and algorithmic flexibility that we achieve with this architecture. Since amplitude levels are regenerated in passing through the switching cells, no broadcasting bus or large bus drivers are required in the implementation of the MRAP. In general, there are two main disadvantages associated with data broadcasting. First, large broadcasts can not be implemented in a single communication cycle, because the broadcast delay can dominate the execution time of an algorithm. Second, large drivers are required for broadcasting in order to drive the combined load of all the cells connected to the broadcasting bus. In MRAP, broadcasting is implemented through switching cells and each switching cell output has to drive a load of only one gate input. As switching cells are used for reconfigurability purposes, the elimination of broadcasting bus or drivers come as an additional benefit in the MRAP 92 design. In the next chapter, we present a mathematical model for reconfigurable mixed systolic array processors and a step-by-step procedure to automatically generate control vectors required for configuring the array to implement a specific algorithm. The procedure is illustrated by mapping the finite impulse response filtering algorithm and the priority queue algorithm into the linear reconfigurable systolic array. CHAPTER.5 ALGORITHM IMPLEMENTATION ON MIXED SYSTOLIC ARRAYS In the previous chapters, we presented a methodology for loading and implementing the configuration control structure in mixed systolic array processors, to establish a desired interconnection pattern required for algorithm, implementation. This methodology employs LSSD techniques, and the control structure for a particular configuration is loaded into the array as a binary control vector in a bit-serial fashion. Efficient implementation of algorithms on MSA processors requires exploitation of parallelism in the algorithm and mapping of the algorithm communication structure into the MSA processor interconnection structure. This chapter deals with the issues of relating MSA's architectural model with parallel algorithms, and investigating procedures for generation of control code required for implementing algorithms on MSA architectures. In this chapter, we present a mathematical formalism for modeling reconfigurable mixed systolic array architectures and a methodology for reconfiguring the array, by mapping the communication structure of an algorithm into the interconnection structure of the array. A step-by-step procedure is presented to map a given algorithm into the mixed systolic array architecture and then generate the control code required to implement the corresponding interconnection structure. The mapping procedure is based on the time and space transformations of data dependence vectors of the algorithm. These transformations describe the data-flow and timing of 93 94 the algorithm and dictate the interconnection structure required for its implementation. Sections 5.1 and 5.2 present mathematical models for reconfigurable MSA processors and parallel algorithms, respectively. A procedure for mapping algorithms with MSA processors is given in Section 5.3. And, this procedure is used in Section 5.4 to map two sample algorithms, the finite impulse response (FIR) filtering algorithm and the priority queue algorithm, into the linear reconfigurable systolic array. 5.1 A MODEL FOR RECONFIGURABLE MIXED SYSTOLIC ARRAYS A reconfigurable mixed systolic array processor is a S-tuple (9n, D: Fc, F5, R), where Qn<: Zn is the index set of the array processor; D is a transformation on On to B = (0. 1}. i.e., D: On --9 B describes if an index corresponds to a switching cell or a computation cell; Fe is a set of the functions a computation cell can perform; F5 is a set of interconnection configurations of a switching cell; and R e znxr, r e I, is a matrix of cell-to-cell interconnection primitives. This mathematical model is general enough to represent conventional systolic arrays with fixed structures as well as multipurpose reconfigurable arrays containing programmable interconnection cells and multifunctional arithmetic and logic units. 95 The index set an refers to an n-dimensional array structure. Most of the practical arrays are either linear (n = 1) or two-dimensional (n = 2). Each cell in the array is represented by its index a a Zn, such that for all a a On, q=lqo.q1.. . . ,an;qiez,o$i$n. D is a mapping on the index set gm to a set of binary digits {0.1} such that D specifies every cell in the array to be either a computation cell (CC) or a switching cell (SC). So, for all 5 e an D(§) 0 if 3 corresponds to a SC, 1 if 3 corresponds to a CC. Fe is a set of computations (arithmetic or 10gical) performed by the computation cells. F3 is a set of all interconnection configurations of a switching cell. For VLSI structures, we can assume that all CCs and SCs are identical in order to preserve modularity and extensibility. In a case where cells are not identical, a surjective mapping fc can be defined from Q9 t0 Fe such that fczan "> Fe, where fc associates a subset f°(a)§; Fc for index '3 a Q“. R is a matrix of cell-torcell interconnection primitives such that R = [ ?i :2 :3 . . , 'Ep], where ij, 1 $ j $ p, is a column vector indicating directed communication link. If E} e R, then for ‘ny E s on,'a is connected to a' = q +‘?j if 3' a On and is connected to an I/O port if E' i QR. VLSI arrays with fixed structures, such as in [4,32,35], can be represented by the above model with D(§) = 1 for all a a On and Fs being an empty set. In the remainder of this section, we present two examples 96 to illustrate how the above model represents reconfigurable array structures. EXAMPLE 531.1 The Linear Reconfigurable Systolic Array (LRSA): described in Chapter 3 and shown in Figure 5.1, is represented as (02, D, F0, F5, R) where QZ={‘«i=(q1,q3)T:Oicn, (01 (—- B, 02 (-- A t 11 + 12, B <-- 11). <01 <-- Is. 03 <-- Is). (01 <_.. 11, 03 <-- 13, R <-- R AND (11 HQ 13). ‘01 <-- Ii. 02 <-- Ii ’ 12 + A)); (5.3) Fs = {(33 = 51. S4 = $2), ($3 = 82. S4 = 31)}: (5-4) R = (5.5) For an LRSA with N=4, the 02 and D are as follows: 02 = {00. 1o, 20, 30, 01, 11, 21, 31. 02, 12. 22. 32}; 97 I L J l O 'L“ In /I\l JOL SC SC (a) Structure of the LRSA. IO U) L\ PI 0 IO (b) A Switching C611 (5C)- (c) A Computation Cell (CC). Figure 5.1. The Linear Reconfigurable Systolic Array (LRSA) Architecture. 98 91 O 1 2 3 q2 o o o o o D o for all 31 a G. (5.21) T is a (1xn) vector which maps the index set of the algorithm into the unidimensional time space. Total execution time Of the algorithm is given by t a max[T°(31‘32)+11 _1 q 2 ° , e In and d. . . min [T . Hi] for J J 18G (5 22) An Optimal transformation is the one which minimizes t in the above equation. Optimal transformation is tried in the first pass, but if it 111 ~ cannot be mapped then other suboptimal transformations are tried. Step 3 is probably the most computationally intensive since a system of diOphantine equations has to be solved. Existence of a valid transformation 8 indicates that the algorithm can be mapped on the mixed 373t°1i° array model (Qn’ D» F0, F3, R). 8 contains the information about the communication structure of the algorithm. Dimensions of the utilization matrix K are rxq, where r is the number of communication links from one CC to its neighboring CCs in different directions and q is the number of dependence vectors in the matrix G. Directions of data-flow correspond to those unit vectors in H, for which the rows Of K has at least one nonzero element. Rows of K with all zero elements correspond to the directions in which the communication will not take place. Matrix K can be generated by combining the patterns satisfying Constraints (5.19) and (5.20), as columns of X. Step 4 gives the information about the timing and data-flow of the algorithm. Product T.G gives the speed at which different variables travel and the product S-G describes their directions of travel. Step 5 determines the configuration of each switching cell and the computation performed by each computation cell. This gives the control code to be stored in each cell. Sets M and N are used in conjunction With F5 and Fc to determine the appropriate control codes. Step 6 generates the final reconfiguration control vector by concatenating the control codes for the cells in the sequence such that the binary vector loaded through the LSSD scan path will put the appropriate codes in all the cells. 112 5.4 EXAMPLES In this section, we illustrate how the above procedure is used for generating configuration control vectors for implementing two algorithms, the finite impulse response filtering and the priority queue, on the linear reconfigurable systolic array (LRSA) described in Chapter 3. The structure of the LRSA is shown in Figure 5.1, and the mathematical model of this array is given in Example 5.1.1. We assume that the computation cell of the LRSA can perform a compare and exchange Operation, in addition to the functions described in Equation (5.3). Compare and exchange is the basic Operation needed in a priority queue algorithm. This operation, fol, is described as follows, with reference to Figure 5.1(b): fc1 = (01 <-- Maximum(11. 12. I3)» 02 <-- Medium(11. 12. I3). 03 <-- Minimum(I1. 12. 13)). (5.23) The set of control codes for the computation cells, N, consists of six elements; N = {1,2,3,4.5.6]. where codes 1 through 5 imply the same computations as described in Chapter 3, and code '6' implies the compare and exchange Operation described above. The set of control codes, M, for the switching cell consists of only two elements, since we assume flhat only two configurations of a switching cell are implementable; straight and 113 exchange. Assume M'= [0.1]. where a 0 implies a straight configuration (83:31, 84:83) and a 1 implies an exchange configuration (S;=S;, S.=S;) (see Figure 5.1(b)). The control codes for all the cells are eventually loaded in their binary form into the corresponding registers. Each switching cell has a one-bit control code register and each computation cell has a 3-bit control code register. For the LSSD scan path, as shown in Figure 5.4, the sequence W can be written as W = [30,31,32,22,21,20,10,1l,12,02,01,00]. (5.24) 5.4.1 FINITE IMPULSE RESPONSE (FIR) FILTERING ALGORITHM We demonstrate how an FIR filtering algorithm is mapped on the Linear Reconfigurable Systolic Array shown in Figure l. The high-level language description Of the FIR filtering algorithm and derivation Of its dependency matrix G are given in the Example 5.2.2. Now we follow the above procedure step by step, for generating the configuration control vector required to implement the FIR filtering algorithm on the LRSA. £222 1: From Section 5.2, the dependency matrix for FIR filtering algorithm is G = (5.12) 1 0 1 y a x 114 will! SC 22 __+. I I I I I'“? CC ! 21 I —"T I I I —-—-—'.- SC 20 ~ sc 12 I I l I .1...— I cc ; 11 T— -‘----w---n—-’ SC 10 *—. \ Scan-In Scan-Out LSSD Scan Path Sequence in a LRSA. Figure 5.4. 115 Steplgz Many time transformations T exist which give a valid execution ordering. We choose T== [1 l]. which satisfies the condition T31 ) o for any Hi 3 G and minimizes the execution time of the algorithm. Steppi: Solve the set of diOphantine equations S . G = H . K . (5.18) H is a set of direction vectors, which indicates the CC-to—CC communication links. For the case of a linear array, like the LRSA, H = [1 0 ~11. (5.25) A 1 in H implies that the data flows towards right, a -1 implies the Opposite direction for data-flow and a 0 implies that data stays at its location for later use. The utilization matrix chosen to solve the above equations is P1 0 17 x = o 1 o (5.26) o o o . I. .1 K is selected such that kji )0 (5.19) 116 and E], kji 4 1‘31. (5.20) The only solution for S, with matrix K as in Equation (5.26), is [0 l]. A row consisting of all zero elements indicates that communication link in the corresponding direction is not required. In the matrix K described above, the last row is a zero row. This row corresponds to the column of H containing the direction vector -1. SO no data-flow occurs in this direction. §£en.i= The transformation matrix is A = = (5.27) As A is a nonsingular matrix, we have a valid transformation. New the data-flow and timing Of the algorithm can be determined by looking at th° product A.G. First row Of A-G corresponds to the timing and the other rows correspond to the directions in which different variables will travel. AoG = = = (5.28) 117 The first row indicates that variable x travels twice as slow as variable y and both x and y travel in the same direction as seen from the second row of the above matrix. The element of the. second row corresponding to variable a is 0, which indicates that variable a does not travel, but stays at the same location. Since x travels half as fast as y, a delay equal to one time step is introduced in the path of x at each computation cell. The algorithm implementation for this transformation is shown in Figure 5.5. This is the same algorithm as proposed by Hang in [32]. .§£sns.i aae.§= The function performed by a computation cell is given as (see Figure 5.1(c)) (01 <-- B, 02 <--A . 11 + 12, B <_- 11): and the switching cells are in straight configuration. The control code for 98°h computation 6011 is 210 = (010);, and for each switching cell, the control code is (0). The LSSD sequence W is given as W = [30,31,32,22,21,20,10,11,12,02,01,00}. (5.24) So the configuration control vector, V, is .4 ll [0, 010,0,0,010,0,0,010,0,0,010,0}; (00100001000010000100)3; (21084)1‘, (5.29) 118 Delay Y1 Y0 a2 a1 210 ——> —-——-~ ’ 1 r ; Figure 5.5. A Linear Systolic Array for FIR Filtering. 119 V is loaded serially with the right most bit first, through the LSSD scan-in line shown in Figure 5.4. 5 .4 .2 PRIORITY m: A priority queue is an abstract data type based on the set model with the Operations INSERT and DELETEMIN, as well as the operation for initialization of the data structures [1]. The elements of the set A have a priority function defined on them; for each element a, the priority of a, p(a), is a real number from some linearly ordered set. Operation INSERT(a,A) replaces the set A with the set A U [a] and the operation DELETEMIN returns some element of the smallest priority and deletes it from the set. Let A be the sorted list of elements in the queue, A = {31,a2, . . . ,an}, such that al contains the smallest priority element and an contains the highest priority elements. Another array B = {P1,b2, . . . bn} contains the elements which are being sorted. We define two constants 'max' and 'min’, such that 'max' is higher than the highest possible priority and 'min' is lower than the least priority that can exist. Variables 30 and b0 are input variables. The sequential algorithm can be described by the following procedure: Procedure INSERT(x) begin 120 for j = 0 to n for i = 0 to n “ii: = med(b:, a1, ai+1) a1” = mad. 31 a...) 1:1 = max(bI. 81» a1+1) end i end j end (INSERT); Now we show how this algorithm is implemented by following the procedure given in Section 5.3. Stepgl: There are nine data dependence vectors between the generated and the used variables. The data dependence matrix is then formed by combining these vectors as follows: a1+1 1314-1 ‘1 la1 ai+1 hi] [“1 ai+l b11 [91 a1+1 b11 1 1 1 1 1 1 1 1 1 J G = ------------------------- -. (5 030) l 0 1 1 0 1 0 -1 0 1 Step‘; Find a time transformation T satisfying the Constraint (5.21). One T which satisfies this constraint is T‘= (l 0). The execution time for this value of T is found from Equation (5.22), to be N+l time steps for inserting a number in the queue of size N. This indicates that the inserted element moves at a rate of one position per time step. This 121 transformation also minimizes the execution time, hence is an optimal transformation. This can be verified by solving for the minima of Equation (5.22) with respect to variables t1 and t2, where 1': (t1 t2). 51222: Solve the diOphantine equations S . G = H . K to find a suitable S. Matrix H is the same as in the previous example, i.e., H = [1 0 -1]. Size Of the utilization matrix K will be 3x9 because there are 9 data dependence vectors in the matrix G and there are three possible directions of data flow. 8 = [0 1] satisfies the above equation, for the following utilization matrix K 1 0 1 l 0 l O 0 0— K = 0 0 O O O 0 0 l 0 (5.31) 010010101 I. So, the space transformation is S = [0 l]. §tep‘i: The transformation matrix is A is a nonsingular matrix, so it is a valid transformation. T-G 1 1 1 1 l 1 1 1 1 A-G = = (5.32) 8-6 1 O 1 1 0 1 0 -1 0 . 122 The row correSponding to S-G implies that variable a1 should travel in the direction of the unit vector (-l), i.e., towards left, and the variables 81+1 and bi+1 should travel towards right. The systolic array formed by this transformation is shown in Figure 5.6. This array is similar to the systolic priority queue given in [41]. §£222 a 229.2: The control code for each computation cell, to perform the compare and exchange operation, is (6)1° = (110),. The control code for each switching cell is 0. From the LSSD sequence W, we can write the configuration control vector, V, as .< ll [O,110,0,0,110,0,0,110,0,0,110,0]; (01100011000110001100)2 (6318C)1‘, (5.33) 5.5 SUMMARY AND DISCUSSION We presented a mathematical formalism for modeling reconfigurable mixed systolic arrays and a step-by-step procedure for generating the control code required for implementing algorithms into mixed systolic array structures. The procedure presented in this chapter provides a facility for automating the reconfiguration process and the design of mixed systolic array architectures. Starting from a high-level language description of an algorithm, this procedure generates the final configuration control vectors required to be loaded into the array through the LSSD scan path in a bit-serial manner. One might argue that 123 X' = min (X,Y,Z) ; Y' = med (X,Y,Z) ; Z' = max (X,Y,Z) Figure 5.6. A Systolic Priority Queue. 124 data dependency matrix cannot be easily derived; but research in optimizing compilers has shown that it can be done in an automatic manner for a large class of algorithms [30]. The basic idea of this chapter is to relate the mixed systolic array's architectural model with the parallel algorithms implemented on the array to facilitate their implementation. Data dependencies in an algorithm determine the communication structure required for its implementation. Linear time and space transformations of the algorithm are selected from the data dependency matrix of the algorithm and the interconnection structure of the mixed systolic array. For a given mixed systolic array structure, the optimal time transformation is first chosen in order to minimize flhe total execution time. If a valid space transformation does not exist to implement the optimal time schedule, a suboptimal time transformation is chosen for which a valid space transformation exists. Algorithms with constant data dependence vectors are easier to map on array structures and result in simple data-flow patterns. For some algorithms, such as the fast Fourier transform (FFT) and the bitonic sort, a remapping transformation is required in order to increase the efficiency Of the algorithms [30]. Transformations for algorithms with tree communication geometry, such as searching algorithm, are not known. The concepts presented in this chapter are not restricted only to mixed systolic arrays; they can also be used for formalizing the reconfiguration processes for some other polymorphic multiprocessor architectures. Several research issues emanate from this work and need further investigations. Procedures should be developed which allow the 125 user to obtain sets of valid transformations for other classes of algorithms with arbitrary index sets and nonconstant data dependencies. Second, this approach should be extended to silicon compilation of reconfigurable arrays to generate masks for an Optimal or near-Optimal array which can implement algorithms in a given set. Finally, further research is needed to define a unifying performance index for reconfigurable arrays to measure the overall array performance taking into consideration the speed, the reconfiguration process complexity, the cell's complexity and applicability Of the array. CHAPTER 6 CONCLUSIONS 6.1 SUMMARY This research work investigated structured methodologies for designing and mapping parallel algorithms into a class of reconfigurable multiprocessor architectures called the mixed systolic array (MSA). The primary objective of this research was to investigate procedures for loading and implementing the configuration control structure in MSAs, and relate the architectural model of the MSA with parallel algorithms implemented on these architectures. The following specific tasks were addressed in this work: 1. Investigate the procedures for loading and implementing the distributed control structure required to establish a desired interconnection pattern on the MSA structure. 2. Relate the MSA's architectural model with the parallel algorithms implemented on the array and investigate the procedures for implementing algorithms into MSA structures by systematically generating the required control code for reconfiguration. The work completed under each of the above tasks is summarized below. , 126 127 In Chapter 3, we presented a serial loading procedure for implementing the distributed control structure required for reconfiguration in MSAs. This procedure employs LSSD techniques, and loads the control structure for a particular configuration into the array through the LSSD scan path, as a binary vector, in a bit-serial fashion. The main advantage of this aproach is that the overhead due to additional pinouts is limited to only three or four pins. Also, serial loading of the control structure reduces the overhead in terms of additional interconnections on the chip due to the control hardware. Chapter 3 presented the design of a linear reconfigurable systolic array architecture, which is statically reconfigurable to realize any of the following: a filtering array, an FIR filtering array, a pattern matching array, and a Discrete Fourier Transform (DFT) array. Shift register latches (SRLs) are used to hold the control information for setting up the interconnections configuration and selecting the functional mapping of computation cells. Both the data flow through the array and the functions executed in the computational cells are established by serially loaded cOnfiguration control vector. In Chapter 4, we extended the approach presented in Chapter 3, for loading the MSA configuration control structure, to two-dimensional MSA structures. Design of an MSA based processor, called the multipurpose reconfigurable array processor was presented. This arhitecture implements both systolic and semisystolic algorithms, and incorporates fault tolerance in its design. The performance of MRAP, taking into account the total computation time and the data-transfer bandwidth, is analyzed for specific algorithms, and it is demonstrated, by way of 128 examples, how MRAP can efficiently implement different systolic and semisystolic algorithms. This two-dimensional array could also be configured as many independent linear arrays to implement independent algorithms involving one-dimensional linear recurrences such as finite impulse response (FIR) filtering. The preperties of reconfigurability and programmability at individual cell level, lead to fault tolerance capabilities in the MSA architectures. This was discussed in Section 4.4. A two-dimensional MSA architecture can be viewed as many reconfigurable pipelines working together. If a cell in the array fails, the array might be able to be reconfigured to bypass the faulty cell with data now flowing through the prOperly functioning cells. Performance of the array may be degraded, but the entire system does not fail, and the system achieves a graceful degradation prOperty. Efficient implementation of algorithms on VLSI structures requires exploitation of parallelism in the algorithm and mapping of the algorithm communication structure into the processor interconnection structure. Chapter 5 presented a step-by-step procedure for implementing a given algorithm into the MSA structure by generating the control code required to reconfigure the array. The mapping procedure is based on time and space transformations of the data dependence vectors of the algorithm. These transformations provide a description of the data-flow and timing, and dictate the interconnection structure required to implement the algorithm on the array. Starting from a highrlevel language description of an algorithm, this procedure generates the final configuration control vectors required to be loaded into the array through the LSSD scan path 129 in a bit-serial manner. The basic idea of Chapter 5 is to relate the mixed systolic array's architectural model with the parallel algorithms implemented on the array to facilitate their implementation. Data dependencies in an algorithm determine the communication structure required for its implementation. Linear time and space transformations of the algorithm are selected from the data dependency matrix of the algorithm and the interconnection structure of the mixed systolic array. For a given mixed systolic array structure, the optimal time transformation is first chosen in order to minimize the total execution time. If a valid space transformation does not exist to implement the Optimal time schedule, a suboptimal time transformation is chosen for which a valid space transformation exists. Algorithms with constant data dependence vectors are easier to map on array structures and result in simple data~flow patterns. To illustrate the methodology and explain the reconfiguration procedure, two sample algorithms, the finite impulse response (FIR) filtering algorithm and the priority queue algorithm, are mapped into a linear reconfigurable systolic array. The design methodologies investigated in this research broaden the scope of systolic architectures by achieving reconfigurability, algorithmic flexibility, partitionability and fault tolerance in MSAs. In addition, these methodologies preserve VLSI design attributes, such as locality of communication in order to -avoid long and irregular interconnections, modularity in order to reduce the design time and cost, extensibility in order to enhance the computing power and simplicity of control. The LSSD technique is employed for loading and implementing the control structure required for reconfiguration of MSAs. This scheme 130 offers low overhead in terms of additional pinouts and extra hardware needed for implementing the MSA control structure. The procedure presented in this research, for mapping algorithms onto MSA architectures, presents a formalism for algorithm implementation and design of reconfigurable MSAs. This procedure can be used for designing MSA processors to implement a pre-defined set of algorithms, and it can also be used for implementing a new algorithm on a given MSA processor by generating the control vector required for its reconfiguration. 6.2 FUTURE RESEARCH The concepts presented in this thesis are not restricted only to mixed systolic arrays; they can also be used for formalizing the reconfiguration processes for some other polymorphic multiprocessor architectures. Several research issues emanate from this work and need further investigations. First, procedures should be developed which allow the user to obtain sets of valid transformations for other classes of algorithms with arbitrary index sets and nonconstant data dependencies. Second, the approach, presented in Chapter 5, should be extended to silicon compilation of reconfigurable arrays to generate masks for an optimal or near-optimal array which can implement algorithms in a given set. Third, further research is needed to define a unifying performance index for reconfigurable arrays to measure the overall array performance taking into consideration the speed, the reconfiguration process complexity, the cell's complexity and applicability of the array. Finally, research should be done to develop CAD tools for MSAs at the 131 architectural description level, and integrate these tools with other CAD systems supporting circuit diagram and physical layout levels of design, to create an integrated system design environment where masks can be generated for optimal or near-optimal MSA processors. APPENDIX APPENDIX A A COMPUTER-AIDED DESIGN FACILITY FOR MIXED SYSTOLIC ARRAYS The rapidly growing complexity of digital systems and the ever-increasing scale of integration on silicon chips provide a challenge to deve10pers of computer-aided design (CAD) systems, to create an integrated system design environment with high-level CAD tools. More complex systems require a greater need for CAD tools to do the design task in a cost effective way. In a tap-down hierarchical design approach, it is common practice to structure the design process of a computer system into the following steps [17]: 1. Design and description of architecture, 2. General circuit diagram, 3. Design of boards and integrated circuits, 4. Techn010gy process and production. Although commercially available computer-aided engineering workstations provide a sound environment with integrated tools for the logical and physical design levels, they are not suited to support the high-level design of hardware and functional system architectures. High-level CAD tools are, therefore, needed to support the system designer at higher levels in the design hierarchy, such as design of functional system architecture from system behavior specifications. 132 133 The main motivation behind this work is to provide the designer of MSAs with a facility to interactively develop an MSA structure for a given set of user-specified attributes, such as mixing density and array geometry, and to simulate the execution of algorithms on MSA processors at the register-transfer level. This appendix describes a computer-aided design facility for modeling and simulating mixed systolic arrays. The CAD facility serves as a high-level design tool which supports the design of MSA architectures. This facility deve10ps the structural model of an MSA by interactively interrOgating the user about various attributes of the MSA structure, such as mixing density and array geometry. In accordance with the user-specified attributes, the system generates a model for the MSA structure and displays it on the user's graphics terminal showing the relative positions of all functional cells in the array. Defining of the desired communication structure is then done interactively. The system also performs register-transfer level MBA simulation of the algorithms, observes the data-flow patterns in the array and helps evaluate the throughput of the system. The CAD facility, described here, is a useful tool for designing and programming MSA architectures. It provides a design environment that facilitates mapping of parallel algorithms into the reconfigurable MSA architecture and automatic generation of reconfiguration control code required for algorithm implementation. The register-transfer level simulation is used to validate the MSA architecture and the reconfiguration control structure during the design phase. This work is a step forward in the direction of automated design of reconfigurable multiprocessor architectures involving multiple pipelines. The next 134 section presents a system description of the CAD facility for modeling and simulating MSA architectures. A.1 DESCRIPTION OF THE SYSTEM In this section, we describe a CAD facility that supports the design of MSA architectures through interactive modeling and simulation. This facility serves as a high-level CAD tool, which develops a structural model of the MSA from user-defined architectural attributes of the array, such as mixing density and array geometry. In addition, this facility performs register-transfer level MSA simulation of algorithms, observes the data-flow patterns and helps evaluate the throughput of the system. This system provides the following facilities to the designer of the MSA architecture. 1. Generation and graphic display of the structural model of a regular MSA when the array attributes, such as mixing density and array geometry, are user-specified. 2. Storage of interconnection specifications, in form of a list of records, which are interactively specified by the user on the graphics terminal. 3. Options to interactively modify or delete selected array cells. 4. Reconfiguration of the MSA by entering reconfiguration control 135 information. 5. Register-transfer level simulation of algorithms. Figure A.1 shows a diagrammatic representation of the basic structure of this facility. As can be seen, the system consists of four sets of data files and seven programs. A brief description of each system block is given below. STRUCTURE GENERATOR. This program generates the structural model of the MSA for a specified mixing density and array size, and maps the index points of the array’s index set into the set of available functional cells. The output of the structure generator program is used by the plotting program and the simulator. STRUCTURE EDITOR. This prOgrmn is used to edit the MSA structure displayed at the user's graphics terminal. Two routines, namely PRUNE and MODIFY, are used for this purpose. PRUNE eliminates any extra boundary cells not required in the MSA implementation. User can move the cursor and select the cell to be eliminated. MODIFY selects a cell position within the array and changes the cell type at that location. This program is robust in the sense that it can sense an improper selection made by the user, i.e., a location where no cell is present, and prompt the user about the mistake. This program provides a facility for designing irregular architectures, as well as for including special I/O processors at the 136 USER'S TERMINAL INTERFACE STRUCTURE t 4 STRUCTURE EDITOR GENERATOR ‘ f PRUNE PLOTTING PRmRAM MODIFY '—"'"* INTERCONNECI’ION EDITOR CONTROL VECTORS '— SIMULATOR Figure A.l. The system diagram 137 array boundary. PLOTTING PROGRAM. This program plots the representation of the MSA structure as defined by the user-specified attributes and developed by the structure generator program. The plot is produced on a graphics terminal, where the user can interactively edit the MSA structure or specify the interconnections. A hard copy of the plot can be obtained at any stage in the design. INTCON. This set of files contains list of records of interconnections specified for specific MSA structures. The simulator uses these specifications to determine the data-flow for the array. One record specifies one interconnection link and contains information about the input and output cell coordinates and their respective local ports. A detailed discussion of the storage schemes for interconnection links is given later in this section. INTERCONNECTION EDITOR. This program modifies and specifies the interconnections among the array cells via a graphics terminal. This program decodes the screen coordinates of the graphics terminal into the array's index set points and generates a record for each specified interconnection to be stored in the INTCON. This program can be extended to check the legality of an interconnection link specified by the user. 138 CELLS' GRAPHIC DESCRIPTION. This set of files contains the graphic representation of all the cells of which a specific MSA is composed. Each file contains sequence of graphics commands which define the shape and size of a cell. Different symbols are used to represent different types of cells. CELLS' FUNCTIONAL DESCRIPTION. This set of files contains the functional description of computation cells and switching cells. A file pertaining to a specific type of computation cell contains the data-transfer statements for processing incoming data at the local ports of the cells. In case of a switching cell, a file contains information regarding the one-to-one correspondence of the set of control codes for switching cells with various interconnection configurations. The simulator program uses the functional description while performing the execution of an algorithm. RECONFIGURATION CONTROL VECTORS. This set of files contains various control vectors, each capable of reconfiguring an MSA structure to implement a specific algorithm. The configurations Of a specific MSA, and the algorithms implemented on it, are determined by its reconfiguration control vectors. User can add new control vectors in this set of files, which when loaded into the array, establish the computation and communication structure of the array. 139 SIMULATOR. This program evaluates the execution behavior of the MSA architecture by determining the state of the machine after each time step when simulating execution of algorithms. Once the MSA architecture and the communication structure are defined by using the structure generator and the interconnection editor, a particular MSA configuration can be achieved by loading' the corresponding control vector. The simulator merely calls two routines to simulate the processor Operations and interprocessor communication at each time step. One routine is called CMPCEL, which simulates the input to output functional mapping for all ports of each computation cell as defined by the control code bits in each cell. The second routine is called SWTCEL, which transfers the data available at the output ports of the computation cells to the input ports of appropriate cells as determined by the control code stored in the switching cells. So, at each time step the two routines, CMPCEL and SWTCEL, are called and executed to simulate the MSA behavior. In the remainder of this section, we describe how the MSA Structure Generator generates and displays the MSA structures with regular hexagonal geometry when the basis mixing density is specified. Although there exists a large number of mixing profiles which yield a regular array structure, only a few of them are practical for real algorithm applications. The hexagonal array geometry, first used for band-matrix multiplication algorithm [31], is an important structure because of its utilization of planar communication and its high packing density. The smallest hexagonal array comprises seven cells and serves as a basis for hexagonal MSAs, as larger array structures can be grown from this basis. 140 The mixing density 0f the basis, Pb, is called the basis mixing density and can be one of the following: Pb = i/7 , i = l, 2, 3, . . , 7. Figure A.2 shows the bases corresponding to three different mixing profiles. MSAs grown from the bases are intended for computation use. The density of a regular hexagonal MSA is a function of the basis mixing density, and these relations are given in [6]. The structure generator program determines the index set 9“ of the MSA and the transformation matrix D, which is a mapping on the index set tO the set {0, l}, where a 0 indicates a SC and a 1 indicates a CC. In order to get a hexagonal geometric pattern on the display a screen coordinate transformation is performed which, of course, is transparent to the user. Each cell (CC or SC) in the array is labelled by an ordered pair (x,y) where x and y are the values of the X and Y coordinates of that cell position. In order to convert the rectangular coordinate system, where X-axis and Y-axis are at an angle of 90 degrees, into a hexagonal coordinate system, where both axis are at an angle of 120 degrees, the following transformation matrix is calculated: r_ _ r- "1' -- - l.732 -l.732 X Xnew . :3 L1 1 _. bYd _Tnm Xna' and Ynew are the new screen coordinates for hexagonally patterned display. 141 (C) (b) With (a) Pb = 1/7, (b) pb = 3/7: gonal array bases and (C) pb = 3/7. Hexa Figure A.2. 142 Interprocessor communication structure is defined by specifying all interconnections among desired MSA cells. The problem of specifying these interconnections is synonymous. to the problem of defining and representing a directed graph (or a digraph). One approach which could be considered here, is to represent the digraph by an adjacency matrix. For a graph with p points there is defined an adjacency matrix A=[ai'j] in which ai,j=1 if there is an arc from vertex vi to vertex Vj and ai,j=0 otherwise. This approach, although quite comprehensive, is not chosen in our system mainly because of the following reasons. In certain implementations, all of the available cells may not be used, and hence the indegrees and outdegrees of some vertices in the digraph will be zero. This results in rows and columns containing all 0’s, which will slow down the simulation because the whole adjacency matrix needs to be exhaustively scanned after each clock cycle to simulate the communication structure. Another reason is that MSA structures employ short and regular communication geometry, so a vertex in its digraph is only connected to a few of its neighbors, whereas adjacency matrix takes. into account all possible arcs from any vertex to every other vertex in the vertex set. Hence the adjacency matrix approach does not match well with the class of digraphs we are considering here. Other approaches of representing digraphs such as Reachability matrix and Distance matrix are also not suitable for MSA communication graph representation and simulator implementation. The scheme we have employed here to represent the digraph and hence the communication structure of the array, is to store in a data file a list of all the arcs (or edges) in the graph. Each arc is specified in 143 the list as a record which contains the originating cell's coordinates and its output port as well as the terminating cell's coordinates and its input port. For example in case of a hexagonal array, the are shown in Figure A.3 will be stored as the following record ( 3 m n 6 k l ), where first and fourth entries refer to the input and output ports respectively such that AIN=l, BIN=2, CIN=3, A0154, BOThS and COT56. The remaining four entries in the record refer to the coordinates of the incident vertices of the arc. Defining of the desired communication structure is done interactively using a CAD graphics terminal. Two possible modes are available to the user at this stage. In the 'Cursor' mode, a cell can be selected by moving the cursor anywhere inside the cell and pressing any key. In this fashion user can enter the desired interconnections on the terminals, which are converted into the record format and saved in the data file containing the list of arcs. In the 'Text' mode, user types in the coordinates of the incident cells and specifies the input and output ports of the cells. It can be possible in future, to make provision in the program for checking the legality of the specified connections and prompt the user in case of an illegal specification. OT IN OT IN 144» OT OT IN IN OT Figure A.3. An arc indicating a connection between two neighboring cells. 14S A.2 DISCUSSION We described a CAD tool, which supports the design of MSA processOrs at the architectural design and description level. The register-transfer level simulator is used to validate the MSA architecture and the reconfiguration control structure during the design phase. The system was implemented on a PRIME-750 computer, for hexagonal array geometry, and the simulation was performed, for the band-matrix multiplication algorithm, as given in [31], on a hexagonal MSA with a basis mixing density of 1/7. A more general implementation of the system can include other array geometries, such as rectangular or triangular, in the array structure generator program. In order to implement the band-matrix multiplication algorithm, the switching cells and selected computation cells should be in straight and bypass configurations, respectively. That is, for these cells, A0T = AIN: BOT = BIN: COT = CIN- The active computation cells in the MSA, perform the inner product step function, given by A0T = AIN: BOT = Em: C CT = AIN " BIN + CIN- The data communication with the outside world takes place at the MSA boundary only. The input and output is controlled by the simulator program. Since the timing for data input and output depends heavily on 146 the algorithm to be simulated, a separate I/O routine is required for each algorithm. In case of the band-matrix multiplication algorithm, considered here, each cell performs active computation at every third time step. The simulator output gives the timing for the data input and output at the MSA boundary cells. The simulation results provide the status of each cell after each time step. This feature helps observe the data-flow and find throughput of the MSA. The work, presented here, is a step forward in the direction of automated design of reconfigurable multiprocessor architectures involving multiple pipelines. For further research, this high-level CAD tool can be integrated with other CAD systems supporting circuit diagram and physical layout levels of design, to create an integrated system design environment for generating masks for MSA processors. BIBLIOGRAPHY 10. BIBLIOGRAPHY Aho A.V., Hopcroft I.E. _ and Ullman I.D., Data Structures and Algorithms, Addison-Wesley, Reading, Massachusetts, 1983. Brent R.P. and Kung H.T., "Systolic VLSI Arrays for Linear-Time GCD Computation," VLSI 8;, F. Anceau and E.J. Aas, eds., North Holland, New York, August 1983, pp. 145-154. Bromley K., Symanski J.J., Speiser J.M. and Whitehouse 3.1., "Systolic Array Processor Developments," VLSI Systems 339 Computations, E.T. Kung et al. eds., Computer Science Press, October 1981, pp. 273-284. Cappello P.R. and Steiglitz X., "Digital Signal Processing Applications of Systolic Algorithms," in VLSI Systems And Computgtions, H. T. Kung et al. eds, Computer Science Press, October 1981, pp. 245-254. Cappello P.R. and Steiglitz K., "Unifying VLSI Array Designs by Geometric Transformations," Proc. 1983 Int'l Conf. 23 Parallel Processing, August 1983, pp. 448-457. Chang T.L. and Fisher P.D., Mixed Sytolic Arrays: .A Reconfigurable Mpgtiprocessor Architecture, Division of Engineering Research Technical Report No. MSU ENGR-82-OO2, Department of Electrical Engineering and Systems Science, Michigan State University, East Lansing, Michigan 48824, 1982. Chang T.L. and Fisher P.D., ”PrOgrammable Systolic Arrays," Digest 2; Papers, IEEE COMPCON, Spring 1982, pp. 48-53. Chang T.L. and Fisher P.D., "Master-Slave Mixed Arrays for Data-Flow Computations," Digest .2; Papers, IEEE COMPCON, Spring 1983, pp 468-472. Chern M.Y. and Murata T., "Efficient Matrix Multiplications on a Concurrent Data-Loading Array Processor," Proc. 1983 Int'lI Conf. 23 Parallel Processing, August 1983. PP 90-94. Chern M.Y. and Murata T., "A Fast Algorithm for Concurrent LU Decomposition and Matrix Inversion," Proc. 1983 Int'l. Conf. 23 Parallel Processing, August 1983, pp 79-86. 147 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 148 Cohen D., "Mathematical Approach to Iterative Computational Networks," Proc. Fourth §yppp Computer Arithmetic, October 1978, pp. 226-238. Eichelberger E.B. and Williams T.W., "A Logic Design Structure for LSI Testability," Proc. l4tp Design Automation Conf., New Orleans, June 1977, pp. 462-468. Eichelberger E.B., "Latch Design Using Level Sensitive Scan Design," Digest pp Papersl COMPCON 1983, February 28 - March 3, pp. 380-383. Fisher A.L., Kung H.T., Monier L.M. and Dohi Y., "Architecture of the PSC: A Programmable Systolic Chip," Proc. 10g; Annual Sypposium pp Computer Argpitecture, VOl. 11, No. 3, June 1983. pp. 48-53. Fortes I.A.B., Algorithm Transforpptiop for Parallel Processing ‘ppp VLSI Architecture Design, Ph.D. Thesis, Dept. of Electrical Engineering Systems, University of Southern California, Los Angeles, December 1983. Fortes I.A.B., Fu K.S. and Wah B.W., "Systematic Approach to the Design of Algorithmically Specified Systolic Arrays," Int'l. Conf. pp Acoustics, Speech and Signal Processing, March 1985, pp. 300-303. Gonauser M. and Sauer A.M., "Needs for High-Level Design Tools," Proc. 1983 Int'l. Conf. on Comppter Design: VLSI 1p Computers, _**_— October 1983, pp. 415-418. Guibas L.J., Kung H.T. and Thompson C.D., "Direct VLSI Implementation of Combinatorial Algorithms," Proc. Conf. pp VLSI Argpitectpxe, Designz ppp Fabrication, Calif. Inst. of Tech., January 1979, pp. 509-525. Heller D.E. and Ipsen I.C.F., "Systolic Networks for OrthOgonal Equivalence Transformations and Their Applications," Proc. Conf, pp Advanced Research pp VLSI, M.I.T., January 1982, pp. 113-122. Hopkins A.L., Jr. "Fault-Tolerant System Design: Broad Brush and Fine Print," Computer, March 1980, pp. 39-45. Huang E.B. and Abraham J.A., "Efficient Parallel Algorithms for Processor Arrays," Proc. 193;,Int'1. Conf. pp Parallel Processipg, pp. 271-279. Johnson L. and Cohen D., "A Mathematical Approach to Modelling the Flow of Data and Control in Computational Networks," VLSI Systems ppp Copputations, H.T. Kung et al. eds., Computer Science Press, October 1981, pp. 213-225. Johnson L., Weiser V. Cohen D. and Davis A.L., "Towards a Formal Treatment of VLSI Arrays," Proc= Caltech Conf. pp VLSI, January 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 149 1981. Kessler A.J. and Patel J.H., "Reconfigurable Parallel Pipelines for Fault Tblerance," IEEE Int’l. Conf. pp Circuits and Computers, September 1982, pp. 118-121. Khurshid A. and Fisher P.D., "Design of a Reconfigurable Systolic Array Using LSSD Techniques," Proc. 1984 Int'l, Conf. pp Computer Design: VLSI ip_Computers, October 1984, pp. 171-175. Khurshid A. and Fisher. P.D., l'Algorithm Implementation on Reconfigurable Mixed Systolic Arrays," Proc. 1985 Int'l. Conf. pp Parallel Processing, August 1985. Khurshid A. and Fisher P.D., ”A Computer-Aided Design Facility for Mixed Systolic Arrays," Sixteenth Annual Pittsburgh Modeling and Simulation Conference, April 25-26, 1985. Khurshid A. and Fisher P.D., ”A Multipurpose Reconfigurable Array Processor for Systolic and Semisystolic Algorithms," Submitted for publication. Kuekes P. and Shen J., "One-GigafIOp VLSI Systolic Processor," Proc. SPIE Symp., VOl. 431, Real-Time Signal Processing VI, August 1981, pp. 10-18. Kuhn R.H., thimizati n ppp Interconnectipp Complexity for: Parallel Processors, Single Stage Networppp ppp Decisipp Trees, Ph.D. Thesis, Dept. of Computer Science, Rpt. 80-1009, University of Illinois, Urbana-Champaign, Illinois, 1980. Kung H.T., "Let's Design Algorithms for VLSI Systems," Proc. Caltech Conf. pp Vepy Large Scale Integration, January 1979. pp. 55-90. Kung H.T., "Why Systolic Architectures," Computer, VOl. 15, No. 1, January 1982, pp. 37-46. Kung H.T., "Notes on VLSI Computation," CREST Parallel Processipg Systems Course, Loughborough, England, 1980. Kung H.T., "On the Implementation and Use of Systolic Array Processors," Proc. 1983 Int'l. Conf. pp Comppger Desigpp VLSI 1p Computers, pp. 370-373. Kung H.T., "Special-purpose Devices for Signal and Image Processing: An Opportunity in Very Large Scale Integration (VLSI)," Proc. SPIE Vol. 241 Real-Time Signal Processing III, July 1980, pp. 76-84. Kung H.T. and Lam M.S., "Wafer-Scale Integration and Two-Level Pipelined Implementations of Systolic Arrays," Journal pp Parallel and Distributed Computing ;, 1984, pp. 32-63. 37. 38. 39. 4o. 41. 42. 43. 44. 45. 46. 47. 48. 49. 150 Rung H.T. and Leiserson C.E., "Systolic Arrays (for VLSI)," Sparse Matrix Proceedings 1978, Society for Industrial and Applied Mathematics, 1979, pp. 256-282. Hung H.T., Ruance L.M. and Yen D.W.L., "A Two-Level Pipelined Systolic Array for Convolutions," in VLSI Systems and Computations, H. T. Kung et al. eds., Computer Science Press, October 1981, pp. 255-264. Kung S.Y., Arun K.S., Gal-Ezer R.J. and Rao D.V.B., "Wavefront Array Processor: Language, Architecture and Applications," IEEE Trans. pp Copputers, Vbl. C-31, No. 11, November 1982, pp. 1054-1066. me M.S., and Mostow J., "A Transformational Model for VLSI Systolic Design," Copputer, VOl. 18, No. 2, February 1985, pp. 42-52. Leiserson C.E., Area Efficient VLSI Computation, Ph.D. dissertation, Department of Computer Science, Carnegie-Mellon University, October 1981. Published in book form as part of the ACM Doctoral Dissertation Award Series by the MIT Press, Cambridge, Massachusetts, 1983. Leiserson C.E., "Systolic and Semisystolic Design,” Proc. 198;_Intl. Conf. pp Computer Design: VLSI pp Computers, pp. 627-632. Leiserson C.E. and Saxe J.B., "Optimizing Synchronous Systems," Journal pp VLSI and Computer Systems, VOl. 1, No. l, 1983. PP. 41-68. Mead C. and Conway L., Introduction pp VLSI Systems, Addison Wesley, 1980. Moldovan D.I., "0n the Design of Algorithms for VLSI Systolic Arrays," Proceedings p; the IEEE, VOl. 71, No. 1, January 1983, PP. 113-120. Moldovan D.I., "ADVIS: A Software Package for the Design of Systolic Arrays," 1984 Int'l, Conf. pp Comppter Design: VLSI pp Computers, October 1984, pp. 158-164. Moldovan D.I., Wu C.I. and Fortes J.A.B., "Mapping an Arbitrarily Large QR Algorithm into a Fixed Size VLSI Array," Proc. 1984 Int'l. Conf. pp Parallel Processing, pp. 365-373. Oppenheim A.V. and Schafer R.W., Digital Signal Processing, Prentice-Hall, 1975. Quinton P., "Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equations," Proc. 11th Int'l. Sypposium pp Computer Architectures, 1984, PP. 208-214. 50. 51. 52. 53. 54. 55. 56. 57. 58. 151 Serlin 0., "Fault-Tolerant Systems in Commercial Applications," IEEE Computer, VOl. 17, No. 8, August 1984, pp. 19-30. Siewiorek D.P., "Architecture of Fault-Tolerant Computers," IEEE COpputer, VOl. 17, No. 8, August 1984, pp. 9-18. Smith S.D. and Siegel R.J., "Recirculating, Pipelined, and Multistage SIMD Interconnection Networks," 1978 Int'l. Conf. pp Parallel Processing, August 1978, Pp. 206-214. Snyder L., "Introduction to the Configurable, Highly Parallel Computer,” Computer, VOl. '15, No. 1, January 1982, pp. 47-56. Snyder L., "Supercomputers and VLSI: The Effect of Large-Scale Integration on Computer Architecture," Advances pp Comppgers, Vol. 23, Academic Press, 1984, pp. 1-33. TRW LSI Products, "LSI Multiplier-Accumulators," TRW Inc., La Jolla, ’California, 1983. Weiser U. and Davis A., "A Wavefront Notation Tool for VLSI Array Design," VLSI Systems and Computptiopp, H.T. Kung et al. eds., Computer Science Press, October 1981. pp. 226-234. Williams T.W. and Parker R.P., "Design for Testability - A Survey," Proc. IEEE, Vol. 71, NO. 1, January 1983, pp. 98-112. Yen D.W.L. and Kulkarni A.V., "Systolic Processing and an Implementation for Signal and Image Processing," IEEE Transactions pp Computers, VOl. C-31, No. 10, October 1982, pp. 1000-1009. 1|JT[Hfiflfljfl‘ljmwfl'jjflflmznflfllmu '