MSU RETURNING MATERIALS: Place in book drop to LJBRARJES remove this checkout from -;—. your record. FINES will be charged if book is returned after the date stamped below. W $3522 3 562: ' J 4/ ~ .- ' Ausztagfiam A NEW METHODOLOGY FOR SYSTEMATIC CONSTRUCTION OF SYSTOLIC ARRAYS By Nikrouz Faroughi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirement for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1987 ABSTRACT A NEW METHODOLOGY FOR SYSTEMATIC CONSTRUCTION OF SYSTOLIC ARRAYS By Nikrouz Faroughi This work involves the development of a systematic method for transforming an algorithm, represented by mathematical expressions with uniform and bounded index spaces, into a systolic array architecture. Systolic arrays are highly structured architectures tailored to a specific application. They have specific architectural properties such as simple processing elements (cells), simple and regular data and control communication, and local cell interconnections. The new design method is based on an understanding of the relationship between two highly structured representations of the algorithms; the mathematical expressions and their systolic solutions. The method consists of three major steps: algorithm representation, algorithm model, and architecture specification. The algorithm representation involves the translation of mathematical expressions into a set of equivalent simple computations which are grouped into subsets based on the required set of operations and same type Operands. A set of information, referred to as features, is extracted from the simple computations. The features include the operands of each simple computation, the lexicographical index differences of each pair of operands of a simple computation, and the lexicographical index differences of the respective Operands of each pair of simple computations. In the algorithm model, the properties of systolic arrays are represented in terms of feature interrelationships. A sub-systolic array is designed separately for each subset of the simple computations. The final array is constructed by joining the sub-systolic arrays. This process provides a method by which non—homogeneous systolic arrays are designed by joining several homogeneous systolic arrays (i.e. sub—systolic arrays). A set of information, referred to as valid projection vectors, is systematically determined from the feature interrelationships. Each subset of the simple computations is represented geometrically where the valid geometric transformations are determined from the valid projection vectors. The projection of the new geometric representations produces the individual cell locations and the simple computations of each cell. Other architecture specifications, such as data movement and cell count ratio, are determined early in a design process and thus can be used to select systolic solutions which require fewest cells and lowest I/O bandwidth. This method produces all the systolic solutions for a given set of simple computations. DEDICATION This work is dedicated to my parents, Majid and Tahereh. They have always supported and encouraged my pursuit of learning. iv ACKNOWLEDGMENTS I wish to express my gratitude and appreciation to my advisor and chairman of my dissertation. committee, Dr. Michael Shanblatt, for his positive and continues encouragement. I am also grateful to the other members of my committee, Dr. Stanley Crouch, Dr. Erick Goodman, Dr. Lionel Ni, Dr. Harriett Rigas, and Dr. Chin-Long Wey. Through this research I learned how to decompose algorithms into simple computations, Dr. Wey provided me with valuable ideas as how to decompose my dissertation into papers. I appreciated Dr. Goodman’s sincere interest in the subject. His valuable comments helped me to tie up some loose ends. Finally, to the many good friends, such as Dr. Charles Given my superviser and my friend for his understanding and total support, Touran Cheraghi for her understanding during my moments of frustration, I offer my heartfelt thanks. Table of Contents List of Tables ............................................................................................................. viii List of Figures ............................................................................................................ ix CHAPTER 1: INTRODUCTION .............................................................................. 1 1.1 Problem Statement ......................................................................................... 3 1.2 Approach ........................................................................................................ 4 1.3 Organization of The Dissertation .................................................................. 5 1.4 Chapter Summary ......................... ' ...... . .......................................................... 7 CHAPTER 2: BACKGROUND ................................................................................ 8 2.1 Systolic Arrays and their Properties ............................................................. 8 2.2 Automatic Systolic Design ....... . .................................................................... 12 2.2.1 Z—Operator Approach ............................................................................ 12 2.2.2 Index Transformations .......................................................................... 15 2.2.3 Data Flow Transformation .................................................................... 22 2.2.4 Dependence Graph Transformation ...................................................... 24 2.2.5 High Level Description Transformation .............................................. 25 2.2.6 Retiming Transformation ...................................................................... 27 2.3 Summary of the Literature Review ............................................................... 31 CHAPTER 3: ALGORITHM DESCRIPTION AND REPRESENTATION ........... 33 3.1 Algorithm Description ................................................................................... 34 3.2 Algorithm Representation .............................................................................. 37 3.2.1 Initial Partial Results ............................................................................ 40 3.2.21 Algorithm Initial and Final Representations ........................................ 43 3.3 Chapter Summary ........ . ................................................................................. 49 CHAPTER 4: ALGORITHM MODEL ..................................................................... 50 4.1 Mathematical Development ........................................................................... 50 vi 4.1.1 Feature Interrelationships ...................................................................... 63 4.2 Valid Projection Vectors ............................................................................... 90 4.3 Chapter Summary .......................................................................................... 98 CHAPTER 5: ARCHITECTURE SPECIFICATION ............................................... 100 5.1 Individual Cell Locations .............................................................................. 101 5.1.1 1D and 2D Systolic Arrays .................................................................. 103 5.1.2 m-Dimensional Systolic Arrays ............................................................ 111 5.1.3 Design Dimension ................................................................................. 114 5.1.4 Cell Count Ratio ................................................................................... 115 5.2 Data Movement .............................................................................................. 118 5.3 Cell Interconnections ..................................................................................... 120 5.4 Data Flow Timing .......................................................................................... 124 5.5 Input/Output Data Patterns ............................................................................ 130 5.6 Cell Functions ................................................................................................ 132 5.7 Chapter Summary .......................................................................................... 132 CHAPTER 6: DESIGN TOOLS AND RESULTS ................................................... 134 6.1 Design Facility ............................................................................................... 134 6.2 Results ............................................................................................................ 138 6.2.1 Matrix Multiplication ............................................................................ 139 6.2.2 LU Decomposition ................................................................................ 174 6.2.3 Solution of Lower Triangular Systems of Equations .......................... 182 6.2.4 Finite Impulse Response (FIR) Filter .................................................. 187 6.2.5 2D Convolution ..................................................................................... 192 6.3 Chapter Summary .......................................................................................... 206 CHAPTER 7: DISCUSSION AND CONCLUSIONS ............................................. 207 LIST OF REFERENCES ........................................................................................... 216 vii List of Tables 3.1. Final representation of algorithm described by Eq. 3.28 ............................ 48 4.1. Interactive feature values of Eq. 4.1-1. .......................................................... 62 4.2. Valid local projection vectors for C1 and C2 of Eq. 4.2-2. .......................... 95 5.1. Lookup Table for Transformation Matrix Construction. ............................... 112 6.1. The valid global projection vectors for matrix multiplication. ..................... 140 6.2. CCR for each Vx of matrix multiplication. ................................................... 142 6.3. Data input directions for matrix multiplication systolic solutions. ......................................................................................................... 143 6.4. Comparison of Number of Cells, DFT, and Data Movement for matrix multiplication systolic solutions. ........................................................ 144 6.5. Simulation results for square (5x5) matrix multiplication. ........................... 145 6.6. The global valid projection vectors of LU decomposition. .......................... 179 6.7. Data input directions for LU decomposition. ................................................ 179 6.8. Global valid projection vectors for the FIR problem. ................................... 188 6.9. Data input directions for the FIR systolic solutions. ..................................... 189 6.10. Valid global projection vectors for 2D convolution. ..................................... 194 viii 2.1. 2.2. 2.3. 3.1. 4.1. 4.2.. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8. 4.9. 4.10. 4.1 l. 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7. List of Figures A systolic array for 3 x 3 matrix multiplication [6]. .................................... 4 The Z-graph representation of the FIR filter y,- = w, * x,- + W2 * xi“. .......... A cell function node [21]. .............................................................................. Possible structures of simple cells. ................................................................ The geometric representation of the output operands of Eq. 4.1-1. ............. The geometric representation of the simple computations of Eq. 4.1-1. ......................................................................................................... A non-homogeneous systolic array. ............................................................... Three sub-systolic arrays of a non-homogeneous systolic array. ................. A homogeneous systolic array. ............................................................. The violated condition 1 of Theorem 4.5. ..................................................... The violated condition 1 of Theorem 4.6. ..................................................... The violated condition 1 of Theorem 4.7. ..................................................... The violated condition of Lemma 4.8. .......................................................... A test tree. ....................................................................................................... The test tree of (c4, c7) of Eq. 4.1-33. ........................................................... A 2-dimensional geometric representation with two simple computations. .................................................................................................. An original geometric representation. ............................................................ A transformation of the geometric representation of Figure 5.2. ................. The cell locations and simple computations of sub-systolic arrays of subsets C1 (Eq. 5.1-25) and C2 (Eq. 5.1-26). ................................................ The primary inputs and outputs of sub-systolic solutions of C1 and C2. ............................................................................................................. A systolic array for matrix multiplication. .................................................... The systolic solution for solving an upper triangular system of linear equations. ......................................................................................................... ix ll 51 56 58 108 109 114 123 126 V 131 6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7. 6.8. 6.9. 6.10. 6.1 1. 6.12. 6.13. 6.14. 6.15. 6.16. 6.17. 6.18. 6.19. 6.20. 6.21. 6.22. 6.23. 6.24. 6.25. 6.26. 6.27. 6.28. 6.29. The system diagram. ....................................................................................... 136 A projection plane for matrix multiplication; Vx = (0, 0, 1)’. ...................... 148 A systolic solution for matrix multiplication; Vx = (0, 0, 1)’. ...................... 149 A projection plane for matrix multiplication; Vx = (0, 1, 0)‘. ...................... 150 A systolic solution for matrix multiplication; Vx = (0, 1, 0)’. ...................... 151 A projection plane for matrix multiplication; Vx = (1, 0, 0)‘. ...................... 152 A systolic solution for matrix multiplication; Vx = (1, 0, 0)’. ...................... 153 A projection plane for matrix multiplication; Vx = (0, 1, 1)‘. ...................... 154 A systolic solution for matrix multiplication; Vx = (0, l, 1)‘. ...................... 155 A projection plane for matrix multiplication; Vx = (1, 1, 0)‘. ...................... 156 A systolic solution for matrix multiplication; Vx = (1, 1, 0)‘. ...................... 157 A projection plane for matrix multiplication; Vx = (1, 0, 1)‘. ...................... 158 A systolic solution for matrix multiplication; Vx = (1, O, 1)‘. ...................... 159 A projection plane for matrix multiplication; Vx = (0, 1, -1)‘. .................... 160 A systolic solution for matrix multiplication; Vx = (0, 1, -1)‘. .................... 161 A projection plane for matrix multiplication; Vx = ( 1, —1, 0)‘. .................... 162 A systolic solution for matrix multiplication; Vx = (l, -1, 0)‘. .................... 163 A projection plane for matrix multiplication; Vx = (1, 0, -1)‘. .................... 164 A systolic solution for matrix multiplication; Vx = (l, O, -1)‘. .................... 165 A projection plane for matrix multiplication; Vx = (1, 1, 1)‘. ...................... 166 A systolic solution for matrix multiplication; Vx = (1, 1, l)’. ...................... 167 A projection plane for matrix multiplication; Vx = (1, 1, —1)‘. .................... 168 A systolic solution for matrix multiplication; Vx = (1, 1, —1)‘. .................... 169 A projection plane for matrix multiplication; Vx = ( 1, —1, 1)’. .................... 170 A systolic solution for matrix multiplication; V): = (1, —1, 1)’. .................... 171 A projection plane for matrix multiplication; Vx = (1, —1, -1)‘. .................. 172 A systolic solution for matrix multiplication; Vx = (1, —1, -1)‘. ................. 173 The projection planes of LU decomposition. ................................................ 180 The systolic solution for LU decomposition. ..... j ........................................... 181 6.30. 6.31. 6.32. 6.33. 6.34. 6.35. 6.36. 6.37. 6.38. 6.39. 6.40. 6.4 1. 6.42. 6.43. 6.44. 7.1. The geometric representation of but = b. ....................................................... The systolic solution of La: = b. ..................................................................... The projection line of the FIR filter for Vx = (0, 1)‘. ................................... The systolic solution of the FIR filter for Vx = (0, 1)‘. ................................ The projection line of the FIR filter for Vx = (1, 0)‘. ................................... The systolic solution of the FIR filter for V): = ( 1, 0)‘. ................................ An equivalent systolic solution of 6.33. ........................................................ The projection plane of 2D convolution (Eq. 6.2-26) for Vx = (0, 1, 0)’. ................................................................................................. The systolic solution for 2D convolution (Eq. 6.2-26) for V): = (0, 1, 0)‘. ................................................................................................. The systolic solution of 2D convolution (Eq. 6.2-26) constructed by overlapping the cells in the solution of 6.38. ................................................ The systolic solution of 6.39 with nine weights. .......................................... The projection plane of 2D convolution (Eq. 6.2-28) for Vx = (0, l, 0)‘. .................................... ' ............................................................. The systolic solution for 2D convolution (Eq. 6.2-28) for V1: = (0, l, 0)‘. ................................................................................................. The systolic solution of 2D convolution (Eq. 6.2-28) constructed by overlapping the cells in the solution of 6.42. ................................................ The systolic solution of 6.43 with nine weights. .......................................... Y-chart rcpresentation of the new systematic systolic array construction method. ....................................................................................... 186 189 190 190 191 191 196 198 200 203 204 205 CHAPTER 1 INTRODUCTION The revolution in VLSI technology allows the implementation of powerful computational systems on a VLSI chip using multiple and regularly connected processing elements. These VLSI systems that combine pipelining and multiprocessing have been referred to as systolic arrays [1]. Many designs using the systolic array concept exist for computational algorithms in pattern recognition, image and signal processing, and matrix manipulation [1-10, 43-44]. The systematic transformation of an algorithm into a systolic array architecture is a design automation problem of current interest The deve10pment of a systematic methodology for transforming algorithms into systolic arrays has been addressed [ll-21,23,25-34]. The power as well as the limitations of these methodologies have been recorded [22]. One general limitation that exists in all methodologies is the lack of prior knowledge about the resulting systolic designs. In many of the existing methodologies the final systolic array is the result of several ad hoc modifications to the original design. The others use linear transformations and often choose the transformation in an ad hoc manner. In general, systematic construction of systolic arrays can be grouped into three major steps [22]. 1. Algorithm Representation- An algorithm expressed by some high-level construct, e.g. mathematical expressions, is represented in a suitable format. 2. Algorithm Model- The algorithm representation from step 1 is modeled for systematic systolic design. 3. Architectural Specification- From the algorithm model the systolic array specifications, such as the number of cells, design dimension, cell interconnections, operations performed in each cell, data movement, and data timing, are extracted. The above three steps may or may not all be contained in a given methodology. Furthermore, each major step may consist of several sub-steps. This research describes a new methodology for systematic construction of systolic arrays and includes all three major steps. For step 1, the algorithm is represented by primitive computations. These computations are generated by applying an index expansion rule to the algorithm index spaces. In step 2, the algorithm model is formed as a “knowledge representation”. The inherent algorithm information that affects systolic array architectural characteristics is represented by index difference vectors. These characteristics include the use of a few different types of processing elements (cells), simple and regular data and control communications, and local cell interconnections. From this model, the simple computations are grouped among a set of cells such that the systolic array properties are preserved. This is done by geometric representation of simple computations by mapping them onto a set of cells via geometric projections. Not all geometric projections will produce systolic arrays. Hence geometric transformations are performed. The transformation matrices are systematically produced from the algorithm’s inherent information. Step 3, the architecture specification, consists of two parts. Part1 is incorporated with the algorithm model so that designs with different “cell count ratios”, design dimensions, and data movement will be known earlier in. the design process. Part 2 Of step 3 identifies the individual cell locations, cell interconnections, data flow timing, and data input patterns. In general, algorithms suitable for systolic implementation are highly structured. It is assumed that the number Of different systolic solutions for any given problem is problem size independent Therefore, this systematic systolic array design approach is first applied to the smaller sized problems. The design information such as the index transformation matrices and the cell interconnections are produced. The design information is then easily extrapolated tO larger problem sizes tO produce corresponding larger systolic arrays. In the discussion that follows, the terms algorithm and problem are used interchangably tO describe a set Of computations which follow a set Of specific rules. 1.1 PROBLEM STATEMENT Many approaches exist for constructing systolic arrays from algorithms or from initial non-systolic designs. For these methods, in general, a systolic solution is the result Of both systematic and ad hoc procedures. There are approaches which provide a systematic methodology for constructing systolic arrays [20,36]. However, these approaches may not satisfy the simple cell property Of systolic arrays. The existing methods do nOt directly incorporate the properties Of systolic arrays in the design process. It seems appropriate that knowledge Of legal communication topologies (such as local connectivity), modularity (such as regular communication), and operation complexities (such as simple arithmetic Operations) should be used in designing systems with such properties. A “good” approach for constructing systolic arrays should use certain algorithm characteristics to control cell complexity, data communication, and cell connectivity. Furthermore, for a given algorithm there may exist many different systolic solutions, each with different characteristic parameters such as the number Of cells, throughput, total execution time, cell interconnection patterns, and data movement. All the systolic solutions for a given algorithm need to be explored in order to select the best solution for the specific application. This work presents a new approach for constructing systolic arrays that incorporates the properties Of systolic arrays in the design process and produces all the existing solutions for a given algorithm description. 1.2 APPROACH An algorithm suitable for a systolic array implementation is first transformed into its equivalent set Of simple (primitive) computations. Each simple computation consists Of one resultant variable, a few input variables, and a set Of simple Operations applied to the input variables. The input variables are either constants or partial results (i.e., the resultant variables Of Other simple computations). Each simple computation therefore has the desired form for computation in a cell. These computations are then assigned tO a set Of simple cells; those assigned to the same cell are computed sequentially in that cell. Therefore, the mapping Of an algorithm into a systolic array becomes the problem Of mapping simple computations to a set Of cells. The algorithm is viewed as a set Of input data, a set Of output data, and a finite set Of rules that define how each output datum is computed from the input and possibly from the previous output data. A set Of simple computations is said to be equivalent to the algorithm if they preserve the algorithm correctness. The mapping Of the simple computations into a systolic array is the same as assigning the computations to a set Of cells such that the properties Of systolic arrays are preserved. That is, these computations are assigned tO different cells such that data communications among the cells are simple and regular, cell interconnections are local, and cell Operations are simple. A set Of information, referred to as features, is systematically extracted from the simple computations. The array properties are then represented in the form Of feature interrelationships. A subset Of simple computations can be assigned tO a cell if and only if all the feature interrelationships are satisfied. Moreover, a systolic solution exists if and only if all the simple computations can be assigned tO a set Of cells without violating the array properties. As the size Of the problem increases so does the number Of equivalent simple computations. Hence, Operating on these computations nO longer becomes a realistic task. A methodology that uses geometric transformation and projection is used. The indices Of the simple computations form an index space such that the projection Of this space onto a lower dimension space maps several simple computations onto the same point. Each point is then considered to be a cell. Not all such projections, however, will preserve the properties Of systolic arrays. The valid projection directions, referred tO as projection vectors, are systematically derived from the feature interrelationships. These valid projection vectors are generated from small problem sizes and later used to design larger arrays for larger problem sizes. 1.3 ORGANIZATION OF THE DISSERTATION Chapter 2 presents the background tO this research. In particular, Section 2.1 describes the systolic array architecture and its advantages. In Section 2.2 the literature on the systematic construction Of systolic arrays is reviewed. Section 2.3 presents a summary Of the existing methods for constructing systolic arrays. Chapter 3 presents the algorithm representation. The class Of algorithms considered in this study is discussed in Section 3.1. The algorithm representation, which consists Of the algorithm initial representation and the algorithm final representation, is presented in Section 3.2. The algorithm initial representation transforms the given algorithm intO a standard form (a set Of simple computations). The algorithm final representation contains a subset Of the algorithm’s inherent information. This information together with the algorithm initial representation form the algorithm final representation. Chapter 4 focuses on the algorithm model. Section 4.1 presents the mathematical development Of the algorithm model where additional inherent information is extracted. The simple computations Of the algorithm are represented geometrically. The projection Of the geometric representation maps several primitive computations onto the same point on the projection plane. The systolic array architectural characteristics are then represented by certain relationships derived from the inherent information. These relationships, presented in Section 4.2, are used tO determine those geometric projections that preserve the systolic array architectural properties. Section 4.3 presents the methodology developed from these relationships tO extract the valid projection directions. Chapter 5 presents the architecture specification according tO which systolic arrays are constructed using the design information extracted from the algorithm model. This information is translated into linear integer index transformation matrices and is discussed in Section 5.1. Each different transformation matrix corresponds tO a different geometric transformation and geometric projection. The projections provide the individual cell locations and the list Of primitive computations that are computed in each cell. The cell interconnections are performed by using the nearest cell information. 1.4 CHAPTER SUMMARY Chapter 1 presented the three major tasks Of systematic construction Of systolic arrays from algorithms. The tasks are: Algorithm Representation, Algorithm Model, and Architectural Specification. There exist many approaches for constructing systolic - arrays either from an algorithm description or from an initial non-systolic design. Only a few Of the existing methods use the architectural properties Of systolic arrays directly in their design process. This research develops a new approach for systematic construction Of systolic arrays in which the properties Of systolic arrays are incorporated into the design process. An algorithm described by mathematical expressions is decomposed into a set Of simple computations. A set Of features extracted from the simple computations which is used tO control the cell complexity, data and control communication, and cell connectivity. The new approach generates all the systolic solutions for the given set Of simple computations. CHAPTERZ BACKGROUND Section 2.1 Of this chapter describes the properties and advantages Of systolic array architecture. Section 2.2 presents a brief review Of the existing methods for systematic construction Of systolic arrays either from an algorithm description or from an initial non-systolic design. Section 2.3 presents the literature review and points out certain drawbacks Of existing methods. 2.1 SYSTOLIC ARRAYS AND THEIR PROPERTIES Special purpose computer systems are typically used tO meet specific performance requirements or tO Off-load computations that are tOO time-consuming for a general- purpose computer. Since these systems have limited applicability, their cost- effectiveness has always been the deciding factor in guiding their development. With evolving VLSI circuit technology, the design cost Of a special-purpose system can be held down only if appropriate architectures are used. Kung and Leiserson have characterized one such architecture as the systolic design [1]. Systolic designs consist of 1. a few different types Of simple processing elements (cells); 2. simple and regular control and data communications; 3. local cell interconnections; 4. extensive pipelining and multiprocessing [5,8]. The use Of only a few different types Of cells imply the design and testing Of only a few different macros. These cells in general perform simple Operations involving a small number Of Operands, hence providing simple communication among the neighboring cells. The complete array is constructed by repeating the macros over the chip area. Regular communication implies that the design can be made modular; larger systems can be designed by combining proven smaller ones. The local cell interconnections provide simple and regular wire routing, making the design Of these special-purpose devices cost effective. The use Of pipelining and multiprocessing enable one tO reach the performance requirement Of a special-purpose system. In a systolic design, data flows in a pipelined fashion between the cells, and communication with the outside world is performed only at the boundary cells. All or portions Of the boundary cells have input/output ports as shown in Figure 2.1. Kung has defined systolic designs as semi-systolic if global data communication is used, and pure-systolic if nO global communication is used [5]. In global communication, data is broadcasted tO or collected from some or all Of the cells. With no global communication, data is passed from cell tO cell and each datum is used only in one cell at any given time. When the number Of cells increases, using global communication results in long connection wires; expanding these non-local communication paths becomes difficult without slowing down the system clock. In general, pure-systolic is preferred but the correct choice Of semi- or pure-systolic 10 design depends on the chip I/O limitations, the specific application, and the environment in which the chip is used. The cells in a systolic design operate synchronously. The need tO distribute the clock signal tO every cell in the design without clock skew is one problem of the systolic architecture. Another architecture similar to the systolic architecture is the Wavefront Array Processor [24, 42]. Unlike the systolic array, a wavefront array does not Operate synchronously. Here the Operations of each cell are controlled locally and depend on the availability Of the input data and on the delivery Of the previous output data tO the neighboring cells. Due to this local control Of Operations, skewing Of the input data, generally needed in systolic designs, may be avoided in wavefront designs. Although a systolic array is usually a special purpose device customized tO a specific algorithm, they can be designed with programmable cell functions and programmable cell interconnection patterns tO be used for a small subset Of different algorithms [37,38]. In this case, different functions may be computed in the same array by programming (reconfiguring) the array. The number Of different systolic arrays that can be programmed into a reconfigurable array is limited by the number Of different cell functions and the number Of different cell interconnections. Reconfigurable systolic arrays are usually pure-systolic. In general, a systolic architecture is used to implement a regular and compute-bound problem. A problem is compute-bound if the total number Of Operations is larger than the total number Of inputs and outputs. A problem is regular when repetitive computations are performed on a large set Of data [5]. Since the introduction Of the systolic architectures, numerous systolic designs for various known problems have been proposed [1-10, 43-44]. Automatic construction Of systolic designs has only recently been addressed [ll-21, 23, 25-34, 36, 39]. The quality Of a systolic design is evaluated by its correctness, delay, data flow timing, number Of cells (chip area), IIIIIII 0+ Iiiiiii 031 021 01 1 00 D11 NZ 0 MS 032 022 012 O O b21 1322 O 11 1:31 b32 ‘b33 t stands for tlme. 09...; O 0(--to Safe—on O(-—-\1 2.? o 0: c=c+0¢b b0 Figure 2.1. A systolic array for 3 x 3 matrix multiplication [6]. 12 modularity and many other factors such as accuracy, flexibility, power requirement, fault tolerance and cost. 2.2 AUTOMATIC SYSTOLIC DESIGN In this section existing methodologies for designing systolic arrays are reviewed. These methodologies are grouped under separate titles based on related techniques. 2.2.1 Z-OPERATOR APPROACH The Z-Operator, which is also referred to as the delay Operator is defined as: Zj Xi = X,’_j 2.2-1 01‘ Zj Xi = xl+j - 2.2-2 where the indices indicate displacement in time or space. When the Z-Operator is applied to a mathematical expression with indexed variables, the index differences are indicated by the powers of Z. For example, applying the Z-Operator notation to the Finite Impulse Response (FIR) filter defined by y“ = z 01' xl-j 2.2-3 .l yields n=2qflo 224 ,- The new expression is then manipulated symbolically using the mathematical properties Of the Z-Operator. Implementations with different properties can be derived [7,23,25]. By direct hardware interpretation of the expression, correctness Of the l3 implementation is assured. However, in general, the direct hardware interpretation does not have the important characteristics of the systolic architecture. Further ad hoc modifications to the designs are needed. Kung and Lin use two representations to specify a design [27]. The two representations are equivalent and either representation can be derived from the other. The Z-graph representation is used to show the design specifications such as the cell positions, the interconnections, and the delays. The algebraic representation is used to perform algebraic transformations on a design to produce better equivalent designs. With this algebra, a designer is able to eliminate global communication. For example, consider Figure 2.2 __ \ r f -0 -1 Z Z -0 Z V2 y Figure 2.2. The Z-graph representation Of the FIR filter y,- = w1 * x, + w2 * x,- +1. X in which Z" implies a delay Of k units. The algebraic representation Of Figure 2.2 can be written as [2] +— [8 2:] [:2] + [3;], * x 2.25 d an 14 y = [1 o] [2] 2.2-6 01' v<—Av+bx, 2.2-7 y=c’v, where x is the input, y is the output, and v represents a vector corresponding tO the cells. Furthermore, the NxN matrix A, the le vector b, and the le vector c represent the delay cycles between the cells and the availability Of the data. Two types of transformations are applied: retiming transformation and “It-slow” transformation. Let D be a full rank, diagonal, NxN matrix where the diagonal elements represent delays in the Z notation. Let u = D v 2.2-8 then v = D”1 u 2.2-9 and u «- (DAD‘1)u + (Db) x, 2.2-10 y = (c‘ D“) u, where the vector it represents the new cells, and (DAD’I), (Db) and (cJ D‘l) represent the delay cycles in the equivalent new design. For “k-slow” transformation, the elements Of A, b and c are indicated in terms of k, for example 2‘2 is replaced by 24". Then the Z exponents of matrix (DAD‘I) will be Of the form d,- + kay- — dj where D=(Zd') and A=(Z'a"’). For no global communication di+ kaij- dj- 21 (d,- + kay- dj= 0 implies zero delay). Solving this inequality for 15 minimum k will result in a design Of maximum throughput. For regular communication all nonzero elements Of any columns Of (DAD’I) and (Db) must be distinct. This assures that the value of a cell at any time is never sent to more than one cell at a time, thus eliminating broadcasting. The initial Z-graph representation of the algorithm is found by similar methods discussed in [7,23,25]. 2.2.2 INDEX TRANSFORMATIONS Index transformation has been used in the past to map algorithms to parallel processors which are connected via limited interconnection networks. It was shown that there is a certain amount of flexibility when an algorithm is mapped to an inflexible interconnection network [17]. This flexibility was shown by the existence Of several index transformations which reStructure the algorithm to use an inflexible interconnection network. Index transformations have been used to allow an algorithm to construct its own interconnections to form systolic designs [17]. This method is best suited for algorithms with arbitrary nest depth and uniform loop body execution time. The indices Of all the variables in the algorithm are expanded in an ad hoc fashion to eliminate the broadcasting; hence, each datum is uniquely defined. A linear transformation is then applied to the indices. Suppose the index set consists Of the indices i, j and k; the projection of the new indices onto the i—j plane then gives the size and the dimension Of the systolic array. The k coordinate indicates the timing; hence, they must all be non-negative integers. The direction of the data through the network is also determined by analyzing the data flow of the algorithm. The selection Of the proper transformations is again done in an ad hoc manner. The Optimality of the resulting systolic design depends on the chosen linear transformations. 16 Cappello and Steigliz use linear transformations on recurrence equations [14]. Starting with recurrence equations describing the algorithm, a canonical representation is Obtained by adjoining a time index to the definition Of the recurrence. For example 11);: ‘— yijJ—l + Wj xi—j 22‘“ is the canonical representation Of the FIR filter. The coordinates of the left side Of the canonical form indicate the location where the computation on the right takes place. This gives a geometric interpretation to the recurrence relations. A geometric representation together with the ordering rule, indicates what data moves, where it moves to, and when it moves. With various linear transformations Of the geometric representation the original computational locations change, thus producing a different geometric shape. Then, by factoring out the time dimension (geometric projection), different designs are produced. The transformations are selected in an ad hoc fashion, and in [14] it was also shown that certain transformations will lead to AP2 Optimal design where A is the area and P represents the period. Another method that uses the index transformation is reported in [l9,28-3l,36]. (The basic concepts are reported in [30,31] and further studies Of the method are reported in [19,28,29].) The organization and the Operations Of a systolic design are described by a set Of 3-tuples (G,F,T). G represents the network geometry, the position of each cell identified by its Cartesian coordinates. The function F associated with each cell represents the Operations performed by that cell. T represents the network timing and it indicates for each cell the time when the Operations F occur and when the data communication takes place. Given an algorithm in the form Of a high—level language, the loop bodies are ordered in a lexicographical ordering to eliminate the broadcasts. Each computation is assigned an index rather than associating the body of loops with the corresponding 17 loop indices as reported in [17]. One or more data-dependent vectors correspond to each variable. Let x(ll) and x02) be two variables generated at a statement 5‘ using indices 11 and 12. Variable x02) is said to be dependent on variable x(l 1) if I] < 12, in the lexicographical sense, and x(11) is an input variable in the statement S. Then the vector d = [2 — 11 is called a data-dependent vector for variable x. Let d1, d2, . . . , dm be the set of data~dependent vectors calculated for the algorithm. Let D=[d1, d2, . . . , dm] be the data-dependent matrix Of the algorithm. A linear transformation T, when applied to D, will generate the datadependent matrix of the new structure, that is TD=D’. Matrix D’=[d’ 1, d’z, . . . , d',,J represents the modified data dependences in the new index space. Entries of D’ can be selected a priori, provided that a valid transformation T exists. T is said tO be a valid transformation if it is a bijection, monotonic and nonsingular. Sufficient and necessary conditions for such a T tO exist are reported in [30,31]. These transformation matrices can also be determined systematically from a set of predetermined cell interconnections. Each transformation matrix, satisfying certain specified constraints, maps the algorithm into an array whose cell connectivity is a subset Of those predetermined interconnections [36]. This approach, which is the most complete to date, has the potential of not satisfying a property Of systolic arrays, namely simple cell structures. The data dependency vectors, used as a set Of algorithm characteristics, do not contain any information about the cell complexity. The Operations required in the body Of the inner loop determine the cell complexity. Those algorithms with complex loop bodies will produce complex cell structures. Matrix T is divided into two parts as: gr] where it is related to time such that ad,- > 0 for all i and s is related to the geometric 18 properties of the algorithm. In general it is chosen with rank 1 and s with rank n—1 where n is the rank Of T. Fortes and Moldovan have used this method to model the design as broadcast-free or reduced broadcast systolic design [28]. Although this method is introduced here as a systematic approach to systolic designs, Moldovan, Wu, and Fortes have applied this method to map arbitrarily large algorithms into a fixed size systolic array [29]. In particular the approach was applied to an arbitrarily large QR algorithm. Further investigation has indicated how to model the design problem as a linear programming problem that rrrinimizes the execution time [19]. Li and Wah’s approach is to model an algorithm, represented as recurrence processes, into a non-linear integer programming 7 problem [12,33]. A systolic design is characterized by three parameters: the velocity of the data flow, the spatial distribution of data, and the period Of computations. In general, linear recurrence equations are expressed as 2‘0.» =f[z" ‘ Y (i. 1). x0. k). We 1)] 2.2-13 where 'y = -1 if the recurrence is a forward and 7:1 if the recurrence is a backward. These two recurrence types are identical in that each type can be easily converted to the other. For systolic designs only, the evaluation order of the terms is important. In the above recurrence equation the superscript It represents time, and k in x(i, k) and y(k, j) represents space. Also, the above recurrence equation is of the simplest form since it and y may be a function Of i, j and k. Other forms of recurrence equations are presented in [12]. Suppose the recurrence equation 2‘0.» =f[2" ' 10.1). 2:03 k). We 1)] 2.2-14 19 is implemented in a systolic design. It was shown that the following equations must hold: th *xd-i-xb: (k! * 2d, 2.2-15 tky*yd+)’ks=’ky*zd, ‘t*xd+xis=‘t*)’d, ‘t*zd+zts='r*)’d, I; *yd+)’ts=‘j*xd, t} * zd+ zjs— I, *xd, “bl = 'tkylr “bl = 'tkir Itk'yl = Itki, |t,-| 2 1, ‘th 2 1: ltkl 2 1, where x’s, y’s, and 2’s are vectors and t‘s are scalars [12,33]. xd represents the ' velocity Of datum x measured in terms Of the directional distance passed during a clock cycle. It is assumed that the distance between any two neighboring cells is one. If there are m-l delay registers between two neighboring cells in the pipeline direction Of x, then Lxdl=llm SI. yd and 2d are measured similarly to xd. x“, 19,, xb, Yrs, and so on, represent data distribution vectors and they are non-zero integers. For example xi, is the directional distance between x(i,j) and x(i+1, j); xj-S is the directional distance between 1(lj) and x(i, j+1). The t’s are computed as follows: iz"(t+1, j) - 1: z"(i, Di 2.2-16 L J 0.. II a" V U z}. = z ::"(i, j+1)] - t[z"(i, I): so ,k = 1: [2"+1(1',D] " PERL/)1 20 where 1:(.) is the computation time. tb, and tky are the time differences between the access of two consecutive items Of variables x and y, respectively, in increasing order of the indices. For the backward recurrences, ya and thy are positive; they are negative for the forward recurrences. It was shown that these equations represent the fundamental space-time relationship in systolic designs [12,33]. Therefore, in order to produce an Optimal systolic array, the design problem can be formulated into an optimization problem using these equations as constraints. The Objective function is formulated based on the design criteria, depending on the above parameters and the problem size. For example, taking the minimum computation time criterion, the Objective function may take the form g(t,-, ‘1” ’10 n1, n2) where n1 and n2 represent the problem size. For example, the Objective function for multiplying two N x N matrices is N * Itkl + (N — 1) * |t,| + (N -— l) * Itjl. 2.2-17a The Optimal systolic design then can be formulated as: minimize g(t,-, tj, tk, n1, n2) 2.2-17b subject to the constraints of Eq. 2.2-15 and It)" * [24' = 1‘1, 2.2-17c Itil * b’dl = [‘2’ It," * MA = k3, where k1, k2, and k3 are integers which represent the distance traversed for the next computation. The solution to the non-linear programming problem involves finding integer or rational numbers for all the parameters that describe the space-time relationships of the systolic design. 21 Miranker and Winkler’s method [18] is similar to that described in [30,31]. Given an algorithm in a high-level language with nested do-loops, loop index expansion is performed in an ad hoc fashion to eliminate the broadcasts. The normal order Of the Operations corresponds to a set of indices T which is referred to as the nodes Of the computations. F is represented by a graph structure. For example, 3(2. 4) =B(1. 4) 2.2-18 and C(2, 4) = C(2, 3)+A(2, 4)*B(2, 4) 2.2-19 indicate that these statements require computations at nodes (1,4), (2,3), and (2,4). These nodes are called the domain of dependence for node (2, 4). In the structure representing 1", dependence vectors (edges) are formed based at node (2, 4) which point back to the nodes in its domain Of dependence. For this example the dependence vectors are (—1, 0), (0, —1), and (0, 0), and are denoted by I“. For systolic design, I" is homogeneous; that is the dependence status at one node is the same as that at any other node. Furthermore, two structures are isomorphic as graphs if there is a 1-to-1 correspondence, f, between them, such that, if p—q is a dependence vector, then f(p)—f(q) must also be a dependence vector where p and q are 3 nodes. Different structures result by mapping p—q onto f(p)—f(q). These mappings are the affine mappings and they are given by the linear maps. An mxn integer matrix A is selected as a linear mapping. Let d1, d2, ..., dm, be the dependence vectors of a structure, then Ad, = d’ ,- for i2 1, ..., m gives the dependence vectors of the new structure. Conditions for the existence of A are given in [18]. The first element in every d’, corresponds to time and the n—l other elements indicate the space. For spatial representation of the systolic design, a universal planner systolic array (in 2-D) is constructed where each connection corresponds to a dependence vector in space. A universal planner systolic array contains all of the possible dependence vectors. The 22 last n—l elements Of every d’i represent the dependence vector in space and correspond to a portion of the universal systolic array. 2.2.3 DATA FLOW TRANSFORMATION Gannon has defined a set of vector operators to derive a function specification from an algorithm where parallelism is explicitly represented [21]. The representation is described from one domain to another, for instance f: 01—) Dz. For example, f := y — (x + z) is represented as f: R3 —) R where (x,y,z) 6R3 and f eR. Graphically f may be represented as in Figure 2.3. x , 2 1+ JV Figure 2.3. A cell function node [21]. The product Operator (TI) represents the vectored application of a function which is the simplest form of parallelism. For example I l 1 represents the concurrent operation of the n functions f,-. Gannon has defined other operators such as the permutation and data movement Operators, the chain operator, and the systolic iteration Operator [21]. The permutation and data movement operators are performed on the data in order to define complex data permutation and 23 movement operations. For example the rotation operator performs a right circular shift data movement. The chain Operator is used to chain a sequence of copies of f where the output of one is connected to the input of the next one. The systolic iteration Operator describes Operations where basic functions are used multiple times. With these Operators the data flow graph of an algorithm is determined. Based on the properties of the algorithm and the operators used, the data flow graph can be mapped into a systolic design. Ramakrishnan’s approach is to map a homogeneous program graph onto a linear array of processors [16]. A program graph is a directed acyclic graph which represents computations. The nodes represent the computation of a function, and the function arguments at each node are the values represented by the incoming edges at that node. A program graph is called homogeneous if it is suitable for execution on a linear array. A homogeneous program graph is a labeled graph G = (V, E, L). V: V1 U 51 U 52 where 51 represents the sink nodes, 52 represents the source nodes, and V1 represents the remaining nodes. E is the edge set where each edge is labeled based on its directions. L is a finite set and consists of k labels. Each node in V, has It incoming edges and k outgoing edges where each incoming and outgoing edge is assigned a unique label from L. The homogeneous graphs are limited subsets of general data flow graphs and several problems such as sorting, convolution, and vector multiplication can be represented as homogeneous graphs. The mapping of a homogeneous program graph G is a process that partitions G into a set of nodes where each set is mapped onto the same processor (cell). A precise formulation of a syntactically correct mapping and correct execution of G on a linear array are given in [16]. Furthermore it was shown that if there exists a syntactically correct mapping of G, then G belongs to a class of homogeneous graphs where there 24 exists a connected subgraph SC in G such that VG t; VSG and SC is a mesh graph. The partitioning Of G is performed via a technique called “diagonalization Of the mesh graph SG” [16]. The representation of the algorithms in homogeneous program graphs is performed in an ad hoc fashion. 2.2.4 DEPENDENCE GRAPH TRANSFORMATION Quinton’s method consists of three steps; 1) defining the Uniform Recurrent Equivalent (URE) system of equations from the algorithm; 2) finding a timing function that is compatible with the dependencies of the URE; and, 3) finding an allocation function that transfers the URE onto a systolic design [20]. Consider the following URE equations: U1(z) =f[U1(z - v1), U2(z - v2), ...,Up(z - vp)], 2.2-21 Ui(z) = U,-(z — v,) for i = 2, ..., p, where 2 eD ; Z" is the convex set of Cartesian coordinates, f is a given p-variable function, and v,- are the dependence vectors where v, 6.2" for all i. D is the domain of the URE and together with the dependence vectors define a dependency graph of the URE. The dependency graph is denoted as the pair (D, V) where v,e V for all i. Before finding a timing function, a convex space of feasible timing functions is determined. A timing function 1 for a dependence graph (D, V) is feasible if 1 maps Z'l to Z such that t is nonnegative over D and for all x,y ED, 1:(x) > t(y) if it depends on y. The feasible timing functions are restricted to functions called Quasi-Affine Timing Functions (QATF) defined as 1(1) = la‘x - b], a 6Q", and b 6Q 2.2-22 where Q denotes the set of rational numbers, and a and b are calculated under certain 25 constraints [20]. Lastly, the URE equations are mapped onto a finite set of processors (cells) using an allocation function g. Function g maps Z'l to Z" such that g(D) is finite and for all x,yeD, g(x)=g(y) implies t(x)¢t(y). The feasible allocation functions are restricted to the Quasi Linear Allocation Functions (QLAF) where they have good geometric properties. Let g(x)=(g1(x), g2(x), ..., g,,_1(x))‘. g is said to be a QLAF if g,, for all i, are linear functions from Z" to Z", and g,-(x) is of the form c(x) mod q where c is a linear function from Z" to Z" and q is a positive integer. One way of finding a QLAF is to project D along a conveniently chosen direction. Within the feasible ranges, the timing functions and the allocation functions are selected arbitrarily. 2.2.5 HIGH LEVEL DESCRIPTION TRANSFORMATION Lam and Mostow apply a series of transformations on the high level description of the algorithm [26]. The method is implemented in a program called SYS that accepts the algorithm described by for-loops, begin-end blocks, simple unnested function calls, and scalar and array variables. A series of transformations are applied to the abstract algorithm until a function level design, expressed in a predefined language, clearly describes the systolic design. In the high level description of the algorithm, the user also indicates explicitly whether to implement each sequential statement, such as a for-loop, in parallel or in place. A loop can be implemented either by performing each iteration on the same cell or by performing each iteration on a different cell. For example, FOR i FROM 1 TO n IN PARALLEL DO FOR jFROM 1 TOm IN PLACE DO 26 P13=Pt*xt+aj will result in a systolic design where x’s and p’s stay in place and a’s are broadcasted. Additionally, FOR i FROM 1 TO n IN PLACE DO FOR j FROM 1 TO m [N PARALLEL DO Pt3=Pt*xt+aj produces a systolic design where a’s stay in place and x’s and p’s move in the same direction. A systolic design is described as consisting of two parts, the structure and the driver. The structure part describes the hardware» cells and their interconnections. The driver part describes the format and the timing of the data streams. In order to correctly perform the computations, two scheduling constraints are applied: 1) the physical constraint implying no data collision and, 2) the logical constraint implying data dependency. The effect of these two constraints appears in the structure of the design. These constraints serve as a guide in choosing to implement, for example, a loop in a sequential scheme, a parallel scheme, or a pipelined scheme. Finally, a series Of Optimization steps are applied to the resulting systolic design. For example, if the resulting design requires repeated values as input, then an optimization step would be to change the design so that repeated values are preloaded. Or if the cells use a common input datum, then an Optimization step would be to change the design so that the common input datum propagates through the cells. SYS produces a simple-minded representation of the systolic design. Systematic and user determined transformations are then applied to generate better designs. 27 Ibarra, Kim, and Palis’ method involves: 1) the representation Of the algorithm computations by a' sequential machine, such as a Turing machine and, 2) the transformation of the sequential machine to a systolic design [39]. They have shown that a 1-dimensional systolic array with time complexity T(n), where n is the number of inputs, can be represented by an equivalent l-dimensional sequential machine with “sweep” complexity T(n). The computations of a l-dimensional systolic array can be “unrolled” in space and time to produce a 2-dimensional array exhibiting combinational circuits. (This array indicates the flow of data and communication topology). It was shown that the computations of the equivalent sequential machine can be simulated by the combinational circuits [39]. Thus, the problem of designing systolic arrays is translated to that of designing sequential machines. Since in the design of a sequential machine one does not deal with the problems of concurrency and synchronization, the design of systolic arrays is simplified. Once a sequential machine is designed for an algorithm, the computations of the sequential machine are mapped onto an array exhibiting combinational circuits (similar to those derived from unrolling systolic arrays in space and time). “Rolling up” the computation of this array produces the systolic array. The construction of the sequential machine is done in an ad hoc manner. 2.2.6 RETIMING TRANSFORMATION Leiserson’s approach is not the transformation of an algorithm to a systolic design but rather the application of a retiming transformation to an existing synchronous system in order to improve the performance [13,32,34]. A synchronous system is viewed abstractly as a network consisting of functional elements (cells) and globally clocked registers (delays). The network is then modeled as a finite, rooted, vertex- weighted, edge-weighted, and directed multigraph G = (V, E, vb, d, w). V is the set of 28 vertices (cells); E is the set of interconnection edges between the cells; vb represents the host computer or the outside world; d is the propagation delay function where d(v) 2 0 for v eV represents the propagation delay of cell v; and w is the delay function indicated by the number of registers on each edge where w(e) 2 0 for e 65 represents the delay on edge e. Furthermore, it is assumed that G has no directed I cycles of zero weight. A retirrring of a system represented as G = (V, E, vh, d, w) is a transformation of G in which registers are added and deleted in order to change G into a new graph G, = (V, E, vh, d, w,). This transformation is a mapping r: V —-) Z and r(vh) = 0 where r is an integer-valued vertex-labeling function and w, = w(e) + r(v) — r(u), where e 6E and e is the directed edge from u to v. Similarly w, = w(p) + r(v) — r(u), where p is any path from u to v. Let (D(G) represent the clock period of G, then 0}. 2.2-23 W(u, V) = min{W(p)}, 2.2-24 W04. 0}. where p is a path from u to v. It was shown in [13] that <1>(G)Sc is equivalent to: (D(G) can be defined as c then W(u, v) 2 1, where c is an integer constant. The retiming transformation is modeled as a linear programming problem to find a retiming function r such that ¢(G,) < (D(G). r is said to be a legal retiming function Of G such that ¢(G,) S c < ¢(G) if and only if: 29 1. 7.07),) = O, 2. r(u) — r(v) S w(e), 3. r(u) — r(v) S W(u, v) — 1 such that D(u, v) > c, for all e eE, u,v eV, and edge e from u to v. The problem can also be modeled as a search problem to find the minimum c [13]. The work presented in [13,32] applies to all synchronous systems. Here only those synchronous systems that can be transformed to a systolic design, where rippling broadcastings are eliminated, is presented. Systems suitable for systolic design are assumed to have d(v)=1 for all vs V. Let G=(V,E,v,,,1,w) and let G-1=(V,E,v,,,1,w’), where w’(e)=w(e)—l for all ee E. G can, be transformed to a systolic design if and only if G-l has no cycles of negative weight [13,32]. In other words, there exists a retiming function r of G such that ¢(G,) S c if and only if G — 1 contains no cycles having negative edge weight. In addition to improving the system clock 0 by the retiming transformation, the system clock may be further improved by k-slow transformation of G. k-slow transformation of G implies multiplying all the edge weights of G by an integer k > 0. That is, Gk = (V, E, vh, d, wk) is the k-slow transformation of G = (V, E, vh, d, w) if wk(e) = k * w(e) for all e 6E. If G — 1 has cycles of negative weights, then G may be k-slow transformed to G, such that 0,, - 1 has no cycles of negative weight. G, is then transformed to a systolic design by appling the retiming transformation. Kung’s retirrring transformation is applied to the time-invariant Signal Flow Graph (SFG) representation of algorithms [15]. In the SFG, in general, a node represents an arithmetic or logic operation performed with zero delay and an edge represents a 30 function or a delay. Let D and D’ represent delay elements and lower case letters represent constants. An SFG will have on its edges either a delay element or a constant indicating a multiplication by that constant. There are two classes of SFGs, those with local interconnections and those with global interconnections. Since SFG with global interconnections, such as the FFT, will require spatially global interconnections, they are not suitable for systolic designs. Here only the SFGs with local interconnections are considered, such as the FIR. A general systematic procedure for transforming SFG’s into systolic designs is based on two simple rules. Rule 1— Time-Scaling: all delays D may be scaled, that is D —> nD’. Similarly the corresponding input and output rates have to be sealed with respect to the new time unit 0’ by a factor n. Let a cut-set of a SFG be defined as a minimal set Of edges which partitions the SFG into two parts. The edges of the cut-set are grouped as inbound and outbound edges depending on the assigned directions to the edges. Rule 2- Delay-Transfer: advancing k time units on all the outbound edges and delaying k time units on all the inbound edges and vice versa, will not affect the general system Operation. The effect of lags and advances cancel each other in the overall timing. Based on these two rules it was shown that any zero-delay edge that is non- temporally local can be transformed into a nonzero-delay edge. This is done by selecting a “good” cut-set that includes the target edge (the zero-delay edge) but does not include zero-delay edges in the Opposite direction to the target edge. The cut-set 31 could include zero-delay edges going in the same direction as the target edge. According to Rule 2, the nonzero-delay edges in the opposite direction can give one or more spare delays to the target edge(s). If there are no spare delays, Rule 1 is applied to increase the number of delays in all the edges. It was shown that there always exist “good” cutsets for SFG with local interconnections [13,32]. 2.3 SUMMARY OF THE LITERATURE REVIEW In the previous sections several existing methodologies for systematically constructing systolic designs were discussed. In general these methods can be grouped into two types. Type 1 methodologies start with a problem represented in a high-level language, e.g. data flow graph or a mathematical expression, and via systematic and/or ad hoc procedures, they are used to construct systolic designs. Alternatively, type 2 methodologies start with a signal flow graph or simple (“poorly”) constructed systolic design and via systematic and/or ad hOc procedures construct systolic designs exhibiting improved performance. Although the existing methods differ in approach, their goals are the formal construction of systolic arrays. They take into account important design properties such as functional correctness, area minimization, propagation time, communication topology, and modularity. These methods are also similar in not directly incorporating the systolic array characteristics in the design process. That is, they do not explicitly use array characteristics (i.e. simple Cells, simple and regular communications, and local connectivity) as a guide in determining the systolic solutions. The approaches presented by Moldovan [36] and Li and Wah [12,33] are the only approaches which use certain systolic array architectural properties in the design 32 processes. Moldovan’s approach, which applies to the algorithms described by nested dO-loops, uses predetermined cell interconnection patterns in the design process. The approach uses algorithm data dependency vectors which are determined from the inner loop bodies. New dependency vectors, which favor systolic designs, are determined from a set of predetermined cell interconnections. The approach, however, does not use any information to control the cell complexities. Those algorithms with complex loop bodies will produce complex cell structures. Li and Wah’s approach represents the systolic array’s communication topology by certain constraint equations in terms of data dependency information derived from the algorithm. Systolic solutions are produced by solving an integer linear programming problem satisfying a certain Objective function subject to the constraint equations. The objective function represents certain Optimization criteria, such as minimum computation time. The drawback of this method is that the solution involves solving for an integer linear programming problem. Furthermore, the design information is in terms of the velocity of data flow, relative operand’s distances, and period of computations. These include operational information rather than design information. The model does not produce the individual cell locations and input/output Operands of each boundary cell. The transformation of the velocity Of data flow, relative operand’s distances, and period of computations to a systolic array involves ad hoc steps. Each different existing approach for systematic construction of systolic arrays was briefly presented in Section 2.2. Fortes, Fu and Wah have compared many of these methodologies using the Y-chart [22]. Among the type 1 methodologies discussed the index transformation, presented by Moldovan, has fewest ad hoc steps. CHAPTER 3 ALGORITHM DESCRIPTION AND REPRESENTATION The new systolic design approach presented herein differs from existing approaches in that the array characteristics (i.e. simple cells, simple and regular control and communications, and local cell interconnections) are incorporated in the design process. In this chapter the properties of simple cells are used to decompose the algorithms. That is, the complexity of the processing elements themselves contribute to the decomposition of the algorithm’s computations into a set Of equivalent recurrence equations. Definition 3.1 — A recurrence equation is called a simple computation if it consists of one, two, or three variables used in the resultant’s calculation. It consists Of one output operand and one, two, or three input operands and zero, one ,or two Operators. (Some of the input operands may be made available at the output to be used by another cell). For systolic array implementation, a simple computation defines a cell structure with constant area and constant latency. Only the simple computations with one, two, or three input operands and zero, one, or two Operations are considered. These correspond to the simple cells shown in Figure 3.1. In Figure 3.1b the partial results remain stationary in the cell. A primary input may be preloaded into the cell’s 33 34 Input Input Input Input partial partlal remit remit .. output output (a) (b) Figure 3.1. Possible structures of simple cells. internal input register which also remains stationary during the computations. In this case, the primary input is a part of every simple computation performed in that cell. These cell models correspond to most Of the existing systolic solutions. The method described here can also be extended to more complex cell models. However, complex cell models may require complex cell communications which are not recommended for systolic designs. 3.1 ALGORITHM DESCRIPTION In order to satisfy the simple cell characteristics of systolic arrays it is important that the mathematical expressions and the body of the nested loops have simple structures. That is, if an expression is complex and involves many variables and many Operations, it should be decomposed into two or more simpler expressions with fewer variables and numbers of different Operations. For example, an Infinite 35 Impulse Response (IIR) filter with seven weights may be described by 3 4 yi = 2% * Yuk—1 + 2": * xi+l—1 for i> 0- 3-1-1 i=1 (:1 Eq. 3.1-1 can be expressed by two simple and equivalent expressions as follows: 4 Si = 2V1 * xi-H-l for l) O, 3.1-2 1:1 yj = S} + iw,‘ * yj+k-1 for l > 0; k=1 where s,- is introduced as an intermediate variable. It will be noted later that Eq. 3.1-2 corresponds to cells consisting of three input operands and one output operand. Each cell performs multiplication and addition Operations. If an expression cannot be decomposed into simpler expressions then the cell structure must be changed to accommodate the more complex computations. But, as mentioned, simple cell structure is a fundamental criterion of systolic array design and thus non- decomposable expressions may indicate that the algorithm is not suitable for systolic array implementation. The algorithms under consideration are grouped into space-indexed and time-indexed algorithms. Space-Indexed Algorithms — These algorithms have uniform and bounded index spaces. The indices of these algorithms can be represented in geometrical space. The total number of values to be computed is finite and is computed from the size of the index space. For example, matrix manipulations are space- indexed algorithms. Time-Indexed Algorithms — These algorithms are described by mathematical expressions with at least one unbounded index space. They may require an 36 infinite number of values to be computed. However, there often exists a Computation Window Index Space (CWIS) for each of the variables. The sizes of these windows are determined from the bounded index space. (If an expression does not have a bounded index space, then it is a series. Therefore, the problem of computation no longer exists. The problem becomes whether the series converges or diverges). For each expression, the window size is equal to the number of elements in the bounded index space. These windows, by definition, are bounded. For example, in Eq. 3.1-2 the CWIS for s,- is 4 and for yJ- it is 3. That is, Eq. 3.1-2 can be modified as follows: Si = VI * xi+l—l for IS I S 4; 3.1-3 M" N ll 1 3 yj=sj+2wk*yj+k—l for ISjS3. ltzl Eq. 3.1-3 is then used to construct systolic solutions for the HR filter. Let A be the set Of all algorithms considered in this study. The following assumptions are made for an algorithm A e A: l. A e A is highly structured and can represented in a closed form such as a mathematical expression. 2. A e A can be represented by simple computations. Each simple computation consists Of one output operand and one, two, or three input operands and zero, one, or two Operations. An output operand is either a primary output or a partial result. An input operand is either a primary input or a previously computed output. 37 3. A e A has a uniform and bounded operand index space. Or, A can be represented with uniform and bounded operand index space, such as in the case of some time-indexed problems. 3.2 ALGORITHM REPRESENTATION This section describes the method by which an algorithm is transformed into its equivalent set of simple computations. The method is illustrated with examples. Example 3.1 — Consider the square matrix-matrix multiplication algorithm in A. Let AB=C where A,B, and C are all NxN matrices. This algorithm can be described by the following mathematical expression: N Cij = Z a“, * bkj for j= 1, 2,..., N and i= 1, 2,..., N. 3.2-1 k=1 Let Z be the set of integers and let 2'" = {(21, 22, - - - , 2m)| 2,42 Z}. The matrix-matrix multiplication algorithm has uniform and bounded index spaces which are defined as follows: Da={l=(i,k)’|1.<_iSN,ISkSN}; 3.2-2 Db={l=(j,k)‘|1$j$N,lSkSN}, Dc={l=(i,j)’|lSiSN,1$jSN}‘, where 00, Db, Dc t; Z2 are the index spaces of the matrices A, B, and C, respectively. 38 To illustrate the calculations more explicitly, for N=3, i=1, and j=1,2, Eq. 3.2-1 can be expressed as follows: — * * * . €11 - “11 bit + 012 b21+ 013 1’31, 3-2-3 €12 = 011 * bl2 + 012 * 1’22 + 013 * b32- Eq. 3.2-3 indicates that the resultant c“ and cm are determined from the summation of certain specified products. The order in which the products are computed and summed is not important. This provides considerable freedom as to how these calculations can be performed. The natural order .of perforrrring these products is in the order of data availability. For the matrix multiplication algorithm, the elements of matrices A and B are primary inputs and hence available to form the products in almost any desired order. In the following representation the products are computed in the lexicographical order of indices. c,” s 0 for k = 0, 1,..., N—l; 3.2-4 c,“ = CiJJt-l + a”, * by for k = 1,..., N; ‘0' 5 Cow. The initial partial results are set to zero and each Cij is computed through a series of recursive calculations. From Eq. 32-4, the index space Dc is redefined as: - Dc={l=(i,j,k)|15i.<.3,1$js3,05ks3}, 3.2-5 where Dc g: Z3. It will be shown that the index differences in the lexicographical sense can be used as design information and will be automatically extracted from the simple computations. The lexicographical (lex) difference of two indices 1 and J in z’” is calculated as: lex(l, J) = II — J I. 3.2-6 39 For consistency, the dimensions of the index sets Da and Db are artificially extended to match the dimension of Dc. That is, Da={l=(i,j,k)‘|15is3,15js3,k=0}; 3.2-7 Db={l=(i,j,k)‘|15i$3,1$j53,k=0}. Each of the index spaces Da, Db, and De are now in Z3. Eq. 32-4 is rewritten as follows: CiJ.k E 0 for k = 0, 1,..., N—l; 3.2-8 ciJ-J‘ = Cid-$4 + ai,k,0 * bkjp for k = 1,..., N; Ci} 5 Cow. For example, if [C = (2,2,1)e Dc and la = (1,1,0)e Da, then lex(lc, Ia) = (1,1,1). (End of Example 3.1.) In Eq. 3.2-8, index k is introduced to assign a precedence order to the basic computations of the algorithm. The expression Cij,k = ciJ,k-l + ai’k'o * bkjfl for k = 1,..., N is a forward recursive function. However, such index expansion may result in forward or backward recursive functions depending on the algorithm type. The algorithms in A are therefore classified into two types. Type 1 — The primary outputs are computed in increasing order of indices. For example, the solution of the linear equations Lx = y, where L is a lower triangular matrix and x = (X1, Iz, ..., xn_1, In): and y = 0’1, yz, ..., yn__1, y")! are VCCIOI'S, IS a type 1 algorithm. Note that x1 is computed before x2. 40 Type 2 — The primary outputs are computed in decreasing order of indices. For example, the solution of linear equations Ux = y, where U is an upper triangular matrix and x = (x1, x2, ..., x,,_1,x,,)' and y = (yl, yz, ..., y,,_1,y,,)’ are vectors, is a type 2 algorithm; x" is computed before xn_1. Note that, hybrid algorithms are also possible. Index-Expansion-Rule- If A e A is a type 1 algorithm, then the recursive functions formed by the simple computations must be forward recursive functions. The functions must be backward recursive functions if A is of type 2. For example, the recursive function c,-J- = Cry-1 + is a forward recursive function and c”: CiJ+1 + is a backward recursive function. If A is both a type 1 and type 2 algorithm (hybrid), then the simple computations will form both forward and backward recursive functions. The order of computations in both cases corresponds to the order of the computed primary outputs. In Example 3.1, the matrix-matrix multiplication algorithm is a type 1 algorithm. An example of a type 2 algorithm is given in the next section. 3.2.1 INITIAL PARTIAL RESULTS Each simple computation involves a set of operations performed on the partial result and other input Operands. Naturally, at the start of the Operation the initial partial results must be defined. These initial partial results are determined either from the algorithm, from the type of operations performed, or both. In Example 3.1, the initial partial results, as indicated by Eq. 3.24, are determined from the Operations. Each primary output cij of Eq. 32-4 is the sum of several products. Therefore, the initial partial results are assumed to be zero. If the partial results are primary inputs, then 41 for each such primary input, a simple computation (with only one input Operand) is used. The purpose of this is to easily identify the primary inputs in the final design process. The following example consists of primary input operands and a different set of Operations. Example 3.2 — Consider the algorithm for the solution of the upper triangular linear system Ux = b in A, where U is an N x N upper triangular matrix and x and b are le vectors. This is a type 2 algorithm and is described as follows: Xi = (bi + Z —u,-j * Xj) / u;,- for I: N,..., 2, 1. 3.2‘9 j>i Eq. 3.2-9 is not the only way this algorithm can be described. For instance, the result inside the parentheses can be multiplied by u};-1 instead of being divided by u,-,-. In the former description, the algorithm consists of multiplication, subtraction, and division Operations whereas in the latter, it consists of only multiplication and subtraction -1, operations (provided that u,,- s are predetermined). For this algorithm, the calculations for N = 3 are explicitly expressed as follows: X3 = (b3) / “33; 3.2-10 x2 = (’92 - “23 * x3) / “22; JIC1:071 — “12 * x2 — “13 * Jr3) / "11- Eq. 3.2-10 indicates that the resultant xi’s are determined from certain specified products which are. subtracted from bi’s and then divided by u,-,-’s. The order of calculating the products is not specified and provides some design freedom. The natural order for performing the basic computations (e.g. the products) is in the order of data availability. In this algorithm, the elements Of vector x are calculated by back substitution; that is, x3 is calculated before x2 and so on. The natural order of performing these calculations is in the order of availability of the elements of vector 42 x. The following backward recurrence equations assume this natural order: it“ E bi for k = N—i+3, 3.2-11 x“, = xi,k+1 for k = N—i+2, xi,k = xi,k+1 ‘ “i,i+k+1 * xi+k—1,1 for k = 2,..., N4“, xj’,‘ = xt,k+1 / uh- for k = 1, fiafib for i = N,..., 2,1. For N = 3, the simple computations of Eq. 3.2-11 are expanded as follows: ‘3 5 ‘31; 3.2.12 JIC31 = 1‘32 / “333 132 = 1‘33; Jr33 5 b3; XZ 5 x21; J‘21 = 1‘22 / “22; 1‘22 = 1‘23 ' “23 * J"31? 1‘23 = 1‘24; 124 5 b2, Jrr E Jr113 Jr11= x12 / “11; 112 = 1‘13 " “12 * x21; Jr13 = J"14 ' “13 * J‘31; 1‘14 = Jlr15; x15 5 b1' b3, b2, and b1 are primary inputs and correspond to the initial partial results x33, x24, and x15, respectively. At the final stage of the systolic array design, 133, 1:24, and x15 43 will take the values b3, b2, and b1, respectively. (End of Example 3.2.) 3.2.2 ALGORITHM INITIAL AND FINAL REPRESENTATIONS In this new approach for constructing systolic arrays from algorithms, certain inherent information that already exists in the algorithm is utilized. This information, referred to as features, is extracted from the algorithm’s equivalent set Of simple computations. In order to systematically extract these features, it is necessary that the simple computations have a unique form. This unique form is referred to as the algorithm initial representation. This initial representation allows the systematic extraction of the features. There exist two types Of features: algorithm features and interaction features. The algorithm features are extracted from each simple computation and the interaction features (which are defined in Chapter 4) are extracted from each pair Of simple computations. The set of algorithm features that are needed in the new systolic design approach is as follows: 1. the input and output operands of each simple computation; 2. the set of operations used to compute each Operand; 3. the lexicographical differences between the indices of the input Operands of each simple computation; 4. the lexicographical differences between the indices of the input operands and the index of the output operand of each simple computation. 44 Associated with each feature there exists a set of different feature values. Each simple computation provides a different value for each of the features. To illustrate, consider the following simple computation from Eq. 3.2—12: X22 = X23 "’ “23 * X31. 3.2-13 The following features and feature values can be extracted from Eq. 3.2-13: l. The input and output operands of Eq. 3.2-12 form a subset of features where x23, u23, x31, and x22 are the feature values. 2. The set of operations used to compute an operand constitute features. Eq. 3.2- 13 has four operands, hence four different features. Multiplication and subtraction operations (*—) are used to compute x22 and x23; no operation ( ) is used for primary input u23; and division (I) is used to calculate x31. The values of these features are (*—), (*-), ( ), and (l). 3. The lexicographical differences between the indices of the input Operands of Eq. 3.2-12 form a subset of features. The lexicographical difference between the indices of x23 and x31 is (3, l)—(2, 3) = (1, —2). Index differences (0, 0), (1, —2), and (l, —2), as calculated from the indices of operand pairs (x23, u23), (x23, x31), and (u23, 131): are the feature values. Later it will be shown that only the index differences between the same type of operands are needed. For example, x23 and an are not the same type of operand; one is a primary input with variable name u and the other is a partial result with variable name 2:. (Regular communication requires that variables with two distinct names enter or leave the array from two distinct lines). Similarly, x23 and x3] are not the same type of Operand since each requires a different set of operations; x23 requires multiplication and subtraction operations and x31 requires a division operation. 45 4. Index differences (0, 1),,(0, 1), and (1, -1), which are calculated respectively from the indices of operand pairs (x22, x23), (x22, u23), and (x22, x31), also form feature values. The extraction of feature values in item 2 can be simplified by including the set of operations required by each operand in the simple computation. That is, Eq. 3.2-l3 can be expressed as follows: x22(*—) = x23(*-—) - u23( ) "‘ x31(/). 3.2-14 Or, Eq. 3.2.11, similarly is expressed as follows: xix E b,- for k = N—i+3, . 3.2-15 xi,k(*—) = xi’k+1(*—) for k = N—i+2, 1544*") = xi,k+1(*-) " ui,i+k+l( ) * Ink—1,10) for k = 2"", N-i+1, xi'k(/) = xi'k+1(*—) / u,,—( ) for k = 1, It 5 xi], for i = N,..., 2, 1. For example, in Eq. 3.2-15, (*-) is used to compute Ink for 1': N,..., 2, 1 and k = 2,..., N-i+l. The lexicographical index differences represent the relative distances between the operands of each simple computation. These index differences contain information related to the simple and regular communication and local connectivity properties of systolic arrays. To extract the algorithm features formally, a simple computation is represented generically as follows: r(11)(01)=y(ly)(0y) 01 2(IZ)(02) 02 W(IWXOW) 3.2-16 where the different operands are indicated by the letters x, y, z, and w. The Operand 46 x(Ix) repreSents the output operand with generic variable name x and index lx. The operands y(ly), z(lz), and w(Iw) represent up to three input operands and 0x, 0y, Oz, and CW are each a set of operations used in the computation of the respective operands. 0x, the set of operations used in the computation of x(Ix), is equal to set {01, 02}. For example, substituting Eq. 3.2-14 in the generic form defines the following: ' x(lx) 5 x22 implies Ix = (2,2) and x = x; 3.2-l7 y(Iy) 5 x23 implies ly = (2,3) and y = x; 2(12) 5 u23 implies 12 = (2,3) and z = u; w(Iw) 5 x3] implies [w = (3,1) and w = x; 0x 5 *—; 0y E *-; 02 :- null; 0w 5 /; 01 = - 02 E * Definition 3.2 — For an operand, such as x(1x)(0x), the variable name, x, and the set of operations, 0x, indicate the operand type. For example, operands x22(*—) and x23(*-) have the same operand type; both have the same variable name, x, and set of operations, (*—). Operands x22(*—) and x31(/) do not have the same operand type. The generic representation of Eq. 3.2—16 defines the structure of the simple computations that can be computed in the cell structures of Figure 3.1. Different generic representations are needed if cell structures other than those shown in Figure 3.1 are used. The method described here can easily be extended to different cell structures. Using the generic form of the simple computations as defined by 47 Eq. 3.2-16, the algorithm features can now be formally defined. Algorithm Features 1. The operands x(lx), y(1y), z(lz), and w(Iw) are operand features. 2. The set of operators 0x, 0y, Oz, and 0w are operator features. 3. The following lexicographical (Iex) index differences, which indicate the relative distances between the operands of a simple computation, form the relative-distance features: ny = Iex(lx, 1y); 3.2-18 sz = Iex(lx, Iz); wa = Iex(lx, 1w); Vyz = Iex(ly, Iz); Vyw = Iex(ly, 1w); Vzw = Iex(lz, 1w). The relative-distance features are computed for the algorithm of Eq. 3.2-9 for N = 3 and the results are shown in Table 3.1. In Table 3.1, the set of simple computations are numbered c1 through c9 for easy reference. Each row in Table 3.1 indicates the set of relative—distance feature values for each simple computation. For example, nyI = (O, 1) is a value of feature ny of simple computation c1; wa4 = (1, -l) is a value for feature wa of simple computation c4. 48 Table 3.1. Final representation of algorithm described by Eq. 3.2-8. Simple Computation (6;) Features. ny sz wa Vyz Vyw Vzw 6131531 U) = 132 (*-)/ “33 ( ) (0,1) (0.2) — (0,1) — _ 023 x32 0“) = J‘33 (*‘) (0,1) — — _ _ _ 633 x21 U) = 122 (*')/ “22 ( ) (0,1) (0,1) — (0,0) - _ 04: x22 (*-) = Jr23 (*-) - “23 ( ) * X31 (I) (0.1) (0.1) (1,-1) (0.0) (M) (M) C53 J"23 C") = 4‘24 (L) (0,1) - — _ _ _ 06‘ 3‘11 U) = Jr12 (*-)/ U11 ( ) (0,1) (0,0) — (0,1) _ _ 67: x12 (*-) = x13 (*-) - um ( ) * x21 (I) (0.1) (0.0) (1,-1) (0,1) (1,-2) (1,-1) 683 113 (*-) = 1‘14 (*-') - “13 ( ) * X31 (l) (0,1) (0,0) (2,-2) (0,1) (2,-3) (2,-2) 093 x14 (*-) = X15 (*-) (0,1) — - _ _ _ Definition 3.3 — Feature value ny, of simple computation c,- is referred to as the dependency vector of 6;. The simple computations of an algorithm are constructed such that all the dependency vectors are the same. That is, the dependency vector of c,- is the same as that of cj. This means that the simple computations have one dependency vector of the form (0, 1), (O, O, 1), (0, O, O, 1), etc. The vector size is determined from the index space of the simple computations. For example, in Table 3.1, all the values of feature ny are equal to (O, 1). Furthermore, input operand y(Iy)(0y) always represents a partial result. The algorithm equivalent set of simple computations and the associated algorithm features of each simple computation collectively are referred to as the algorithm final representation. 49 3.3 CHAPTER SUMMARY An algorithm with uniform and bounded index space can be represented by mathematical expressions. The algorithm is then systematically transformed into a set of simple computations. The complexity of these simple computations defines the complexity of the simple cells and vice versa. Only simple computations which involve three or fewer variables and two or fewer operations are considered. This corresponds to simple cells with one, two, or three input operands and zero, one, or two Operations. The method is easily extrapolated to give cells with more complex structures. However, complex cell structures require complex cell communications which are not recommended for systolic arrays. The simple computations of an algorithm have a unique form and define the algorithm initial representation. From these simple computations, a subset of features referred to as the algorithm features are extracted. Additional features referred to as the interaction features will be extracted in Chapter 4. The simple computations and the corresponding algorithm features define the algorithm final representation. The algorithm feature values are used repeatedly during the extraction of the design information. These feature values are determined prior to analysis (i.e. algorithm final representation) in order to avoid their repeated computations. Note that the algorithm feature values, computed from each simple computation, contain information about the relative distances between the operands and the set of operations needed to compute each operand. The relative operand distances will be shown to relate to the simple and regular communication and the local connectivity properties of systolic arrays. The set of operations needed to compute each operand will be shown to relate to the complexity of the cells. In the next chapter, the algorithm final representation is modeled to determine the design information necessary for constructing of systolic arrays. CHAPTER 4 ALGORITHM MODEL This chapter presents the algorithm model from which design information is extracted and used for constructing systolic arrays. The model consists of certain local and global properties that must be satisfied if there exists a systolic solution for the algorithm. Local properties are the cell intra-connections, as indicated in Figure 3.1, and cell operations. The legal cell operations depend on the cell complexity. Global properties include simple and regular communications and cell to cell connectivity. The local and global properties are represented by feature interrelationships. A systolic solution exists if and only if all the feature interrelationships are satisfied. 4.1 MATHEMATICAL DEVELOPMENT The simple computations of an algorithm are represented geometrically by letting the indices of the output operands indicate the location in space where they are computed. This is similar to the geometric representation considered in [14,17]. For example, consider the simple computations of Eq. 3.2-8 for N = 3, which are repeated here for convenience. c1: x31(/) = x32(*—) / u33( ); 4.1-1 50 51 62: 1326-) = x33(*-); 53‘ 1:10 ) = 1220‘) / “22( )3 c4: 1:2(*-) = x23(*-) - u23( ) * X310); ‘53 1230‘") = Jr24("‘-); C 31:1(/) =112(*-) / “11( )i O\ or: Jr120‘") = JHitch“) - “12( ) * x210); cs: 1130‘“) = Jr140'") " “13( ) * x31(/); C92 1140‘") = 1150‘“): The geometric representation of Eq. 4.1-1 is shown in Figure 4.1. l /\ 4‘--. 3“. o 2». o o ‘--o o o 41 I \1 III I 12 3 Figure 4.1. The geometric representation of the output operands of Eq. 4.1-1. 2:}. point III Figure 4.1 indicates a simple cell and its coordinates indicate the output operand that it calculates. The cell at location (2, 2), for example, computes x22 52 where x23, 1423, and x31 are the input operands. The functions of each cell and their respective input and output Operands can also be included in the geometric representation by replacing each point in Figure 4.1 by the corresponding cell. The result is shown in Figure 4.2 where the MS cells perform multiplication and subtraction operations and the D cells perform division. Assigning each simple computation its own cell is obviously not a good choice as it implies that most of the cells will be idle most of the time and many cells will be needed. (However, there are algorithms such as the solution of lower triangular linear systems that require as many cells as there are indices in the algorithm’s index space). To reduce the number of cells, the geometric representation is projected onto a line or a plane in order to overlap several cells onto one cell. The geometric representation of Figure 4.2 is projected onto a line thereby assigning several simple computations to one cell. The result is a linear array with fewer cells and a higher cell utilization. The geometric representation of Figure 4.2 consists of two different cell types. It is obvious that only the cells that are the same type may be overlapped. Consider the MS cells of Figure 4.2, which form an array in the geometric space. The projection of the MS cells onto the 1' axis is the same as overlapping all the rows. This projection is said to be along vector (O,1)', or for short, projection vector (O,1)‘.' This vector is indicated in the space by a line between two points with coordinates (0,0) and (0,1). The direction assigned to this vector is an angular direction and is indicated by the counterclockwise angle between this vector and the 1' axis. The angular direction of projection vector (0,1)‘ is rt/2 and the projection of the MS cells along this vector, as expected, is row overlapped. The column overlapping is performed by projection vector (1,0)'. For diagonal and reverse diagonal overlapping the projection vectors are (1,-1)‘ and (1,1)‘, respectively. The MS cells can also be projected onto a line along projection vectors (1,—2)‘ or (2,-1)‘. However, these 53 projections (if possible) will produce systolic solutions consisting of too many cells. Only the projection vectors with element values 0, -1, or 1 are considered. The D cells can be projected onto a line using projection vectors (0,1)' or (1,0)‘. Definition 4.1 — A portion of a systolic array responsible for a subset of simple computations is referred to as a sub-systolic array. Sub-systolic arrays have the same properties as systolic arrays. In addition they must also satisfy necessary boundary cell communication linkages with other sub-systolic arrays. Definition 4.2 — A systolic array is said to be homogeneous if it is not separable into sub-systolic arrays. A non-homogeneous systolic array is separable into two or more sub-systolic arrays. Each sub-systolic array is responsible for a subset of the simple computations. The following criteria are used to separate a non-homogeneous systolic array into sub- systolic arrays. 1. Different cell types — The simple cell requirement dictates that cells with non-matching sets of operations should not be overlapped. 2. Different variable names — The regular communication requirement dictates that the different sets of variables should enter and leave the array from distinct input/output lines. This implies that cells with non-matching input/output variable names should not be overlapped. Those cells which perform the same set of operations and have the same input/output variable names form a sub—systolic array. Figure 4.3 illustrates a non-homogeneous systolic array which is constructed from three sub-systolic arrays. These sub-systolic 54 /\ x15 0 °—>o . x14 x14 x24 U13 0 3 .— m—aG .31 0—>@ 0 x13 ' x23 x13 x23 x33 U12 u23 O 2 -- ”219° x21x3‘9e x31 09° 0 x12 x22 x32 x12 x22 x32 1 at. an ~© u22 u33 x11 x21 x31 i i 1‘ > 1 2 3 Figure 4.2. The geometric representation of the simple computations of Eq. 4.1-l. 55 arrays are shown in Figure 4.4. Figure 4.5 illustrates a homogeneous systolic array. Note that this systolic array is not separable into sub-systolic arrays based on the above criteria. To construct a systolic array, the simple computations are grouped into subsets such that a sub-systolic array is constructed from each subset. This is done by geometric representation and projection of each subset of simple computations. The simple computations of Eq. 4.1.1 are grouped into two subsets C1 and C2 as follows: C1 = {02, C4, C5, C7, Cs. 09}. 4.1-2 C2 = {01’ 03, Co}- The input and output operands of the simple computations of C1 require the same type of operations and the corresponding input and output variable names are the same. Simple computations c2, c5, and c9 of C1 each consists of one input Operand. The other input operands are assumed to be null. This type of simple computation, as discussed in Chapter 3, is used to identify the initial partial results which are primary inputs. Similarly, the simple computations of C2 require the same type of operations and the corresponding input and output variable names are the same. A geometric representation is constructed from the indices of the output operands for each subset of simple computations. Each representation is separately projected onto a lower dimensional space. In order to determine which projections will produce valid systolic solutions, the valid projection vectors must be determined. The valid projection vectors must satisfy three major properties. (The third property does not apply to homogeneous systolic arrays). 56 l-l/u U U I. U U C . | _ .. .. . e . I. U n n I. U U II U 000.4. 0 4 - 0000001. 3 4 O OOMOOZ 3 u .‘8 0000100“. I 3’3 00...“...00m00uwa mlmlmlu 4‘." mttmttmttm 1 2 OOflooaoom 00m00n00u 1 2 OOOOflOOM 000m004 1 U 000000“ t . 1234567890 flHR - 1 Y. t o d 0 Figure 4.3. A non-homogeneous systolic array. 57 1. They must preserve the properties of simple and regular communications. 2. They must preserve the local cell connectivity. 3. They must satisfy the boundary cell communications of the sub-systolic arrays. To determine the valid projection vectors a methodology is developed based on feature interrelationships. That is, the feature values are used to assign simple computations to a set of cells such that the above properties are satisfied. Definition 4.3 — A pair of simple computations is said to interact if they are computed in the same cell. An interacting pair is called a valid interactive pair if it does not violate the properties of systolic arrays. Consider simple computations c4 and c7 of Eq. 4.1-l. C4: X22(*-) = X23(*—) — 1123( ) * X310); 4.1'3 07‘ x12(*‘) = 1‘13““) ‘ “12( ) * 1210). Simple computation pair (c4, c7) is a valid interactive pair if the following local and global properties are satisfied. (Pair (c4, c7) satisfies these properties). 1. The variable names and the set of operators of x22(*—) and x12(*-) must match. That is, Operands x22(*—) and x12(*-) must be Of the same operand type. This will ensure that those operands with different variable names or with a different set of operators are not computed in the same cell. 2. There must exist a one to one match of variable names and operators for the input operands of c4 and c-,. That is, operands x23(*—) and x13(*—) must be of the same operand type. Similarly, Operands u23( ) and u12( ) and operands x310) homogeneous systolic array. 59 1.3 011 7 021 0 012 B 031 A .32 8 013 18 fig 023 11 12 0130000 0000031 0230001200 0002100032 0330002200011 01100022001333 001200023 :0 V Figure 4.5. A homogeneous systolic array. and x21(/) must be of the same operand types. This means that, when x22 and x12 are computed in the same cell then, operands x23 and x13 must be computed in one cell and x31 and x21 must be computed in one cell. This is required so as not to receive x13 and x13 from two separate input lines during the calculation of x22 and x12 in the same cell. Similarly, x31 and x21 must not be received from two separate input lines. 6O 3. Simple computations c4 and c7 must have unique input operands unless the common input operand is a primary input. (Primary inputs may be preloaded and remain stationary). This requirement is needed to satisfy the cell intra- connections as illustrated in Figure 3.1. 4. The output operand of c4 must not be used as an input operand in c7 or vice versa, unless the output Operand of c4 is the partial result used in c7 or vice versa. This is required to satisfy the cell intra-connections. 5. If partial results x22 and x12 are computed in the same cell, then they must not be sent as input Operands to more than one cell. That is, if x22 and x12 are used in other simple computations, then these simple computations must be computed in the same cell. Otherwise, the output line of the cell which computes x22 and x12 will be connected to more than one other cell. Properties 1 through 5 indicate a set of conditions that must be satisfied before c4 and c7 can be computed in the same cell without violating the properties of systolic arrays. Although these properties appear simple, their implementation involves certain complex feature interrelationships. Furthermore, computing two simple computations in the same cell must not violate the local and global properties in another cell in the final array. In Chapter 3, a set of features (referred to as the algorithm features) were extracted from each simple computation. Additional features, called interactive features, are extracted from each pair of simple computations and relate to the geometric projection vectors. The algorithm features and interactive features then form the set of all features extracted from a set of simple computations. They are used to represent the local and global properties in terms of the feature interrelationships. 61 Let (c,-, c) be a pair of simple computations represented in generic form as follows: ci: xi(lx,-)(0x,-) = y,(ly,-)(0y,~) 01,. z;(lz,-)(Oz,-) 02,- wi(lw‘-)(0w,-); 4.1-4 cj: x,(lxj)(0xj) = yj(ly,-)(0y,-) 01’- zj(lzj)(02j) 02]- wj(le)(0wj). In Eq. 4.1-4, the indices i and j are used to identify the variable names, Operands’ indices, and the sets of Operations of different simple computations. For example, in substituting simple computation Cr of Eq. 4.1-1 into the generic form, the items are defined as follows: x1(lx1)(0x1) 5 x31(/) implies that, x, = x, le = 31, and 0x1 = I; 4.1-4a Y1(I)’1)(0)’1) 5 1320-) implies that, Y1 = x, [Y1 = 32: and 0Y1 = L: zl(lzl)(021) a u33( ) implies that, 21 = u, [21 = 33, and 021 = null; w1(Iw1)(0w1) a null implies missing. Similarly, substituting c4 of Eq. 4.1-1 into the generic form yields the following: x4(Ix4)(0x4) 5 x22(*—) implies that, x4 = x, 1x4 = 22, and 0x4 = *—; 4.1-4b y4(ly4)(0y4) a x23(*-) implies that, y,, = x, [y4 = 23, and 0y4 = *—; 24(124)(024) a u23( ) implies that, 24 = u, 124 = 23, and 024 = null; w4(Iw4)(0w4) a x31(/) implies that, W4 = x, [W4 :31, and 0W4 = . Using the simple computation pair, (Ct: cj), interactive features are now defined. Interactive Features The following lexicographical (lex) index differences, which indicate the relative distances between the operands of simple computations c,- and c-, are the interactive feature values of (c;, cj): Vxxij = Iex(lx,, Ixj); 4.1-15 62 Vyyij = Iex(lyi, by); szij = Iex(lzi, Izj); VWWij = lflX(IWi, le). Vxx, Vyy, sz, and wa denote interactive features and Vxxij, Vyyij, VZZij, and waij denote unique values for the interactive features, respectively. The values for the interactive features of the simple computations of Eq. 4.1-1 are listed in Table 4.1. Table 4.1. Interactive feature values of Eq. 4.1-1. Computation Features Pair (c;, C!) Vxx Vyy sz VW (61» 63) (1. 0) (1. 0) (1. 1) - (61: 65) (2. 0) (2. 0) (2. 2) - (63. 65) (1. 0) (1. 0) (1. 1) - (62» C4) (1, 0) (1. 0) - - (62, 65) (1. '1) i (1. -1) - - (62: 67) (2. 0) (2. 0) - - (c2, c3) (2, -l) (2, -l) - - (c2, c9) (2, -2) (2, -2) — — (64, 65) (0. 1) (0. 1) - - (64. 67) (1. 0) (1. 0) (1. 1) (1. 0) (64» 63) (1. -1) (1. -1) (1. 0) (0. 0) (64» 69) (1. -2) (1. '4) - - (6s. 67) (1. 1) (1. 1) - - (65, 63) (1. 0) (1. 0) - - (65, 69) (1. 1) (1. 1) - - (c7, c8) (0, l) (O, 1) (0, 1) (1, 0) (6s. 69) (0. 1) (0. 1) - - 63 Definition 4.4 — Two simple computations c,- and cj are mapped to the same cell location if feature value Vxxij = Iex(lx,, lxj) is linearly dependent to a projection vector, Vx. That is, c,- and of form an interactive pair if Vx = t * Vxxij where t 62R and ZR is the set of all integer and rational numbers. (This relationship between the values of feature Vxx and projection vectors is needed to satisfy the regular communication property). For example, projection vector Vx = (0, l)‘, linearly dependent to feature value Vxx47 = (O, 1), maps c4 and c7 onto one cell. Similarly, Vx = (O, 1)‘, which is linearly dependent to (0, 2), maps simple computations c7 and c9 onto the same cell location where Vxx79 = (0, 2). 4.1.1 FEATURE INTERRELATIONSHIPS In this section, the conditions for valid interactive pairs are formally discussed. These conditions test for violation of the local and global properties. Local properties consist of cell intra-connections and cell operations while global properties consist of simple and regular communications and cell interconnections. The conditions for valid interactive pairs are expressed in the form of theorems. Consider two simple computations c: and cj in generic form: Ci: xtUXrXOxr) = y.(ly;)(0yr) 01: zr(Iz.-)(02.-) 02a wi(lw,-)(0w,-); 4-1-6 Theorem 4.1 - Simple computation pair (Ci, C) is a valid interactive pair if the following hold: 1. 0x,- = 0x}; 2- oyt' = 0in 3. Oz,- = 02,-; 4. 0w,- = 0w}. Proof: The proof of condition 1 is simple; only those simple computations which require the same set of Operations are allowed to be computed in the same cell. If 0x,- at Oxj then, the interaction of c,- and cj will require the cell to perform two sets of Operations bx,- and Oxj. This violates the simple cell property. Condition 2 relates to the simple communication property. Suppose c,- and cj interact in the same cell, q, and condition 2 does not hold (i.e. 0y; at Oyj). Since 0y,- at 0y}, operands y,(ly,-) and yj(lyj) will be computed in two different cells. However, partial results yi(Iy,-) and yj(ij) are used to compute c,- and cj in cell q. This implies that cell q will have to receive y,(Iy,-) and yj(ij) from two different cells. In other words, an input line Of cell q must be connected to two different cells. This violates the simple communication prOperty. The proofs of conditions 3 and 4 are similar to that of condition 2. C1 The regular communication property implies that the array input/output information must have a regular format. For example, two different primary input variables must be received from two distinct input lines at the same speed. Regular communication allows modular design so larger arrays can be made by combining proven smaller ones. The conditions that test for violation of this property are stated as Theorem 4.2. Theorem 4.2 - Simple computation pair (C), C!) is a valid interactive pair if the following hold: 65 1. The variable names of the output operands must be the same. That is, = xi. 2. If z,(lzi)(02,-) at null and zj(lzj)(02j) at null then, 2i = 2}. 3. If w,(lw,-)(0w,—) at null and wj(1wj)(0wj) at null then, w,- = w}. Proof: The regular communication property requires that the’operands entering and exiting each cell must have the same variable names. [:1 Theorems 4.1 and 4.2 are used to group the simple computations into subsets. The simple computations that belong to the same subset satisfy the conditions of Theorem 4.1 and 4.2. This initial grouping of the simple computations avoids repeated testing of Theorems 4.1 and 4.2 for each pair of simple computations of each subset. For example, consider the following simple computations of Eq. 4.1-1, are repeated here for convenience: 0,: x31(/) = x32(*-) / u33( ); 4.1-7 62: ch320'“) = 1330—): 643 Jr229"“) = x23(*-)-‘ “23( ) * 15310)- Simple computation pair (c2, c4) satisfies Theorems 4.1 and 4.2, but pair (c1, c2) does not. Note that, c2, c4 6C1 and c1 6C2 in Eq. 4.1-2. For convenience in notation, let C be a subset of simple computations. Let (Cl’ cj) eC indicate a simple computation pair formed by c,- e C and cj eC. Theorem 4.3 - Simple computation pair (Ct: cj) eC is a valid interactive pair if the following hold: 66 1. If z,(Iz,-) and 2,02,) are previously computed outputs then, z,(Iz,-) at zj(lzj). 2. If z,-(Iz,-) and wj(le) are previously computed outputs then, zi(Iz,-) at wj(IwJ—). 3. If w,-(Iw,-) and zj(IzJ-) are previously computed outputs then, wi(Iw,-) at zJ-(lzj). 4. If w,-([w,-) and wj(le) are previously computed outputs then, WtUWt) Proof: Suppose condition 1 does not hold. That is, z,(Iz,-) and 2,02,) are both previously computed outputs and z;(lz,~) = zj(lzj). Suppose computations c; and cj are computed in a cell, q. In this case operand zi(lz,-), which is the same as zj(IzJ-), is used twice in cell q, once during the computation of c,- and again during the computation of c- This implies that a feedback loop is required at that input line of cell q. But this J. contradicts the structure of the simple cells as described in Figure 3.1. (Only partial results are allowed to loop back). The proofs of conditions 2, 3, and 4 are the same. E] In a simple computation such as c,- the input operands z,(Iz,-) or wi(Iw,-) or both are either previously computed outputs or are primary inputs. Theorem 4.3 applies only for non-primary inputs since a primary input can be preloaded into a cell and used several times. To illustrate the violated condition 4 of Theorem 4.3, consider simple computations c4 and c3 of Eq. 4.1-l, repeated here for convenience. C4: 122(*") = X23(*-) - u23( ) * X310); 4.1-8 083 1‘13“") = 1‘14“") ‘ “13( ) * x310)- Simple computations c4 and c3 can not be computed in the same cell since x31 is used by both. In systolic arrays, data items move from one cell to the next in a pipelined 67 fashion and are not lOOped back. The computation of c4 and es in the same cell implies that x31, which is a previously computed output, must loop back to be used in the computation of c3. Theorem 4.4 - Simple computation pair (c,-, cj) eC is a valid interactive pair if the output Operands of c,- and cj are not used as inputs to the same cell unless they are partial results. That is, 1. xi(x,-) ¢ Zj([Zj); 2. x,(xi) at WJ-(le); 3. 11(11') ¢ 2‘12"); 4. xj(xj) at wi(lw,~). Proof: Suppose condition 1 does not hold. That is, xi(Ix,-) = 2,02,). Suppose c,- and cj are computed in cell q. Since x;(lx,~) = 2,02,), the output operand of c,- is used as an input in the computation of cj. This implies that there must exist a feedback loop from the output line of cell q to one of its input lines other than the input line used for the partial results. This contradicts the simple cell structures of Figure 3.1. The proofs of conditions 2, 3, and 4 are the same. [:1 To illustrate the violated condition 1 of Theorem 4.4, consider the following simple computations: 043 (210+) = ...; 4.1-9 Cb: [32(*+) = . . . +121(*+) st: . . . The computation of ca and Ch in the same cell implies that 121 must be fed back to be 68 used in the calculation of cb. This is a violation of cell intra-connections as illustrated in Figure 3.1. Theorem 4.5 - Simple computation pair (ci, cj) 6C is a valid interactive pair if the following hold: 1. If x,- = z,- and 0x,- = 02,- then, szi at t * Vxxij where t6 ZR. (Set ZR is the set of all integer and rational numbers). 2. If x,- = w,- and 0x,- = 0w,- then, wai at t * Vxxij- where te ZR. 3. If xi = zj and 019- = 02,- then, szj- at t * Vxxij where teZR. 4. If xi = wj and 0x]- : 0wj then, wai at t * Vxxij where IE ZR. Note that in the above conditions, such as condition 1, zi(Iz,-) is a previously computed output if x,- = Zi- Proof: Suppose condition 1 does not hold. That is, x,- = z,-, 0x,- = 02,-, and szi = t * Vnij where ts ZR. The properties x,- = z,- and 0x,- = 02,- imply that x,(Ix,-) and z,-(Iz,-) have the sameoperand type. Hence, they are computed in the same sub- systolic array. Let ck be a simple computation such that z;(lz,-)(02,-) = xk(lxk)(0xk). 4.1-10 (Note that c,‘ exists because z,(Iz,-) is a previously computed output). This implies that Operand z,(lz,-) is the output operand of Ck- Consider the following feature values computed from simple computations c,- and ck: szi = Iex(lx,, 12,-); 4.1-11 VII“ = (“(1113 In). Note that 12,- = Ix,‘ since xk(lxk) = zi(lz,-). This implies Vxxi,c = szi. The following 69 exists: szi = t * Vxxij where teZR; 4.1-12a szi = Vxxik. Hence, Vxxij = t’ * Vxxi,‘ where ‘t’ eZR. 4.1-12b The values of feature Vxx are linearly dependent to the projection vectors. Let Vx be a projection vector linearly dependent to feature value Vxxr. From Eq. 4.1-12b, Vx is also linearly dependent to feature value Vxxik. Using the projection vector Vx, the geometric projection will map computations cg, cj, and ck onto one cell. Figure 4.6 illustrates the interaction of Cr" cj, and ck. However, the interaction of c,- and ck violates condition 1 of Theorem 4.4. Conditions 2, 3, and 4 can be similarly proven. El The violated conditions of Theorem 4.5 do not exist in the simple computations of Eq. 4.1-l. However, consider the following three simple computations: ca: 121 (+*) =...; 4.1-13 651320”): +121 (+19»: cc: 154 (+*) = - - ~' The interaction of simple computations cb and cc violates condition 1 of Theorem 4.5. Note that szb = [ex [(3, 2), (2, 1)] = (1,1) is linearly dependent to feature value Vxxbc = [ex [(3, 2), (5, 4)] = (2,2). Feature value szb is calculated from the indices of operands xb(lx,,) “=- 132 and 2,,(lzb) 5121 of cb and Vxxbc is calculated from the indices of Operands xb(lxb) a 132 and xc(lxc) a [54 of cb and cc. Note that szb = Vxxab. Projection vector Vx = (1, l)‘, which is linearly dependent to feature value Vxxbc and szb, will map simple computations ca, cb, and cc onto the same cell. This means 70 (ca, cb), (ca, cc), and (Ch, cc) are interactive pairs. However, the interaction of ca and cb violates condition 1 of Theorem 4.4. Theorem 4.6 - Simple computation pair (c,-, c,) eC is a valid interactive pair if the following hold: 1. If y,- = z,- and 0y,- = 02,- then, Vyz, at t * Vyy,-j where te ZR. 2. If y,- = w,- and 0y,- = 0w,- then, Vyw, at t * Vyyij where te ZR. 3. If yj- = 21- and Oyj = 02,- then, Vyzj rt t * Vyyij where teZR. 4. If y, = wj and 0y,- = 0w,- then, Vywj at t * Vyyy- where te ZR. Proof: Suppose condition 1 does not hold. That is, c,- and c, interact; 4.1-14 Y1 = 2:; Oyt' = 021'; VyZi = t * Vyyij; where t eZR. Let Ck: c,, and c,,, be three simple computations such that y,(ly,-)(0y,-) = xk(lx,,)(0x,,); 4.1-15 Yj(l)’j)(0)’j) = xr(lxr)(0xr); z,-(Iz,-)(02,-) = xm(lxm)(0xm). Operands xk(lx,,), x,(lx,), and x,,,(1x,,,) are the output operands of ck, Cb and cm, respectively. Since y,- = z,- and 0y,- = 02,-, Operands y,(ly,-) and z,-(Iz,-) are of the same Operand type, which implies xk(lx,,) and x,,,(Ix,,,) are of the same Operand type. Also, since c,- and cj interact, Operands y,(ly,-) and yj(ly,-) are of the same operand type and implies xk(lx,,) and x,(lx,) are of the same operand type. Hence, simple computations ck, c,, and cm will be computed in the same sub-systolic array. 71 2t 6'21) 11: 0R) . Yr (In) Yr: (In) XI (”(1) XI: ('10:) =" 2t ('21) Figure 4.6. The violated condition 1 of Theorem 4.5. 72 Let Vx = t * Vxxu, where te ZR, be a projection vector. Projection vector Vx is used to map c,, and c, onto the same cell location on the projection plane. Consider the following feature values: Vxxu = Iex(lxk, 1x1); 4.1-16 Equations 4.1-15 and 4.1-16 imply that, VII“ = Vyyij. 4.1-17 Furthermore, equations 4.1-14 and 4.1-17 imply that, Vyz, is linearly dependent to Vxxu and hence linearly dependent to projection vector Vx. That is, Vyz, = t * Vx where t eZR. But note that Vyz, = lex(1y,-, 12,-); 4.1-18 Vxxbn = Iex(lxk, Ixm). Finally, equations 4.1-15 and 4.1-18 imply that Vxxbn is the same as Vyz, and hence linearly dependent to Vx. This implies that projection vector Vx will also map c,, and c”, onto a cell location. However, the interaction of c,, and cm violates the simple communication property. Figure 4.7 illustrates this violation. In Figure 4.7, the two input lines Of cell p must be connected to the single output line of cell q. This will require a DMUX and hence a complex control mechanism at the output line of cell q. Conditions 2, 3, and 4 can similarly be proven. [:1 To illustrate the violated condition 1 of Theorem 4.6, consider the following simple Computations: Ca: [222 (+*) = ...; 4.1-19 Cb: [322 (+*) = ...; 73 xi 1m)- x, (In) XI: (3"!) " Yr (M) x. it...) - z. (12.) Figure 4.7. The violated condition 1 of Theorem 4.6. 6c: 1323 (+*) =1322 (+*) +1222 (+*) * ...; 64: 1523 (+*) =1522(+*)+ ‘ ' ' Feature values Vyzc and Vyya, are linearly dependent and are computed as follows: Vyzc = (3, 2, 2) - (2, 2, 2) = (l, 0, 0); . 4.1-20 Vyycd = (5, 2, 2) — (3, 2, 2) = (2, O, 0). If projection vector Vx = (l, 0, 0)‘ is used to project the geometric representation, simple computations cc and cl, will interact. Note that simple computations Cm Cb, cc, and c4 belong to the same geometric representation. However, the interaction of Ca and cb violates the simple communication property at the cell where simple 74 computation cc is computed. Theorem 4.7 - Simple computation pair (c,-, C!) EC is a valid interactive pair if the following hold: 1. Let ck and c, be two simple computations such that xk(lxk)(0x,,) = ,-(Iz,-)(02,-) and x,(1x0(0x,) = w,(Iw,-)(0w,-). If z,-(Iz,-) and w,(1w,-) are previously computed outputs and z,- = w,, 02,- = 0w,, and projection vector Vx, = t * Vxxg- where t eZR, then a projection vector, such as sz which maps c,, and c, onto the same cell location, must not be linearly dependent to feature value Vzw,-. 2. Let Ck and c, be two simple computations such that xk(lx,,)(0x,,) = j(Iz,-)(02,-) and x,(lx,)(0,,,) = w,(lw,-)(0w,-). If 2,02,) and wj(1wj) are previously computed outputs and z, = W}, 02,- = Owl, and projection vector Vx, = t * Vxxfi where t eZR, then a projection vector, such as sz which maps ck and c, onto the same cell location, must not be linearly dependent to feature value Vzwj. Proof: Suppose condition 1 does not hold. That is, z,- = w,-; 4.1-218 02,- : 0w,; VXl = 11 * Vxx- ‘I’ 11 EZR; VXZ = 12 * VZWi, 12 EZR; and z,-(Iz,-) and w,(lw,-) are previously computed outputs. Projection vector Vx, map simple computations c,- and c, onto the same cell. Let c,- and c, be computed in cell q. The following exists: Vxxk, = Vzw,- 4.1-21b ' From Eq. 4.1-21b, sz is linearly dependent to Vxxu. Therefore, projection vector 75 sz maps simple computations ck and c, onto the same cell. Let c,, and c, be computed in cell p. Figure 4.8 shows cells p and q. The third and the fourth input operands of c,- are computed in cell p. This would require a DMUX at the output operand line of cell p, hence violating the simple communication property. Condition 2 can similarly be proven. D To illustrate the violated condition 1 of Theorem 4.7, consider the following simple computations: Ca: 1329*) = ' ' ' + 1320) * 1230); 4.1-22 cb: t23(/) = ...; cc: t32(/) = The projection of the geometric representation using projection vector sz = (1, -1)‘, where cb and c,_. are in the representation, maps simple computations cb and Cc onto the same cell. However, this mapping violates the simple communication property at the cell where ca is computed. Note that, sz is linearly dependent to feature value VZWa = [ex [(3:2)r (2,3)] = (19 -1): Until now the geometric representations were defined in terms of the indices of the output operands. The projection of these geometric representations maps (assigns) the output operands and hence the sirnple computations onto certain cell locations. Now consider the geometric representations in terms of the indices of the input operands. The projection of these representations assigns the input Operands to certain cell locations. A sub-systolic array is constructed by mapping the output operands and the respective input Operands onto a set of cells. To do this, a valid projection vector is needed to map the output operands and at most three valid projection vectors (one for each existing input line) are needed to map all the input operands. Note that these 76 valid projection vectors must map the right operands to the right cells in order to preserve the algorithm correctness. In general, however, the input operands are not unique; an input operand may be used by more than one simple computation. Therefore the projection of a geometric representation constructed from the indices of the input operands does not define the cell locations. It only indicates those input operands that are assigned to the same cell. On the other hand, the output operands are unique. Therefore, only the projections of the geometric representations, which are constructed from the indices of the output Operands, are used to define the individual cell locations. In order to map the right input and output operands to the right cell locations, the valid projection vectors for the output operands and the input operands are determined. To distinguish the different projection vectors from one another, the operands of a simple computation are named. Consider the generic form for simple computations, which is repeated here for convenience. x(1x)(0x) = y(1y)(0y) 01 2(12)(02) 02 w(Iw)(0w). 4.1-23 The following discussion refers to x(1x)(0x) as the output Operand; y(Iy)(0y) as the input-l operand; z(lz)(02) as the input-2 Operand; w(Iw)(0w) as the input-3 operand. Based on these references, a sub-systolic array is constructed by determining the valid Pijection vectors which assign the correct output operands, input-1 operands, input-2 Operands, and input-3 operands, to a set of cell locations such that the algorithm 77 x. in.) - a. an.) .. 6x.) - z. (12.) E E g \9 z. (12.) v. 0'.) y. (In) .. x. :0.) Figure 4.8. The violated condition 1 of Theorem 4.7. 78 correctness and the properties of systolic arrays are preserved. The projection vectors of the geometric representations constructed from the indices of the output operands are linearly dependent to the values of feature Vxx (Definition 4.4). Similarly, the projection vectors for the geometric representations constructed from the indices of the input-1, input-2, and input-3 Operands are defined to be linearly dependent to the values of features Vyy, sz, and wa, respectively. Furthermore, the following vectors are used to indicate these projection vectors: Vy - indicates the projection vector of input-l operands; Vz - indicates the projection vector of input-2 Operands; Vw - indicates the projection vector of input-3 operands. Vx, which was defined earlier, indicates the projection vector of output operands. In order to deternrine the valid projection vectors, certain projection vector interrelationships must be satisfied. Lemma 4.8 - Let ck and c, be two simple computations such that xterXOXD = YtUYiXOYt); 4.1-24 xt(1xt)(0xt) = yj'UYjXOyj). For the simple computation pair (c,-, c,-) eC to be a valid interactive pair the following must hold. When projection vector Vx, = t1 * Vxx-- ,J, where t1 eZR, is used to map simple computations c,- and c,- Onto the same cell location, then projection vector sz = t2 * Vyyij, where t2 eZR, must be used to map c,, and c, onto the same cell location. 79 Proof: Suppose the contrary that projection vector Vx, is linearly dependent to feature value Vxxy- and projection vector sz is not linearly dependent to feature value Vyyq. Then the following hold for simple computations c,- and c,: x,- = x,; 4.1-25 Y1 = yj; 0x,- = 0x]; 03’; = O)?- Let G, be the geometric representation which includes c,- and cj. Let 02 (a geometric representation) include c, and c,. From Eq. 4.1-24, Vxxu = Vyyij and hence sz is linearly independent to Vxxu. The projection of G, and 02 using projection vectors Vxl and VxQ, respectively, will map c,- and c,- onto the same cell and ck and c, onto two different cells. Figure 4.9 illustrates this condition. In Figure 4.9, input-1 operands of cell q are received from two different cells, q, and q2. This will require a MU X at the input-l operand line of cell q, violating the simple communication PmPCflY- D To illustrate the condition of Lemma 4.8, consider simple computations c3, c5, c4 and c7 of Eq. 4.1-1, which are repeated here for convenience, c3: x210) = x22(*-) / “22( ); 4.1-26 663 3110) = x12(*-) / “11( ); 641 x22(*-) = 16230“) " "23( ) "' 1310); 671 Jr12(""') = Jr13("'--) - "12( ) "' Jl€210)- Lemma 4.8 implies that if Vx, = (l, 0)’, which is linearly dependent to feature value Vast“, is used to map simple computations c3 and c6 onto one cell, then projection 80 xi élxt) - y, (1);) x. (in); y. 0!.) .. Y: (In) ~ Yr (1") ° Kt :(h‘t) 31.01‘1) Figure 4.9. The violated condition of Lemma 4.8. 81 vector sz = (1, 0)‘ must be used to map c4 and c7 onto one cell. Note that sz is linearly dependent to feature value Vyy36. Lemma 4.9 - Let z,-(Iz,-) and 2,02,) be previously computed outputs. Let ck and c, be two simple computations such that, xk(1x,,)(0x,,) = z,-(12,-)(02,-); 4.1-27 IIUXIXOXt) = z,(Iz,-)(02,-). For the simple computation pair (c,, c) EC to be a valid interactive pair the following must hold. When projection vector Vxl = t1 * Vxx,j, where t, eZR, is used to map simple computations c,- and c, onto the same cell location, then projection vector sz = t2 * sz where t2 eZR, must be used to map ck and c, onto the same cell if) location. Proof: The proof is analogous to that of Lemma 4.8; letter 2 is substituted for letter y.El Lemma 4.10 - Let w,(lw,) and wj(1wj) be previously computed outputs. Let c,, and c, be two simple computations such that, xk([x,,)(0x,,) = w,(1w,-)(0w,-); 4.1-28 xl(1x,)(0x,) = WjUWjXOWfl- For the simple computation pair (c,-, C!) eC to be a valid interactive pair the following must hold. When projection vector Vx, = t1 * Vxx,j where t2 eZR, is used to map simple computations c,- and cj- onto the same cell location, then projection vector sz = t2 * V ,j, where :2 eZR, must be used to map ck and c, onto the same 82 cell location. Proof: The proof is similar to that of Lemma 4.8. Here letter w is substituted for letter y. 1] Theorem 4.11 - Simple computation pair (c,-, c) 6C is a valid interactive pair if: VX = 11 * Vxxij; 4.1-29 V)’ = ‘2 * Vyyt'j; V2 = t3 * V225; Vw = t4 * VWWij; where t1, t2, t3, and t4 eZR. Proof: The proof follows from Lemmas 4.8, 4.9, and 4.10. These lemmas indicate that if projection vector Vx is linearly dependent to Vxx then projection vectors Vy, ijr V2, and Vw must be linearly dependent to Vyy,-,-, V221]: and wa,j respectively. El Theorem 4.12 - Simple computation pair (c,-, c,) 6C is a valid interactive pair if the following hold: 1. If x,- = y,, x, = yj, 0x,- = 0y,-, and 0x,- = 0y}, then projection vector Vx must be linearly dependent to feature values Vxxy- and Vyyy. 2. If x,- = 2,-, x,- = 2,, 0x,- = 02,-, and 0x,- = 021-, then projection vector Vx must be linearly dependent to feature values Vxx” and V22”. 3. If x,- = 19,-, xi = W], 0x,- = 0w,, and 0x,- = 0w], then projection vector Vx must be linearly dependent to feature values Vxxij and wa,,-. 83 (In the first condition of Theorem 4.12, it is not necessary to check for whether x,- = y,- and x,- = yj. In a simple computation, the variable name of the output operand is assumed to always be the same as that of the input-1 operand. The input-1 operands are always partial results. This condition is stated in this form in order to be consistent with the second and third conditions). Proof: Consider condition 1 which is a special case of Lemma 4.8. Let ck and c, be two simple computations such that, xtUxtXOxt) = yi(lyi)(0yi); 4.1-30 xt(lxt)(0xt) = yj(ij)(0yj)° Since the variable names and the set of operators of the output operands and the input-1 operands of c,- and c,- are the same, simple computations c,-, 01-, Cb and c, belong to the same geometric representation. Therefore projection vector Vx must be linearly dependent to both feature values Vxxij and Vxx”. However, from Eq. 4.1-30, ka = Iy,- and 1x, = ij. Hence, Vx must be also linearly dependent to Vyyij. Proofs of condition 2 and 3 are similar to that of condition 1. E] To illustrate condition 1 of Theorem 4.12, consider the following simple computations of Eq. 4.1-l which are repeated here for convenience: c4: x22(*-) = x23(*-) — u23( ) * x310); 4.1-31 673 Jr120'") = Jr13("“') " “12( ) * X210); 65: x23(*-) = x24(*-); 683 J"13("") = Jr140"") “ “13( l * ch31(0- Simple computations c4 and c7 satisfy condition 1. Note that x22(*—), x12(*-), x23(*-), and x13(*-) are of the same Operand type. This implies that simple computations c4, c7, c5, and c8 belong to the same geometric representation. 84 Therefore, a projection vector, Vx, that maps (c4, c7) onto the same cell must also map (c5, c8) onto the same cell. This implies that Vx must be linearly dependent to both Vxx“ and Vxx58. But, Vxx53 is the same as Vyy47. Occasionally, in a systolic array, a primary output is computed in two or more sub- systolic arrays. That is, a final partial result of one sub-systolic array is sent as a partial result to another sub-systolic array for further calculations. For example, consider simple computations c6 and c7 of Eq. 4.1-1, repeated here for convenience. C62 X110) = x12(*—) / u11( ), 4.1-32 671 1‘12“") = x13(*—) " “12( ) * x210)- Partial result x12 is computed in a sub-systolic array where each cell performs multiplication and subtraction operations. Operand x12 is then sent to another sub- systolic array, where each cell performs division leading to the computation of x“. It is obvious that the cell model of Figure 3.1b should not be used to implement these sub-systolic arrays. The following feature interrelationship theorem is used to eliminate those sub-systolic solutions which require the cell model of Figure 3.1b and also send results to other sub-systolic arrays. Theorem 4.13 - Let C, and C2 be two subsets of simple computations such that each subset is computed in a separate sub-systolic array and the variable names of the output operands are the same in both sets. (This means that a resultant variable is calculated by the COOperation of both sub-systolic arrays). Let G, and 02 be the geometric representation of C1 and C2, respectively. Projection vector Vx, of GI and projection vector sz of 02 are valid projection vectors if Vx, and sz are both linearly independent to the dependency vector (i.e. values of feature ny as defined in Chapter 3). 85 Proof: Suppose the contrary that Vx, is linearly dependent to the dependency vector and sz is not linearly dependent to the dependency vector. Let S, and S, represent two sub-systolic arrays constructed from the projections of G, and 02, respectively. Since Vx, is linearly dependent to the dependency vector, all the simple computations mapped onto the same cell location in S, will require cell models of Figure 3.1b. That is, the successive partial results will stay in the same cell where each is used to calculate the next partial result. From the statement of the theorem, the simple computations associated with a resultant variable are members of both subsets C, and C2. Therefore, the final partial results of each cell in S, must be sent as inputs to sub- systolic array 32. However, since S, requires cells with stationary partial results (Figure 3.1b), a special controller is needed to handle such communications between sub-systolic arrays S, and $2. This contradicts the simple communication property. 121 To illustrate the condition of Theorem 4.13, consider the simple computations of subsets C, and C2 of Eq. 4.1-2 which are repeated here for convenience. c2: x32(*-) = x33(*-); 4.1-33 643 x22(*—) = x23(*—) - “23( ) * x3,(/); 653 x23(*-) = x24(*-); C7: x,2(*—) = x,:,(*-) " “12( ) * 1210); 68: x13(*') = J"140") " “13( ) "‘ x31(/)3 691 Jr114("‘-) = 16150“)- C1: X310) = X32(*-) I U33( ); 4.1'34 63: 15210) = J"22(""') / “22( ); 653111(/) = x,2(*-) / u,,( ). The sirnple computations of C, (Eq. 4.1-33) require cells performing multiplication and subtraction operations and those of C2 (Eq. 4.1-34) require cells performing 86 division. If Vx, = (O, l)‘ is used to project the geometric representation of C,, then simple computation c2 is mapped onto a cell location, c4 and c5 are mapped onto another cell location, and c7, c3, and c9 are mapped onto a third cell location. This implies that, for example, x23 of c5 and x22 of c4 are computed in the same cell, say q. Operand x23, which is calculated in cell q, is used as the partial result in cell q in the calculation of x22. Hence, cell q has the input/output intra-connection of Figure 3.1b. Let simple computation c2 of C2 be computed in cell p. Since the simple computations of Eq. 4.1-33 are backward recursive functions, c4 is the last computation of cell q. Therefore, the output operand of C,,, which is x22 (which is also the last output operand of cell q), must be sent to cell p as an input. This requires additional control communication, in violation of simple communication property. Theorems 4.1 - 4.7 and Theorems 4.11 - 4.13 define the feature interrelationships. These interrelationships form a set of necessary conditions that must be satisfied before a pair of simple computations can be computed in the same cell without violating the properties of systolic arrays. The sufficient condition is stated below in Theorem 4.14. Definition 4.5 - Simple computation pair (C,,, c,) is called a precedence pair of simple computation pair (c,-, cj) if the output operands of ck and c, form either the input-l, input-2, or input-3 operands of c,- and cj. As an example, consider simple computations c,, c3, c4, and c7 of Eq. 4.1-1 which are repeated here for convenience. c,: x3,(/) = x32(*-) / u33( ); 4.1-35 €33 1‘21“) = x22(*-) / “22( ); C41 1220'“) = x23(*-) - “23( ) * 1310); 673 I12(""') = 1130—) ‘ “12( ) * x210)- 87 The output operands of simple computations c, and c, are used as the input-3 operands in the calculation of c., and c7, respectively. In this case, simple computation pair (c,, (:3) is called a precedence pair of simple computation pair (c4, c7). Theorem 4.14 - Simple computation pair (c,-, C) e C is a valid— interactive pair if all of its precedence pairs are valid interactive pairs. Proof: Let (C,,, c,) be a precedence pair of (C,, cj). From Lemmas 4.8, 4.9, or 4.10, when (c,-, c,) is computed in the same cell then, (ck, c,) must also be computed in the same cell. That is, (Cb c,) must also be a valid interactive pair. Similarly, all precedence pairs of (ck, c,) must be valid interactive pairs. El Theorem 4.14 has a global effect. It ensures that data flow in the array does not violate the simple communication property. To determine whether a simple computation pair is a valid interactive pair, one must determine whether every precedence pair is a valid interactive pair. This is done by forming a test tree for each simple computation pair. Pair (c,, c,) is considered as the root of the test tree which has at most three sons, one for each of the existing precedence pairs. Similarly, each son in the test tree may contain at most three sons of its own and so on. Figure 4.10 shows such a test tree. In Figure 4.10 the simple computation pairs at nodes C1 and C2 are the precedence pairs of the simple computation pair at node Bl. Similarly, the pair at node BI is a precedence pair of the simple computation pair at node Al. A1 is the root-node and BI, 82, and B3 are the son-nodes of A1. It is easy to see that the leaves of the test tree contain those simple computation pairs which have no precedence pairs. Note that a precedence pair is formed from two simple computations and not from two input operands. FOr example, consider simple computations c5 and c3 of Eq. 4.1-33. Although simple computation c8 requires the output operands of simple computations c9 and c,, no precedence pairs can be formed from the respective input operands of CS and c8. This is because the only input operand of c5 is an initial partial result (a primary input). 88 A1 Figure 4.10. A test tree. A simple computation pair can now be marked VIP (valid interactive pair) or NVIP (not a valid interactive pair) based on the results obtained from applying the feature interrelationships. A pair is marked NVIP as soon as one of the feature interrelationships is not satisfied. A pair is marked VIP if all the feature interrelationships are satisfied. However, a VIP may be changed to NVIP later if Theorem 4.14 is not satisfied for that simple computation pair. For each simple computation pair, the test starts at the root of the test tree. If the simple computation pair at the root-node is unmarked, then the feature interrelationships are applied to the pair at the root-node and the test result is marked. If the pair at the root-node is marked VIP, then the process continues at the son- nodes, each son-node, in turn, becoming a new root-node. When a pair at a node is marked NVIP, then all the simple computation pairs at the preceding father-nodes are 89 also marked NVIP. For example, using the test tree of Figure 4.10, suppose the test has been completed for the simple computation pairs at nodes A1 and BI and both are marked VIP. Next, the test continues by taking node C l as the new root-node. The feature interelationships are then applied to the pair at node C 1. Suppose one of the feature interrelationships fails. Then the simple computation pair at node C1 is marked NVIP. At this time, the mark of the simple computation pair at node Bl, which is the father-node of Cl, is changed from VIP to NVIP and in turn the mark of the simple computation pair at node Al (the father-node of BI) is also changed to NVIP. Once a simple computation is marked NVIP, its mark remains unchanged throughout the testing process. The test tree for simple computation pair (c4, c7) of Eq. 4.1-33 is illustrated in Figure 4.1 1. ® (02. 04) Figure 4.11. The test tree of (c4, c7) of Eq. 4.1-33. 90 Next to each node is the corresponding simple computation pair. Simple computation pair (c2, c4) is a precedence pair of pair (c,, c3); pairs (c,, c3) and (c5, Cs) are the precedence pairs of (c,,, c7). In order for (c4, c7) to be a valid interactive pair, simple computation pairs (c2, C,,), (c,, c3), and (c5, c3) must all be valid interactive pairs. Constructing a test tree for each pair of simple computations results in a test forest. The testing process for valid interactions is applied only once for each of the simple computation pairs and the test result is marked. If a simple computation pair is a precedence pair of more than one other simple computation pairs, the successive application of the testing process to the common precedence pair is eliminated. The previous theorems indicate the necessary and sufficient conditions that must be satisfied before a pair of simple computations can be, computed in the same cell without violating the properties of systolic arrays. In the next section, the values of features Vxx, Vyy, V22, and wa of the valid interactive pairs are used to generate the valid projection vectors. These vectors are then translated into geometric transformation matrices. Each geometric representation is transformed into a new representation. The projection of the new representation onto a lower dimensional space produces a sub-systolic array. The final systolic array is constructed by combining the sub-systolic arrays. 4.2 VALID PROJECTION VECTORS In this section, the values of interactive features Vxx, Vyy, sz, and wa of the valid interactive pairs are used to determine the valid projection vectors for each subset of simple computations. A set of 4-tuples (Vx, Vy, V2, Vw) of valid projection vectors are generated from each subset of simple computations. The projection vectors of each 4-tuple corresponds to a sub-systolic solution. Projection vector Vx is used to project the output operands and hence the corresponding simple computations onto a set of cell locations. This process will also identify the total number of cells and their individual locations on the projection 91 space. Projection vectors Vy, V2, and Vw define the relative operand distances for each of the cell input lines. In non-homogeneous systolic arrays, these vectors also define the necessary communication linkages between one or more adjacent sub- systolic arrays. Theorem 4.15 - Let C be a subset of simple computations. Projection vectors Vx, Vy, V2, and Vw are valid projection vectors for the geometric representations of C if each simple computation pair, (c,, cl) 6 C, is a valid interactive pair when Vxxy- = t, * Vx, - 4.2-l Vyyt'j = t2 "‘ Vy, V22,-,- = t3 * V2, wa,,- = t4 * Vw, where t1, t2, t3, t4 eZR. (This is an extension to Theorem 4.11). Proof: Assuming other feature interrelationships hold, from Theorem 4.11 a simple computation, (c,, C), is a valid interactive pair if projection vectors Vx, Vy, V2, and Vw are linearly dependent to interactive feature values Vxx,-,-, Vyy,-j, V2z,-,-, and Vw,-,- respectively. This is true for each valid interactive pair. Therefore, in order to construct a sub-systolic array from C using projection vectors Vx, Vy, V2, and Vw, Eq. 4.2-1 must hold for each pair in C which is mapped to the same cell in the sub- systolic array. [:1 Theorem 4.15 provides a mechanism for generating the valid projection vectors from each subset of simple computations. One way is to construct a table of all possible projection vectors from the values of features Vxx, Vyy, V22, and wa. Feature value 92 Vxx,- calculated from simple computation pair (c,, cj), is compared with the values already stored in the table. If Vxx,j is linearly dependent to any one of the corresponding existing values, then a match has been made. (Note, Vxxy- is linearly dependent to a projection vector, Vx). Otherwise, feature values Vxx- and ,j, Vyy,-j, V22 i," Vw,-,- are added as a record to the existing list in the table. Each record is a candidate which indicates certain values for projection vectors Vx, Vy, V2, and Vw. In case of a match, feature values Vyy,-j, sz and Vw,-,- are compared to the ij’ corresponding stored values. If any one of these feature values is not linearly dependent to that corresponding stored value, then that record is no longer a candidate record. The record is also not a candidate record if simple computation pair (c,, cj) does not satisfy all the feature interrelationships. That is, if (c,, cj) violates any one of the feature interrelationships, then the corresponding record (which is identified by feature value Vxx,,-) is marked as a non-candidate record. The process continues until all the simple computation pairs are tested. The smallest vectors which are linearly dependent to the values of the candidate records form the set of valid projection vectors. Specifically, the following procedure is used to generate the valid projection vectors for each subset of simple computations. The valid projection vectors extracted from each subset of simple computations are referred to as the valid local projection vectors. Procedure A: Generating Valid Local Projection Vectors Let C be a subset of simple computations such that Theorem 4.1 and 4.2 are A satisfied for each simple computation pair, (c,, C), in C. (The simple computations of C are computed in a sub-systolic array). A table, T, of records, where each record consists of values linearly dependent to feature values Vim-j, Vyy,-j, V22,-j, and VW 93 ,j, is constructed. Table T will contain only one copy of each different record and the records are identified by the values of feature Vxx. Step 1. Step 2. Smp3. Step 4. Step 5. For simple computation pair (c,, c) eC, compare Vxx,- with each record in T. If Vxx,,- is not linearly dependent to the corresponding value in all the records in T go to Step 2. If Vxx,- is linearly dependent to the corresponding value in a record, r, in T and r is not marked as “bad”, go to Step 3. Otherwise, go to Step 5. Add feature values Vxx Vyy,-j, V22 and Vw,,- as a record, r, to the ijr i1" existing list in T and go to Step 4. Compare feature values Vyy,-j, V22 and Vw,-,- to the corresponding 1]: values in record r (Theorem 4.15). If either Vyy,-j, sz or wa,~ is it" 1 not linearly dependent to the corresponding value in r, then mark record r as “bad” and go to Step 5. Otherwise, go to Step 4. Apply conditions of Theorems 4.3 - 4.7 and 4.12 - 4.14 to simple computation pair (c,, c,). If any one of these conditions is not satisfied, then mark record r as “bad” and go to Step 5. If all the simple computation pairs in C are exhausted, print all the records in T which are not marked “bad” and stop. Otherwise, choose a different pair, (c,-, 0}), in C and go to Step 1. Each “good” record in T indicates valid projection vectors Vx, Vy, V2, and Vw. If there exist no “good” records in T then there exist no valid projection vectors. This means that there either exists no sub-systolic solutions to implement the simple computations of subset C or each" simple computation of C must be computed in a 94 separate cell. In the latter, each point in the geometric representation indicates a cell location and no geometric projections are performed. The simple computations of Eq. 4.1-1 are grouped into two subsets C, and C2 as indicated in Eq. 4.1-2 and repeated here for convenience. C1 = {C2, C4, C5, C7, C8, C9}, 4.2-2 C2 = {6'1- 93- C6}- where C22 X32(*") = 1330‘"); 4.2‘3 643 x22(*-) = J6239‘") ‘- u23( ) "‘ 1310); 6‘53 1230“) = x24(*-); 07: I120'") = Jr130“") ‘ “12( ) * x210); Cs: Jr130'") = x,4(*-) " "13( ) * x3,(/); 691 1140-) = Jrrs(""-); and 013x31(/) =132(*’) / “33( ); 6‘3: Jr210) = J5220'") / “22( )i 063 x,,(/) =x,2(*-) / “11( )- Procedure A is applied to the simple computations of subsets C, and C2 of Eq. 4.2-2 and the results are shown in Table 4.2. Let G, and 0, represent the geometric representations constructed from the output operands’ indices of the simple computations in C, and C,, respectively. Let S, and S, represent the sub-systolic arrays constructed from the projections of G, and 02 onto a line, respectively. From Table 4.2, projection vector Vx= (l, O)‘ of C, is used to construct S, from 0,. Projection vectors Vy=(1,0)‘, Vz=(l, l)‘, and Vw=(l,0)‘ of C, indicate the 95 Table 4.2. Valid local projection vectors for C, and C2 of Eq. 4.2-2. Set of Simple Valid Local Projection Vectors Computations Vx Vy Vz Vw Cl (190), (1,0)! (191)! (1,0)‘ C2 (190)! (190): (1’1)t ' relative input-l, input-2, and input-3 operand distances of each cell in 5,, respectively. The simple computations of C2 contain only the output, input-l, and input-2 operands, so Vw of C2 is not required. Projection vector Vx = (1, O)‘ of C2 is used to construct S, from 02 while Vy = (l, 0)‘ and V2 = (1, l)‘ of C2 define the relative distances of input-l and input-2 Operands of each cell in $2. The projection vectors of Table 4.2 satisfy the necessary boundary communication linkages between sub-systolic arrays S, and 52. Projection vector Vw = (l, O)‘ of C, defines the relative distances of input-3 operands x3,(/), x,,(l), and x,,(l). This implies that x3,(/), x,,(l), and x,,(l), which are calculated in sub-systolic array S2, are supplied to S, with relative distances equal to Vw = (l, O)‘. This means that projection vector Vx of C2, as shown in Table 4.2, must be equal to Vw of C,. Similarly, communication linkages from S, to 52 must be also satisfied. The operands x32(*-), x22(*-), and x,,(*-) are calculated in S, and supplied to $2 with relative distances equal to Vy of C,. This implies that Vx of C, must be equal to Vy of C2 as shown in Table 4.2. 96 The above discussion~ illustrates that certain relationships must hold between valid local projection vectors in order to satisfy the necessary boundary linkages between adjacent sub—systolic arrays. Furthermore, it also indicates that certain valid local projection vectors may not satisfy the necessary boundary linkages of sub-systolic arrays. Therefore, a list of valid global projection vectors is determined from the list of all valid local projection vectors. A procedure to determine the valid global projection vectors from the given valid local projection vectors of each subset of simple computations is described below. Let C,, C2, ..., C,, be the n subsets of simple computations of an algorithm. (The simple computations of each C,- satisfy Theorems 4.1 and 4.2). Let T,, T,, ..., T,, be the n tables consisting of records of valid local projection vectors of C,, C,, ..., C,,, respectively. Each record, r eT,, consists of at most four vectors. That is, r = (Vx, Vy, V2, Vw). For example, let T, and T2 represent two tables of records consisting of the valid local projection vectors of C, and C2 of Eq. 4.2-2, respectively. Tables T, and T2 each contain only one record as shown in Table 4.2. That is, T1 = {[00)‘. (Let. (1.02 (1.0)‘] }. ‘42-4 T,={[.,..,.,.., (1,111}. Let a ET, and b 6T2. That is, a = [(1.02 (1.0)‘. (1.1)‘. (1.0)‘]; 4.2-5 b = [(1.0)‘. (1.0)‘.(1.1)‘]. 97 Notation 4.1 - Let notation Vxl, e1 indicate projection vector Vx of record r in table T. Similarly, Vyl, GT, Vzl, ET, and le, 6, indicate projection vectors Vy, V2, and Vw in record r of T, respectively. For example, from Eq. 4.2-5, Vxla at, = (1, O)‘ and Vzlb €72 = (1, 1)‘. Procedure B: Generating Valid Global PLojection Vectors Step 1. If the output operands of a subset, C,, are used either as the input-l, input-2, or input-3 Operands in another subset, C,, then Vxla GT, must be linearly dependent to either Vylb GT1, Vzlb 67,, or leb 67-), where a and b are two records. Repeat Step 1 for all a 67',- and all such C,. Remove those records from T,- where VxlaeT, is not linearly dependent to either Vyl,J er,- Vzlb GT}, or leb 57, depending on whether the output Operands of C,- are used either as input-1, input-2, or input-3 Operands in C,, respectively. Step 2. If all C,-’s are exhausted, then print the valid global projection vectors remaining as records in Tables T, through T,, and stop. Otherwise, choose a different C,- and go to Step 1. Incidently, the records of Table 4.2 are both valid local and global projection vectors of subsets C, and C2 of Eq. 4.2—2. Valid global projection vector Vx of each subset of simple computations indicates a geometric projection direction that supports a sub— systolic solution. Valid global projection vectors Vy, V2, and Vw of each subset of simple computations indicate the relative input-1, input—2, and input-3 operand distances in the sub-systolic array, respectively. In the next chapter, each Vx is 98 translated into an integer, non-singular transformation matrix. This transformation matrix is used to transform an original geometric representation, which is constructed from the indices of the output operands, into a new geometric representation. The projection of the new representation in a known direction will provide the individual cell locations and the simple computations of each cell. The individual cell locations and the simple computations of each cell form a subset of the design information required to construct each of the sub-systolic arrays. Other design information such as data flow timing, cell count ratio, and design dimension, are also determined from the valid global projection vectors. The local cell interconnections are performed from the nearest cell input/output information. The valid global projection vectors guarantee both local cell connectivity and simple and regular communications, the architectural properties of systolic arrays. 4.3 CHAPTER SUMMARY In this chapter additional features, referred to as the interactive features, are extracted from each pair of simple computations. These features and the algorithm features (extracted in Chapter 3) form the set of information extracted from the simple computations. The architectural properties of systolic arrays (i.e., simple processing elements, simple and regular data and control communications, and local cell connectivity) are expressed in terms of the feature interrelationships. A pair of simple computations is called a valid interactive pair if it can be computed in the same cell without violating the properties of systolic arrays. A pair of simple computations is a valid interactive pair if all the feature interrelationships are satisfied and if all its precedence pairs are also valid interactive pairs. A test tree is constructed from each simple computation pair and all its successive precedence pairs. Each node in the test tree contains a simple computation pair. The 99 simple computation at the root-node is a valid interactive pair if the simple computation pairs at all nodes of the tree are valid interactive pairs. The feature interelationships are applied to each test tree constructed from each pair of simple computations. The process (Procedure A) produces as output a set of valid local projection vectors for each subset of simple computations. These projection vectors will be used to construct sub-systolic solutions to implement the simple computations of each subset. The final systolic array is constructed by joining the sub-systolic arrays. The valid local projection vectors indicate valid geometric projections used to construct sub-systolic solutions. They, however, may not satisfy the necessary boundary linkages between sub-systolic arrays. Procedure B, which operates on the valid local projection vectors, produces as output a set of valid global projection vectors. These projection vectors, if they exist, correspond to those sub-systolic solutions that when joined will produce a systolic array. In the next chapter, the valid global projection vectors are translated into integer, non-singular geometric transformation matrices. These matrices are used to transform an original geometric representation into a new representation such that the projection of the new representation provides the cell locations and simple computations of each cell. The valid global projection vectors are also used to determine data movement information and cell count ratio. CHAPTER 5 ARCHITECTURAL SPECIFICATION This chapter describes the method for the determination of systolic array architectural specifications from the valid projection vectors. The architectural specifications needed to construct systolic arrays are: 1. individual cell locations; 2. simple computations of each cell; 3. data movement; 4. data timing; 5. design dimension; ‘ 6. data input/output patterns; 7. cell interconnection patterns; 8. cell functions. Not all of these specifications are needed to complete the design. The information of individual cell locations, data tinting, and cell interconnection patterns are by far the most important subset. Certain other specifications, such as the cell interconnection patterns, can be extracted from other information; in this case from the simple computations of each cell. 100 101 The architectural specifications directly extracted from the model include the individual cell locations, simple computations of each cell, data movement, cell functions, and design dimension. The cell interconnection patterns, data tinting, and data input/output patterns are extracted from the nearest neighbor cell information and data movement information. The individual cell locations are determined via geometric projections. A geometric representation constructed from the indices of the output operands is projected onto a lower dimensional space (e.g., plane). Each point on the projection plane represents a cell location. The design dimension is deternrined from the dimension of the geometric representation. The data movement information is determined from the valid global projection vectors. It identifies the stationary and non-stationary operands in the array. The stationary operands are either primary inputs or outputs. An index, referred to as cell count ratio, is also calculated from the valid global projection vectors. This index relates to the number of cells required. The cell count ratio and data movement information are determined earlier in the design process hence providing a priori information about the different systolic solutions. 5.1 INDIVIDUAL CELL LOCATIONS The individual cell locations are determined by projecting a geometric representation, constructed from the indices of the output operands, onto a lower dimensional space. The valid global projection vectors, described in Chapter 4, indicate the projection directions which produce systolic solutions. Since the output operands are unique and each uniquely defines a simple computation, only the valid global projection vector Vx is used to define the individual cell locations. Each different Vx indicates a different geometric projection. 102 In order to assign a unique projection direction to all the geometric projections, Vx is translated into a geometric transformation matrix, H. Let G be the geometric representation constructed from the indices of the output operands of a subset of simple computations. Matrix H transforms an index Ix 50 into another index Ix’ e G’ where H * Ix = Ix’ and G’ is the new geometric representation. This is denoted as H: G -) G’. Matrix H is a mapping from Z’" to 2'", where m is the dimension of G and Z is the set of all integer numbers. This means that the enteries of H are integer numbers that produce integer coordinates for each point in 0’. Furthermore, H must be a non-singular matrix to provide a one-to-one mapping from G onto 0’. A geometric transformation is equivalent to axes rotation. The axes of the initial geometric representation are rotated by certain angles which are calculated from the valid global projection vector Vx. For a linear (l-dimensional) sub-systolic array, geometric representation G is a 2-dimensional space. For example, in Figure 5.1, two simple computations, c, and c,, are shown by points at locations Ix, and Ix,. Let simple computation pair (c,,, c,) be a valid interactive pair. Let Vx be the valid global projection vector which maps c, and c, onto the same cell location. From the feature interrelationships discussed in Chapter 4, feature value Vxx“ = (be Ix,) is linearly dependent to Vx. Select c,‘ and c, such that, Vxx“ = Vx and let Vx = (v,, vz)‘. The rotation of i—j plane by 4) radians counter clockwise (as shown in Figure 5.1) results in a new representation, 0’, indicated by i’—j’ plane. Note that the projection of G’ onto the j’ axis maps c, and c, onto the same location. This axes rotation can be represented by a transformation matrix. The following discussion illustrates the determination of transformation matrices from a valid global projection vector Vx for 2-dimensional and 3-dimensional geometric representations. The projections of these representations produce l-dirnensional (1D) and 2-dimensional (2D) systolic arrays, respectively. 103 Figure 5.1. A 2-dimensional geometric representation with two simple computations. 5.1.1 1D AND 2D SYSTOLIC ARRAYS This section discusses the generation of the geometric transformation matrices for 2- dirrrensional and 3-dimensional geometric representations. A transformation matrix in this case is an integer and non-singular matrix which transforms each point with integer coordinates in the original geometric representation into a unique point with integer coordinates in the new representation. These matrices are generated from the valid global projection vectors, Vx’s. In Figure 5.1, the angle 4) is calculated as follows: ¢=- -1[2] 5.1-1 V1 where Vx = (v,, vz)‘. The rotation of the i—j plane by 4) radians counterclockwise can 104 be indicated by the following matrix: _ cos¢—sin¢ - R"[sintlr cos¢° 5'12 The multiplication of matrix R by Ix, produces Ix’,‘ which indicates the new location of CI:- Let Ix,- = (x,, x2) where x, and x2 are integer numbers and let Ix’,- = (x’,, x’z). That is, [cos <1) - sin 0 [x1] = [{1} 5,3 srn 0 cos 4) x2 x 2 which can be rewritten as x’, = cos 4) (x,) — sin 4) (x2); 5.1-4 x’z = sin 0 (x,) + cos 4) (x2). However, note that, x’, and x’z may not be integers since the domain of both the cos and sin functions is [O I]. To resolve this problem, an integer transformation matrix, H, is derived from matrix R. To do this, two cases must be exarrrined. Case 1: v, at 0 (this corresponds to the values of 4) which are not integer multiplies of N2) For values of 4) which are not integer multiples of tt/2, divide the equations of 5.1-4 by cos 0. That is, x . 1 .. ,- “"452; 5.1-5 cos ‘1) cos 4) 1’2 = sin (b This can be rewritten as cos 4) = x, — tan (1) (x2); 5.1-6 105 cos 4) = tan 111 (x,) + x2. Eq. 5.1-l indicates that tan ¢=(- v2 / v,). For v, at O we have, Multiplying both sides of Eq. 5.1-7 by v, yields v (X’ ) C108 ; = V1 (X1) '1’ V2 (X2); v (x’ ) clos i = —v2 (x,) + v, (x2). The right hand sides of Eq. 5.1—8 are both integers. Let I . ”10‘ 1) X1 = , cos 4) I v,(x 2) 12 '-"- . cos (I) 5.1-7 5.1-8 5.1-9 52, and 22, which are both integer numbers, can be considered as the new coordinates of the location of ck where the original axes (i-j plane) are rotated by d) radians in the counterclockwise direction. Replace the left hand sides of Eq. 5.1-8 by those of Eq. 5.1-9. The results can be represented in matrix as follows: 1it V1 V2 J‘1 £2 — -V2 v1 x2 V1 V2 “V2 V1 where 5.1-10 5.1-11 106 is the transformation matrix, H. Case 2: v, = 0 In this case, 0 = -tan’1(v2 / 0) which implies that -tan-1oo if V, > 0 ¢ = tan'loo if v2 S 0 5'1-“ and therefore - 325 if v2 > 0 (l) =1 rt 5.1-12 L —2' if V2 S O Hence, v, = 0 corresponds to values of 4) which are integer multiplies of 1t/2. In this case H is calculated from matrix R (Eq. 5.1-2). lcos(- éi) - srn(- 129) H: It ifV2>O; 5.1-13 rt srn(- —) cos(- —) 2 d or 1.0.1 .. .m— _ 2 2 . H- sin-IE. COSE lfVZSO. b 2 2 d Thus, 0 1 . H = [__1 0] 1f v2 > 0; 5.1-14 107 Therefore, in general, for a 2-dimensional geometric representation, the transformation matrix, H, is calculated from projection vector Vx = (v1, v2)‘ as follows: V1 V2 . H: W rfvlatO; 5.1-15 H=[_(i (1)] ifv2>Oandvl=O; H=[? -01] iszsoandVl=O. h The new geometric representation is (3 indicated by the 29—}? plane. Transformation matrix H is calculated such that the projection of (3 is always performed in the direction of the 1° axis onto the j" axis. For example, consider the geometric representation of Figure 5.2 with points at locations p=(1,1) and q=(2,2). Using projection vector Vx= (1,1)‘, the corresponding transformation matrix is calculated from Eq. 5.1-15 as _11 H_L11} arm Multiplying H by the coordinates of points p and q in the i—j plane will produce the coordinates of p and q in the new geometric representation. 3H2. rm: 314.1. :ng- This is illustrated by points p“ = (2, O) and c? = (4, O) in the ia-j“ plane in Figure 5.3. Note that the projection of the new geometric representation (Figure 5.3) onto the f 108 N i I | l l I l l l 4 r——-‘ 4r a-lb—J 'o N-——————— Figure 5.2. An original geometric representation. axis, as expected, maps 13 and q“ onto the same cell location. The results of Eq. 5.1-15 can be extended to 3-dimensional geometric representations. The valid global projection vectors for 3-dimensional geometric representations are vectors of size three. Let Vx = (v1, v2, v3)‘ be the projection vector of a 3-dimensional geometric representation, G. Let G be indicated by the i—j—k space where each simple computation is represented by a point in G. The rotation of the i-j plane by ¢ = — tan-1(v2 / v1) radians counterclockwise is given by matrix R1 cos¢ -sin¢ 0 R1: sin 4) cos (p 0. 5.1-18 0 O 1 Similar to the method discussed for 2-dimensional geometric representations, an 109 JV '4»- Figure 5.3. A transformation of the geometric representation of Figure 5.2. integer non-singular transformation matrix, H1, is constructed from R, as follows: V1 V2 0 H1: -v2 v1 0 ifvlatO; 5.1-19 o o 1 O 1 0 H1: '1 0 O ifV1=0 andv2>0; 0 0 1 01' 0 -1 0 H1: 1 O 0 ifv1=0 andVZSO. 0 l The transformation, H1: G —) G’, transforms each point in G to a point in G’ with integer coordinates. Let G’ be indicated by the i’—f—k’ space. Recall that projection vector Vx = (v1, v2, v3)‘, which indicates a projection direction, is a line connecting 110 points (0, O, O) and (v1, v2, v3). This implies that projection vector Vx of G corresponds to a projection vector, Vx’, of G' where Vx’ = H, * Vx. That is, V1 V1 H1 V2 = 9 . 5.1-20 V3 V3 The rotation of the i’—k’ plane by 9 = — tan-1(v’3 /v'1) radians counterclockwise is given by matrix R2 cos 9 0 - sin 6 R2 = 0 l O . 5.1-21 in 9 0 cos 6 The corresponding transformation matrix H2 is V1 0 V3 H2 = o 1 o if v', :2 0; 5.1-22 — V3 0 V’l 0 O 1 H2: 0 1 O ifV’1=O andv’3>0; -l O 0 01' 0 O -1 H2: 0 1 O ifv’l=0 andv'3S0. l O 0 The transformation, H2: 0’ -) G”, transforms each point in G’ to a point in G" with integer coordinates. Let G" be indicated by the i”-j”—k" space. Projection vector Vx’ = (v’l, O, v’3)‘ of 0’ corresponds to a projection vector, Vx”, of G” where VII 'VIII H2 0 = o . 5.1-23 V3 0 Projection vector Vx” is a line from (O, O, 0) to (v”1’0, O) in the i”—j”-k” space Vx” = H2 * Vx’. That is, 111 which defines the i" axis as the projection direction and the j”—k” plane as the projection plane. Transformation matrices H1 and H2, which are both integer and non—singular matrices, are combined to produce a single integer and non-singular transformation matrix, H by H = H2 * H1. 5.1-24 5.1.2 m-DIMENSIONAL SYSTOLIC ARRAYS Equations 5.1-15 and 5.1-24 illustrate the translations of the valid global projection vector Vx to integer non-singular transformation matrices for 2-dimensional and 3— dimensional geometric representations, respectively. The projections of the 2- dimensional and 3-dimensional geometric representations produce l-dimensional and 2-dimensional sub-systolic arrays, respectively. However, the method of translating projection vector Vx to a transformation matrix can be extented to higher (i.e., >3) dimensional geometric representations. Procedure C: m-Dimensional Transformation Matrix Construction 1. Let Vx(°)=(v1(0),v2(o), . . . , v,,,<°> )‘ be the initial valid global projection vector. Let H“), ..., H(”" 1) be m x m diagonal matrices where the diagonal elements are one and non-diagonal elements are zero. Let k = 1. 2. Modify H“) = (1:5)“) as follows: i. let a = v1("’1) and b = mgr”; ii. determine the new values of a and b from Table 5.1 (in Table 5.1, gcd stands for the greatest common divider); iii. let 112 hum = a; hum“) = 1’; hk+l,l(k) = ’b; hk+l,k+l(k) = 0~ 3. Let VxU‘) = H“) * Vx(k’l). 4. Let k = k+l; if k _<. m — 1 then go to 2, otherwise go to S. 5. Let H = ”(k-1) sl: ”(k-2) * . . . a: HO). Table 5.1. Lookup Table for Transformation Matrix Construction. a=0 a¢0 b=0 0:0 0:1 b=-1 b=O b>0 a=0 . a=algcd(a,b) b=1 b=b/gcd(a,b) b __9 b <~—— c e— e—— c c - c - a 0 b O a Figure 5.7. The systolic solution for solving upper triangular system of linear equations. 132 5.6 CELL FUNCTIONS The functions of each cell are determined early in the design process. In general, before generating the simple computations (Chapter 3), the functions of each different cell type is defined. Then, the algorithm is decomposed into its equivalent set of simple computations such that each is computable in a cell. Only those cells with at most three inputs and three outputs are considered here. Each such cell can perform at most two Operations during a clock cycle. However, the method can be extended to more complex cell structures performing many Operations during a clock cycle. 5.7 CHAPTER SUMMARY This chapter presented the extraction of systolic array architectural characteristics such as the individual cell locations, data movement, simple computations of each cell, data flow timing, design dimension, input/output data patterns, and cell count ratio. The individual cell locations and the simple computations of each cell are determined from the projections of the geometric representations. Data movement and cell count ratio are determined from the valid global projection vectors. From the nearest cell information and data movement information a macro cell is constructed. A sub-systolic array is constructed by repeating the macro cell at each cell location. A systolic array is constructed by joining all the component sub-systolic arrays. The data flow timing of each sub-systolic array is determined from the communication lines of the macro cell which indicate the rate at which actual data enters and exits the sub-systolic array. The data flow timing of a systolic array is equal to the largest data flow timing value of its sub-systolic arrays. This rate and the data dependency information from the adjacent cells indicate the timing of each datum and the input/output patterns. 133 Using the data movement information and cell count ratio, systolic solutions with different numbers of cells and I/O limitations can be selected early in the design process. CHAPTER 6 DESIGN TOOLS AND RESULTS A design facility based on this method is developed for designing and simulating systolic arrays from algorithms. Section 6.1 describes the design facility. The systolic solutions for matrix multiplication, LU decomposition, solution of lower triangular systems of equations, finite impulse response filter, and 2D convolution are given in Section 6.2. 6.1 DESIGN FACILITY The block diagram of the design facility is illustrated in Figure 6.1. The routines are developed only for the purpose of research and at present they are loosely connected. Furthermore, only those procedures which were essential for this research are implemented. In Figure 6.1, the procedures of the Near Cell Information Analyzer and Structure Editor are not implemented. In this section the purpose of each routine is discussed. Algorithm Decomposition — This step transfers the original algorithm described by mathematical expressions into equivalent simple computations. This is referred to as the algorithm initial representation. The generalization Of this step, however, requires 134 135 the writing of a universal routine that transfers a given mathematical expression into its equivalent simple computations. Mathematical expressions in the categories considered in this work are easily translated into simple computations. Feature Extraction — This routine Operates on the algorithm initial representation; it computes for each pair of simple computations those feature values which involve index differences. Hence, the multiple evaluation of these feature values during the generation of the valid projection vectors is eliminated. Projection Vector Generator — This routine is the implementation of Procedures A and B of Chapter 4. At its termination, the routine lists all the global valid projection vectors for each subset of simple computations. If, for a subset of simple computations, no valid global projection vector exists, then either there exists no sub-systolic solution or one cell per each simple computation is needed. Transformation Matrix Generator — For a given projection vector, Vx, this routine computes the geometric transformation matrix. These matrices are stored for use later in the construction of larger systolic arrays. The systolic design method is applied to small problems. The geometric transformation matrices are then used to transform the geometric representations of larger problem sizes. Geometric Transformation and Projection — This routine performs the geometric transformation and projection. Specifically, it determines the new geometric representation G’. The projection of G’ onto the lower dimensional space (e.g. j’k’ plane) provides the individual cell locations and the simple computations of each cell. Near CeII Information — From the simple computations of each cell and the data movement information, this routine determines the macro cell for each sub-systolic solution. Repeating the macro cell at each cell location produces the sub-systolic 136 Algorithm ——-> Structure Editor 1 I Figure 6.1. The system diagram. Oecompoeition ‘ Feature ‘ Extraction , Projection Vector Generator Traneformation Matrix Generator ‘ Geometric Traneforrnatlon aid Projection Structure ‘ Near Cell information Generator Simulator Macro Cell } Input Data attern C 137 array. Structure Generator — This routine is used to construct the systolic array for simulation. Each cell is considered as a node where the cell interconnections are performed through linked lists. Each node is a record consisting of input registers, output registers, cell functions, node type (a cell or a delay register), cell type (an inner cell or boundary cell), and linkages. Given the individual cell locations and the macro cell (for homogeneous arrays) the sub-systolic array is constructed. For non- homogeneous arrays, the data directions may not be uniform and some structural editing may be needed. Structure Editing — This routine is used to modify the generated array in case the systematic steps do not complete the design. Structural editing may be needed in the following two cases. 1. The systematic steps generate only those cell locations which are required to do the necessary calculations. However, in some designs (e.g., Figure 6.21), for the purpose of design regularity, additional cells are added to the design. These cells perform no useful calculations and serve only as delay registers. 2. For non-homogeneous systolic arrays, the design process produces the cell locations and cell interconnections for each of the sub-systolic arrays. Each sub-systolic array is a homogeneous array and therefore has uniform cell interconnection. The final systolic array is then constructed by joining all the sub-systolic arrays. Often, when sub-systolic arrays are joined together to form the final systolic array, the cell interconnections between the boundary cells of the adjacent sub-systolic arrays are different from those in either of the sub-systolic arrays. In this case, structural editing is needed to 138 modify the boundary cell interconnections. Simulator — This routine performs the final simulation for a specific number of clock cycles. At each clock cycle the input data in the desired input patterns is supplied to the simulator. Similarly, at each clock cycle the routine lists the output data from the respective boundary cells. The system is used to construct and simulate systolic solutions for several known problems. This system has reproduced known systolic solutions for these algorithms and in some cases has produced several new solutions. 6.2 RESULTS This new method for constructing systolic arrays is applied to several problems and the results are reported. Matrix multiplication, LU decomposition, and the solution of lower triangular systems of equations are selected as space-indexed algorithms. Finite impulse response (FIR) filter and 2D convolution are chosen as time-indexed algorithms. The transformation of each algorithm into its equivalent set of simple computations is given. For each algorithm, the procedures are applied to a small problem size. The selected problem size is the smallest size that represents the algorithm. That is, the simple computations generated from two different problem sizes should differ only in their number and not in the number of subsets of simple computations. For each algorithm the global valid projection vectors, fiansformation matrices, data flow timing, and data movement are also reported. The cell count ratio for each different systolic solution is also reported. 139 6.2.1 MATRIX MULTIPLICATION Consider matrix multiplication AB = C, where A(a,-j) is N x M, B(b,-j) is M x Q, and C(cij) is N x Q. This algorithm is described by M CI] = E a", * by for i=1, ..., N and j =1, ..., Q. 6.2-1 [=1 Each Cij is the sum of certain specified products. The order in which the products are computed is not important. This provides considerable freedom as to how these calculations are performed. For example, Eq. 6.2-l does not specify the order in which the products are summed. The following simple computations perform the summations in the lexicographical order of indices: cwc E O for k = O, ..., M—l; 6.2-2 Cij,k(*+) =7 Cur-10+) ‘i' ai,k,0( ) * ka,o( ) for k = 1. ..., M ; Cir = UN for i = 1, ..., N and j = 1, ..., Q. The index spaces of the a’s and b’s are artificially expanded (by adding a zero to each index) to match the dimension of the index space of the c’s. As indicated in Eq. 6.2-2, all the indices are now members of Z3 (refer to Chapter 3). For M = 3, c“ is calculated as follows: 6'110 E 0; 6.2-3 Clll(*+) = Cllo(*+) + “110( ) * b110( ); Cll'2(*+) = Clll(*+) + al20( ) * b210( ); 01130”) = 0112(*+) + “130( ) * b310( ); €11 5 6113. Each simple computation of Eq. 6.2-3 is computed in a cell with multiplication and addition operations. For matrix multiplication, there exists only one type of simple 140 computation. Hence, the systolic solutions are homogeneous arrays. The procedures were applied to matrix multiplication algorithm for N = 3, M = 3, and Q = 3. Table 6.1 shows the valid global projection vectors generated from 3 x 3 square matrix multiplication. Table 6.1. The valid global projection vectors for matrix multiplication. Vx Vy Vz Vw l. (0,0, 1)‘ (0,0, l)‘ (0, 1, 0)‘ (1, 0, 0)‘ 2. (o, l, 0)‘ (O, 1, 0)‘ (o, o, 0)‘ (0, l, 0)‘ 3. (1,0, 0)‘ (l, O, 0)‘ (l, 0, 0)‘ (0,0, 0)‘ 4. (o, 1, l)‘ (O, l, l)‘ (0, 1, 0)‘ (1, 1, 0)' 5. (1, l, 0)‘ (1, 1, 0)‘ (1, 0, 0)t (o, l, 0)‘ 6. (1,0, l)‘ (1,0, 1)‘ (1, 1, O)‘ (1,0, 0)t 7. (0,1,-l)‘ (0,l,-1)‘ (0,1,0)t (1,-1, 0)‘ ‘ 8. (1, -1, O)‘ (1, -1, O)‘ (1, o, 0)‘ (o, l, 0)‘ 9. (1, O, -1)t (1,0, -l)‘ (l, -1, O)‘ (1,0, 0)' 10. (l, 1, l)‘ (l, l, 1)‘ (1, 1, O)‘ (1, 1, 0)‘ 11. (1, 1, —1)‘ (1, l, —l)‘ (1,-l, 0)‘ (1,-1,0)‘ 12. (1, -1, l)‘ (1, -1, 1)‘ (l, 1, O)‘ (1,-1, 0)‘ 13. (1, -1, -1)‘ (1, -l, -1)‘ (1, -1, O)‘ (l, 1, O)‘ Each valid global projection vector, Vx, corresponds to a systolic solution. Each Vx is systematically translated into a geometric transformation matrix H. Procedure C, for m = 3, is applied to each projection vector and the transformation matrices are listed in Eq. 6.2-4. H II ONO 0 l o H: _1 o o for Vx=(o,l,0)‘; 1 0 l 0 O for Vx = (O, O, 1)’: l 0 6.2-4 141 for Vx = (l, O, 0)‘; O for Vx = (O, l, l)‘; O for Vx = (1, 1, 0)‘; O for Vx = (l, 0, 1)‘; O for Vx = (O, l, -1)'; O for Vx = (l, -1. 0)‘; o for Vx=(1,0,-1)‘; O for Vx = (1, l, 1)‘; 0 for Vx = (1, 1, -1)'; O for Vx =(1, -1, 1)‘; o for Vx = (1, -1, -1)‘. 142 The cell count ratio for each Vx is calculated for 3 x 3 square matrix multiplication and the results are shown in Table 6.2. Table 6.2. CCR for each Vx of matrix multiplication. Vx CCR 1. (0,0, 1)' '3‘: =15 2. (o, l, O)‘ % = 1.5 3. (1, 0, or % = 1.5 4. (o, 1, 1)‘ % = 2.25 5. (1, 1, 0)‘ % = 2.25 6. (1,0, 1)' % = 2.25 7. (o, l, -1)‘ % = 2.25 8. (1, —l, 0)' $25 = 2.25 9. (1, o, -1)‘ % = 2.25 10. (l, l, l)‘ 381 = 3.375 11. (1, l, -l)‘ 32-37— : 3.375 12. (1, -l, l)‘ 93?- = 3.375 13. (1, -1, -1)‘ 277- = 3.375 The total number of simple computations generated for 3 x 3 square matrix multiplication is 27. CCRs are always greater than one. The closer CCR is to one, the fewer number of cells is needed. This implies that those projection vectors which correspond to CCR = 1.5 require the fewest number of cells. The projection planes and the systolic solutions corresponding to the thirteen projection vectors are shown in Figures 6.2 through 6.27. The directions of the data communication lines, which are 143 determined from the near cell information, are shown in Table 6.3. Table 6.3. Data input directions for matrix multiplication systolic solutions. Input Line Directions Vx ia ib ic 1. (0,0, 1)‘ (o, -1) (-1, 0) 2i 2. (o, 1, 0)‘ J (1,0) (0, -1) 3. (1, o, 0)‘ (-1, 0) I (o, -1) 4. (o, 1, 1)‘ (O, 1) (1,0) (0, -l) 5. (1, 1, 0)‘ (o, l) (o, -l) (—1,0) 6. (1,0, 1)‘ (-1, 0) (o, l) (O, -l) 7. (o, 1, —l)‘ (o, 1) (1,0) (0, -1) 8. (1, -1, O)‘ (—1,0) (1,0) (0, -1) 9. (1,0, —l)‘ (_1, 0) (1,0) (0, -1) 10. ('1, l, l)‘ (—1, 1) (l, 1) (o, —2) ll. (1, 1, —1)‘ (-1,'-1) (1, -1) (o, -2) 12. (1, -1, l)‘ (—1, -1) (1, -1) (o, —2) 13. (1, —l, —1)‘ (—1, 1) (1, l) (o, -2) The actual number of cells, .data flow timing, and data movement information are shown in Table 6.4. In this table, data movement indicates not only the stationary and non-stationary data, but also the relative directions. In Table 6.4, the word “stay” implies stationary data. For example, in systolic solution 4, a’s move south (S), b’s move west (W), and c’s move north (N). These directions are relative directions. In solution 4, the data directions indicate that a’s and c’s move in opposite directions in the array. Some systolic solutions, such as 6 and 9, are similar in the required number of cells, DFI‘, and data movement. The difference between these solutions is in the order of resultant’s calculation; that is, the elements of matrix C appear as output in different order in each of these systolic arrays. 144 Table 6.4. Comparison Of Number of Cells, DFI‘, and Data Movement for matrix multiplication systolic solutions. Vx Number Of Cells DFI‘ Data Movement a’s b’ s c’s 1 (0,0,1)‘ Q*N l N E stay 2 (0,1 ,0)‘ M *N 1 stay W N 3 (1,0,0)‘ M*Q l E stay N 4 (0,1,1)‘ N(M+Q-l) 2 S W N 5 (1,1 ,0)‘ M(N+Q—1) 2 S N E 6 (1,0,1)‘ Q(N+M—l) 2 E S N 7 (0,1,—1)‘ N(M+Q-l) 2 S W N 8 (1,— l ,0)‘ M (N+Q-1) 2 E W N 9 (1,0,-1)‘ Q(N+M-l) 2 E S N 10 (1,1,1)‘ (N+M—1)(M+Q-l) 3 SE SW N 11 (1,1,—1)‘ (N+M—1)(M+Q-l) 1 NE NW N 12 (1,—1,1)‘ (N+M—1)(M+Q-1) 1 NE NW N 13 (1,- l ,-l)‘ (N+M-1 )(M+Q— 1) 3 SE SW N The best systolic solution for matrix multiplication depends on the specific application. In order to compare these systolic solutions under similar conditions, 5x5 square matrix multiplication was simulated using each of the solutions. Table 6.5 shows the results of the simulations. In Table 6.5, AA is the area Of a macro cell, AT is the duration of the clock cycle, and 1: is the loading/unloading time. The results indicate that for minimum chip area the best solutions are 1 - 3. For minimum execution time, the best solutions are 11 and 12. Solutions 11 and 12 are also the best solutions in terms of throughput. (DFT indicates the throughput of each pipeline in the array). Optimal solutions based on AT or AT2 optimization can also be selected. The total required execution time for solutions 1 - 3 depends on the implementation of the loading/unloading process. The AT and AT 2 optimal solutions 145 among solutions 4 - 13 are 5 and 8. Table 6.5. Simulation results for square (5x5) matrix multiplication. Vx Total Area Total Time 1 (0,0,1)‘ 25 AA 14 Ar + 1: 2 (0,1,0)‘ 25 AA 14 A7- + t 3 (1,0,0)‘ 25 AA 14 A1- + 1: 4 (0,1,1)‘ 45 AA 22 AT 5 (1,1,0)‘ 45 AA 18 AT 6 (1,0,1)‘ 45 AA 22 A7- 7 (0,l,-1)‘ 45 AA 22 AT 8 (l,—l,0)‘ 45 AA 18 AT 9 (1,0,-1)‘ 45 AA 22 AT 10 (1,1,1)‘ 81 AA 22 AT 11 (1,1,-1)‘ 81 AA 14 AT 12 (1,—1,1)‘ 81 AA 14 AT 13 (l,—1,—l)’ 81 AA 22 AT A solution based on the total area, total time, or throughput is not necessarily the optimal solution for a specific application. For example, consider solution 4 where a’s and c’s move in opposite directions (Figure 6.9). In this case, all, an, an, etc. and cm, 012: c13, etc. enter and exit the same pipeline in the array. The input and output patterns in the other pipelines are also the same. b’s move perpendicular to the directions of a’s and c’s. Solution 4 can therefore be selected as the best solution for calculating powers of a matrix. For example, to compute A“, where A is a square matrix, the following steps using systolic solution 4 are taken: A*A=A2; A2*A=A3; A3*A=A4. 6.2-5 A multiplexer at each output boundary cell is required. To compute A2, the multiplexers select the array inputs from the entries of A. To compute A3 and A4, the 146 entries of A2 and A3 are provided directly from the array output cells, respectively. NO additional delays or buffers are needed. A reconfigurable systolic array that supports all these cell interconnections will provide a rich class of systolic arrays for matrix multiplication. Systolic solution 10 corresponds to the hexagonal systolic array given by Kung [40] (the array shown in [40] is for banded matrix multiplication). Solution 5 corresponds to the rectangular solution given by Hwang [6]. A systolic array for any size matrix multiplication described by Eq. 6.2.2 can be designed using the information obtained from 3 x 3 square matrix multiplication. The following steps are used to design a larger systolic array for matrix multiplication: 1. Select the values of N, M, and Q in Eq. 6.2-2 to define the size of the matrices. 2. Collect the indices of the output Operands (i.e. ij,k in Eq. 6.2.-2) which indicate the cell locations in the geometric representation, G. 3. Select the desired transformation matrix, H, listed in Eq. 6.2-4. Associated with each transformation matrix, H, there exists a projection vector,’ Vx. Associated with each Vx, there exist CCR, DFI', data movement, and total number of cells as indicated in Table 6.2 and Table 6.4. 4. Transform the original geometric representation (G) into a new geometric representation (G’) by multiplying H by the coordinates of each cell location in G. 5. Project G’ onto the predefined plane (e.g., j’k’ plane). Each point on the projection plane now indicates an actual cell location. 147 Associated with each H there exists a macro cell. Repeat the macro cell at each cell location on the projection plane to construct the systolic array. The DFI‘, data movement, and input/output data patterns of the larger array are the same as those of the smaller array. “r c132. 013. 033 c131, 012. b23 c130. 011, 013 c1 33 c1 32 c1 31 c122. 013. 032 c121. 012. b22 0120. 011. 012 c1 23 c1 22 c1 21 o112. 013. 031 c111, 012. 021 c110. 011. 011 148 c232. 023. 033 0231. 022. 023 0230, 021. 013 c233 c232 c231 c212. 023. 032 c221. 022. 022 c220. 021. b12 c223 e222 c221 c212. 023. 1531 c211. 022, 021 0210, 021, b11 6332. 033. 1:33 c331, 032, 1323 c330. 031. M3 c333 c332 c331 c322. 033. 1332 c321. 032. b22 0320. 031. b12 c323 6322 0321 c312. 033. 031 0311. 032. b21 c310, 031. bii c113 c213 c313 c112 c212 0312 c111 0211 c311 l 1 L \ I l l— 7 1 2 3 Figure 6.2. A projection plane for matrix multiplication; V): = (0, 0, 1)‘. 149 033 023 013 0 0 c13 > If > 032 022 012 0 c12 \r j y 031 021 011 c11 % am/ > 0 011 O 012 021 013 022 023 0 c-c+000 Figure 6.3. A systolic solution for matrix multiplication; Vx = (0, 0, l)‘. 032 (:31 X ,X’ 0332. 033, 033 0322. 033, 032 c312. 033. 031 0333 c323 0313 0331. 032. 023 0321. 032.‘ 022 0311. 032. 021 0332 c322 0312 0330. 031, 013 0320, 031. 012 0310. 031. 011 150 0232. 023, 033 0222. 023, 032 0212. 023. 031 0233 223 021 3 0231, 022. 023 0221. 022. 022 0211. 022. 021 0232 0222 0212 0230. 021. 013 0220. 021, 012 0210, 021, 011 0132. 013, 033 0122, 013. 032 0112. 013. 031 $3.0 1 0 as (all 0131, 012. 023 0121. 012. 022 0111. 012, 021 0132 0122 c112 0130. 011. 013 0120. 011, 012 0110. 011. 011 c331 0231 0131 c321 0221 c121 0311 0211 0111 l L l \ I T I I -3 -2 -1 Figure 6.4. A projection plane for matrix multiplication; Vx = (0, l, 0)’. EE§°° i: 032 031 O 021 022 023 It 151 IN < :23 < E < 021 IN NIX IN 011 t-4 012 2 013 3 4 5 t-1 2 3 4 5 013 O O 031 032 033 012 0 021 022 023 011 011 012 013 c-c+000 Figure 6.5. A systolic solution for matrix multiplication; Vx = (0, 1, 0)‘. 0312. 033. 031 0212. 023, 031 0112. 013. 031 0313 0213 0113 0311, 032. 021 0221. 022. 021 0111, 012. 021 0312 0212 0112 0310. 031, 011 0210. 021. 011 0110, 011, 011 152 0322. 033. 032 0222. 023. 032 0122. 013. 032 0323 0223 0123 ' 0321. 012, 022 0221. 022. 022 0121. 012. 022 0.322 0222 01 22 0320, 031. 012 0220. 021, 012 0120. 011, 012 0332, 033, 033 0232. 023. 033 ‘ 0132. 013. 033 0333 0233 01 33 0331. 032. 023 0231. 022. 023 0131. 012. 023 0332 0232 01 32 0330. 031. 013 0230. 021, 013 0130. 011, 013 031 1 0321 0331 0211 0221 0231 0111 0121 0131 J L I A l V l 7 1 2 3 Figure 6.6. A projection plane for matrix multiplication; Vx = (1, 0, 0)‘. 153 t-i 011 O O 2 012 021 O 3 013 022 031 ‘ ‘r 023 032 5 “ >5 v» 033 033 023 013 o o 631 > A x 033 i I 032 022 012 o 021 xfi 022 ,L 1:23 031 021 011 on > 012 > 013 0-c+0-0 Figure 6.7. A systolic solution for matrix multiplication; Vx = (1, 0, O)‘. 154 kl /\ 0312, 033. 031 0212, 023, 031 0112. 013, 031 2 .... ® ® ® 0313 0213 0113 0322, 033, 032 0222, 023, 032 0122. 013, 032 0311, 032. 021 0211, 022. 021 0111, 012. 021 1-0 63 ® ® 0323 0223 0123 c312 0212 c112 0332. 033. 033 0232. 023. 033 0132, 013. 033 0321. 032. 022 0221, 022. 022 0121 012, 022 0310. 031. 011 0210. 021, 011 011 , 011, 011 ° r 63 ® ® 0333 0233 0133 c322 0222 0122 c311 0211 0111 0331 032, 023 0231. 022, 023 0131, 012. 023 032 . 031, 012 0220, 021, 012 0120, 011. 012 _1__ ® ® ® 0332 0232 0132 c321 0221 0121 0330. 031, 013 0230. 021, 013 0130, 011, 013 ..2 __ ® ® ® 0331 0231 0131 L L L \ T I F I -3 -2 -1 Figure 6.8. A projection plane for matrix multiplication; Vx = (0, 1, 1)‘. IMO-1.00% 155 00-04-000 0330 0 011 H 00 023021 0 7 032031 00 013012 8 00 022022 00 O 0 0 “are: 2:... 1 0033 0 011 12 ‘4 J '4: o. 011 0 022 0 033 O 012 O 023 013 . O 0 Figure 6.9. A systolic solution for matrix multiplication; Vx = (O, l, 1)‘. 156 mum mm“ Ge. mmm mam. a" mo -6 39.1 ‘11 Db. am £62“ . a... 6% r cm 031. 012 0210. 021. 011 0310. 031. 011 0311 L I I I -1 Figure 6.10. A projection plane for matrix multiplication; Vx = (l, l, O)‘. r0 1NU+UON 157 033 032 0 031 0 023 O 022 0 021 O 013 O 012 0 011 0 0 011 0 0 0 021 0 012 0 031 0 022 0 013 0 032 033 0 033 cue-+000 8 7 til-0 003100 032 0 0210 022 0 011 O 012 0 013 0 0 Rotated by 90 Degree- Figure 6.11. A systolic solution for matrix multiplication; Vx = (l, l, 0)‘. 158 k. A 0112. 013, 031 0122, 013, 032 0132. 013. 033 o o 0113 0123 0133 0212. 023. 031 0222, 023. 032 0232, 023, 033 0111, 012, 021 0121, 012, 022 0131, 012, 023 13.. ® ® 0213 0223 0233 0112 0122 0132 0312. 033, 031 0322. 033. 032 0332. 033, 033 0211, 022. 021 0221 022. 022 0231, 022. 023 8110. 011, 011 8120'. 011, 012 0130, 011. 013 0 .4- ® ® ® 0313 0323 0333 c212 0222 0232 0111 0121 0131 0331 032. 023 0321, 032. 022 0331, 032. 023 021 , 021, 011 0220. 021, 012 0230, 021. 013 -1-- o 0312 0322 c332 0211 0221 0231 0310. 031, 011 0320, 031, 012 0330. 031. 013 -2 .. ® ® 0311 c321 0331 L I l \ '_ T T 7 1 2 3 Figure 6.12. A projection plane for matrix multiplication; Vx = (l, 0, l)‘. 159 0 - 0 + a 0 0 t-0 011 033 7 0 7 0 012 032 0 e 0 021 031 0 0 013 023 5 0 0 0 022 022 0 0 4 10 031 021 0 0 023 013 3 0 11 0 0 032 012 0 O 2 12 0 011 0 0 033 O t-1 00 ° ° 0"00 023 0 012 0 . 03300220011 03200210 03100 Figure 6.13. A systolic solution for matrix multiplication; Vx = (l, O, 1)‘. 160 k. /\ 0332. 033. 033 0232, 023, 033 0132. 013. 033 e 0333 0233 c133 0331, 032. 023 0231.022. 023 0131. 012. 023 0322. 033, 032 0222. 023. 032 0122. 013. 032 5.. 1 ® ® ® 0332 0232 0132 0323 0223 0123 0330, 031. 013 0230, 021. 013 0130:: 011. 013 0321. 03 022 0221.022 0121,01022 0312, a .031 0212. 02 .031 0112. 01, 031 4 .,. ® ('3 0331 0231 c131 0322 0222 0122 0313 021301113 0320, 031, 012 0220. 021. 012 0120, 011, 0122 0311.032. 021 0211, 022. 021 0111, 012.17 3.1.. ® 0321 0221 c121 0312 0212 0112 0310. 031. 011 0210. 021. 011 0110. 011. 011 2 ... ® ® ® 0311 0211 0111 L L l \ r I r 7 -3 -2 -1 Figure 6.14. A projection plane for matrix multiplication; Vx = (0, 1, -1)‘. lNqu-UQV ifl N 8°89 L. O L§o&o&ol 161 O a a a u. ‘ONO‘H 012 011 O 013 I 00-04-000 t-3 4 5 0 7 0 023 0 032 013 0 022 0 031 0 012 O 021 00011 Figure 6.15. A systolic solution for matrix multiplication; Vx = (0, l, -1)‘. 162 0332 0330. 031. 013 0331 022 g 023 0331. 032. 023 0230. 021. 013 0321 0231 83:22 ”3’ 033 d.” 022 0321. 0322 0232 0320. 031. 012 .0”. mmvmmo 0131. 012. 023 0231 0311 m 0131 033. $23. 0223 0133 0311. 032. 021 0310131. 011 8020. 021. 812 0130. 011. 013 011 012 $5322} 6) .5 o o 0111 8:133: 3032 0132. .9“ 2. 22. 0211 0121 0120. 011 0112. 013. 031 E“) 0113 0111. 012. 021 © 0112 0110. 011. 011 i Figure 6.16. A projection plane for matrix multiplication; Vx = (l, -1, 0)‘. ”00.00% A 163 0 0 013 012 0 061 032 033 0 c - c + 0 a 0 021 0 033 0 032 0 031 0 0 ' 0 0 109 8 7t-6 00011 O i Q o 012 0 e21 l , . ... 013 0 022 0 031 1 O l 0 00230032 011 0 0 021 0 012 0 031 0 022 0 Rotated by 90 Degreee 013 O 032 033 0 Figure 6.17. A systolic solution for matrix multiplication; Vx = (l, -l, O)‘. 164 ”r 0312, 033. 031 0322. 033. 032 0332. 033, 033 6 r ('9 0313 c323 0333 0311, 832; 821 0321. 032. 822 0331, 83 023 0212. 023, 031 0222. 023, 032 0232, 02 . 033 5 ..1. ® 0312 c322 0332 c213 0223 0233 0310, 031, 011 0320, 031. 012 0330. 031. 013 0211, 022. 021 0221. 022. 022 0231, 022. 023 0112. 013, 031 0122. 013. 032 0132. 013, 033 4. 1* ® ® 0311 c321 0331 0212 0222 0232 0113 0123 0133 0210, 021. 011 0220. 021. 012 0230. 021, 013 0111, 012. 021 0121, 012, 022 0131, 012. 023 o 0211 0221 0231 0112 0122 0132 0110. 011, 011 0120, 011, 012 0130. 011. 013 o o 0111 0121 0131 L I 4 A I I T I 1 2 3 Figure 6.18. A projection plane for matrix multiplication; Vx = (1, 0, —l)‘. 0-0+000 1.5 b 7 0 0 10 0 11 12 00 033 0 0 023 0 032 0 013 0 022 0 031 012 0 021 0 01100 165 031 033 0 032 032 0 021 031 0 0 033 023 0 0 022 022 0 0 011 021 0 0 023 013 0 0 012 012 0 0 0 011 0 0 013 0 n ...NGQCION Figure 6.19. A systolic solution for matrix multiplication; Vx = (1, 0, -1)‘. 166 k. A 0112. 013. 031 ‘ r ED 0212. 823. 831 “‘3 8122, 813. 032 o o 0222. 023. 032 0312, 033. 031 0213 0111. 012. 021 0123 0132. 013. 033 o o 0 an 033. 832 8232. 023. 033 “‘3 8211. 022. 021 $5" 0121. 012. 022 “33 ® a: a a C2 . 0 0 e311. 832. 821 312‘; 0110. 011. 811 0122 0131. 812. 823 °" (‘9 ‘ ('5 C9 0321. 032. 022 0231. 022. 023 ‘3‘: 0210, 021. 811 g 0120. 011. 812 “‘32 “I .11.. ® 0‘“ ® 0331. 032. 023 8310. 031. 811 M0211 0220. 021. 812 :3? 0130. 011. 013 4+ (‘9 ® (”9 ‘3" 0320. 031. 012 323} case. 021. 813 “3‘ o o “32‘ 0330. 031. 013 a“ -4 .1). ® 0331 L l l 1L 1 X F F f r r I -2 -1 O 1 2 Figure 6.20. A projection plane for matrix multiplication; Vx = (l, 1, 1)‘. 167 9. i 0 7 021 0 012 0 031 0 0 0 013 9 0 022 0 10 032 0 023 11 0 12 033 0130000 0000031 0230001200 0002100032 0330002200011 0110002200033 03200021 001200023 ”’1 00013 Figure 6.21. A systolic solution for matrix multiplication; Vx = (l, 1, 1)‘. 168 W I 12.1 ® W “33 W 1151.. 61-) ® 8331-1132-82- 8312—1133—1131 “‘3 .033 8132-1113-1130 o e o 6321—1132-1122 «231-822-823 8313 .332 a m 8133 .011- 6 “-013 ® 8311-1132-1121 8112-1113-1131 0131-01“” 8 «013 ® 8123 «320-831-812 «230-821-813 O ‘3" «211-402-821 3 0121-01sz “a 7d- 6 “‘3 ® . mar-812 8321 «2.31 arm-811 8212 8111-1112-4121 8122 01H1-013 o - o o ‘9" arm-811 331‘ 0120-011-012 ‘1" 3-1- 6 ‘2" 0110-011-011 “‘31 ‘db 6 0111 r J 1 1L L _\ f I IV I 1 r 7 -2 -1 o 1 2 Figure 6.22. A projection plane for matrix multiplication; Vx = (l, l, —1)‘. 170 yr 8132. C11; 033 e... 4122. 013 032 “’3 W 7.... ,2, cm 023. 032 ® «12.113831 ° 0131. 812.823 “”3 menu: 0 .. 1 S2 «212023.831 not... 69 0121. 012. 022 g «231. m. 82.3 ‘3” w 69 3,108,, .4, e 0111. 012. 821 8.1.3 «130631. 813 g 8331. .32. 023 4.1. 6 «011. «22. 821 «021. 832. 822 “"3 8120 011. 812 3%: cm 021. 813 “m 3.... ‘ 0131 ® 0110. 811. 811 «121 8220. 021. 812 m, can. 831. 813 2.1. 6 6 °‘" 0210-1121-1011 3} 8.120. 0'51. 012 ‘3" 1..- 6’) ® ‘3" 8310. 031. 811 “3‘ odi- 8311 ‘1 L L l L x f I I I— I 1 I 2 3 4 3 0 Figure 6.24. A projection plane for matrix multiplication; Vx = (1, -l, 1)‘. 171 t-B 011 021 031 032 033' 7 012 022 023 8 013 \ o '\ o o 033 '\ 032 031 823 822 021 813 012 Figure 6.25. A systolic solution for matrix multiplication; Vx = (l, -1, l)‘. 172 1: T 8312. on 031 e. ® 0212. «23 831 ‘3“ 8322. 833. 832 7 -. 0213 ‘3‘" '3’: ”2‘ Q 8112. 013. 831 0222. 023. 832 0332. 833. 833 0.3,. 8011 022. 821 6 8321 on. 822 6 «113 1 8312 1 8333 an. «131 832 8223 «232. as. 033 “-0 ® 8310. 031. 811 ® 8212 «021. 022. 822 .322 .111. .12. 821 .123 0132.6 833 82.33 8331.6 823 4 .1.- ea10. 021. 011 0320. 031. 812 “"2 0121. 012. 822 :2; e231. «22. 823 “a 3.... 8133 S2 0220. «21. 812 Q 0110. 011. 011 0122 «131.615. 823 0232 03306031. 813 2.... “"1 8120. 011. 812 5‘}; an. «21. 813 “’1 ‘dj- 6 “‘2' 0130. 011. 813 “3' e. j. «131 .L i i J r 3 I 2 3 0 Figure 6.26. A projection plane for matrix multiplication; Vx = (1, -1, —l)‘. 173 ”'9 021 ‘3‘ 032 g 011 8 022 0 033 10 012 8 033 11 12 013 0330000 0000031 0230003200 0002100032 0130002200031 0110002200033 812008010 081200823 «1100?? Venue e-c+e-0 0 0 Figure 6.27. A systolic solution for matrix multiplication; Vx = (l, -—l, —l)‘. 174 6.2.2 LU DECOMPOSITION An N x N LU decomposition algorithm, A = LU, where A = (aij) is an N x N nonsingular matrix, L = (1,-1) is an N x N lower triangular matrix with a unit diagonal, and U = (uij) is an N x N upper triangular matrix, can be described as follows: ’1] = (0,]; "' 2 [it * ukj) I ujj for i=2,..., N and j=1,..., i—l; 6.2-6 It0 180 1433 02133 ' Varlwle n 0: l 01:22 / 1211111111 ”‘1‘" / /£434 0343 \ Voridrle name: 0 11222 {21212. 112121 °P"°*°"‘ “' 13“. “111 '21\ taxi—1032' 8342. 1323. 11242 3/1321. 13122 :me In :3” 0231. 1212. 0131 \ u120 0111 Q "$3.”; 131 “131 241. 121 \4 225 u 0 0135' u \ 0232 1311 0332 01 o 9. ©> 1420 '3” "T113113 "m 123‘ 6240 V \ 6:) / \' 1430 $331 «340 " ’1’ \ \Q / / u 1 Vdde 11800: I) \l g / .L 4 1 T r + r Figure 6.28. The projection planes of LU decomposition. 181 U / I u u I U U C O 0 II. II .. _ 3 . ..1 I u u u n I: 0 “II u 000% 3 4 I 000000m 001002 u u ‘. - 0000u002 2 u. .4. a . 0100 00 u u u 200%003 a ..‘e‘o .. . . 1 1 m11m11u1 u «I m w “ . 11.1 11 11 .I. . . a 00fl00m00“ 2 a o OOMOOBOOM 1 2 00003004 4| 2 o 0 0003004 I 000000“ M o 0000mT||_ w 1234567890 fluR 1 ... a Figure 6.29. The systolic solution for LU decomposition. 182 6.2.3 SOLUTION OF LOWER TRIANGULAR SYSTEM OF EQUATIONS Let L be an N x N lower triangular matrix with a unit diagonal. Let x and b be two N x 1 vectors where the unknown vector, x, is the solution of linear system L1: = b. The elements of vector x are calculated as follows: {‘1' = bi " E lij * xj 6.2-12 j/I t=2 x2 b x1 <>/S( 1.. T 1-1 b1 0 2 — - - b3 0 3 - - ' - b4 Figure 6.31. The systolic solution of L1: = b. 187 6.2.4 FINITE IMPULSE RESPONSE (FIR) FILTER An FIR filter with K weights can be described by the following expression: K yt' = 2 Wk * Ink-.1 6.2-18 k = l for i = 1, 2, - - - . The computational window index space (CWIS) for this problem is K, which is the size of the window (refer to Chapter 3). This means that in order to design systolic solutions for the FIR filter, the following problem description is used: K Y1: 2 Wk * xi+k-1 6.2-19 k: 1 for 1' = 1, 2, ..., K. Eq. 6.2—19 is decomposed to its simple computations as follows: Yip E 0; 6.2-20 yi,k (*+) = Yi,k-1 (*+) + “’20 ( ) "‘ Ink-1.0 ( ); Yr 5 Yuc- The index spaces of the w’s and x’s are artificially expanded (by adding a 0 to each index) to match the dimension of the index space of y’s. Each index in Eq. 6.2-20 is now a member of 22. For K = 3, the equivalent simple computations are as follows: YIO E 0; , 6.2-21 Y11(*+) = y10(*+) + W10( ) * x10( ); Y12(*+) = Y11(*+) 1' W20( ) * x20( ); Y13(*+) = Y12(*+) 4' W3o( ) * x3o( ); Yr 5 Yrs; )20 5 0; Y21(*+) = Y2o(*+) + W10( ) * x20( )3 Y22(*+) = Y21(*+) + W20( ) * x30( ); 188 Y23(*+) = h2(*+) ‘1' W30( ) "' x4o( ); )2 5 Y23; yso E 0; Y31(*+) = Y3o(*+) + W10( ) * x30( ); Y32(*+) = Y31(*+) + W20( ) * x40( ); Y33(*+) = Y32(*+) ‘1’ W30( ) * Jr50( ). hahfi The simple computations of Eq. 6.2-20 are all of the same type and form only one set of simple computations. Each simple computation is computed in a cell performing multiplication and addition operations. The systolic design procedures were applied to the simple computations of Eq. 6.2-21 and the valid global projection vectors are shown in Table 6.8. Table 6.8. Global valid projection vectors for the FIR problem. Vx Vy Vz Vw 1- (0, 1)‘ (0. l)‘ (1. 0)‘ (1. 0)’ 2. (1. 0)‘ (1, 0)‘ (0. 0)’ (1, 0)‘ The corresponding geometric transformation matrices are: [.01 (1) for VI = (0, 1):; 6.2-22 [(1) (1)] for Vx =(1, O)‘. The projection planes and the systolic solutions are shown in Figures 6.32 - 6.35. The 189 directions of the data communication lines are shown in Table 6.9. Table 6.9. Data input directions for the FIR systolic solutions. Input Line Directions Vx DFI‘ iy iw ix 1. (0,1)‘ 1 I (1) (1) 2. (1,0)‘ 2 (_1) Z" (1) Although DFI=1 for Vx = (O, 1)‘, due to the delay registers one output is produced every other clock cycle even though the actual data is supplied at every cycle. In Figure 6.33, the elements of the window are supplied repeatedly. To reduce the I/O bandwidth, the elements of the window can be preloaded. This result is shown in Figure 6.36; however, a long communication line is needed in this case. These solutions are also given by Kung [5]. As before, the geometric transformation matrices can be used to design larger arrays for larger window sizes. However, in this case, the design is a linear array and larger designs can be produced simply by adding additional cells. y32. v13. x6 yzz. '13. x4 y1 2. 1113. 113 y31. w2. 114 3m. 11112. x3 )1". 1112. 112 y30. '1. x3 )20. 1'1. 112 y10. 111. x1 1133 m 1113 1132 YZZ >42 731 M Y" 1 1 1 1 f —3 -2 —1 Figure 6.32. The projection line of the FIR filter for Vx = (O, l)‘. 190 . 1-1 2 3 4 5 1111 112 113 V1 112 "'1 k “’1 0+ x1 x2 x3 x4 115 V V y3 1-7 1/2 t-5 111 t-3 ,3 10 yet 8 )4 5 13 . 11 . 9 y WG—— X(___... 6____ Figure 6.33. The systolic solution of the FIR filter for Vx = (O, 1)‘. y30. W1. x3 y31. 1112. 114 y32, 1113. :15 1’20. “.22 131.112.3131 122113.114 y10. 1111. 111 y11. 1112. 112 y12. 113. x3 y31 y32 y33 Y21 m 1’23 y11 y12 y13 1 1 1 e f 1 2 3 Figure 6.34. The projection line of the FIR filter for Vx = (1, O)‘. 191 X1 0 X2 0 1111 ¥ 2 \ é— :E°§os x(— 6— x Y—-) —__) yI-y-i-wox Figure 6.35. The systolic solution of the FIR filter for V1: = (1, O)‘. v3 .3. y3 t-7 n t-5 yl t-3 )0 10 y6 8 )4 0 13 . 11 . 0 Y '6‘ 1. y-y-t-wox Figure 6.36. An equivalent systolic solution of Figure 6.33. ~bna an: 192 6.2.5 2D CONVOLUTION A K x L 2D convolution can be described by the following expression: K L yij = Z 2 Wu * 1141-1, j+l—l 6.2-23 k= 1 I: I for 1': 1, 2, - -~ and j= l, 2, -- - . The computational window index space (CWIS) for 2D convolution is equal to the window size. The systolic design procedures are applied to the problem described by K L yij = 2 2 Wu * Ink-1, f+l-l 6-2-24 k: l I: I for i = 1, 2, ..., K and j = 1, 2, ..., L, which has bounded index spaces. Eq. 6.2-24 is decomposed into equivalent simple computations are follows: mp = 0: 6.2-25 My}: (*+) = yi‘J,t-1 (*+) + Wup ( ) * Ink-1314.0 ( ); yiJ = yt‘J,K"L; for k =1, 2, ..., K, I: l, 2, ..., L, t =1, 2, ...., K*L, 1': l, 2, ..., K, andj = 1, 2, ..., L. Again the index spaces of the w’s and x’s are artificially extended (by adding a O to each index) to match the dimension of the iridex space of y’s. Each index in Eq. 6.2- 25 is now a member of Z3. For k = 2 and L = 2, which defines a 2 x 2 window, the simple computations are as follows: yl 10 .=— 0; 6.2-26 Yul (*+)=)’110(*+)+ W110 ( )* JlC110( ); yllz (*+) = Yul (*+) + W120 ( ) * 1120 ( ): YI13 (*+) = ha (”'1") 4' W210 ( ) "‘ x210 ( ); Y114 (*+) = YI13 (”‘1“) 1' W220 ( ) * J"220 ( ); Yu 5 Y114; 193 ero 5 0; err (*+) = ero (*+) 1’ W110 ( ) "‘ J"120 ( ); Y122 (*+) = err (*+) + W120 ( ) "' Jr130 ( ); Y123 (”'1”) = erz (”'1“) 4' W210 ( ) * x220 ( ); yi24 (*+) = .er3 (*+) + W220 ( ) "‘ Jr2:10 ( ); m 5 Y124§ Y2ro E 0; Y211 (*+) = ero (*+) + W110 ( ) * x210 ( ); Y212 (*+) = Y211 (*+) + W120 ( ) "‘ Jr220 ( ); ers (”'3“) = Y212 (”'1“) 1’ W210 ( ) * Jr310 ( )3 3214 (*+) = Y213 (*+) + W220 ( ) "‘ x320 ( ); er 1"- Y214; Yzzo 5 0; )221 (*+) = Y220 (*+) '1' W110 ( ) "‘ x220 ( ); yZ22 (*+) = Y221 (”'4") + W120 ( ) "‘ x230 ( ); Y223 (*+) = Y222 (*+) + W210 ( ) * J'5320 ( ); Y224 (”‘1”) = Y223 (*+) + W220 ( ) * x330 ( ); Y22 5 )224- The systolic design procedures were applied to the simple computations of Eq.6.2-26 and the valid global projection vectors are shown in Table 6.10. As shown in Table 6.10, the projection vector of the w’s is Vz = (O, 0, 0), indicating stationary operands. This implies that the elements of the window are preloaded into the internal cell registers prior to the start of the operations. The corresponding geometric transformation matrices are: 0 l 0 —1 O O for Vx = (0, 1, O)‘; 6.2-27 0 O l 194 Table 6.10. Valid global projection vectors for 2D convolution. Vx Vy Vz Vw (O, l, O)‘ (O, 1, O)‘ (0, O, O)‘ (0, l, O)‘ (1, O, O)’ (1, O, O)‘ (0, O, O)‘ (l, O, O)‘ (1, 1, 0)‘ (l, 1, O)‘ (O, O, O)’ (1, 1, O)‘ (1, -1, 0)’ (l, -1, O)‘ (O, O, O)’ (1, -1, O)‘ PPN!‘ 1 o o o 1 o for Vx=(1,o,0)‘; o 1 1 1 0 -1 1 o for Vx=(1,1.0)‘; o 1 l l O l —l 0 for V1: = (1, -1, 0)‘. 0 0 1 For Vx = (0, l, 0)’, the projection plane and its corresponding systolic solution are shown in Figures 6.37 and 6.38. As shown in Figure 6.38, the solution contains one non-uniform cell interconnection. Although the simple computations are all of the same type and correspond to a homogeneous systolic solution, the solution of Figure 6.38 should be designed as the combination of two sub-systolic arrays. The left four cells and the right four cells form the two sub-systolic arrays. This systolic array requires a high I/O bandwidth. Another systolic solution, derived from the solution of Figure 6.38, is shown in Figure 6.39. This systolic array is constructed by combining the two cell rows in Figure 6.38 into one cell row. Additionally, the systolic solution of Figure 6.39 is operated in a non-standard manner as follows: 195 1. xij’s with odd 1' enter the array at the top x-data stream; 2. xij’s with even i enter the array at the bottom x-data stream; 3. each cell uses an x-data stream for 4 consecutive cycles before switching to the other stream and continues alternating in this manner. The other valid global projection vectors, similar to Vx = (O, 1, 0)‘, result in systolic solutions that should be decomposed into sub-systolic arrays. By overlapping the cells in each of these solutions, the systolic solution- shown in Figure 6.39 is also constructed. Figure 6.40 illustrates the systolic solution with nine weights (3 x 3 window). This array is operated as follows: 1. xij’s with odd 1' enter the array at the top x-data stream; 2. xij’s with even i enter the array at the bottom x-data stream; 3. each cell uses an x-data stream for 6 consecutive cycles before switching to the other stream and continues alternating in this manner. In Eq. 6.2-25, the products are computed in the order of left to right and top to bottom as the window moves from top to bottom and left to right over the image array. For example, w“ * 1:11 is computed before “’12 * x12 and the computation of m cannot be started until both products w“ * x11 and W12 "‘ x12 are known. However, the computations of the y’s can be accelerated by computing the products in the order of right to left and bottom to top as the window moves from top to bottom and left to right over the image array. This means that, W12 * x12 is computed before w“ * x11 and consequently the computation of Y12 can start immediately after computation of “’12 * x12. This results in the following decomposition of Eq. 6.2-24: yup = 0; 6.2-28 196 k0 x33 . A $23 3. 1122. x32 51%: 322%.. 325 n24 y124 m4 y114 3222. 21.1132 21. x22 y212, 2211x31 {111% 32122121 3223 y123 n13 y113 1.1v1 1123 21.1 13 $11: 111351122 {1111131221112 y222 y122 y212 y112 0. 11. x22 20.11.1112 3$230. :11. x21 y13103311121111 yz21 y121 >411 y111 1 1 > J’ -2 -1 Figure 6.37. The projection plane of 2D convolution (Eq. 6.2-26) for V1: = (O, l, O)‘. 197 Y1,” (*+) = you—1 (*+) + W210 ( ) * xi+k-lj+l—-1,0; YiJ = YIJK‘L; for k = K, K-l, ..., l, l= L, L-l, ..., l, t =1, 2, ...., K*L, i = l, 2, ..., K, andj=1, 2, ..., L. For a 2 x 2 window, the simple computations are as follows: YIIO E 0; 6.2-26 )’111(*+)= Yrro (”’4“) + szo( )* x220( ); Y112 (*+) = Yrrr (*+) + W210 ( ) * x210 ( ); Yrra (*+) = Yrrz (*+) 1' W120 ( ) * Jr120 ( ); Y114 (*+) = Y113 (*+) 4' W110( )"' Jr110( ); Yrr 5 Ym ( )3 ero 5 0( ); M21 (*+) = ero (*1’) '1' W220 ( ) * 1230 ( )3 3122 (”'19 = Y121 (”'1“) 1' W210 ( ) * J"220 ( ); Y123 (*+) = Y122 (*+) + W120 ( ) * x130 ( ); Y124 (*+) = Y123 (*+) 1‘ W110 ( ) * J“120 ( ); Y12 -=- Y124 ( ); >210 5 0 ( ); )211 (*+) = )210 (*+) + W220 ( ) * x320 ( ); erz (”'2“) = >211 (”'1“) 1’ W210 ( ) "' Jl€310 ( )? Y213 (”‘1“) = Y212 0+) 1' W120 ( ) "' x220 ( ); Y214 (*+) = Y213 (*+) + W110 ( ) "‘ x210 ( ); Y21 5 Y214 ( ); nm50(k Y221 (*+) = Y220 (*+) + W220 ( ) * 1330 ( ); Y222 (”'1”) = >221 (”‘1”) 1' W210 ( ) * x320 ( )i 198 r———> ... ,——->@ >691 7327—2 \ 1 W COO-”1:623 * ,., ‘1 r1 ylt 0 0 . 0 1112 x21 . y12 0 0 ° 0 1113 x22 )21 y13 0 0 0 . y14 x23 )22 . 0 0 1124 ~ x31 Y23 ' 0 ' x32 0 V y - y + 1v 0 x ‘33 x 1 w x : Figure 6.38. The systolic solution for 2D convolution (Eq. 6.2-26) for Vx = (0, 1, O)‘. 199 Y223 (*+) = )222 (*+) + W120 ( ) * x230 ( ); Y224 (*+) = Y223 (*+) + W110 ( ) * 1220 ( ); Y22 “-"' Y224- The systolic design procedures were applied to the simple computations of Eq. 6.2-28 and the valid global projection vectors are the same as those shown in Table 6.10. For Vx = (O, 1, O)‘, the projection plane and its corresponding systolic solution are shown in Figures 6.41 and 6.42. Again, the systolic solution is made of two sub-systolic arrays. The throughput has been improved compared to that of the systolic solution of Figure 6.38. However, additional delay registers are needed. This solution can be similarly improved by combining the two cell rows into one cell row. The result is shown in Figure 6.43. The operations of the systolic solution of Figure 6.43 are as follows: 1. xij’s with odd 1' enter the array at the top x-data stream; 2. xij’s with even i enter the array at the bottom x-data stream; 3. each cell uses an x—data stream for 2 consecutive cycles before switching to the other stream and continues alternating in this manner. Similarly, the systolic solutions corresponding to other global valid projection vectors are made of sub-systolic solutions. Again, overlapping of the cells in these solutions results in the systolic solution of Figure 6.43. Figure 6.44 illustrates the systolic array for 3 x 3 2D convolution. This array is operated as follows: 1. xij’s with odd 1' enter the array at the top x-data stream; 2. xij’s with even i enter the array at the bottom x-data stream; 3. each cell uses an x-data stream for 3 consecutive cycles before switching to the other stream and continues alternating in this manner. 200 , x11 . 0 .1112 0 y11 . 1'11 W12 111121 11122 1:21 f 1113 0 0 x22 :14 o 0 x23 X X 1‘s 1\ t \l/ I f. i Figure 6.39. The systolic solution of 2D convolution (Eq. 6.2-26) constructed by overlapping the cells in the solution of Figure 6.38. 201 A systolic solution similar to the one shown in Figure 6.44 is given by Kung et. al. [41]. In their solution the products are computed in the order of bottom to top and right to left. There are no other significant differences. ~0~§OEOEO§OEOEO 202 1'11 '12 1'13 v33 at x x x y y-y+v0x 1' Figure 6.40. The systolic solution of Figure 6.39 with nine weights. .ooEOEoEo x41 x31 203 )423. 11111. x22 fi13. 11:11. x21 y224 yz14 y222. 11112. x23 )212. 11112. x22 y223 y213 3(221, 111121. 1132 3411. 11121. W31 y123, 11111. x12 y113. W11. x11 y124 y114 y122, W12. x13 y112. W12. x12 y123 y113 y121. 11121. x22 y111. 11121. x21 Y222 y122 Y212 y112 20. 1122. x33 20. 11122. x23 {2210. 11122. x32 114110, 1122. x22 YZ21. 11121 fl11 y111 1 41 A, f -2 -1 Figure 6.41. The projection plane of 2D convolution (Eq. 6.2-28) for V): = (O, 1, 0)‘. 204 ego—rowan.“ “r Fréflo c_.e:!:: o a x22 y11 o a 1131 1:23 1112 0 n1 x32 :24 y13 x11 ,2: x33 - - x12 n3 x13 y___)m’_._) y-y-1-110x x___)m___)x U 1——-> ——->1 ' 7—-—>‘\J-——-> v-H'” Figure 6.42. The systolic solution for 2D convolution (Eq. 6.2-28) for Vx = (O, l, O)‘. DO. 1151 x43 y32 0 x42 y31 Figure 6.43. 1&2 x51 x41 x42 x43 y12 y13 x44y22 x45 y31 X” 1132 1131 x41 0 x23 :22 m 142 3111, v33 x32 1:31 1124 1'32 205 o X22 y11 1122 1:13 1:11 1:12 Hf 11112 ' 1m 2,. _)x ——)y-y+wex The systolic solution of 2D convolution (Eq. 6.2-28) constructed by overlapping the cells in the solution of Figure 6.42. x15 lfl 11131 .HX ...—9X ...)y-y-t-wox Figure 6.44. The systolic solution of Figure 6.43 with nine weights. 206 6.3 CHAPTER SUMMARY This chapter presented the results of applying the systolic array design method on several known problems. The known systolic solutions were reproduced. Furthermore, additional systolic solutions were produced for matrix multiplication and 2D convolution. The method was also applied to the solution of lower triangular linear system of e®ations and no valid projection vectors were found. This means either that no systolic solution exists or one cell per simple computation is needed. In this case, a systolic solution does exist where each point in the geometric representation indicates an actual cell location (no geometric transformation and projection are performed). The systolic solutions for LU decomposition and FIR filter were also reported. CHAPTER 7 DISCUSSION AND CONCLUSION Systolic arrays structures are known for their regular structures and simple processing elements (cells). These arrays operate synchronously, performing dependent computations through the pipelines and independent computations in parallel. The advantage of using extensive pipelining and multiprocessing makes them suitable for the implementation of many computationally intensive problems. Although they have many advantages in their architectural properties, their design involves complex data scheduling; the right data has to reach the right cells at the right time in order to preserve the algorithm’s correctness. This involves determination of the spatial and tinting distributions of data. Several attempts have been made to systematize the design of systolic arrays either directly from an algorithm or from an intuitive non-systolic design. Presented here is an introduction of a new systematic systolic array design method aimed at solving some of the existing design problems. The new method belongs to that class of design methods which map an algorithm into an architecture. The design method, in general, follows three major steps: algorithm representation, algorithm model, and architecture specification. 207 208 The algorithm representation step involves the decomposition of the algorithm computations into a set of basic, or what has been called in this work a set of simple, computations. Each resultant variable is the result of one or more dependent basic computations. Each cell in the systolic array is responsible for a subset of these basic computations. In general, it is desirable to decompose the algorithm into the simplest possible forms of basic computations thereby allowing maximum parallelism. Some existing mapping methods start with the description of the algorithm in a high level language with nested do-loops. They then decompose the algorithm by extending the loop indices to every variable in the loop body. With this process a dependency vector is generated for each variable. Often, in loop bodies with complex index structures, this process is ad hoc. In the new method, a spaced-indexed or time- indexed algorithm with uniform index spaces is described by mathematical expressions. Each resultant variable is a function of certain input variables. These functions are decomposed into a set of simple computations where each simple computation has the desired form for computation in a simple cell. In this step a designer has complete control in defining the cell structures and hence the complexity of the simple computations. Here, only the simple cell structures with zero, one, or two operations were considered. This implies that the algorithm is decomposed into sirrrple computations with one resultant operand and one, two, or three input operands. An index expansion is performed for the purpose of scheduling the simple computations needed to produce a resultant variable. This scheduling is done by the lexicographical ordering of the indices of the simple computations. Index expansion is performed on the primary inputs for the purpose of equalizing the dimension of all the index spaces. No special ordering is assigned to the primary inputs which results in complete freedom to calculate the independent resultant variables in any order. 209 The algorithm model step involves the extraction of certain information from the basic computations, such as the data dependency vectors which define the data flow. This information is then altered to meet the data flow requirements of systolic arrays. Prior to this work there had been no reported systematic approach for altering this information. Furthermore, the data flow information does not include any information about the cell structure, making these methods suitable for designing only homogeneous systolic arrays. In the new design method, a different approach has been employed which incorporates the architectural properties of systolic arrays in the design process. The knowledge of legal communication topologies (such as local cell connectivity), design modularity (such as simple and regular data and control communication), and operation complexities (such as simple arithmetic operations) are used to design systems with such properties. Certain information, referred to as features, is systematically extracted from the simple computations. This information includes the operands of each simple computation, the set of operations required for the computation of each operand, the lexicographical index differences of each pair of operands of a simple computation, and the lexicographical index differences of the respective operands of each pair of simple computations. The simple computations are then grouped into subsets of simple computations based on the required set of operations and the same type operands. A systolic array, referred to as a sub-systolic array, is designed separately for each subset of the simple computations. The final array is constructed by joining the sub-systolic arrays. This process, therefore, provides a method by which non- homogeneous systolic arrays are designed by joining several homogeneous systolic arrays (i.e. sub-systolic arrays). The grouping of the simple computations based on the required set of operations incorporates the property of operation complexity in the design process. This grouping of the simple computations prevents two simple computations with different sets of operations from being assigned to the same cell. 210 The grouping of the simple computations based on the operand types is a requirement to satisfy the property of modularity. The prOperties of simple and regular data and control communication and local cell interconnections are represented in terms of a set of rules based on the feature interrelationships. Each feature interrelationship is based on the feature values of a pair of simple computations. Each interrelationship checks for violation of one or more architectural properties. A violation can be either local or global. A local violation is when cell intra-connections, as described by the cell models, are violated. A global violation occurs when a pair of simple computations computed in a cell violates an architectural property in some other cell in the design. A pair of simple computations can be computed in the same cell if and only if all the feature interrelationships are satisfied for that pair and all its dependent pairs. The feature interrelationships are translated into a procedure that accepts as input a set of simple computations and produces as output a set of information in terms of the relative distances of the operands. This information, referred to as valid projection vectors, is produced for each subset of the simple computations. These vectors are used to design each of the sub-systolic arrays. For each different sub-systolic solution up to four valid projection vectors (one for each of the operands of simple cells) are produced. Each valid projection vector defines the relative distance between the consecutive pairs of operands in all the cells in a sub-systolic array. In non-homogeneous systolic solutions, the valid projection vectors also define the necessary boundary communication linkages required by the adjacent sub-systolic arrays. In addition, a valid projection vector associated with the output operands provides information used to determine the spatial distribution of the output operands; this in turn defines the spatial distribution of the input operands. 211 The architecture specification step involves the determination of design information needed to construct systolic arrays. This information includes individual cell locations, simple computations of each cell, data movement , data timing, design dimension, cell interconnection patterns, data input/output patterns, and cell functions. Not all of this information is needed to complete the design of a systolic array. The information of individual cell locations, data timing, and cell interconnection patterns are by far the most important subset of the architecture specification. Certain architecture information, such as the cell interconnection patterns, can be extracted from other architectural specifications, such as the simple computations of each cell. The individual cell locations are determined by geometric representation and projection of the simple computations. For each subset of the simple computations, the index of an output operand, indicated by a point in space, represents the location of a cell where the corresponding simple. computation is computed. A geometric representation is constructed for each subset of the simple computations. The projection of a geometric representation onto a projection space maps several cell locations onto one location. Each location on the projection space now represents an actual cell location in the design. Furthermore, all the simple computations mapped to the same cell location are computed in that cell in sequence. In general, however, not all such geometric projections will correctly map simple computations to a set of cell locations. Geometric projections are determined to be valid if data communication between the cells is local, simple, and regular. Geometric representations and the valid geometric projections are determined by selecting the valid geometric transformations. An index transformation matrix is selected to transform an original representation into a new representation. The projection of the new geometric representation produces the individual cell locations for a systolic solution. Prior to this work there was no reported systematic method for constructing such matrices. (In one case, however, these matrices are constructed 212 from a set of predetermined cell interconnection patterns and others have reported certain properties of such matrices.) The new systolic array design method provides a method for systematically constructing the transformation matrices from the valid projection vectors. The matrices constructed here have different properties than those selected by others. These matrices are integer and non-singular, corresponding to integer and one-to—one transformations between the original and the new geometric representations. Furthermore, the tinting of the computations is determined not from these matrices, but from the order of the computations and the data flow timing which indicates when the actual primary inputs are supplied to the array input cells. The cell interconnections are performed by comparing the operands of the adjacent cells. The valid projection vectors guarantee local communication between the cells. Bach sub-systolic array, by definition, is a homogeneous array; hence the same cell interconnection patterns are used for all the cells in the array. These interconnection patterns are represented by vectors indicating the direction of the input lines. Macro cells are repeated at each cell location to produce the cell interconnections. The line directions are also used to determine the data flow timing. The various steps involved in the design method are illustrated by a Y-chart in Figure 7.1. The axes are Algorithm Representation, Algorithm Model, and Architecture Specification. This Y-chart has been used before to compare several existing systolic array design methods [22]. As indicated in the Y-chart, some of the features are extracted during the algorithm representation. With this step, the repeated extraction of these features during the determination of the valid projection vectors is eliminated. The projection vectors are extracted from the model and if no valid projection vector exists for a set of simple computations then either no systolic solution exists to implement the simple computations or one cell must be reserved for each of the simple computations. This method, therefore, can also be used to verify whether a 213 systolic solution exists for a given set of set of simple computations. In addition, it could also be used to verify the correctness of a given systolic solution. This method also produces all the projection vectors for a set of simple computations. This is important as there may exist many different systolic solutions with different performance parameters. In fact, an index proportional to the total number of required cells, referred to as cell count ratio, is calculated from the projection vectors. This index can be used early on in a design process to select systolic solutions which require fewest cells. Data movement information is also determined earlier in the design process to identify stationary and non-stationary operands in the array. Systolic solutions with non-stationary operands usually require higher I/O bandwidth. This method was applied to several known problems and the known systolic solutions were reproduced. For some problems, such as matrix multiplication and 2D convolution, it produced new, previously undiscovered solutions. Moreover, some are found to have better performance parameters than previously known solutions. In this method, the design rules are described in terms of the features extracted from the simple computations. As the problem size increases so does the number of the simple computations. Systolic arrays are highly structured and apply to highly .structured algorithms. Therefore, the design information, such as the geometric transformation matrices, determined from a small problem size are used to design larger arrays. The exploration of this new approach for systematically designing systolic arrays started with the goal of discovering the relationships between two different representations of a class of highly structured algorithms; mathematical expressions with bounded and uniform index spaces and their systolic solutions. It is not wholly inappropriate to view systolic arrays as a form of algorithm representation because 1) they are highly structured; and 2) they are unique for each algorithm. (A systolic array 214 is unique in the sense that it is tailored to a Specific application). The hypothesis that there must exist certain inherent relationships between these two representations has been validated as a result of this research. 215 .352: couoabmeoo .35 0:89? £388.“? Bo: 05 mo eowfieaoaou 55.? AH Bewfl dofluogoomm 0.5503384 .3 8m ea 23.33.80 . 5.23:3 82 o .5383 .3 . . .632. .3 35.2 3oz gang‘seb .8... 05:... to: Soc . ..an 568:5.» 38v... . e303 1' .8323 .3 one—...,. . nee—«025 Sea 9 2332.5 3309:5932. =00 o 9283.: 838.5580 been 8:833. 956: 55:82 .. Jen-29¢ . as... a? 828.2... 51590 2.8.636 .82... 92838... ass-8.8.21.8... 238... . 5.00be :39. .90 sags-89¢ 2300... 30308.50: 050880 25083:. 38: 85.834 noflflaeuoumom Eaton—4 LIST OF REFERENCES [1] [2] [3] [4] [5] [6} [7] LIST OF REFERENCES H. T. Kung and C. E. Leiserson “Systolic Arrays (for VLSI),” Sparse Matrix Proceedings 1978, Society for Industrial and Applied Mathematics, pp 256-82. P. R. Cappello and K. Steiglitz, “Digital Signal Processing Applications of Systolic Algorithms,” in VLSI Systems and Computations, H.T. Kung, et al. eds., Computer Science Press, Oct. 1981, pp 243-54. R. W. Priester, H. J. Whitehouse, K. Bromley and J. B. Clary, “Signal Processing with Systolic Array(a,b),” IEEE 1981, pp 207-215. H. '1‘. Kung, “Let‘s Design Algorithms for VLSI Systems,” Caltech Conference on VLSI, Jan. 1979, pp 75—90. H. T. Kung, “Why Systolic Architectures?,” Computer Magazine 15(1) 1982, pp 37-46. K. Hwang and Y-K Cheng, “VLSI Computing Structures for Solving Large- Scale Linear System of Equations”, Conf. on Parallel Processing, 1980, pp. 217-227. L. Johnson and D. Cohen, “Computational Arrays for the Discrete Fourier Transform,” COMPCON Spring 1981, San Francisco CA, Feb. 1981, pp 236- 44. 216 217 [8] M. J. Foster and H. T. Kung, “The Design of Special Purpose VLSI Chip,” Computer, Jan. 1980, pp 26-40. [9] H. T. Kung, L. M. Ruane and D. W. L. Yen, “A Two-Level Pipelined Systolic Array for Convolutions,” in VLSI Systems and Computations, H. T. Kung et. al. eds., Computer Science Press, Oct. 1981, pp 255-264. [10] J. L. Bentley and H. T. Kung, “An Introduction to Systolic Algorithms and Architectures,” Nav. Res. Rev. (USA), vol. 35, no. 2, 1983, pp 3-16. [11] L. Johnsson and D. Cohen, “ A Mathematical Approach to Modeling the Flow of Data and Control in Computational Networks,” in VLSI Systems and Computations, H. T. Kung et. al. eds., Computer Science Press, Oct. 1981, pp 213-225. [12] G. J. Lie and B. W. Wah, “The Design of Optimal Systolic Algorithms,” Proc. ‘of Computer Software and Applications Conf., 1983, pp 310-319. [13] C. E. Leiserson, F. M. Rose and J. B. Saxe, “Optimizing Synchronous Circuitry by Retiming,” Proc. Third Caltech Conf. on VLSI, ed. R. Bryant Comp. Sci. Press 1983, pp 87-116. [14] P. R. Cappello and K. Steiglitz, “Unifying VLSI Array Designs with Geometric Transformations,” I983 Conf. on Parallel Processing, 1983, pp 448-457. [15] S. Y. Kung, “ On Super Computing with Systolic/Wavefront Array Processors,” Proc. IEEE, Vol. 72, No. 7, 1984, pp 868-884. [16] I. V. Ramakrishnan, D.S. Fussel and A. Silberschats, “On Mapping Homogeneous Graphs on a Linear Array-Processor Model(a),” Proc. of 1983 Int‘l Conf. on Parallel Processing, 1983, pp 440-447. 218 [17] R. H. Kuhn, “Transforming Algorithms for Single-Stage and VLSI Architecture,” Workshop on Interconnection Networks for Parallel and Distributed processing, 1980, pp 11-19. [18] W. L. Miranker and A. Winkler, “Space-time Representations of Computational Structures,” Computing, Vol. 32, 1984, pp 93-114. [19] J. A. B. Fortes and F. P. Presicce, “Optimal Linear Schedules for the Parallel Execution of Algorithms,” I984 Conf. on Parallel Processing, Aug. 1984, pp 322-328. [20] P. Quinton, “Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equations,” Proc. 11th Symp. Comp.Architecture, 1984, pp 208-214. [21] D. Gannon, “Pipelining Array Computations for MIMD Prallelism: A Function Specification,” I 982 Conf. on Parallel Processing, 1982, pp 284-286. [22] J. A. B. Fortes, K. S. Fu and B. W. Wah, “Systematic Approaches to the Design of Algorithmically Specified Systolic Arrays,” I985 Int’l Cortf. on Acoustics, Speech, and Signal processing, Tampa, Florida, Mar. 1985, pp 300-303. . [23] L. Johnsson, U. Weiser, D. Cohen, and A. L. Davis, “Towards a Formal Treatment of VLSI Arrays,” Caltech Conf. on VLSI, Jan. 1981, pp 375-398. [24] D. S. Broomhead, J. G. Harp, J. G. Mcwhirter, K. J. Palmer and J. G. B. Roberts, “A Practical Comparison of the Systolic and Wavefront Array Processing Approach,” I985 lCASSP, March 1985, pp 297-99. [25] D. Cohen, “Mathematical Approach to Computational Networks,” USC Technical Report, ISI/RR-78-73, Nov. 1978. 219 [26] M. Lam and J. Mostow, “A Transformational Model of VLSI Systolic Design,” IFIP 6th Symp. Computer Hardware Descriptive Languages and Applications, 1983’ pp 65‘77. [27] H. T. Kung and W. T. Lin, “An Algebra for VLSI Algorithm Design,” Conf. on Elliptic Problem Solvers, April 1983, pp 1-26. [28] J. A. B. Fortes and D. I. Moldovan, “Data Broadcasting in Linearly Scheduled Array Processors,” I 1th Symp. on Computer Architecture, 1984, pp 224-231. [29] D. I. Moldovan, C. 1. Wu and J. A. B. Fortes, “Mapping on Arbitrarily Large QR Algorithm into a Fixed Size VLSI Array,” Proc. of 1984 Int’l Conf. on Parallel Processing, Aug. 1984, pp 365-373. [30] D. I. Moldovan, “On the Analysis of VLSI Algorithms,” IEEE trans. on Computer, Vol. C-31, No. 11, Nov. 1982, pp 1121-1126. [31] D. I. Moldovan, “On the Design of Algorithm for VLSI Systolic Arrays,” Proc. of the IEEE, Vol. 71, No. 1, Jan. 1983, pp 113-120. [32] C. E. Leiserson and J. B. Saxe, “Optimizing Synchronous Systems,” 22nd Symp. on Foundation of Computer Science, 1981, pp 23-36. [33] G. J. Li and B. W. Wah, “Optimal Design of Systolic Arrays for Image 8’ Processing, Proc. of Workshop on Computer Architecture for Pattern Analysis and Image Database Management, Oct. 1983, pp 134-141. [34] C. E. Leiserson, “Systolic and Semi-systolic Design,” IEEE lnt‘l conf. on Computer Design: VLSI in Computers, 1983, pp 627-632. 220 [35] D. D. Gajski and R. H. Kuhn, “Guest Editor’s Introduction: New VLSI Tools,” IEEE Computer, No 12, Dec. 1983, pp 11-14. [36] D. I. Moldovan, “ADVIS: A Software Package for the Design of Systolic Arrays,” IEEE Trans. on Computer-Aided Design, Vol. CAD-6, No. 1, Jan. 1987) pp 33'40. [37] A. Khurshid and P. D. Fisher, “Design of a Reconfigurable Systolic Array Using LSSD Techniques,” Int’l. Conf. on Computer Design: VLSI In Computers, Oct. 1984, pp 171-175. [38] L. Snyder, “Introduction to the Configurable Highly Parallel Computers,” IEEE Computer, Jan. 1982, pp 47-56. [39] O. H. Ibarra, S. M. Kim, and M. A. Palis, “Designing Systolic Algorithms using Sequential Machines,” IEEE Trans. on Computers, Vol. C-35, No. 6, June 1986, pp 531-542. [40] C. Mead and L. Conway, “Introduction to VLSI Systems”, Addison- Wesley, Section 8.3, 1980. [41] H. T. Kung and R. L. Picard, “Hardware Pipelines for Multi-Dimensional Convolution and Resampling”, IEEE 1981 Computer Architecture for Pattern Analysis and Image Database Management, pp. 273-278. [42] S. Y. Kung, S. C. Lo, S. N. Jean, and J. N. Hwang, “Wave Front Array Processors - Concept to Implementation,” Computer, July 1987, pp. 18-33. [43] B. L. Drake, F. T. Luk, J. M. Speiser, and J. J. Symanski, “SLAPP: A Systolic Linear Algebra Parallel Processor,” Computer, July 1987, pp. 45-49. 221 [44] J. V. McCanny and J. G. McWhirter, “Some Systolic Array Development in the United Kingdom,” Computer, July 1987, pp. 51-63.