iélé WW“NW“HimIIHIHIWll“NHHHNIHIHHM -THS_ thgt fl/fll/W/W/WZ/Wi/7W WM ‘ This is to certify that the thesis entitled BINARY TREE STRUCTURES FOR MATRIX MULTIPLICATION presented by Abdullah Celik Erdal has been accepted towards fulfilment of the requirements for M o S 0 degree in EleCt 0 Eng 0 Major professor F 2 5 l 9 8 3 Date ebruary , 0-7 639 MM...“ 1.135% E? 3 Michigan mate University MSU LIBRARIES RETURNING MATERIALS: Place in book drop to _ remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. BINARY TREE STRUCTURES FOR MATRIX MULTIPLICATION BY Abdullah Celik Erdal A THESIS Submitted to Michigan State University in partial fulfillment of the requirements ‘ for the degree of MASTER OF SCIENCE Department of Electrical Engineering and Systems Science 1983 659.05% ABSTRACT BINARY TREE STRUCTURES FOR MATRIX MULTIPLICATION By Abdullah Celik Erdal Multiple arithmetic processors are interconnected to form a binary tree structure and then utilized as a high performance special-purpose co-processor. Such co-processors may be incorporated into a digital signal or image processing system to perform such tasks as matrix multi- plication, which is routinely required to compute convolutions or dis- crete Fourier transforms. Specifically, this research shows that, any nxn dense matrix multiplication can be performed in [(2n + log(n))] time steps by using n(2n-l) processing elements, where each processing ele- ment is designed to be either a multiplier or an adder with some addi- tional registers. To demonstrate the utility of this binary tree structure, implemen- tation of the convolution and the discrete Fourier transform operations are presented. This binary tree structure may also be applied to other digital image and signal processing algorithms thus establishing the flexibility, generality, and the cost effectiveness of this structure. To My Parents ii ACKNOWLEDGMENTS I wish to express my sincere thanks and appreciation to Dr. P.D. Fisher for his constant guidance, continuous encouragement and valuable suggestions throughout this study. Gratitude is also expressed to the members of my thesis committee, Dr. M.A. Shanblatt and Dr. R.G. Reynolds, for their comments and suggestions. Above all, I especially wish to thank my wife, Nil, for her end- less encouragement, patience, and understanding throughout the course of this study. This research was supported in part by NSF under Grant Number MCS79-09216. TABLE OF CONTENTS Chapter I. INTRODUCTION . . ........ ' .............. II. BACKGROUND . . . . ..... '. . . . . . .......... 2.l Systolic Arrays . . . . . . . . . . . ....... . . 2.2 VLSI Cellular Arithmetic Arrays ............ 2.3 Discussion . . . . . . . . . . . . . . . . . . . III. IV. AN ALGORITHM FOR MATRIX MULTIPLICATION . . . 3.l Functional and Structural Description ......... 3.2 An Algorithm for Dense Matrix Multiplication on Trees . . . . . . . . 3.3 Discussion and Performance Analysis . . APPLICATIONS OF THE TREE STRUCTURES . . . . 4.l Convolution . . . . . 4.2 Discrete Fourier Transform (DFT). . Z I : I I I Z I Z : 4.3 The Design of a Special-Purpose Device for 1-D Convolution and DFT . . . . . . . . . CONCLUSION .................... . 5.l Sunmary . . . ..................... 5.2 Further Research ........... REFERENCES ......................... iv Figure 2.l 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.l 3.2 3.3 3.4 3.5 3.5b 3.6a 3.6b 3.6c 3.6d LIST OF FIGURES Page Geometries for the inner product step processor [2] . . . 5 Mesh-connected systolic arrays [2] . . . . . ...... 7 Multiplication of a vector by a band matrix with p32 and q=3 [2 J O O O O O O O O O O O O O O O O O O O O O 8 The linearly connected systoic array for the vector multiplication problem in Figure 2.3 [2] . . . . . . . . .9 Band matrix multiplication [2] . . . .......... ll The hex-connected systolic array for the matrix multiplication problem in Figure 2.5 [2] . . ...... l2 The additive multiplication cell [5] .......... l3 3x3 square matrix multiplication ............ 15 VLSI array for pipelined multiplication of two 3x3 dense matrices [5] . . . . . . . . . . . . . . . . . . . 16 A processing element (PE) ................ l9 The tree structure . . . . . . ............. 2l The tree structure with PE's and data flow shown . . . . 2l Matrices A,B, and C where AxB=C ............. 24 Individual elements of the matrix C ........... 25 Redefined version of the first row of matrix C ..... 26 Time step 4 ....................... 28 Time step 5 ....................... 28 Time step 6 .............. . . . . . . . . . 29 Time step 7 ......... . ............. 29 Figure Page 3.6e Time step 8 ....................... 30 3.6f Time step 9 ....................... 30 3.69 Time step 10 ...................... 31 3.7 Computation table . . . . . . . . ....... . . . . . 32 3.8 Number of required processing elements vs. N ...... 34 3.9 Total computation time of an NxN dense matrix ...... 35 3.lOa Number of required processing elements vs. N ...... 36 3.l0b Number of required processing elements vs. N ...... 37 3.lla Total computation time of an NxN dense matrix ...... 38 3.llb Total computation time of an NxN dense matrix . ... . . . 39 3.llc Total computation time of an NxN dense matrix . . . . . . 40 3.12a Speedup vs. N ...................... 42 3.l2b Speedup vs. N ...................... 43 3.l3a Percent efficiency vs. N ................ 44 3.l3b Percent efficiency vs. N .............. . . 45 4.l Noncyclic convolution for N=4 . . . . . ......... 49 4.2a I-D convolution tree . ............ . . . . . 50 4.2b Realization of 1-D convolution . . . . . . . . . . . . . 51 4.3 DFT for N=4 . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 One dimensional DFT trees. (Real or imaginary part of the overall I-D DFT design.) ........... 55 4.5 Two dimensional DFT tree for N=2 ............ 57 4.6 Block diagram of MPE and APE .............. 58 4.7 Block diagram of a possible single chip implementation of the tree structure ("Tree-1C“) . . . . . . . . . . . . 60 4.8 Tree structured I-D convolution for N=l6, and 1-D DFT system for N=8 .................. 62 vi CHAPTER I INTRODUCTION In many areas of computer application, such as image and signal processing, the quality of the answer the computer returns is propor- tional to the amount of computation performed [l]. The use of general purpose computers has been common for these applications; however, the high computational throughput and data rate demanded by these applications make the conventional computers inefficient and thus impractical for many contemporary applications. Advances in the design and fabrication of Very Large Scale Inte- grated (VLSI) circuits will soon make it feasible to implement com- puters consisting of tens or even hundreds of thousands of computing elements. For this reason, the tendency is to design cost-effective, high-performance, special-purpose devices to meet specific application requirements by employing many processing elements. Many special purpose devices for applications that require highly parallel computing have been proposed and built. Systolic arrays, introduced by Kung [2-4], and the VLSI computing structures, introduced by Hwang [5], are such devices. Both consist of a set of interconnected cells, each capable of performing some simple operation, where each cell rhythmically computes and passes data through the system. They have simple and regular communication paths and are highly concurrent modular systems. Both authors emphasize procedures for solving linear systems of equations and some matrix computations, such as L-U decompo- sition and matrix multiplication. These computations play a vital role in numerous computer applications which require high-speed computer performance. For example, matrix multiplication is an important image and signal processing operation, and it is the subject of this research. This research develops and investigates a high performance special purpose device for matrix multiplication, which works as a co-processor attached to a host computer. It capitalizes on the properties of VLSI to achieve high throughput rates and high efficiency. Although the algorithm and the structure conform to the restrictions of VLSI tech- nology, they can be easily implemented.with pre-VLSI technology without significant performance degradation. A binary tree structure is used to obtain a simple, regular, and short communication geometry, which are considered as some of the most desired attributes of a VLSI imple- mentation. One of the main advantages of using the binary tree struc- ture over either hexagonally or mesh connected structures proposed by Kung [2-4] and Hwang [5], respectively, is that a tree structure uses much fewer connections between the processing elements; also, the longest distance any data must travel is considerably less than any of the other schemes, leading to higher speeds. Furthermore, only two different processing elements, namely a multiplier and an adder, are used for simple, regular, and modular design to yield cost-effective special-purpose systems. There are many special and general purpose machines that employ binary tree-structured interconnections between the processing elements. For example, the “Inner Product Computer" discussed in a report by Swartzlander [6], is a special-purpose computational unit intended to be used as a co-processor to compute the inner product of complex vectors by using the binary tree structures, which are very similar to the tree structured arrays investigated in this research. The “X- Tree,“ University of California, Berkeley, project [7], is a tree structured general purpose multi-processor computer architecture. It has the hierarchical structure of a binary tree, but extra links are employed between the nodes to enhance fault-tolerance and balance uniform message traffic [7]. The California Institute of Technology “Tree Machine,“ based on Browning's doctoral dissertation [8], is also a binary tree structured multiprocessor system for general purpose applications. However, this machine does not have the extra links between its processors as in the case of “x-Tree." Both of these machines are general purpose computers. Background information is presented in Chapter II in order to review some alternative highly parallel computing structures for matrix multiplication. Some examples are also given. The third chapter of this research introduces the proposed matrix multiplication algorithm and the structure of the binary tree, where each node of the tree represents a processing element. The chapter closes with an example matrix multiplication and compares the proposed algorithm with the two algorithms introduced in Chapter II, on the basis of speedup and efficiency. The fourth chapter gives application examples for the proposed architecture and describes the processing elements, chips, and the overall system architecture. Lastly, in the concluding chapter, some of the advantages and drawbacks of this architecture are delineated. Possible extensions of this work for future research are also included. CHAPTER II BACKGROUND The purpose of this chapter is to review the matrix multiplication algorithms of the two multiple special-purpose functional units intro- duced by Kung [2-4] and Hwang [5]. These units employ many tightly interconnected elements and are designed to be used in a highly par- allel computing environment for applications such as signal and image processing. . Systolic arrays and cellular arithmetic arrays are introduced in Section 2.l and Section 2.2, respectively. Sample matrix multi- plication algorithms, which may be implemented on these arrays, are also included. The chapter concludes with a discussion of some of the negative aspects of these algorithms. 2.l Systolic Arrays In a systolic system, data flows from the computer memory in a rhythmic fashion, passing through many tightly connected processing elements in a co-processor before returning to memory. Each processing element, called an inner product step processor, performs a single operation, namely the inner product step, C + C + A * 8. Figure 2.1 shows two types of geometries for this processing element (PE). Each PE has three registers RA, RB, and RC’ and each register Figure 2-l: Geometries for the inner product step processor [2]. has two connections, one for input and one for output. In each time step, where one step is defined as the interval of time for any one of the PE's to complete its computation, the PE shifts the data on its input lines denoted by A, B, and C into RA, RB’ and RC’ respec- tively, computes C+RC+RA*RB R and makes the input values for RA and RB together with the new value of Rc available as outputs on the output lines A, B, and C, respec- tively [2]. Each processing element is connected to its neighboring PE's as shown in Figure 2.2. Using these different systolic arrays, it is possible to design systolic systems or systolic co-processors for computations for such applications as band matrix multiplication, L-U decomposition, and for solving triangular linear equations [2-4]. For example, consider the problem of multiplying a matrix A 8 (aij) with a vector x = (x],...,x0) (see Figure 2.3). The product term, y = (y],...,yo) can be computed by the following equations [2]: A —‘ v u ‘YT 0: + l yi(k I) ‘ Vim I aikxk’ + yl = yi(n ]) If A is an nxn band matrix with bandwidth w = P + Q - 1, (See Figure 2.4 for the case when P = 2 and Q = 3.), then the above equations can be evaluated by pipelining the xi and y1 through a lin- early connected systolic array consisting of w inner product step PE's in (2n + w) time steps [2]. [I (a) I J J r I] I- I I J I I I _ L. _,_. ‘1. L... ___"_ (b) Figure 2-2: Mesh-connected systolic arrays [2]- :“':"‘ . . . ‘n “W2 [*1 yJ 4 a23 a22 :123 0 x2 yz ‘31 ‘32 a33 ‘34 x3 . y3 Figure 2-3: Multiplication of a vector by a band matrix with p = 2 and q = 3 [2]. I a34 ‘43 I I I I l I a I I 33 a42 I I I I I I I 623 a32 L I I I I I l I I ‘22 ‘31 II I (’ l l I [I I I alz II21 /’ : \\\ I I \ /( I b‘ a l’ I I I \‘. n I I I \ / I a::. ‘- x2 .:::. --_ I" Fix- I F: I I=<~ . IIII’ I ‘*I Figure 2-4: The linearly connected systolic array for the vector multiplication problem in Figure 2-3 [2]. IO Another example is the multiplication of two nxn band matrices with bandwidth w1 and "2' The matrix product C = Icij) of A = Iaij) and B = Ibij) can be computed as follows 2 : c (I) ij 3 0' (k+I) a (k) C13 cij I aikbkj’ = (n+l) Band matrices A, B, and C, shown in Figure 2.5, are pipelined through a systolic array of (w1 * "2) hex-connected PE's. The interconnection network and data flow are shown in Figure 2.6. Each Cij is able to accumulate all of its terms before it passes through the upper bound- aries. If A and B are nxn band matrices with bandwidths w1 and "2’ respectively, then a systolic array of (w1 * wz) hex-connected PE's can compute the result C in [3n + min ("l’ wZX] time steps. If A and 8 are nxn dense matrices, then (3n2 - 3n + l) hex-connected PE's can compute the result in C in 5(n - 1) time steps [4]. As described, the systolic array architectures provide the capa- bility for realizing a number of important matrix operations. In addition to achieving a high computational rate by means of pipelining and concurrent computation, these arrays are characterized by having a simple, regular and short communication geometry. 2.2 VLSI Cellular Arithmetic Arrays Processing elements, as in the case of systolic arrays, compute the inner product, C + C + A * B II 3". a... an :8 2... so no 28 .flNH coppaup_awupse x_apas team a: .28 NM”. mu 2”. 3 :.U :6 so :0 ~m -a ~— 3.... Z m-~ aa=m_a :a 5,. : 12 r . . \va;4 II :3; The hex-connected systolic array for the matrix multiplication problem in Figure 2-5 [2]. Figure 2-6: I3 Figure 2-7: The additive multiplication cell [5]. T4 and have three inputs and three outputs as shown in Figure 2.7. Al- though it is not specified in [5], presumably each processing element has a set of three registers that relates to the input and output lines of A, B, and C. Each processing element is connected to its neigh— boring PE to form a short communication path. Consider the problem of multiplying two nxn dense matrices A = Iaij) and B = Ibij) (see Figure 2.8 for n = 3), where the product coefficients n cij "' kg, (aikbkj) for all i and j [5]. The elements of matrices A and B are fed from the lower and upper input lines in a pipelined fashion, one skewed row or one skewed column at a time, as in Figure 2.9 [5]. Some dummy zero inputs are interspaced with the matrix elements to ensure correct results. Multiplying two nxn dense matrices requires n(2n - l) pro- cessing elements and it takes (4n - 2) time steps to produce the pro- duct matrix C = Icij)‘ 2.3 Discussion All the PE's for both algorithms are kept busy all the time, either by performing the inner product computation or by simply passing the data to their neighbors. Other computations, such as matrix inversion and L-U decomposition, can also be implemented by using one or two different PE's and different array architectures. Both of these VLSI arrays are expandable to allow modular growth. However, they both have some drawbacks. For example, they need data skewing, and control of every input and output data, in order to accomplish their correct Figure 2-8: 15 ‘13 bII Ihz bI3 cII ‘fiz ‘13 a23 * P21 b22 b23 ' ‘21 c22 c23 ' c a33 b32 b32 b33 C31 c32 c33 3x3 square matrix multiplication. I6 3 I b 23 D 13 D b32 O b b 22 D 12 0 D31 D b b 21 D 11 O D . o c 0 C13 0 O O o o I C) O o u n (A) n _a N U o —‘ O O O N O N O N "‘ —J o N O O O _4 .u‘ D aII g o o 12 a a 0 13 2I a o O 22 a a o 23 31 a O O 32 a 3 O 33 Figure 2-9: VLSI array for pipelined multiplication of two 3x3 dense matrices [5]. l7 timing. This increases the circuit complexity and introduces additional delays. One other negative aspect of these arrays is that to accomplish the multiple use of every element of matrices A and B, the result of each processing element, as well as the elements of A and B, are pipe- lined through the array. However, this requires the use of more inter- connection lines between the processing elements, and hence more chip area. And the number of required registers also increases for the same reason . CHAPTER III AN ALGORITHM FOR MATRIX MULTIPLICATION This chapter provides the requisite groundwork and the algorithm for the development of a highly parallel computing machine. Basically, this machine is a tree structured array and may be classified as a multiple special-purpose functional unit. First, the processing ele- ments are defined and the basic tree structure is outlined in Section 3.1. In the next section, a new matrix multiplication algorithm is introduced, and a sample matrix multiplication is provided. Section 3.3 concludes with a discussion of the space-time complexity of this tree structure and compares its speedup and efficiency with those of the structures reviewed in Chapter 2. 3.1 Functional and Structural Description * Processing Elements (PE)--As mentioned earlier there are two types of processing elements, namely, a multiplier and an adder. Each of these PE's has three connection lines A, B, and C, as shown in Figure 3.1, where lines A and B are the input lines and the line C is the output line. Each PE performs only a single operation, either C‘+ A x B or C *-A + B. 18 19 Figure 3-1: A processing element (PE). 20 In each time step, a processing element takes the data on its input lines A and B, performs its operation, and latches its result into register RC’ where RC is directly connected to the output line C. All outputs are latched and the logic is clocked so that when one PE is connected to another, the changing output of one during a time step will not interfere with the input to another during this time step. * Tree Structure--The tree structure, shown in Figure 3.2, has seven nodes and three levels. The first node is called the root node, and it is accepted as the level one of the binary tree. The nodes 4, 5, 6, and 7 are called leaf nodes or input nodes. This tree struc— ture is redrawn in Figure 3.3 to show the actual data flow. As shown in this figure, data is pipelined downward from the leaf nodes to the root node, where the output of the root node is the final result of the computation. Only the input nodes (leaf nodes) are represented by the multipliers; the rest of the nodes are all adders. 3.2 An Algorithm for a Dense Matrix Multiplication on Trees Let A = (aiJI’ B = Ibij)’ and C = (cijI be nxn dense matrices, where C is the product term, which can be computed in (l + log n) time steps by using n2(2n - l) PE's, and by employing the type of tree structures shown in Figure 3.3. (Note: the logarithmic function will be used in base two throughout this paper.) All the rows of the matrix A and all the columns of matrix B are multiplied together in one time step to produce all the partial products of the resultant matrix C. Then the produced partial products are added in (log n) time steps to obtain the final result. Consequently, very high speedup is obtained over the other algorithms. However, it requires too many PE's and 21 Figure 3-2: The tree structure. \/ \,/ Figure 3-3: The tree structure with PE's and data flow shown. 22 very large I/O bandwidth even for very small matrix multiplications. Since this would be an unrealistic assumption, a different approach will be taken to obtain high speedup with many fewer PE's. In the remainder of this paper, it is assumed that A, B, and C are all nxn dense matrices, where n = 2k for k = l,2,3,..., although this is not an essential assumption. With this assumption, we can define some terms that will be used throughout this paper as follows: p = Total number of PE's that will be used for a given nxn matrix multiplication. t = Total amount of time it takes for a given nxn matrix multi- plication to be computed. Then, it will be shown in this section that p = n(2n - l); t = (2n +_log n); where n = 2k for k = l,2,3,... for nxn dense matrix multiplication. The first modification needed is to reduce the number of input lines for the input nodes (multipliers) from two to one. This will not only reduce the I/O bandwidth requirements, but also reduce the number of pins for a chip implementation. However, the price that we pay is the reduction of the overall speed. The second modification is to include one extra register, R3, for each multiplier to use as a buffer to hold the elements of the matrix B. This register will be implemented inside the multiplier PE. Finally, it will be assumed that n-way memory interleaving is incorporated in the host computer. 23 With this memory model, one word can be fetched from each of the n memory units at once in one clock period, then the effective memory bandwidth is n [9]. With all the given modifications and the assumptions, it is now possible to load all the elements of the matrix B into the register 2 multiplication PE's in n time steps, one column at a time, R8 of n where each column has n elements. Note that if the I/O bandwidth is larger, more data can be accessed in one time step. Also, it is as- sumed that memory access time is equal to one time step, although it may be possible to access the memory more than once in one time step, which will speed up the overall computation time. Furthermore, for simplicity, it will be assumed that each of the PE's, multiplier or adder, have equal computation times. Consider a 4x4 matrix multiplication example to illustrate this algorithm, where matrices A, B, and C are shown in Figure 3.4. Indi- vidual elements of the matrix C are shown in Figure 3.5a. Each row vector will be computed separately. For this reason n Column Vector Computation Units (abbreviated as CVCU) will be used, where each CVCU will contain a total of (Zn - l) PE's bringing the total number of PE's to n(2n - 1). These (2n - l) PE's in CVCU's consist of n multiplier PE‘s and (n - l) adder PE's. Sepcifically, for this example, 4 CVCU's have been used. Each CVCU contains 7 PE's, where 4 of them are multiplier and 3 of them adder PE's. First con- sider the CVCU-I which produces column vector-I of resultant matrix. As shown in Figure 3.5a, the equations C1], C2], C3], and C4] have four elements in common to all of them. For this reason, first these common elements, namely b1], b2], b3], and b4], are loaded into the 24 Figure 3-4: Matrices A, B, and C where A x B = C. Figure 3-5a: Column I. b c11 ' ‘11 11 ' a21b11 b c21 C 31 ' a31 11 c41 ' a41b11 Column II. 811512 b c12 c22 I21 12 a31b12 b c32 c42 ' I41 12 Column III. b c13 ' 811 13 c23 a21b13 c33 ‘ a31b13 c43 a41b13 Column IV. c14 ' a11b14 c24 ' a21b14 c34 ’ a31b14 c44 ’ a41b14 25 21 a22b21 °3zb21 a42b21 a12b22 azzbzz a32522 a42b22 ‘12P23 a22°23 832% 642% a12°24 a22b24 a32b24 a42b24 4. 613b 31 a23b31 a33b31 a43b31 313332 a23b 32 a33b32 b a43 32 a13b 33 a23P33 a33b33 a42b33 a13534 a23b34 a33b34 a43°34 I b14b41 4. + 4. + 4. + + + + + + 4. b24b41 b34b41 b44I’41 b14b42 b24b42 b34b42 b44b42 b14b43 b24P43 b34b43 b44b43 b44 b14 b24b44 b34b44 b14b44 Individual elements of the matrix C. 26 ‘11 ' x11 I ‘12 I ‘13 I ‘14 ‘21 I ‘21 I ‘22 I ‘23 I ‘24 +X +X ‘31 ' ‘31 32 I ‘33 34 c41 ' ‘ I ‘42 I ‘43 I ‘44 Figure 3-5b: Redefined version of the first row of matrix C. 27 RB registers as shown in Figure 3.6a. Also, for simplicy, C1], C2], C3], and C4], are redefined as shown in Figure 3.5b. At time steps 5, 6, 7, and 8, matrix A is pipelined one row at a time, where each row has n elements, and for an nxn dense matrix there are n rows. So, it takes n time steps to pipeline all the rows of nxn dense matrix. After the pipelining of the last row, it takes (log n) time steps to merge the partial results of each PE. At each time step each PE takes its operands, does computations on these op- erands, and stores the result into the RC' Like a systolic array each PE takes data in and pumps data out at every time step. Only the CVCU-I is shown in Figure 3.6. CVCU's-II, -III, and -IV, also perform the same computations at time steps 1 through 10. Note that the same row elements are needed to do computations on these CVCU's. For example, at time step 5, the first row of matrix A is inputted to CVCI-I, -II, -III, and -IV simultaneously. At time steps 6, 7, and 8, row II, III, and IV of matrix A are pipelined through every CVCU, respectively. From this, it is concluded that, only n elements are needed at each time step to do all the computations as shown in Figure 3.7. Note that band matrix multiplications and matrix-vector multi- plications can easily be implemented on these tree structures with the same algorithm. For example, an nxn and n-element vector multi- plication can be computed using n(2n - l) PE's in (n + l + log n) time steps. Figure 3-6a: a12 Figure 3-6b: 28 Time step 4. a13 Time step 5. ‘21 29 %m Figure 3-6c: fin Figure 3-6d: fin Time step 6- ‘33 Time step 7. 30 Figure 3—6e: Figure 3-6f: Time step 8. Time step 9. ‘44 31 \ I \ / Figure 3-6g: Time step 10. 32 z: :3 :J >~Izu>o :n cw cm mm 3 n v. ”.8 «no ”no men mm a .mpnmp cowaeasnsou n ¢_a a: .ma :6 m_a ”Na "ma «ea N. ~_a «N Nma Nee — -~I=u>p —H~I:u>u .2 . ~r. mm __a .Na .ma .ca "RIM beamed «.8 -u ~mu Nag Nu: __8 .ma .ma .eu ~13u>u odd-3.3.3 33 3.3 Discussion and Performance Analysis As shown in Section 3.2, nxn dense matrix multiplication can be computed using n(2n - l) PE's in (2n + log n) time steps. Note that for the algorithm proposed in this paper, there is no need for data skewing or a data alignment network because the algorithm requires only row access to the matrices A and B for the matrix multiplication of AxB. The number of processing elements and the total time steps required for a given nxn dense matrix are shown in Figures 3.8 and 3.9, respec- tively. Figures 3.lOa and 3.lOb show the comparison of this algorithm with Kung‘s 2-4 and Hwang's [ 5 ] algorithms in terms of number of required PE's for the range of n = 2 to n = 2048. Figure 3.lla, b, and c depicts the comparison of these algorithms in terms of time steps that are necessary for any given nxn dense matrix for the same range. As shown in Figures 3.10 and 3.ll, the algorithm proposed in this paper obtains higher speeds with fewer processing elements over the other two algorithms. Note also, it is possible to use fewer PE's to compute a given nxn dense matrix, however, this causes an increase in total time steps. For example, 1 CVCU can be employed instead of n to com- pute a given nxn matrix using (2n - l) PE's in (2n + n + log n) time steps, which is still faster than a sequential processor. Combinations of the number of PE's and the total time steps can be easily arranged so that the computation rate of the overall system matches with the I/O bandwidth of the host computer. Speedup and efficiency provide more insight into the performance of these algorithms, where speedup S, is defined as the ratio of the serial computation time to that of parallel computation time; efficiency, 34 MILLIONS cjh 2.1 // 1.5 ‘* ,///// 1.25 ‘I 375‘? ’ i x/ 0.5 , I i 325'7 I 1 TI I I I 200 400 600 800 1000 SIZE N OF AN NXN DENSE MATRIX Figure 3-8: Number of required processing elements vs. N. 35 4000 q 3500 . H ERDAL 3000 ‘ 2500 - 2000 ‘ 1500 ‘ 1000 -I -. 500 ‘ up-.. .. I 0 250 500 750 1000 1250 1500 1750 2000 SIZE N OF AN NXN DENSE MATRIX Figure 3-9: Total computation time of an NxN dense matrix. 36 2000 ~ f 7'" 1750 "H ERDAL 31 HWANG ’ 3" HMG ,’ 1500 _. 1250 -' / 1000 " . .’ 750-1 1‘ 5m)? 250 II I 1 I I I 2.5 5 7.5 10 12.5 15 17.5 20 22.5 25 27.5 30 SIZE NIOF AN NXN DENSE MATRIX Figure 3-lOa: Number of required processing elements vs. N. 37 MILLIONS 2w / mgr}. E53? 8. HUANG ,. 1.5" L25” ,1 /" 075* /’.:/// 05— ’ ,x' 0254 1’;;lF//I (Junkeé': I I I ‘ I 200 400 600 800 1000 SIZE N OF AN NXN DENSE MATRIX Figure 3-lDb: Number of required processing elements vs. N. M‘UI'T'I'HU) mZH—‘I 38 T 1 1 r 1' l l T T 10 12.5 15 17.5 20 22.5 25 27.5 30 SIZE N OF AN NXN DENSE MATRIX Figure 3-1la: Total computation time of an NxN dense matrix. mvm—Iw I'T‘IZH—‘I 39 50 100 Figure 3-llb: I r 1 r T T I T I 150 200 250 300 350 400 450 500 SIZE N OF AN NXN DENSE MATRIX Total computation time of an NxN dense matrix. M'UI'I'I—IU') mzH-‘I 4O 4mm'4 , , - /////// HERDAL ’ , 3500 - I--I KUNG 3000 H 2500 ‘ 2000 4! 1500-3"] a. 1000 IT tr 400 600 300 1000 1200 1400 1600 1800 2000 SIZE N OF AN NXN DENSE MATRIX Figure 3-llc: Total computation time of an NxN dense matrix. 41 E, is regarded as the ratio of the actual speedup to the maximum possible speedup using p processing elements, as shown below [30,11]: S ts / tp.: 1 m ll 5/p_<_1 where t5 and t are serial and parallel computation times, respectively. Figures 3.12a :nd 3.12b illustrate the speedup of each algorithm for the range of n = 2 to n = 2048. Percent efficiency of these algorithms for the range of n = 2 to n = 1024 is depicted in Figure 3.l3a and 3.l3b. As shown in Figures 3.12 and 3.13, higher speedup and higher efficiency is obtained by the algorithm introduced in this paper. For example, for n = 32, approximately, (9 / 4) and (9 / 5) times more speedup, and (10 / 3) and (9 / 5) times more efficiency are obtained over the Kung's and Hwang's algorithm, respectively. Note that effi- ciency stays almost constant for n 512 for all algorithms; however, much higher speedups can still be obtained for larger values of n over the other two algorithms. One of the reasons that the tree structure performs better than the other structures is that the longest data path that one element has to travel is O(log n) instead of 0(n). Another important reason for higher performance is that the tree struc- ture makes use of all of its PE's all the time, whereas the other structures use only half or less than half of their total PE's for the actual calculation, and the rest of the PE's are used for simply passing data, most of the time. More detailed performance analysis of these algorithms can be accomplished by including other performance criteria such as, utilization and redundancy as defined in [10,11]. mzH—i MUM-4m 42 "1'4 ERDAL I" :1: KUNG / my HUANG // 300 ‘I 200 4* 100‘: I I I I ("1: ,_. ' 2.5 5 7.5 10 12.51517.5 20 22.5 25 27.5 30 SIZE N OF AN NXN DENSE MATRIX Figure 3-12a: Speedup vs. N. 43 MILLIONS ‘ w 2_‘_T‘;—v' RDAL / ,., KUNG Hwang /’ 1.75 - ////” 1.5 -' ' ./ /./ 1.25 e» ./// 1 -u /. ”j / ’ fl ' //' ,./' I ,/' .x’ [ ./"// ; 3 1 I l J 250 500 760 1000 1250 1500 1750 2000 SIZE OF NXN DENSE MATRIX Figure 3-12b: Speedup vs. N. :ou o-H cogsuuagum awe» .m ... z 28 ”m-e mczm_u ----...-. a. we I -| -11.... .621 li-fiI------1-_LA ".88 _e_.1 a. FLT 1" _1~u_ mwsA — _ .. £00.- .. 8 s u. h 1M— u. ouch 11sA umcaxaa um:- 1...: a. 3%... x seeaxuc.=m — - 3% e 3.6 CHAPTER V CONCLUSION 5.1 Sumary The purpose of this research was to develop and investigate a high performance special-purpose device for matrix multiplication which works as a co-processor attached to a host computer. This device may be employed to perform discrete Fourier transform and convolution operations for digital signal and image processing applications. The design was constrained by the requirement that basic VLSI rules be followed; i.e., the overall co-processor has a simple, regular struc- ture with short, nearest-neighbor communication paths. Systolic arrays, introduced by Kung [2-4], as well as other VLSI computing structures, introduced by Hwang [5], were reviewed in Chapter II. Examples for matrix multiplication algorithms were shown, and the corresponding space-time complexity of each of these algorithms were indicated. Specifically, it was indicated that an nxn dense matrix multiplication can be computed in 5(n - l) and (4n - 2) time steps by using (3n2 - 3n + 1) and n(2n - 1) processing elements by the Kung‘s [2-4] and the Hwang's [5] algorithms, respectively. A new binary tree structure matrix multiplication algorithm was introduced in Chapter III. The algorithm capitalizes on the properties of the VLSI to obtain high throughput rates and high efficiency. The algorithm first forms the products of the matrix elements in the leaf 63 64 nodes, and then sums the partial results by merging them as they are pipelined toward the root node of the binary tree. In this way, massive parallelism can be achieved to introduce high degrees of pipe- lining and multiprocessing. Processing elements, which are the nodes of the binary tree, and the basic tree structures were also defined in this chapter. It was shown that, the binary tree structures can be implemented by only a few different types of simple processing ele- ments, and they have simple, regular, and short communication geometry, which are considered as necessary attributes of an efficient VLSI imple- mentation. This in turn yields cost-effective special-purpose devices. Chapter III concluded with a discussion of the space-time complexity of this binary tree structure and the comparison of its speedup and efficiency with those of the structures reviewed in Chapter II. It was shown that, for any nxn dense matrix multiplication, the binary tree structures algorithm provides higher speedup and efficiency over the other algorithms. Specifically, it was shown that any nxn dense matrix multiplication can be performed in [(2n + log(n))] time steps by using n(2n - 1) processing elements. From this, we can conclude that the binary tree structures use fewer processing elements than Kung's [2-4] algorithm but use the same number of processing elements as compared to the Hwang's [5] algorithm. And the binary tree struc— ture provides a solution faster for any nxn dense matrix multiplication than either of the algorithms mentioned above. Note that a processing element that is used in the binary tree structures is either a multi- plier or an adder with some additional registers, whereas a processing element for the other structures contains both a multiplier and an adder, with some additional registers. Furthermore, each PE in the 65 tree structure has only one register as compared to three registers for the others. The reason for this is that each PE requires only three input/output ports in binary tree structure as compared to six input/output ports. All of these advantages yield a smaller chip area and fewer input/output connection lines per PE for the binary tree structures. Some other important advantages of tree structures were also indicated in Chapter II. For example, in binary tree structures there is no need for data skewing for matrix multiplication, whereas it is necessary to skew the data before inputting for the other structures which introduces additional computation times. For matrix multiplica- tion, once the matrix B is loaded to the RB registers, matrix A is broadcasted to all column vector processing units at the same time one row at a time, which simplifies the data accessing problem. As the matrix A is pipelined through the binary tree, all the processing elements are kept busy and utilized to perform the actual calculations, instead of simply passing data from one PE to another as in the case of Kung's [2-4] and Hwang's [5] algorithms. The algorithm introduced in this thesis utilizes the inherent characteristics of the matrix multiplication, and optimally matches this algorithm to the binary tree structure. In Chapter IV, special-purpose devices that can be used in digital signal and image processing to perform convolution and discrete Fourier transform were proposed and discussed. Specifically, one dimensional convolution operation and its binary tree structure chip level and system level architectures were illustrated. Applicability of the binary tree structure to many other digital signal and image processing 66 applications, which can be represented as matrix-matrix multiplication, is evident. The use of this binary tree structures can substantially reduce the amount of computations required for the solution of many problems in digital signal and image processing. This binary tree structure is intended to be used in conjunction with a conventional computer. It can be used to implement a complete digital signal or image processing device, or can be used as part of another device. And it can easily be extended to any size because of its modularity. Advances in the design and fabrication of VLSI circuits will soon make it feasible to construct special-purpose devices, such as the binary tree structures, which are excellent candidates for the VLSI technology, with their simple, regular, and short communication geometry. They can be used to obtain very high throughput rates for the problems which are not amenable to solutions even with the state-of-the-art conventional computers. The geometry, flexibility, and cost effectiveness computers. The geometry, flexibility, and cost effectiveness will make them very valuable tools for many important tasks. 5.2 Further Research One area that should be investigated in the future concerns the application of binary tree structures introduced to other signal and image processing algorithms. Using these structures, it may be possible to construct complete digital signal and image processing devices with very high speeds. A second investigation that could be pursued is the development of single chips with multiple PE's which could result in a substantial reduction on the time-space complexity of the overall 67 system. And, finally, further investigation that could be taken is the development of fast and large interlieved memories for the tree structures and their scheduling problems between them, because the communication line between the host computer and the attached special- purpose device can present a bottleneck. 10. 11. 12. REFERENCES Haynes, L.S., “Highly Parallel Computing," IEEE Computer, Vol. 15, NO. 1’ Jane 1982, pp. 7-80 Kung, H.T. and Leiserson, C.E., Systolic Arrays (for VLSI), Tech- nical Report, Carnegie-Mellon University, Department of Computer Science, Dec. 1978. Kung, H.T., “Let's Design Algorithms for VLSI Systems,“ Proc. of Caltech Conf. on VLSI, Jan. 1979, pp. 65-90. Kung, H.T., “The Structure of Parallel Algorithms," In Advances in Computers, (Yovits, M.C., Editor), Academic Press, New York, 1979. Hwang, K. and Cheng, Y.H., "VLSI Computing Structures for Solving Large Scale Linear Systems of Equations,“ Proc. of Intr'l Conf. on Parallel Processing, Aug. 1980, pp. 217-230. Swartzlander, E.E. et al., "Inner Product Computers," IEEE Trans. on Computers, Vol. 27, No. 1, Jan. 1978, pp. 21-31. Despain, A.M. and Patterson, D.A., "X-Tree: A Tree Structures Multiprocessor Computer Architecture," Proc. of Fifth Intr'l. Symp. on Computer Architecture, Apr. 1978, pp. 144-151. Browning, S.A., The Tree Machine: A Highly Concurrent Computing Environment, Technical Report (Ph.D. Thesis),—Computer Science, California Institute of Technology, Jan. 1980. Budnik, P. and Kuck, D.J., “The Organiation and Use of Parallel Memories,“ IEEE Trans. on Computer, Dec. 1971, pp. 1566-1569. Kuck, 0.0., The Structure of Computers and Computations, John Wiley & Sons, 1978. Lee, R.B-L., "Empirical Results on the Speed, Efficiency, Redun- dancy, and Quality of Parallel Computations," Proc. of Intr'l Conf. on Parallel Processing, 1980, pp. 91-100. Agarwal, R.C. and Cooley, J.N., “New Algorithms for Digital Convolu- tion,“ IEEE Trans. on Acoustics Speech, and Signal Processing, Vol. 25, No. 5, Oct. 1977, pp. 592-410. 68 13. 14. 15. 16. 17. 69 Cochran, H.T. et al., “What is the Fast Fourier Transform?" IEEE -I£2"5- 0" Audio gnd Electroacoustics, Vol. 15, 1967, pp. 45-557_- Booth, A.D., “A Signed Binary Multiplication Algorithm,“ Quart. J. Mech. Appl. Math., Vol. 4, 1951, pp. 236-240. Baugh, C.R. and Nooley, B.A., “A Two's Complement Parallel Array Multiplication Algorithm," IEEE Trans. on Computers, Vol. 22, Dec. 1973, pp. 1045-1047. Hwang, K., Comppter Arithmetic, Principles, Architecturesgpg Design, John Wiley & Sons, 1979. Erdal, A.C. and Fisher, P.D., “Relaxed Pinout Constraints Yield Improved Parallel Adders for VLSI-Based Systems,“ IEEE Intr'l. Conf. on Circuits and Computers, Sept. 1982, pp. 142-146. GAN STATE UNIV. LIB 111111111111111111 111111111111111111111111111