w 2‘ r d? H‘ 0' “”13. "i 'rln, k‘\ ‘97:)? . 1:? ”'31:, ' .3» my“; ‘3 * ‘3‘“: {It} a; ' » "I' [\‘I'J W!” '” '1‘?!” in 5'" " {1* {.1 'd w .g. “'0: "jl' ‘ i .IiNIII LL" l " .- ' "'1 ha. I, ‘ I'N- ‘ - 9\5 l mayhem" ‘ Mlefiaigaa Mate ‘ University l \ °V:'i';‘ , I —. II- This is to certify that the dissertation entitled A ’iirrt yd RT ,L m“ Pt'i'l~t+ if) TntFSSCK fit/K USE 1"- C‘ VLSI 3/l/C (Zyr'i'x) presented by ‘M J has been accepted towards fulfillment of the requirements for is r ‘ degreein Firclwfil b. 3%.qumj Ht: j? r a/V‘t/M/é/M 44% 8101' professor Date /Z"2“83 MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 MSU LlBRARlES ——. V RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. A FLOATING-POINT INNER PRODUCT STEP PROCESSOR FOR USE IN A VLSI SYSTOLIC ARRAY By Yeong-Jeng Tseng A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Electrical Engineering and Systems Science 1983 ABSTRACT A FLOATING-POINT INNER PRODUCT STEP PROCESSOR FOR USE IN A VLSI SYSTOLIC ARRAY By Yeong-Jeng Tseng This thesis presents the design of a floating-point inner product step processor, suitable for VLSI implementation. The operands and the resultant are expressed using the IEEE-75A standard. A simulation of the floating-point inner product step processor is developed to obtain information on parameters of chip area and propagation delay time as 'functions of a minimum lithographic linewidth. An additional simulation is developed for band matrix triangulation,' utilizing concurrent Gaussian elimination, in which the processor is used as a basic building block in a VLSI systolic array. The systolic array simulation uses parameters of matrix bandwidth and lithographic linewidth to provide results indicating total area geometry and delay time requirements. To my parents ACKNOWLEDGEMENTS The author wishes to express his sincere appreciation to his major adviser, Dr. Michael A. Shanblatt. for his guidance and encouragement in the course of this research. He also wishes to thank Dr. D. K. Reihard and the committee members, Dr. P. D. Fisher and Dr. C. L. Hey, for giving valuable suggestions and comments in this work. Finally, the author owes a special thanks to Miss Echo Chang for .her emotional encouragement-and support. TABLE OF CONTENTS LIST OF TABLES ..................... .... ........ ... ........ .... LIST OF FIGURES . ................... ...... ....... .. ........ ..... I. INTRODUCTION . ..... . ....... . ....... . ............. . ....... l.l Problem Statement ..... ............................. l.2 Approach ........ ........... ........................ II BACKGROUND .......... .................. ......... ......... 2.I VLSI Systolic Array for Matrix Triangulation ..................... .......... 2.2 lEEE-75h Floating-Point Standard ................... 2.3 Floating-Point Arithmetic Operation ..... ........... WWUWW DESCRIPTION OF A FLOATING-POINT INNER PRODUCT STEP PROCESSOR .... ....... ..... ..... . ...... 3. Introduction ................ ..... ............ ..... . .2 Multiplication and Alignment ....................... .3 First Shifter ........................ ..... ......... .h Addition ........................................... .5 Second Shifter ..... ........... ..................... .6 Roundoff ............................ ..... .......... DESCRIPTION OF A FLOATING-POINT DIVIDER ...... ....... .... A.l Introduction ................................. ...... h.2 Divider ............................................ A.3 Shifter ....... ... ...... . ........ . ..... ............. h.h Roundoff ....... ....... . .......... .... .............. SIMULATION DEVELOPMENT ............... ................... 5.l Chip Area Computation ....... ............ ...... ..... 5.2 Propagation Delay Computation . ............. . ....... 5.3 Significant Module Data ........... ...... . .......... ll 12 I8 I8 20 2h 30 35 Al Al Al A3 55 1+7 AB '09 51 VI. RESULTS AND CONCLUSIONS ...... i ...... . ..................... 6.l Circuit Simulation Results ......................... 6.2 Conclusions ........................................ REFERENCES 'Page 53 53 59 6l Table 5.] 6.l 6.2 LIST OF TABLES Port-coefficient time table. ... ....... . ..... Significant module data. ............ Simulation area and time result for FLPMAC design with different options. Comparison of area and time results of different arrays. ..... vii 55 58 LIST OF FIGURE Figure ' Page 2.l The hex-connected processor array for pipelining the L-U decomposition of a band matrix. ................ ....... ......... ..... . 5 2.2 The L-U decomposition of a band matrix. ........ . ....... 6 2.3a Augmented matrix {Alb}. ................................ 7 2.3b Augmented upper triangular matrix {Uid}. ............... 7 2.A Computing array structure for matrix of arbitrary dimension with 8-3. . ......... ...... ....... 8 2.5 Basic-single format. ..... ...... i. ....................... II 2.6 Preliminary register. .................................. l2 2.7 Flow chart of the execution sequence for normalized FLP addition/subtraction. ............. .. l5 2.8 Flow chart of the execution sequence for normalized FLP mutliplication. ... .................. l6 2.5 Flow chart of the execution sequence for normalized FLP division. ........... ........ .. ...... l7 3.l Block diagram of the FLPMAC. ......................... .. 19 3.2 Block diagram of the MA. ........... ..... ............... 21 3.3 A 5-by-5 sign-magnitude Braun array multiplier. ........ 23 3.h A AB-bit serial shifter. ..... ..... ... ..... . ....... ..... 25 3.5 A h-by-h parallel shifter. .......... ................... 27 viii 3.6a 3.6:. '3.7 3.3 3.9 3.10 3.II 3.I2 3.13 3.lA 3.15 m. h.2 h.3 A.A 6.l 6.2 A NOR form Z-to-A decoder. ........... A NAND form 2-to-A decoder. Block diagram of An 8-bit leading A AB-bit leading Block diagram of A right shifter. Block diagram of Block diagram of An 8-bit increment by one circuit (IBO). .......... Block diagram of the FSe 0.000.000.000 zero detector. .......... .............. zero detector. ........................ the AD. .000... ......... 0 ........ the $5. ........................... the rounding circuit. the overflow circuit. Block diagram of the FLP divider. ...................... An n-by-n convergence division algorithm based divider for mantissa division. Block diagram of the Divider. ........... ......... Block diagram of the Shifter. ........ 000......0.... Entire chip edge size versus matrix bandwidth for 0.8. 0.5 and 0.2 micron linewidth. .......... Entire chip propagation delay versus matrix bandwidth for 0.8, 0.5 and 0.2 micron linewidth. 28 28 29 3] 32 33 35 36 38 39 A0 A2 Ah “5 A6 56 57 CHAPTER I INTRODUCTION VLSI (Mery Large Scale integration) systolic computing array processors have provided a new frontier of research for improving the performance of systems requiring rapid matrix manipulations. Specifically, in the triangulation of the linear equations, A'g-b, systolic arrays with their modularity and regularity allow for straight forward circuit design, testing and implementation. Systolic array processors utilize both pipeline and parallel processing concepts and can execute the many inner product step operations, the kernel of matrix triangulation. much faster than conventional serial architectures. The benefit of VLSI technology is that these computing structures can be fabricated on a single chip, or perhaps in a modular fashion on a few chips, and then be attached to a host computer as a peripheral device capable of rapidly producing solution vectors of the linear equation system. Previous design simulations of inner product step processing elements (PE's) have been constrained to a fixed-point (FXP) number system [l,2,3]. But. in order for these designs to have more universal applicability, floating-point (FLP) PE's must be considered. The use of FLP PE's has been delayed due to its complexity, in both time and space when compared to FXP designs. Additionally, circuit designers have been 2 reluctant to adopt a nonstandardized FLP format. In 1979. the IEEE proposed a FLP standard [A,5] for microcomputer and minicomputer architectures which is gaining wide industry acceptance [6,7]. l.l Problem Statement The purpose of this thesis is to design and simulate a FLP inner product step PE using the IEEE standard as a basis. This will lead to an assessment of the time and space complexity of such an element. A simulation of the design will be developed to quantify the PE's required chip area and its modurar delay time. Then, the developed FLP inner product algorithm, with additional required circuit elements, will be used in a simulation of an overall systolic structure for band matrix triangulation. The overall structure will then be quantified with respect to maximum matrix bandwidth per chip and problem throughput. These results will provide further understanding into the potential. promise and applicability of the VLSI systolic array for matrix triangulation. l.2 Approach To begin the design of a FLP PE using the IEEE standard, some initial design problems must be considered. l. The precision problem: A tradeoff .between hardware cost and 3 computational precision must be examined with respect to the required precision of the application. 2. Overflow and underflow: In the IEEE standard, both overflow and underflow have unique forms. Our design must sense these singularities and correctly modify the result so that computation can continue with a minimum loss of accuracy. There are several different approaches to the design of a FLP PE, but only a few are suitable for VLSI implementation. These must incorporate the necessary ingredients for good VLSI design, namely, maximum parallelism and pipelining, design regularity and the use of only local communications. The research approach here will include three steps. The first step is to design the optimal processor by examining specific design tradeoffs. This includes comparing such items as shift register candidates (serial .versus parallel) and methods for rounding of the computed results (rounding to nearest versus truncation). The second step is to quantify, via.simulation, possible choices so as to compare delay times and geometric areas. The third step is to incorporate the optimal inner product step PE design in a previously developed systolic algorithm and simulate the array with various parameters of matrix bandwidth and lithographic linewidth. The results of this final circuit simulation will provide information of the maximum matrix bandwidth that can be solved on a single chip and its associated solution time. CHAPTER II BACKGROUND 2.l VLSI Systolic Arrays for Matrix Triangulation A VLSI algorithm implementing Gaussian elimination for L-U decomposition was proposed by Mead and Conway [8]. The algorithm, illustrated in Figure 2.1, can triangulate an NxN band matrix A, with bandwidth 8, into upper triangular form 9 and lower triangular form L (Figure 2.2). The bandwidth of a matrix is defined as B-(p+q-2)/2, where p is the column number of the first row's rightmost nonzero element, and q is the row number of the first colomn's bottommost nonzero element. To triangulate the matrix will take 3xN + min(p,q) time units and pxq processors. Reducing time or processors can be done either by minimizing the bandwidth of the matrix using the methods described in [9,10,1l], or by revising the algorithm. An early version of a revised algorithm was proposed by Kung and Leiserson [12]. Their algorithm, composed of simple inner-product step and division function processors, is used to carry out L-U decomposition on a full matrix {A}. A further revised version of that algorithm was proposed by Hwang and Cheng [13]. This version differs from the early one in interconnection structure, latch operation and I/O requirements. A \Wlmfl I.II.I vaiii b4 The hex-connected processor array for pipelining the L-U decomposition of a band matrix [8]. Figure 2.1 ‘11 ‘12 ‘13 an 0 1 “11 “12 “13 “1a . ° ‘21 ‘22 ‘23 ‘21» ‘25 121 1 ° “22 “23 “2a “25 ‘31 I‘32 ‘33 ‘3“ ‘35 = 131 132 1 ' “33 “3a “35 ‘M ‘uz ‘na . 1&1 1&2 1A3 1 ° a52 ‘53 152 153 '.. O O L. .3 I. - I— " A L D Figure 2.2 The L-U decomposition of a band matrix. Moreover, it is designed to perform L-U decomposition on an entire (augmented) linear system of equations, {Alb}. Neither of these two versions considered any design specifics such as I/O interface details or actual processor layout details. ' A more recent improvement to these original versions was developed in [3]. By using an isolated row of PE's, the improved algorithm can triangulate an arbitrarily large augmented band matrix {A'b} (Figure 2.3a) to an upper triangular form {ggg} (Figure 2.3b). Also, in [3], circuits for I/O interface and processor layout were presented and analyzed with respect to minimum delay time and chip area requirements. In this version, Figure 2.A, inner-product cells, called MAC's (Multiply and Add Cells) perform the functions w-xy+z, x-x and y-y. The II a12 a13 31h a22 a23 32h a25 a32 a33 33:. a35 3A2 ”A3 ‘uu a‘52 a53 aN(N-3) aN(N-2) aN(N-I) a Figure 2.3a Augmented matrix {AID}. ”12 ”13 “1h u22 u23 ”2A u25 ”33 “3:. "35 ”an “as “55 Figure 2.3b Augmented upper triangular matrix {DID}. =6 f E f DC DC DC g 01 AMA C IWAC 02 NH“: MEI IWAC °3 DMAC INA Nun: on *——0 0 d r ----- ----°---- ------ 3; -—D section Figure 2.h Computing array structure for matrix of arbitrary dimension with 8-3 [3]. 9 complementation circle shown on the topmost row of MAC's signifies a two's complement operation. The division function is implemented by the DC (inision gell), where g-e/f and f-f. The thick black lines between each row of cells represent latch arrays. These latches provide the synchronization for operands being pumped between rows of cells. Matrix elements of A enter the processor via input ports '1-|7 shown in Figure 2.A. The number of input ports is given by the full breadth of the matrix, 28+]. Matrix elements of g are pumped out of the processor via output ports 01-0“ shown in Figure 2.A. The number of output ports is given by the bandwidth plus one for the diagonal element, B+l. The A section, shown on the bottom of Figure 2.A, resolves the right hand side vector A into vector g. Elements of Q enter into the 9 section via Id and are pumped out via 0d. An l/O port - coefficient timing table for triangulating the matrix {Alp}, for B-3, is given in Table 2.1. In Table 2.1 it is seen that the time required to obtain {gig} is 2N+3, or in general 2N+B. Thus, this algorithm is classified as an O(n) algorithm. Previous versions, also O(n) algorithms, have required more PE's and thus more chip area. In all previously published versions of this array, PE's were designed to handle only intergers, or, at best, the mantissas of FLP numbers with identically adjusted exponents. This limited the function and applicability of the algorithm. To alleviate this limitation, we need a PE capable of handling general FLP operands. t2N+6 31h 25 836 Table 2.1 12 I3 I“ all 1 a12 a22 1 a'13 “23 a33 1 32h 33h ahh 1 a35 ans a55 1 ans IO Port-coefficient time table [3]. 13 ”2A 35 “1h 25 II 2.2 IEEE-75A Floating-Point Standard The IEEE-75A floating-point standard has been proposed for mini and microcomputer use [A,5]. This standard defines four FLP formats in two groups, basic and extended, each having two possible operand widths, single and double. In this thesis, only the basic-single format is chosen and used throughout the entire design. The basic-single format for a binary FLP number X is shown in Figure 2.5. Isl E l M 01 89 31 Figure 2.5 Basic-single format [5]. Using an 8-bit biased exponent, a 23-bit implicit mantissa and a l-bit sign digit, the format can express values v of x as follows: I. If e-255 and ffO then v-NaN (Act g Aumber). 2. If e-255 and f-O then v-(-l)s. 3. If O e2 '(°2'°1) m3 -I m‘ x 2 i m2 for e1 < e2 [ml 1 m2 for e1 - e2 e1 for e1 > e2 e3 - e2 for e1 < e2 0 for e . e (Z-A) IA Since we add or subtract only two numbers at a time, the resultant's mantissa is always bounded in the range 0 5 :m: < A (2'5) which violates the mantissa definition in the IEEE-75h standard and has to be normalized. When 25lm}