PVIESI_) RETURNING MATERIALS: Piace in book drop to LJBRARJES remove this checkout from w your record. FINES wil] be charged if book is returned after the date stamped be10w. EVALUATION OF A SINGLE VLSI CHIP ALGORITHM FOR TRIANGULATING LARGE BAND FORM MATRICES BY Wen Chang Hsu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1982 ABSTRACT EVALUATION OF A SINGLE VLSI CHIP ALGORITHM FOR TRIANGULATING LARGE BAND FORM MATRICES BY Wen Chang Hsu One of the efficient procedures to solve a large, sparse system of linear equations, 5 '.§ =‘b, as encountered in a variety of common engineering problems, is to go through fast, band matrix triangulation utilizing concurrent Gaussian elimination. This algorithm may be implemented using systolic computing cells configured into a simple, regularly connected processing array. With the most recent advances in microelectronics technology, it will soon be possible to fabricate this computing array on a single chip. A computing structure, utilizing both parallel and pipeline concepts to triangulate an augmented band form coefficient matrix k{§ g b}’ in 0(N) time steps, where N is the order of the matrix, is presented. This new design uti- lizes only C(82) processing cells, where B is the matrix bandwidth. 'This compares favorably to previous algorithms which require 0(N2) cells. Hardware arithmetic algorithms of MAC and DC, the required processing cells of the com- puting structure, are designed and evaluated. Wen Chang Hsu Three I/O circuits are presented and compared *with respect to area-time trade-offs. An Optimal input strategy, the data controlled algorithm and optimal output strategy, the SCS algorithm are chosen. A transistor level circuit simulation, based on parame- ters of word size, matrix bandwidth and minimum lithographic linewidth, provided parameters of area geometry and delay time estimations for the I/O and computing structures. The ultimate goal of this simulation was to provide two impor- tant comparisons, total prOpagation delay and chip size ver- sus matrix bandwidth. For a given matrix bandwidth, the optimal chip size and throughput speed based on current and projected lithographic linewidths were evaluated. Bottle- necking of the I/O Operands was not observed for word lengths g 16 bits and small matrix bandwidths. These results, in turn, lead to an assessment of the feasibility and advantages of this class of Special purpose VLSI computing structures. To my parents Mr. and Mrs. Chiung-Yung Hsu ii ACKNOWLEDGEMENTS The author wishes to express his sincere appreciation to his major advisor, Dr. Michael A. Shanblatt, for his guidance and encouragement in the course of this research. He also wishes to thank the committee members, Dr. P. D. Fisher, Dr. R. G. Reynolds, Dr. E. D. Goodman and Dr. S. R. Crouch for giving the valuable suggestions and comments in this research. Finally, the author owes a special thank you to his wife, Huey-Fen, for her confidence and emotional support that only a wife can give. Work reported here was supported in part by NSF under Grant ECS-8106675. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS O O C C O O O O O O O O O O 0 LIST LIST I. II. III. IV. OF TABLES O O O O O O O O O O O O O O 0 OF FIGURES O O O O O O O O O O O O O O 0 INTRODUCTION . . . . . . . 1.1 Statement of Problem . . . . . . . 1.2 Approach . . . . . . . . . . . . . 103 contributions 0 O O O O O O O O 0 BACKGROUND 0 O O O O O I O O O O O O O O 2 l . Array Processors and the Vector Nature of Gaussian Elimination . 2.2 VLSI Architectures and Systolic Algorithms . . . . . . 2.3 ' Design Methodology of VLSI Systems COMPUTING STRUCTURE DESIGN . . . . . . . 3.1 Band Form Algorithm . . . . . . . 3.2 Array Structure and Timing . . . . 3.3 Cell, Port and Time Slot Counts . 3.4 Floating-Point Considerations . . 3.5 Function Block Design . . . . . . INPUT/OUTPUT CIRCUIT DESIGNS . . . . . . 4.1 SCS Input/Output Circuit . . . . . 4.2 Binary Tree Structure . . . . . . 4.3 Data Controlled Input/Output Circuit . . . . . . 4.4 Discussion . . . . . . . . . . . SIMULATION DEVELOPMENT . . . . . . . 5.1 Chip Area Computation . . . . 5.2 Propagation Delay Computation . . 5.3 Significant Module Data . . . . . iv Page iii vi vii \IUIWH 10 10 14 17 22 22 26 31 32 35 46 47 50 55 57 60 63 65 H I VI. VII. SIMULATION RESULTS AND EVALUATION . . 6.1 6.1.1 6.1.2 6.2 CONCLUSIONS . 7.1 Summary 7.2 BIBLIOGRAPHY Future Research Comparison of I/O Strategies . INPORT Strategy Selection . OUTPORT Strategy Selection . Entire Chip Simulation Results Page 68 69 70 73 78 87 87 9O 92 LIST OF TABLES Table Page 3.1 Port-coefficient timing table. . . . . . . 30 5.1 Significant module data. . . . . . . . . . ' 67 6.1a Total chip area and time results of algorithm #3 (INPORT) and #1 (OUTPORT) for 8 bits per word at A = 0.8 pm. . . . 76 6.1b Total chip area and time results of algorithm #3 (INPORT) and #2 (OUTPORT) for 8 bits per word at A. = 0.8 pm. . . . 76 6.2a Total chip area and time results of algorithm #3 (INPORT) and #1 (OUTPORT) for 16 bits per word at A. = 0.8 pm. . . . 77 6.2b Total chip area and time results of algorithm #3 (INPORT) and #2 (OUTPORT) for 16 bits per word at A. = 0.8 gm. . . . 77 6.3 Simulation time results for 8 and 16 bit words at..A = 0.8 microns. . . . . . . 85 vi LIST OF FIGURES Vector nature of a full matrix row-order elimination step. . . . . . Augmented matrix {‘A : b} . . . . . . Upper triangular elements of matrix { é : 2} O O O O O O O O O O O 0 Large band form matrix. . . . . . . . . Computing array structure for matrix of arbitrary dimension with B = 3. . . Logic circuit diagram of a 5-by-5 Baugh-Wooley based MAC. . . . . . . . . Logic gate diagram of a one-bit fuj-l adder. O O O O O O O O O O O O O O A l6-by-l6 convergence division algorithm based DC. . . . . . . . . . . D-type f1ip‘flOpo o o o o o o o o o o 0 Dynamic latch controlled by two-phase nonoverlapping clock. . . . . SCS input circuit diagram. . . . . . . SCS output circuit diagram. . . . . . . Input and output circuit control signal timing diagram. . . . . . . . . Binary tree input structure. . . . . . Binary tree output structure. . . . . . Data controlled input circuit diagram. Data controlled output circuit diagram. vii Page 13 24 24 25 28 37 39 44 45 45 48 49 51 52 54 56 58 Figure Page 6.1 INPORT edge size versus matrix bandwidth for 16 bits per word at A = 0.8 ”m. 0 I O O O O O O O O O O O ‘ 71 6.2 IMPORT delay time versus matrix bandwidth for 16 bits per word at A = 0.8 um. O O O O O O O O O O O O O 72 6.3 OUTPORT edge size versus matrix bandwidth for 16 bits per word at A = 008 um. O O O O O O O O O O O O O 74 6.4 OUTPORT delay time versus matrix bandwidth for 16 bits per word at A = 008 um. O O O O O O O O O O O O O 75 6.5 Entire chip edge size versus matrix bandwidth for 8, l6 and 24 bits per word at A = 0.8 pm. . . . . . 80 6.6 Entire chip propagation delay versus matrix bandwidth for 8, 16 and 24 bits per word at A. = 0.8 gm. . . . . . 81 6.7 Entire chip edge size versus matrix bandwidth for 8, 16 and 24 bits per word at A. = 0.5 gnu . . . . . 82 6.8 Entire chip propagation delay versus matrix bandwidth for 8, 16 and 24 bits per word at A. = 0.5 um. . . . . . 83 viii CHAPTER I INTRODUCTION High speed VLSI (yery Large Scale integration) comput- ing arrays, based on concurrent Gaussian elimination, for triangulating the linear equation system A ° 35 = b have become a tOpic of recent interest. By combining parallel and pipeline concepts, these "systolic arrays” [11 can rapidly execute the numerous inner-product step operations which serial computers must tediously operate upon sequen- tially. It is projected that these high performance algori- thms can be implemented directly using low cost, high speed VLSI circuit technology either on a simple chip, or perhaps in a modular fashion on a few chips. These devices can then be attached to either dedicated or general purpose host con- puters as plug-in components capable of producing fast solu- tion vectors to large scale systems of equations. This con- cept is similar to the currently available attached peri- pheral array processors and to those devices which perform dedicated fast Fourier transforms exclusively in hardware. The work described herein advances this topic by the design of a computing algorithm to triangulate systems of arbitrary size which have been permuted, a priori, to mini- mum band form. The algorithm is applicable to any struc- turally symmetric, diagonally dominant coefficient matrix. This permits use of the algorithm on many naturally occuring band form matrices and, more importantly, on the highly 2 sparse, diagonally dominant matrices arising from a multi- tude of engineering and scientific problem formulations. The salient feature of the algorithm is that the key restriction limiting network size is on the reduced matrix bandwidth, not on the original network dimension. Moreover, several good heuristic bandwidth reduction algorithms, based on graph theoretic concepts, already exist and are quite easy to implement [2, 3]. . It is assumed that the permutation of A to minimum band form involves full row and column pivoting so as to ensure the retention of the necessary property of diagonal dominance. The processing elements of the computing structure were originally designed to operate with fixed-point binary num- bers. But the nature of this structure also allows these PE's to work with the mantissas of floating-point input operands provided that all eXponent values have been equalized and stored in the host processor. More details of this idea will be presented in Chapter III. In a single chip system, operand input and output may present a bottleneck due to some very practical problems of packaging considerations. Therefore, high speed serial-to- parallel input and parallel-to-serial output circuits must be included in, the design so as to minimize any potential bottleneck and thus retain optimal system throughput. Three different I/O strategies are presented in Chapter IV. Comparisons are made in terms of area and prOpagation delay requirements. The final quantifications of area geometry and propaga- tion delay for the total chip, composed of the computing structure with the optimal input/output strategies, are exa- mined via a chip model simulation in terms of matrix band- width, word size and lithographic linewidth. These simulation. results are intended. to provide a realistic estimation of the feasibility, with respect to problem size and throughput, of this type of systolic array numerical algorithm fabricated on a single VLSI chip. This, in turn, will shed light on how these ideas may benefit the numerous scientific and engineering problems for which enhanced throughput in the rapid solution of large scale simultaneous equations is greatly desired. 1.1 Statement of Problem VLSI based dedicated computing structures offer great potential for' optimizing problem throughput and 'hardware investment. Yet many aspects of the design and implemen- tation of these ideas still remain incomplete or under- deveIOped. Such problems include the inner details of various processing elements, packaging constraints and input/output considerations. 4 The goal of this work is to develop and assess aspects of a special purpose VLSI computing architecture to triangu- late large scale, band form, linear equation systems. Specific areas to be examined are as follows: 1. The design and evaluation of a computing array structure for this problem type is presented. The global dimensions of this structure are based on a function Of coefficient matrix bandwidth and on the assumption Of an implementation of the row-ordered permutation of Gaussian elimination resulting in a strict upper triangulation of the coefficient matrix. .A coefficient input anui output patterns are presented. . Individual processing elements are designed and evaluated. This includes eXplorathm1 of various fixed-point multiplier, adder and divider algorithms. Selection Of the best algorithms are based on comparisons Of speed, area, modularity and local communication prOperties. Possible input demultiplexer and output multiplexer circuit designs are evaluated. Parameters which are examined include area geometry, prOpagation delay and a consideration Of the effects of the fundamental limitation Of pinouts per chip. I/O designs are also evaluated with respect to poten- tial bottlenecking of Operands which can severely degrade the overall system throughput. Realistic estimation of the total prOpagation delay versus matrix bandwidth and chip size versus matrix bandwidth of the entire chip is made by incorporat- ing the Optimal computing structure and I/O circuit designs. The results Of this work, particularly tasks 1 and 4, enable a more comprehensive assessment of VLSI's potential for reducing solution time Of large linear equation system problems. 1.2 Approach In the past few years the advantages Of systolic, hex- connected array structures, realizable with VLSI .tech- nology, were introduced [12, 13]. These structures have been shown to be applicable to many matrix computation problems including Gaussian elimination [16, 17]. The first step Of this work is the develOpment of a computing array to perform a complete factorization (not merely L-U decomposition) Of an N x (N + 1) matrix (A augmented by g) in minimum unit time. In other words, the objective is to create a set of possible input patterns from the nonzero entries of a matrix, most suitably in band form, and implement a computing structure in a simple hexagonal array. The plan is that these two structures, the input pattern and the computing array, may be manipulated back and forth until their subsequent output pattern fits the required upper triangular pattern. The output pattern should' be that which is most conducive to rapid back- substitution. The second step entails study Of the input/output problems for the main body of the chip - the developed com- puting array. The number of pinouts per chip of current and projected VLSI technology give primary ideas on how control signals, fan-in demultiplexer and fan-out multiplexer cir- cuits must be developed. IDi order to avoid a potentially serious I/O bottleneck problem, the major point of these input and output circuit designs is to provide high speed channels for operand routing. Also, design regularity and low power requirements, necessary for good VLSI implemen- tation, are considered. The third step develops some of the inner details of the processing element structures. In order to fully assess the area geometry and prOpagation delay of these processing elements, schematic logic circuit diagrams of various arith- metic algorithms are studied. For example, the method for designing an MAC (Multiply-Add gell) encompasses the following: 1. Review existing LSI multiplication algorithms and evaluate the optimal hardware implementation in terms of design complexity, local connection pro- perties, modularity, speed and area. Also, study a possible attached high-speed full adder using iden- tical procedures. 2. Develop a network realization of the chosen optimal fixed-point arithmetic algorithm for a complete MAC. This step will lead to quantification of the necessary performance measurements of area geometry and prOpagation delay. 3. DevelOp a tunui layout of the fundamental building block modules, such as a full-adder, in terms of scalable minimum lithographic linewidth, A. Using these design layouts, quantify the Ac (area per cell) in terms of A, and Tc (propagation delay per cell). Tc will be determined in terms of typical basic inverter discharge time, also a function of A , and any non-negligible communication path delays. The fourth step is the develOpment and evaluation of the required building block modules for several possible I/O circuit strategies, then quantify area geometries and propa- gation delays of these modules using the methodology of step 3. The fifth and final step is a Fortran-coded simulation to provide two important comparisons, total prOpagation delay versus matrix bandwidth and chip size versus matrix bandwidth. These comparisons are determined by varying parameters of minimum lithgraphic linewidth, word size and matrix bandwidth. The results, combined with predictions of future trends in VLSI technology, will help evaluate, and hopefully promote, the feasibility and advantages of special purpose VLSI computing structures as linear equation solvers. 1.3 Contributions This section provides readers with an additional understanding of the original research which has been accomplished in this dissertation work. The successful design of a new VLSI systolic array algorithm based on concurrent Gaussian elimination is presented. This algorithm is unique in that it triangulates a band form matrix A augmented by vector 2 (in other words, .{A g Q} ). The array requires 0(N) unit time and 0(B2) processing elements, (where N and B are the dimension and half bandwidth of A, respec- tively). The new algorithm, requiring 0(82) processing elements represents a dramatic improvement over pre- viously published algorithms of this form which have required 0(N2) PE's. This is a tremendous saving, par- ticularly when B << N. The inner details of processing element designs are develOped and assessed. This entails evaluating and configuring the required MAC's (Multiply and Add gell) and DC's (inider ‘Qell). The best designs were Obtained subject to criteria of design regularity, design complexity, chip area and propagation delay requirements. The best input strategy, the data controlled algorithm and the best output strategy, the shift-register control sequence (SCS) algorithm, were developed and evaluated. ‘These strategies helped tremendously in minimizing a potential I/O bottleneck problem which was anticipated in systolic array structures of this type. Determinations of the entire chip area geometry and total prOpagation delay were obtained using a comprehensive chip simulation. Results were based on various word size, minimum lithographic linewidth and matrix bandwidth parameters. The simulation results have indicated that this type of special pur- pose VLSI chip indeed provides rapid triangulation of large scale, band form linear equation systems. CHAPTER II BACKGROUND 2.1 Array Processors and the Vector' Nature of (Gaussian Elimination Array processors are generally realized as one (vector) or two (matrix) dimensions of processing elements which allow data Operands to be processed in parallel. In the literature [37], array processors have been defined as those structures for which the information unit (data string) is an array of one or two dimensions. This would define pro- cessors in which both serial and parallel execution would be implemented in a pipelined or even a parallel pipelined fashion. The definition of array processors as used here are those systems where the processing elements have both parallel and pipelined structures. There are two basic classifications of array processors. First, the stand alone type, such as the CRAY-l or the CDC STAR-100, and second, the scientific processors like the AP-lZO B/FPS-l64 family or the MAP-300. The latter classification has been referred to as a special algorithm processor [37]. Array architec- tures are most advantageous for problem types which are vec- tor or array oriented with highly repetitive arithmetic operations. The typical operations include vector addition and subtraction, matrix multiplication and convolution, and .many advanced functions like fast Fourier transform. Applications of array processors have been proven to be 10 ll cost-effective in areas such satellite image processing [4], power system network computation [5], reservoir simulation [6], and pattern recognition [7]. For a class of large scale, linear equation systems, A ° 5 = b, the method of reducing solution time via array processors is to apply a vectorized form of Gaussian elimi- nation Optimizing the use of these processors' parallel and pipelined structures in the triangulation procedure Of the coefficient matrix. For example, consider an N x N matrix. A possible FORTRAN coded program segment to perform rwo-order Gaussian elimination is DO 1 K = l, N DO 1 J = 1, K DO 1 I J + l, N IF (K.EQ.J) GO TO 2 A(K,I) = As} (3-1) A={aij:aij=0rv where B, the bandwidth is given by B = max{|i -j| : aij ¢ 0} (3-2) for i, j = 1, 2,..., N. Note that structural symmetry only (not absolute symmetry) is implicit in this definition. Furthermore, B, as defined here, is sometimes referred to as the half-bandwidth, spanning from the matrix diagonal to the rightmost nonzero matrix element. The entire equation system, A - 5 = p, can then be described by the pair (A, Q), where A = {aij} is an N x N 22 23 matrix in structurally symmetric band form of bandwidth B and ‘9 = (b1, b2,..., bN)T is a column vector of size N. Conducive to Gaussian elimination, ,the system can be described as an N x (N + l) augmented matrix{ A: p} . The elimination procedure will reduce {A. la} to upper |'-‘ -..._... triangular elements U = {uij} for i = ( , 2,..., N), j = (i, i + l,..., i + B), augmented by a revised column vector _d. This can be easily illustrated by considering a 9 x 10 augmented system with a bandwidth of 3 as depicted in Figure 3.1a. After the upper triangulation has been per- formed, the system appears as illustrated in Figure 3.1b. For larger systems, some more interesting aspects can be eXposed. Consider a large A with BiacN. An examination of the elimination of row k from such a system, as shown in Figure 3.2, reveals some useful information. The shaded areas of this figure represent the total number of matrix elements required to perfomm the elimina- tion of row k. As depicted, the length, in matrix elements, of row k is 28 + 1. This will be called the working row. Neglecting the zero elements then, the width of an augmented working row is merely 28 + 2 nonzero elements. Examining the required upper triangular elements needed to eliminate the augmented working row, a simple geometric analysis reveals that B2 + B elements of A are required. In addition, again neglecting null entries, B elements Of the g vector (resolved A vector elements) are required. Thus the total number of active nonzero elements involved in the 24 Figure 3.la Augmented matrix { 3.3 9.} ° Figure 3.lb Upper triangular elements of matrix { A_E 9_} ll .x_cume Eco» Ocmn mace; N.m . 7mm u filmIIY— Tilm V\\\\\\ \\\\\x \/ VA I, 26 elimination of augmented row k is given E=BZ+4B+2 (3.3) These elements, i.e., the shaded areas of Figure 3.2, will be referred to as the "window" of elements required to eliminate any given row k. Now, focusing on the development of an ideal systolic computing array for such a band form system, a reasonable ‘objective is to have the size Of the array, in processing elements, related to this window Of elimination. The window can then be viewed as "sliding down" the main diagonal eli— minating rows as it passes. This is analogous to the func- tion of the systolic array; row k is eliminated while the new augmented working row k + l is brought in; simulta- neously, the resolved elements just above the window are placed on the output lines. The advantage of such a scheme is simple and Of great consequence. The size of the computing array, as will be shown, is limited by B, the bandwidth, in both width and depth. 3.2 Array Structure and Timing Creation of the actual systolic array structure required manipulating a set of possible coefficient input patterns and arithmetic processing cells in order to produce properly resolved upper triangular elements at the output ports. Restrictions were based on structural simplicity, 27 regular connectivity and an ideal upper bound of O(Bz) pro- cessing cells. Actual input alignment, as will be seen, required the injection of dummy zero and one elements, used as "spacers” in the operand string. The output string, also regularly interlaced with spacers, contains the necessary resolved upper triangular factors and the resolved column vector A for subsequent back—substitution. The systolic array Operates synchronously with latch arrays controlling operand flows between processing cells. There are no storage elements, save the latches, in the array. The most difficult part Of the design was to correctly align the 2 vector element stream through the array. Many of the possible input patterns and cell permutations pre- sented data interference and conflict problems. This was eventually overcome by isolating those processing cells which perform the actual Operations Of resolving 2 into A; this portion of the hardware is called the D section. It is, however, constructed of precisely the same cells as the rest of the systolic array, differing only in interconnec- tion pathways. The array structure for a matrix of arbitrary dimen- sion, N, and of bandwidth B = 3 is presented in Figure 3.3. Two types of processing cells are used, corresponding to the two operations of the Gaussian elimination procedure. The first and majority cell is labeled MAC (_Multiply and Add Cell). It has previously been called an inner 28 e . 9 . 9 , y . w MAC MAC MAC x w x 024 ' z ' y MAC MAC MAC MAC MAC MAC \\\ o .__.l 4 l h r--------—----— Figure 3.3 Computing array structure for matrix of arbitrary dimension with B = 3. 29 product step processing element, and is a 3-input, 3-output cell calculating w = xy + z and performing the transfer x = x and y = y. The second cell performs the diagonal divide operation and is labeled DC (Division gell). It is a 2-input, 2-output cell. The DC performs the division g = e/f and the transfer f = f. Both cell types are illustrated in Figure 3.3. The complementation circle shown on the input of the tOpmost row of MAC's refers to a two's complement operation. Inter-row registration, providing Operand synchronization, is depicted by the thick black lines. The array structure supports separate upward and down- ward traffic flows. Input coefficients are synchronously pumped into ports I1 through I7. The number of required input ports is given by the full breadth of the matrix band or ZB 4-11 (refer to Figure 3.2). Columns of A enter the ports every two cycles interspersed with the prOper spacer elements. Table 3.1, a timing table, illustrates the Operand strings for the example B == 3 corresponding 'to Figure 3.3. Once this initial upward stream reaches the DC level it proceeds downward toward the D section. At this point in time, t3 in the example, A vector operands begin to enter port ID of the D section. At the end of t3, b1 reaches the w output of the rightmost MAC in the D section and latches appropriately. During t5, b2 pumps in from ID and b1 is transferred to the bottom latch of the D section waiting to 30 Table 3.l Port-coefficient timing table. 31 enter input 2 of the second rightmost MAC. Once the first column of lower triangular factors reaches the y inputs of the D section, the bottommost latches are filled with the required bi elements so that the generation of the new right hand side vector elements, di, 1 = 2, 3,..., N, can begin. Note that during the pre- vious time step, d1 (which equals b1) appeared at port 0d of the D section. Then, D section outputs appear at. port 0d every two time slots. 3.3 Cell, Port and Time Slot Counts The salient property of this new algorithm is that cru— cial counts of input-output ports and processing cells are found to be functionally related to the matrix bandwidth and not the full matrix dimension. This fact is quite important as many systems of large dimension yield tightly banded matrices. Furthermore, many sparse coefficient matrices can be reordered into band form which for some problem types have been shown to yield B = 0.10 N [8]. It has been determined that the number of input/output ports are ZB + 2 and 84+ 2, respectively. It must be remem- bered that each port is m bits wide, m being the number of bits per word. This introduces a very critical pin limita- tion problem which will be addressed in the input/output circuit design chapter. The algorithm requires far less processing cells than previously published schemes which could be classified as 32 requiring O(N2) cells, N being the full matrix dimension [16]. Specifically, B(B + 1) MAC's and B DC's are required, advantageously classifying the algoritmml as needing just O(BZ) cells. Of course, this can only be applied to those sparse coefficient matrices for which bandwidth reduction is effective. As far as time requirements are concerned, the new algorithm requires 2N + 28 time slots, thus classifying it as an O(N) algorithm. This is on the same order as other systolic array algorithms, which require many more pro- cessing cells for this type Of problem [16]. 3.4 Floating-point Considerations The computing structure presented in Section 3.2 is basically constrained to a fixed-point binary number system due to the nature of its processing element designs. It is possible, however, to consider processing fixed-point num- bers which are the pre-adjusted mantissas of previous floating-point coefficients. This possibility, of course, is a crude approach to a true floating-point solution, but, due to current limitations of chip size and lithographic linewidth, it deserves further investigation for, at best, a near term application. It is realized, and must not be understated, that this approach is extremely vulnerable to the dynamic range of the input matrix entries and will work only for those problems which possess well tempered coef- ficients of tight dynamic range. A possible candidate for 33 this dedicated computing structure is the class of matrices with all positive and strictly diagonally dominant coef- ficients, such as the admittance matrix of a power system load flow problem. For a well conditioned matrix with these properties, no unpredictable overflOW' problem *will occur during the elimination procedure if the following proposed fixed-point scheme is considered. In fixed-point number systems, the two most commonly selected positions for the radix point are either at the left extreme or at the right extreme of the magnitude posi- tion Of the number [20]. To allow the PE's of the com- puting array to Operate in both number systems, the left extreme is aa more logical choice. Here, the radix point lies between the sign bit and most significant bit. Moreover, this dictates that all fixed-point numbers be strictly less than one. To adjust floating-point numbers to this fixed-point scheme, the host processor (or some intermediate processor which will not be considered here) must first sort out the maximum Operand. For the matrices under consideration, this element lies on the matrix diago- nal. The mantissa value of this maximum Operand is nor- malized and the exponent value is stored. The other operands in the matrix are then pre-adjusted in floating— point to the same eXponent value as the maximum Operand. As it is assumed that the latches Of computing struc- ture are reset before any operands are pumped in, it is obvious that the first significant arithmetic Operation, 34 which could possibly Offset the eXponent, is at the tOp row of processing elements, the DC's. From the example of Figure 3.1a it can be seen that during t4, the operands working in the leftmost DC are dividend a21 and divisor all- If we eXpress the pre-adjusted floating-point Operands as, all M1 x 2e (3-4) 321 M2 X 28 '(3-5) where ea is the eXponent value of the maximum operand, the quotient A21 M2 x 28 M2 A11 Ml x 28 M1 (3-6) has an exponent value of zero. This quotient becomes the multiplier for the next downstream MAC. The MAC algorithm, w = xy + z, restores the exponent value to e. Moreover; zero eXponent quotients continue to be pumped out of the DC's every time slot and these factors propagate vertically down through the array. Consequently, the eXponent values of the resolved upper triangular elements at the output port are all equal to the exponent value of the original input elements. A computing structure with this potential can operate with only the adjusted mantissas of the input operands. The fixed eXponent value can be retained by the host processor and used for post-adjustment upon return of the upper triangular elements prior to back-substitution. 35 3.5 Function Block Design Examination of the computing array depicted in Figure 3.3 reveals that only three schematic logic circuit diagrams of function blocks (MAC, DC and latch) are needed to esti- mate the area geometries and propagation delays. Again, the highly regular, localized wire routing among these building blocks will reduce the design complexity, delay time and layout area of the computing structure. ° Many addition, multiplication and division algorithms can be called computationally efficient, but, for exclusive hardware implementation, especially in this class of a data oriented VLSI chip, the final algorithm selecting criteria must also be judged in consideration of the design task. It is most desirable to choose MAC and DC algorithms based on existing arithmetic algorithms that can be imple- mented in simple and regularly structured combinational cir- cuits. For circuit and logic simulation, design verifica- tion and test validation, combinational circuitry is pre- ferable: to sequential circuitry [19]. Another' important criterion is that those MAC and DC algorithms which are best realized in locally connected array' networks. of 'unique, simple and regular functional cells will considerably ease the design of the building blocks themselves. This type of network will have several other advantages. First, it can provide increased throughput due to the parallel and pipe- lined processing structure Of the algorithm. Second, func- tion cells which only communicate with nearest neighbors are 36 very advantageous in that they minimize interconnection requirements. With these criteria in mind, the MAC and DC algorithms were designed as follows: A. MAC Algorithm - The high speed cellular array multipliers such as Wallace tree [21], Pezaris array [22], and Baugh-Wooley array [23] were evaluated based on the design criteria discussed above. The Baugh-Wooley algorithm, with its fast multiplication speed (only 2n full-adder delay time for an n x n multiplication, where n is word size), and regular full adder structure [20], was determined to be the best among these three. Furthermore, this algorithm allowed for straightforward extension of the design to the desired MAC function of nmdtiplication followed by addition by placing an extra row of full-adders at the bottom edge of the Baugh-Wooley multiplier. Thus the MAC function is obtained in only one more full-adder delay time. The MAC array is shown in Figure 3.4. Note that the portion within the dotted line is the original Baugh-Wooley multiplier. For design layout, consider the following equations of the one-bit full-adder function for a standard gate-level implementation: Si = A1 + Bi + Ci (3-7) Ci+i = AiBi + BiCi + AiCi (3-8) where A1 and Bi are inputs to the current stage and Ci is I. . . . L. B.k:‘ . .. .. .. .. twin-'4', . 37 .O<2 cmmmn ampoozigmzmm muxnim m to Emcmmwo “waocwo Ommoo e.m mczmwm no» a o e \lllll J \ ... .‘e‘e 9 6‘9. x .. C. “'0 J A:- L" ‘M ‘Q ‘Ifl & O. 9 9 x - .. \ \ \ Java 1 cm \ \ «no. (u e G 6 \ q \ \ m cm a c \ \ m \ . \ a... 1 e 9 .\ .9- (u .5.- e in. 6 in. e .. \ \ .5: f... in \ on... a... as can. on... \ R - .\ } 3 D .—.— nw_wo- %w.wo .a v. .5 .- Acév u Cryos- n.~...e u... :9. 38 the carry from the previous stage. The wire-logic gate implementation of this full-adder is shown in Figure 3.5. This design, with only a two gate level delay, provides a minimal propagation delay. B. DC Algorithm — Several existing high speed division algorithms are considered feasible for VLSI implementation. These include the Cappa-Hamacher array divider [20, 24, 25] and two types of multiplicative division schemes, the convergence division algorithm and divisor reciprocation division algorithm [20]. Ihi a pipelined structure, the total prOpagation delay of the system is a function of the maximum delay of the longest segment. However, the Cappa-Hamacher divider algorithm‘s computational time, in terms of word size, is at least two times that of the Baugh-Wooley algorithm [20]. A consistency of computational speeds between MAC and DC algorithms is crucial. Furthermore, it is ideal to strive to make the MAC prOpagation delay the limiting factor of clock rate 2“; it will most likely be the fastest element. Limiting the clock rate by a much slower element will degrade the computational efficiency of the entire chip. Therefore, it is desirable to set the clock rate as the reciprocal of the sum of MAC time plus a latch time and find a DC algorithm within this bound. An alternative algorithm which partitions the complex division function into several subtasks, serving as segments of a pipelined structure, is the prime candidate. 39 r+¢nv penumco m we sat m ONH swoon ppze. two mama meoo m.m mssmwm 40 Both the convergence division and divisor reciprocation division algorithms can be implemented in segmented pipeline structures. They both approach the final result of quotient or divisor reciprocal quadratically. Through detailed study of these two algorithms, the hardware requirements are found to be approximately equivalent. But, there is a key advan- tage to the convergence division algorithm; that is, the number of iteration steps can ix; determined. a priori in terms (ME word size» This factor is very important in an exclusive hardware implementation because no software con- vergence checking is required. The convergence division algorithm operates by multi- plying both the denominator and numerator with the same sequence of convergence factors until the denominator approaches unity. To illustrate this, consider the division Operation, Q = -—— . (3-9) The right hand side can be expressed as ——— x ——— x ——— x - - - x ——— (3-10) 0 R0 R1 Rn where the Ri's are a set of convergence coefficient factors whose goal is to force the denominator to unity and thus produce the quotient. If the resolved denominator comes up 41 to be one, that is, D X R0 X R1 x 0 ° 0 X Rn > 1 (3‘11) then the quotient can be eXpressed as O = N x R0 x R1 x - . - x Rn. (3-12) Assuming that D is positive and fixed-point arithmetic is selected, then, 1 > D > 0. (3-13) In order to converge faster, D should be normalized first, that is, 1 > o' 2 L: (3-14) where D‘ is the divisor after normalization. Then, N must be shifted as many bits as D, thus, N' N D' D Now, a number' 6 can be defined which makes 0' = 1 - 6 (3-16) and then 0 < 6 3 L2. (3-17) We can choose the first multiplying factor as R0 = 1 + a (3-18) 42 so that Do' D' x R0 (1-6)(1+d) = l - 62. (3-19) Now choosing R1 = l + 52 (3-20) then D1' = Do' X R1 (1 - 52) (l + 52) 1 .. 04° (3‘21) Therefore, in general Di. = D0. X R0 x R1 x o o o X Ri = 1 - 621+1, (3-22) In the binary case, 6 g by, therefore 621 g 2‘21. (3-23) Now, consider a word size of 16 bits implying 15 signi- ficant digits. From equations 3-22, in order to make Di' = 1 requires 62i+l < 2_15. (3-24) Then equation 3-24 gives i = 3. In other words, multiplying factors R0, R1, R2, R3 are required to make D3' converge to l and O = N' x R0 x R1 x R2 x R3. (3-25) 43 For the same reason, if the word size is 24 bits, 5 multi- plying factors are required and so on. Finally, the quotient value should be shifted right by the same number of bits as the divisor, D, has shifted left to normalize the result. The Convergence Division algorithm was chosen for use in the DC's due to the fact that no software convergence checking is required. The hardware implmentation Of a DC working with 16 bits per word is illustrated in Figure 3.6. C. Latch Design - There are two basic types of latches -"- static (and dynamic. A.txaditional static latch is a D-type flip-flop as shown in Figure 3.7. The disadvantage of this latch is that it requires four 2-input NAND gates and one INVERTER; also four units Of gate delay. These area and time require- ments are more than those required by a dynamic latch. Dynamic latches are known to be ideal in high speed, synchronous processor chip designs, especially using VLSI [9, 19]. The dynamic latch, shown in Figure 3.8, consists of two pass transistors and two inverters. It uses a two- phase nonoverlapping clock to load and refresh the input data. The utilization of pass transistors has the advan- tages of a simple topology of interconnections and elimina- tion of VDD and GND connections, thus reducing wire routing geometry and power requirements. 44 N D Jew +1. DIVIDEND SHIFT - —1 DIVISOR SHIFT [" -'I-‘SHIFT BIT COUNT I l_‘ [M 1—61 7 2-(1-6) 18 X 1. { 16 X 16 MULTIPLIER [ MULTIPLIER 2-(1-6’) [N(1+6) i 1' X 1C 1. X 18 “ULTIPUER MULTIPLIER ‘— I I 2-(1-6‘) N'(1+a)(1+6’) fi 16 X 16 18 X 16 MULTIPLIER uuuwuan _A {N’(1+o)(1+6’)(1+6‘) 1 2-u-f) 16 X I. MULTIPLIER : l SHIFT REGISTER ~ — ‘T-ns Q L SHIFT BIT COUNT Figure 3.6 A l6-by-l6 convergence division algorithm based DC. 45 v {DI _{>O_ Figure 3.7 D-type flip-flop. Figure 3.8 Dynamic latch controlled by two-phase nonoverlapping clock [9]. CHAPTER IV INPUT/OUTPUT CIRCUIT DESIGNS Operand I/O for VLSI structures inherently suffers from a basic pinout limitation problem due to practical inte- grated circuit, packaging constraints. This necessitates the incorporation of serial-to-parallel demultiplexer and parallel-to-serial. multiplexer circuits at the front and rear ends of the computing array, respectively. Further— more, very high speed DMUX/MUX designs are required to pro- vide sufficient operand broadcasting to the row of input cells which, in the case considered in this thesis, requires 2B + 2 Operands simultaneously, where B is the matrix band- width. Additionally, any I/O circuit design for efficient VLSI implementation must also be optimized with respect to minimum cumin real estate requirements and practical pinout counts. Currently, 64 to 100 pins per chip are considered practical [26]. This limitation also affects the operand word size and consequently the precision of the internal computations. The purpose of this chapter is to present three possible input/output routing strategies for the VLSI com- puting array presented in the last chapter. Logic circuit diagrams of these I/O designs are developed and are used in the simulation described in the next chapter. The simula- tion provides comparative information on propagation delays and chip area requirements. 46 47 The design methOdology used here is identical to that used in designing the computing array network and processing cells. In the logic circuit diagrams of the DMUX/MUX struc- tures, pass transistors and inverters are used for building switch logic and dynamic latches in place of traditional logic gate structures. The new structures have simpler interconnection topologies anui smaller power requirements. Morever, gains in area geometry and speed are. also anticipated. Ideally, the I/O circuit structure should be fast enough to avoid any I/O bottleneck. This must be accom- plished with each input and output bus being only one word wide due to pin limitations. The three I/O strategies are presented in the following sections. 4.1 SCS Input/Output Circuit The first 13%) circuit. candidates are illustrated in Figures 4.1 anul 4.2, respectively. In the input circuit (Figure 4.1), the SCS (ghift-Register‘gontrol Aequence) is stored iJIEB one-bit wide control queue. This sequence is synchronously pumped ixnx> the one-bit shift register chain and is used to select the prOper operand input channel. The first bit of this control queue is logic "1", the others are logic "0". At the end of the first cycle, the output of the first shift register is "l". The load pass transistors of the top input channel are "on". Consequently, the first data operand can propagate through the top channel and is 48 .Emcmmwo pwzoswo page? mum —.e mszmwm 93a... [1 . A. A e 94%... 9%... x.) \ 0A TDA J) 5.sz- r [UV cow «.2. 93 I\ J\ 49 .EBCmmwu uwsoswo pzapzo mum ~.¢ mszmwu HDLV'I hl\\\4 _ullmum m\«////fil. 1L a: «Z1 I[ HDlV1 o o o FDA—~30 nu o“. o— HDlV'I I 1 [AA .~-—l .-——l .«——i .———+ .~—-J. ..__4 50 refreshed by the ¢1 clock signal. At the next clock cycle, logic ”1' of the control queue moves onto the output of the second stage of the shift register. Then, the second chan- nel is selected as the input channel. After N data Operands are inside of the input structure, refreshing at every ¢1, a control signal called MAC IN°¢1 (see Figure 4.3) enables those ii Operands into tflua corresponding latches. These latches are used to control the data flows and to stabilize the operands. The output circuit of this strategy is even simpler. Identical to the input circuit, the pass transistors on the output channels are controlled by an identical SCS. After N data Operands are pumped into latches (by the MAC OUT'951 signal (see Figure 4.3)) they are refreshed by ¢2. Sequen- tial output channels are then selected from top to bottom until the operand of the Nth MAC flows out. Quite con- veniently, channel pass transistors here also serve as the tri-state output gates! 4.2 Binary Tree Structure The input DMUX circuit, which is illustrated in Figure 4.4, features three important concepts. First, it uses the abstract structure Of a binary tree, a basic interconnection network architecture. Second, it is a revised version of the traditional block DMUX tree. Third, and most impor- tantly, it utilizes pass transistor and shift register con- cepts in the routing circuitry so as to decrease power 51 .Emsmcwu mewswp Pmcmwm H+Z Forecoo pwzucwo pzapso new usacH m.e crammm _ a ....DO 322 _ L s _ _ _ _. _ISO 932 _ _ _ _ _\.—DO u<<< _ up _ _ "no.2. ”.52 _ u _ _ u [C u u z. of: _ _ _ _ _ _ ... . N u L D u _ _ a _ 2 , - m N . a u _ . _i a [a .. z. . .... H 52 .mssuozspm page? coco zcmcwm ¢.e mczmwm 20.53 $05.4... 10.5..— 10.5... “I an _ —Inmu an an. «m m..— an T ‘9." IS 53 requirements and area geometry. Moreover, this will facili- tate better testability for tracking data flow. This structure uses control signals to select on or off for each pass transistor. When clock signal 4:1 rises, a stream of "on" pass transistors generate an input channel for a set (a word) Of data which is assumed to be ready at the input port. The Operand prOpagates along and its bits are momentarily held by the input capacitances of the first bank of inverters. Then, when cloCk d9 rises, one of the two down-stream pass transistors allows the Operand to flow through a branch and it is again held via the input capaci- tances of the second bank of inverters. The data is then stabilized by the SR flip-flops providing temporary storage. After all data operands are pumped in, they are simulta- neously clocked into the latches and refreshed by ¢2. The control signals and structure of the latches are identical to those described in Section 4.1. The operands remain in the latches until the next data stream replaces them. The output MUX circuit, which is shown in Figure 4.5, is basically a mirror image of input DMUX circuit. After the rear edge of MAC's complete their computation, a set Of latches are loaded with the resolved Operands and are thus isolated from interference with the MAC's further com- putation. Subsequently, each data Operand is pumped out of the chip through an output channel selected by appropriate control signals. Among the disadvantages of this strategy is the 54 .mczpozcum unapzo more xtwcwm m.e mczmwu Pan—#30 55 cumbersome control signal requirements, Which alone may severely increase the pinout count. Additionally, the nature of the regular binary tree requires that the number of destination nodes be a direct power of 2. This will imply that much Of the tree is wasted if the number of MAC's at the input level of the array does not correspond to a power of 2. For example, consider the case where there are 17 first row MAC's. Then a l to 32 input tree is implied, resulting in a waste Of valuable chip area. Of course, irregular binary tree designs are possible, but as will be shown in Chapter VI this binary tree structure is far from optimal even when a power of 2 match is Obtained. 4.3 Data Controlled Input/Output Circuit The major difference in this routing strategy over the others is that it pumps the data operands into the destina- tion processing elements without using any channel-selecting control signal. It utilizes the FIFO (First-In-First-Out) stack concept employing :1 sets (where r1 is the number of bits per word) of N-stage shift register chains (Where N is the number of the first level MAC's). This concept is illustrated in Figure 4.6. The full set of N Operands are’pumped into the shift register chain during the first N clock cycles. This provi- des an entire set of operands available to the first row of MAC's at the (N + l)th clock cycle. At this time, the (N + l)th clock cycle, a control signal called MAC IN°¢1 56 .EBCmmwc uw3OCwo page? oopposucoo coco o.e ossmwu Tz 052 _T ... W T a M _T... a . . fiq h C I I J. h s a. .1 .s .a ... .az. u1 (see Figure 4.3) enables each Operand through the corresponding first two pass transistors and one inverter of the shift register chain. Note that at this time, the pass tran- sistors, which are controlled by the MAC OUT (see Figure 4.3), are Off. They serve to eliminate possible interference between the data of each shift register during the data parallel-in Operation. The next ¢2 cycle completes the parallel shift-in function. After this cycle, the MAC OUT signal is Off (i.e.,-MAC—OUT.is on). The pass transistors controlled by -MAC—OUT- are on, and they configure the remaining output circuit to be a serial-out circuit only. This output circuit then shifts one data operand out of the chip every clock cycle. 4.4 Discussion Three input and output strategies as well as their related logic gate structures have been discussed and illustrated in this chapter. The MAC IN and MAC OUT control signals described in these I/O circuits are actually iden- tical signals. 58 hampDO .Emsmmwu uwzoswo psaoso om__ocp:oo sumo m.e mesmwm w: ....r llneAnllxAwlmMmllxAwIwflwl4IWHWIAAHTJAWWIRAHIrIHI. So 922 «a. .4. So 04: as .4. u a» * Llw «.50 O 100 or so) (”I a single chip. Therefore, multi-chip networks of processing elements with large word sizes should be looked into as a near term approach. Problems to be considered here include algorithm decomposition and, most importantly, inter-chip communication. V Another crucial step will be the design of true floating-point processing elements for our com- puting structure. This will make the chip design more universal as it will ease restrictions on problem compatibility. 10. 11. BIBLIOGRAPHY Kung, H. T. and Leiserson, C. 3., "Systolic Arrays (for VLSI)," Sparse Matrix Proc., (I. S. Duff, el. -al. editors), Society for Indust. and Appl. Math. (1979), pp. 256-282. Cuthill, E. and McKee, J., "Reducing the Bandwidth of Sparse Symmetric Matrices," Proc. 24th National Conf. AQM, Brandon System Press, New Jersey (1969), PP. 157-172. Gibbs, N. E., Poole, W. G. Jr., and Stockmeyer, P. K., "An Algorithm for Reducing the Bandwidth and Profile of a Sparse Matrix," SIAM J. Numer. Anal., Vol. 13 (1976), pp. 236-250. Zeleny, P. C., and Nagle, R. 13., "Application of an Array Processor :h1 Satellite Image ‘Processing," AD-A056664/6ST, National ‘Tech. Information Service, Springfield VA 22151 (June 1977). Pottle, C., "The Use of an Attached Scientific ("Array") Processor to Speed Up Large Scale Load Flow Simula- tions," Proc. IEEE ICCC 80, Port Chester, NY (1980). Woo, P. T., "Application of Array Processor to Sparse Elimination," SPE 5th Symposium on Reservoir Simulation, Denver (1979). ”Array Processor' Does. Shape Recognition," Electronic Design, Vol. 26, No. 18 (1978), p. 146. Shanblatt, M. A., "Vectorized Elimination and Ordering Strategy for Load Flows on an Array Processor," Ph.D. Thesis, University of Pittsburgh (April 1980). Mead, C. and Conway, L., Introduction to VLSI Systems, Addison-Wesley Pub. Co., Reading, Massachusetts (1980). Rem, M. and Mead, C. A., "Cost and Performance of VLSI Computing Structures," 1888 J. of Solid State Circuits, Vol. Sc-14 (April 1979). pp. 455-462. Foster, M. J. and Kung, H. T., ”The Design of Special- Purpose VLSI Chips," Computer (January 1980), pp. 26-400 92 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 93 Kung, H. T., ”Let's Design Algorithms for VLSI Systems," Proc. Caltech. Conf. Very Large Scale Inte- gration (January 1979): pp. 65-90. Kung, H. T., "The Structure of Parallel Algorithms," in Advances in Computers, Vol. 19, Academic Press (1980). Kung, H. T. and Leiserson, C. 8., "Algorithms for VLSI Processor Array," Symposium on Sparse Matrix Compu- tations, Knoxville (1978). Preparata, F. P. and Vuillemin, J., ”Optimal Inte- grated-Circuit Implementation of Triangular Matrix Inversion,” Proc. of Int‘l Cppf. Parallel Processing, (August 1980) pp. 211-216. Hwang, PL. and Cheng, Y-H, "VLSI Computing Structures for Solving Large-Scale Linear Systems (ME Equations," Proc. 1980 Int'l Conf. on Parallel Processing (August 1980), pp. 217-230. Hwang, BL. and Cheng, Y-H, ”Partitioned Algorithms and VLSI Structures for Large-Scale Matrix Computations," Proc. 5th Symposium on Computer Arithmetic (May 1981), pp. 222-232. Oakes, M. F., "The Complete VLSI Design System," Proc. 16th Design Automation Conference (June 1979), pp. 452-460. Eichelberger, E. B., "A Logic Design Structure for LSI Testability,” Proc. 14th Design Automation Conference (June 1977), pp. 462-468. Hwang, K., Computer Arithmetic, John Wiley and Sons, Inc. (1979). Wallace, C. S., ”A Suggestion for Parallel Multi- pliers,” IEEE Trans. Electronic Computers, Vol. EC-13 (February 1964), pp. 14-17. Pezaris, S. D., ”A 40nS 17-bit-by-bit .Array’ Multi- plier," IEEE Trans. Computers, Vol. C-20, No. 4 (April 1971), pp. 442-447. Baugh, C. R. and Wooley, B. A., "A Two's Complement Parallel Array Multiplication Algorithm," IEEE‘, Trans. Computers, Vol. C-22, No. 1-2 (December 1973), pp. 1045-1047. Cappa, M., "Cellular Iterative Arrays for Multipli- cation and Division," M. S. Thesis, Dept. of Electrical Engineering, University of Toronto, Canada (October 1971). 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 94 Cappa, M. and Hamacher, V. C., "An Augmented Iterative Array for High-Speed Binary Division," IEEE Trans. on Computers, Vol. C-22 (February 1973). pp. 172-175. Lyman, J., "Tape Automated Bonding Meets VLSI Chal- lenge," Electronics (December 18, 1980), pp. 100-105. Fairbairn, D. G., "VLSI: A New Frontier for System Designers," Computer (January 1982), pp. 87-96. Taub, H. and Schilling, D., Digital Integrated Elec— tronics, McGraw-Hill Inc. (1977). Personal communication with Donnie K. Reinhard, Depart- ment of Electrical Engineering and Systems Science, Michigan State University (March 1982). LaBrecque, M., "Faster Switches, Smaller Wires, Larger Chips," NSF MOSAIC (January/February 1982). Keyes, R. W., "Physical Limits in Semiconductor Electronics," Science, Vol. 195 (March 1977): PP. 1230-1235. Eidson, J. (3., "Fast Electron-Beam Lithography," IEEE Spectrum (July 1981), pp. 24-28. Chang, T. L., "Mixed Systolic Arrays: A Reconfigurable Multiprocessor Architecture," Ph.D. Thesis, Michigan State University (February 1982). Markowitz, H. M., "The Elimination Form of the Inverse and its Application to Linear Programming," Management Science, Vol. 3 (1957); PP. 255-259. Calahan, D. A., "Multi-Level Vectorized Sparse Solution of LSI Circuits," Proc. IEEE ICCC 80, Port Chester, New York (1980). Dembart, B. and Neves, K. W., "Spare Triangular Factor- ization (N1 Vector Computers," EPRI Special Report E1-566-SR, pp. 57-102. Siewiorek, D. P., Bell, C. G., and Newell A., Computer Structures: Principles and Examples, McGraw-Hill Inc. (1982).