£9133 WNWNH“MINIMUMIHIWIJIHWIIIHIIHUIWI HESlB L"?- fiA 3'? Y Michigan Mate ’ L Universfity 'w J: This is to certify that the thesis entitled VLSI SYSTOLIC ARRAY FOR MATRIX TRIANGULATION IN LOAD FLOW ANALYSIS presented by Yu-Ying Jackson Leung has been accepted towards fulfillment of the requirements for M. S. degree in Electrical Engr. DAjor professor 2/67/67 ’ I 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution MSU LIBRARIES “ RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. VLSI SYSTOLIC ARRAY FOR MATRIX TRIANGULATION IN LOAD FLOW ANALYSIS By Yu-Ying Jackson Leung A THESIS Submitted to Michigan State University in partial fulfillment of the requirments for the degree of MASTER OF SCIENCE Department of Electrical Engineering and Systems Science 1933 ABSTRACT VLSI SYSTOLIC ARRAY FOR MATRIX TRIANGULATION IN LOAD FLOW ANALYSIS BY Yu-Ying Jackson Leung The computational bottleneck incurred in power system load flow analysis is due to the cumbersome solution of a large, highly sparse set of linear equations. A VLSI systolic array structure for band matrix triangulation, utilizing concurrent Gaussian elimination, is applied to this problem to decrease the required triangulation time. Included is a design of an interface system connecting the systolic array structure to a host computer. A circuit simulation of the complete structure indicates a time savings of two orders of magnitude over traditional serial computer methods. These results lead to an assessment of the advantages and significance of this class of special purpose VLSI computing structures for throughput enhancement in load flow analysis. To my mother and eldest brother Mdm. Me Chit and Mr. Wai-Ying Tommy Leung ACKNOWLEDGEMENTS The author wishes to express his sincere appreciation to his major advisor, Dr. Michael A. Shanblatt, for his guidance and encouragement in the course of this research. He also wishes to thank Dr. W. C. Hsu and the committee members, Dr. P. D. Fisher and Dr. R. A. Schlueter, for giving the valuable suggestions and comments in this work. Finally, the author owes a special thanks to Miss Jasmine Lam for her emotional encouragement and support. Work reported here was supported in part by NSF under Grant ECS-8106675. TABLE OF CONTENTS Page LIST OF TABLES ....... ......... ........... ....... ............. vi LIST OF FIGURES ........................ . ..................... vii I. INTRODUCTION . ...... ....... ............................ . l l.l Statement of Problem .............................. 3 1.2 Approach ..... ........ . ..... . ............ . ......... A l.3 Contributions ..................................... 6 II. BACKGROUND ............. ......................... . ...... 8 2.l Load Flow Analysis ........................ ........ 8 2.2 Traditional Load Flow Program Structure ................................. l0 2. 2. I Program Structure ......... . ... ........ .... ll 2. L 2 Jacobian Matrix Triangulation by Gaussian Elimination ..... ............... 12 2.3 Banded Matrix .................. ................... 16 2.A VLSI Systolic Algorithm for Matrix Triangulation ....... ..... . .......... ....... 2l III. SYSTOLIC ARRAY FOR TRIANGULATION OE LOAD FLOW MATRIX FORM ................ ..... .......... 27 3.l VLSI Array Structure Modification ................. 27 3.2 Computing Structure and Timing .................... 33 3.2.1 Multiplexer ................................ 3A 3.2.2 Latch and Stack ........... ...... ........... 35 3.2.3 MAC and DC ................................. 37 3.2.A Timing ..................................... A0 3.2.A.l Internal DC Timing .. .............. A2 3.2.A.2 External Computing Array Timing ................ ...... A2 3.3 Fixed-Point Consideration for Load Flow Convergence ................... ...... AA IV. VI. A PROPOSED HOST PROCESSOR/PROCESSING ARRAY INTERFACE ............... . ........................ A.l Operand Adjustment by Host Processor ...}.... ............................ A.2 Operand Adjustment by Intermediate Processor ............................ SIMULATION DEVELOPMENT ..... ............................ 5.] Chip Area Computation ....................... . ..... 5.2 Throughput Computation ... ................... . ..... 5.3 Significant Module Data ...... ..... ................ RESULTS AND CONCLUSIONS .............. ........ . .......... 6.l Circuit Simulation Results ..... ................... 6.2 Comparsion with Previous Results .... .............. 6.3 Conclusions ............... ...... .. ................ 6.A Discussion ................................... ..... BIBLIOGRAPHY ........................................... Table 3.] 5.] 6.] 6.2 6.3 6.A 6.5 6.6 LIST OF TABLES Port-coefficient time table. ... ....................... Revised port-coefficient time table. .................. Significant module data. .. .......... . ................. Covergence results in number of iterations of MSU load flow program version. ..... ................ Simulation time results of I/O ports, MAC and DC. ...... ........ .. ........... ...... Simulation time results showing an I/D timing bottleneck. .... ......................... Simulation time and chip edge size results for load flow study. ...... .............. . ..... Comparison of triangulation time per iteration of different processors. ... ..... ............ Comparison of time results of different processors in ratio. ........................ vi 25 32 63 65 70 72 7A 75 76 Figure . 2.2 2.3 2.A 2.5 2.63 2.6b 2.7 3.] 3.2 3.3 3.1. 3-5 LIST OF FIGURES Structural flow of POWERFLO program. ...... .... ..... ... Row-ordered elimination pattern in the POWERFLO program. .............................. Bandwidth and full breadth of a band form matrix. ........ .................. . ..... . Single variable elimination from a band form matrix. ....... .......... ............. Full row elimination from a band form matrix. ........ ................. ..... ..... Augmented matrix {Aib}. ......... ..... ................. Augmented upper triangular matrix ”Id 0 .........OOOO’OOOOOOOOOO......OOOOOOOOO Computing array structure for matrix of arbitrary dimension with 8-3. ...................... Augmented strict upper triangular matrix {U'Id' with unit diagonal. ,................... Revised computing array structure for matrix of arbitrary dimension with B-3. ..................... 2:1 multiplexer. ...................................... An n-by-n convergence division algorithm based DC. ................................... Latch and multiplexer control Signals timing diagram. ..........OOOOOOOO0.00.00.00.00 vii IA I7 I8 20 23 23 2A 29 31 3A 39 A3 Figure A.2 I..3 A.A 6.l 6.2 Program flowchart for operand pre-adjustment. ....... ........... ..... ........ Program flowchart for operand post-adjustment. .......... . ................... Function block structure of an intermediate processor for pre- and post-adjustment. . ............ .. A sample internal block structure of an adjuster. ........................ ..... Entire chip edge size versus matrix bandwidth for 0.8, 0.5 and 0.2 micron linewidths. ....... ........ Entire chip propagation delay versus matrix band- width for 0.8, 0.5 and 0.2 micron linewidths. . ........ viii 50 5] 53 55 67 68 CHAPTER I INTRODUCTION Electrical energy generation and delivery systems, generally known as power systems, are and will continue to be of fundamental importance to the technological and engineering community. For this reason, power systems, one of the biggest "systems” in the world, have been explored and studied since the late nineteenth century [I]. In particular, the load flow problem is the most basic and essential study because it provides information for the continuous evaluation of the current performance of the system and for analyzing the effectiveness of alternative plans for system expansion to meet increased load demand. The load flow problem is the evaluation of power flows and voltages of a network for specified bus or terminal conditions. It provides the solution for the static operating conditions of the power transmission system. It is one of the most important of many current engineering problems that require a very rapid routine solution before system failures occur. The Newton-Raphson method has been traditionally applied to solve the load flow problem by iteratively solving a set of non-linear equations expressing the specified real and reactive power in terms of bus voltages and phase angles. The limiting factor in problem throughput, from the standpoint of a traditional serial computer, is the time required to solve the large (often n > l000) set of non-linear 2 equations via triangular factorization, a form of Gaussian elimination. This limitation is due to the fact that the number of computations is directly proportional to the square of the matrix dimension. Also, it is due to the cumbersome data transfers between the arithmetic-logic unit and memory (the Von Neumann bottleneck) as well as the constant packing and unpacking of operands during the elimination. Recently, VLSI (fiery Large Scale integration) technology has emerged. An enormous number of computer algorithms and architectures have been proposed which show great promise in overcoming this computational bottleneck [2, 3, A]. Utilizing this new VLSI technology, the purpose of this thesis is to study aspects of the design of a high-speed dedicated processor architecture, realizable in VLSI, tailored towards the load flow problem. The dedicated processor structure is mainly composed of arrays of processing elements which can be implemented directly using low cost, high speed VLSI circuit technology either on a single chip, or perhaps in a modular fashion on a few chips. This VLSI computing array, based on concurrent Gaussian elimination for triangulating band form systems and composed of fixed-point processing elements, is to be simulated by a Fortran-coded program. The simulation will be tied into a proposed interfacing design between the VLSI processor and a host computer. Special topics to be discussed include details of the interfacing system, the simulation and results pertaining to load flow throughput enhancement. Furthermore, a basic constraint of the number of fixed-point bits (word size) inside the computing array must be determined. This is because the word size of the operands has an 3 essential effect on load flow convergence. Too few bits will cause undue rounding error and lead to divergence of the overall Newton-Raphson formulation. Another detail to be discussed is how to interconnect the processing elements within the array such that it can efficiently carry out a strict upper triangular factorization of the coefficient matrix. In the past, designs of this type have carried out a form of factorization which is not immediately conducive to back-substitution [2, 3, A]. The goal of the design to be presented here is an array in which the resolved upper triangle has a unit diagonal therefore prepared for immediate back-substitution. In addition, methods for adjusting the complex Jacobian coefficient matrix, of the type encounted in load flow analysis, to the systolic array must be reviewed. In this case, single matrix elements are actually sets of 2x2 complex sub-entries relating real and reactive power to changes in nodal voltages and phase angles. The requirement of a band form system for triangulation, as mentioned before, is still applicable to this complex system. It is assumed that the system has been permuted, a priori, to a minimum band form. l.l Statement of Problem It is well known that the bottleneck operation encountered in large scale power load flow programs is the solution of the system of complex simultaneous equations derived from a Newton-Raphson formulation [5]. A huge amount of research effort has been spent investigating new generation computer architectures in order to obtain a faster and more efficient solution [2, 3, A]. Likewise, the aim of this research is to determine if a VLSI systolic array structure can be applied to an existing load flow program, via simulation, for effectively reducing the solution time. l.2 Approach . The primary goal of this thesis is to obtain an accurate quantification of processing time (throughput) for a VLSI systolic processing array imbedded, via simulation, in an industry standard load flow program to promote its efficiency in improving load flow computation time. In addition, the same simulation will provide an estimation of the chip edge size of the VLSI array obtained as a function of word size, matrix bandwidth, matrix dimension and minimum lithographic linewidth parameters. Details of the approach towards this goal will be described in this section. Foremost, the structure of an industry standard, large-scale (l500 bus), serial computer load flow program is explained. The program is modified to run on the CDC CYBER-750 computer at Michigan State University and verified against benchmark results. For the purpose of benchmarking, the program is executed with several different sets of data independently to produce various records of solution time, mainly the elimination time (or the triangulation time) and iteration counts. The time records are known to be high when 5 compared to those available for a scientific-attached array processor and are expected to be tremendously high when compared to the VLSI systolic array processor. This is because the data processing inside either array processor is in parallel and pipeline fashion and thus inherently faster than Von Neumann type architectures. As a second step a subroutine is designed to normalize all numbers that are involved in the calculation of the resolved Jacobian elements. This subroutine is to be annexed to the MSU load flow program version. Whenever a number is transferred to this subroutine, it is first changed to a fixed-point binary number and then truncated to a predefined number of bits. Finally, this truncated binary number is converted back to the original representation and returned back to the calling program thus simulating the effect of fixed-point round off. The whole program (the full load flow program plus the rounding subroutine) is then executed to ascertain the necessary number of fixed-point bits required for load flow convergence. The third step involves restructuring the version of a VLSI systolic array for triangulating large augmented band form coefficient matrices based on a recently proposed algorithm [A]. This restructuring includes the addition and revision of some hardware in the suggested array structure such that it can do {53513; upper triangulation (i.e. resolve the diagonal to unity). Fourthly, the development of a hardware interface system between the host computer and the processing array will be discussed. This interface is responsible for the adjustment of operands transferring back and forth between the host computer and the peripheral array 6 processor. The interface is necessary since the number format of these two computing structures is different. The overall block structure of the interface will be presented and explained. Finally, the proposed VLSI circuit model is simulated with variuos parameters of word size, matrix bandwidth, matrix dimension and minimum lithographic linewidth. The word size, which determines the width in bits of the systolic array processing elements, is found from the second step of the approach. The matrix bandwidth is found by reducing the original bandwidth as much as possible. The matrix dimension is ~ dependent on the size of the electrical utility network. The minimum lithographic linewidth, X, is varied to project the trend of I. C. technology. Lastly, in order to determine if the goal of load flow throughput enhancement is achieved, the obtained time results are to be compared with the results from the previous works [7, 8]. l.3 Contributions The major contributions of this thesis are summarized as follows: I. An existing VLSI systolic array algorithm based on concurrent Gaussian elimination is successfully improved such that it can effectively triangulate any augmented band form matrix to a strict upper triangle and a unit diagonal. 2. The VLSI algorithm is successfully applied to a representative 7 load flow program and the accuracy of the load flow convergence results are within tolerance. A suggested interface is presented for the proposed special purpose processor attached to a host computer. The significance of using a VLSI algorithm to relax the computational bottleneck that occurs in many engineering problems, specifically the load flow problem, is reinforced by the data collected. CHAPTER II BACKGROUND 2.l Load Flow Analysis Load flow analysis involves calculating power flows and voltages for a specified utility system subject to the regulating capability of electric generators, condensers, and tap changing under load transformers. This procedure is used to determine a set of complex bus voltages and line power flows representing the static operating condition of the power system network. The analysis involves solving a set of non-linear algebraic equations which is typically large and sparce ( < 3 X non-zero elements). The set of equations, obtained from nodal voltages and the network admittance, is solved for bus voltage magnitudes and phase angles. Then, the voltages and phase angles are used to calculate the power flows. Although a power system is operated with three-phase generation and loads, a single-phase representation is adequate because the system is considered balanced. For each bus there are four possible parameters, namely the voltage magnitude, the phase angle, and the real and reactive power. Two of these four quantities are specified at each bus representing one of three bus types: a voltage controlled bus, a load bus or a swing bus. A swing bus, also known as a slack bus, must be designated to supply the additional real and reactive power for accommodating transmission losses inasmuch as these quantities are 9 unknown prior to the final solution. Although there are many numerical methods that can be used to solve the load flow equations. the most economical and effective one, from the standpoint of computer memory storage and solution time, is an approach employing the bus admittance matrix and the Newton-Raphson method [5]. This approach involves the derivation of a set of non-linear equations which is used to express the specified real and reactive power in terms of bus voltages and phase angles. A Jacobian matrix equation in polar notation is derived which is in the form of Ax=b as follows: aPS aP . 38 {MEI IEI A8 AP ..... ......... . (2-l) a9 : 89 . AIEI as gain ”5' IEI AQ where AP - real power mismatch, AQ 8 reactive power mismatch, A6 - correction to the nodal voltage phase angle, AIEI correction to the nodal voltage magnitude. To initiate the procedure all bus voltages and phase angles are set to the swing bus value. Next, real and reactive bus power and current are evaluated. Then, the Jacobian matrix equation is formed and the linear system is solved for the nodal voltage and phase angle corrections. Now, a new set of bus voltages is derived using the calculated corrections and the real and reactive bus power are reevaluated. The process continues with convergence checked by examining a tolerable power mismatch. If the tolerence is not lO satisfied, the process continues iteratively until convergence results. The computational bottleneck in the overall load flow method lies in the solution of the Jacobian matrix equation at each iteration of the Newton-Raphson approach. Many procedures have been developed to ease this hindrance as much as possible [6, 7, 8]. One of these methods is first to reorder the system of equations by a strategy which attempts to locally minimize the number of new non-zero elements which are introduced into the matrix during the elimination process. Then the network variables are packed into a sparse data structure and a specific form of Gaussian elimination, operating on one row at a time, is used to acquire the upper triangular factorization [6]. Another technique makes clever use of a one-dimensional array processor to achieve fast elimination [7, 8]. In this method, peripheral attached array processors, such as the AP-ISOL, are utilized in the solution of the highly vectorizable elimination procedure. With proper restructuring of the computer algorithm. savings of up to 6A.5 2 of the overall solution time, compared to the traditional serial computer approach, have been achieved [8]. However, as the power networks grow much larger, a bigger array processor with larger internal memories have to be designed for solving the problem at the same speed. 2.2 Traditional Load Flow Program Structure As an initial step in this research, a traditional and representative load flow computer program was brought up on the Michigan State University CYBER-750 computer. Serving as a basic reference ll program, a version of Philadelphia Electric Company's POWERFLO program was obtained [9]. This program was originally modified at the University of Pittsburgh to faciliate studies of this type*. This program utilizes the Newton-Raphson approach with the solution of the resulting linear equations by row-ordered Gaussian elimination and subsequent back-substitution of the resulting strict upper triangular matrix. 2.2.l Program Structure Although there are many procedures inside a standard load flow package, only those steps involving the generation of the Jacobian matrix and subsequent solution by Gaussian elimination are of interest in this study. A brief explanation of these two steps follows. I. The MAIN program calls a subroutine MATGEN which oversees the matrix generation and factorization. 2. Inside MATGEN, the program starts looping over all the nodes in . the system. A "working row” of the Jacobian is generated for each node. *We gratefully acknowledge the help of Professors M. H. Mickle and W. G. Vogt, Department of 'Electrical Engineering, University of Pittsburgh, for supplying us with the working version load flow package. l2 3. For each ”working row” the subroutine ELIM is called to eliminate all the elements of that row up to the diagonal. The eliminated coefficients (factors) of the upper triangle are then packed into a condensed and indexed form. A. After all rows have been eliminated, subroutine BAKSUB is called for back-substitution and thus the nodal voltage corrections are obtained. 5. Finally, subroutine ADJUST is called to perform inter-iteration adjustments and convergence is checked before the next Newton-Raphson iteration is initiated. 6. If convergence is not obtained, the loop is either terminated (divergent case) or steps I through 5 are repeated until a satisfactory solution is obtained. The program flowchart is illustrated in Figure 2.l. 2.2.2 Jacobian Matrix Triangulation by Gaussian Elimination Since the main concern of this research is in reducing the time required in the factorization aspect of a load flow program, the exact details of the elimination process will be explained in this section. The solution scheme used in the standard load flow program processes and stores only non-zero elements. This is done by storing a CM?“ 3 NO SUBROUTINE MATGEN I SUBROUTINE ADJUST YES (END) 13 SUBROUTINE MATGEN GENERATE WORKING ROW I SUBROUTINE ELIM LAST NODE YES SUBROUTINE BAKSUB cm 3 Figure 2.l Structural flow of POWERFLO program. lA compacted table of factors (upper triangle) and a compacted working row. In addition. a set of pointers and counters for tracking the process are required. The elimination process is actually a triangular decomposition of the Jacobian matrix by Gaussian elimination which is described in many books on matrix analysis [l0, ll]. However, in here, only non-zero elements of each row up to the diagonal are eliminated and stored according to the current pointers and indexed counters before proceeding to the next row. This is shown pictorially in Figure 2.2. The figure depicts the matrix form just prior to the elimination of row i. Since only the non-zero elements of the upper triangle are stored, it is not necessary to store all the resultant elements depicted in the figure. The form of each element shown in Figure 2.2 is actually a set of 2x2 complex sub-entries, expressing the partial derivatives of real and reactive power with respect to nodal voltage and phase angle (See Equation 2-l). For easy reference, Equation 2-l can be rewritten in the form of Equation 2-2. quuuuu' quuu quu .—— xxxxxx xxxxxx _xxxxxx_ Figure 2.2 Row-ordered elimination pattern in the POWERFLO program. IS where AP, AQ and A8 are defined as before and AV AIEI/IEI, H 3P/35. 2 II (aP/aIEI)-IEI, c. II aQ/aa. r I (aQ/aIEI)-IEI. Therefore, the Jacobian can be expressed in the form matrix derived from a system of After the elimination is complete, the matrix is upper triangular form of n+l the (2-2) nodes strict I Nll Hln Nln cccccccccc . . ...-0.0.0.0....- 0 l J ' L ' ln ln 0 . Q o o o O O o 03 o I N ' . nn (...... g . g ............... 05 O 0 l and thus ready for immediate back-substitution. 2.3 Banded Matrix In this section, the description and advantages of reordering a matrix to band form prior to performing Gaussian elimination in an array processing environment are presented. Any matrix, A, is said to be banded, or in band form, if all the non-zero elements are clustered about the main diagonal. The bandwidth, B, a measure of this clustering, is defined by a . max {Ii-jl : a(i,j) # o}. (2-3) For a structurally symmetric matrix, as in the case of Jacobian matrix derived from the Newton-Raphson approach to the load flow problem, the above bandwidth definition implies that the maximum width, in number of elements, between the first and last non-zero entry of any row is 28+l. This maximum width will be referred to as the "full breadth" of the matrix. The bandwidth and full breadth are illustrated in Figure 2.3. 28+] Figure 2.3 Bandwidth and full breadth of a band form matrix. There are several heuristic algorithms that have been developed to reorder the matrix A to a reduced band form [l2, l3, IA]. The two most commonly used are the Cuthill and Mckee algorithm [l3] and an improved version described by Gibbs, et al [IA]. These two algorithms were basically developed with the same fundamental graph theoretic concepts. It has been shown that matrix banding is theoretically the optimal network ordering for systems to be processed on an attached array processor architecture [7]. Furthermore, in one particular study the banded form ordering on the array processor yields the fastest solution time for networks less than about I300 busses [7, 8]. This is reasonable because for any sparse matrix, regardless of the non-zero pattern, the total computing time required to resolve the upper triangle by using vector row-ordered elimination on an array processor, is I8 bounded by a function of B and in the limit by N. This can be demonstrated by considering an N-dimensional matrix A of minimum bandwidth B. The time required to eliminate one variable from any given row is T - T + B-T , (2-A) s op where Ts and Top are the setup and unit operation times of the array processor function providing scalar-vector multiply followed by vector-vector subtract. This corresponds to the operation illustrated in Figure 2.A. +6.. J+B I g VECTOR 1 1 Tl: VECTOR 2 ‘l . "I H l :— B—z'..'“—B :‘I Figure 2.A Single variable elimination from a band form matrix. 19 In this figure, only the shaded element and vectors l and 2 are required to complete a given inner loop step. This is analogous to the loop on index I of a FORTRAN coded procedure for the same row-ordered elimination as follows: DO )0 K = I, N DO IO J = I, K DO I0 I 8 J+I, J+B lO_ A(K.I) - A(K.I)-A(K,J)*A(J,I). Building on this, the time required to eliminate all the variables of any row K is given by 2 T B°Ts + B .Top (2 5) corresponding to those elements shown shaded in Figure 2.5. The resolved elements to the right of the main diagonal must now be divided by the diagonal entry. This adds one setup time and B more unit operations and Equation 2-5 becomes 3 o o o A .- T (B+l) TS + B (B+l) Top , (2 6) where Top' is now the maximum operation time between the multiply-subtract and division vector functions. For an N-dimensional matrix, N such row eliminations are required. Thus, an upper bound on the total elimination time is given by TN(B) = N-(B+I)-Ts + N°B°(B+I)°Top'. (2-7) 20 K-B-—-C- 'V K V K —> \l\\\\\\\\\\\\\\fi 1 B ____4;i::}.____8 .____:1 Figure 2.5 Full row elimination from a band form matrix. Now, if a parallel pipeline structure such as a systolic array is used, the total elimination time is given by a o o ' - TN(B) (B+l) Ts + N Top . (2 8) As a result, since Ts’ N and Top' are defined variables, to minimize the elimination time implies the minimization of B for both array and parallel pipeline architectures. In order to achieve the most efficient operation inside the proposed VLSI array structure for matrix triangulation, all original load flow bus and line data must be renumbered to reduced band form beforehand. Also, as will be discussed later, the size of the systolic computing structure is dependent on the bandwidth of the matrix due to 2l the structural properties of the hardware algorithm [A]. In conclusion, reordering a matrix to reduced band form prior to triangulation not only reduces the solution time but will also yield bounds on hardware dimensions in an array or pipeline processing environment. 2.A VLSI Systolic Algorithm for Matrix Triangulation A systolic array processor is a system having a two-dimensional configuration of processing elements (PE's) in a parallel pipeline fashion. The processor synchronously pumps data between levels of PE's performing part of an overall computation at each time step such that a regular flow of data is kept up in the network. The algorithms use distributed control achieved by simple local control mechanisms such as stacked one-dimensional arrays of PE's located between rows of latches. Ideally, these hardware algorithms are to be implemented on a single chip or perhaps in a modular design on a few chips with the use of VLSI technology. The modular approach will not be addressed here. VLSI technology, however, places new constraints on computer architects. For high-performance algorithms to be implemented at low cost, the algorithm must possess the properties of regularity, local communication, and parallelism and pipelining [l5]. An early example of this type of algorithms is the systolic array for solving linear systems of algebraic equations proposed by Kung and Leiserson [l6]. This structure, built of simple inner-product step and division function processors, can be used to carry out L-U decomposition on a full matrix, {A}. A revised version appeared, advancing Kung's design, in which the 22 proposed VLSI architecture would perform L-U decomposition of an entire linear system of equations, {A:Q} [3]. However, in both of these examples, practical details such as input/ouput circuits and processing cells designs were not considered. Recently, an improved systolic structure was developed by Hsu and Shanblatt [A]. This structure, by isolating a row of processing cells, will triangulate an arbitrarily large augmented band system {A:Q} to upper triangular form {gIg} aligning the A and g vectors. Moreover, in this work, the potential l/O bottleneck was studied with respect to the overall solution time, and, the processing cells as well as I/O port circuit candidates were designed and evaluated. The structure proposed by Hsu and Shanblatt utilizes both parallel and pipeline concepts and performs the triangulation in 2N+ZB time steps. Moreover, it requires B(B+l) inner-product cells and B division cells, where N and B are the full dimension and the reduced bandwidth of the coefficient matrix, respectively. In order to illustrate this, consider an Nx(N+l) augmented matrix system {A:Q}v with a bandwidth of 3 as depicted in Figure 2.6a. After the upper triangulation has been performed, the system appears as illustrated in Figure 2.6b. The array structure corresponding to this matrix example is shown in Figure 2.7. A port-coefficient time table for the triangulation of this system is given in Table 2.l. In Figure 2.7, the inner-product cells are called MAC's (Aultiply and Add gells) and perform w=xy+z, x-x and y-y. The complementation circle shown on the input of the topmost row of MAC's refers to a two's complement operation. The DC's (inision gells) perform g-e/f and f-f. II 23 312 5‘13 311+ ' bl\ a"22 a23 azh a‘25 ' I"2 a'32 a33 33h ass ' ' b3 31.2 3&3 am. ' bu a52 a53 I b5 I I . I I awn-3) aN(N-2) aN(N-l) aNN ' bu) Figure 2.6a Augmented matrix {A:b }. ”12 “I3 ”It. ' ‘I'IW u22 u23 ”2A u25 ' d2 ”33 ”3:. ”35 ° ' d3 ”III. ”u5 ' d1. 55 NN N Figure 2.6b Augmented upper triangular matrix {u:d}. 2A e f o f DC DC DC 9 y W MAC MAC MAC x ® x 2 y MAC MAC ' MAC MAC MAC MAC I1 .12 I4 15 16 17 r' """ A """"" '4 """ -, 3 X D section 1 MAC MAC MAC 1 L, ”IA. ID Figure 2.7 Computing array structure for matrix of arbitrary dimension with B-3. t7 alA t2N+6 13 32A Table 2.l Port-coefficient time table. I3 “2!. 35 “1A 25 26 The D section is isolated in order to correctly align A vector elements, which subsequently became g vector elements, during the triangulation. This right-hand-side vector is pumped into input port ID and out of output port 0 00 Input and output operands are pumped in and out of the processing array through the I/O ports, I -I l and 01-0“, 0 Synchronism is l 7' D D' provided by latch arrays which are depicted by thick black lines between the rows of PE's. The number of required input ports is given by the full breadth of the matrix and right-hand-side vector input port, or 28+2, while the number of output ports is defined by the bandwidth plus a diagonal and right-hand-side output port, or 8+2. The operands shown in the time table (Table 2.l) are interspaced with zeros and ones to provide proper synchronism among array coefficients. The use of these ”spacers” will be explained in next chapter. As the time table shows, 2N+28 (2N+6 in this example) time slots are required to obtain {gig}. The algorithm may therefore be classified as an 0(N) algorithm. This is on the same order as other systolic array algorithms, which require many more processing cells for this type of problem [3]. In conclusion, all of the above mentioned algorithms have the same important features: regularity, expandability, and extensive use of parallel and pipeline concepts. This makes them particularly well suited for VLSI implementation. Based on these algorithms, a revised version of a VLSI algorithm for band matrix triangulation more conducive to use in the load flow program will be presented next. CHAPTER III SYSTOLIC ARRAY FOR TRIANGULATION OF LOAD FLOW MATRIX FORM The development of an improved version of the VLSI algorithm for fast band matrix triangulation will be presented in this chapter. This algorithm is to replace the routine for matrix triangulation inside a serial computer load flow program such that the overall solution time is reduced at no loss of accuracy. As a starting point towards practical realization, the number of fixed-point bits for load flow convergence is determined because the present algorithm is constrained to a fixed-point architecture due to chip size and lithography limitations. This is done by executing a modified MSU load flow program version on CYBER-750 computer. A method for combining, or interfacing, this structure to the serial host computer will be explained in next chapter. 3.l VLSI Array Structure Modification In standard load» flow studies triangulating the A matrix by a . serial computer program results in a strict upper triangle with a unit diagonal [9]. Therefore, it is necessary to modify the algorithm of [A] to match this requirement. Referring to the structure and time table shown in Figure 2.7 and Table 2.1, the diagonal elements of the resultant upper triangle (uii’ 27 28 i=l,2,...N) pumped out of the 01 port, are not necessarily equal to one. In order to obtain a definate unit diagonal, all of the elements in each row must be divided by the diagonal element of that row before they are pumped out of the array. That is ”ij' = "ij/uii’ (3'1) and di' = di/uii’ (3-2) for all i=l,2,...N and j=i,i+l,...i+8. Since the diagonal elements must be equal to one .after the final division, at any time slot, the maximum number of the elements which must be divided by the diagonal is given by 8+l. This includes 8 A matrix entries and the right-hand-side entry. As a worst case result, by adding an extra row of 8+] division cells to the array, a strict upper triangle may be otained. However, the size of the structure will be increased tremendously since the size of a DC is much larger than that of an MAC [A]. Fortunately, by close examination of the structure, it is found that the additional division as described above can be done by "cycle stealing" instead of the addition of another row of DC's. Focusing now on the present row of DC's in the example, a2], a3] and ah] must be divided by all at time t“. At t5, though, null divisions take place (0+l) and this is the case during every second time slot from then on. In other words, at every ti+8+l’ where 8-3 and i-l,3,5,...2N-l, the operation inside the DC's is zero-divided-by-one. 29 Indeed, the null operations are unnecessary for the process of calculation because the results of these operations are always subsequently multiplied by zero later in the procedure. Therefore, no matter what value is obtained from the division, it has no effect on the overall solution. As a result, cycles stolen during these operations can be used to perform the additional division required to unitize the diagonal. This is accomplished by adding multiplexers (MUX's) and~ latches to the array structure. However, only 8 DC's are available and 8+l divisions are required at a time. Therefore, one additional DC must be added to the structure. In order to illustrate this modified algorithm, the same augmented system {A:Q} with a bandwidth of 3, as depicted in Figure 2.63, is considered. After the upper triangulation is complete, the system of equations is as shown in Figure 3.l. The modified computing structure r I I .7 I “12 uI3 ”IA ' “I l u23' ”2A. ”25' I d2' I UBAI ”35' I d3' l ”AS. I dA' Figure 3.l Augmented strict upper triangular matrix {U'Id'} with unit diagonal. 30 corresponding to this example is illustrated in Figure 3.2. A port-coefficient time table for the triangulation of this system is given in Table 3.l. The multiplexers (MUX's) shown in Figure 3.2 are controlled by a simple true and false signal synchronous with the clock. At every even time slot, t2, t“, t6..., all lines marked L are selected to pass the I data. However, at every odd time slot the L lines are selected. 2 The stack section (S section) depicted in the same figure is composed of arrays of latches which serve as an FIFO (Airst-ln-fiirst-Qut) memory stacks. The number of stacks, as well as the number of latches inside each stack, varies with B. The details of their relationship will be explained in Section 3.2.2. Referring to this structure (Figure 3.2) and the corresponding time table (Table 3.l), during the period tA all lines marked LI are selected to pass the operands all’ al2’ a13, and alA to the row of DC's and enter as defined in Table 2.l) into stack S (same as u At the next an ll 5' slot, t5, Lz's are selected to pass and store operands. Simultaneously, ull propagates to the next stage of $5 and u}2 Is put onto SA' At t6. L1's are selected again to pass operands as well as push u22 onto 55. At the same time, while ”II and u]2 are pushed further along in S u]3 Is put onto S3. Eventually, at the end of t8, d], ulA’ u 5 and SA’ 13’ ”12 ll are at the topmost latch of SI through 55, respectively. Then, and u during t Lz's are selected such that d and "IA-”I2 are all 99 I! ' ' ‘ I I l I simultaneously dIVIded by "II In the DC 5. Therefore, u]2 , u]3 , ulA 1-03 and OD at the end of t9. After this at every two time slots, a new row of the upper triangle and d]' are pumped out of output ports, 0 3I ,5" S Section D Section 3 Figure 3.2 Revised computing array structure for matrix of arbitrary dimension with 8-3. t2N+7 alA 25 336 Table 3.] I2 I3 aIz an 6‘23 32A 33A a35 31.5 3A6 Revised port-coefficient time table. 32 I5 I6 I7 a2I a31 aAl a32 aA2 a52 03 0D ”IA' dI uzs' d2 U36 d3 “I. d5 dN 33 elements with an implied unit diagonal, is pumped out. Because of this implied unit diagonal, the number of output ports in this modified structure is one less than that of the original structure. 3.2 Computing Structure and Timing The special purpose peripherial array processor for band matrix triangulation is mainly composed of input/output circuits and a computing array structure. This processor, consisting of those circuits, is to be attached to a serial host computer. The overhead timing and data controls of the peripheral processor are provided by the host computer through interconnected lines. The internal timing control can be supplied by either a built-in clock circuit or the host computer; however, the former method is preferred because it only requires a small increase in overall hardware size. Internal timing controlled by the host computer immediately increases the connection complexity between the two machines. Operand I/O for the computing structure fundamentally suffers from a potential bottleneck problem due to practical fabrication limitations of pin-out and packaging considerations. Fortunately, through careful design of I/O circuits, the I/O bottleneck can be avoided [17]. This will be shown among the results presented in Section 6.l. The computing array depicted in Figure 3.2 requires only five schematic logic circuit diagrams of function blocks, namely multiplexer, latch, stack, MAC and DC. The design of each function block is based on previous work [A] and crucial details of these designs, as well as 3A timing control, are explained in the following sections. 3.2.l Multiplexer A 2-to-l multiplexer (2:l MUX) is simply constructed by a pair of pass transistors under mutually exclusive control of a single signal line with an additional inverter. This is illustrated in Figure 3.3. The pulse width and frequency of the control signal (CS) depend on the sequence of data flow or operating time step. Whenever this signal is true (high), the right-hand-side transistor is turned on. This allows the data in line LI to pass through. If the signal is false (low), the Ieft-hand-side transistor will be turned on allowing data in L2 to pass. The MUX's inside the computing structure are built solely from this basic block and are simply controlled by a signal synchronous with the Figure 3.3 2:l multiplexer. 35 system clock. The frequency of this signal is half of that of the signal controlling the latches between rows of PE's. Therefore, the L1 and L2 lines are selected alternately. This control signal, which will be called a, is shown in the timing diagram presented in Section 3.2.A.2. Due to the nature of the modified triangulation algorithm, the L1 lines are selected at either every odd or even time slot depending on the bandwidth. If 8 is an odd number, as in the previous example where 8-3, LI lines will be selected at every even time slot. This is because it requires 8 time steps to initially fill up the topmost row of MAC's. Therefore, at every t8+l+i’ where i-0,2,A,...2N-2, all of the L1 lines must be selected to pass the operands to the DC's as well as to the stack. 3.2.2 Latch and Stack All latches are dynamic and consist of two pass transistors and two inverters [l8]. A two-phase non-overlapping clock is required to load and refresh the latch data. The memory stacks, 51-55, inside the 5 section shown in Figure 3.2, are merely arrays of dynamic latches. The main function of these stacks is to consecutively store the individual elements of each row of the resolved upper triangle and right-hand-side vector. When all elements. of a complete row are available in the topmost latch of each stack, they are simultaneously enabled into the DC's through the L MUX lines. 2 However, since L2 lines are selected according to the even/odd control 36 signals, the operands (a complete row of elements) may have to wait another cycle before they can reach the DC's. This is to avoid a potential data flow conflict that may happen in the MUX's. Therefore, another function of the stacks is to serve as a delay circuit if such conflict exists. By close inspection of the structure (Figure 3.2) it is seen that the last element of a particular row that is put onto the stack is the g vector element. In general, these 9 vector elements are actually available from the upper left corner latch inside the 0 section at every odd time slot starting at t . However, L lines are selected only at 28+] 2 every t8+i slot and Ll lines are selected at every Therefore, all the elememts of a complete row can be pumped into the t8+l+i slot. DC's through the MUX's at either t28+l+i if B is odd or t28+i if B is even (where i-2,A,6,...2N). In the example of 8-3, L2 lines are selected at every odd time slot and one complete row of elements is also available at every odd time slot starting from t7. Therefore, there is one latch inside the stack SI and all of the other stacks have an extra latch for producing a delay of one time step such that the sequence of data flow is properly adjusted. But, for an example of B being even, these L2 lines are selected at every even time slot, and a complete row is obtained at every t28+l+i’ for i-0,2,A,...2N-2, which is always odd. Therefore, this complete row can be latched directly into the DC's at t which 28+2+i is always even. As a result, a row of latches inside the stack section actually can be saved without disturbing the proper sequence of data flow if B is even. 37 Based on these facts, the number of stacks, j, and the number of latches, L, inside each stack Sk are proportional to the bandwidth, 8, and are given as follows: J = z-FB/zl + I. (3-3) and L(Sk) = k, (3-A) where k B l,2,...j. The notation [x] refers to the smallest integer that is not less than the real number x. 3.2.3 MAC and DC The structure of an MAC is merely a Baugh-Wooley multiplier with an extra row of full-adders at its bottom edge [A, IS]. The convergence division algorithm [20] is chosen in the DC design because the number of iteration steps can be determined a griori in terms of word size and software convergence checking is not required. In addition, since this algorithm requires mostly iterative multipling procedures, the Baugh-Wooley algorithm can be applied again such that there is no loss in structural regularity. This division algorithm can be partitioned into several subtasks. Thus, the algorithm can be realized in a pipeline structure having rows of latches located between rows of multipliers. The details of the numerical formulation of this algorithm can be found in [20] and an 38 example of the pipeline structure is presented in [A]. Due to the constraint of synchronous data flow between the rows of DC's and MAC's, the latch-to-latch time between rows of MAC's is constrainted by that of the DC. This is because the operands that will be pumped into the DC's are obtained from the topmost row of MAC's but the calculation of these operands requires previous results from the DC's first. In other words, there is a tightly coupled linkage between the DC's and the topmost MAC's. Therefore, eventhough the division algorithm can be pipelined, the worst case time slot is still the sum of all segment times. So, pipelining the division algorithm does not help in increasing throughput. In addition, due to the nature of the convergence division algorithm, the number of iteration steps, m, of multiplication varies with the word size of the operands. The number of iteration steps is given by m - [log2(n)]. (3‘5) where n is the word size of the operands. Therefore, the number of rows of multipliers in the pipeline structure is also m. This implies that the hardware size of a DC is directly proportional to m times the number of multipliers in each row. As a result, an improved DC structure with drastically reduced hardware having approximately the same speed as before is developed next. The improved structure has only one row of multipliers and is independent of m. This original design is illustrated in Figure 3.A. In this figure, the latches and MUX's have the same structure as 39 SHIFT REGISTER SHIFT BIT ..... COUNT N D n n DIVIDEND ...;. DIVISOR * ...... _ SHIFT BIT SHIFT SHIFT COUNT N' 1-6 1L2 ALT A ,L1 A? I MUX I I] MUX I 1-5i ADDER I (2-(1-51)) nxn I- J1+61 , nxn MULTIPLIER MULTIPLIER It LATCH i_, LATCH l N'(1+6)...(1+61) Is Figure 3.A An n-by-n convergence division algorithm based DC. A0 described in Section 3.2.l and Section 3.2.2. However, the pulse width and frequency of the signals for these latches and MUX's are different. These control signals will be explained and shown in the timing diagram presented in the next section. After the dividend and divisor, marked as N and D in the figure, have been shifted, lines marked L] are selected to pass the shifted results to the adder and subsequently to the multipliers. This shifting, integral to the convergence division algorithm, is necessary in order to obtain a faster convergence [20]. The Ll lines are selected only once in every PE's segment time slot. Then, the partial results, N'(l+5) and (l-8)(l+5), held by the latches are looped back to the adder and multipliers through the L2 lines. This loop continues until the result, N'(l+5)...(l+53), where i-Zm (m is defined in Equation 3'5). is obtained. Finally, a convergent quotient, Q, is available after the result has been shifted back according to the original shift bit count. 3.2.A Timing Recall that the whole computing array is pipelined in nature and therefore the operation time, top’ must be at least the worst case segment time. In the array structure shown in Figure 3.2, the maximum segment time should be the propagation delay from the MUX's to those latches between the DC's and MAC's. As a result, ' t + tDC + tLATCH' (3’6) SInce tHux and tLATCH are sImply the total delay of two pass transIstors Al and two inverters, the important focus must be placed on tDC' 3.2.A.l Internal DC Timing Referring to Figure 3.A, the number of loops routing from the MUX through the adder, multiplier, latch and back to the MUX is given by m (Equation 3-5). Let ts, ta, and tx be the shifting, adding, and multipling time, respectively. Then t -- 2-t +In-(t s + + DC ta tx t MUX I LATCH)° (3'7) The control signals for the MUX's and latches inside a DC are not the same as those controlling MUX's and latches in the external computing array. The two-phase non-overlapping clock signals of the latches in the DC, A] and A5, must be at a frequency at least m times faster than l/top such that m iteration steps, plus the shifting, can be done within top' Specifically, this frequency, fL, can be obtained from Equation 3-6 and Equation 3-7. The signal, B, controlling the MUX's inside the DC must be synchronous to that for the latches as described above. At every first cycle within top’ Ll lines must be selected in order to pass the initial operands, N' and (l-8) (See Figure 3.A). Then, during the remaining cycles, L2 lines must be selected instead. Therefore, assuming relative high voltage is "true" or "I", the control signal for these MUX's can be veiwed as a pulse signal with a pulse width approximately equal to the reciprocal of the frequency of the control signal of the latches, or l/fL. A2 3.2.A.2 External Computing Array Timing The rows of latches amid the PE's, as well as inside the stack section, oversee the operand timing between each row of PE's (See Figure 3.2). These latches are controlled by two-phase non-overlapping clock signals, ¢l and ¢2. The time period of these signals, should be approximately equal to the maximum segment time, top’ which implies that the frequency of the clock is l/top. As a result, based on the above assumption, an approximate function block timing diagram is shown in Figure 3.5. In this figure, W and A are the control signals for the latches and MUX's inside the DC, and ¢ and I: are those for the latches and MUX's outside the DC, respectively. Notice that 5 iteration steps of multiplication inside the DC (word size of 32 bits) is assumed for illustration. Finally, the overall throughput time depending on the bandwidth of the system matrix, B, is derived. As shown in Table 3.l, the time required for triangulating an Nx(N+l) matrix with 8-3 is 2N+28+l or 2N+7 time slots. This is true for all odd 8. However, for all even 8, the time required for triangulating the same system is 2N+28. This is because an extra row of latches inside the stack section is removed (See Section 3.2.2). Therefore, a general equation expressing the overall througput time, T(N,B), is T(N.a) - top-(z-N + a + 2-I'a/2'I). (3-8) A3 II II— I Figure 3.5 Latch and multiplexer control signals timing diagram. AA 3.3 Fixed-Point Consideration for Load Flow Convergence The computing structure presented in this thesis is basically constrained to a fixed-point binary number system due to the nature of its processing element designs dictated by available chip real estate. However, an approach circumventing a true floating-point solution is to consider processing large fixed-point numbers which are pre-adjusted mantissas of previous floating-point coefficients. In order to allow the PE's of the computing array to operate in a common fixed-point number system, the selected position for the radix point is at the left extreme of the magnitude position of the number. Thus, the radix point lies between the sign bit and most significant bit which dictates that all fixed-point numbers be strictly less than one. In adjusting floating-point numbers to this fixed-point scheme the host computer or an intermediate processor must first sort out the maximum operand among those that will be sent to the processing array. The mantissa value of this maximum operand is normalized and the exponent value is stored. Then, all of the operands are pre-adjusted in floating-point to the same exponent value as the maximum operand. Next, the mantissa of each operand is truncated according to the number of fixed-point bits defined by the array structure design. Upon return of the upper triangle and right-hand-side vector elements, post-adjustment or post-normalization is done by the host computer or the intermediate processor according to the retained fixed exponent value. A5 Although this approach and the ultimate convergence of the overall non-linear system are extremely vulnerable to the dynamic range of the input matrix entries, the approach still can be applied to the system matrix derived from load flow analysis if all of its coefficients are well tempered and of tight dynamic range. Since the number of fixed-point bits of each word governs this dynamic range, a study is made to determine the minimum number of bits for which load flow convergence is obtained. First, a subroutine, called NORMAL, is designed and incorporated into the MSU load flow program version. This subroutine serves to normalize all numbers that are involved in the calculation of the upper triangle and right-hand-side vector elements. These numbers are actually the elements of the "working rows" described in Section 2.2.l, plus any numbers that are directly or indirectly calculated from these elements. Specifically, these elements of the working rows are H, N, J, L, AP and A0 as depicted in Equation 2.2. The main function of the subroutine NORMAL is to simulate the effect of fixed-point round off. Whenever a number is transferred from ELIM to this subroutine, the number is first converted to a fixed-point binary format and then truncated to a predefined number of bits. Finally, this truncated binary number is converted back to the original representation and returned thus incorporating the numerical effect of truncation. The whole program (MSU load flow program version with edited ELIM and annexed NORMAL) is then individually executed with networks consisting of A3, A9, l05 and ISO busses. For each network, the program A6 is consecutively executed varying the number of fixed-point bits and determining whether or not convergence results. Too few bits cause undue rounding error and lead to divergence of the overall Newton-Raphson formulation. Therefore, the purpose of the above procedure is to determine the minimum necessary number of fixed-point bits (word size) required for convergence. Subsequently, this number will be used as the parameter in the simulation of the VLSI array structure. CHAPTER IV A PROPOSED HOST PROCESSOR/PROCESSING ARRAY INTERFACE The most effective near term approach for enhancing the load flow computation speed is to reduce the solution time of the system equations derived from the Newton-Raphson formulation. This can be achieved by attaching the VLSI systolic array processor presented in the previous chapter to a serial host computer such that the triangulation is efficiently executed in a parallel pipeline fashion. Thus, a designate goal of this research is to simulate the interface of these two computing structures and quantify the enhanced processing throughput of load flow solution. Generally, incompatibility arises when two computing systems, differing in number format or representation, are assembled. For example, the proposed VLSI array structure is limited to two's complement fixed-point fractional binary number operation while the host computer used in this project, the MSU CYBER-750, is basically a one's complement floating-point binary number computing system. In addition, the position of the radix point of floating-point numbers inside the . CYBER-750 is at the right extreme of the mantissa. Therefore, pre- and post-adjustments of the data (operands) transferring back and forth between these two systems must be made to avoid a modification or redesign of either system. A7 A8 As described in Section 3.3, the operand adjustment can be done by either the host computer or an intermediate processor. More details of these two methods are discussed in the following sections. A.l Operand Adjustment by Host Processor An obvious advantage of using the mainframe computer for operand adjustment is a savings of additional hardware cost. This is because the VLSI array processor can be attached peripherally to the host computer by simple interconnection with data and control lines. However, this is under the assumption that the host computer can provide an adequately fast memory access through an operational DMA channel such that data transfer time between the host computer and the VLSI processor is less than the I/O time of the VLSI array. First, the structure of the load flow program must be reviewed. Inside the subroutine MATGEN, the program starts looping over all the nodes in the system. A "working row” is generated for each node and then eliminated (triangulated) by calling subroutine ELIM. The elements of this working row are actually H's, J's, N's, L's, AP's and AQ's as depicted in Equation 2-2, which are exactly the elements of the augmented matrix. Therefore, this working row can at most be composed of two single rows of elements. The column size of the Jacobian matrix is, at most, equal to 2N' for a system with N'+l busses including the slack bus. Thus, the maximum number of single elements is 2N'x2N' plus 2N' right-hand-side (augmented) vector elements. If this system is A9 reduced to a bandwidth of 8', the maximum number of elements will become 2N'(AB'+3)-(28'+l)(28'+2) plus 2N' augmented vector elements. In other words, the upper bound Jacobian matrix bandwidth, 8, and dimension, N, are expressed as follows: 8 8 2°B' + I, (A-I) N - 2-N'. (A-2) The upper bound matrix full breadth then becomes AB'+3. Since the elimination (triangular factorization only) procedure is carried out for each node, it can be described as a piece-wise elimination process. Therefore, the total elimination time is the sum of all the nodal processing times. Also, the operation of finding the operand of maximum absolute value can be started when each working row is generated. Finally, according to the normalization procedure described in Section 3.3, the pre- and post-adjustments of the operands carried out by the host computer, the CYBER-750, can be summarized in the program flowcharts illustrated in Figure A.l and Figure A.2. The number of the operands which will be normalized is the only parameter in the program. Unfortunately, as described before, this number is directly proportional to NxB. Therefore, the time required for the normalization increases as the size of the network increases, which will eventually slow down the overall solution throughput. INPUT MATRIX ELEMENTS 81,.‘ 0 08k INCREMENT S's MANTISSA OVERFLOW ? NO 50 OVERFLOW ? SHIFT RIGHT AND INCREMENT ai's EXP t J j=S's EXP - ai's EXP 's MANTIS. LEFT 3 BITS SHIFT RIGHT AND INCREMENT S ' s EXEONENT I s! STORE 8 EXPONENT STORE a '8 MANTIS. I Figure A.l o INCREMENT ‘ ai's MANTIS. Program flowchart for. operand pre-adjustment. 51 C Sim) INPUT MATRIX ELEMENTS u1,...uk I {9 LET j a 8'8 EXPONENT - :. SHIFT LEFT AND 3 = 3'1 SHIFT LEFT I AND :I = 1-1 Y STORE u1 CONCATENATING WITH :1 DECREMENT ui (MANTIS.) Figure A.2 Program flowchart for operand post-adjustment. 52 A.2 Operand Adjustment by Intermediate Processor In this section, a hardware interface system between a host computer and the VLSI array processor is discussed in terms of function block operation. This interface, an intermediate processor, is responsible for the adjustments described before. It is assumed here that the operation of finding the maximum absolute operand is done by the host computer during the procedure of generating the Jacobian matrix. Therefore, the intermediate processor only serves to pre- and post-adjust the operands. The advantage of using this intermediate processor is a savings of time when compared to the method explained in the previous section. The overall block structure of the intermediate processor, which operates in pipeline fashion, is shown in Figure A.3. In this figure, the mantissa and exponent busses have m and e lines, respectively. The number m must be equal to or greater than the number of fixed-point bits, n, of the operand width inside the VLSI array processor. If m is greater than n, these n lines (input to and output from the array processor) must be connected to the most significant bit (MSB) positions (including the sign bit) of m lines. Likewise, the m lines inpUt from and output to the host computer will be connected to the MSB's of the mantissa bus lines inside the host computer. The number e is the number of bits in the exponent field of the host computer. The left half portion of the structure performs the pre-adjustment of the operands £53 | e I BUFFER ] I BUFFER I I I BUFFER ] l BUFFER I I I I I I I ; : I I I I I l ADJUSTER l ADJUSTER __i, _ I [SUBTRACTER] I SUBTRACTER I' SHE-[5%, I;-I ----- —Jl I I I ‘ I DECODER I I DECODER A I I I I I I RIGHT l SHIFTER I I i LZDC i | i_ 1* II I , I I BUFFER | I BUFFER I i-m I +m I VLSI ARRAY PROCESSOR Figure A.3 Function block structure of an intermediate processor for pre- and post-adjustment. 5h traveling from the host computer. The right half portion operates the post-adjustment of the operands resulting from the VLSI processor. Both of the adjusters depicted in the same figure function as a number format adapter. Therefore, the internal structure is dependent on the format of the input and output operands. The adjuster inside the left half portion converts the mantissa of an input operand in any format to a two's complement form. For example, if the normalized mantissa and exponent of the input operands are in one's complement representation (e.g. from the CYBER-750) and the radix point is at either the right or left extreme of the magnitude of the mantissa, the adjuster must include an incrementer for converting the mantissa to two's complement form. No increment of the exponent is required because the exponent will not be transferred to the VLSI processor. Thus, the exponent can be in either biased or unbiased form. In case an overflow results after the mantissa's increment, the whole mantissa except the sign bit will be shifted right once with trailing ”0” (if the number is positive) or "I" (if negative) and its corresponding exponent value must be incremented by one. A possible design of the adjuster is illustrated in Figure h.h based on the pipeline concept. If, however, the input operands are in two's complement format, this .portion of the adjuster can be omitted. If the input operands are not in a normalized form, a leading-zero detection circuit (LZDC), a decoder and a left shifter for the mantissa and a subtracter for the exponent may be required additionally. Even in this case the adjuster can still be divided into pipeline segments. Therefore, it is assumed that the adjuster can always be designed in a pipeline‘ structure which 55 Mant ssa Exponent m ”“9 I_____—__-___—I ’ overflow='l' I I INCREMENTER I"—— ‘ I I I I I I 1 l I I RIGHT SHIFTER I“_ ADDER I I Figure h.h A sample internal block structure of an adjuster. is crucial to avoiding a potential interface bottleneck. The adjuster inside the right half portion of the processor always changes the input mantissa from two's complement form back into the original number format. For the same example of one's complement format, the input two's complement mantissa is initially decremented and then a test determines if the M58 next to the sign bit is still "l". If not, the mantissa without the sign bit will be shifted left once with trailing "O" or "i" and the corresponding exponent value must be decremented by one. The structure of this adjuster is thus composed of a decrementer, a subtracter (or another decrementer) and a left shifter which are interconnected similar to the structure shown in Figure h.h. 56 During the pre-adjustment, the maximum operand found beforehand appears first followed by other operands in row-ordered form. These operands are pumped into the pipeline structure through the receiving buffers. At the second stage, the latches marked as L5 in Figure h.3 have the same structure as that of the others, but they have a different input control. Once the very first exponent, the exponent of the maximum operand, has been input into these latches, no other exponents can be put into them. Therefore, this exponent will be kept stable by constant refreshing for use as the subtrahend for the subtracter in the next stage as well as in the post-adjustment portion. .As a result, the maximum exponent is consistently subtracted from all other exponents. Finally, before all of the adjusted mantissas, except the one of the maximum operand, are pumped into the VLSI processor through the output buffer, they are shifted right (with trailing "0" or "l”) corresponding to the result obtained from the exponent's subtraction. During the post-normalization, triangulated results are returned from the array processor through a receiving buffer. These results are pumped into a leading-zero detection circuit (LZDC). Subsequently, the result of the detection is used as a minuend and decoded as a control signal to be sent to a subtracter and a left shifter. Inside the shifter, the mantissa is shifted with trailing "O" or ”i" again depending on its sign. Eventually, after the shifted mantissas and subtracted exponents have been adjusted back to the original number format, they are transferred to the host computer through the output buffers. All of these output buffers as well as the receiving buffers can be first-in-first-out memory buffers if necessary for speed matching 57 (e.g. if a DNA channel is operational). Latches control the timing of the process in the same way as those in the VLSI array structure. Therefore, the processing time, Tp, can be described by the standard pipeline processing equation, Tp - ts + (n-l)°top, (h-3) where t5 = setup time, tOp - longest segment time, n = number of operands. The segment time, top’ can be determined from the structure shown in Figure A.3. Assuming the adjuster can be pipelined (See Figure A.A), all of the shifters are barrel shifters, subtraction is pipelined into a complementation and addition, and, the decoder and the LZDC are sufficiently fast (a total 5 unit gate delay LZDC has been described by Chang and Fisher [Zl]), top is simply the addition time. As a whole, performing the pre- and post-adjustments of the operands by this intermediate processor is better than utilizing the host computer from the view point of data transfer time and arrangement. This is because execution in this processor is pipelined. CHAPTER V SIMULATION DEVELOPMENT The main purpose of the circuit simulation is to quantify the overall triangulation time as well as the chip area of the improved VLSI array structure including I/O circuits. The triangulation portion represents the actual triangular factorization encountered in the load flow calculation except that fixed-point arithmetic is employed. Therefore, the piece-wise elimination time as discussed in Section h.l will be replaced by the simulated processing time. The estimation of the chip area, though it does not affect the triangulating process, will project a basic trade-off comparison of chip size versus matrix bandwidth. Several fundamental parameters that may be manipulated in the simulation are minimum lithographic linewidth (A), word size, matrix or network bandwidth (8 or B'), and matrix or network dimension (N or N'+I) . In particular, the manipulation of A can track current trends in I. C. fabrication technology and thus projects the reality of the implementation of the proposed VLSI systolic array algorithm. A "A model", a layout geometry design tool introduced by Head and Conway [18], assumes that the permissible layout linewidths and spaces along the diffusion, polysilicon and metal lines can all be scaled down in linear dimension. Therefore, the geometries of layout elements can be 58 59 described proportionally in terms of current and projected minimum lithographic linewidth. As the I. C. fabrication technology advances, X decreases and thus the layout geometries also decrease proportionally. In addition to the A model, a revised “7 model", a simple arithmetic model for computing propagation delay of a transistor level circuit, is used to relate the switching speed of a gate and overall processing time of triangulation [A, IS]. The original 1 model acknowledges that the delay time of a node depends on the total capacitance of that node together with the gate capacitance and transit time of the driving transistor [18]. However, the transit time, 1, is the fundamental limit on the switching speed of the gate. In a practical circuit, the speed of MOS (fietal-Qxide-§emiconductor) device operation is determined by the speed with which capacitors can be charged and discharged [22]. Therefore, the revised model assumes that "T" is the discharge time for a basic inverter coupled with only one identical inverter. Also the gate effective capacitance of every logip gate is assumed equal. Thus, the simulation results in the propagation delay calculation are more realistic than the 1 model since it uses discharge time, T, not just transit time, 7. Moreover, this simplifies the process of estimating total load capacitance [A]. The number of bits per word (word size) is determined by the necessary number of fixed-point bits required for load flow convergence (See Section 3.3). The Jacobian matrix bandwidth and dimension, however, are .not the same as the network bandwidth and dimension. As mentioned in Section h.l, corresponding to a power utility network with 60 a system bandwidth 8' and bus dimension N'+l the upper bound Jacobian matrix bandwidth and dimensions, 8 and NxN, are given by 28'+l and 2N'x2N', respectively. This is of course due to multiple entries (up to 2x2) for the particular partial derivatives. The upper bound dimensions of the right-hand-side vector then becomes 2N'xl. 5.1 Chip Area Computation The major hardware design of the VLSI array processor in this research is.partitioned into three distinct parts, namely computing structure, input and output circuits. The design of I/O circuits for the array structure is based on the choice from several I/O circuit candidates [1?]. The input circuit, cataloged as a ”data controlled" circuit, pumps the data operands into the destination PE's without using any channel-selecting control signals. It ultilizes the FIFO stack concept employing n sets (same as the number of bits per word) of Min-stage shift register chains (where Nin is the number of input ports or 28+2). Therefore, this circuit is composed of Nin arrays of 1-bit shift registers. The output circuit, cataloged as an ”SCS" (Shift-Register Control Sequence) circuit, is controlled by an SCS signal. This sequence, a single "1" followed by B "0's", is synchronously pumped into a 1-bit shift register chain and is used to select the proper operand output channel. When the "out data operands, where Nout is the number of output ports or 8+l (See Figure 3.2), are ready to be clocked out of the array structure, N output channels are sequentially selected for out 61 routing the operands. Thus, this output circuit consists of Nout l-bit shift registers plus Nou arrays of buffers and pass transistors. Finer t details of these I/O circuits can be found in [17]. Since the layout of processing elements (PE's), multiplexers and latches in the computing array is regularly connected and supports local communication (i.e., nearest neighbors), it is possible to model each unit as a functional module and tesselate these modules into large building blocks. Also, the fundamental building blocks, such as full-adders in the PE's, can be modeled and likewise tesselated. In the same manner, the I/O circuits are modeled as unit modules such as l-bit shift registers and buffers. As a result, the I/O circuits and computing structure taken together require that relatively few specific modules be laid out by hand and quantified (with respect to the parameter A) as per the technique of Head and Conway. Specifically, these modules include a full-adder, l-bit shift register, buffer, latch, 2-input NAND gate, 2:1 multiplexer and pass transistor. 5.2 Throughput Computation The localized connectivity of the computing and I/O structure allows straightforward computation of delay parameters. Quantification of module delay involved summing the active gates' propagation time, with consideration of loading due to fan-out, and internal communication path delay. Diffusion line length and unit length delay were considered the significant path delay parameters. Moreover, 'intermodule line 62 length was negligible due to the tesselated nature of the structures. Utilizing the revised 1 model, the active gate delay time is based on T. It is assumed that T is linearly dependent on A and when A93 microns, T=O.6 nsec [18]. An upper bound on the diffusion line lengths were determined for each module as a function of line length squared. Then, using the fact that transit time is 100 nsec for a 10 millimeter length line [18], communication path delays were obtained. Finally, the total module delay is the sum of active gate delays and communication path delays. Thus, once the maximum segment delay is found, the total processing time (throughput) for triangulating a band system matrix can be obtained by using the Equation 3-8. However, the I/O delay time must be considered and taken into the account of throughput calculation. Therefore, assuming no I/O bottleneck, the overall throughput time including I/O time, tin and to is estimated ut’ by‘modifing Equation 3-8 and is shown as below. T(N.B) a tin + top-(z-N + B + 2°rB/2]) + to . (5-1) Ut where tin and tout are also known as the pipeline setup time and flush time, respectively. 5.3 Significant Module Data Chip area and propagation delay computational methods have been discussed in the previous sections. Based on the data from [A], Table 5.1 presents results of these calculations for each fundamental building block. 63 Table 5.1 Significant module data. Active Gate Module Class Areagjx K2) Delay (x T) Full-Adder 65 x 5h 28 l-Bit Shift Register 23 x 17 18 Buffer 23 x 17 2 Latch 23 X I7 I3 2-Input NAND Gate 19 x 5 8 2:1 Multiplexer 6 x 6 1 Pass Transistor 3 x 3 l CHAPTER VI RESULTS AND CONCLUSIONS The minimum number of fixed-point bits per word required for load flow convergence of several different bus networks as determined by running the revised MSU load flow program version described in Section 3.3 will be presented first. As explained in that section, the convergence of the overall Newton-Raphson formulation in the revised program is limited by the number of fixed-point bits (word size). Therefore, by varying this number, several load flow program results are obtained and shown in Table 6.1. This table indicates that the minimum numbers of bits per word required for the convergence of the test networks are 28 and 32. As a second step, the circuit simulations comparing the throughput speed and chip size versus matrix bandwidth are presented. The time results selected from the simulations are compared with those from previous works to promote the significance of using a VLSI systolic computing array structure for load flow computation. Finally, conclusions will be drawn and summarized from all of the results obtained from this research. 6h Network Size Busses Band- width I03 “9 105 150 11 I7 16 65 Table 6.1 Convergence results in number of iterations of MSU load flow program version. a Results of Convergence Original MSU Revised MSU Program Version Program Version Number of Number of Number of Iterations Fixed-Point Iterations Bits per Word 6 I6 12 28 6 h 16 Divergent 28 5 32 h 6 28 6 32 6 6 28 7 32 R Tolerant Real Power Mismatch (PMM) - 0.15 Tolerant Reactive Power Mismatch (QMM) - 0.25 66 6.1 Circuit Simulation Results It has been predicted that the patterning resolution limitation of optical lithography will be about 0.5 micron in the early twenty-first century [23]. Recently, however, many I. C. designers have been exploring new and promising techniques including electron-beam and X-ray lithography. With electron-beam methods, for example, future linewidths of 0.2 and 0.125 microns have been estimated as feasible [2A, 25]. For these reasons, 0.8, 0.5 and 0.2 micron linewidths have been chosen as the realistic goals of mid and late 1980's and early 1990's VLSI capability, and are used as quantification parameters in the simulations. Another parameter used throughout the simulations is the word size of 32 bits, which is chosen in order to assure convergence of all test networks. First, the results of a general circuit simulation with the selected parameters and the assumption of B-O.1N (where B and N are matrix bandwidth and dimension, respectively) are illustrated graphically in Figure 6.1 and Figure 6.2. Figure 6.1 contains three curves showing entire chip edge size versus matrix bandwidth with linewidths of 0.8, 0.5 and 0.2 microns. A projection of a maximum technologically feasible chip size has been made which predicts the future I. C. chip edge to be about 1.5 cm by the late 1980's [23]. Under this assumption, using a 0.5 micron linewidth the matrix bandwidth, 8, is limited to about 13. To extrapolate this result CHIP EDGE IN CH 67 O “3 w"! CHIP EDGE VS BFINOIIIDTHI o + 0.8 MICRON LINEIIIDTII a: x 0.5 MICRON LINEIIIDTI'I v" o 0.2 MICRON LINENIDTH O ‘3 *1 c, I “I m‘l O 't N! D “2 D “3 O! c: c.’ c: I’ If If I l I If T8 T I T I I l 2 4 6 8 1012141618202224262830 BHNDNIDTH Figure 6.1 Entire chip edge size versus matrix bandwidth for 0.8, 0.5, and 0.2 micron linewidths. 68 1-40 TIME VS BHNDNIDTH 4- 0.8 NICRON LINENIDTH X 0.5 I‘IICRON LINENIDTI‘I O 0.2 NICRON LINENIDTH 0.80 1:00 1.20 1 0.60 1 TIME IN HILLI-SEC O-4O O-ZO O-OO r I I I l l l r I l l l l 6 8 10 12 14 16 18 20 22 24 23 28 30 BHNDNIDTH I 4 ~— Figure 6.2 Entire chip propagation delay versus matrix band- width for 0.8, 0.5, and 0.2 micron linewidths. 69 to an actual electrical utility network size, an additional aspect must be first considered. As mentioned in Section A.l, the upper bound of the Jacobian matrix bandwidth and dimension are 28'+l and 2N', where 8' and N'+l are the utility network bandwidth and dimension, respectively. Thus, the result indicates that a power system of approximately 61 busses could be solved on a single chip assuming late 1980's technology (0.5 micron linewidth). And, if early 1990's technology (0.2 micron linewidth) with the same chip edge size is assumed, systems of about 171 busses could be solved on one chip. 0f greater significance is the fact that the triangulation time, or total chip propagation delay, is dramatically improved over traditional serial techniques. As depicted in the graph of time versus bandwidth (Figure 6.2), using the same 0.5 and 0.2 micron linewidths as before, the triangulation times of the networks of 61 and 171 busses are 0.266 and 0.286 milliseconds, respectively. Next, by using the details of the simulation time results of several major block structures, it can be shown that this type of systolic array architecture will not cause a serious I/O bottleneck of operands. As shown in Table 6.2, it is unquestionable that the DC segment time (tDc) completely dominates over the MAC time (t INPORT MAC)’ time (tin) and OUTPORT time (t ) at the given bandwidths and out linewidths. The INPORT time is actually the time required to set up the operands for the first row of input latches inside the computing structure (See Figure 3.2). Therefore, this INPORT time is proportional to the number of input ports or 28+2. Likewise, the OUTPORT time varies with the number of output ports or 8+l. However, the DC time as well as 30 20 10 70 Table 6.2 Simulation time results of A ImicronI l/O ports, MAC and DC. t t t in out MAC InsecI (nsec) Insec) 178.6 9A.7 293.8 111.6 59.2 183.6 hh.6 23.7 73.h 121.0 65.9 293.8 75.6 h1.2 183.6 30.2 16.5 73.h 63.h 37.1 293.8 39.6 23.2 183.6 15.8 9.3 73.h 38-6 22-7 293.8 21.6 lh.2 183.6 8.6 5.7 73-“ DC InsecI lh80.0 925.0 370.0 1h80.0 925.0 370.0 lb80.0 925.0 370.0 lh80.0 925.0 370.0 71 the MAC time is dependent on the word size but not B. As a result, an I/O bottleneck will happen if and only if B is large enough to make t > t As shown in Table 6.3, this l/D bottleneck occurs when 8 is in DC' greater than 255 which implies 8' and N'+l are equal to 127 and 1271, respectively. The bottleneck, though, never occurs in this work since only networks of less than or equal to 150 busses are studied. Now, utilizing the same simulation program, some practical results tailored towards the load flow study are obtained by specifing the exact dimensions and bandwidths of some typical power systems. The simulation time results, however, when used to replace the complete piece-wise elimination time, must include the time spent on the pre- and post-normalization of operands. These normalization times are found according to the intermediate processor design discussed in Section A.2. As shown in Figure h.3 and Figure A.A, the numbers of stages (including the buffer stages and the complementation stage for subtracters) in the pipeline structure are 8 (pre-adjustment portion) and 10 (post-adjustment portion). Assuming the adder is a 3-level CLA (Carry Lookahead Adder) which has a 12 unit gate delay [20], the pre- and post-normalization times are simply the setup time of the pre-adjustment portion and the flush time of the post-adjustment portion of the intermediate processor. Therefore, the pre-normalization time (tp ) is given by, re tpre = 9 x lZ-A (6-1) where A is assumed to be 1 nsec with mid 1980's technology [23]. An 260 256 255 250 Table 6.3 Simulation time results showing A Imicron) 72 an I/O timing bottleneck. t 1503. .6 939 375- 1880. 925. 370- lh7h. 921. 368. lhhs. 903. 361. in ShSECI I. 8 t' 0‘ a: 0‘ O‘ O‘ I: out (nsecI 757 . 1173. 189. 765. 0‘ \N h66.0 186.h 7&2. 56%. 185. 728. #55. 182. 293. MAC (nsec) 8 183.6 73- 293. 183. 73- 293. 183. 73- 293. 183. 73- 0‘ a: t' 0‘ 00 S‘ 0‘ a: t' r DC (nsecI lh80.0 925.0 370.0 . 1h80.0 925.0 370.0 1A80.0 925.0 370.0 1h80.0 925.0 370.0 73 extra segment time is included in tpre because the very first operand pumped into the intermediate processor is the maximum operand which will not be transferred to the VLSI processor. The post-normalization time (tpost) 15 defined by, tpost = 10 x 12-A. (6-2) FInalIy, tpre’ tpost’ the VLSI array computing tIme IncludIng I/D tIme (tcomp)’ overall solutIon tIme (tVLSI.tpre+tpost+tcomp) and the corresponding chip edge size of the VLSI structure (S ) are shown in VLSI Table 6.A. All of these _time results are found with respect to the network bandwidth (8'), network dimension (N'+l) and various lithographic linewidths. This table indicates that t and t are pre post insignificant when compared to t The solution time, , will be tVLSI compared to other results in the next section. Notice that some of the comp' chip edge sizes, S , are tremendously large compared to that of the VLSI projected 1.5 cm x 1.5 cm I. C. chip. 6.2 Comparison with Previous Results Time results obtained from the previous works which utilized a one-dimensional array processor (AP-190L) to solve the same power systems are compared with the benchmark results from the CYBER-750 and the simulation results in this research. This is illustrated in Table 6.5. This table shows that the triangulation time per iteration spent on the VLSI systolic array structure is dramatically improved over 7h Table 6.A Simulation time and chip edge size results for load flow study. pre + tpost + tcomp . tVLSI VLSI __ (micron) (nsec) (nsecI ImsecI (msecI (ch N'+l 8' A t S A3 6 0.8 108 120 0.289 0.289 2.30 0.5 108 120 0.181 0.181 1.hh 0.2 108 120 0.072 0.072 0.58 AS 11 0.8 108 120 0.355 0.355 3.89 0.5 108 120 0.222 0.222 2.A3 0.2 108 120 0.089 0.089 0.97 105 17 0.8 108 120 0.722 0.722 5.80 0.5 108 120 0.b51 0.h51 3.62 0.2 108 120 0.180 0.180 1.A5 150 16 0.8 108 120 0.983 0.983 5.58 0.5 108 120 0.61h 0.61h 3.A3 0.2 108 120 0.2h6 0.2h6 1.37 75 Table 6.5 Comparison of triangulation time per iteration of different processors. N'+I 3' tCYBER* tAP tVLSI (I) ____ __ _jm§ggL _Jm§ggL (msec (micron)) I3 6 5.500 2.717 0.289 (0.8) 0.181 (0.5) 0.072 (0.2) I9 11* 10.500 b.555 0.355 (0.8) 0.222 (0.5) 0.089 (0.2) 105 17 77.500 13.163 0.722 (0.8) 0.u51 (0.5) 0.180 (0.2) 150 16 55.833 17.6u6 0.983 (0.8) 0.6lh (0.5) 0.256 (0.2) *Average from several runs with Markowitz ordered data sets (Tinney No. 2 ordering [5,6]). Numbers are absolute lower bounds; real CPU time is considerably higher but could not be accurately trapped. 76 the benchmark result (tCYBER) and that spent on the array processor (tAP)° tCYBER is tremendously large since the program is executed in a serial fashion. Therefore, it is concluded that the VLSI array structure is very efficient in reducing the banded Jacobian matrix triangulation time which in turn enhances the overall computational throughput of large scale load flow problems. In summary, the ratios of tVLSI:tCYBER and tVLSIztAP (for A80.8 mIcron) are listed In the following table. Table 6.6 Comparison of time results of different processors in ratio. N'+l 3' tVLSI ‘ tCYBER tVLSI ‘ tAP AB 6 l : 19.0 1 : 9.h 119 11 1 : 29.6 1 : 12.8 105 17 '1 : 107.3 1 : 18.2 150 16 l : 56.8 1 : 18.0 77 6.3 Conclusions In this work, an existing VLSI algorithm for triangulating large band form matrices was successfully modified and applied to load flow analysis. By adding rows of multiplexers and latches to an existing algorithm, the revised structure can triangulate a band matrix to strict upper triangular matrix with a unit diagonal in 0(N) time steps. The pre- and post-adjustments of the operands transferring between a host computer and the VLSI array structure can be done efficiently by the proposed intermediate pipeline processor described in Section A.2. The time required for operand adjustment is simply the pipeline setup and flush time which is insignificant to the overall triangulation time. It is found that the minimum number of fixed-point bits per word for load flow convergence of the test networks used ( < 150 busses) is 32 which indicates that all of the coefficients of the system matrix are reasonably well tempered and of adequately tight dynamic range. The simulation results show that the l/O bottleneck does not occur in this VLSI structure if the maximum segment delay time in the pipeline is completely dominated by the PE time, specifically the DC time. As shown in Section 6.1, a power system network of approximately 61 busses could be solved on a single VLSI chip with the late 1980's VLSI technology. The circuit simulation time results reveal that the time of upper triangulation by the designed VLSI systolic array structure is greatly 78 reduced and hence the overall solution throughput of load flow analysis is improved. 6.A Discussion It has been shown that when the system bandwidth is large, the proposed VLSI structure cannot be implemented on a single chip. Therefore, this algorithm must be realized in a modular fashion on a few chips. If such a case is possible, the VLSI algorithm should be modified at the same time to have true floating-point capabilities since there will be no limitation on the hardware size if the intercommunication problem among chips is solved. Recall that the complete triangulation time of the VLSI structure is limited by the segment time of the DC (See Table 6.2) which is approximately five times greater than that of an MAC. Thus, the triangulation time can be further reduced if a faster DC is designed. However, as indicated in Section 6.1, the I/O bottleneck did not occur because of the domination of the DC segment time. So, it is possible that if the matrix bandwidth is extremely large and DC time is reduced tremendously, the INPORT time will dominate over other segment times. As a result, an I/D bottleneck may be encountered. Therefore, a trade-off between the segment and I/D time may result in limiting the overall triangulation throughput in the future modular designs. If the VLSI algorithm is realized in a modular fashion and designed with binary floating-point (FLP) arithmetic, all of the processing elements should be reexamined.’ An FLP PE should possess the 79 capabilities of alignment or normalization, truncation and rounding of the operands. This, of course, may double or triple the hardware size making a modular layout essential. In addition, the complexity of the cell design depends on other factors, such as degree of precision and error bounds. This may introduce trade-offs between normalized or unnormalized FLP designs. Undoubtedly, the inclusion of FLP arithmetic will. involve a huge increase in hardware size. Thus, a near term research requirement is to investigate some area-efficient FLP arithmetic algorithms (multiply-add and division) and then examine the intercommunication problem and potential modularity among these new PE's. Finally, it must be pointed out that the application of a proposed VLSI systolic array structure to load flow analysis leaves a lot of practical questions unanswered. For instance, in the future when such a VLSI structure is available, will it still be attached peripherally to a host computer of today's serial genre? Also, in order to obtain the fastest possible load flow solution, the structure of a load flow program should be reexamined such that the process of finding a maximum operand and coefficients' normalization can be avoided. On the whole, as VLSI technology advances, not only should special purpose computer architectures be designed and improved, but also the algorithms, including arithmetic and programming, must be reexamined so that all of the technology, architectures and algorithms can progress harmoniously. BIBLIOGRAPHY 10. I]. BIBLIOGRAPHY Gross, C. A., Power System Analysis, John Wiley and Sons, New York (1979). PP- 3'9- Kung, H. T., "The Structure of Parallel Algorithms," Advances 1p Computers, Vol. 19, Academic Press (1980). Pp. 65-112. Hwang, K. and Cheng, Y. H., "VLSI Computing Structures for Solving Large-Scale Linear Systems of Equation," Proc. 1980 lnt'l Conf. pp Parallel Processing (August 1980), pp. 217-230. Hsu, W. C. and Shanblatt, M. A., Evalpption pi 3 Single VLSI Chip Algorithm for Triangulating Large Band Form Matrices, Tech. Report No. MSU-ENGR 82-015, Michigan State University, East Lansing, Michigan (August 1982). Tinney, W. F. and Hart, C. E., ”Power Flow Solution by Newton Method," IEEE Trans. pp Power Apparatus ppp Systems, Vol. PAS-86, No. 11 (November 1967), pp. lhh9-lh56. Tinney, W. F. and Walker, J. W., "Direct Solutions of Sparse Network Equations by Optimally Ordered Triangular Factorization," Eggs; [555, Vol. 55, N0. 11 (November 1967). PP. 1801-1809. Shanblatt, M. A., Mickle, M. H. and Vogt, W. 6., ”Optimal Ordering Strategies for Load Flows on an Array Processor,“ Proc. 1981 Int'l Conf. pp Electrical Energy (April 1981), pp. 78-81. Abulleil, A. M., Ipp U55 9: pp Large Scale Load Fl Prob Pittsburgh (1981). Array Processor 1p the Solptjon ‘21 ems, 1 Ph. D. Thesis, University of Philadelphia Electric Company Power Flow Program, POWERFLO, Version A, Mod Level 30, System Planning Division, Philadelphia Electric Company, Philadelphia, Pa. (8 October 1975). York (1965), pp. 398-h15. Bodewig, E., Matrix Calculus, lnterscience, New York (1959), pp. 101-12A. 80 12. 13. 1h. 15. l6. l7. l8. 19. 20. 21. 22. 23. 2h. 25. 81 Alway, G. G. and Martin, D. W., ”An Algorithm for Reducing the Bandwidth of a Matrix of Symmetric Configuration,” Computer Journal (August 1965), pp. 26h-272. Cuthill, E. and McKee, J., "Reducing the Bandwidth of Sparse Symmetric Matrices,” Proc. ghth National Conf. ACM, Brandon System Press, New Jersey (1969), pp. 157-172. Gibbs, N. E., Poole, W. G. Jr., and Stockmeyer, P. K., ”An Algorithm for Reducing the Bandwidth and Profile of a Sparse Matrix,“ SIAM pp Numer. Anal., Vol. 13 (1976). pp. 236-250. Foster, M. J. and Kung, H. T., "The Design of Special-Purpose VLSI Chips," omputer, Vol. 13 (January 1980), pp. 26-h0. Kung, H. T. and Leiserson, C. E., "Algorithm for VLSI Processor Array," Symposium pp Sparse Matrix Computptions, Knoxville (1978). Hsu, W. C., Leung, Y.-Y. J. and Shanblatt, M. A., "Comparison of Input/Output Structures for Single Chip VLSI Systolic Arrays,” Proc. 25th Midwest Symp. pp Circ. and Sys. (August 1982). Mead, C. and Conway, L., Introduction pp VLSI Systems, Addison-Wesley Pub. Co., Reading, Massachusetts (1980), pp. 1-90. Baugh, C. R. and Wooley, B. A., "A Two's Complement Parallel Array Multiplication Algorithm," IEEE Trans. Computers, Vol. C-22, No. l-2 (December 1973), pp. 10h5-10h7. Hwang, K., Comppter Arithmetric, John Wiley and Sons, New York (1979). PP- 35’259- Chang, T. L. and Fisher, P. D., "High-Speed Normalization and Rounding Circuits for VLSI Floating-Point Processors," Proc. IEEE Int'l Conf. pp Circuits ppp Computers (1980), pp. 512-516. Taub, H. and Schilling, D., Digital Integrated Electronics, McGraw-Hill Inc. (1977), pp. 35-53. Keyes, R. W., "Physical Limits in Semiconductor Electronics," Science, Vol. 195 (March 1977), pp. 1230-1235. Eidson, J. C., "Fast Electron-Beam Lithography,” IEEE Spectrum, V01. 18 (July 1981). pp. 2I—28. Bernhard, R., "VLSI Lithography at the Crossroads,” IEEE Spectrum, Vol. 18 (July 1981)” p. 27.