. . . wwvuww br-‘ell Ito-v "crowns ””1 P s V '. '9... an. - 1 hfl.-~4‘ '.- m.- M-"- I _ 4p ’- Hn.’ I . E ‘ $.12-“ . . . n J _',,‘.\‘..u[ ‘xaul‘ “- —'--- 1'14! '7 This is to certify that the dissertation entitled Performance Tradeoffs in the Hierarchical Design of Regular VLSI Structures presented by Yu-Ying Jackson Leung has been accepted towards fulfillment of the requirements for Ph.D. degreein E'IeCt. EngY‘. “Ia/J51 M MM Major professor Date 12/10/85 MS U is an Affirmative Action/Equal Opportunity Institution 0- 12771 ‘ MSU LIBRARIES “- RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. PERFORMANCE TRADEOFFS IN THE HIERARCHICAL DESIGN OF REGULAR VLSI STRUCTURES BY Yu-Ying Jackson Leung A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree or DOCTOR OF PHILOSOPHY Department of Electrical Engineering and Systems Science 1986 ABSTRACT PERFORMANCE TRADEOFFS IN THE HIERARCHICAL DESIGN OF REGULAR VLSI STRUCTURES BY Yu-Ying Jackson Leung Hierarchical layout is the prevalent design methodology in the computer-aided generation of VLSI circuits. But, the tradeoff of design-complexity versus circuit performance among the various design paths of this hierarchy has yet to be fully assessed and compared numerically. This research provides a systematic and quantitative investigation of this tradeoff for regular structures. Systolic array structures, for which regular levels of modularity are well defined, are used as testbed structures. The hierarchical regularity levels in a systolic structure are the transistor, gate. functional device, processing element (PE) and algorithmic levels. The investigation of modularity is pursued along two distinct paths -- the bottom-up and top-down approaches. Using the bottom-up approach and standard NMOS design techniques, several PE's are independently designed using the various design entry-points and pathways of the design hierarchy. Results provide a set of area-performance versus design-complexity index figures that represent the tradeoffs among the various design pathways. The general Yu-Ying Jackson Leung expressions for the complexity, chip area and propagation time are numerically derived based on an infinite array model. The top-down approach is applied to the algorithmic and PE levels where chip-wise modularity is desired. The various chip-wise decompositions of an array structure are parameterized by an I/O bottlenecking index (BI). Results show the relationship of BI to the number of pinouts, word size in number of bits and chip size. Results obtained from the bottom-up and top-down approaches provide an analysis of several tradeoff parameters. An appropriate design pathway can thus be chosen with respect to the desired tradeoff parameters of the final design. To my mother and eldest brother Mdm. Me Chit and Mr. Wai-Ying Tommy Leung ii ACKNOWLEDGEMENTS The author wishes to express his sincere appreciation to his major advisor, Dr. Michael A. Shanblatt, for his guidance and encouragement in the course of this research. He also wishes to thank the committee members, Dr. E. Goodman, Dr. P. D.-Fisher, Dr. R. A. Schlueter and Dr. D. Yen for giving the guidance, suggestions and comments in this work. Other thanks are given to Dr. Goodman, Director of A. H. Case Center for Computer-Aided Design at MSU, for allowing the author to use the Computervision CADDS system, and Dr. Stephanie Shanblatt for the proofreading of this dissertation. Finally, the author owes a special thanks to his wife, Jasmine, for her emotional encouragement and support, and especially, for her patience while the author was in East Lansing pursuing this work. From alpha to omega, Glory to God for endlessly providing the love, wisdom, hope, faith and forgiveness. iii TABLE OF CONTENTS LIST OF TABLES 0.0.000...0.0.0.0...0......OOOOOOOOOOOOOOOOO0.0 Vi LIST OF FIGMS 0.0.0.0....OOOOOOOOOOIOOOOOOOOOOOOOOOOOOOOOOOO Vii I. INTRODUCTION ........................................... 1 1.1 Problem Statement ................................. 3 1x2 anndi.n.u.u.n.n.u.u.u.n.u.u.u.n.u 6 1.2.1 The Bottom-up Approach 7 1.2.2 The Top-down Approach ..... 9 II. BACKGROUND ............................................. 11 2.1 VLSI Design Methodology ........................... 12 2.1.1 IC Design Systems .......................... 14 2.1.2 Application-Specific IC Layout ............. 16 2.1.3 Silicon Compilation ........................ 19 2.1.4 Hierarchical Design Methodology ............ 21 .2 VLSI Technologies ................................. 22 .3 VLSI Systolic Array Structures .................... 24 4 Design for Testability ............................ 26 S Y-Chart Representation of VLSI Design Methodologies .............................. 29 III. BOTTOM-UP APPROACH ..................................... 33 3 1 Transistor and Gate Level Design .................. 36 3.2 Functional Device Level Design 39 3 3 PE Level Design ................................... 43 3 4 Systolic Arrays for Matrix Multiplication .................................... 45 3.5 Latch Design for Systolic ' Array Structures .................................. 48 iv IV. VI. TRADEOPF PARAMETERS EVALUATION AND JUSTIFICATION ............... ...... . ..... ........... 4.1 Chip Area and Propagation Time Computation .................................. 4.2 Design Complexity of Regular VLSI Structures .............................. ..... 4.2.1 Module Placement Approach to Design Complexity .......................... 4.2.2 Compactness Ratio Approach to Design Complexity .......................... 4.2.3 Tradeoff of Compactness Ratio Versus Area—Time.................. ........ . TOP-DOWN APPROACH IN ARRAY DECOMPOSITION ............... 5.1 Chip-Wise Partitioning of Systolic Structures ............................... 5.2 Tradeoff Between I/O Bottleneck and Pinout Limitations ............................ SULTS AND CONCLUSIONS 1 Compactness Versus Area-Time Ratios ............... .2 I/O Bottlenecking Index ........................... 3 Contributions ....:................................ 4 Trends in VLSI and Future Study ................... BIBLIWRAPHY ..... OOOOOOOOOOOOOOOOOOOOOODOOOOOOOOOOOOOOO APPENDIX 00.0.00...00.000.000.00.00000000000000000000000 51 52 56 S8 60 66 76 78 82 9O 90 101 108 111 114 ' 119 LIST OF TABLES Table . Page 6.1 Area-time and compactness ratios of FA. ............... 93 6.2 The typical physical parameters 0: "OSFET tGChnOIOgYO 0.0.0....OOOOOOOOIOOOOOOOOOOO0.00 94 6.3 Area-time and compactness ratios of MAC. .............. 96 6.4 Area-time and compactness ratios of various target modules. ............................ 99 6.5 Comparison of I/O bottlenecking index, pin limitation and chip area. ......................... 104 vi 2.1 2.2 2.3 3.1 3.2 3.3 3.5 3.6 3.7 3.8 3.9 3.10 3.11 LIST OF FIGURES Two types of MAC's. .. ........................ . ........ Mesh-connected MAC's for mtrix multiplication. O O O O I O O O O O O O 0 0000000000 0 000000 O O A modified Y-chart representation or tWical VLSI-MSd deSignO 0.00.0000000.000.0000.... An overview of the mttw-up awroaCh. 0.00.0...0.0000000009000000...000.. Design hierarchy in the mttm-up approacn. 0.00.0000...OOOOOOOOOOOOOOOOOOOOOOO Cell boundary of two transistor modules. .............. Typical gate-level modules. ........................... Logic diagram for the implementation of PA. ........... Alternate version of Figure 3.5. ...................... Logic diagram of a 5-by-5 “ugh-Mley “5” MAC. 0....00.0.0.0...00.00.00.000... Array network for gsgxg shown in Figure 2.2. OOOOOOOOOOOOOOOOOO0.00000000000000000000 Design entry-point and pathway 0: various tuget structures. 0000.00...00000000000000. MC latChESe O...OOOOOOOOOOOOOOOOOOOOOOO00....OOOO'. Dynamic latches with scan design. ..................... vii 27 31 34 35 37 38 4O 41 44 47 48 49 50 Figure 4.1 4.5 5.1 5.2 5.3 5.5 6.1 6.2 6.3 6.5 An N.-input NMOS NOR gate driving No iémtical outputSe OOOOOOOOOOOOOOOOOOOOOOOO00.0.0.0. A transistor-level module. ............................ A gate-level module. ......... ...... ................... Hierarchical design model otmn+1 lwel ”“18. OOOOOOOOOOOO. ..... .00.... ...... Hierarchical design model Otamc level ”“13. 00.000.00.0000000000000000000.0. An overview of the topic“ appracn. OOOOOOOOOOOOOOOOOOOIOOOOOOOOOOOOOOOO Examples of partitioned arrays ........................ Array of modularized chips MVing Ns (7 rigure 2.1 Two types 0: MAC's. inversion, multiplication, triangulation and L-U decomposition. Thus, one-dimensional (l-D) or two~dimensional (Z-D) systolic array structures can be contigurated by using these MAC's depending on the application. For example, consider a 2-D systolic array structure used for band matrix multiplication [3, 6]. The multiplication of two NxN matrices A and g with bandwidth of wl-pl+q1-l and wz-pz+q2-l, respectively, can be represented by p p I<-l->I I<---3--->I / . / \ / \ :11 :12 a o 33 :11 :12 :13 b 0 . :11 :12 :13 :14 ° q1 a21 In22 .23 a i 21 b22 b23 b24 c21 c22 c23 c24 * 31 a32 33 34 ‘ 32 b33 34 _ c31 c32 33 34 (2_1) 42 43 41 42 o . o . o . 26 In this example, both w1 and w2 are equal to 4. Figure 2.2 shows the corresponding mesh-connected network of type-2 MAC's with the appropriate intermediate latches depicted by rectangular blocks. Elements of matrices A and g are pumped into the array network synchronously until all elements of g are obtained. In general, the numbers of MAC's, and time steps, T required for the "MAC' mc' multiplication of any two NxN matrices of bandwidth w1 and w2 are expressed as follows [3, 6], and TMAC 8 3N + min{w1,w2}. (2-3) 2.4 Design for Testability The testability of a digital network is directly related to the} difficulty of controlling and observing the logical values of internal nodes from external circuit ports [39, 40]. Thus, controllability and observability are the key concepts in the design and implementation of testability. In addition, accessibility, which is the measure of the ease of controllability and observability, is also important in testing complex VLSI circuits. The two basic approaches prevalent today in VLSI design for testability are the ad hoc and structured approaches [39]. The ad hoc approaches use techniques which can be applied to a given product, but are not directed at solving the general problem. Examples of these techniques, some of which evolved from MSI and LSI techniques, are partitioning, test points and signature analysis [39]. A. ...... ..._\ fin/w /. ............. .A ......... \ .//*\ #\ .I \ / A ............. ...e A ......m...... ......... _ ............ .... l/ \ / \ 11%...- .0. ........................... / \ C22 2.2 Mesh-connected MAC's for matrix multiplication. Figure 28 Most structured testing approaches are built upon the concept that if the values in all the latches can be controlled and observed in a straightforward operation, than the test generation and simulation of a sequential network can be reduced to that of doing test generation and fault simulation for a combinational logic network. A control signal can switch the memory elements from their normal mode of operation to a mode that makes them controllable and observable. Techniques that use the structured approach are level-sensitive scan design (LSSD), scan path, scan/set logic, random-access scan, self-testing and built-in tests [39]. Among these, LSSD [13] and scan path [14] techniques have been the most popular approaches to VLSI testing. In LSSD structures the memory elements or latches are threaded together to form a serial-in/serial-out shift register. The shift register latches (SRL's) are controlled by shift clocks for loading and unloading and by system clocks for latching logic values present at their data inputs. In testing the logic networks, test inputs are applied through primary inputs as well as SRL's through a ”scan-in" operation. Outputs of the networks are observable at both primary outputs and SRL's through a ”scan-out" operation. The scan path technique has the same objectives as the LSSD approach. The memory elements used in. the scan path approach are raceless D-type flip-flops. The scan-in and scan-out operations are controlled by the master/slave operations of the flip-flops. With proper adjustment to the delays of the flip-flops, race conditions can be avoided [14]. Therefore, only a single system clock is required in. this technique. 29 In general, all designs for testing require additional hardware and I/o pinouts so as to achieve controllability and observability. Therefore, accessibility tends to trade off with hardware size and/or performance. Design for testability of systolic arrays and other structures having similar regular and pipelined characteristics are relatively straightforward in comparison with other VLSI circuits. The functional and logic tests of these structures are directly related to algorithm implementation. Therefore, once the systolic or other similar algorithm has been correctly translated and implemented into a hardware algorithm, the remaining test required is just the physical testing of the chip. In addition, systolic structures use dynamic latches for pipelining and each pipelined segment is composed of combinational logic. Thus, scan design used in LSSD and scan path techniques can be applied to these structures without a significant increase in hardware or signal delay time. 2.5 Y-Chart Representation of VLSI Design Methodologies A tripartite representation on a Y-chart has been used to define VLSI design methodologies [8, 41]. A Y—chart is a descriptive model using three axes. In general designs, the three axes of the chhart are associated with functional or behavioral representation (e.g., a Boolean expression), structural representation (e.g., the functional realization of the Boolean expression), and geometrical representation (i.e., the 30 physical implementation of a design) [41]. In the design of systolic algorithms, the three axes are related to algorithm representation (specified forms or levels of systolic algorithm), algorithm model (abstraction of the features of the algorithm) and architecture specification (physical description of the systolic array) [8]. Alternative Y-charts will be minimally related to each other if the axes among these Y-charts are not consistent in representation. However, all Y-charts utilize directed or feedback arcs to represent the transformations on the same axis or among the axes. The means for these transformations vary with the design methodologies. The means can be CAD tools, sets of parameters or even a direct mapping or transfer. Design representations using Y-charts have many advantageous features. First, the three-axes as well as arcs are all user-defined so that the Y-chart can be used to clarify the explanation of a design approach. Secondly, the tradeoffs of the different design methodologies can be initially assessed by comparing the information gleaned from. the Y-chart. (Of course, the axes of the compared Y-charts must be consistently defined.) Thirdly, Y—charts can be further subdivided into more detailed Y-charts such that informative details can be additionally depicted. This may be considered as a ”hierarchical” Y-chart representation of VLSI designs. As an example, a modified Y-chart describing the design of several VLSI structures mentioned in this chapter is shown in Figure 2.3 [8, 41]. In this Y-chart, the axis of STRUCTURE REPRESENTATION broadly shows two major types, random and regular, VLSI-based structures. Six typical random and regular structures are shown on this axis but, 31 STRUCTURE DESIGN REPRESENTATION MODEL M ICROPROCESSOR PERIPHERAL CONTROLLER I ”we“ SYSTOL 1c ARRAY gig.“ GATE ARRAY FLA SILICON REGULAR MEMORY COMPILATION HIERARCHICAL NSIGN CELL ARRAY __ SEMI-CUSTOM .. FULLY CUSTOM .. LAYOUT I SPECIFICATION Figure 2.3 A modified Y-chart representation of .typical VLSI-based design [8, 41]. 32 obviously, VLSI-based structures are not restricted to these six. This axis can also indicate the degree of either regularity or non-regularity of the structures. Regularity increases toward the origin while non-regularity or randomness increases with the direction of the axis. Note that the scales on this axis are not necessary in equal proportion. The methodologies of custom design, silicon [compilation technique and hierarchical design are depicted in the axis of DESIGN MODEL. The custom.design model includes all other traditional design methodologies such as. symbolic layout approaches, building-block techniques, etc. [17]. Design techniques must not be constrained to a single design model; therefore, a mixed design model can also be described. For example, silicon compilation techniques can also be used in some of the procedures in the hierarchical design model [29, 33]. The third axis for LAYOUT SPECIFICATION represents IC layouts 'in order of increasing complexity. The cell array approaches such as the gate-array and logic array, have the least complex layout while the fully custom layout has the highest complexity. Any directed or self-looped arcs denoting a transformation through certain means such as simulators, compilers, sets of parameters, etc., can be placed on this Y-chart. Finally, a detailed Y-chart can be further developed for a particular structure or design model. For example, a detailed Y-chart can be developed for the description of the design of regular structures using a custom design approach. CHAPTER III BOTTOM-UP APPROACH An overview of the bottom-up approach to the hierarchical design of systolic array structures is shown in Figure 3.1. The layout design starts from a selected design entry-point according to a set of desired parameters such as circuit complexity, size and performance. Design pathways can then traverse among the module levels of transistor, gate, functional device or processing element (PE) as desired. Finally, a ”target" structure can be formed by modular tessellation. Tradeoffs existing among the design hierarchy have yet to be fully assessed and compared lnumerically. In order to do so, the bottom-up approach presented here involves the layout and simulation of different target systolic array structures designed through different hierarchical design entry-points and paths. The thread of ‘this hierarchy in the bottom-up approach is further shown in Figure 3.2. The primary target systolic structure is the multiply-add cell (MAC), which is a commonly used and often the majority PE inside systolic array structures. Examples of systolic structures in which the MAC is the major component include arrays for matrix triangulation, multiplication and inversion. As an example in hierarchical design of layout, a MAC module considered at the PE level can be designed starting from transistor, gate or functional device level. Likewise, a full 33 34 DESIGN ENTRY POINT _ T ‘ TRAIE OFF 1 PARAMETERS TRANSISTOR GATE FUNCTIONAL PE LEVEL LEVEL GVICE LEVEL LEVEL MONLAR TESSELLATION ONE-PATH MULTI-PATH MDWLE ARRAY MODULE ARRAY Y I Figure 3.1 An overview of the bottom-up approach. 35 4 ENTRY POINT __J _ i FWCTIO'W- GATE TRANSISTOR 333?; . 1 MODULE * l MOOOLE J PE MODULE VLSI SYSTOL IC ARRAY STRUCTURE Figure 3.2 Design hierarchy in the bottom-up approach. 36 adder (FA) module at the functional device level can be built of either gate modules or transistor modules. As an initial approach to study the tradeoff among the various design levels, modules at several levels is designed and simulated using standard NMOS design rules. The choice of these design rules is due mainly to the widespread use and knowledge base of NMOS technology [6, 27, 37]. Other factors making NMOS attractive include high circuitry density, richness of available circuit functions and simple topological properties of circuit layout and simulation. 3.1 'Transistor and Gate Level Design The boundary of a circuit module at any level is simply defined as the rectangle which contains the complete planar layout of the functional circuit. Therefore, the area of a circuit module is simply the product of the length and width of the boundary of the module. Figure 3.3 illustrates the cell boundary of a depletion mode (pull-up) and an enhancement mode (pull-down) transistor modules. Figure 3.4 shows the four typical gate-level modules of inverter, 2-input NOR, 2-input RAND and XNOR. Both of these figures represent standard NMOS design [6, 27, 37]. In addition, due to the assumption of module tessellation, modification of predefined lower level modules is not considered when they are used to form higher level modules or circuits. Otherwise, there will be an increase in design complexity and design-time and the design parameters cannot be generalized. As an example, consider a 37 IE] DIFFUSION - CONTACT CLIT ION IMPLANTATION METAL E POLYSILICON Depletion mode Enhancement mode transistor transistor Figure 3.3 Cell boundary of two transistor modules. gate-level FA which are tessellated by using a 2-input HAND gate as shown in Figure 3.4. During the tessellation, the pull-up and pull-down transistors inside the RAND gate can be shifted, rotated or/and reflected provided that all the transistors stay within the predefined gate boundary. Otherwise, the FA is considered to be a transistor-level FA (”x.FA') instead of a pure gate-level FA ("G.FA”). This is because the RAND gate is actually broken down into transistor modules which implies an increase in the number of modules requiring placement and lines requiring routing. Transistor modules and gate level modules are defined and then "programmed”, via a compilation technique, into a VLSI design oriented CAD system in order to establish a flexible cell library. Then, the 38 ESE DIFFUSION - CONTACT CUT ICIN IMPLANTAT 1 ON METAL E PDLYS IL I CON 2 - I NPUT NAND XNUR Figure 3.4 Typical gate-level modules. 39 layout of any transistor (pull-up, pulledown) or gate (NOR, NAND, XNOR, etc.) module can be generated automatically once the required specifications of the module, such as size, pull-up to pull-down transistor ratio or I/O and geometrical orientation, have been defined. Therefore, high-level modules can be designed by tessellating the predefined lower-level transistor modules or gate modules. 3.2 Functional Device Level Design Common circuits that can be considered as functional device modules include the half adder, full adder, comparator, complementer and counter cells. In general, the functional device module is composed of a certain number of transistors or gates and is the majority device module inside the higher level PE module. The most common functional device module found inside regular VLSI computing structures is a l-bit full adder (FA). High speed arithmetic circuits such as the carry-save adder and fast multipliers, such as the Braun array and Baugh-Wooley array, are also built primarily of FA's [42, 43]. The FA adds two binary digits a1 and bi' and a carry-input c1 to produce a sum-output s1 and a carry-output c . Conventionally, the .i+l implementation of the l-bit binary FA module is according to the Boolean equations siaaiObiOci and c1+1= logic diagram of this implementation is shown in Figure 3.5. The (aiAbi )V(biAci)V(aiAc1). The schematic reasons for choosing this implementation are summarized as follows: 1. It contains various discrete and cenventional gates such that transistor and gate modules can be easily distinguished. 40 Figure 3.5 Logic diagram for the implementation of FA. 2. Its simple interconnection and small number of fan-ins and fan-outs per gate allow flexible layout design, predictable circuit behavior, straightforward circuit simulation, and speed calculation. 3. It is not particularly biased against or toward any IC technology. (Note that NMOS is biased to NOR while CMOS and TTL (Transistor-Transistor Logic) are biased to RAND [26, 27].) In designing an FA as a target structure, there are three distinct design paths as noted in Figure 3.2. Arranged in decreasing order of design complexity and circuit performance, they are "x.FA", ”G.FA” and ”FA”, where 'X', 'G', and 'FA' represent the transistor, gate and functional device module level, respectively, and '.' denotes the design path between two levels. The terms of ”functional device module level” and ”FA module level” will be used interchangeably since the majority functional device is the FA in this research. Likewise, ”PE level” and ”MAC level” will be interchangeable because the majority PE is the MAC. Therefore, "xxx" represents a transistor-level FA module which is designed via the pathway of ”entry point", ”transistor module” and 41 "functional device module” as shown in Figure 3.2. Using only the transistor modules, this FA module is carefully crafted according to the logic diagram shown in Figure 3.5. In order to utilize the standard NMOS gates as shown' in Figure 3.4, the XOR gate in Figure 3.5 is realized by an XNOR concatenated with an inverter as shown in Figure 3.6. . Since the layout positions of global I/O can affect overall module size and speed, different‘ layouts of the FA with different I/O orientations are Agenerated and compared.“ The I/O orientation is determined by the direction of data flow such that the module is best fit for tessellation. Finally, a preferred ”x.FA', which is best in performance and chip area utilization, will be used as the building blocks for higher level modules. Likewise, "G.FA" is a gate-level FA module designed through the path of ”entry point", ”gate module” and ”functional device module" again as illustrated in Figure 3.2. This module is designed by placing gate modules only and routing the interconnection lines for I/O, power Figure 3.6 Alternate version of Figure 3.5. 42 and ground. In the same manner as designing the "x.FA", layouts of ”G.FA" with different global I/O positions are obtained and compared. Thus, ”G.FA" modules heuristically optimal in both speed and area are obtained. "FA", an functional device-level FA module, requires the least design time, thus representing the FA module with lowest design complexity. This module, mimicking a standard FA cell stored in the cell library, can be extracted and directly placed for tessellation. The actual design of this FA module uses the idea of the gate-array approach as discussed in Section 2.1.2. That is, the gates required to form an FA are simply packed as close as possible together in an array form and then I/O and power/6ND lines are connected directly. Therefore, this design is completed in the shortest time compared to the other target FA's. Once the three major types of FA's (”X.FA”, "G.FA" and ”FA") [designed according to the above criteria have been obtained, they become part of the database of the CAD _system as FA cells. Although the Computervision CADD 2/VLSI design system is used as a design aid in this work, any other basic CAD system or even paper and pencil can be used instead. This is because the final results and comparison will not be expressed in terms of absolute design time such as CPU seconds or man-hours but rather in relative ratios in terms of the number of modules or density of the modules. What's more, none of the basic 'layout designs utilized any placement or routing algorithms because of the assumed properties of regularity and local connectivity. As a result, these designs are machine independent. 43 '3.3 PE Level Design The primary target structure in the design hierarchy is the multiply-add cell (MAC), which is the majority PE found in many systolic structures. The implementation of a MAC is based on the Baugh-Wooley array multiplier which allows fast and direct two's complement array multiplication [9, 42]. In addition, the Baugh-Wooley array is constructed entirely from the conventional l-bit FA shown in Figure 3.5. For illustration, Figure 3.7 shows a 5-bit by S-bit MAC which is merely a Baugh-Wooley array multiplier (enclosed by dotted lines) with an extra row of FA's at its bottom edge. The design complexity and final performance of a MAC are again quite Wdependent on the design entry-point and pathway. As shown in Figure 3.2, five distinct MAC's of different complexity corresponding to five design paths of mixed levels are possible. They are the "x.wAc", ”x.FA.MAC”, ”G.MAC”, ”G.FA.MAC” and "FA.MAC”. For example, ”x.MAC” is a transistor-level MAC which is laid out using only transistor modules and "FA.MAC” is tessellated only by "FA” modules. Obviously, the design of a ”FA.MAC” is less complex than that of ”X.MAC” simply because the former is built using larger blocks without any concern given to details inside each block while the latter one is crafted from basic transistor modules. However, the chip area and propagation time of the "FA.MAC" are expected to be greater than that of ”x.MAC” due to the relatively smaller interstitial space and shorter signal path inside the ”x.MAC”. 44 .2... ..3 De: eons. haoosucasea ALBA e no 5&er owes fin 253a oe Lo No me so me ee so me me v v 4' : V i v v. v _ a. E e. (a 2 (a \.. .. ... L .. ... \ ... .. e e e e e“ \ \ e. . k e. e. \ area 9 G e In... 6 ea.- 9 \ \ \ \ are. e. em \ \ \ .... a 9‘9» \ \ \ on e. \ .m. \ \ \ a... 3 9 9‘6 \ .... .... x \ \ .... \ .... ....v.a...vssa.....e.... so. .a .. .a e. .a .. 45 Among the five different MAC's, "x.MAC", "G.MAC” and ”FA.MAC" represent the transistor-level MAC, gate-level MAC and FA-level MAC, respectively. However, closer inspection of Figure 3.2 reveals two points-of-view with respect to the design entry-point and target-point as observed for ”x.FA.MAC” and ”G.FA.MAC". From the standpoint of design entry-point, or a straight bottom-up approach along the arrows shown in the figure, "x.FA.MAC" and ”G.FA.MAC” are considered to be transistor-level MAC and gate-level MAC, respectively. However, from the viewpoint of design target-point (PE module), or a top-down approach reversely following the arrows shown in the figure, both ”x.FA.MAC" and ”G.FA.MAC” can be alternately considered as FA-level MAC's. In addition to these two points-of-view, both ”X.FA.MAC” and "G.FA.MAC” can be simply treated as mixed-level MAC's. Therefore, the defined entry-level of ”x.FA.MAC” and ”G.FA.MAC" depends on the point-of-view of the observer. As a whole, five independent MAC's corresponding to five distinct design paths of different complexity are constructed. They are stored in the CAD system and will be used to form the final target VLSI systolic array structures whose majority PE is an MAC. 3.4 Systolic Arrays for Matrix Multiplication A VLSI systolic array structure having the MAC as its majority PE can be designed directly by tessellating the required MAC's and other minority modules. As a testbed for benchmarking, the matrix-matrix 46 multiplier, shown in Figure 2.2 in Section 2.3, is considered. By tessellating arrays of any library MAC designed from the various combinations of module levels and the external latches, a matrix-matrix multiplier of any matrix dimension and word size can be constructed. Of course, it has been assumed that there is no limitation on single chip size. If such limitation exists, a top-down approach such as algorithm decomposition, must be considered. The top-down approach will be discussed in Chapter 5. The rhombic shape of the overall connected network illustrated in Figure 2.2 is impractical for actual layout.’ Therefore, the layout of the network is ”sheared” to a rectangular shape and the hex-shaped MAC's are changed to rectangular MAC's for better chip area utilization. This modified network is shown in Figure 3.8. All the small rectangular blocks shown in this figure are arrays of identical latches for data synchronization. The number of latches in each block is equal to the word size (the number of bits per word). These latches, however, are comparatively much smaller than the MAC's and thus have an insignificant effect on overall area and speed. The design of these latches will be presented in next section. *Finally, five different designs of a VLSI matrix-matrix array multiplier are obtained with respect to the three distinct entry-points and five different design paths as shown in Figure 3.2. Specifically, they are ”x.MAC.VLSI”, ”X.FA.MAC.VLSI', "G.MAC.VLSI”, ”G.FA.MAC.VLSI” and "FA.MAC.VLSI'. With all the target structures discussed in this chapter, the tradeoffs among chip area, circuit performance and design complexity of the proposed design hierarchy can be explored in next chapter. In 47 Figure 3.8 Array network for Q-Axg shown in Figure 2.2. 48 summary, utilizing the bottom-up approach, the target structures designed previously are l-bit full adder (FA), multiply-add cell (MAC) and VLSI matrix-matrix multiplier (VLSI) as shown in Figure 3.9. Entry Target structures: point FA MAC VLSI > X.MAC ----> X.MAC.VLSI X :---> X.PA ----> X.FA.MAC ----> X.FA.MAC.VLSI > G.MAC ----> G.MAC.VLSI G -/----> G.FA --—-> G.FA.MAC ---->'G.FA.MAC.VLSI PA ----> PA ~---> FA.MAC ----> PA.MAC.VLSI Figure 3.9 Design entry-point and pathway of various target structures. 3.5 Latch Design for Systolic Array Structures Dynamic registers or latches are required for high speed, synchronous processor chip designs, especially using VLSI [6, 13, 39]. Therefore, dynamic latches using a feedback path as shown in Figure 3.10 are utilized in the design of the systolic array structure presented in this work [6]. These latches are designed as shown in the figure so that they can be directly tessellated either vertically or horizontally amid the PE's. An individual latch, enclosed by the dotted lines in the figure, uses a two-phase nonoverlapping clock to load (£1) and refresh ($2) the input data. As mentioned in the previous section, the number 49 I}. LI. LE ”' ..l... i i - ,_l—1_n 1 ,_J""L_. I ' -h‘I Figure 3.10 Dynamic latches. of individual latches inside each row or column of an array is dependent on the word size. Scan design for testability of systolic arrays and other structures having similar regular and pipelined characteristics uses special. latches to facilitate the scan-in and scan-out operation. Therefore, the latches shown in Figure 3.10 are modified in order to insure a proper scanning operation. These modified latches for scan design are shown in Figure 3.11. In this figure, the signals of O1 and 41' are exclusive to each another during normal and scanning operations. During a scan-in or scan-out operation, ‘1' and d2 are used to load the testing vectors through the line marked S-IM or unload the results through S-OUT while ¢1 is disabled. In normal operation, O1 and d2 provide the 50 — 1.1 (a... I... I... I... ...—... ..J—Lne. ..J—L... Figure 3.11 Dynamic latches with scan design. synchronism for the data pumping through the computing arrays while dl' is off. The obvious difference between the latches with and without scan design is the addition of an extra signal line, 61', and an extra pass transistor per latch. Thus, the increase in overall chip area and propagation time due to the additional scanning function is insignificant. What's more, the effect on the design complexity of a latch is so small that the overall increase in the complexity of the systolic arrays is negligible. CHAPTER IV TRADEOFP PARAMETERS EVALUATION AND JUSTIFICATION The main purpose of the bottom-up approach in this work is to study the tradeoff of design complexity versus chip area and propagation time among the various design paths. This chapter first presents the measures for chip area and propagation time and design complexity. Then, the tradeoffs of the design complexity versus area-time based on a generalized layout model are developed. The design complexity can be expressed by actual design time in terms of CPU seconds or man-hours [44, 45]. Unfortunately, these measures of design time are undesirable due to their dependence on the CAD or CAE system or the designer's expertise. Alternative approaches to parameterization of the design complexity will be discussed and justified in this chapter. . The design of input/output (I/O) circuits for the overall systolic array structure has been neglected in the bottom-up approach by assuming the area of I/O circuits is small and potential I/O bottlenecks are avoided. However, it is noted that if certain conditions exist, such as the domination of I/O time over the pipeline segment time of PE, an I/O bottleneck may be encountered which will eventually degrade the overall performance of the circuit [5, 9]. This I/O problem will be further addressed in the top-down design approach presented in Chapter 5. 51 52 4.1 Chip Area and Propagation Time Computation Since the layout of any higher level module is done solely by tessellation Of lower level modules which include communication lines, module areas are determined simply by the sum of the areas of all required lower level modules. In addition, the layout of each module uses standard NMOS design rules as per the technique of Mead and Conway [6]. Thus, the total chip area of a structure is the product of the width and length measured from the final layout of the structure in terms of the minimum lithographic linewidth, A. The fundamental limit on the switching speed of a MOSFET (MOS Field-Effect Transistor) is the transit time of a carrier between source and drain. In practical circuits, however, the switching speed is limited by the capacitance charge and discharge times. The capacitance of main concern is the effective load capacitance of an active gate, CL' which is due to the capacitance of the active gate and the input capacitance of the next gate(s), plus intercommunication lines, parasitic, fringing and Miller-effect capacitance [6, 12, 26]. As an example, Figure 4.1 shows an NMOS Ni-input NOR 'gate driving NO identical outputs. By assuming one input of this gate is switched at a given time, the total effective load capacitance is expressed as cL ' Cstray + (Ni+1)CGDe + NiCDSe + CGDd + CDSd + No(CGSe + 2cGDe)' (‘-1) where G, D, S, e and d refer to gate, drain, source, enhancement mode 53 ”*‘I 94:“ —+ Ag Figure 4.1 An N -input NMOS NOR gate driving no identical outputs [26]. device and depletion mode device, respectively [26]. The stray capacitance, C is due to the communication lines and fringing stray' capacitance. C One and CD5. are the gate-drain and drain-source capacitance of the enhancement mode (pull-down) transistor of the NOR gate. CGDd and CDSd are the gate-drain and drain-source capacitance of the depletion mode (pull-up) transistor. The last term of the equation refers to the total capacitance of the pull-down transistors of No fan-out gates. Note that both CGDe terms include a factor of 2 for the Miller effect. This is because the gate-drain capacitance is charged in one direction for one polarity of input and in the opposite direction for the opposite polarity input [6, 26]. 54 cstray is the smallest, per unit area, among all the C terms shown in Equation 4-1 [6]. However, the long length of the communication or interconnecting lines among all the gates makes the total effect of Cstray significantly large, especially when gate channels shrink down into the subcmicron region in. the near future [46]. cstray will dominate over other capacitance in future ultra dense VLSI circuits [6]. In the hierarchical design of VLSI structures, the length of the interconnecting lines and spacing among modules generally varies with different module levels. The average spacing between two modules tends to increase as the design entry-point and pathways move upward along the design hierarchy [6, 27, 37]. Therefore, since the length of the interconnecting lines between two modules is simply proportional to the spacing between the two modules, cstray is greater among high-level modules than among low-level modules. Once the total effective load capacitance of an active gate has been found, its delay time can be estimated based on the charging or discharging time of this capacitance [6, 12]. The pull-up time (t ) pu and pull-down time (t ), which correspond to the charging (high-going Pd transition) and.discharging (low-going transition) time, respectively, are estimated by tpu 8 CL(VH-VL)/Ipu (4-2) and tpd - CL(VH'VL)/Ipd (4-3) where Va and VL are the output voltages of the high and low states, respectively [47, 48]. u and IPd are the averaged pull-up and I P pull-down currents, respectively. IPu and tpd are approximated by the expressions as follows [12]. 55 2 Ipu uncoxvtu (VDD :tu/3)/zzpuvDD (4'4) an§ « Ipd ' “headv DD th) (ZVD Dwth62 pdv DD (4’5) where un - the mobility factor for n-type charge carriers, C = the gate oxide capacitance per unit area of the pull-up or pull-down transistor, vDD = the supply voltage, Vt“, vtd = the threshold voltage of the pull-up or pullsdown transistor, 2 zpd = the length-width ratio of the pull-up or Pu' pull-down transistor. The length-width ratio, 2, refers to the ratio of length to width of the planar gate region (channel) of a MOS transistor [6, 27]. Assuming VHPV DD and VL =0, and substituting IPu and IPd into Equations 4-2 and 4-3, respectivEly, yields 2 tpu‘z 22 pu vDD ch/[unc ox vtu (V DD vtu/3)]' (4 6) and tpd- szpdvDD2 cL/[un c°x(v DD- vtd)2 (2vD D+vtd>]. (4-7) As a result, Equations 4-6 and 4-7 estimate the time required for switching the output of an active gate from low to high and high to low, respectively. Finally, the propagation time of a gate is taken as the worst case of pull-down and pull-up time, i.e., max{t Note pd, tpu}. also that since all the terms of capacitance are dependent on the size of lines or gate channels, tpd or tPu can be expressed in terms of A. ‘ Based on the above capacitance charging-discharging model, the propagation time of each module or target structure is found by summing the worst case delay time of all active gates located in the most critical path. The most critical path is the signal path with the longest propagation delay among all possible signal paths. Since the 56 propagation delay of all possible paths must be known before the longest delay path can be selected, a computer search of some sort is usually required, especially for random circuits [49]. However, finding the longest path in a highly regular structure is much easier because of the relatively low number of possible signal paths. If there are p possible signal paths inside a module or target structure, the propagation time, T, of this module or target structure can be expressed numerically by 91 92 r . mad 2: maxItPu(il),tpd(il)}, Z maxItPuuz), 1181 1231 9 P tpd(12)}'°"' Z mhpuupmpdupn }. (4-8) 1 1 P- where g1, g2, ..., g are the total number of active gates inside the P st nd th 1 , 2 , ..., p possible signal paths, respectively. The index i is used to indicate the active gates in a particular signal path. 4.2 Design Complexity of Regular VLSI Structures In general, there is no unique measure for VLSI design complexity [50]. As an example, compare an 8-bit microprocessor chip with a 256K memory chip. In terms of design time, the microprocessor chip is more complex than the memory chip because the menory chip has a high degree of regularity and thus requires less design time. However, in terms of 57 the number of transistors, the memory chip, crowded with hundreds of thousands of transistors is far more complex than the microprocessor chip which contains only a few thousand of transistors. Therefore, a comparison of the design complexity of VLSI circuits can only be made if all the circuits are of same nature. For example, from all points of view the 32-bit microprocessor and 256K memory chips are more complex than their 8-bit and 128K counterparts. The measure for design complexity is thus generally expressed on a relative scale. The design complexity, or simply the complexity of a circuit is loosely defined as the degree of difficulty encountered in designing that particular circuit. The design-time is said to be proportional to the complexity of the circuit since it requires more time to layout a more complex circuit [44, 45]. As a crude estimation, the worst case (upper bound) time or number of trials needed to obtain an "optimal” placement of M modules among M points is M! time steps or trials. The design time in this example is considered to be factorially proportional to M. A circuit having a larger number M can thus be said to be more complex than the same circuit having a smaller M. Similarly, the time required to obtain an optimal routing of L interconnecting lines can also be expressed in terms of L depending on the routing requirement. In reality, however, there are more practical limitations which relate the design-time of a circuit to the circuit complexity. These are in terms of modules to be placed and/or the number of interconnecting lines to be routed. As an alternate approach, the relative complexity of a circuit can be directly related to the density of the circuit module layout, i.e., 58 active region/module area. The active region of a module is defined as the total (logical-OR) surface area of covered silicon regions, such as lines, contact cuts, gate channels, etc., within that module. This approach is based on the fact that the layout area of any functional circuit designed by higher level modules (e.g., gates) is larger than that of the same circuit packed by lower level modules (e.g., transistors). This is because the predefined boundary of the higher level module prohibits them from being flexible at the upper levels of the design layout. As a result, a target circuit layout tessellated by high-level modules is larger and contains more null interstitial space (unused silicon surface area) than the same target circuit designed using lower level modules. Also, as a consequence of designing the same target circuits using different level modules, the target circuit having the larger layout generally has longer propagation delay due to longer interconnecting paths. Since this research concentrates on regular structures and M05 IC design technology, the following sections are aimed at justifying the application of the above approach in defining the design complexity of this class of structures. 4.2.1 Mbdule Placement Approach to Design Complexity For a highly regular structure, such as a systolic array, the placement of modules and routing of lines are plainly defined due to the properties of local connectivity and regularity. What's more, routing is not even an issue once the required modules have been placed. The 59 interconnecting lines are assumed to abut since the array structure utilizes local communication (i.e., nearest neighbors). Thus, as an initial measure the design complexity of a regular structure can be directly related solely to the number of modules to be laid out. Therefore, the design complexity can be expressed as a function of the number of modules to be placed (f(M)). For example, M=19 transistor modules (3 transistors for each XNOR or 2-input NAND and 2 transistors for each inverter as shown in Figure 3.6) must be placed in order to form a transistor-level FA ("X.FA”) in NMOS. However, M87 gate modules (2 XNOR's, 2 inverters and 3 2-input NAND's) must be placed to design a gate-level FA (”G.FA"). In a CAD/CAE design environment f(M) depends on the placement algorithm utilized by the design system. In the worst case f(M)=M! as discussed previously. In general, since heuristic placement algorithms are not optimal for all designs, f(M) is determined by the specified placement algorithm that is chosen to best Suit a particular design [51]. In a non-CAD environment, on the other hand, f(M) is dependent on“ the expertise of the designer. What's more, the designer's‘ sophistication in IC technology and graph theory can also effectively influence f(M). However, as M increases, f(M) tends to increase much more rapidly due to the limited capability of human memory. Therefore, intuitively, f(M) is linearly proportional to M if M is small but can be exponentially proportional to M as the number of modules increases. As a result, there is no general expression to relate f(M) and M for all designs. It is possible, however, that an order of magnitude 60 bound can be found for certain classes of designs. For example, with highly regular structures, which have the ideal locality and regularity properties to allow straight tessellation and simple module placement, f(M) can be assumed to be 0(M). Still, the approach using f(M) is undesirable since there is no exact bound on f(M). A better approach to the design complexity is described next. 4.2.2 Compactness Ratio Approach to Design Complexity An alternative approach to the assessment othhe design complexity of regular structures is to relate the complexity of a circuit to the density of the layout of the circuit module. The density of the layout of a circuit module is also called the compactness ratio (CR) and relates to the design complexity. The compactness ratio of a circuit is given by . active region inside a circuit module CR 8 (4-9) area of the circuit module . For example, the CR's of the NMOS depletion mode and enhancement mode transistor module shown in Figure 3.3 are 11/16 and 5/9, respectively. These ratios are found by measuring the not covered area per unit square and then dividing by the overall module area per unit square. In general, the lower bound CR of a circuit module can be expressed 61 CR = ——————— = ——- (4-10) where w and s are the average line width and spacing between two lines, respectively. The pitch, P, is simply the sum of the line width and spacing. If w=s, the lower bound CR of a circuit module is 1/2. However, the field of random or custom circuit design is so broad that the general expression for the CR of any random IC module varies with the application of the circuit and the IC technology used. A general expression for the CR of any MOS circuit module or device is derived next. The most primitive element in MOS integrated systems is the MOSFET [6, 26, 27]. A MOS transistor is produced whenever a polysilicon path crosses a diffusion path (see Figure 3.3). Any functional circuit can be formed by using such a transistor. For example, the target structures of ”x.FA" and ”x.MAC" as described in Chapter 3 are tessellated from transistor modules. These transistor modules are packed together, restricted by a set of layout design rules such as the minimum linewidth of polysilicon, diffusion or metal, the spacing between any two lines, size of contact cuts, etc. Assuming that wx and 5x are the average linewidth and spacing, respectively, Figure 4.2 shows a layout model of a transistor-level structure built by an infinite number of transistors in a plane unbounded on the sides. The wx and 5x distances are not necessarily equal in scale and the dotted lines indicate the boundary of a transistor module. Let k be the unbounded number of transistor modules in each row or column. The expression for the CR of this 62 5x Wx Sx Wx Sx le s r" e. 7‘] X Wx L O O O 5x E" ""' Wx L :r i O O O 5x L- --. ..3 Fix L O O O O O O O O O O O O O O O Wx_ O O O 5x .... .L__. ...J Figure 4.2 A transistor-level module. Wx 5x 63 transistor-level module is kwx[k(wx+25x)+25x] C%{ = ____ .‘2 [k(wx+sx)+sx] kw [k(w +25 )+Zs ] Lx , where Lx is defined as the length of the overall module. By letting k-->~ representing a generalization of an unbounded module, Equation 4-ll gives w (w +25x) wx(Px+sx) limcaxs- xx 2 k-->a (wx+sx) P , (4-12) where Px is the average pitch as defined previously. Further, if wx=sx, which is the usual case for MOS design, Equation 4-12 gives the expected CRx=3/4=O.75. Likewise, a gate-level module or device, such as the ”G.FA” or ”G.MAC” as described in the last chapter, is tessellated only by gate . modules. A basic gate module can be assumed to be formed by a finite array of kxxkx transistor modules as in Figure 4.3. This figure illustrates a layout model of a general gate-level structure built by an unbounded number of gate modules. is taken as the average spacing 56 between any two gate modules and the dotted lines are used to clarify the boundary of a gate module. Usually, s is greater than 5x or w . G X Note that if sG=sx then Figure 4.3 will be exactly the same as Figure 4.2. 64 .... .. he... . i... L a. L. . WELT... ... Figure 4.3 A gate-level module. 65 Again, let k be the unbounded number of gate modules in each row or column. The expression for the CR of any gate-level structure is thus given by wx{k[kx(wx+25x)-st+ZSG]+4sx-ZsG} CR = o 2 (4-13) {k[kx(wx+sx)-sx+sG]+st-sG} . By letting k-->¢, Equation 4-13 yields ‘ w [ (wx +25 xx)-25 +25 61 ’ lim CR = kxx Rx (4-14) G k-->~ [kx(wx +s x)- sx HG] For the example of MOS design where wx=sx, the above equation becomes, w (3 w -2w +2s ) lim c126 "xx "xx x26 (4—15) k-->o .(kawx-wx+sG) . Therefore, the CR of a MOS gate-level structure is determined by the number of transistors, kxxkx, inside a gate module, the average spacing between any two gate modules, sG ,and the average line width and spacing between two lines, wx and 56, respectively. For example, for some typical value kxsz and w ”=5 =(2/3)SG [6, 37], then CRG=56/8180.69l. As a result, Equations 4-12 and 4-14 represent the general expressions for the CR's of the MOS transistor-level circuit module (CRx) and the gate-level circuit module (CR6). It is obvious that CRx is greater than CRG if sG>sx. Based on the layout models and equations derived above, a more general model can be further developed and applied to all structures that possess multi-level regularity and 66 modularity. This model is derived in next section and can provide a general comparison of CR versus module area and propagation delay. 4.2.3 Tradeoff of Compactness Ratio Versus Area-Time A generalized model for the comparison of compactness ratio (CR), chip area and propagation time of a circuit module is presented in this section and can be applied to any structure possessing multi-level regularity. The model is expanded and generalized from the models that have been shown in Figures 4.2 and 4.3. This model represents a bottom-up approach to the design task by assuming that all the high lavel modules and final target structure are basically tessellated by the lowest level modules such as the MOSFET in MOS design. Figure 4.4 shows a hierarchical design model of n+1 arbitrary level modules assuming all the high-level modules and the final target structure are formed by tessellation. The lower numbers correspond to lower level modules and the mn+1 level module represents a final target structure. Note that any one of these levels can also be considered as the target structure of a sub~module or building block. It is assumed that the design entry-point starts from the lowest level, m1. Of course design entry-points other than 1111 can also be selected. But, it will be clear that the general expression will become the same no matter which level is chosen as design entry-point. As a practical example of this model Figure 4.5 illustrates the corresponding hierarchical design of the systolic structure for matrix multiplication as described before. In terms of the notation that was 67 [ENTRY Pomfl r , um... LEVEL MODULE I Figure 4.4 Hierarchical design model of an n+1 level module. used to represent a multi-level module in Chapter 3, m1.m2, m1.m3, ..., . m1.m(n+1) denote the "one-path” modules or target structures. These one-path modules are designed via pathways marked as :11 in Figure 4.4. Specifically, ml.mn is called a one path Inn level module. Likewise, ml.m1.mj for (n+l)3j>i, are defined as two-path m level modules 3 designed via pathways chained by d1 and d2 as shown in the same figure. An n-path "n+1 level module is denoted as m1.m2....mm_1. For example, an m3 level module, considered as a target structure, can be formed by either m1 level modules or 1112 level modules. An In2 level module in turn 68 [ENTRY POINT J I IRMBNNR LEVEL MODULE IL:— FA OATE LEVEL LEVEL MODULE MODULE I MAC LEVEL MODULE Figure 4.5 Hierarchical design model of a MAC level module. is tessellated by 1111 level modules. Therefore, m1.m3 and m1.m2.m3 are one-path and two-path m3 level modules, respectively. To relate these design path transitions to the design hierarchy described in this work, consider that m is thetransistor module, m2 is 1 the gate module and m is an FA module. The figure implies that to 3 design this particular device two options exist. These options are: m1.m3 - implying the construction of the device from transistor ' modules; 69 1 2.!!! 3 — implying the construction of the device from new gate modules specifically designed for this task from transistor modules. ‘ However, if a design entry-point other than m is considered, two more 1 options included are: m2.m3 - implying the construction of the device from the gates previously designed and extracted from a cell library; m3 - implying the extraction of the device itself from a previously stored library of defined device level modules. It is clear that the design times of these four options are quite different. It will obviously take a much longer time to design an PA from transistors than it will from gates while the last option has a null design time. Selecting m as the design entry-point implies that there are l 2“”) different combinational paths for the design of an nn level module. In other words, the final target structure, an mn+1 module, can be obtained through any one of the 2(n-l) distinct design level paths as shown in Figure 4.4. However, if all other levels are also considered as possible design entry-points, there are 1+2(n-2)+2(n-3)+... different paths for the design of an mn level. module. Based on the model as illustrated in Figure 4.2, the expression for the CR of any one-path module that is built from an array of klxk1 1111 level modules is 70 k1w1[k1(w1+251)+251] CR1 8 2 [k1(w1+sl)+sl] k w [k (w +2s )+2s ] g 1 1 1 12 1 1 (4_16) L1 , where w1 is the average width of the active device within an m1 level module (e.g., the line width of a MOS transistor) and s1 is the average spacing between any two active devices. L1 is simply the length of the overall oneépath module. Indeed, this equation is the same as Equation 4-ll if kl and the subscript of l are replaced by k and the subscript of x, respectively. Likewise, based on Figure 4.3 and corresponding to the design paths chained by dl and d2 as shown in Figure 4.4, the CR of any two-path module that is built from kzxk2 one-path modules is kzklw1{k2[k1(wl+2s1)-Zsl+252]+4sl-Zsz} {k2[k1(w1+sl)-sl+szJ+251-sz} , where 52 is the average spacing between any two one-path modules. Again, the above equation is same as Equation 4-13 if k2, RI and the subscripts of 2 and l are substituted by k, kx and subscripts of G and x, respectively. Further, the CR of any three-path module tessellated by k3xk3 two-path modules with 53 as the average spacing between any two two-path module is 71 kk k w (k {k [k (w +2s 11)-2s +2s 21-25 +2s }+4s —2s ) ca 3 2 1 1 3 2 1 1 2 3 1 3 . (4_18) 2 (k3{k2[kl(wl +5 1)- s 1+521- 52+s 33+251-53) . In order to check the validity of the above equation, let k3al. Then CR3=CR2 because a lxl two-path module is simply an array of kzxk2 one-path modules. Further, if k2=k3=l, then CR3=CR2=CR1. For the derivation of CR for a general n-path module formed by knxkn (n-l)-path modules with spacing of sn, Equations 4r16 to 4-18 are modified as follows. First, let WW -w +451 and L 8w +251 be the o 3:",1' R010 1 _ initial values for the general equation. Then Equation 4-16 is rewritten and becomes 11w [k1 (w 11+dsl-4sl+25 )+4s [k1 (w l+251- 251+ sl)+Zsl-s -2s1] 512 l l W0[k1(Ro-4sl+251)+4sl-Zsl] 512 (4-19a) [k1(Lo-25fi l-s)+251 1 Physically, wo is the average width of the active region of a basic device (e.g., the line width of a MOS transistor) and R is the total 0 lump-sum length of the active region of the same device. Therefore, is simply the the logical-OR area of a basic device module. Also, H0 0 L0 is the length of the basic module. Further, let W1 = klwo, R1 = k1(Ro-4sl+251)+asl-Zsl, and L1 = k1(Lo-251+sl)+251-sl, then, 72 cal = --;-5- (4-19b) Next, Equation 4-17 is rewritten as k k W {k2[kl(Ro-4s +251)+451-25 —4s +2521+4sl-252} 2 l 0 l l 1 CR2 : {k [k (L -25 +5 )+25 -5 -25 +5 ]+25 -5 }2 2 l 0 l l l l l 2 l 2 a kzwl[k2(R1-451+252)+451-252] (4-20a) [k (L -25 +5 )+25 -5 12 2 l l 2 l 2 ‘ Again, let W2 2 kzwl' R2 = k2(R1-451+252)+451-252, and L2 = k2(Ll-251+52)+251-52, then, W ca . --3§3- - (4-20b) 2 2 L2 . Likewise, Equation 4-18 is rewritten as k3k2W1{k3[k2(R1-451 1-252-4sl+253]+4s 2 {k3[k2(L1-251+52)+251-52-251+53]+251-53} +252)+45 -253} l k W [k ( -45 +25 )+45 -25 ] . 3 2 3 R2 1 3 1 2 3 (4_213) [h3(L2-251+s3)+251-53] . Finally, let w a k w 3 3 2' R3 - k3(Rz—451+253)+451-253, 73 L3 = k3(L2-251+53)+251-53, then, CR 2 ------ (4-21b) A general recursive equation of CR for the n-path module derived from above equations is wan CR 8 ------ (4-22) n L 2 n I where, wn a knwn-l' Wo-wl; Rn = kn(Rn_1-tsl+25n)+4sl-25n, R0=w1+451; and Ln = kn(Ln_1-251+5n)+251-sn, Lo-wl+251. Physically, Wn and Rn are the lump-sum width and length of the active region, respectively, and Ln is simply the length of the overall module. By letting kn-->o as in Equations 4-12 and 4—14, Equation 4-22 becomes wn-1(Rn—1"sl*25n) lim CRn = 2 (4-23) kn-->o (Ln_1-2sl+sn) . If the :111 level module is assumed to be a transistor module (i.e., ncl=X), Equations 4-22 and 4-23 are exactly the samn as Equations 4-11 and 4-12, respectively. As a result, Equation 4-23 provides a general recursive expression for the compactness ratio of any multi-level module. What's more, the denominator or Equation 4-22 is actually the area of the n-path module. 74 Therefore, the general recursive expression of the overall area of a multi-level module is given by _ 2, _ - 2 - An- (Ln) [kn(Ln_l 251+sn)+251 Sn] , (4 24) where Lo=wl+251. In order to find a general expression for the propagation delay of a multi-level module, the delay time of the intercommunication lines among the modules is calculated independently of the internal module delay. Based on the model illustrated in Figure 4.2, the propagation time of a one-path module formed by klxkl m1 level modules is found by T1: fl(kl)(twl+tsl)+tsl, (4-25) where t“1 is the average propagation time due to the active transit time of a sumeodule or the m1 module (e.g., channel delay of a MOS transistor), and t51 is the delay of the intercommunication lines connecting the sub-modules. f1(kl), a function of kl, is the number of subpmodules through which the critical signal passes. Obviously, f1(k1) varies with the nature of different circuits. 2 1 f1(kl) approximately equal to 2k +1. In general, f1(k1) can be either For example, the MAC built by approximately k FA's in Figure 3.7 has 1 an analytical distribution function based on probabilistic assumptions or a statistical function based on stochastic programming [52, 53]. Similarly, based on Figure 4.3, the propagation time of a two-path module, built from kzxk2 one-path modules, is T2: f2(k2)(T1-2tsl+tsz)+2t (4-26) sl-tsZ' Again, f2(k2) is a function of k2 depending on the nature of the circuit. Likewise, the propagation time of a three-path module built from k xk 3 3 two-path modules is given by 75 T = 13(k3)(T2-2t51+ts3)+2tsl-ts 3. Based on Equations 4-25 to 4-27, a recursive expression for the propagation time of an n-path module is derived as 2 Tn= kn (T 2t51+tsn)+2ts1 tsn, (4-28) Note that the key comparison of different Tn is n—l where T st" +2t O l 51' based on the propagation delay of the intercommunication lines among the sub-modules, tsn‘ In summary, Equations 4-23, 4-24 and 4-28 provide the general expressions for the compactness ratio, chip area and propagation time of a general multi-level module. Once the parameters of w t l' twl' si' and fi(ki), for all i=1,2...,n, have been estimated, the tradeoffs of si' ki compactness ratio versus area-time can be obtained. In fact, in designing a particular target module, if the spacing among higher level modules is increasingly greater than that among lower level modules, >5 i.e., sn>...>s the expressions for the CRn is a decreasing 2 1' function, i.e., CRn<...°”>A2>Al and Tn>...>T2>T1. For example, if 52>51, the compactness ratio of a two-path target structure (e.g. ”x.G.FA") is less than that of a one-path target structure (e.g. ”x.FA") while the area-time parameters of the two-path structure are greater than those of the -one-path structure. As a result, CRn is inversely proportional to An and Tn' corresponding to the parameters of the number, width, and spacing of the subvmodules of different levels. CHAPTER V TOP-DOWN APPROACH IN ARRAY DECOMPOSITION The investigation of hierarchical modularity pursued along the top-down approach starts from the algorithmic level at which a systolic, or other algorithm possessing similar regularity, is physically decomposed providing chip-wise modularity. The algorithmic level structures, such as the matrix-matrix systolic multipliers, are constructed from processing elements (PE's) as defined by the design hierarchy. Therefore, the purpose of the top-down approach is mainly to- evaluate the effect of various physical partitions to the tessellated array of PE's. An overview of the top-down approach to the design of systolic structures is shown in Figure 5.1. The top-down approach is especially important due to the current IC lithographic and chip size limitations. In addition, it can provide the flexibility to customize a modularized array structure with a "best fit” size and[or performance for any system with variable dimensions. In other words, designs using a set of chips (decomposed arrays) can be more flexible and efficient if the dimensions of the problem are variable or not known a prior . In the following sections, parameters are first defined and then a general expression relating the I/O bottlenecking of operands and pin limitation is derived. Finally, the general expression is depicted graphically. 76 77 DESIGN ENTRY POINT TRAN OFF PARAMETERS PE I Aiooanwmc LEVEL LEVEL uoouun TESSELLATION ONE 'PATH MULTI'PATH MODULE ARRAY MOWLE ARRAY 4 MOWLARIZED , ARRAY Figure 5.1 An overview of the top-down approach. 78 5.1 Chip-Wise Partitioning of Systolic Structures The physical decomposition, which is also known as modularization [4], of a systolic structure is simply the partitioning of groups of interconnected PE's. For similar regular structures, the array of elements at the next lower level below the algorithmic level is to be partitioned. Each group of partitioned PE's is‘ assumed to be implementable and fabricated on a single IC chip. The maximum number of PE's inside each chip depends, of course, on their size. Thus, assuming an unbounded number of chips, a systolic structure of arbitrarily large dimension can be implemented on a set of IC chips. To maintain regularity, all the PE's within a partitioned group or chip must be identical. Let i and j be the numbers of input and output operands of each PE (local I/O), respectively, and n be the number of bits per operand. Then the total number of local I/o lines per PE is (i+j)n. Usually I/O symmetry is a "by-product", characteristic of a structure possessing regularity and local connectivity properties. Symmetrical I/O lines mean i-j, which will be assumed throughout this section. For example, the MAC's shown in Figures 2.1 and 2.2 have i=j=3, respectively. PE's which are commonly found in systolic structures have i=j=2 or 3 [3-6]. . Consider an R-by-C partitioned array of symmetrically interconnected PE's fabricated on an IC chip where R and C are the number of elements in a row and column, respectively. The maximum 79 number of pinouts (for operands only) per modularized chip is then given by POP(R,C,1=j,n) = 2[(R+C)(i-1)-(i-2)]n. (5-1) As an illustration, Figure 5.2 shows two sample arrays with RxC=3x3 and 4x2. PE's in the 3x3 array have local I/O lines of 18382 and those R368 333 L - - - - - - I. I. 2 R‘s:- 4312 I P,,(3.3.2.n)- Ian p”(4.2.3.n)‘ 22:! Figure 5.2 Examples of partitioned arrays. 80 in the 4x2 array have i=j=3. Note that the depiction of latch placement is neglected to simplify this illustration. Also, the direction of the data flow is assumed to be symmetric but not restricted to the fixed directions as shown in the figure. Assuming all I/O buses shown in Figure 5.2 contain n lines, the maximum numbers of pinouts of the partitioned arrays are then equal to Pop(3,3,2,n)812n and P (4,2,3,n)=22n. This points out an immediate OP constraint to the practical implementation of these examples in that the numbers of pinouts for n=32 bits are 384 and 704. This is far beyond the current IC pin limitation and thus requires the use of an appropriate I/O multiplexing scheme. ,Assuming two separate sets of pinouts are used for the input fan-in demultiplexing (DMUX) and output fan-out multiplexing (MUX) of I/O operands, the number of pinouts is reduced to Pop-2n. Further, if a combined I/o DMUX/MUX scheme is used then Pop-n, but this is not cost-effective as will be explained in the next paragraph. Thus, the number of pinouts for operands can be expressed in terms of n depending on the multiplexing schemes used. Multiplexing schemes using P°p=2n and 3n have been proposed for computing arrays for matrix multiplication and covariance matrix inversion [lo]. . As mentioned, it is not cost-effective to use a single set of pinouts for a combined I/o DMUX/MUX scheme. This is due to many reasons. First, the sharing of the same set of pinouts by both input and output operands complicates the scheduling of systolic operations. Secondly, extra logic circuits will be required inside each chip for the control of bidirectional flow of input and output operands. Thirdly, 81 since all data pinouts of all chips are connected to the same data buses, there is virtually ,no chip-wise parallel data transmission. Lastly, the combined scheme has the largest potential for exhibiting an I/O bottleneck because it has longest multiplexing time in comparison with other schemes using separate sets of pinouts. An I/O bottleneck occurs when the I/o operand multiplexing time dominates over the pipeline segment time [5, 9]. Based on these reasons, multiplexing schemes utilizing at least two sets (a pair of n bits) of data pinouts are considered best, i.e., popZzh. The upper bound Pop“) is simply the maximum number of pinouts for operands without I/O multiplexing as found in Equation S-l. Unfortunately, no single multiplexing scheme - is universally applicable to every modularized systolic algorithm [4, 10]. Different systolic algorithms use different schemes as well as different numbers of pairs of n data pinouts. For example, two different schemes for modularizing systolic structures for L-U decomposition and matrix inversion have been reported by Hwang and Cheng in [4]. The square-shaped multiply and divide partitioned array chips for the construction of the L-U decomposition network utilize S and 2.5 pairs of n data pinouts, respectively. But, the triangular multiply and square-shaped multiply modularized array chips for the formation of the matrix inversion network use 3 and 4.5 pairs of n pinouts, respectively. All of the modularized multiply arrays in [4] contain MAC's having 183-3 and the division arrays are . composed of division cells (DC's) having i=j=2. Since 5 pairs of n data pinouts is obviously equal to lOn pinouts, n is limited to a word size 82 of less than 32 bits. This limitation is very undesirable because many engineering problems require high precision calculations [5]. Also, the consideration of I/O bottlenecking was not mentioned in [4]. 5.2 Tradeoff Between I/O Bottleneck and Pinout Limitations Although no universal multiplexing scheme can be applied to all modularized systolic algorithms, a general expression can be derived relating R, C, i, j, and us. This expression will be used to further study the details of I/O bottlenecks between partitions. N5 is defined as the number of pairs of n data pinouts. The others have already been defined in Equation 5-1. As discussed previously, at least one pair of n data pinouts must be used for a cost-effective multiplexing scheme; therefore, N521. ‘ Interestingly, if "5 is less than isj, arrays of external demultiplexers (DMUX's) and multiplexers (MUX's) must be placed amid the modularized chips in order to route the data operands properly. As an example, consider the array of 2x2 modularized chips consisting of. RxC=2x2 PE's as shown in Figure 5.3. Each chip in this array has N581 and i=j=2. Assuming each bus is n bits wide, an array of n 1:2 DMUX's is required to demultiplex the output operands to the either right or lower neighboring chips. Similarly, n 2:1 MUX's are necessary to multiplex the input operands from either the left or the top adjacent chips. However, if uszisj, no external DHUX's or MUX's are needed. This can be illustrated by using the same example except having Ns-isjsz as shown in Figure 5.4. 83 xcxo Figure 5.3 Array of modularized chips having Wsl an I/O bottleneck exists. What's more, the greater the BI, the more severe the degradation of overall performance of the circuit will be. Assuming i=3 and substituting Equation 5-5 into Equation 5-6 produces (R+C)(i—l)-(i-2) . 31 s max{tD(l:M),tM(M:l)}. (5-7) Nstra- To depict the above equation graphically, consider a systolic array built merely by MAC's. Since MAC has 3 pairs of symmetry input and 87 output lines, i.e., i=j=3, Equation 5-7 becomes 2(R+C)-1 BI = max{tD(l:M),tM(M:l)} . (5-8) NstPE For simplicity in simulation, R=C and tD(l:M)=tM(M:l) are assumed. Note that Rsc implies that the partitioned array has R2 or C2 MAC's. Interestingly, since BI¢(R+C), BI is the minimum it R=C. Further, let tB-tD/tPE, then Equation 5-8 gives (GR-l) BI = -------- t o (5-9) N 3 Based on the above equation, Figure 5.5 graphically depicts the relationship between BI and the number or pinouts in terms 0! t3 and n. Each curve in the tigure corresponds to the number of Rae, which can also be used to represent the chip area of a modularized array. To illustrate the application of this graph, consider that R810 (i.e., lelO MAC's), N581 (i.e., number of pinouts-2n) and t380.05 (i.e., tD(l:M)-5 nsec and tpzsloo nsec); then BI-l.95. This indicates - that an I/o problem exists. In other words, the I/O time is almost twice as long as the pipeline segment time which. degrades the overall performance or the circuit by one-halt. The solution of this 1/0 problem is to either design taster DMUX and MUX circuits or to reduce the number R. Another feasible solution is to increase the number of pinouts which, however, is dictated by the I/O multiplexing scheme as well as the technological pinout limitation. 88 VD BOTTLENECKI NG INDEX (BI) 40 t .t 3 -r 30 ‘3" 20 tad- lo tB-I- \Ral‘ I I l I 7 ’ an 40 on an 10:! NUMBER OF PINS (2N3n) Figure 5.5 NO bottlenecking index versus miner of pins. 89 The I/O bottlenecking index, BI, defined in Equation 5—7 provides an indication of the occurrence of an I/O bottleneck of operands during chip-wise partitioning of a systolic or other similar structure. By examining BI and the number of pinouts allowed per modularized array or chip, one can evaluate the tradeoff of various physical partitions to the tessellated arrays of PE's versus the I/O pin limitation. CHAPTER VI RESULTS AND CONCLUSIONS The bottom-up and top-down approaches to the design methodologies and the study of tradeoffs for hierarchical modularity in regular VLSI-based circuits have been presented in previous chapters. In this chapter, the results obtained from the design of a systolic array structure as a testbed using these approaches are presented and discussed. These results numerically verify the tradeoffs among circuit complexity, chip area, propagation time, number of pinouts and the condition for I/O bottleneck existence in the design of multi-level structures. A design of a target structure can be started according to an analysis of these tradeoff parameters. The choice of the design entry-point and pathway will eventually determine the final circuit performance. Finally, future trends in the design of VLSI-based structures related to this research are discussed. 6.1 Compactness Versus Area-Time Ratios Using the bottom-up approach discussed in Chapter 3, the physical layouts of several FA's as target structures are shown in Figures A.l-A.3 in the Appendix. These represent ”x.FA", ”G.FA" and ”FA”. In each figure there are two different versions, type-A and type-B. These 90 91 two types were designed so as to enable tessellation into two types of nearest neighbor communication patterns. A S-bit MAC tessellated by both types of FA's is illustrated in Figure 6.1. The schematic layout of this target MAC is actually sheared and flipped vertically from the rhombic shaped MAC shown in Figure 3.7. Note that the summand generation takes place in the lower left corner of the FA's. This is depicted by the solid geometric symbols shown in Figure 6.1. .The required number of type-A and type-B FA's to form an n-bit MAC are n2-2n+2 and 2n+l, respectively. Type-A is the majority FA and will dominate over type-B FA's if n is large. For example, if n-32, the number of type-B FA's is just 6.76t of that of type-A FA's. Table 6.1 shows the chip area, propagation time and compactness ratio parameters corresponding to the different entry-points and pathways traversed in designing the various target FA modules. Since the relative area and time ratios are used for comparison only, the parameter of the lithographic linewidth, A, has been removed. For simplicity, the figures of area and time in Table 6.1, and all other tables presented in this chapter, are given assuming A-l um. The chip areas of the various modules were measured directly from their physical layouts. The compactness ratio (CR) is a ratio ofthe logical-OR area to the overall area of the module. The logical-OR area is the total surface area of covered silicon regions. This was found from the physical layout automatically by the CAD system utilizing a command called "LOG OR ACTIVE” which uses the coordinates of the layout to find the logical-OR area. e w": x a, .5, Figure 6.1 Schenetic layout of a 5-bit MAC. Table 6.1 Area-time and compactness ratios of FA. 93 Target Entry Design 2 Area Time ATR CR module point path um NR nsec NR x X.FA 90x34 1.00“ 2.04 1.00“ 1.00 0.755. G G.FA 94:36 1.11 2.33 1.23 1.17 0.686 FA FA FA 103x40 1.35“ 2.56 1.37“ 1.36 0.644 x x.ra 87x30 1.00“ 2.06 1.00“ 1.00 0.751 TYPE"B a a G G.FA 92x36 1.27 2.19 1.06 1.17 0.664 31 FA FA 103x37 1.46“ 2.36 1.15“ 1.31 0.671 NRBNormalized Ratio “Ratio based on X.FA ATRsArea-Time Ratio CR=Compactness Ratio The propagation time results were obtained by using the charging-discharging model presented in Section 4.1. Since the model requires physical parameters of the MOS transistor, such as threshold voltages and doping concentration, typical values of these parameters were obtained from references and are listed in Table 6.2 [6, 26, 27]. The normalized ratios (NR‘s) of both area and time shown in Table 6.1 are based on the corresponding values of either type of ”x.FA”. The area-time ratio (ATR) shown in the second from last column of the table is the mean of NR's of the corresponding design path. The table shows that CR increases, representing an increase in complexity, as ATR decreases, indicating a better area utilization and performance. 94 Table 6.2 The typical physical parameters of MOSFET technology [6,26,27]. Threshold voltages: depletion mode transistor (vtu) -4 V enhancement mode transistor (th) l v Voltage supply (VDD) S v High level gate output (VB) 5 v Low level gate output (VL) 0 V Oxide thickness between transistor . . gate and channel 250 A-lOOO A Permittivity of silicon dioxide 3.451110.ll F/m n-type impurity dOping concentration 101.6 cm-3' Electron mobility at 300°x‘ (un) 1000 cmz/V-sec A circuit with a higher complexity . has several important implications as follows. 1. 2. The high density of the circuit layout complicates the process of mask generation because masks must be perfectly aligned over each other. There is virtually no tolerance of any misalignment. Pinch-through between two devices is most likely to happen in a more complex circuit because of the closeness of the devices. Pinch-through simply causes two devices to short together. Power dissipation per unit area is much higher in a denser circuit requiring special treatment in heat removal. The difficulty of testing a circuit tends to increase with the complexity.‘ 95 The average CR's of the transistor-level FA, "x.FA”, and the gate-level FA, ”G.FA”, are 0.75 and 0.69 as shown in Table 6.1. According to Equations 4-12 and 4-15, this implies that wx=sx=(2/3)sG and kxzz. This is because the average spacing is equal to the average line width (an NMOS design rule as per Mead and Conway). Also, the average spacing between two gate modules, again by design rule, is 3/2 of that between two transistors [6, 371.‘ The average CR of both types of ”FA” is found to be 0.66. This indicates that the spacings among the submodules inside the ”FA” are larger than those of the other target FA's. Interestingly, the design of a ”FA”, which used an approach similar to a gate-array technique in order to provide a fast design time, has the smallest CR but the largest ATR among the three different target FA modules. Using the above various FA modules as basic Sells, three target MAC's, ”x.FA.MAC”, ”G.FA.MAC" and ”FA.MAC", are tessellated. In addition to these MAC's, two more MAC's designed using transistor or gate modules only are the ”x.MAC" and ”G.MAC”. The geometrical layouts of these five Inc's are presented in Figures 11.4-11.5 in Appendix. All MAC's shown in these figures are 4 bits in size for illustration. In each of these figures, a high percentage of wasted (null) areas appears in the far left and right sides of the MAC layout. In a practical application, however, MAC designs are at least 16 bits rendering the 'null area insignificant. As an example, a lG-bit ”G.FA.MAC” is depicted in Figure A.9. In this figure, the null area as mentioned above is less than 5% of the overall module area. 96 Table 6.3 presents the results obtained from the designs of five different target MAC's again with A=l um. The area LxW in umz, and time t in nsec, of the corresponding target MAC's are expressed as X.MAC : Lxxwx 8 (104n+84)x(44n+29), tx =4.07n+2.ll; X.FA.MAC : LXFxWXFg (ll4n+95)x(48n+3l), tx384.14n+2.18; G.MAC : Lwa = (llln+94)x(46n+40), t =4.67n+l.91; G G G.FA.MAC : LGFxWGF=(lan+lOl)x(50n+37), tsp-4.77n+l.90; FA.MAC : 1.waF =(lZ7n+ll4)x(54n+38), tP I5.23n+l.98. (6-1) Table 6.3 Area-time and compactness ratios of MAC. Target Entry Design 2 Area Time ATR CR module point path um NR nsec NR x.uac LxxWx 1.00“ tx 1.00“ 1.00 0.69b x b a a x.ra.xac LXwaxF 1.20 tXF 1.02 1.11 0.62 G.MAC LwaG 1.12“ tG 1.15“ 1.14 0.65b MAC G b a 3 FA FA MAC L w 1 50“ t 1 29“ 1 40 0 57b . Fx F 0 F O C . NRSNormalized Ratio “Ratio based on x.xac & n-->e ATR-Area-Time Ratio bn-->c- CR-Compactness Ratio Lxxw§ 8 (104n+84)x(44n+29) tx -4. 07n+2. ll XI,- (114n+95)x(4an+31) txr=4.14n+2. .18 §§§:% - (111n+94)x(46n+40) tG =4.67n+1. 91 "56 F-(118n+101)x(50n+37) tGF=4.77n+l. .90 -(127n+llt)x(54n+38) tF =5. 23n+l. 98 fiufi:er of bits per word 97 By letting n-->a for comparison, the normalized ratio (NR) and compactness ratio (CR) in Table 6.3 are made independent of n. ‘In fact, if n232, new NR’s and CR's will vary by less than 1% from those in Table . 6.3. This table further shows the inverse relationship between the ATR and CR of the target MAC's designed by different deSign entry-points and pathways. Interestingly, the set of CR's for the MAC as a target module is relatively lower than that of the FA as the target module (compare Tables 6.1 and 6.3). This is because of the effect of the greatly increased number of signal communication paths connecting and routing around the FA's inside the MAC. Since the average lines in NMOS design are separated by an equal width, there will be .a factor of wx/(wx+sx)=wx/2wx=0.5 effect on the overall CR if the number of signal paths is significantly large. But, if multiple metal layers are considered for global communication lines, the overall CR will not be significantly affected. This, however, increases the Inumber of fabrication steps and decreases the yield of good IC chips due partially to an increase in the number of contact cuts [54]. In the design of the final target structure, a matrix-matrix systolic multiplier, the design of latches to be placed amid the MAC's must be considered. Figures A.10 and A.ll in Appendix present the layouts of two dynamic latches with and without scan design. These layouts are based on the logic diagrams presented in Figures 3.10 and 3.11. These latches are designed to tessellate with the MAC's, either vertically or horizontally, without any routing. The area and time delay of the latch without scan design are 33x38 um2 and 0.40 nsec; and 98 those of the latch with scan design are 46x38 um2 and 0.52 nsec, respectively, for A=l um. Therefore, the increases of area and time for scan design are 39% and 8.3%, respectively. The CR of the latch with scan design is 0.698 and is lower .than that of the latch without scanning, which is 0.755. This further reflects the effect on the overall CR if the number of communication signal paths increases. Figures A.lz and A.l3 illustrate a 4x4 matrix-matrix multiplier built by the 4-bit ”x.MAC's” and the latches without scan design. Some of the details inside the MAC's shown in Figure A.lz have been omdtted for a clearer view. Figure A.l3 is a zoom view of Figure A.lz which shows the detailed tessellation of the MAC's and latches. Let L, W, t are the length, width and propagation time of the n-bit MAC. Then, based on Figure A.l2 and Equations 2-2 and 2-3, the area, A(L,W), and time, T(t), of a wl-by-w2 matrix-matrix multiplier for two NxN matrices having bandwidth of w1 and wz, are derived as A(L,W) a [w1(L+33)+33][w2(W+33)+33] 411112, (6-2) and T(t) = [3N+min(w1,w2)][t+0.48] nsec. (6-3) Only the length of the latch (33 um) was used in the calculation of the total area in the above equation. This is because of the latches placement shown in Figure A.13. The above equation also implies that the design performance of the matrix-matrix multiplier depends mainly on the MAC since the latch arrays are comparatively insignificant in both area and time if n is large. For example, if a 32-bit ”x.MAC" is used, the length of the latch (33 um) is 1.0% and 2.3% of the length and width of the "x.MAC” (3412 um and 1437 um). Likewise, the CR contributed by the latch will 99 not have a significant effect on the CR of the MAC. The area-time parameter and CR of the matrix-matrix multiplier are thus assumed to be equal to those of the corresponding MAC, for n-->o. Table 6.4 and Figure 6.2 summarize the results obtained from the bottom-up approach to the design of various target modules. Table 6.4 Area-time and compactness ratios of various target modules. Target Entry Design Area-time compactness module point path ratio (ATR) ratio (CR) x x.FA 1.00 0.755 TYPG‘A G G.FA 1.17 0.686 FA FA PA 1.36 0.644 x x.FA 1.00 0.751 Type-8‘ G G.FA 1.17 0.684 FA . FA n ; 1.31 0.671 x.MAC 1.00 0.69 x X.FA.MAC 1.11 0.62 G.MAC 1.14 0.65 MAC G G.FA.MAC 1.23 0.59 FA | mane | 1.40 | 0.57 X.MAC.VLSI 1.00 0.69 X X.FA.MAC.VLSI 1.11 0.62 G.MAC.VLSI 1.14 0.65 VLSI G ' G.FA.MAC.VLSI 1.23 0.59 FA | FA.MAC.VLSI | 1.40 | 0.57 .oflmu 03343.3 manned, 033 30502500 «.0 mean: 32.5 02%.... m2~h4 . 33 + no (nut: (n. o I 00.0 I? 3:33 #85; E0. 02.3. mmuzhnzmzoo 101 Significantly, the graph in Figure 6.2 clarifies that the CR's of the target modules using low-level design entry-point are greater than those of the modules using high-level design entry-point. However, the ATR of the target modules using higher design entry—points are larger, representing poorer area utilization and circuit performance. The figure shows that these tradeoffs are also true for the one-path and multi-path modules. For example, selecting transistor level as design entry-point, the CR of the ”X.MAC” is greater than that of the "x.FA.MAC” but the ATR of the ”xeMAC” is the smaller of the two. The CR and ATR of the target MAC are same as those of the final VLSI multiplier. This implies that the average spacing among the suhnodules inside the MAC's is same as that among the MAC modules. That is, using the notation as derived in Section 4.2.2, 5 may be equal to MAC SPA, sG.FA' sG, “x.FA' or sx depending on the target MAC used. 6.2 I/O Bottlenecking Index Using the top-down approach presented in Chapter 5, the tradeoff of various physical partitions to the array structure and pin limitations can be observed. The I/O bottlenecking index, BI, as defined in Equation 5-7 and the number of pinouts per chip for data I/O, ZMsn, are used to evaluate the effect of various decompositions to the array of MAC's and latches. The total chip area of a matrix-matrix multiplier is a function of R as shown in Equation 5-7. Near term IC lithographic and chip size limutations cap the size of R per chip. Therefore, a parameter of the 102 chip size limitation, in terms of the lithographic linewidth, A, can be incorporated into the graph used in Figure 5.5 in Section 5.2. This is done as follows. Let A=l um and n=32 bits, then Equations 6-1 to 6-3 imply that the total area and pipeline segment time of the five different matrix—matrix multipliers are expressed as A(X.MAC) = (3445R+33)(1470R+33) “m2, tPE(x.MAC) = 132.83 nsec; A(X.FA.MAC) a (3776R+33)(1600R+33) umz, tPE(x.FA.MAC) 8 135.14 nsec; ammo) = (3679R+33)(1545R+33) umz. tPB(G.MAC) - 151.83 nsec; A(G.FA.MAC) a (3910R+33)(1670R+33) umz. tPE(G.FA.MAC) s 155.02 nsec; A(FA.MAC) = (4211R+33)(l799R+33) umz, tPE(FA.MAC) = 169.80 nsec. (6-4) Note that the propagation time of the latch, 0.48 nsec, is included in' tPE' . The propagation time of the DMUX and MUX I/0 circuits must also be determined. Three I/0 circuit candidates specific for single chip VLSI systolic arrays have been presented and evaluated in [9]. A set of input and output circuits, known as a ”data controlled” input circuit and a ”shift-register control sequence” output circuit, are selected .from these candidates to apply to this work. Both of the selected I/O circuits are pipelined in nature and are thus well suited for the design 103 of systolic structures. Through simulation, the worst cast propagation time of these I/O circuits is found to equal tD(l:M)=3.6 nsec, for 1:1 an [5, 9]. The total area of the I/O circuits is significantly small (less than 0.1% of the overall chip area) and is neglected [5, 9]. Table 6.5 and Figure 6.3 present the comparison among the I/O bottlenecking index, pinout and chip size limitations. In Table 6.5, tPE as listed in Equation 6-4. The chip area and BI were found by using Equations 5-9 and 6-4. t was calculated by dividing tD(l:M)=3.6 nsec by the corresponding All values of BI in Table 6.5 are well under 1.0 indicating no I/0 bottleneck. However, some of target arrays have a chip size greater than the current standard chip size of 1 cm2 [55]. This means that the number of MAC's on a single chip is more seriously constrained by the chip size limitation than the I/O bottleneck. The index of tB can provide a quick insight into the potential for encountering an I/O bottleneck. A ratio of less than 0.05 can be reasonably assumed as an index threshold unlikely to have an 1/0 problem. The number of bits per word (n) also affects the chip size as well as the propagation time (See Equation 6-1). If n decreases, the overall chip size and tPE decrease implying that R may increase and t3 will increase as well. Although BI does not directly depend on n as shown in Equation 5-9, BI is directly proportional to R and tB' Therefore, there is a chained relationship among BI, R, t3 and n. It must b. emphasized that Figure 6.3 does not provide the direct relationship between BI and n. It indicates that BI is inversely 104 Table 6.5 Comparison of 1/0 bottlenecking index, pin limitation and chip area. Modularized Parameter* R array tB 3 4 5 Area(cm2) 0.461 0.817 1.274 Ns=l| 0.297 0.405 0.513 X.MAC.VLSI 0.027 ------ 81 ns=2| 0.149 0.203 0.257 Ns=3| 0.099 0.135 0.171 Area(cm2)l 0.549 0.974 1.519 xs=1| 0.297 0.405 0.513 x.FA.MAC.VLSI 0.027 ------ 81 Ns=2| 0.149 0.203 0.257 xs=3| 0.099 0.135 0.171 Area(cm2)l 0.517 0.916 1.430 Ns=l| 0.264 0.360 0.456 G.MAC.VLSI 0.024 ------ 81 Ns=2| 0.132 0.180 0.228 Ns=3| 0.088 0.120 0.152 Area(cm2)l 0.593 1.052 1.642 Ns=l| 0.253 0.345 0.437 G.FA.MAC.VLSI 0.023 ------ BI Ns=2l 0.127 0.173. 0.219 Ns=3| 0.084 0.115 0.146 Area Zn 411 Sn NUMBER OF PINS (ZNSnI Figure 6.3 BI versus ZMsn for 32-bit "x.MAC.VLSI". 106 related to the number of pairs of n pinouts for data 1/0 (NS). Note that the graph in the figure uses the data of ”x.MAC.VLSI" from Table 6.5 for illustration. The limitation of current standard chip size of l umz is projected on the graph by a dotted curve between the curves of R24 and 5. This shows, for instance, that the maximum number of 32-bit "11.141165" that can be implemented on a single 1 cm2 chip is about 20. To further show directly the relationship between BI and n consider Figure 6.4. All the parameters shown in Figure 6.4 are the same as those in Figure 6.3 except that n-l6 bits. In comparing both figures, BI is seen to be inversely proportional to n if all other parameters remain unchanged. Figure 6.4 also shows an example of a minor I/O bottleneck where BI=1.026 for R85 and Nssl. To resolve this problem, select R to be less than 5 or Us to be more than 1. Alternatively, designing faster I/O DMUX/MUX circuits can also alleviate such a problem. In summary, the tradeoffs among the I/O bottlenecking potential, numbers of pinouts, word size and chip size limitation have been illustrated in the graphs developed in the top-down approach. Specifically, the I/O bottlenecking index, BI, is inversely proportional to the number of pairs of n pinouts, NS, which is dictated by the I/O multiplexing scheme and current pin limitation. 0n the other hand, BI directly relates to the number of PE's on a single chip which is constrained by the word size and chip size limdtation in terms of A. 107 I/O BOTTLENECKING II} INDEX (80 AREA-0.35 an“ LC ..I... 3.5 BI V8 ZNsn roe may 0,22 “.2 IG-BIT X.MAC.VLSI 0.8.... Ree , 2 0.6-..ARE‘ DJ: on R93 0.4... 0.2-.. — + + 9 I l I Zn 471 60 NUMBER OF PINS (ZNsn) Figure 6.4 31 versus 2x5n tor lG-bit ”X.MAC.VLSI". 108 6.3 Contributions Tradeoffs involved in the bottom-up approach to the hierarchical design of regular VLSI-based structures are parameterized by the design complexity, chip area and circuit performance. A generalized model using infinite (unbounded) modular arrays was presented in Section 4.2.3. Based on this model, the general recursive expressions for the compactness ratio (a measure of complexity), chip area and propagation time were derived. These general expressions reveal that a tradeoff among the compactness ratio and area-time parameter exists within the multi-levels of the design hierarchy. The compactness ratio was found to vary among the one-path and multi-path. modules. The area-time parameter was “shown to be inversely proportional to the coipactness ratio. This is specifically shown in Equations 4-23, 4-24 and 4-28. The use of a systolic array structure as a testbed further verifies the inverse relationship between the compactness ratio and the area-time parameter as the design entry-point and pathways move upward along the design hierarchy. Testability of VLSI circuits using scan design has been of special interest. A dynamic latch with scanning ability was designed and specifically applied to systolic testbed structure. This latch required 39% more chip area compared to a standard latch without scan design. The propagation time of the latch with scan design had an increase of only 8.3% which had negligible effect on the overall performance of the 109 array structure. The top-down approach to the various decompositions of regular array structures were parameterized by an I/O bottlenecking index. The I/O bottlenecking of operands tends to increase as the number of sets of pinouts for data I/O and the word size in number of bits decrease. Results from the testbed systolic structure indicate that the current lumitation of the number of PE's on a single standard chip is constrained more by the chip size and lithographic linewidth than by the I/O bottleneck problem. As a whole, the bottom-up and top-down approaches to the hierarchical design of regular structures presented in this research are summarized in Figure 6.5. According to this figure, a designer can choose an appropriate design entry-point with respect to the desired tradeoff parameters of the final design. For example, the tradeoff parameters derived in this work are area-time versus circuit complexity and I/O bottlenecking index versus pin limitation. Once the design entry-point has been selected, the design path can proceed either back or forth along the design hierarchy, representing a topodown or bottom-up approach, as dictated by desired performance parameters. Finally, by modular tessellation, the IC layout of the regular structure is obtained. This layout can be implemented on either a single chip or a set of multiple chips depending on the size of the system as well as the I/O limitation. However, if the layout requires modification or redesign, for reasons such as being less optimal than expected or having an intolerable I/O bottleneck, a new top-down or bottom-up approach may be required. 110 DESIGN ENTRY l 1 4:45:22. l l I BOTTOM -UP TW-OOWII LOWEST MONLAR IGIEST LEVEL TESSELLATION LEVEL ' ONE-PATH MULTI-PATH MOWLE ARRAY MOWLE ARRAY $52428}— I I I Figure 6.5 A sumary of bottom-up and top-dwon approaches . 111 6.4 Trends in VLSI and Future Study As the density of digital IC circuits grows, the design-time, complexity, difficulties in simulation and testing, mask generation and fabrication, and percentage of yield loss tend to increase tremendously. Designs using silicon compilation, semi-custom approaches, symbolic layout techniques or hierarchical design currently provide a fast design turnaround for highly complex and dense circuits. The performance tradeoffs in the hierarchical design of regular structures have studied in this dissertation. However, tradeoffs in the design of random structures can also be examined based on the principles and techniques derived in this work. The average compactness ratio of a ‘ random circuit can be obtained by proportionally averaging the compactness ratios of all regular and non-regular sub-modules. If the compactness ratio is used to relate to the design time, a "regularity factor" must be found and incorporated into the compactness ratio of the regular sub-modules [44, 45]. This regularity factor can be determined either experimentally or statistically [45, 53]. Likewise, the area-time parameters of a random circuit can also be derived using heuristic or statistical models [53, 56]. The evaluation and derivation of these regularity factors and models for all VLSI structures are thus essential next steps. The recent development of design frames shows great promise in simplifying the design of VLSI-based systems [57, 58]. A design frame is 112 a hardware frame that contains all the details of I/O circuits, drivers, power supplies or other interface counterparts. A standard design frame can be used to interface with any new circuit design. Whenever a new circuit is designed, it can be directly inserted into a design frame for fabrication. As an approach to the design of regular structures as studied in this dissertation, a designer can select a design frame to match the needs of the array. For the bottom-up approach, the designer can concentrate more on the design of the array structure giving 1ess concern to the I/O or other interfacing problems. For the top-down approach, the designer can use the data of the design frame, such as the bandwidth size and propagation delay of the DMUX and aux circuits, to find a suitable way to decompose the array structure. However, the design of a standard design frame for systolic array structures is not a simple matter because the required I/O bandwidth is greatly dependent on the size of the array. The size of the drivers for I/O also varies with the number of PE's. What's more, the design frame must be compatible with standard IC package constraints [59, 60]. Another recent development in VLSI is wafer-scale integration (WSI) circuits. The idea behind a WSI approach is to assemble an entire system or circuit structure on a single wafer [61-63]. There is virtually no pin limitation inside the wafer. Therefore, WSI is ideally suited for the implementation of regular structures, especially systolic array structures. For example, all MAC's and latches can simply be laid out on a single wafer and joined directly by interconnecting lines. Assuming a silicon wafer with 4-inch diameter is used to implement the 113 matrix-matrix systolic multiplier designed in this research, as many as 1600 32-bit MAC's with latches can be put on this wafer. This implies that the multiplication of any matrix having a dimension up to 40 can possibly be done on one wafer. The current major problem with WSI is that some of the cells, such as MAC's, are likely to be defective. A high percentage of defective cells will render the wafer useless. It is very costly to discard the wafer if some of the cells are bad. To overcome this problem, current research is aimed at designing restructurable wafer-scale arrays or using redundancy [63]. Future work relating the WSI approach to the design of regular structure described in this research has two trends. The first one is to study the tradeoff between the circuit complexity and the yield of defective cells [64-66]. That is, designs] using different design entry-points and pathways will have different percentages of defective cells. Secondly, the multiple levels of regular structure can be further extended to the wafer-wise modularity. This extended level will become the wafer module level in between the FE and algorithmic levels. On the whole, current trends project that design methodologies using ~design frames and wafer-scale integration show a great promise in VLSI circuit design. And, these new frontiers will eventually allow any VLSI algorithmic structure to be directly implemented on a silicon or GaAs wafer without any concern of density or I/O constraints. BIBLIOGRAPHY BIBLIOGRAPHY Niessen, C., "Hierarchical Design Methodologies and Tools for VLSI Chips,” Proc. 1833, Vol. 71, No. 1 (January 1983). PP. 66-75. Wallich, P., ”The One-Month Chip: Design,” Spectrum, Vol. 21, No. 9 (September 1984). PP. 30-34. Kung, H. T., ”The Structure of Parallel Algorithms,” Advances in Computers, Vol. 19, Academic Press, New York (1980). PP. 65-112. Hwang, K. and Cheng, Y. H., VLSI Arithmetic Arrays and Modular Networks for 'solving Large-Scale Linear System of Equations, Tech. Report No. TR-EE 80-8, Purdue University, W. Lafayette, Indiana (March 1980). Leung, Y.-Y. J. and Shanblatt, M. A., A VLSI Systolic Array for Matrix Triangulation in Load Flow Analysis, Tech. Report No. MSU-ENGR-83-003, Michigan State Univ., East Lansing, MI (January 1983). .Mead, C. and Conway, L., Introduction to VLSI Systems, Addison-Wesley Pub. Co., Reading, Massachusetts (1980). Zhang, C. N. and Yun, D. Y. 2., ”Multi-Dimensional Systolic Networks for Discrete Fourier Transform,” Proc. 1888 11th Annual Int'l Sym. on Computer Architecture (June 1984), pp. 215-222. Fortes, J. A. 8., Pu, K. S. and wan, B.W., ”Systematic Approaches to the design of Algorithmically Specified Systolic Arrays,” Proc. 1985 Int'l Conf. on Acoustics, Speech, and Signal Processing (March 1985). pp. 300-303. Hsu, w. C. and Shanblatt, M. A., Evaluation of a Single VLSI Chip Algorithm for Triangulating Large Band Form Matrices, Tech. Report No. MSU-EMGR 82-015, Michigan State University, East Lansing, Michigan.(August 1982). 114 10. ll. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 115 Liu, P. S. and Young, T.Y., ”VLSI Array Design Under Constraint of Limited 1/0 Bandwidth,” IEEE Trans. on Computers, Vol. C-32, No. 2 (December 1983). PP. 1160-1170. Rose, 8., ”Application of a Hierarchical Layout System to VLSI,” Tech. Digest IEEE 1984 Int'l Conf. on CAD (November 1984), pp. 125-127. Reinhard, D. K., Integrated Circuits Engineering, Course Notes, Michigan State Univ., E. Lansing, MI (Winter 1985). Eichelberger, E. B. and Williams, T. W., ”A Logic Design Structure for L51 Testing,” Proc. 14th Design Automation Conf. (June 1977), pp. 462-468. Masui, T., Niimi, P. and Iwase, M., "A New Approach to Design for Testability in an LSI Logic Synthesis System,” Tech. Digest IEEE Int'l Conf. on CAD (November 1985), pp.105-107. Moore, G. 8., ”Progress in Digital Integrated Electronics,” Tech. Digest IEEE 1975 Int'l Electron DeVices Meeting (December 1975), pp. 11-13. Wallich, P. ”A Review of Engineering Workstations,” IEEE Spectrum, Vol. 21, No. 10 (Gotober 1984). PP. 48-53. Avenier, J. P., ”Digitizing, Layout, Rule Checking - The Everyday Tasks of Chip Designers," Proc. IEEE, Vol. 71, No. 1 (January 1983). PP. 49-56. Chen, C. P. et al, ”The Second Generation MOTIS Mixed-Mode Simulator,” Proc. let Design Automation Conf. (June 1984). PP. 10-17. Butt, H. H. and 81-219, Y. M., ”Impact of Mixed-Mode Self-Test on Life Cycle Cost of VLSI Based Designs,” Proc. IEEE 1984 Int'l Test Conf. (October 1984), pp. 338-347. De Man, 8. et al, ”A Debugging and Guided Simulation System for MDSVLSI Design,” Tech. Digest IEEE 1983 Int'l Conf. on CAD (September 1983). PP. 137-138. Wong, Y., ”Hierarchical Circuit Verification," Proc. 22nd Design Automation Conf. (June 1985). pp. 695-701.‘ Daniel, M. E. and .Gwyn, C. W., "CAD Systems for IC Design,” IEEE Trans. on CAD of IC and Sys., Vol. CAD-1, No. 1 (January 1982) ' PP. 2-12. Hughes, 5. C., Lewis, D. B. and Rimkus, C. J., ”A Technique for Distributed Execution of Design Automation Tools," Proc. 22nd 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 116 Design Automation Conf. (June 1985). pp. 23-30. Collett, R., ”Designer's Guide to Semi-Custom IC's," Digital Design (June 1984). PP. 64-85. Kessler, A. J. and Ganesan A., ”Standard Cell VLSI Design: A Tutorial," IEEE Circ. and Devices Magazine, Vol. 1, No. 1 (January 1985). PP. 17-34. Barna, A., VHSIC Technologies and Tradeoffs, Wiley- Interscience, New York (1981). PP. 19-106. Mavor, J., Jack, M. A. and Denyer, P. 8., Introduction to MOS LSI Design, Addison-Wesley Pub. Co., Reading, Massachusetts (1983). PP. 18-162. Werner, J., ”Progress Toward the 'Ideal' Silicon Compiler,” VLSI Design, Vol. 4, No. 6 (September 1983). PP. 38-41. Southard, J. R., "MacPitts: An Approach to Silicon compilation,” Computer, Vol. 16, No. 12 (December 1983). PP. 74-82. Crawford, J. D., ”EDIF: A Mechanism for the Exchange of Design Information,” Proc. IEEE 1984 Custom IC Conf. (May 1984), pp. 446-449. Duyn, B. and Trimberger, S., ”Structured-Design System Takes Over the Complexities in VLSI Circuits,” Electronics, Vol. 56 (August 25, 1983), pp. 127-130. vanCleemput, W. M., ”An Hierarchical Language for the Structural Description of Digital Systems,” Proc. 14th Design Automation Conf. (June 1977), pp. 377-385. Curtis, W., ”Re-Implementing the MicroVAx I on the GENESIL,” Digest IEEE 1985 Compcon Spring (February 1985), pp. 132-136. Salsbury, P. J., "Nonvolatile Memories,” Tech. Digest IEEE 1984 Int'l Solid-State Circ. Conf. (February 1984). P. 135. Fiebiger, J., ”C-MOS: A Designer's Dream with the Best Yet to Come,” Electronics, Vol. 57 (April 5, 1984), pp. 113-115. Sangiorgi, E. C. et a1, ”Two-Dimensional Numerical Analysis of Latchup in a VLSI COMS Technology,” IEEE Trans. on CAD of IC and Sys., Vol. can-4, No. 4 (October 1985), pp. 561-574. Newkirk, J. A. and Mathews, R., The VLSI DeSigner's Library, Addison-Wesley Pub. Co., Reading, Massachusetts (1983). Whitaker, S., "Pass-Transistor Networks Optimize n-MOS logic," 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 117 Electronics, Vol. 56 (September 22, 1983), pp. 144-148. Williams, '1'. w. and Parker, R. in, "Design for Testability - A Survey,” Proc. IEEE, Vol. 71, No. 1 (January 1983), pp. 98-112. McCluskey, E. J., "VLSI Design for Testability," Tech. Digest 1984 Sym. on VLSI Technology (September 1984). PP. 2-5. Gajski, D. D. and Kuhn, R. 3., ”Guest Editor's Introduction: New VLSI Tools," Computer, Vol. 16, No. 12 (December 1983), pp. 11-14. Baugh, C. R. and Wooley, B. A., "A Two's Complement Parallel Array Multiplication Algorithm," IEEE Trans. on Computers, Vol. C-22, No. 12 (December 1973). PP. 1045-1047. Hwang, R., Computer Arithmetic, John Wiley and Sons, New York (1979), pp. 85-254. Fey, C. F., ”Custom LSI/VLSI Design Manpower Model,” Proc. IEEE 1984 Custom IC Conf. (May 1984), pp. 246-250. Fey, C. F., "Custom LSI/VLSI Chip Design Productivity,” IEEE Journal of Solid-State Circ., Vol. SC-20, No. 2 (April 1985), pp. 555-561. Broers, A. N., ”Practical Methods for Sub-Micron Lithography," Microcircuit Engineering 84, edited by Heuberger, A. and Beneking, H., Academic Press, New York (1985), pp. 3-20. Hamilton, D. J. and Howard, W. G., Basic Integrated Circuits Engineering, McGraw-Hill, New York (1975), pp. 515-541. Wagner, K. D. and McCluskey, E. J., ”Effect of Supply Voltage on Circuit Propagation Delay and Test Applications,” Tech. Digest IEEE 1985 Int'l Conf. on CAD (November 1985), pp. 42-44. Jouppi, N. P., ”Timing Analysis for nMOS VLSI," Proc. 20th Design Automation Conf. (June 1983). PP. 411-418. Sequin, C. 8., ”Managing VLSI Complexity: An Outlook," Proc. IEEE, Vol. 71, No. 1 (January 1983), pp. 149-166. Blanks, J. P., "Near-Optimal Quadratic-Based Placement for a Class of IC Layout Problems,” IEEE Circ. and DeVice Magazine, Vol. 1, No. 5 (September 1985), pp. 31-37. El Gamal, A. and Syed, 2., ”A New Statistical Model for Gate Array Routing,” Proc. IEEE 20th Design Automation Conf. (June 1983). pp. 671-674. Shragowtiz, E. and Desmonds, D., "Statistical Methods in 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 118 Hierarchical Design of VLSI,” Proc. IEEE Int'l Conf. on Computer Design: VLSI in Computers (October 1984). PP. 468-473. Okabayashi, H., ”Multilevel Metallization Technology for VLSIs," Tech. Digest IEEE 1984 Sym. on VLSI Technology (September 1984). PP. 20-23. Reyes, R. W., ”The Evolution of Digital Electronics Towards VLSI,” IEEE Journal of Solid-State Circ., Vol. SC-14, No. 2 Feuer, M., ”Connectivity of Random Logic,” IEEE Trans. on Computers, Vol. C-31, No. 1 (January 1982). PP. 29-33. Katz, R. H., Bell, A. G. and Conway, L., ”VLSI System Design by the Numbers," IEEE Spectrum, Vol. 22, NO. 2 (February 1985). pp. 44-50. Bell, A. G., ”Using System Kits and Design Frames to Enhance VLSI Prototyping,” Digest IEEE 1985 Compcon Spring (February 1985). PP. 182-183. Balde, J. W., ”New Packaging Strategy to Reduce System Costs," Proc. IEEE 1984 Int'l Conf. on Computer Design: VLSI in Computers (October 1984). PP. 579-584. Mahalingam, M., ”Thermal Management in Semiconductor Device Packaging,” Proc. IEEE, Vol. 73, No. 9 (September 1985). PP. 1396-1404. McDonald, J. F. et al, ”The Trials of Wafer-Scale Integration," IEEE Spectrum, Vol. 21, No. 10 (October 1984). PP. 32-39.‘ Johnson, R. R., ”The Significance of Wafer Scale Integration in Computer Design,” Proc.: IEEE 1984 Int'l Conf. on Computer Design: VLSI in Computers (October 1984), pp. 101-105. Raffel, J. I. et al, ”A Wafer-Scale Digital Integrator Using Restructurable VLSI,” IEEE Journal of Solid-State Circ., Vol. sc-zo, No. 1 (February 1985), pp. 399-406. Ferris-Prabhu, A. v., ”Modeling the Critical Area in Yield Forecasts," IEEE Journal of Solid-State Circ., Vol. SC-20, No.4 (August 1985). PP. 874-878. Perry, 5., Mitchell, M. and Pilling D., ”Yield Analysis Modeling," Proc. 22nd Design Automation Conf. (June 1985). PP. 425-428. Pichler, P. et a1, ”Simulation of Critical IC-Fabrication Steps," IEEE Trans. on CAD of IC and Sys., CAD-4, No. 4 (October 1985), pp. 384-397. APPENDIX 119 {SS DIFFUSION - CONTACT CUT ION IMPLANTAT [ ON METAL g FOLYS 1 L I CON .mm TYPE-A FA U1 E ’4 I llllllllllllllll’.lll : '4 4i I h.) U1 Illlillllllllllll Figure A.1 Physical layouts of "x.FA". 120 -16% TYPE-A FA . TYPE-B FA “-4 U1 Li“; 5 n ='."- 33 -_ F‘ 3%?E ==§€ .sa =3- algal-s; 1 .c __' — _- -- ..— — -n I .25 unuuuuu. nuvnu' Ziillllllll - IS U1 U1 :9 ‘44 U1 Figure A.2 Physical layouts of "G.FA'. 121 TYPE-A FA TYPE-B FA Ill , 'nunnga. ' ii 1 .- -: ll,"3 III . 3: ' .- .. ._ .- . a _ _ _ . — h _ _ _ _ - _ _ .— _ _. _ — — _ _ _ _ i: I I lllllllllllllilllllllll I. '.::: Illllllllll!, .I III III III ' 4 Hi!“ llllllm m '.'] l'l III 8' Illll‘ .a'ufln Figure A.3 Physical layouts of "FA". -588 -488 -188 122 4-BIT X.MAC i . ............. "—111.11 .13.! '.1 .371 _quMFLmflNlflMWluMFLwW1 all! _ .-!!!11 I: I": ! ..__ _ 1"- 1111 |I|1-m!1||-.1111L.f g! 1{ 1“ #1117173 LL11!!! !,, 1.: 11 35111 1!!!! IL' .nIIIIL' 31!!!!L 111I-11111 1“ ~11" Jar-111: ' 11?; 1!!I111 z'l'1::: !L1 '— 511113.111" “fa "111-1111! 8 II 218 388 Figure A.4 Physical layout of a 4-bit "x.MAC". 123 4-BIT X.FA.MAC -588 -488 .388 -188 Figure A.5 Physical layout of a 4-bit "x.FA.M.AC”. .488 .388 -288 -188 "11111 .1111 11711111111 124 4-BIT G.MAC 1.118.r ”g- 1_ru"=.m11.p1 gm” 111,1 1111111 711' 111111111111111181111111“ 111111 h— if???"51‘171‘11’1'1111 1ii1r.“‘11'_' f I "I l!:.i—J —-'=L| 34-81 1:11 21:11! l-IJ-__-_ 1J1" 1Jlll_1-_111:.Jr'11|1|| -:—_- -— :n "'1 1'1 ""-1"11'1“.111111 111E- 111111.. 1111117.- 111II11111 .11,11'131L1II1111! 1“ 1111'! .11" 1I11!!! 1'1.r 111.F 111 111lI||111 —‘1 388 Figure A.6 Physical layout of a 4-bit "G.MAC”. .688 -588 -488 .na .188 1.. "188 1111--111 n1 Jail—III L' qIIiI .517 ‘IIIiIrr —' IIiI “a; -II III Iii-411155;]: ”II JIIfiIlI :JrI1 1II!I'._' "-IrI iI II ILIMLIHIIIM. WHIWM llrr 1111 11' I1 .11 1111 11111 I'MI‘II”E$,IWIfl IH1fi IWI 9 J—1 [3 2—j1 - I y FHMI I!1F1IIIII .IIIIh IIIP'HIIIIIII ‘11 ""‘1-11'2111‘11 1" “.1111 11.111111: 11111.- 1 1:11 1 13 1'11 Ifi'f. II ail-1::- IIIl II!I”‘ . .;-* I V III Ilgl. ma 21 Ill .1] "||11_ 112-11111151:IIIIIIHII f.71j-1. 1.1 I-;.1'-'—"" .—j1_ I I ffi '34 3I8 L Figure A.7 Physical layout of a 4-bit "G.PA.HAC". .633 41 a ‘II 9;.mm%m ' “I . I .. a I . . . Il| u :3 ~ - 'thw.‘ _usc‘fliléi II II ,. _ . . L— II I I|l| MKHUHE? - EEE.-I-P.. ,EEE. fiat—59'1- 131’ —IL-I ' r -Sflfl ‘.J ‘ 99 _9 _- I 9:.-:2-2‘ :' 2 E i .. ‘L. Lflfifll Ill—l '— L. 11 9 a-iIi—n 'fi'r'II 'fiw 9 ‘1' I‘lh' ' ' III‘I I I 1'1 9 "=— ml"- l l- H -1 '.—-"“u 1’ ‘l ‘- II Hm:=" nflfi II IM% : -- : u".,_uu . I .59 ‘ ‘1'“ Br.- ' ""3 . n i 3;: i —-1 : - - :. ”9.. ‘ . 9. _ I III . ‘ I ' . ' I" . I - -; I J- l I III ..L. ...- 9435 _—_.'—-=;§ iii" - LI _ 429.323", . . 1- ll ; ..; . ;.; . ”w .1 ;A ‘ ~‘ - " u '- 9 ‘ ‘ cll‘ -. . ' "_ 9.. ‘9. J'.‘ '. 5 ' llll .__. l r:: III" 9 -.—_. »—--— |-,:.ll r: | .9 =ll .:;=_l:';: 5i! L. Ii!" III. I i? :4- 23.22.91!- I .fi‘ ‘VLWH f '31 ’. _ II: | £ .9-’_n9 "2 Aa‘lllln f ----- III‘ *99i9 . _' "'1' " L‘ 5’ ,. 3939 |9 9'9;- ...]9 .3flfl =- -_' H 9“" iIl L .- 9 lug- 91 99 I "III 1‘. l gifll I ll ,- 99 _._ ha _1 _ 2g. 2.133 R i ... l —I I- I l ."fll ,: 9.3199"- _. 'II. _. I I —‘ . ‘ z: . ,— L 1.1" .lflfl ..J _. 9 _Z: 9 . -. "III .5 i '" “ L———Vfi Figure A.8 Physical layout of a l-bit 'PA.MAC'. 127 IE'BIT G.FA.MAC I .2333 .1F53 .1533 l .... m (I! a .1333 .?53 -533 P258 .3 -2§3 3 233 533 753 1333 Figure A.9 Physical layout of a 16-bit "G.FA.MAC" without "G.PA's". 128 ESE D 1 FFUS 1 ON - CONTACT CUT ION IMPLANTAT I OH METAL E POLYS 19. 9 CON -4l .33 -2I -13 -3 9 9 _ 29 4 Figure 19.10 Physical layout of a latch without scanning ability. 129 .53 3 1323 3| 4| I I I L I Figure A.11 Physical layout or a latch with scanning ability. '1 II‘ '~ .;-.II I. . 9|] I .I- - . . '991" ..99. Figure A.lz —. III —p. - _L‘Ii”!_ u ..9!|”! ‘III 'I' = _. '|'-~ , 'a -II' L.I||- 9!] ! I: l!lI -' _. Physical layout of a ‘34 matrix-matrix multiplier. 131 9;; all] _,-‘9PP,II U9 -—'P--’:':": P ”P; MlllIMf ‘.‘-3P P I—I; :5: 9‘: $39‘-1?=99r:99_ '9 29 ,7 9 “PI" .4; _mg ~99 -' ‘9 . ; !JP[9P.21999.-_.9m:9~. _ 3‘ ~ ' .. - sfitgi -— “:5. 1'. P'.‘ 7;?! .719‘ . . ‘— -— 1.79 r—Erg—A“ " 9 -|:l L; . L ' I. _ J- _ ___ _.__ —;F-‘l' .r——;._1_-g-. - I ... ' 9 P. ....?l . or; . .721 731.3.JI: l‘vjmgglilft 9 :’jllg;..".;' . [9 9.Ir1.nl9r|.l "94 iii. P ii 1199.:- UPII lL!49 iii-(I‘lllrlsmll _ Pl_r Pi_ ‘ III um}: __]99| UPI EPP' ". :1in E- II F:- PIIIIII III'... ‘ 2! PJ '.P'tv P‘: =- 2' 9 f: ‘ PPPL PPP "III '.'”:4-1' ”"19 _i. .IIIP'P9 ‘ :_r- ' P- 17:199; P919 ' , ‘.P ‘PilP'PPQ 9999:2999. - "‘4" i ’.‘-[P'— .._:-r .I“!!__ 'PIcf- II! g1- lJmt tLJIII ‘J 999. 9. ~ ':.II -Ir:'JII"i:9.'.;PPP 9 ir _I Y ._; l '_-—: _ . i-I_lllfl99n " .- l ill jllmln . ...I . :1 .. '- v'i‘ —.-—£-. . 'ziLMII '.’-Ii _L IPP| 1 _ ' I I- [I I“, L . 44 ——TT_.T: .9 .1 _ _. =-.—~___.::_‘~;. ; I! ; .t'_ ,1 .9991II I 999I9 T—w—:_ A: '. -?' ‘.’- " .___ ' ... .. PPIPII =" I I '— 9% '.P' 5:: '.‘.i" .- P? .'P (.9? "III " 2|.qu '9"; ‘.‘ '99- . . I '.'-’.'. I I u I u , A 4 . Figure A.l3 A zoom view 0: Figure A.12.