. . g r. him-ru . A 1' $.53“... .V at. » . I . n o Isfngl :1,» la 3. .. . if .72 (.51. .h‘l-Jufloi.) 2.! ”HST .......¢.l.-l£. in .‘vv5 ‘ d". . ifiufix . . ‘ . inn“... .r t ‘ . 91.8!!! . :13! ’42....6rrz». ,9. .25 1‘! . 1., .11 .IMV:0. 1 or 1 —) 0 transition was being propagated. In order to include this block in the probabilistic propagation of the transition density, a stochastic 21 model for the filtering block itself was developed. To develop this model it was assumed that the cumulative distribution functions F0 (t) and F1(t) of the low and high pulses transmitted by the block were exponential. After analyzing the filter behavior and introducing some simplifying assumptions about the pulses at the filter input, the expressions obtained for propagating the equilibrium probabilities and transition densities through the filter block were 13(31): PFX ‘1—_F—10(7;)'—P(:L‘) (It -7'1)+T1F1(T1)/2 (MO-To)+ToF0(T0)/2 + l 1—F1(rl) — 1-F0(7'0) (2.9) and D (y) = Pp x D(a:) , (2.10) where a: and y were assumed to be the filter input and output respectively, #0 and M the average lengths of the low and high pulses at the filter inputs, and PF a relation between the distributions of the high and low pulses given by P = [1" F0(TO)l [1 ‘Fi (Till F 1'F0(T0)F1(T1) ° (2.11) The results obtained indicate that the inclusion of the low pass filter in the module out- puts attained the goal of reducing the overestimation error of the transition density at high frequencies at a relatively small computational cost. On the other hand, the overall ap- proach still had the accuracy problems identified in the original algorithm due to the node correlations created by the reconvergent fanout, and its applicability remained limited to combinational circuits. Ghosh, et al. proposed methods for estimating the average switching activity in com- binational and sequential logic circuits [5]. Their methodologr was capable of partially accounting for temporal correlations in internal and output nodes, spatial correlations created by reconvergent fanout, and the glitch activity created by spurious transitions. 22 The primary inputs (PIs) to a circuit were assumed to be independent, and their be- havior had to be specified in terms of their transition probabilities. These probabilities 01 :1: 1 P3150, and p3,,1 where the super- for a given signal :1: were described in the form p20, p script ij denoted the probability of x experiencing a transition from state i to j. At the PIS, these probabilities could be obtained from the long term behavior of the input sig- nals. For example, the probability of signal a: changing from 1 to 0 could be obtained as pic = limN_.oo fi 23:11.“) W. The signal probability of a node x was obtained from its transition probabilities as P, (2:) = p; 2 pi“ + 19,151. Consequently, p2 = l —- p},. The signal probabilities were propagated from the inputs to internal and output nodes using a disjoint cover of the Boolean function of the node in terms of the P13. Thus, if the disjoint cover of the Boolean function of node y were fy (X) = ELI Ck, where X = x1, - - ~ ,xn represents the vector of the P13 and 0,, is a cube of fy (X), then P, (fy) == 22; P8 (Ck). Since 0;, is the product of independent signals, then P, (Ck) is the product of the probabilities of the variables in it. Transition probabilities were propagated from the inputs to internal and output nodes by finding the signal probability of the XOR of the Boolean function of each node in two consecutive time frames. Thus for node y, P; (y) = P, (y (t) 69y (t + 1)). The resulting expression for Pt (y) can be expressed in terms of the signal and transition probabilities of the P13. As an example consider a two-input AND gate, y = :31 - 1:2: P. (y) = P. [a (t) -x2 (t) e x1(t+ 1) - a (t + 1)] = P. [g (31)]. (2.12) A disjoint cover of g (y) defined as above is 9(y) = x1(t)-x1(t+1)-x2(t)+x1(t)-w1(t+1)°x2(t)-1=2(t+1) +ETfii-x1(t+1).x2(t)+x1(t)-xl(t+1).?c2_(T)-x2(t+1). (2.13) 23 Thus, the transition probability of y can be written as P. [g (31)] = pl. (195? + 1021) + 193.} (1)3 + 192%) - (2-14) The implementation of the propagation algorithms for signal and transition probabili- ties was based on ordered BDDs (OBDDs) [30]. To account for gate delay effects in the computation of transition probabilities at a given node y, l Boolean functions of y were constructed, where l was the number of levels found when advancing in the circuit from the P15 to node y. Each of the l functions described the logic expression of y at the time points where it can experience a change. The transition probability of y was obtained as the signal probability of the XOR of each two successive functions of y. As an example, consider the circuit of Figure 2.3. Assuming the AND gate and the inverter have the same delay, node 2 could experience two transitions as a consequence of a single change in the input vector. In this case y (1) = r0), y (2) = 7(2), 2 (1) == Elm-2:2 (1), 2(2) = flag (2), and z (3) 2 3727-222 (2). Thus Pt (z) = P3 [2 (1) GB 2 (2)]+ P, [z (2) GB 2 (3)]. For extending the above technique to sequential circuits, Ghosh et al., proposed a de- composition of the structure of the FSM such that the temporal correlation between two consecutive vectors in the present state (PS) lines could be captured. The transformation started from the common model of a sequential circuit, illustrated in Figure 2.4. By repli- cating the symbolic simulation equations corresponding to the internal and output nodes in x y x1“) x xl(2) ‘ x,(l) X x,(2) x: t 53—2 J. 2(1) 2(2)X 2(3) Figure 2.3. Glitch activity caused by gate delays. 24 the next state logic, using two time instances of the primary inputs, PI (t) and PI (t + 1), and using the present and next state lines, two time spaced expressions of the logic func- tions at the combinational logic block, f (t) and f (t + 1) were obtained. Based on this transformation, illustrated in Figure 2.5, the transition probabilities could be obtained as the signal probability of the expression resulting from the XOR of f (t) and f (t + 1). The application of this method assumed that the present state lines were spatially uncorrelated and that after power—up, the probability of the FSM of being in any of its 2” states, where N is the number of state variables, is uniformly distributed. Although the proposed decom- position was very clever, the last two assumptions do not necessarily hold for a large range of sequential circuits, resulting in many cases in considerable estimation errors. The results presented from the application of this approach did not provide a reference measurement which could allow for comparison of the estimation errors. The technique proposed in [5] to estimate the switching activity in sequential circuits was revisited by Monteiro, et al. and Tsui, et al. in two simultaneously published studies proposing exact and approximated estimation methods [31, 3]. The exact method proposed by Monteiro, et al. used the same decomposition of the F SM structure suggested by Ghosh, et al. to capture the temporal correlation in the present state lines. However, instead of assuming uniform state probabilities, the actual steady-state probabilities of the FSM were found when the number of transitions approached infinity (t —> 00). This computation was performed by regarding the state transition graph (STG) of a FSM as a Markov chain and using the Chapman-Kolmogorov equations to compute its limiting probabilities [21]. The probabilities computed in this way would include the spatial correlations of the present state lines and for this reason it was considered an exact method. However, this approach required an explicit enumeration of the state space of the FSM, implying that the linear system to be solved had an exponential complexity in N, the number of flip-flops of the FSM. To make the problem tractable for large FSMs, an approximate method was proposed which required the solution of a system of nonlinear equations of size N, formulated as follows. Let n31, . . . ,nsN be the next state lines in a FSM. Then ns, = f, (X, PS), where X = PM!) PO“) I ' Next State PS“) Logic PS(t+1) Present Next State State Lines Lines State Variables 25 Figure 2.4. Conventional FSM. PM) > Symbolic Simulation PSft) Equations of Internal Nodes Pl(t) Next State PS(t) Logic PS(t+1) Pl(t+1) Symbolic Simulation State PS(t+1) Equations of Variables Internal Nodes fit) XOR a , Network 1714-1) Figure 2.5. F SM transformation to capture temporal correlations. 26 2:1, . . . ,xM is the set of primary inputs and PS = p31, . . . ,psN is the set of present state lines. When the FSM is in equilibrium, P3 (psi) = P3 (ns,) = p,, and therefore p,- = P, [fl- (X, PS )] Given that the probabilities of the P18 are known, the state line probabilities can be written as p,- = g, (p1, . . . ,pN), where g, (-) is the function resulting after replacing the values of PS (X) in P, [f, (-.)] This yields a set of nonlinear equations in terms of the state line probabilities of the form y1=p1-91(p1,---,pn)=0 = — ,..., n =0 :12 p2 92(p1 p) Y(P)=O. (2.15) y~=p~-g~(P1,---,pn)=0 The set of equations Y (P) = 0 were iteratively solved using the Newton-Raphson method: Pk+1 = 10'c - J(Pk)‘1Y(Pk), k = 0,1,... with P0 = [0.5,...,0.5]. The convergence of the method was proved under the assumption that “y (P0) I] is small. In each iteration, to evaluate the Jacobian J (Pk), each entry in the matrix would be g;- Pk It was noted that for i 95 j, if = —g—% and g: = 1 — 2%:- making it possi- ble to evaluate any entry in J () from gig-1L. The expression for gig- could be obtained from f.- (-) through the application of Shanon’s expansion theorem as follows: f.- (X, P3) = psj - f; (X, PS) |p8j=1 + p—sf- f1: (X, PS)|psj=O. Taking probabilities and considering that P, (psj) -—» 19,- as t —-> 00 and that the primary inputs were independent, it resulted in P. [i mm] = g.- = p.- -P. [1:- (°)|p.m.] .(1_,,,.) .p, [fl- (-)|p.(x>.] . (2.16) psj=1 pay-=0 Then, partial derivatives were obtained as 991 = . [f.- (-)|p.(X),] — P. [1:- (’llP.(X).] . (2.17) apj psj=1 psj= where the probabilities in the expression were evaluated using OBDDs. Although this method allowed an accurate computation of the line probabilities, the state probabilities 27 would not be accurately computed due to the spatial correlations existing among the state lines. For approximating these correlations Monteiro, et al. proposed obtaining m—wise correlated sets of state lines, with 1 g m S N a user specified value, and using them to construct the set of nonlinear equations. The accuracy of this approach as well as the number of equations in the system to be solved increased with m, degenerating into the exact method (Chapman-Kolmogorov) when m = N. The case m = 1 corresponded to assuming spatially uncorrelated state lines. The results of applying this method, when compared to the exact solution obtained through the Chapman-Kolmogorov method, indicated that setting m = 1 produced errors of less than 10% on the selected test cases. Using m = 2, the maximum recorded error was less than 5% and m = 3 produced errors equal or less than 3.6%. The method proposed by Tsui, et al. made use of Ghosh’s decomposition of FSMs to capture the temporal correlations in the present state lines [3]. Their exact method also used the Chapman-Kolmogorov equation to compute the limiting probabilities of the Markov chain obtained from the STG of the FSM. Here again, the exponential dependence of the number of equations to be solved on the number of state variables severely limited the size of the largest sequential circuit which could be analyzed with this method. An approximate method was proposed which unrolled and cascaded the next state logic k times to form a k—unrolled network. The logic equation of the ith bit of the next state outputs at the let" stage were written "sf = f‘lxk’PSk)=f.-(Xk,Nsk-l) = fi [ch,fl (Xk—1,PSk—l) ,---,f1v (Xk—1,pSIc—1)] = fi[Xk,f1(Xk_lif1(Xk_2,°°°af1(XO,PSO)”')),'”,fN(°)]- (2-18) Taking probabilities, replacing the values of P, (X k), and considering that in equilib— 28 rium P3 (ns,) = P3 (psi) 2 p,, a set of nonlinear equations was obtained in the form p,- = 9,- (p1, . . . ,pN), which could be written as Y (P) = 0 with y,- = p,- -g,- (p1, . . . ,pN) = O. A solution of this set of nonlinear equations was obtained by using the Picard-Peano itera- tion method. For a nonlinear function f (:r) , this method startswith an initial guess 51:0 and calculates 3,-1.1 = f (xi) until convergence is reached. A proof of convergence of this method was included under the assumption that f (3:) was contractive. The results of applying the above approximation approach indicated that using k = 3, the average error relative to the exact method was around 5%. T sui, et al. later revisited their approximated approach proposed in [3] to compare it to that of Monteiro, et al. and found that the Picard-Peano method usually required more iterations but less time than the Newton-Raphson method [32]. This is because Newton-Raphson converges at a quadratic rate while Picard-Peano does so linearly, but the computation of the J acobian in the Newton-Raphson method is more computationally expensive than the Picard—Peano iteration. A problem found in the Picard—Peano method is that for circuits where the contractiveness condition is not met, the iterations might become oscillatory keeping the method from reaching convergence. All the probabilistic methods discussed above assumed the primary circuit inputs as independent. This assumption is generally true for data lines, however it does not necessarily hold for control inputs. In typical controller applications, the input patterns are restricted to sequences of commands which are fed to the circuit in some determined fashion. As a consequence, the inputs of those circuits have a pseudo random behavior which typically presents high levels of spatial and temporal correlations. Assuming that the inputs of such circuits are independent frequently leads to considerable errors in the estimates of their switching activities. Accounting for all kind of correlations is a very difficult problem however. The estimation techniques discussed below propose approximate models which account for part of these correlations. Schneider, et al. proposed an estimation technique capable of taking into account con- current transitions and temporal correlations on the primary inputs of combinational circuits 29 under a zero-delay model [33]. This technique was proposed as part of a power optimiza— tion mechanism which retargeted the functional decomposition of Boolean expressions to find solutions with low power consumption. To determine the transition probability of an internal node in a circuit, a transition function g was built by performing the XOR of two consecutive time instances of the Boolean function of the node expressed in terms of the cir- cuit primary inputs. The signal probability of the transition function was equivalent to the transition probability of the node. Thus, for a given node y = f (X), where X = 2:1, . . . , 2:" is the set of primary inputs, 9 (y) = f (X 0) 613 f (X T). To evaluate the probability of the transition function, the Boolean expression was recursively decomposed using Shanon’s the- orem until it was expressed in terms of the probabilities of the primary inputs, which were considered to be spatially independent. Applying Shanon’s decomposition with respect to x? and 1:? on g (y) gave 0-1". 9 (y) = 1'9sz - 9 any; + 93.58. g (9);??? +53% -9 (9)504 + am? -9 (mm-r- (2-19) Taking probabilities and expressing in terms of the transition probabilities of the primary inputs yield P. [g (31)] = Pt (31) = 1211 (13') 'P3 (galaxy) +1910 (372') -P. (gamer) ‘l +1201 a.) - P. (9 (may) +poo (a) -P. (g (was) . (2.20) where pp, (2;) represents the probability of signal cc,- changing from state 3' to state k in two consecutive clock cycles. The values of these probabilities are obtained in the primary inPUtS as P10 (932') = P01($z') = %Pt($i)1 1911(13‘) = P3 ($1) — %Pt (132'), and Poolxz‘l = 1 - P, (xi) —- §P¢ (mi). The implementation of the algorithm was based on the construction of reduced OBDDs (ROBDDS) for each node. A single ROBDD traversal provided the transition probability of the node, which meant that the computational time was linear in the size of the BDD. Note however, that these are global BDDs (in terms of the circuit PIs), which will limit the 30 size of the largest circuit that can be handled due to their exponential size on the number of P15. The study did not provide comparative results of the power estimation mechanism. In another attempt to account for input correlations in probabilistic estimation, Mar- culescu, et al. proposed an analytical model which accountedfor spatiotemporal correla- tions in combinational circuits fed with pseudorandom and biased input streams [34]. Their methodology used lag-one Markov chains to compute signal correlation (SC) and transition correlation (TC) coefficients between pairs of signals in a circuit. Higher order correlations (those which involve three or more signals) were approximated by pairwise correlation coef- ficients. Once the pairwise correlation coefficients of the input nodes were obtained, OBDDs were used to propagate them through the circuit. Two propagation approaches were eval— uated: global and incremental. In the global approach, the OBDDs of internal and output nodes were built in terms of the P15 of the circuit, while in the incremental approach the OBDDs were developed in terms of the immediate fan-ins of the nodes. The computation of the transition probabilities at an internal node f with support set :01, . . . , 2:" was carried out using the sets of OBDD paths in the ON-set and OFF-set of f, identified as H1 and Ho respectively. . The probability of f making a transition from state i to j, denoted p (f,_.j) ; i, j E {0, l} is the probability of the input vector changing from 1r 6 II.- to 7r’ 6 11,-. This probability can be found in terms of the transition probabilities of the P13 and their pairwise transition correlation coefficients as «EH.- 1r’E lgk 3? is a function returning the capacitance of the wires interconnecting the terminals that make up network Nk. Find an assignment S of each module m,- E M to a unique location lj E L such that the cost of the assignment, denoted C (S) and measured by the weighted summation of equation (3.1), is minimized, subject to the condition that the intersection of the domains of any two modules is empty (no two modules can overlap). C (S) = E [at .9, (Mai (3.1) k=1 Term C (S) in equation (3.1) actually represents the switched capacitance of the network wires in the placement solution S. This value is directly related to changes in the factor (1;; - CL]: in component Pd; of the total power dissipation defined in equation (2.3), as shown below. Analyzing in some detail the load capacitance CL;c appearing in equation (2.3), it can be seen that its value can be expressed as the sum of three components, as shown by equation (3.2). CLk = Cok + ka + Cuc (3.2) In this formula, Cok, Cik, and ka represent the equivalent capacitances of the net driver output, the net fan-in, and the wires interconnecting the terminals in the net, respectively. In the search for a power efficient solution S, the capacitance component affected by changes in the placement of modules is ka, since the portion corresponding to Cok +C,-;c will be the same for any solution. This establishes the indicated direct relationship between equation (3.1) and changes in the Pa component of the total power dissipation. Moreover, in typical macro-block circuits the value of ka is several orders of magnitude larger than Cok + 05);. This implies the relevance of minimizing equation (3.1) in a power optimizing placement technique. Equation (3.1) can be taken to an even more elemental level. The value of ka results 66 from the equivalent capacitance between metal and polysilicon layers and the substrate. This capacitance can be approximated using a parallel-plate model similar to that in equa- tion (3.3). Cwlc = $14!: (33) In this model, 6 represents the permittivity of the insulating material between layers, t is the insulator thickness, and A1,, is the plate area. The model of equation (3.3), although not considering fringing fields, multi-layer structures, and mutual capacitance of adjacent wires, provides a reasonable approximation of ka for the placement application. More accurate models which consider those factors enumerated above and others can be found in [108]. The plate area Ak in equation (3.3) can be obtained as the product A; 2 wk - dk, where wk and dk represent the width and length, respectively, of the wires making-up net k. Considering that changes in placement do not affect parameters 5 and t, and assuming that wk = w is constant for all nets, ka becomes a direct function of dk. Thus, a reduction of the switched capacitance, and therefore of the power dissipation of the circuit, can be effectively achieved by reducing the product ak - (1;; of the networks of the circuit. If function 9,; (Nk) is redefined to return the wirelength of net Nk, the objective function of the placement problem can be rewritten as equation(3.4). D (S) = 2011: ' dk) (3-4) k=1 This equivalent objective function, now named D(S), is called the switched length of the circuit. Note that wirelength and switched length are in general different objective functions, and in fact, in some cases they can be conflicting objectives, as illustrated in the case below. Consider the circuit in Figure 3.1, where each gate, considered as a block (p = 8), is to be placed in rectangular layout. To reduce the size of the solution space it is assumed that the layout contains a number of locations equal to the block count (q = 8), that module dimensions as well as inter-module spaces are of unit length, and that input and output pads 67 Figure 3.1. Circuit to be placed with estimated switching activities per net. are attached to their associated modules. This last assumption reduces the net count to 7 (r = 7) with an equal number of signals, whose switching activities have been previously estimated. Despite these simplifying assumptions, the solution space S remains quite large, [S] = 8! = 40320. Figure 3.2 shows two solutions for this problem. The first one optimizes wirelength only, while the second optimizes the switched length of the layout. Note that optimizing one parameter renders the other suboptimal. This situation, typical in most circuits, makes it necessary to redefine the objective function guiding the mechanism used to optimize a placement for low power dissipation. The example discussed above highlights two important characteristics of the placement problem for low-power. First is its intractability. The placement problem has been proved to be NP—hard regardless of the optimization objective and the simplifying assumptions introduced by the specifications of the application [53]. Second is the importance of the correct estimation of the switching activity ah and wirelength dk of the nets of the circuit. 3.3 Approach The central objective of this work is to develop a placement methodologr capable of produc- ing circuit layouts with near-minimal switched length. As was established in the previous 68 a) Solution 1: d, = 9, at (1,, = 1.975 b) Solution 2: (1,, = 11, a, d,‘ = 1.883 Figure 3.2. Placement solutions with objective functions for: a) wirelength, b) switched length. section, this will result in a reduction of the total power dissipation of placed circuits with respect to conventional techniques which traditionally target the expected wirelength for reducing either the overall delay or layout area. This new placement methodology also aims at a higher objective of supporting the potential development of complementary tools which, when combined, could provide a complete physical design methodology for low power circuits. In order to achieve these goals, a bottom-up approach was followed in which each task was specifically designed with the power reduction goal in mind. An outlined description of the most relevant tasks is presented below. The relationships between tasks are graphically depicted in Figure 3.3. Task 1: Development of a power annotated netlist format. Given a circuit specified through a set of design rules, a standard physical netlist, and a functional netlist, the first task consisted of generating a netlist format for optimizing the power dissipation of the circuit through its physical design. This task turned out to be a relevant step in this research, due to the lack of power dissipation parameters in conventional netlist formats. The new format is used to specify both the input and output of the placement process. The transformation process is made with the aid of the switching activity estimator previously selected. The new format is made externally available for use in other design tools that could be developed for reducing the power dissipation of circuits during the physical design 69 Switching Activity Estimator Task 1 _ ____________ L : Perturbance l L Generator Line Search Network : T k'4 Engine Extractor as - ..... in??? ........ my ' 1 ' Overlap Extractor Cost Function ‘ Evaluator ] Wirelength ' Estimator , ' Annealin ' 8 ' ' - "_" Schedule ' Routing Channel . , V Pre-allocator ' - Annealing ' " Process 6 . Solutions: ' Current and Best-so—far ~ - - - ............... Figure 3.3. Relationships among tasks in the solution approach. 70 process. Task 2: Design of localized representations of the layout geometry and network set. In this task, a suitable internal representation of the circuit was designed which allowed an ef- ficient application of algorithms that extract and update placement dependent information. This was one of the key factors for the speed of the optimization mechanism, due to the vast number of placement configurations generated in the processing of a single layout. The designed representation captures the geometric features of the circuit through a line search engine. The same mechanism also provides information to a network generation mechanism which handles the geometrical and electrical characteristics of the network structures. Task 3: Design of cost evaluating algorithms. The efficient computation of the amount of overlap, switched length, and general routability of the circuit at each step of the op- timization process were key factors for assessing the accuracy and speed of the evaluation of new placement solutions. In this task, algorithms for computing these parameters were developed. The algorithms were designed to execute on the internal representation of the circuit designed in Task 2. Task 4: Implementation of the optimization mechanism. As previously established, the mechanism selected for optimizing the circuit placement was simulated annealing. In this task, the algorithms that make up the core of the simulated annealing process were designed. These algorithms were designed to adapt themselves to the problem being solved, especially the determination of the annealing schedule and its interaction with the solution perturbation mechanism. Additional algorithms developed in this stage included those dedicated to the collection of statistics for the calibration of the parameters governing the annealing process. Task 5: Extraction of data, validation, and evaluation. In this last task, a number of experiments were performed to collect data from the placement of a set of benchmark circuits selected to ascertain the performance of the newly designed placement tool. Comparative analyses were carried with three central purposes: first, to validate the results against those obtained with an established placement tool. Second, to assess the gains in power reduction 71 achieved with the new tool, and third, to measure the scaled sub-optimality of the heuristic. 3.4 Summary This chapter provided a statement of the problem of placing a circuit for low-power opti- mization and reviewed the outlined course of actions executed for providing a solution to it. Terms and definitions essential for the statement were given at first. The statement itself, in addition to presenting the problem with some formality, made evident the complexity involved in the search of a solution. The outlined task description provided a map of the remaining chapters where the detailed descriptions of the procedure are presented. CHAPTER 4 New Steiner Tree Heuristics for Physical Design Problems This chapter presents two new time-efl‘icient greedy heuristics for the minimum rectilinear Steiner tree (MRST) problem in graphs. These heuristics, originally developed for the wirelength estimation problem in the low-power placement methodolog of this dissertation, can also be applied to diverse problems found in the physical design process of VLSI circuits and in other areas. The organization given to the chapter begins with a discussion of the reasons why new heuristics were needed for the low-power application. Next, definitions of the terminology used throughout the rest of the chapter are given. The algorithms used to transform circuit layouts into suitable graphs for application of the MRST heuristics are discussed in Section 4.3. This section also describes a set of reduction mechanisms applied to the graph prior to the MRST heuristics to limit the size of the search space. Section 4.4 describes the developed heuristics and their performance analysis. The last two sections present experimental results of the application of the heuristics and a summary of the results of the chapter. 4.1 Why more IVIRST heuristics? Finding a minimum length network interconnecting sets of points in the plane is a central task in the physical design of VLSI circuits, particularly at the stages of placement and 72 73 routing. Regardless of the objective function used in either or both of these stages, the end results of these procedures will be affected by the quality of the algorithm used to find the interconnecting network. Within a typical placement mechanism, a quick wirelength estimator like the semi-perimeter method is commonly used. In the routing stages, although some other more accurate but slower technique is used, in most cases the behavior of the route estimation heuristic in the largest, least frequent nets is not a main concern. This approach frequently yields practical results in cases where objective functions are not severely affected by non-homogeneous errors in the estimates. Further, the errors from the placement stage have little or no effect in the routing stage, since typical wirelength estimators used in placement do not provide detailed route information required in upper stages. A different situation arises when optimizing the placement or routing of macro block- based layouts for low-power dissipation. Here the goal is to reduce the power dissipation of a circuit by reducing the wire capacitance of highly active interconnects. In this case, the placement objective function is especially sensitive to the wirelength of specific nets since the capacitance of a net is a direct function of its length. An additional requirement introduced by the layout style is that the interconnecting mechanism has to consider networks with multiple points and obstacles in the routing plane. Moreover, the information obtained from such nets during placement would benefit the global routing phase when providing continuity to the common objective of reducing the switched capacitance. Under these requirements, a logical approach would be to use an accurate wirelength estimator during placement, capable of yielding results within a reasonable time frame. In addition, if the results contain enough detail, they can be used later to guide the channel definition and assignment process during the global routing phase. Using these guidelines and considering the review of wirelength estimators presented in Chapter 2, it can be concluded that the best estimation mechanism for such a constrained application should be based on a fast and yet accurate SMT heuristic. This makes the class of greedy SMT heuristics referred to in Chapter 2 the first choice in the selection process. 74 A detailed analysis of several reported heuristics in this class revealed that although their worst case performance is bounded, their estimation error becomes larger with the number of points to be connected by the tree [72—75]. Moreover, it was suspected that larger nets in typical circuits had a significant activity, making them critical nets in the power dissipation of the layout. To verify this suspicion an analysis of switching activities versus net sizes was performed on a group of selected circuits from the ISCAS benchmark suite [109]. A set of four circuits that included sequential and combinational designs was selected providing a total of 1323 nets with terminal counts between 3 and 57 terminals per net. The switching activity estimator previously selected was applied and the data obtained plotted as a scatter of activity versus terminal count per net is shown in Figure 4.1. The figure also shows the central tendency of the data, measured as the average activity per net size. The figure confirms the initial suspicion: an increase in the terminal count per net does not imply a reduction in the switching activity of the nets. In fact, single nets with as many as 56 terminals were found to have activities over 0.30, consuming as much as 6% of the total power in circuits with over 600 nets. 0.5- 0'4 o o g o o o o o E 8 o 8 o 8 ' o o a 0.3L 0 o o s 8 8 o o o g g o o o 0‘) 0.2 § 8 3 ° 8 o 8 o o o 0.1» 0 0 o o 0 ° 0 g . o o o A A A n A l L L L 1 L L 2 3 4 5 6 7 8 9 10 11 12 13 14 15+ Tenninalspernet Figure 4.1. Switching activity vs. number of terminals in selected circuits. 75 Therefore, an error skew in the wirelength estimation worsening as the terminal count per net increases is an undesirable characteristic for an estimator used in the low-power placement or routing application. An error skew of this type is, however, unavoidable in a heuristic estimator. Thus, the search was focused on an [SMT-based estimator with a better handling of nets with large terminal counts. This search was unsuccessful, thus the development of a new heuristic with the desired characteristics was undertaken. The remaining sections of this chapter discuss the development of two new time—efficient greedy SMT heuristics which, given a layout graph of a circuit instance containing n vertices from which t (t S n) are terminals, find an approximate solution in O(n2 - log n) time. Both heuristics grow trees guided by the distance of the graph vertices measured with respect to the terminal vertices. In order to produce data sets from actual layouts, the implementation of algorithms for transforming macro block circuit netlists into graphs is discussed. Graph reduction algorithmsare developed and implemented to reduce the search space of the SMT heuristics. To measure the quality of solutions and the suitability of the proposed heuristics for the low power placement problem, they are evaluated using a set of benchmark circuits and compared to three other known approaches of their same class. The results obtained highlight the strengths and limitations of each of them. 4.2 Terminology and Definitions The terminolog used throughout this chapter includes definitions found in classical graph theory literature as well as terms which apply to the specifics of the developed algorithms. Given a layout with obstacles, an escape segment is a horizontal or vertical line which extends along the width or height of the layout to either the internal perimeter of the routing region or to the border of an obstacle without intersecting either of them [110]. An escape grid is obtained by drawing escape segments around obstacles and through each terminal point along an unobstructed direction [71]. A escape grid can be transformed into a layout graph by representing the grid intersections as vertices and the segments between intersections as edges. 76 Thus, a layout graph G (V, E) is defined as a connected, undirected, edge-weighted graph representing a layout with obstacles. In this notation, V represents the resulting finite set of vertices and E represents the set of unordered pairs of distinct vertices which symbolize the edges. Edge weights are determined by a cost function w such that w : E —» §R+ through the assignment of the segment lengths as weights. Set V is made up of subsets H and T, where T contains the vertices resulting from grid intersections falling on terminals of the layout. Subset H includes the vertices resulting from the intersections of escape segments at points where there are no terminals. Layout graphs are also referred to as escape graphs [71]. Within a layout graph G (V, E), the distance between two vertices d(vi,vj) is the length of a shortest path from v,- to v,-. If d (vi, v,-) is maximal with respect to all other pairs of vertices in G, then d(vi,vj) is the diameter of G, D(G’), and v.- and vj are peripheral vertices in the graph. The distance of a vertex d(v) in graph G is the sum of the distances from v to all other vertices in G. The median M (G) of a graph G is the subgraph induced by the set of vertices having minimum distance [59]. When computing the distance of a vertex u considering only terminal vertices as desti- nations, the value obtained is the terminal distance, d¢(u), given by d. (u) = Z d(u,v,). (4.1) Vvi ET The terminal diameter D¢(G) of a layout graph G (V, E) is defined as the length of a longest shortest path between two terminal vertices. The terminal median M; (G) of a layout graph G is the subgraph induced by the set of terminals having minimum terminal distance and the shortest path connecting them. Note that M; (G) might contain non- terminal vertices as well. A v.- — vj path is said to be median-biased, ijvj , if it runs over a set vertices with minimum terminal distance. Note that d (vi,vj) will not change if the path on which it is defined is median-biased. 77 4.3 Layout Graph Extraction and Reduction The concept of using layout graphs for routing multi-terminal nets in the presence of ob- stacles was studied by Ganley and Cohoon [71]. They proved that if a problem instance is solvable, then there is an optimal solution which is a subgraph of its escape graph. By using a line sweep algorithm, the generation of all line segments from a problem instance can be accomplished in time 0(m - log m), where m represents the sum of the number of terminal points to be interconnected plus the number of module sides in the layout [110]. Finding the intersections of the line segments can be completed in the worst case in time 0(m2) if all horizontal and vertical segments intersect among themselves [111]. This step produces all vertices and edges of the layout graph. Figure 4.2 shows an example of a macro block-based circuit layout and its corresponding layout graph. H'°"<. (Steiner candidate) \modulo mm moutline/ block a) Macro block-based layout b) Equivalent layout graph Figure 4.2. Macro block layout and its equivalent graph. Layout graphs generated in this manner often contain vertices and edges that do not lie in any optimal solution. Reduction algorithms can be used to identify and eliminate these elements from the graph thus reducing the search space for the heuristic used to find a solution. Techniques capable of deleting as much as 90% of the potential Steiner 78 points created in the graph constructed from unobstructed layouts have been reported [65]. However, not all of these techniques can be directly applied to the kind of graphs generated for this application either because of compatibility or time efficiency reasons. Some of these reductions require transformations which are possible only in graphs obtained from obstacle-free layouts. Additionally, not every technique applicable to graphs obtained from layouts with obstructions is suitable for the target application. Some of them are too computationally expensive. For these reasons, only a selected set of reductions were selected and adapted (when necessary) for use in layouts with obstacles. The first type of reduction deletes non-terminal vertices of degree two and their incident edges. In the construction of most layout graphs, at least four such vertices will be initially produced in the corners of the layout. These vertices can be easily detected during the construction of the graph. If vC is a vertex with deg (vc) = 2, with adjacent vertices v, and vy forming a path P1 = {vb vc, vy}, it has been shown that vc can be deleted from a Hanan grid by establishing an alternate (v3, - - - , vy) path P2 of the same cost as P1 which does not contain v6 [65]. This reduction can be applied to escape graphs by limiting the selection criteria to searching for the existence of a path P2 such that Cost(P2) S Cost(P1). Note that this criteria does not require finding a shortest (v3, ' ' - , vy) path, making it possible to use a breadth-first search (BFS) algorithm to find P2. Such an algorithm would identify a suitable path in 0(p) time, where p is the total number of vertices in the input graph. Moreover, since the graph was directly obtained from the layout geometry, positional information of vertices can be used to reduce the scope of the search algorithm thus reducing its average run time. The search for potentially removable vertices of degree two can be efficiently performed by holding a FIFO queue initialized with those vertices identified during the graph genera- tion stage as potential candidates. As removable candidates are identified and deleted from the graph, new candidates may be created. These candidates can be added to the FIFO queue for later processing. The second type of reduction involves those candidates of degree two which do not 79 satisfy the alternate path condition. In this case, the vertex is removed and its two incident edges are joined into a single edge connecting vertices vat and vy. A third type of reduction involves non—terminal vertices whose degree drops to one as a result of other reductions. These vertices are unconditionally removed from the graph as they are created. Figure 4.3 shows the layout graph previously obtained in Figure 4.2 after the application of the reduction algorithms. Figure 4.3. Reduced layout graph of Figure 4.2. This set of reductions'can be performed in a single pass over the input graph with n g p terminals through the use of the queue indicated above. The worst case time complexity of these computations would be 0 (p — n)2 for the case when all non-terminal vertices had to be checked for alternate paths. 4.4 Median-biased Trees Given a layout graph C(V, E) with a set of terminal vertices T g V, the problem of estimating the wirelength of the interconnection of the set of points in T can be regarded as that of finding the cost of a Steiner minimal tree (SMT) on a network. This problem calls for 80 the establishment of a minimum length subnetwork S" (T), spanning all terminal vertices in the graph. To offer an approximate solution to this problem, two heuristic algorithms which grow Steiner trees guided by the terminal distance of the vertices of the graph are presented. In this way, the solutions they provide will be in~ the context of median—biased trees. The rationale behind growing median-biased trees comes from the observation that in many practical cases, an SMT solution runs over a set of vertices with minimum terminal distance. Thus, by biasing the growth of a tree by this metric, the heuristics search for these type of solutions. The operation of both proposed heuristics is preceded by a pre-processing of the layout graph to identify and eliminate redundancies and infeasibility in shallow vertices. The criteria described in the previous section are used to perform these reductions on the input graph. 4.4.1 Median-biased Heuristic (MBH) This heuristic starts by labeling all vertices in the reduced graph with their terminal dis- tances. This allows one to identify the terminal median of the graph. A terminal vertex of smallest median distance is selected as root of the tree. If there is more than one, an arbitrary one is selected. The rest of the tree is grown by selecting the remaining terminal vertices of G in non-decreasing order of their median distance and connecting them to the partially built tree through a. median-biased shortest path. The MBH is outlined below in Heuristic (4.1). Heuristic 4.1 (MBH) Finds an SM T approximation on a graph. Input: Reduced layout graph G (V, E) with terminal set T E V, [T] > 2. Output: Median-biased Steiner Tree S (T). 1. Initialization: The solution tree S is initialized as empty. All vertices are labeled with their terminal distances. As each terminal vertex is labeled, it joins an ascending priority queue indexed by the terminal distance of its elements. 81 Let S +— 0. Vi): E V, find dt (v,). If vi 6 T then insert v,- into APQ. 2. Root: A terminal with smallest terminal distance is removed from the queue to become the root of the solution tree. Remove r from APQ and do S s— r. 3. Selection: A terminal x with smallest terminal distance is removed from the queue to be added to the solution tree. Remove x from APQ. 4. Growth: A vertex c in the solution tree closest to x is identified. A median-biased shortest path PM is added to the solution tree. Find a c E S such that d (x, c) is minimum. Establish P3; and add it to S. 5. Termination: If all terminals have been processed the solution tree is returned. Oth- erwise the next terminal is processed. If APQ = Q) then output S, otherwise return to Step 3. End To analyze the time complexity of the above algorithm, it will be assumed that [V] = p and [T] = n, where n _<_ p. The initialization step requires finding shortest paths from all vertices to all terminals and sorting them by their terminal distances. By using a heap implementation of Dijkstra’s algorithm the distance calculation can be completed in time 0 (p2 . log p). By also implementing APQ as a heap, the sorting of terminals can be completed in time 0 (n - log n), rendering the distance computation as the dominant process in Step 1. The use of heaps allows for Steps 2 and 3 to be completed in 0 (log n) time, while the growth step requires 0 (p - log p) time. Since Steps 3 through 5 are repeated n times, their complexity will be less or equal to that of Step 1. Therefore, the overall time complexity of the algorithm is O (p2 - log p). 82 4.4.2 Diameter-biased Heuristic (DBH) In this heuristic, like in the case of the MBH, the initialization step labels all vertices with their terminal distances. However, the solution tree is not initialized as empty. Instead, two peripheral terminals are identified and used to establish a median-biased path whose length is equivalent to the terminal diameter of the reduced layout graph. This path will be the trunk of the solution tree. After this, the remaining terminals are selected one at a time in non-decreasing order of their terminal distances and connected to the partially built tree through a median-biased shortest path. The reason for selecting a median-biased terminal diameter as the trunk of the solution tree lies in the observation of MSTs where such a path was frequently part of the solution. In addition, a terminal diameter provides a lower bound in the solution of the MST, as shown in Lemma 4.1. Lemma 4.1 Let D¢(G) be the terminal diameter of a layout graph G (V, E) measured be- tween vertices x and y from set T. Then for any SM T S“ (T) of C it is verified that IS" (T)| 2 D¢(G). (4.2) Moreover, if IS‘ (T)| = .D‘(G) then S“ (T) is a median-biased shortest x — y path. Proof: Let D¢(G) be a terminal diameter in a layout graph G measured between terminals x and y. The first part of the Lemma can be proven by contradiction. Assume there exists an SMT S" in G such that |S"'| < D¢(G'). Then, since both x and y are part of 3*, this implies that there exists an x - y path P3,, in S“ satisfying [PW] < d(x, y). Since 5‘ is a subgraph of G, then ny can be also be established in G, contradicting that D¢(G) is the length of a shortest x —— y path in C. Therefore the relation of Equation (4.2) must be satisfied. The proof of the second part of the Lemma requires showing that if an SMT S" of G has length D; then all terminal vertices fall in a shortest x - y path sz. Note that if there were at least one vertex 2 E T such that z ¢ Pry, then removing z and all 83 non-terminals connecting it to ny. would yield D; < [5*] contradicting that [5*] = Dt. So, all non-terminal vertices falling in ny will also be shortest path connected to all terminals in G. This implies that their terminal distances are minimal, making P3,, median-biased.D The outline of the DBH, as can be inferred from the above discussion, is very similar to that of the MBH. The main difference lies in Steps 1 and 2, which are restated as indicated below: Heuristic 4.2 (DBH) Finds an SM T approximation on a graph. 1. Initialization: All vertices are labeled with their terminal distances. As each terminal vertex is labeled, it joins an ascending priority queue indexed by the terminal distances of its elements. Vvi E V, find d, (vi). vai E T then add v,- to APQ. 2. Trunk: Identify a pair of peripheral terminals (x,y) and establish a median-biased shortest path ny between them. Make the solution tree equal to ny. S «— Pmy where nyis median-biased and [szl = max ((1: (u, v))V(u,v) E T. End As in the MBH, the time critical part of this heuristic is in Step 1 during the labeling of the vertices of the graph. Therefore its time complexity will remain as O (n2 - log it). Two additional steps could be added to both MBH and DBH which if inserted after the termination step could, in some cases, improve the quality of the solution without affecting the overall time complexity of the heuristics. 6. Rearrangement: Build a minimum spanning tree 8’ for the subnetwork of G induced by the set of vertices in S. 84 7. Pruning: Remove one at a time from 5’ all non-terminal vertices of degree 1. Given that Heuristics (4.1) and (4.2) provide approximate solutions, it is of interest to know how far from the optimal these solutions could be. The analysis methodology presented by Takahashi and Matsuyama in [72] can be applied to the proposed heuristics with minor modifications to provide answers to these questions. This methodolog shows that the worst case solution of the class of heuristics to which MBH and DBH belong are upper bounded by twice the optimal solution of the Steiner tree. Specifically, for a problem with [T] = k, the relation of the cost of the heuristic solution S to the cost of the optimal solution 3* is given by [[3]] _<_ 2 (1 — i) . (4.3) The methodology also shows that this bound is tight, i.e. it is not possible to obtain a lower upper bound through scaling cases for which the relationship asymptotically tends to the limit [72]. 4.5 Experimental Results To assess the quality of the solutions provided by the MBH and DBH, their performance was compared to those of other heuristics considered as close competitors in terms of the approach used to produce solutions. The heuristics selected for comparison were the median heuristic proposed by Diane and Plesnik (DMH) and the distance-network heuristic (DNH) proposed by Kou et al. [73], [74]. The DNH operates as follows. Given a graph G(V, E) and a terminal set T Q V, begin constructing the distance network of G, DN (G) Next, obtain an MST Q on DN (G) and prune it one vertex at a time, until obtaining a tree Q’ where all leaves are terminal vertices. Tree Q’ is returned as S. The distance network of G, DN (G) is a complete graph defined on set T such that any edge between terminals v,- and 12,- in DN(G) is a shortest v.- — vj path in G. 85 The second reference heuristic, the DMH, operates as follows. Given a graph G(V,E) and a terminal set T (_I V, begin by finding the median of G, M (G) Then, find a set of shortest paths from M (G) to all vertices in T and construct on it an MST K. Prune K to form a tree K’ where all leaves are terminals. Next find a DNH solution Q’ for G(V, E), T, and return S = {D, ID] = min(|K’| , |Q’|)} . In addition to the DNH and the DMH, the results of the MBH and the DBH were also compared to those of a very simple MST heuristic. In this case, a solution S is constructed by building an MST of the layout graph and pruning all non-terminal vertices, one at a time, until the leaves of the tree are all terminal. This heuristic is based on Prim’s algorithm and will be referred as the MST. The reason for including such a simple heuristic is to evaluate if the added time complexity of the other four heuristics is worthwhile in terms of the quality of the results. A set of five macro block based circuit layouts were selected from the MCNC benchmark suite (apte, xerox, hp, am133, and ami49) to support the experiments. The layouts in these data sets contained obstructions whose number varied between 9 and 49 blocks, with about one thousand nets whose number of terminals ranged between 2 and 74. The circuits were given an arbitrary placement before being processed. The selected layouts and their networks were transformed into appropriate layout graphs over which the SMT-based wirelength estimation heuristics would work. Prior to the appli- cation of the heuristics, the reductions described in Section 4.3 were applied on the input graphs. This preprocessing significantly reduced the size of the graphs, limiting the search space of the heuristics. The reduction results are summarized in Table 4.1. It is observed that the average reduction obtained with the implemented techniques decreased as the number of terminals per net increased. However, it is also noticed that in more than 90% of the nets, the average reductions exceeded 95% of the Steiner candidates in the input graph. Although these results include two-terminal nets, such entries were considered only to report the outcome of reduction techniques. The Steiner estimation results consider only nets with three or more terminals. 86 Table 4.1. Results of graph reduction algorithms. Terminals Proportion Average per net in data (%) redution (%) 2 73.52 99.58 3 15.54 97.33 4 3.61 96.09 5 1.97 87.14 6 1.31 91.50 7 0.88 81.16 8 0.33 95.27 9 0.88 56.70 10+ 1.97 41.18 The reduced layout graphs of each circuit and their nets were fed to the five heuristics under evaluation. To measure the relative quality of the results provided by the heuristics without an explicit knowledge of the optimal solution, an analysis of the frequency with which each heuristic yielded the best and worst result was made. In this analysis, given a group of nets with equal number of terminals, the proportion of best and worst results given by each heuristic was computed. The results of this analysis are summarized in Figures 4.4 and 4.5 for the best and worst results, respectively. From Figure 4.4 it is observed that for nets with 3 and 4 terminals both the DMH and the MBH provided the best result in more than 85% of the cases. For nets with five or more terminals the MBH consistently provided the best result most of the time. This behavior in the frequency of best solutions is very significant as it indicates that the skew in the estimation error of the MBH will be smaller than that of the rest of the evaluated heuristics. It must also be noted that although in typical circuits most nets have 4 or less terminals, it is also true that nets with 5 or more terminals are more likely to have larger capacitances than those with 4 terminals or less. Moreover, the analysis presented in Section 4.1 indicates that nets with 5 or more terminals are not less active than those with a lesser number of terminals. Therefore, the estimation of their wire requirements should not be less accurate, because of their likelihood of becoming critical power dissipating nets 87 80.0% ‘ 20.0% ‘ Figure 4.4. Proportion of best solutions given by each heuristic. 100.0% ,, __,,, ,,nn_,,_ ,,,__,___, a--. Figure 4.5. Proportion of worst solution given by each heuristic. 88 during placement. This reasoning then points to the MBH as the most appropriate wire length estimator for the low-power placement application. The remaining heuristics, in general performed from average to poor with the MST being the worst. The DBH, DNH, and MST were found to provide good solutions in nets with a low count of terminals. However, they have a growing pattern of providing the worst solution as the terminal count per net increases. Figure 4.5 shows this behavior with the proportion of times that each heuristic provided the worst estimate as the number of terminals per net increased. To assess the relative speed of the heuristics, their average running times were recorded for each set of nets with equal number of terminal points. Table 4.2 summarizes these statis- tics, obtained from the execution of the heuristics in a Sun Sparc Ultra 1. The results show that the MST, as expected by its reduced time complexity, provided the fastest solutions. It can also be observed that although all distance-based heuristics have the same order complexity, the DNH is faster than others in its class. A close order analysis of its time complexity shows that the polynomial describing its time requirements has smaller coeffi- cients than those in the other three heuristics. The time requirements of the MBH, DBH, and DMH were in the same neighborhood of values. Table 4.2. Running time of the heuristics in ms. Terminals Heuristic per net DBH DMH DN H MBH MST 3 19.2 10.7 0.5 19.5 0.3 4 4.8 3.5 0.5 5.2 0.2 5 170.7 115.5 3.9 272.9 1.7 6 52.5 29.8 2.2 78.7 0.7 7 313.2 131.8 4.8 227.5 1.4 8 1.7 1.8 0.8 1.5 0.1 9 80.6 76.3 9.2 117.3 1.8 10+ 2397.4 1492.2 183.7 2149.2 7.0 89 4.6 Summary Techniques for efficiently estimating interconnection networks in a power optimizing physi— cal design environment have been studied. Given the accuracy constraints of the application framework, techniques falling in the class of greedy Steiner minimum tree (SMT) estimators were selected as the study target. Two new heuristics were proposed and evaluated: the median-biased heuristic (MBH) and the diameter-biased heuristic (DBH). Evaluations in— cluded anlyses of their time and space complexities as well as their error bounds to establish their validity as approximation heuristics. Algorithms for transforming macro block netlists into appropriate graphs were imple- mented. To limit the search space of the estimation heuristics, selected graph reduction techniques were adapted and implemented. These reductions were found to be both time efficient and powerful. Results indicate average deletions of over 95% of Steiner candidates in more than 90% of the cases. The results from the SMT approximation heuristics favored the MBH in the consistency of quality results, a desired characteristic in the application framework of the estimators. Thus the MBH offers estimates with the minimum skew over the other heuristics for all tested network sizes. CHAPTER 5 Implementation of the Power Optimizing Placement Tool This chapter describes the implementation of the SAPPO tool. SAPPO stands for Simulated Annealing-based Placer for Power Optimization. The chapter begins with an introduction to the input format accepted by the program, followed by a description of the internal layout representation used in the program. Section 5.3 describes the tasks completed in the implementation of the simulated annealing heuristic. These include the design of the cost function and the annealing schedule. Special emphasis is placed on the description and analysis of a new algorithm developed for the handling of rectilinear shapes. Another area of emphasis is a description of the process followed in the calibration of cost function parameters and the annealing schedule. Results from calibration experiments are presented to assess the consistency and stability of the schedule. 5. 1 Input Format The SAPPO program accepts as input the specification of a circuit layout which includes the geometry and connectivity information of its modules, as well as the switching activ- ity of its networks. Given that no standard netlist format was found that could support power optimization goals, a new physical layout netlist format was developed. This new format was based on the well known YAL netlist language used in MCNC benchmarks. It 90 91 extends the capabilities of its predecessor to support power optimizing physical design. The switching activity information embedded in the LPY netlist format can be obtained from the functional description of the circuit through the use of the previously selected switching activity estimator. In addition to the netlist description, the input format must include a second set of data specifying design rules such as the electrical characteristics of the routing layers used in the layout, their minimum wire widths, inter-wire spaces, and constraints for the overall layout area among others. Appendix A describes in detail the extensions made to the base language to create the low-power YAL (LPY) format, as well as the rules and constraint specifications accepted by the SAPPO tool. 5.2 Internal Layout Representation The input layout specification accepted by the placement tool is internally split into two data sets which capture the geometry, the inter-module relationships, and the electrical characteristics of the layout. The first data set, called the line command set, is a line- based representation of the layout which uses several line types to identify the boundaries of all modules and all interconnecting points. The second data set, called the network list, contains the connectivity among modules and the power parameters of each net. The network-module associations are made through a set of line commands representing the terminal sets joined together by a network. The line-based representation was selected for several reasons, which include: 0 The speed of the line segment generation process. The process can be completed in time 0 (m), where m is the sum of the number of module sides plus terminal points in the layout. 0 The simplicity in performing geometric manipulations. Module transformations in- volving displacement, orientation changes, and aspect ratio changes can be efliciently carried in a time proportional to the number of sides and terminals of the modules 92 involved in the operation. 0 The suitability of the line command set to implement line sweep-based overlap area computation algorithms. The specific implementation in this research represents an improvement over previous approaches to this problem by handling general rectilin- ear shapes as single objects, without having to break them down into constituent rectangles. This overlap extraction algorithm was also used in the development of a channel pre-allocation mechanism in which the placement process produces sufficient inter-module space to route all terminals associated to neighboring blocks. 0 The support it provides to other algorithms. The line-based representation allowed for the effective use of a line sweep mechanism as the basis of the algorithm that generates input graphs to the Steiner approximation heuristics. A line command set is an instance-indexed list of line units representing module bound- aries and terminals. Indexing of lines is made through an AVL tree [112]. Six basic command types are used. The first four: left, right, top, and bottom, denoted LC, RC, TC, and BC respectively, represent module sides. The last two commands, vertical and horizontal, de- noted VC and HC respectively, represent terminal related lines. The designation of VC or HC to a given command is made depending on the unobstructed direction of a line escaping from the associated terminal. Each line in a command is specified through a triplet (i, j, k). In horizontal lines, i and j represent the starting and ending abscissas of the line while k represents its ordinate. For vertical lines, i represents the abscissa while j and k represent the bottom and top ordinates. A network list is an instance name indexed list of terminal commands associated to a common signal name and sharing a single switching activity value. Networks contain 10— calized terminal commands which have received their (i, j, k) triplet after the modules have been assigned a specific location and orientation. Additionally, in the graph-based wire— length estimators, each entry in the network list contains the set of escape lines associated to the group of terminals joined by the net. 93 The main advantage of using a line-based representation of modules resides in the re- duced time complexity of the algorithms used in the construction of the layout graphs re- quired by graph-based wirelength estimation mechanisms and in the computation of module overlaps in intermediate placement solutions. The process of constructing and manipulating a layout graph was discussed in Chapter 4. Section 5.3.1 below describes the algorithms de- veloped for computing module overlaps and performing the routing channel pre-allocation. 5.3 Annealing Algorithm Simulated annealing, as previously established, was selected to be the optimization mech- anism used by the SAPPO tool. The discussion presented in Chapter 2 identified three central tasks in the implementation of a simulated annealing-based placement strateg. These were the design of the cost function, the perturbation function, and the annealing schedule. This section discusses the implementation of these tasks in the SAPPO tool. 5.3.1 Cost Function The cost function of the SAPPO tool was designed to reflect in a balanced way the different factors that measure the quality of a placement solution. In the current implementation, these factors are, in order of importance, the switched length, the amount of module overlap, the number of blocked terminals, and a penalty for insuflicient inter-module space. The switched length, according to the discussion presented in Chapter 3, reflects the power cost of the current solution. The remaining factors represent constraints introduced into the objective function to guarantee the routability of all nets in the final solution. Other constraints, such as the maximum length of one or several delay critical nets, the length of the maximum signal path in the circuit, or limitations for power critical nets can be easily accommodated in the current implementation through the inclusion of appropriate penalties to the cost function. The general expression for the total cost of a placement solution S, denoted C (S), is 94 obtained from the sum of two components as C (S) = Cl (S) + 02 (S). (5.1) Components C1 (S) and Cg (S) represent the costs related to the switched length and the overlap of the current solution S. The value of 01 (S), obtained through equation (5.2), evaluates the switched length and the penalty assigned to nets deteCted as unroutable in the current solution. 01 (S) = Z (1ij + 2 (new). (5.2) jeNl k€N2 The first term in the sum of equation (5.2) represents the switched length of the set of routable nets N1 in S. Factor a(.) is the switching activity of the net, and Lj its estimated wirelength. The second term assigns a penalty length to each unroutable net in the current solution. Denoting the set of unroutable nets in S by N2, the penalty of a net nk 6 N2, represented B (nk), is obtained as B(nk) = C - SPM(S) - [Tm] , (5.3) where C is a scaling constant, SPM (S) represents the semi—perimeter of the bounding rectangle of the current solution, and [Tm] is the terminal count of the net being evaluated. The purpose of the scaling constant C is to adjust the weight of the penalty to not overload the cost function. The semi-perimeter of the core area was included to ensure that the penalty will automatically scale with the dimensions of the layout, while the terminal count ensures a linear growth of the penalty with the net size. The second component of the global cost, C2 (S), introduces the penalties associated with the overlap of modules and inter-module spacing. The expression used in the evaluation of this cost component is 02 (3) = 111' x/(T0(S) + A0(S))-01(S)- (5-4) 95 In the evaluation of 02 (S), rb represents a scaling constant, TO (S) is a function return- ing the overlap penalty among modules, and A0 (S) returns the penalty associated with violations to the minimum inter-module space. As in the case of C, the purpose of 1b is to adjust the weight of the penalty in such a way that it does, not overload the cost function. Function TO (S) measures the total overlap resulting from the assignment S of a set of locations L to the set of modules M in the circuit being placed. Given a core boundary constraint and an assignment S, the measure returned by TO(S) includes the overlap area among modules as well as the overlap of modules with the exterior of the core boundaries. Function AO (S) measures the penalty associated with violations to the minimum inter- module space in the form of an overlap area. In doing so, the boundaries of each internal module are outward augmented by an amount of space equivalent to the number of terminals in each side of the module times the sum of the wire width and the inter-wire space of the routing layer. The boundaries of the layout are modeled as a square hole in a module which extends from minus to plus infinity in each direction. With this model, the augmentation of the core boundaries occurs inward, opposite to that of general modules. Figure 5.1 shows a sample layout with the augmented dimensions of its internal modules and core boundaries. This inter-module space model assumes that each wire connectinglto a module has ex— clusive use of a routing track spanning the entire side of the module. Also, one additional routing track is allocated on the right and bottom sides of each module, effectively creating a minimum inter-module space for general routing of the layout. Although this model rep- resents the worst case of a pair-wise inter-module space allocation mechanism, for simplicity reasons, it does not consider routing congestion issues which arise around the center of the layout. However, it is expected that when the placed layout is taken to the routing stage, the over-allocation produced by the augmentation mechanism will compensate the for extra requirements of congested areas. Once the module boundaries are augmented as indicated above, violations to inter- module spaces can be measured as overlap of their augmented areas. This condition is illustrated in Figure 5.1 by the dark area in the augmented area between modules m1 and 96 Figure 5.1. Sample layout with augmented dimensions. m2, and m3 and the core. It should be noted that since A0 (S) is a measure of augmented overlap, it will also contain the quantity measured by T0 (S). However, experimentation has indicated that the redundant inclusion of TO(S) made it possible for the cost function to differentiate between overlaps among only the augmented portions of a set of modules versus overlaps among the true module areas. The similarity in the evaluation of functions TO (S) and A0 (S) allowed for their imple- mentation to be based on a common line search algorithm. Details of the implementation of these functions are discussed in Section 5.3.1. In addition to the balancing constant and overlap functions, the value of 02 (S) is depen- dent on the switched length costs represented by C; (S). This dependence was introduced to increase the relative weight of the overlap penalties with respect to the switched length costs when both switched length and overlap costs are small. The square root was intro- duced to drop the quadratic complexity of area measurements over the linear scale of the layout dimensions. This was necessary to ensure that in layouts with a fine grid resolution, 97 where linear measures grow large, the quadratic weight of the overlap does not overcome the switched length costs which grow linearly with the layout dimensions. The relative weights of all factors making up the cost function have to be balanced in such a way that at the end of the annealing process the switched length component has been minimized, while the cost of all other penalties is reduced to zero. This balance should, however, allow for the coexistence of penalties during most parts of the annealing process, making them converge to zero only close to the end of the simulation. Experiments have shown that an early disappearance of penalties, especially those due to module overlaps, tends to produce poorer results than those obtained when penalties are allowed to coexist with the main optimization objective for most of the annealing process. Figures 5.2 and 5.3 illustrate typical cost behaviors for these situations. The cost function in Figure 5.2 was set up to over-penalize the overlap. As a conse- quence, at an early point during the annealing process, intermediate solutions containing overlaps became unacceptably expensive. This forced the algorithm to enter a phase where only overlap-free configurations were accepted. Then, once the overlap disappeared, the reduction rate of the central objective was considerably slowed down. In Figure 5.3, the relative weights of the cost components had a balance which allowed the penalties to coex- ist with the central objective for a longer period of time. In this case, the initial slope of the reduction rate was maintained longer and the overall final solution went from 84170 to 69098, an improvement of 18% over the previous case. The current implementation of the cost function has experimentally determined values of C = 3% and 1/2 = $5. The value currently assigned to C showed that at high temperatures, penalty B () was not too large to reject all moves containing blocked terminals. But as the global cost and temperature decrease, the relative cost of blocked nets is larger, making it more unlikely to accept solutions with unroutable terminals. The selected value of ’1/J corresponds to cases where the overlap penalty was held the longest, without reaching the final solution. 98 Total Cost Switched Length — Overlap ] 1e+ooe r: Figure 5.3. Cost behavior with late penalty removal. 99 Overlap Computation In order for a placement solution to be feasible, it should not contain module overlaps. However, it has been reported that in iterative improvement mechanisms, allowing overlaps in intermediate solutions leads to better final placements [100], [113]. Overlap computation accepting such an intermediate solution was found to be performed in different ways, de- pending on the types of shapes supported in the layout. A few representative approaches are discussed below. Onodera, et al. use an analytical overlap computation method based on a set of equa- tions describing the topological relationships between pairs of modules [114]. This method requires 0 (p2) computations for a layout consisting of p modules. Their computation model was designed to only handle rectangular blocks. Sechen, in the TimberWolf program, uses a bin-based approach which supports general rectilinear shapes [113]. Support for general rectilinear blocks is provided through a mech- anism that breaks non-rectangular modules into an equivalent covering set of rectangular tiles. The overlap model divides the layout area into a grid of rectangular bins indexed by their row-column position. When a module is assigned to a specific location, a record is created on the bins intercepted by the rectangular tiles of the module. If tiles from other modules intercept any of the bins already intercepted by another block, an overlap compu- tation is performed with the cells occupying the shared bins. This model requires in the worst case a time 0 (pg) if a module intercepts all bins. In a later revision of the model, Swartz found that it might fail to detect overlaps for small modules [100]. A modification was introduced which refined the computation of the bin size to reduce the possibility of an occurrence of this error. Several other macro block placement techniques were also analyzed, but none of them considered either non-rectangular shapes or intermediate overlapped solutions [58], [115— 117]. The SAPPO tool, as it was inferred earlier in the discussion of the cost function com- ponents, was designed to accept intermediate configurations with module overlaps through 100 the assignment of a properly balanced penalty cost. The overlap computation mechanism selected was however, different from any of those discussed above. A method based on line search contour detection algorithm was implemented instead. Efficient line search—based algorithms have been reported in the literature for several planar applications in computational geometry. These include counting and reporting in- tersecting rectangles, the batched range search problem, and finding the contour of a set of intersecting rectangles, among others [118-120]. The direct application of these algorithms on rectilinear objects was found to be limited to rectangular shapes. Non-rectangular shapes could only be handled by first decomposing them into an equivalent set of non-overlapping tiles. This decomposition approach, although practical, increases the number of objects to be handled by algorithms that manipulate general rectilinear shapes. Moreover, in place- ment applications, it also requires the introduction of additional constraints into the ob- jective function to keep the constituent rectangles from being separated by perturbation mechanisms. The overlap computation mechanism implemented in the SAPPO program is based on a new contour detection algorithm for rectilinear objects which does not require breaking- up non-rectangular shapes into constituent tiles. This new algorithm starts on the work reported by Lipski and Preparata for rectangular shapes [119], and expands by introducing algorithmic modifications which allow for the direct handling of general rectilinear shapes without increasing either the average time or space complexity of the former algorithms. The descriptions of the contour and overlap computation algorithms follow. Let M = {m1,m2, - - - , mp} be a set of general rectilinear modules which has received a placement S in a bounded plane, receiving for each m,- E M a unique location lj. Then, the union of the domains of the elements in M, denoted F = D1 U D2 U - - - U Dp consists of one or more disjoint connected pseudo-domains, whose contour C = {01, C2, - - - , 05} is a set of disjoint cycles, each of which is representable by a set of alternating vertical and horizontal module commands. Each cycle C,- is assigned a direction such that it will rotate clockwise on the boundaries of a hole, and counter-clockwise otherwise. The contour problem is that 101 of finding set C. The algorithm used in the computation of C consists of two separate phases in which the vertical and horizontal edges of the contour cycles are respectively obtained. The first phase is the critical stage in the handling of general rectilinear shapes. The second phase follows the same approach reported in [119] for extracting the set of horizontal lines. A description and analysis of each phase follows. Phase I: Detection of Verticals. In the first phase, the detection of the vertical seg- ments of the contour cycles is made through a scan of the vertical lines of each shape. This is achieved by symbolically moving a vertical line from left to right over the layout space. As portions of the scan line are occluded or uncovered by modules, the vertical edges of the contour cycles are detected and reported. Algorithm 5.1 shows the sequence of operations performed in the first phase of the contour detection algorithm. Algorithm 5.1 (Contour, Phase I) Input: H and V = Horizontal and vertical commnad sets of M. ]V] = ] H ] = tole- Output: Vertical line command set of C 1. W +— H sorted in ascending order of k (ordinates) without repetitions; ST 4— segment tree of M made using W; V’ «— V sorted in ascending order of i, type, j, and k; V1) 6 V’ do: 99°50 4.1. If (v.type = LC) then Stack «— Insert(v.bottom,v.t0p, ST, 0. f «— FALSE); 4.2. Else Stack «— Delete(v.bottom, v.top, ST, a f 4— FALSE); 5. Return Stack; Performing the scan requires sorting the set of vertical edges of all modules in increasing order of their abscissas. For a layout with b module sides, this step can be completed in O (b - log b) time and with space complexity O (b). The detection of which portions of the module sides contribute to the set of vertical edges of the contour cycles is made with the aid of a segment tree. 102 A segment tree is a data structure based on a perfectly balanced binary tree, proposed by Bentley and Woods, to represent a line segment made up of a number of smaller sub- segments whose end points are known in advance [118]. Each vertex of a segment tree contains four fields: bottom, top, cover, and status, in addition to the pointers to its left and right children. Fields bottom and top represent the end points of the interval represented by the node, cover is a counter of the number of segments that cover the node, and status is a flag indicating whether the coverage of the current node is full, partial, or empty. A full status implies that at least one segment covers the current vertex from bottom to top. A partial status indicates that at least one child of the current vertex is covered, while empty indicates that neither the current vertex or any of its children is covered. The construction of a segment tree, invoked in Step 2, is completed as follows. Given an ordered set of intervals W making up a line segment, a new vertex (root) is created for the entire set W. If the number of elements in W is larger than one (]W] > 1), the input set is partitioned into two halves W1 and W2 such that [W2] — [WI] S 1. Subsets W1 and W2 are assigned to the left and right child of root, respectively, and the same procedure is then repeated on each, until subsets with only one interval are created. As can be seen, the whole process can be completed in time 0 (]W]) with the same space complexity. Given that ]W] = g, the time and space complexity of Step 2 is 0(b). In the traversal of the layout, finding line commands corresponding to a left or right edge might increase or decrease the coverage of one or several segments. These events, symbolized by procedures Insert and Delete respectively, will change the status and coverage of nodes in the segment tree. If the coverage of a certain node increases from 0 to 1, or decreases from 1 to 0, and none of its ancestors is fully covered, the interval represented by the node will contribute to a contour cycle. The parameter a f used in the invocation of Insert and Delete keeps these procedures from reporting false contour contributions deep inside the segment tree. Parameter a f indicates whether or not an ancestor of the current root in the tree was detected to have full status. The implementation of the Insert procedure is outlined in Algorithm 5.2. 103 Algorithm 5.2 (Insert) Input: (b, e) = End points of line command, root = Current vertex, a f = Ancestry status Output: Contribution of line command to the contour 1. If (root.status = FULL) then a f <— TRUE; 2. If (b S root.bottom and e Z root.top) 2.1. If (a. f = FALSE) then Stack «— Report(root); 2.2. root.c0ver <— root.cover + 1; 3. Else 3.1. If (b < root.left.top) then Stack «— Insert(b, e, root.left,af); 3.2. If (e > root.right.bottom) then Stack «— Insert(b, e,root.right,a f ) ; 4. Update(root) ; 5. Return(S tack) ; In the above implementation, Stack represents an external stack used for reporting contributions. In the operation of Algorithm 5.1, all contour commands pushed («-) onto Stack are retrieved whenever there is a change in the input command type or in the abscissa value of the scan process. The general implementation of the Insert procedure is very similar to that for rectan- gular shapes, except in two details. First is the conditioning of the ancestry of the current root in Step 2.1. Second is the form in which children nodes are tested for recursive inser- tion in Steps 3.1 and 3.2. This indexing avoids the need for a contents associative search required in the former implementation when the normalized ordinates are used to index the segments. The reporting procedure invoked in Step 2.1 of the Insert routine places on the stack a set of line segments corresponding to the contribution of the current node to the contour. Algorithm 5.3 outlines the implementation of Report. 104 Algorithm 5.3 (Report) Input: root = Current vertex Output: Contribution of current vertex to the contour 1. If (root.status = EMPTY) 1.1. line <— (root.bottom,root.t0p) ; 1.2. line +— Join(line, Stack) ; 1.3. Stack «- line; 2. Else if (r00t.status = PARTIAL) 2.1. Report(root.left); 2.2. Report(root.right); 3. Return(Stack) ; When reporting the contribution of general rectilinear shapes, a single segment can be [reported as several intervals. The objective of the Join fimction called in Step 1.2 is to merge segments already inserted into the stack which might be colinear with line. The joining operation retrieves from Stack any segment identified to be lower or upper colinear to line, joins them together, and returns to Stack the equivalent segment. This procedure fixes a problem identified in the former version of the contour algorithm which fails to properly report the contribution of a set of stacked shapes aligned on either their leftmost or rightmost sides. The new implementation of Join has a worst case time and space complexity O (b) if all shapes are stacked and aligned on one of their vertical sides. Based on the fact that Join will not merge more than % - 1 intervals in one entire recursive invocation of Insert, and that the rest of the operations in the insertion process are completed, in the worst case, in time 0 (log b), the resulting time and space complexity of the entire insertion process is O (b + log b), which is equivalent to O (b). The Update function, called in Step 4 of the Insert procedure, is outlined in Algorithm 5.4. Its objective is to update the status of vertices affected by a segment insertion, based on the resulting coverage count of each vertex and its children. 105 Algorithm 5.4 (Update) Input: root = Current vertex Output: None. The input vertex status field is updated. 1. If (root.cover) then rootstatus «— FULL; 2. Else 2.1. If (root.left) 2.1.1. If (root.left.status aé EMPTY or root.right.status 75 EMPTY) 2.1.1.1. root.status <— PARTIAL; 2.1.2. Else root.status 4— EMPTY; 2.2. Else root.status +— EMPTY; 3. Return; The extension of the contour extraction algorithm to support general rectilinear shapes required a completely redesigned Delete procedure. When only rectangular shapes are supported, every call to Delete will uncover from the segment tree an interval of length equal to some other interval previously covered by Insert. For non-rectangular shapes, the segments removed by Delete are not necessarily of the same length as any other previously covered with Insert. Two general cases may occur, as illustrated in Figure 5.4. In the first case, a left-to-right scan of module m1 will insert into the tree a segment sequence ll,---,ln for some n > 1 but will delete a single segment r, which covers the Scan line In rl . r l .. [I] m1 m2 r" Scan direction Figure 5.4. General cases in the scan of non-rectagular modules. 106 interval Ur=1lzw In this case the former algorithm will erroneously decrement by one the cover of the node representing segment r. A converse situation will happen in a similar scan of module m2 where the covers of the segments in the sequence r1, - - - ,rm would be erroneously reduced by one. To properly handle these cases, the new Delete procedure introduces the operations Split and Uncover. These new functions break down or uncover, as required, segments in the tree to correctly report the contribution of each module side to the contour. After either of these operations is completed, the status in the segment tree will be updated to reflect the changes. Algorithm 5.5 outlines the implementation of the new Delete procedure. Algorithm 5.5 (Delete) Input: (b, e) = End points of line command, root = Current vertex, a f = Ancestry status Output: Contribution of line command to the contour 1. If (b S root.bottom and e Z root.top) 1.1. root.cover «— root.c0ver - 1; 1.2. If (root.cover < 0) 1.2.1. root.c0ver <— root.cover + 1; 1.2.2. Uncover(b, e, root) ; 1.3. Update(root) ; 1.4. If (a f = FALSE) then Stack «— Report(root) ; 2. Else 2.1. If (root.status) then Split(root); 2.2. If (root.cover) then a f +— TRUE; 2.3. If (b < root.left.top) then Stack «— Delete(b, e, root.left,af) ; 2.4. If (e > root.right.bottom) then Stack «— Delete(b, e, root.right,a f) ; 2.5. Update(root) ; 3. Return(Stack) ; The implemented Delete procedure will correctly uncover and report rectangular and non-rectangular shapes. The case when a previously inserted large left side is being uncov- 107 ered by a set of smaller right sides, as in the case of mg above, is properly handled in Step 2.1 with the Split function. This function partitions a large covered root by reducing its cover by one, increasing by the same amount the covers of its two immediate children, and updating accordingly the status of the three vertices involved. This operation keeps the tree from being corrupted with negative cover values. A The case posed by ml in Figure 5.4 could still cause a node cover to become negative. In this case the Uncover procedure, invoked in Step 1.2.2, searches the segment tree rooted at root to uncover a set of previously inserted sub-segments covering the interval spanned by the large right segment. Algorithm 5.6 outlines the implementation of the Uncover function. Algorithm 5.6 (Uncover) Input: (b, e) = End points of line command, root = Current vertex Output: None. Uncovers a segment set spanning [b, e] 1. If (b S root.bottom and e 2 root.top and root.status = FULL) 1.1. root.c0ver «— root.cover -— 1; 2. Else 2.1. If (b < root.left.top) Uncover(b, e, root.le ft) ; 2.2. If (e > root.right.bottom) Uncover(b, e, root.right); 3. Update(root) ; 4. Return; The worst case time complexity of the Uncover procedure occurs when it is invoked from the root of the segment tree and the set of covering sub-segments are in the leaves of the tree. In this case the algorithm would take 0 (b - log (b)) time to complete. When analyzing how these requirements would manifest in the calling procedure, it can be observed that although Delete has a recursive execution, an invocation to Uncover with the worst case time complexity can occur at most once in the entire execution of Delete. In cases where multiple invocations to Uncover could occur, the worst case would be when uncovering a segment whose length is covered by ~3- — 1 leaves of the tree (recall 108 that there will be at most g- sub-segments in the tree). In this case, the Uncover algorithm would be called log (%) - 1 times, with a decreasing sub—tree depth at each new call. The resulting complexity of calls to Uncover would be in this case given by )—1 tolo- log [5% . log (3%)] < b slog (b). (55) 1'21 This complexity, as indicated in equation (5.5), is bounded by the worst case single call complexity obtained earlier. Considering the total time requirements of all functions called during the execution of the new Delete procedure, the resulting time complexity would be 0 (2b - log (b) + log (b)) = O (b - log (b)). (5.6) It can be inferred from the above analyses that the worst case time complexity of the first phase of contour detection algorithm is dominated by the requirements of the scanning loop (Step 4 in Algorithm 5.1), which would run in O (b2 - log (b)) time. The worst case complexity, however, is very unlikely to occur in practical circuits since it would require that all modules are non-rectangular and spanning one entire dimension of the layout. In reality, the dimensions of internal blocks are always smaller than those of the core, and non-rectangular modules are typically found in mixed layouts containing also mostly rectangular shapes. For these reasons, the average complexity would be more representative of the time requirements of this algorithm. In the average case the Delete procedure would execute in time O(log(b)) , since no call to Uncover will take time O(b-log (b)). As a consequence, the average complexity of the first phase of the contour algorithm would result 0 (b - log (b)). Something to also be considered is that the additional operations performed under the new algorithm on non-rectangular shapes are less expensive than decomposing them into multiple rectangles and then recombining. This is especially true in iterative placement applications, where the decomposition is carried to the solution perturbation mechanisms, 109 where they will exponentially increase the size of the solution space to be searched. Phase II: Extraction of Horizontals. The process of extracting the horizontal sides produces the actual contour cycles used in the overlap area computation. Given the set of verticals detected in the first phase, the horizontal extraction process proceeds as outlined in Algorithm 5.7. Algorithm 5.7 (Contour, phase II) Input: Q’ = Set of vertical contour commands. IQ’] = d S % Output: Set of contour cycles 1- Q +- Q’ ; 2. Vv’ E Q’ with v’ = (i,j,k,type) do 2.1. v «— (i,k,j); 2.2. Q s— Q U v; 3. Q <— Q sorted in increasing order of k, i; 3.1. (Resulting command sequence: Q = {v1,v2, - - - ,Ugd}) 4. C <— 0; C 4— 0 /* Single cycle and cycle list */ 5. Vt,t = 1,2,---,d do 5.1. Assume: v2t_1 = (i,j,k,type) and v2; = (i’,j’,k’,type’); 5.2. ht «— (j,i,i’,null); /* Type to be assigned */ 5.3. If ((v21_1.type = LC and j < k) or (v2¢_.1.type 2 RC and k < 3)) 5.3.1. ht.type i— BC; 5.4. Else 5.4.1. ht.type «— TC; 5.5. C +- C U (v2t_1, ht,v2t); 5.6. If (ht closes cycle C) 5.6.1. C «— C; 5.6.2. C <— 0; 6. Return(C); 110 The loop in Step 2 augments the command set by duplicating the vertical commands with their ordinates j and k exchanged. Sorting the resulting set by k and i produces an ordering where the horizontal sides are delimited by the abscissa values of commands v2t_1 and um, for t = l, 2, - - - , 2d, respectively. The type of horizontal command is given by the rule in Step 5.3, and the rotating direction is defined by the orientation of the leftmost of the lowest horizontals in the cycle. A cycle is closed when the newly created horizontal joins two vertical sides already in the cycle. The space complexity in this phase of the algorithm is determined by Step 2 to be 0 (b), while the time complexity, determined by Step 3, remains O (b - log (b)). The overall average time complexity of the complete contour extraction is O (b - log (b)) and its space complexity O (b) . Area Computation. Once the contour cycles are known, the overlap area of solution S, denoted F (S), can be obtained as F(S)= Z ID.|- Z ICE-l. (5.7) "REM 056C where [Di] represents the domain area (of module mi) and [0.] the area enclosed by cycle Ci. When computing the individual terms of each sum, the area of clockwise cycles are assigned negative values, while that of counterclockwise loops are assumed positive. 5.3.2 Perturbation Function The block perturbation mechanism used in the SAPPO tool produces new placement solu- tions by introducing small incremental changes to a given configuration. The new solution is then evaluated with the cost function described in the previous section, and applied to the Metropolis criteria for acceptance. The changes induced on a given solution by the implemented perturbation mechanism can be directed to affect one or two modules at a time, depending on the selected type of perturbation. Changes affecting a single module m, are: 111 e M D1 (mi): Displacement of module m,- to a randomly selected new location preserving its current orientation. 0 11'! D2 (rm): Displacement of m,- to a randomly selected new location with an orienta- tion change affecting its aspect ratio. 0 M D3 (mi): A randomly selected orientation change of m,- in its current location. Perturbations affecting a pair of modules m.- and m,- at a time are: 0 M X 1 (rm, mj): Exchange the locations of modules m.- and m,- preserving their original orientations. o MX 2 (m,,mj): Exchange the locations of m,- and mj, with an aspect ratio change applied to both modules. The ratio of double to single module perturbations produced by the program is con- trolled by a predetermined parameter. This parameter was experimentally determined by analyzing the variations it produced in the quality of the final solution of a test case with ten macro blocks. Several values between 0.5 and 0.04 were tested while leaving all other parameters unchanged. The best resulting ratio among those tested was 0.10. Table 5.1 summarizes the tests performed. The implementation of the perturbation function begins by determining the number of modules to be perturbed and selecting them. If a single module is selected, an attempt is made with an M D1 type perturbation. If this operation is rejected, an MD2 move is attempted by changing the orientation of the selected module. If this move is also rejected, an M D3 perturbation is attempted. If any of the three attempts is successful, Table 5.1. Effects of the displacement to exchange ratio. Rati012 1/2 ] 1/4 ]1/10 1/16 1/25 Final Cost 139817 ] 116237 ] 85413 107914 163309 112 the perturbation is integrated into the current solution or otherwise left unchanged. When two modules are selected, the initial attempt is with an M X 1 type move. If this fails, the aspect ratios of the selected blocks are changed to attempt an M X 2 exchange. An outline of the implemented perturbation function is illustrated inAlgorithm 5.8. The function RndInt invoked throughout the procedure generates a random integer number uniformly distributed in the range specified by its two parameters. The range function invoked in Step 2.2 is a range limiter which keeps the process from generating large perturbations at low temperatures, when they are most likely to be rejected [95]. The limit imposed on the range of moves varies linearly with the logarithmic rate of the temperature changes. The expression used in the range limit computation (RL) is 1081001) RL = e—— am“, 5.8 x 10810 (T0) ( ) where To and Tk represent the initial and current temperature values, respectively, xmax the width of the core area, and e is an experimentally determined constant. The current imple- mentation uses 6 = 1.0, a value found to allow a uniform exploration of the solution space when T k is relatively close to To. At low temperatures, a lowest bound of min (9, 0.1xmax) layout units was assigned to RLz. Equation (5.8) gives the formula used by RangeX. Function RangeY performs a similar computation based on the height of the core ymax. Function Accept evaluates the cost of the changes suggested by the perturbation and returns and acceptance flag based on the Metropolis criteria. Figure 5.5 shows a typical space exploration achieved by Perturb. The plots in Figure 5.5a and 5.5b show how the perturbation function uniformly explores the width and height of the layout at the beginning of iterations, when Tk is relatively high. As the process advances (lower temperature values), the range limiter reduces the scope of the space exploration to limit the search around the position established to be best for each module. The seven clusters identified in Figure 5.50 reveal the spatial exploration around the final locations assigned to the seven modules of the selected test case. 113 Algorithm 5.8 (Perturb) Input: Sol [NB] = Input solution with NB input blocks Output: Perturbed or previous solution Variables: N BTP = Number of blocks to perturb, bid = Block ID, ornt = orientation 1. 2. 4. NBTP *— RndInt(1, 2) with prob(1) = 1 —- Ratio21; If (NBTP = 1) 2.1. bidl 4— RndInt(1,NB); 2.2. {x, y} 4— {RndInt (RangeX (Sol [bid1])) , RndInt (RangeY (Sol [bid1]))} ; 2.3. ornt <— Current orientation of Sol [bidl] ; 2.4. If (Accept (Sol [-] ,bid1,(x,y,arnt))) 2.4.1. Sol[bid1]l+— (x,y,ornt); Return(Sol H); 2.5. ornt «— RndInt(1,4); 2.6. If (Accept (Sol [.1 ,bid1,(x,y,ornt))) 2.6.1. Sol [bidl] «— (x,y,ornt); Return(Sol H); 2.7. ornt +— RndInt(1,7) ; 2.8. (x,y) «- Current coordinates of Sol [bidl] 2.9. If Accept(Sol [.1 ,bid1,(x,y,ornt)) 2.9.1. Sol [bidl] «— (x,y,ornt); Else 3.1. {bid1, bid2} «— {RndInt (1,NB) , RndInt (1, NB)}; 3.2. If (Accept (Sol [] ,bidl, (x2,y2,0rnt1) ,bid2, ($1,311, WW2)» 3.2.1. Make permanent the exchange into Sol []; 3.2.2. Return(Sol H) ; 3.3. ornt 4— RndInt(1,4) ; 3.4. If (Accept (Sol [] , bidl, (£132,312, 01‘7“) , bid2, (931,311, ornt))) 3.4.1. Make permanent the exchange into Sol []; Return(Sol [D ; 114 Perturbation Counl 0 20 40 60 BO 100 120 140 160 .s ‘ I .1 O 20 40 60 80 100 120 XPoeltlon c) Layout area exploration Figure 5.5. Space exploration performed by the perturbation function. 115 5.3.3 Annealing Schedule The SAPPO tool was implemented around an adaptive annealing schedule where most parameters are computed using statistical properties from the problem to be solved. This approach dynamically fits the schedule to each individual problem. The design of such a schedule combines ideas and concepts discussed in several studies published on annealing schedules [91], [95], [96]. Melting Temperature The initial temperature of the annealing process To is obtained using a widely accepted criterion proposed by White establishing that the system is melted if 0 << To, where a is the standard deviation of the cost distribution [95]. To obtain a value of To satisfying this criteria, an initial space exploration of the configuration is performed assuming that the temperature has an infinite value. This would lead to an acceptance of all proposed moves by the Metropolis criteria. It has been found that under these conditions the cost function of typical problems has a uniform distribution [96]. Thus, selecting a temperature high enough to accept with probability P a configuration whose cost is k - 0 times worse than the current one, with k being an arbitrary constant, the initial temperature can be obtained as To = —-l;-:i-P-S ° 0’. (5.9) The current implementation of the melting function in the SAPPO tool uses the values k = 3 and P = 0.85 in the evaluation of equation (5.9). The standard deviation of the cost distribution is computed after at least 2p moves have been made, where p is the number of blocks in the layout being placed. Inner-loop Criterion The number of iterations performed at each temperature value is dynamically determined using an equilibrium detection mechanism proposed by Huang et al. [96]. It establishes that once equilibrium is achieved, the ratio of the number of new states generated with 116 their costs within a certain range 6 from the average cost u to the total number of newly accepted states will reach a stable value x. A normal distribution of the cost function is assumed. Thus, the ratio between the number of accepted states whose cost is in the range (u — 6, u + 6) , referred to as the within count, and the number of accepted states is given by erf (3), where erf () is the error function [121]. The value of 6 should be a fraction of a so that the cost of the final state in each iteration is close to that of )1. Given the above ratio and a problem size measured by the number of blocks to be placed p, the parameters needed to specify the equilibrium condition would be a target within count [6 and a tolerance limit '7. The tolerance limit is the maximum number of states accepted outside the target range. Equilibrium is considered to be reached when the within count reaches fl before the tolerance reaches 7. Otherwise both counts are reset to zero and the process is restarted. Considering that at low temperatures the target within count might never be reached, an upper limit B is set on the number of newly generated states. To guarantee the validity of statistics, the detection of equilibrium is not started until a minimum number G of states has been accepted. This additional condition requires an ultimate limit U for the generation of states at low temperatures when G might not be reached. The specific values assigned to the above set of parameters in the implementation of the SAPPO tool are controlled through a set of parameters specified at run time. A discussion on how these values are assigned is included below in Section 5.3.3. Temperature Decrement Rate The temperature decrement rate u was experimentally determined through a process where the behavior of several test circuits was analyzed. Given the constraint that consecutive temperature changes had to be sufl'iciently small to allow for a quick reestablishment of quasi-equilibrium, an initial rate change of v = 0.96 was set and fixed for the entire process in all test cases. The experiment showed the following. At high temperatures, quasi-equilibrium was easily obtained and the average solution cost decreased in almost all cases about 10% of its total reduction. It was deduced that this phase could tolerate a more aggressive temperature reduction rate without significantly 117 altering its overall behavior. This behavior was observed to persist for about 1.6 decades of temperature decrease, starting at the melting point. At medium-high temperatures the solution cost exhibited its most drastic reduction and an increasing count in the number of attempted moves in the inner loop. This implied that during this period, larger temperature changes could make it harder to reach quasi- equilibrium. This phase was observed to last about two decades in most circuits. A region of medium-low temperatures was also identified, exhibiting a less drastic reduction in the solution cost and even longer inner loops. The cost behavior suggested that this zone could support a slightly faster temperature reduction scheme. This phase was observed to last for about one decade. At low temperatures, virtually all temperature steps reached the ultimate limit U in the number of trials, each with very few accepted moves and a very small reduction in the average solution cost. The attainment of limit U in virtually every inner loop iteration indicates that equilibrium was being forced due to the extremely low acceptance ratio. This behavior is very costly in terms of time and lead to the conclusion that an aggressive temperature reduction scheme would positively impact the running time of the heuristic without a severe degradation in the quality of solution. These observations led to the implementation in the SAPPO tool of a temperature reduction strategy as illustrated in Table 5.2. Table 5.2. Temperature reduction rule. Temperature Reduction Length in region rate (:1) decades High 0.87 1.6 Medium-high 0.96 2.0 Medium-low 0.85 1.0 Low 0.80 Remainder Tests performed with the above strategr did not show a significant change in the overall 118 ie+6 fifi, fir r v VT v r Fixed Rate . Variable Rate —— Average Cost N 0 + U! 16+5 . 96+4 » . 89+4 A ~ ‘ 19-1 16+0 1 A +1 .A _x te+3 A A.‘_‘l te+4 194-5 m _4—4. 1e+6 1e+1 ie+2 1e+7 Temperature Figure 5.6. Solution cost vs. temperature for fixed and variable values of v. behavior of the algorithm with respect to using the fixed reduction rate. Figure 5.6 shows the solution cost behavior before and after the application of the implemented strategy for a. typical case. I Table 5.3 shows the average changes induced in the number of temperature iterations, attempted moves, final cost, and running time obtained with the new strategy on the test case. It can be observed that the new decrement strategy produced the same quality of solution while reducing the total annealing time by almost 50%. Table 5.3. Effects of the temperature reduction rate. Iterations Moves Final Cost Time (8) Fixed 334 32471 87625 176.2 Variable 174 16583 87469 89.0 Reduction 47.9% 48.9% 0.17% 49.5% 119 Stop Criterion The stop criterion implemented in the SAPPO tool terminates the annealing process when the average solution cost it exhibits a variation less than 21:0.05% during a sequence of w temperature iterations. During the annealing process, the. stop criterion is tested only after the range limiter has reached its minimum value. The value assigned to w depends on the speed convergence parameter specified at run-time. Schedule Parameterization In the execution of the SAPPO program, different sets of values can be assigned to the parameters defining the inner loop and the stop criterion. This allows for a run-time control of the annealing process convergence speed. Seven different convergence rates, t,, i = (1, 2, - - - , 7), can be specified to the algorithm. Each increasing value of i selects a different set of predefined values which will slow down the convergence rate in an attempt to increase the quality of the final solution. The set of values listed in Table 5.4 were experimentally tuned to set tighter conditions with increasing i on the acceptance region for the within count and on the number of times the average cost variation is expected to remain below the threshold value specified by the stop criterion. The values in Table 5.4 are used as indicated in equations (5.10) through (5.14) to affect the parameters identified in the inner-loop and stop criteria. Table 5.4. Parameters controlling the convergence rate. Rate Index C k1 k2 w t1 0.675 1 1/4 3 t2 0.600 2 1/2 3 t3 0.500 3 1 4 t4 0.350 6 2 5 t5 0.250 9 4 6 t6 0.150 12 8 7 t7 0.050 15 16 7 120 g = g (5.10) [3 = kl'P'eff(€) . (5-11) = ape—eras) (5.12) B = krfl’g—l—l (5.13) U = 4-B (5.14) Experimental Results Several experiments were conducted to assess the stability of the implemented annealing schedule and to quantify the effects of the run-time parameters on the overall behavior of the algorithm. Such an assessment provides guidelines for the application of the tool and validates the produced results. The first set of tests were designed to quantify the effects of the different convergence rates on the algorithm behavior and final cost. In the experiment, a single circuit containing seven modules was placed multiple times, one per each different convergence rate and the resulting number of trials, temperature behavior, final cost, and number of iterations were recorded. Figure 5.7a shows the growth of the number of attempted moves (trials) as a function of the successive convergence rate factors. It was found that the convergence factors exponentially increased the number of moves required for the algorithm to complete. The quality of the final solution, shown in Figure 5.7b as a function of the number of moves, asymptotically decreased approaching a lower bound as the number of trials increased. This plot suggests that a convergence rate of t4 or t5 offers the best accuracy-speed ratio. The number of temperature steps as a function of the convergence rate factor, illustrated in Figure 5.7c, grew at an approximately logarithmic rate, a growth driven by the narrowing tolerance in the stopping criterion. The values of the initial and final temperatures per convergence rate are shown in Figure 5.7d. The values of the melting temperature, shown at the top of the figure, were not sensitive to the changes in the convergence factors. This Convergence Rate 121 8E+4 500 .3 6&4 . o '1: .. 450 . t / g 3 4E+4 « o" — . .' (B 8 ,' .5 ~. 5 / u. 400 . 9 3 , 2 252.4 . /¢ .\ 0 /°/ ‘0 . ,-.e-" ‘ "-0“ —- «A 0E+o 9 ? i r ‘r r r 350 I f Y 0 1 2 3 4 5 6 7 8 OE+0 26M 4E+4 6E+4 BE+4 Convergence Rate Number of Trials a) Trials per rate b) Cost per trial 160 1E+5 '9-""' o. . 0 O .0..___¢_ or" 4—6 . 1 3 . 150 ’19,. 9 8+ "’ 2 s / g a: 140 . 6/ a iE+1 « g 1’ ° E =3 oz" '2 (he. ‘30 ‘ I 15" * \ 120 r 1 I Y Y Y 1E'3 r Eh r r r T T 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 Convergence Rate c) Iterations per rate d) To and T f per rate Figure 5.7. Effects of the convergence rates. was an expected behavior due to the independent formulation given to these parameters. The values of the final temperature decreased at an approximately exponential rate with the growth of the convergence factors. This was also expected from the tighter conditions imposed on the stopping criterion. In general, the behavior of all tested factors resulted as expected, showing consistency in the schedule. An additional set of tests was performed to assess the stability of the annealing schedule. This assessment can be performed by measuring the spread of the solution cost as a function of the temperature when the same problem is solved several times [95]. Figure 5.8 shows 122 the outcome of this experiment for three successive solutions of a single circuit with ten macro blocks. It can be observed that at high temperatures all solutions oscillate in the same neigh- borhood. As the process converges, the solutions approach the same asymptotic value with very small spread (u = 87464,cr = 1323). Average Cost i 1e+5 > 86“ x . A . AA . rmxm n A 11 A -1 A All A -mL m A - 1e-2 19-1 1e+0 te+1 ie+2 te+3 1e+4 te+5 1e+6 te+7 Temperature Figure 5.8. Stability of the annealing schedule. 5.4 Summary This chapter described the implementation details of the SAPPO tool. Details explaining the input-output formats of the program, internal algorithms, and the calibration process of sensitive parameters in the simulated annealing heuristic have been discussed. Several important achievements were made throughout this process. The development of the input-output formats represent a new physical netlist specification supporting power 123 parameters. This is an essential step not only for the SAPPO tool, but also towards the development of other mechanisms that could be created to optimize power dissipation during the physical design stage of VLSI circuits. During the development of the functions used in the cost evaluation of placement solu- tions, new algorithms emerged allowing the contour detection of a set complex rectilinear shapes in a plane without a need for decomposing them into a set of constituent rectangles. An analysis was performed showing why previously reported contour detection algorithms targeting rectangular shapes would fail if presented a non-rectangular object. This was fol- lowed by the introduction of two algorithms proposed to fix this problem. A discussion of the performance of these algorithms in fixing the identified problem was used to verify the correct performance of the proposed algorithms. Analyses of the time and space complex- ities of the new algorithms were also included, demonstrating their efliciency. Considering that the solution space size of placement configurations grows exponentially with the num- ber of modules in the layout, these new algorithms avoid the increased space to be dealt with when non-rectangular blocks are decomposed. The implementation of the annealing heuristic, although conceptually simple, required the calibration of a large number of parameters which turned its practical implementation into a laborious task. A number of small experiments were set up for the calibration process which proved to be effective in their purpose. Although this task did not follow a strictly theoretical procedure, it provides useful guidelines for parameter refinement in similar implementations. To validate the implementation of the simulated annealing algorithm, several exper- iments were performed. These experiments were directed at assessing the correct space exploration induced by the perturbation function, the proper balance of the cost function components, and the consistency and stability of the annealing schedule. The tests results, obtained through the application of well established methods for evaluating the quality, stability, and repetitiveness of annealing algorithms [91], [93], [95], [96], point to a well achieved implementation of the simulated annealing algorithm. CHAPTER 6 Results and Analysis This chapter presents the results and analysis of the application of the placement method- ology developed as part of this dissertation. A number of experiments performed on a set of circuits selected from the MCNC benchmark suite were used to evaluate and validate the contributions of this research. The contents of the chapter has been organized as follows. A description of the char- acteristics of the circuits used for the evaluations and the general approach followed in the experiments are presented in the next section. Section 6.2 discusses the tests performed to validate the heuristic, using as a reference results previously reported on the set of benchmarks. The gains in switched length as well as the effects of using the median-biased heuristic in the placement tool are also discussed in this section. The last two sections con- tain an analysis of the asymptotic behavior of the methodology and a summary of related ideas. 6.1 Circuit Profiles The evaluation of the capabilities of the SAPPO tool as a layout placer was carried out on a set of five circuits selected from the MCNC benchmark suite. These circuits, in addition to containing a representative mix of block and net sizes, have been widely used to report comparative results of physical design tools. The characteristics of these circuits, along with some statistics about their contents are summarized in Table 6.1. 124 125 Table 6.1. Characteristics of the circuits used in the evaluation of results. Circuit Number of Average Standard deviation name blocks nets pads block size net size block size net size ami33 33 123 42 0.035 4.21 0.018 9.23 ami49 49 408 22 0.732 2.34 1.030 1.33 apte 9 97 73 5.173 2.96 1.852 1.93 hp 11 83 45 0.802 3.72 0.664 3.03 xerox 10 203 2 1.935 3.44 1 .060 6.01 Two types of analyses were performed on these circuits. The first set of tests was aimed at establishing the quality of the results produced by SAPPO tool with respect to previously reported placement approaches. The second analysis was focused on quantifying the power savings introduced by the switched length approach in the optimization of the circuits. Section 6.2 presents the results of these experiments and their analysis. A second set of custom configured circuits was used to perform an analysis of the as- ymptotic behavior of the placement tool. In this experiment, a single circuit containing a mix of blocks of different shapes and sizes was replicated creating several scaled versions of the basic problem posed by the original circuit. Different scaling factors were used in the construction of the replicas and each resulting circuit was placed with the SAPPO tool. The details of this experiment are discussed in Section 6.3. 6.2 Comparative Analysis The comparative analysis of the results produced by the SAPPO tool was divided into two stages, the first to establish the quality of results produced by the tool and the second to quantify the power savings attained through its use. To validate the results produced by the SAPPO tool, the cost function in the simulated annealing algorithm was customized for producing placements optimized for wirelength only. This enabled the collection of a set of results that could be compared with those from previous approaches reporting wirelength results on the target circuits. Once this reference 126 was established, the target circuits were placed again, this time with the original objective function. The new set of results were then compared for switched length with those from the previous placements. 6.2.1 Wirelength Results The formulation given to the cost function in the annealing algorithm allowed for easily retargeting the placement objective by changing the terms making up equation (5.1) to only reflect wirelength. The resulting expressions for these terms, denoted C] (S), and C; (S) were {(5) = Eli-+2302) (6.1) jEN; k6N2 05(5) = ¢'-\/(T0(S)+A0(S))-C:. (6.2) where Lj, N1, N2 and B(nk) are defined as in Section 5.3.1. A new scaling constant ib’ = $5 was required to properly balance the weights of the wirelength and overlap in the cost function. The wirelength measurements obtained with the customized objective function are sum- marized in Table 6.2. along with similar results reported in the literature for a placement heuristic with a similar objective [114]. The results from the SAPPO program were ob- tained for a convergence factor t5 and using the semi-perimeter method as the wirelength estimator. This last condition was introduced to use the same estimator as in most of the compared techniques. All wirelength measurements are given in mm and areas are expressed in mm2. Table 6.3 shows additional results reported for the two most frequently reported circuits. The outcome of this experiment shows that the placements produced by the SAPPO tool are, in terms of wirelength and area, competitive with the best results published for the target circuits. It should be considered that although the same parameters are reported on the same set of circuits, a direct comparison among the techniques in the above tables Table 6.2. Wirelength optimization for the SAPPO tool and a reference heuristic. 127 SAPPO BB [114] Circuit Wirelength Area Wirelength Area ami33 77.18 2.50 109.0 2.24 ami49 835.03 60.14 1021.0 51.49 apte 366.10 54.69 460.0 54.05 hp 223.59 15.36 278.0 12.15 xerox 613.37 29.71 628.0 26.60 Table 6.3. Additional results reported on the set of benchmark circuits. SAPPO Bear [122] MBP [58] TimberWolf [100] Circuit Wire Area Wire Area Wire Area Wire Area ami33 77.18 2.50 131 2.83 91 2.42 105 2.57 xerox 613.37 29.71 633 28.47 601 25.79 - - might be difficult to establish. This is because each of the referred tools operates under different conditions. Examples of these differences include the fact that all the reference tools optimize area while the SAPPO considers it just as a constraint. Also, some of the tools report their results after routing has been completed on the circuits [58], [122], and some even include a stage of compaction [100]. Regardless of these differences, the results from the SAPPO tool can still be validated through these tables. An additional item of information regards the CPU time requirements. Although it is practically impossible to make a meaningful comparison of execution times for the above set of placement mechanisms, CPU time is reported to provide some idea of time requirements. Table 6.4 indicates the amount of CPU time taken by the SAPPO tool to produce the reported results. The tool was written in C and the reported execution times were obtained on a Sun Sparc Ultra 10. The times taken by BB to produce the solutions above for circuits xerox and am133 on a DEC3100 were 66.7 minutes and 89.1 minutes, respectively. 128 Table 6.4. Running times for the SAPPO program. Circuit ami33 ami49 apte hp xerox Time (min.) 50.94 332.86 8.07 7.77 8.17 6.2.2 Switched Length Results To produce results where an assessment of the power dissipated by the wires of the layout could be made, a new set of placements were generated using the original cost function 0' (S) which optimizes switched length. Four sets of data were recorded from these placements: global wirelength (WL), total switched length (SL), length of the longest net (DC'N), and the cost of switched length critical net (PCN). Similar sets of parameters were also recorded in the previous placements optimized for wirelength whose results were discussed in Section 6.2.1. The goal of collecting the data on the above four parameters was to quantify the savings in power that could be obtained by targeting the cost function to reduce the circuit switched length rather than its wirelength and/or area. In these experiments, the core area was established as a fixed constraint on all placements and no constraints at all were placed on either the maximum DCN or PCN. The reason for tracking these last two parameters, even when they had a free behavior in the processes, was to quantify the effect of the new placement technique on the critical delay path and the power critical net of each circuit. Table 6.5 contains the results obtained with the wirelength cost function G" (S) including the additional recorded parameters, along with the results from the switched length cost function. As before, all lengths including switched length, are expressed in mm. A summary of the improvements obtained by the application of the switched length cost function on the recorded parameters is provided in Table 6.6. The results in Tables 6.5 and 6.6 indicate that targeting the objective function to reduce the switched length of the circuits consistently produced modest improvements in the power 129 Table 6.5. Wirelength results versus switched length on the set of benchmarks. Cost function: C" (S) Cost frmction: C (S) Circuit WL SL DC'N PCN WL SL DC'N PCN ami33 77.18 16.67 2.70 1.29 78.31 14.48 2.74 1.20 ami49 835.03 206.34 12.07 5.04 884.74 205.76 13.31 5.06 apte 366.10 76.62 12.47 4.85 353.26 69.10 11.63 4.99 hp 223.59 51.85 7.49 2.77 223.57 50.57 7.83 2.90 xerox 613.37 137.48 10.10 3.45 537.75 115.15 9.44 3.02 Table 6.6. Improvements obtained when switched length was used as target. Circuit WL (%) SL (%) DCN (%) PCN (%) ami33 —1.46 13.14 -1.67 7.19 ami49 —5.95 0.28 —10.33 —0.29 apte 3.51 9.82 6.74 —2.89 hp 0.01 2.46 —4.61 ——4.43 xerox 12.33 16.25 6.56 12.45 dissipation of the layouts. An average reduction of about 7% over all circuits was achieved, with most of them showing improvements of 10% or more. At the same time, these savings in power did not degrade the already established good results in wirelength for most of the circuits. An example of an output layout is illustrated in Figure 6.1. A small degradation, however, was observed in the length of the DC'N for several circuits. This behavior can be attributed to the redistribution of weights induced by the switching activities on the circuit networks. It was observed that in most cases the nets flagged as DCN were among the least active of the layout. The average degradation though was very small: -0.3% over the entire set of circuits. The power critical net, although showing a small average improvement of 1.4%, also had an oscillatory behavior. The results in the last two parameters indicated that meaningful behaviors could only be established if the cost function was modified to optimize critical power dissipation along with the overall switched length. 130 Figure 6.1. Placement of benchmark hp. 6.2.3 Other Results Additional tests were performed to quantify the effects of the wirelength estimator on the SAPPO results. Two estimators were considered: the MBH, discussed in Chapter 4, and the semi-perimeter method (SPM), described in Chapter 2. The interest on the SPM originates from the fact that it is used by virtually all reviewed placement techniques, including those used for comparisons in the previous section. The specific aspects investigated were the impact on the quality of the final solution and the effect of the estimator on the run time of the placement heuristic. Three circuits were used in the tests, am133, apte and hp, each of which was optimized for switched length under the MBH and compared to the previous results already based on the SPM. Prior to making any comparisons, the final wirelength of the circuits placed under the SPM was re-estimated with the MBH. This conversion was necessary to fairly compare both measurements under the same metric. The analysis on the converted data revealed that despite the underestimation trend known to affect SPM measurements in circuits with regularly shaped modules like am133 131 and apte, the improvements induced by the MBH on the final switched length were minimal (less than 3%). For less regular blocks like those in benchmark hp, the MBH produced a final switched length 13% smaller. More consistent values were observed in the DCN and PCN where the MBH produced values reduced in up to 35% and 38% respectively, and never less than 11%. The above improvements were outweighed, however, by the excessive time requirements of the MBH. To illustrate the disparity in CPU times, the placement of benchmark hp with a convergence speed factor of t3 was completed under the SPM in 107 seconds. Under the same conditions, the MBH required 6117 seconds. It should be recalled from Chapter 4 that although the MBH was proven competitive with other heuristics in its class, it has a time complexity 0 (n2 - log n) on a graph with n vertices. In the implementation of the SAPPO algorithm, this is added to the overhead of the graph construction and reduction mechanisms, bounded by an 0 (n2) polynomial. In contrast, the SPM has a time complexity 0 (n) for a network with n terminals. 6.3 Asymptotic Behavior An issue seldom discussed in most heuristic techniques is the method used for evaluating their effectiveness. When a deterministic algorithm is applied to a problem where the desired optimum value can be somehow estimated or measured, the evaluation of the algorithm effectiveness becomes a trivial task. However, when NP-hard problems like those found in VLSI layout are approached, their optimum solutions are usually intractable and therefore comparative evaluations require careful handling. This fact becomes more critical when the underlying optimization mechanism is of a stochastic nature like in the case of simulated annealing. The subject of assessing the effectiveness of such a class of heuristics has been addressed by Hagen, et al. [123]. Their approach establishes that given a solution cost OH (I) obtained by executing heuristic H on problem instance I, then a k-scaled instance of I for some k > 0 will yield a solution 0;; (k - I) Z k - CH (I) , when solved using H. Then, the scaled 132 suboptimality of heuristic H for a scale factor k, 17” (k), can be computed as 17” (k) = fl — 1. (6.3) A heuristic will be considered sub-optimal if 17 H (k) > 0. The growth of this factor can be used to predict how bad a heuristic can become as the problem size increases. Tests were performed to measure the asymptotic behavior of the SAPPO heuristic using the scaled suboptimality criterion explained above. In these tests, a single circuit consisting of seven macro blocks was replicated 2, 4, and 8 times to create scaled versions of the initial layout, as illustrated in Figure 6.2. XIIIHII HHIIHL lllllll lllllll 1111mm: lllllllllll Illllnl; IIIL:’., I. -J . . nil lllllllllllf llllllllll lTlHlll llllllll IIIITITI Tllllll ‘llll ‘ll'lllllll "Till’rflllll lllll'l'l'. lY'll’ Illlllll Illlllibllllllll llllllll 'lIIlIll lilll‘llllll a) Base circuit b) Scaled instances Figure 6.2. Base and scaled circuits used for asymptotic tests. Given that the heuristic to be tested works on a per-net basis, the scaling factor k for each new instance was computed based on the increasing number of nets of the new circuits. Table 6.7 summarizes the characteristics of the original and scaled layouts. 133 Table 6.7. Characteristics of scaled layouts used for asymptotic behavior tests. Circuit Blocks Nets I / 0 Area x 103 k ex7 7 24 18 19.2 1 .00 ex8 14 42 24 38.4 1.75 ex9 28 78 36 76.8 3.25 6x10 56 I44 48 153.6 6.00 Multiple runs of each circuit were carried out and the resulting cost, reflecting the layout switched length, was averaged to compute the scalability factor n for each instance. The results of the experiment are summarized in Table 6.8. Table 6.8. Results of suboptimality tests. Circuit Runs SL 77 Deviation ex7 5 534 0.0 — ex8 3 937 2.686 —- 3 0.27% ex9 3 1806 40.626 — 3 4.06% 6x10 2 3427 69.606 — 3 6.96% The results of the scalability tests indicate a satisfactory behavior of the heuristic, as can be appreciated from the small values of n. In the largest of the test cases, the deviation from the optimal scaling behavior was less than 7% and, within the obtained values, the overall behavior experiences a near-logarithmic growth. 6.4 Summary The results obtained from the application of the placement methodology developed as part of this dissertation were presented and discussed in this chapter. A comparative analysis of the wirelength quality of the layouts it produced ranked its results among the best known for a selected set of benchmark circuits. The same set of high quality results were then 134 evaluated for their power dissipation characteristics and compared to those obtained when the newly developed methodology was used to reduce the switched length of the circuits. It was found that through its application, modest reductions in the power dissipation of the test circuits were achieved without degrading the quality of the placements. Moreover, an analysis of the asymptotic behavior of the heuristic pointed to a good scalability of its results as the size of problems increase. Additional results were analyzed showing that further improvements were possible by replacing the semi-perimeter-based wirelength estimation mechanism used in most tests by an estimator based on the median-biased heuristic. However, the improvements obtained in switched length and critical nets are outweighed by the time requirements of this alternate method. It is inferred by this outcome that such a heuristic might be more appropriate for the stages of global and detailed routing where its good stability and consistency might counterbalance its time complexity. CHAPTER 7 Conclusion The design, implementation, and evaluation of a power reducing placement methodolog for macro block-based VLSI layouts has been presented in this dissertation. The results of its application have been presented and analyzed indicating that power reductions are possible without significantly degrading the quality of the output layouts in terms of area or delay. This chapter summarizes the work developed in this dissertation, highlighting the con- tributions that have emerged through its completion, and identifying future research direc- tions. 7 .1 Research Summary The review presented in Chapter 2 exposed the relationships among circuit parameters and input data patterns in the power dissipation of digital circuits. This analysis identified the supply voltage, the switching activity, and the network capacitance as the variables defining the solution space for low-power tools. An analysis of the problem of estimating switching activity in CMOS circuits was also presented, discussing salient techniques, identifying the different estimation trends, and highlighting their strengths and limitations. This analysis supported the selection of a probabilistic switching activity estimator for the power driven placement environment. The study of power minimization techniques made it evident that although numerous 135 136 approaches have been proposed with different degrees of success, the physical design stage has received the least attention. Moreover, a review of placement techniques indicated that although this has been the most intensively researched physical design problem, virtually all reviewed techniques have neglected the effects of placement configurations on the power dissipation of digital circuits. These facts made evident the existence of a gap in the development of power conscious physical design techniques. The problem statement in Chapter 3 describes the circuit placement for minimum power dissipation. This statement, in addition to formally presenting the problem, made evident the complexity involved in the search of a solution. An important point stressed in this chapter was the difference between wirelength and switched length optimization. In Chapter 4, error skew effects on techniques used for estimating minimum interconnec- tion networks in a power optimizing physical design environment was studied. Considering the speed-accuracy constraints of the application framework, techniques falling in the class of greedy Steiner minimum tree (SMT) estimators were considered. The study found that the error behavior of typical estimators could have a drastic effect on the indicated applica- tion framework. Two new heuristics, the MBH and the DBH were proposed and evaluated for their error behavior. Results showed that among several SMT heuristics, the MBH had a better consistency of quality results for large nets, a desired characteristic in the application framework of the estimators. To limit the search space of RSMT heuristics, selected graph reduction techniques were adapted to the requirements of layout graphs. They were found to be both time efficient and powerful. Results indicate average deletions of over 95% of Steiner candidates in more than 90% of the cases. The implementation of the placement tool supporting the proposed power reduction mechanisms was presented in Chapter 5. Details explaining the input-output formats of the program, internal algorithms, and the calibration process of sensitive parameters in the simulated annealing algorithm were discussed. Among important achievements of this process were the development of a new physical netlist format supporting power parameters "35...... 137 and new algorithms proposed for the manipulation of complex rectilinear shapes as single objects. Analyses of the time and space complexities of these algorithms were performed demonstrating their efficiency. In the implementation of the simulated annealing algorithm, a careful experimental calibration process was described for the annealing schedule, the cost function, and the perturbation mechanism. This calibration led to a very stable and consistent adaptive implementation of the annealing algorithm. The main results of this dissertation were presented and evaluated in Chapter 6. A comparative analysis of wirelength and area on a set of benchmark circuits placed for minimum wirelength was used to validate the results produced by the placement tool. Comparing these to the placements produced with the developed low-power methodolog, it was found that modest reductions in the switched length of the test circuits were achieved without degradations in wirelength or delay. In about 60% of the tested cases, power reductions in signal nets from around 10% to over 16% were measured. Moreover, an analysis of the asymptotic behavior of the heuristic pointed to a good scalability of these results as the size of problems increase. Additional results were analyzed showing that further improvements were possible by replacing the semi-perimeter—based wirelength estimation mechanism with an estimator based on the MBH. Despite the high cost in execution time at which these additional improvements were obtained, this analysis confirmed the hypothesis of the effect of the skew error of estimators in power-driven physical design applications. Of special interest were the effects caused by this approach on the critical nets of the circuit, where peak reductions in excess of 35% were measured. 7 .2 Contributions The objectives of this dissertation required an approach that combined diverse areas of knowledge. At the same time, the development of this work produced several contributions which enhance several of these areas. Five major contributions can be attributed to this research. Below, a summary is presented of these new concepts. 138 A New Low-power Placement Methodology: The methodology developed as part of this dissertation was shown to be an effective way of reducing the power dissipation of new physical layouts. The results obtained indicate that stable and consistent reductions in the switched length of the output layouts can be obtained with very small degradations in wirelength and delay. Although this was an experimental program, the concept has been proven to work. This implies it could be used either in emerging industrial grade low-power design tools or incorporated into existing programs. A Better Understanding of Power Dissipation Patterns in Physical Layouts: The results obtained supporting the newly developed methodology have provided a better insight into the power dissipation patterns of macro-block based layouts. Researchers have conjectured that it might be possible to control the power dissipation of circuits by introduc- ing changes in their layout design methodology [124]. However, no research has been found reporting on this behavior for the wiring of macro block-based layouts. The experiments performed with the power optimizing placement tool have quantified these gains for the indicated layout style. Moreover, these results show how and at what cost diverse degrees of reduction can be achieved. An additional item in the quantification of power savings achievable during physical layout design is that for the studied style, the obtained results point to the placement stage as that producing the highest power reductions. An aspect for which no previous discussion has been found is the effect of the skew error of network interconnect estimators on the resulting power dissipation of layout designs. The results of this research have pointed out that nets with a large number of terminals typically become critical in the power dissipation. However, these nets are frequently the least accurately modeled by traditional interconnect estimators. This research quantified part of the effect caused by wirelength estimators in the total switched length during the placement stage. Evidence was presented showing that there could be a more dramatic impact in the results of performance driven placement techniques. It is conjectured at this point that the same reasoning could be applied to a hypothetical router for low-power, or a performance driven routing application. it 139 New Heuristics for the Minimum Rectilinear Steiner Tree Problem: Two new heuristics for the minimum rectilinear Steiner tree (MRST) problem in graphs were devel- oped. These new algorithms were found to be competitive in terms of space-time complexity and error bound with other heuristics in their class. Most important, the MBH was found to outperform similar heuristics when presented with configurations of increasing numbers of terminals. This result might prove useful in many areas where rectilinear Steiner trees find application. This is especially true for the development of power reducing layout techniques, as was demonstrated in the wirelength estimation application in low-power placement. An additional contribution relevant to the MRST problem was the development of an efficient method to perform reductions on graphs produced from two-dimensional spaces with obstructions. A set of mechanisms previously reported for graphs extracted from obstacle- free layouts were selected and adapted for application in graphs generated from macro block-based layouts. The selected subset was implemented and found to be as effective as other mechanisms having larger time complexity. Improved Algorithms for the Manipulation of General Rectilinear Shapes: A set of algorithms for the manipulation of general rectilinear shapes capable of handling non-rectangular objects as single entities were developed. These algorithms were shown to correctly handle the special cases arising during the processing of non-rectangular shapes. In addition, they were found to have space and average time complexity similar to that of mechanisms which handle non-rectangular shapes as a set constituent rectangles. The rele- vance of this contribution resides in the reduced number of objects needed to be considered when handling complex rectilinear shapes. A representative example of this situation oc- curs in the case of an iterative improvement placement application. When non-rectangular shapes need to be decomposed into constituent rectangles the size of the solution space ex- plored by the application dramatically increases. This is because of the exponential growth of the solution space with the number of modules. 140 An Extended Netlist Format Supporting Low-power Layout Techniques: Ex- tensions were created to a standard physical netlist format expanding its capabilities to support power optimizing physical design tools. With this contribution, part of the foun- dation for the development of a unified set of low-power tools covering all physical layout design subproblems has been established. 7 .3 Future Work The creation of a power optimizing placement tool for macro block-based layouts, as was indicated earlier, only partially covers the need for power conscious physical design tools. Similar approaches are needed in the remaining stages of layout design, and even in the placement stage itself, to provide a continuum of tools enabling power optimized layouts from beginning to end. This necessity points to multiple development objectives for tools that could partition, route, and compact physical layouts. The development of such tools could take advantage of several mechanisms already developed for the placement tool. Another interesting development that would enhance the low-power placement method- ology would be its expansion to handle layout styles other than macro blocks. Given the high degree of placement freedom offered by the macro block style, the introduction of ap- propriate constraints could adapt the algorithms to handle standard cells, gate arrays, or even mixed layouts. In all of these cases suitable rules for low power optimization could be devised using the switched length of nets as a measure of quality in the resulting layouts. In the analysis of the effects of wirelength estimation accuracy, results consistently showed large differences in the costs of critical nets. A more detailed analysis of this trend in performancedriven physical design techniques could provide an insight into a problem that might not have been noticed before. Despite the quality of the results reported from the application of the developed place- ment tool, the current implementation has not been optimized for speed. Several options have been identified as possible sources of improvement. The most time consuming tasks during the MBH-based operation are the wirelength 141 estimation and the overlap computation. An alternative that could be explored for speeding- up these processes would be through their parallelization. Some data dependencies exist between the processes, but given that overlap computation is based on modular information and wirelengths on network topologies, the data in the two stages have a large degree of coarse grain parallelism. The small dependency arises in the detection of feasible nets, made with the use of the segment trees of the overlap detection mechanism. Another alternative worth exploring for speeding-up the process would be partitioning the layout into subcircuits and parallelizing their optimization. In this case, processor interaction would occur due to the presence of nets joining the sub-circuits. To minimize this interdependence in the data, a partitioning algorithm that minimizes the cut size could be used. Issues needing additional consideration include the possible performance degradation that might occur due to the fragmentation of the solution space. 7 .4 Closing Remarks The idea behind this dissertation, more than simply providing an isolated method for power efficient placement of a certain type of layouts, is the creation of a unified design methodol- ogy that could transform the way in which physical design automation tools operate from beginning to end. This is not an easy goal. After reaching this point and reflecting on all the work it took to complete this dissertation, one realizes that this is just a small step in reaching that ultimate objective. The foundation has been laid and the objectives are clear. APPENDICES APPENDIX A Using the SAPPO program This appendix briefly describes the use of the SAPPO program. It includes details of command line options used in its invocation, the input format specification and the difl'erent output files it produces. Also, summarized information is provided on a translation tool developed to facilitate the task of importing layouts from other languages. A.1 Command Line Syntax The SAPPO program can be invoked by its name from the command prompt. If no com- mand line parameters are specified, the program produces the following message on the standard output device: SAPPO: Simulated Annealing Placer for Power Optimization. Usage: sappo [options] filename - name of file with netlist in Low-Power YAL (LPY) format. OPTION EFFECT -f Produce a file.FIG with a plot of placement in xfig 3.1 format. -1 Produces a log file file.LOG of the program execution. -o Produces an LPY file with the resulting layout. -c Evaluates the cost of a solution, bypassing the annealing process. -t# Annealing time control. Value #: 7-slow, l-fast, default-3 -r # Specifies a seed (# > O) for random number generation, default-auto. -i Generates an INITIAL solution different from that in the input file. -w Annealing process to optimize Wire length only. 142 143 Steiner approximation obtained through: -m Median biased path heuristic (MBH) --defau1t -d Distance-network heuristic (DNH) -p Prim’ 3 Minimum Spanning Tree (MST) -s Semi-perimeter method (SPM) The different command line arguments can be combined to specify concurrent actions in the processing of the circuit. Only one Steiner approximation method can be specified at a time. The program will use the provided file name to search for an LPY circuit description file and a supplemental rules and constraints file with the same name as the input circuit and extension RCF. The formats of these two files are described in the sections below. A.2 Input Format The layout input / output format was developed based on YAL layout language, originally developed by B. Preas and K. Robers‘,’ and used in the MCNC benchmark circuits. It also includes modifications introduced later by J. Rosel The new format, named LPY (low-power YAL), extends the YAL constructs to support the specification of power dissipation parameters in physical design-netlists. The description of the LPY format presented below closely follows that of the YAL language. It includes the newly added constructs that extend its support to physical design for low power. The new features added to YAL include an additional section in the layout parent module specifying the switching activity factors of input and internal nodes of the circuit. Terminal descriptions in all module types have been extended to allow the specification of their equivalent resistance and capacitance values, facilitating the computation of all power components of the layout as well as extending the previous support for delay computations. The description of critical nets has been complemented with a new section added to specify module propagation times and timing constraints for signals traveling through the entire chip. Module dimensions in the LPY format remain specified by the coordinates ’YAL is a COpyright (C) 1987 of Bryan Preas and Ken Roberts. All rights reserved, March 1987. fModified by Jonathan Rose to accomodate rectilinear cells, power and ground nets with current and voltage restrictions, and the specification of critical net lengths. The new YAL is not compatible with the old. November 1987. 144 of the corners of their rectilinear shapes in counter-clockwise order, as in its predecessor. However, the LPY format enforces that the leftmost of the lower corners is specified first. In the YAL format, design rules are not part of the language. This remains true for the LPY, but a separate file is now provided which allows for specifying a reduced set of design rules and constraints. A.2.1 The LPY Netlist Format An LPY file is composed essentially of comments and data lines. Comments are delimited by “/*” and “*/” and may extend across physical line boundaries. Data lines are composed of a series of keywords and tokens organized to form diverse sections in the format. White spaces are used delimiters between tokens. A white space can be one or more of any combination of space, tab, line feed, or carriage return characters. The data is free format. A semicolon, “;”, is the logical line terminator. Any number of white space characters may separate a semicolon. A logical line can occupy more than one physical line. Any number of logical lines can occupy a physical line. Within the format, “name” fields are used for module or cell definitions, module or cell instances, and signals or nets. Names are case sensitive, may contain any combination of characters except white spaces or semicolons, and do not require to begin with an alphabetic character.ln the descriptions below, “I” means OR, implying the selection of one item from the list of fileds separated by the “|”. Items enclosed in square brackets, “[” and “]” , indicate optional fields. An ellipsis, “. . .”, either horizontal or vertical indicates that the preceding field or line can be repeated as necessary. Tokens enclosed by angle brackets, “<” and “>” , are used as intermediate definitions. Words in upper case are keywords of the language. A ”number” is a real value and can specify a measure in microns or the value of some other quantity. Module Definitions A module is a definition of a circuit being laid out or a constituent primitive cell. The definition for each circuit or primitive cell begins with “MODULE ”. The 145 definition ends with the “ENDMODULE” keyword. A template of a module description is presented below. MODULE ; TYPE ; DIMENSIONS . . . ; IOLIST; [ [ l [ <1ayer>]]] [CURRENT ] [VOLTAGE ] [RESIS (resistance) CAPAC ]; ENDIOLIST; NETWORK; . . . ; ENDNETWORK; ACTIVITY; . . . ; ENDACTIVITY; PLACEMENT; [] []; ENDPLACEMENT; CRITICALNETS; ; ENDCRITICALNETS; PIN2PINTIME; ; ENDPIN2PINTIME; ENDMODULE; The different fields in the module template described above are defined as follows: 146 21' name ::- STANDARD I PAD I GENERAL I PARENT | FEEDTHROUGH ::- number (height) ::- number 21' name ::- I I 0 I B I PI I PO I PB I F I PWR I ::- BOTTOM I RIGHT I TOP I LEFT (layer) ::- PDIFF I NDIFF I POLY I METAL1 I METAL2 ::- number ::= number ::- number ::- name ::- number ::- number ::- number ::- number ::- number (voltage) ::- number ::- number ::- number (capacitance) ::- ::- ::= RFLNONE I RFLY ::= number ::- number number number GND ::- ROTO I ROT90 I ROT180 I ROT27O The “TYPE” line is required for all modules. Allowable module types are: PAD for a driver or pad cell that should be placed around the periphery of the chip. GENERAL for an internal, primitive general cell or building block. PARENT for a higher level mod- ule to be laid out. Additional types supporting standard cell-based layouts are: STAN- DARD for an internal, primitive standard cell that must be placed on a row or column, and FEEDTHROUGH: a cell used to connect a signal across a row. Whether or not a line like DIMENSIONS or a section like NETWORK or PLACEMENT is required within a module depends on the TYPE of module being defined. A TYPE 1% ‘1!) u. 147 and an IOLIST are always required. A primitive cell is assumed to have been laid out previously and therefore has the following data: DIMENSIONS and IOLIST; or , or . Tokens and are required for each of the IO s within the IOLIST for primitive cells. NETWORK, ACTIVITY, and PLACEMENT sections are ignored for primitive cells. A feedthrough cell (TYPE = FEEDTHROUGH) should not appear in a NETWORK but is inserted into a layout by the layout system as necessary. A module that defines a circuit to be laid out has TYPE PARENT and must have IOLIST, NETWORK, and ACTIVITY sections. Its DIMENSIONS line can be overridden by information in the file of rules and constraints. The DIMENSIONS line gives the coor- dinates of the corners of the rectilinear box in counter-clockwise order, with the leftmost of the lowest corners being first. The IOLIST section defines the external connections or terminals of the module and should be included in definitions of all modules. Each line in this section defines a segment of connection to a higher level of hierarchy for the module. Each line must contain at least a of an 10 terminal and its . Within a module being laid out (TYPE = PARENT) a on a terminal line in the IOLIST is in the same name space as the s in the NETWORK of that same . Each module has its own separate name space. Other fields may be required depending on the terminal and the module that contains it as described below. If a primitive module (TYPE = STANDARD, PAD, GENERAL, or FEEDTHROUGH) is being defined, the lines within an IOLIST must contain the , , , and <1ayer>. Terminals appear in the order in which they will be referenced in the NETWORK section of any higher level module. Electrically equivalent terminals (6.9., in a dual-ported primitive module) must have the same s. In this case, binding between the terminal s of a primitive module and the s of the higher level module are performed in the order of unique signal names. It should be assumed that electrically equivalent terminals can be used as “feedthroughs” for the signals 148 connected to them. If additional over-cell routing is allowed, i.e., for signals not connected to the cell, terminal locations for these feedthoughs will appear at the end of the 10 terminal list. Valid s are: “I” for input. “0” for output. “B” for bidirectional. “Px” for a pad terminal on a primitive, with x is replaced byeither I, O, B for input, output, and bidirectional pads, respectively. “F” for an independent feedththrough.”PWR” for VDD power input. ”GND” for VSS ground input. All signal terminals can optionally have their equivalent resistance and capacitance values specified to allow for delay computations in timing constrained designs. Resistance values are given in milliohms and capacitance values in picofarads. Terminal locations are specified by (or ), (or ), , and <1ayer> (or LEFT). and give the coordinate of the center of the terminal in microns on the edge of the cell. To be compatible with the old YAL, terminals can also be specified as and position. may be BOTTOM, RIGHT, TOP or LEFT. The token gives the coordinate of the center of the terminal in microns along the indicated from the left or bottom of the module as appropriate. Field specifies the dimension of the terminal along the indicated edge; it extends from - / 2 to + / 2 or - / 2 to + /2 as appropriate, and <1ayer> specifies the conductor layer of the terminal. A fixed pad placement for higher level circuits (TYPE = PARENT) requires a slightly different interpretation of the fields and is specified in the following manner. In this case, the and specifiers must be used, rather than the . If the terminal is bound to the pad of an 10 driver, then the refers to the side of the chip and refers to the position of the origin of that driver cell. and <1ayer> are ignored. If no is specified, then the pad is restricted to the indicated but not to a on the . If no is specified for a terminal bound to an 10 driver then that IO driver may be placed on any . The NETWORK section defines the internal connectivity for the module. If a NET- 149 WORK section is included in the definition of a primitive module (TYPE = STANDARD, PAD, GENERAL, or FEEDTHROUGH), it will be ignored. Each line defines an instance of a primitive module that is included in the circuit. Each instance definition has the fol- lowing fields: is the name of the instance of a module. An instance has signal bindings, a location, and an orientation to be determined by the layout system in the PLACEMENT section. is the name of the module to which the instance is bound. is a list of s connected to the module, in an order determined by the module definition. Connections to the feedthrough terminals are not included in this list; they should be determined by the layout system. Single terminal s are used to specify an unconnected terminal. An unconnected terminal may also result if the list of s on an instance definition is shorter than the list of unique s in the IOLIST of the primitive module being bound. The estimated switching activity values of circuit nodes are specified in the ACTIV- ITY section. This new section is expected to be found in the bounding module of the layout (TYPE = PARENT). Each line in the ACTIVITY section contains two fields: and . The field defines a unique signal name of those included in the network section. Entries are expected to have the following order: terminals listed in the IOLIST section of the PARENT module, keeping their original rela- tive order, followed by internal net definitions on the NETWORK section, in their relative order of appearance. The second field, specifies the estimated switching activity of the node identified by . The PLACEMENT section is used input initial placements or to report placement re- sults, one logical line per instance. If a PLACEMENT section is included in the definition of a primitive module (TYPE = STANDARD, PAD, GENERAL, or FEEDTHROUGH), it will be ignored. The fields in a placement section are defined as follows: is the cell instance, and must match an instance name in the NETWORK section. specify the location of corner (0, 0) of a module after any orientation change 150 of the instance. are the optional reflection and rotation of an instance. The default for is RF LNONE (the normal or “defined” reflection), while RFLY means reflected about the Y axis. The default value for is ROTO (no rotation); ROT90, ROT180, ROT270 indicate a counter-clockwise rotation for the ap- propriate number of degrees. Reflection is applied before rotation. The dimensions in all primitive cells are forced to the lower left corner of the layout (0,0). If the circuit contains an initial placement specification it must be given through the placement section of the format. The CRITICALNETS section is used to specify which nets are performance-critical, and their maximum allowable lengths. Field identifies the critical net, and must match a in the NETWORK section. Value is a number in microns specifying how long the signal wire can be restricted to. This information is useful when performing performance constrained placement. A new section, PIN2PINTIME, has been added which allows for specifying pin—to-pin propagation times. Two fields specify a pair of terminals in the module of concern. These fields must match declarations in the IOLIST section of the module. Fields and specify the estimatedltimes taken by rising and falling signals, respectively, propagating between the pair of specified terminals. Note that for modules other than PARENT, the pin—to-pin propagation times are input data, while for PARENT modules they represent timing constraints for critical signals traveling through the chip. A Macro Block-based Example The description below corresponds to a layout consisting of three primitive macro cells, each instanced once in the layout. /* An example containing placement and constraint sections */ MODULE m1; TYPE GENERAL; DIMENSIONS 0 0 55 IOLIST; M111 I 0 30 1 M112 I 0 10 1 M113 I 40 0 1 M114 1 40 2O 1 M101 0 15 O 1 M102 O 55 10 1 M103 0 10 45 1 ENDIOLIST; PIN2PINTIME; M112 M102 30 M114 M101 15 ENDPIN2PINTIME; ENDMODULE; MODULE m2; TYPE GENERAL; DIMENSIONS 0 O 25 IOLIST; M211 1 O 20 1 M212 1 25 30 1 M213 1 10 45 1 M201 0 O 35 1 M202 0 0 5 1 M203 0 25 15 1 ENDIOLIST; ENDMODULE; MODULE m3; TYPE GENERAL; DIMENSIONS 0 0 30 IOLIST; M311 1 0 10 1 M312 1 3O 15 1 M313 I 20 25 1 M314 I 5 25 1 M301 0 O 15 1 ENDIOLIST; PIN2PINTIME; M313 M301 10 ENDPIN2PINTIME; ENDMODULE; MODULE bound; TYPE PARENT; 151 0 55 20 25 20 25 45 0 45; METAL2 RESIS METAL2 RESIS METAL2 RESIS METAL2 RESIS METAL2 RESIS METAL2 RESIS METAL2 RESIS 30; 15; O 25 45 0 45; 2.58 0.16 2.03 1.99 3.12 3.76 4.21 METAL2 RESIS 3.71 METAL2 RESIS 2.06 METAL2 RESIS 1.06 METAL2 RESIS 4.48 METAL2 RESIS 2.61 METAL2 RESIS 4.96 0 30 25 0 25; METAL2 RESIS 2.22 METAL2 RESIS 4.22 METAL2 RESIS 2.30 METAL2 RESIS 4.66 METAL2 RESIS 0.23 10; DIMENSIONS 0 0 120 O 120 80 0 80; IOLIST; CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC (III-60110030 hflwsli-t 4010030030) .93; .45; .10; .01; .09; .59; .60; .06; .71; .68; .93; .81; .24; .51; .66; .97; .26; OUT1 1N3 OUT2 1N4 1N5 1N6 IN7 OUT3 OUT4 1N8 1N1 1N2 P1 0 ENDIOLIST; NETWORK; PM1 m1 1N1 PM2 m2 NET1 PM3 m3 1N1 ENDNETWORK; ACTIVITY; OUT1 .05; 1N3 .50; OUT2 .02; 1N4 .50; 1N5 .50; 1N6 .50; 1N7 .50; OUT3 .33; OUT4 .09; 1N8 .50; 1N1 .50; IN2 .50; NET1 .23; NETZ .30; NC1 0.16; ENDACTIVITY; PLACEMENT; P0 25 0 PI 65 0 PO 100 0 120 15 120 40 120 65 80 80 80 80 55 25 HHHHHHHHHHHH IN2 IN5 NET2 00000000000000 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 METAL2 RESIS RESIS RESIS RESIS RESIS RESIS RESIS RESIS RESIS RESIS RESIS RESIS 152 huwwmwowmmw» .95 .39 .90 .90 .94 .71 .30 .47 .71 .23 .41 .07 CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC CAPAC MQOMMQQ‘OMHNV 1N3 NET1 OUT1 OUT2 OUT4; NET2 OUT3 NET2 NC1; IN7 1N8 NET1; PM1 10.0 10.0 ROT27O RFLY; PM2 85.0 10.0; PM3 45.0 45.0; ENDPLACEMENT; CRITICALNETS; NET1 25.00; ENDCRITICALNETS; PIN2PINTIME; IN2 OUT2 50 50; IN7 OUT1 65 65; ENDPIN2PINTIME; ENDMODULE; .01; .16; .51; .97; .10; .06; .95; .30; .28; .95; .31; .27; 153 A.2.2 Rules and Constraints Specifications In addition to the LPY circuit description, 3 file specifying a sub-set of the design rules and constraints for the layout is required. This file is expected to have the same name as the circuit description, with extension .RCF. Its contents should be organized as follows. RULES; LAYER <1ayername> (capacitance); ENDRULES; CONSTRAINTS; [ASPECT ;] [CORE (height) ;] [DELAY ;] ENDCONSTRAINTS; The RULES section allow to specify the wire width and wire spacing for each layer in the layout, as well as the sheet conductance and capacitance of each layer. The constraints section allows to optionally impose constraints in the aspect ratio or the core dimensions of the resulting layout. Also, the length of the delay critical net can be optionally specified in this section. A.3 Output Files The number and type of output files produced during the execution of the SAPPO program will depend on the command line parameters specified during its invocation.. The files produced by the program will be in the same directory as the input LPY file. Output files will have the same the same name as the input file, with the corresponding extension. An exception to this rule is the output LPY file which will be named “filename_o.LPY ”. This last file is the one containing the placement produced by the program in its PLACEMENT section. Additionally, a report file with extension RPT will be also created. The output files are formatted in plain ASCII and their contents is self-explanatory. 154 AA Support for Other Netlist Formats In order to provide a means for processing with the SAPPO layouts specified in netlist formats other than the LPY, a translation tool was developed. This translator, called yal2lpy, allows for converting YAL netlists into the LPY format, provided that the switching activities on input and internal nodes are known. When invoked without parameters, the following output is produced: yal2lpy: Converts YAL files to LPY format. Usage: yal2lpy [outfile I -] where, infile - Input YAL file (required). outfile - Optional output LPY file. If no outfile is specified, infile.LPY is created. Using “-” sends output to the stdout device. The switching activity specifications should be provided in a file with extension ACT, formatted according to the output provided by the SIS power estimation tool. BIBLIOGRAPHY BIBLIOGRAPHY [1] H. Mehta, M. Borah, R. M. Owens, and M. J. Irwing. Accurate estimation of combi- national circuit activity. In Proc. 32nd IEEE/A CM Design Aut. Conf., pages 618-622, Jun. 1995. [2] F. Najm, S. G061, and I. Hajj. Power estimation in sequential circuits. In Proc. 32nd IEEE/ACM Design Aut. Conf., pages 635—640, Jun. 1995. [3] C.-Y. Tsui, M. Pedram, and A. M. Despain. Exact and approximate methods for cal- culating signal and transition probabilities in FSMs. In Proc. 0f the 31th. ACM/IEEE Design Ant. Conf., pages 18—23, Jun. 1994. [4] R. Burch, F. N. Najm, P. Yang, and T. Trick. A monte carlo approach for power estimation. IEEE Transactions on VLSI Systems, 1(1):63—71, Mar. 1993. [5] A. Ghosh, S. Devadas, K. Keutzer, and J. White. Estimation of average switching activity in combinational and sequential circuits. In Pmc. 29th IEEE/A CM Design Aut. Conf., pages 253-259, Jun. 1992. [6] F. N. Najm. Transition density, a stochastic measure of activity in digital circuits. In Proc. 0f the 28th ACM/IEEE Design Aut. Conf., pages 644—649, Jun. 1991. [7] S. Iman and M. Pedram. Logic extraction and factorization for low power. In Proc. 32nd IEEE/A CM Design Aut. Conf., pages 248—253, Jun. '1995. [8] R. San Martin and J. P. Knight. Power-profiler: Optimizing ASICs power consump- tion at the behavioral level. In Proc. 32nd IEEE/ACM Design Aut. Conf., pages 42—47, Jun. 1995. [9] L. S. Nielsen, C. Niessen, J. Sparso, and V. VanBerkel. Low-power operation using self-timed circuits and adaptive scaling of the supply voltage. IEEE Transactions on VLSI Systems, 2(4):391—397, Dec. 1994. [10] R. Hossain, M. Zheng, and A. Albicki. Reducing power dissipation in serially con- nected MOSFET circuits via transistor reordering. In Proc. IEEE Int. Conf. 0n Comp. Design, pages 614—617, Oct. 1994. 155 156 [11] V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded software: A first step towards software power minimization. IEEE Transactions on VLSI Systems, 2(4):437—445, Dec. 1994. [12] V. Tiwari, P. Ashar, and S. Malik. Technology mapping for low power. In Proc. 30th IEEE/ACM Design Aut. Conf., pages 74—79, Jun. 1993. [13] K.-Y. Chao and D. F. Wong. Floorplaning for low power designs. In Proc. 0f the 1995 IEEE Int. Symp. 0n Circuits and Systems, pages 45—48, May 1996. [14] H. Vaishanav and M. Pedram. Pcube: A performance driven placement algorithm for low power designs. In Proc. 0f the 1993 European Design Aut. Conf., pages 72-77, Sep. 1993. [15] N. Weste and K. Eshraghian. Principles of CMOS VLSI Design. Addison-Wesley Co., Reading, MA, second edition, 1993. [16] H. Veendrick. Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. IEEE Journal of Solid-State Circuits, SC—19(4):468—473, Aug. 1984. [17] J. M. Rabaey and M. Pedram. Low Power Design Methodologies. Kluwer Academic Publishers, Norwell, MA 02061, 1996. [18] S. M. Kang. Accurate simulation of power dissipation in VLSI circuits. IEEE Journal of Solid-State Circuits, SC-21(5):889—891, Oct. 1986. [19] K. P. Parker and E. J. McCluskey. Probabilistic treatment of general combinational networks. IEEE Transactions on Computers, C—24:668—674, Jun. 1975. [20] F. Najm. Transition density: A new measure of activity in digital circuits. IEEE Transactions on Computer Aided Design, 12(2):310—323, Feb. 1993. [21] S. M. Ross. Introduction to Probability Models. Academic Press, Inc., San Diego, CA 92101, fifth edition, 1993. [22] M. Xakellis and F. Najm. Statistical estimation of the switching activity in digital circuits. In Proc. 3lst IEEE/ACM Design Aut. Conf., pages 728-733, Jun. 1994. [23] G. I. Stamoulis. A monte-carlo approach for the accurate and efficient estimation of average transition probabilities in sequential logic circuits. In Proc. 0f the IEEE 1996 Custom 10 Conf., pages 221—224, May 1996. [24] L-P. Yuan, C—C. Teng, and SM. Kang. Nonparametric estimation of average power dissipation in CMOS VLSI circuits. In Proc. 0f the IEEE 1996 Custom IC Conf., pages 225—228, May 1996. I25] I26] WI [28} I29] I30] I31] I32] I33] I34] I35] I36] I37] 157 M. H. DeGroot. Probability and Statistics. Addison-Wesley Pub. Co., Reading, MA 02061, second edition, 1986. J. Savir, G. S. Ditlow, and P. H. Bardell. Random patern testability. IEEE Transac- tions on Computers, C-33(1):79—90, Jan. 1984. M. A. Cirit. Estimating dynamic power consumption of CMOS circuits. In Proc. 0f IEEE Int. Conf. 0n CAD, pages 534—537, Nov. 1987. R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C—35(8):677—691, Aug. 1986. F. Najm. Low-pass filter for computing the transition density in digital circuits. IEEE Transactions on VLSI Systems, 13(9):1123—1131, Sep. 1994. M. Fujita, H. Fujisawa, and Y. Matsunaga. Variable ordering algorithms for ordered binary decision diagrams and their evaluation. IEEE Transactions in CAD, 12(1):6— 12,Jan.1993. J. Monteiro, S. Devadas, and B. Lin. A methodology for efficient estimation of switch- ing activity in sequential logic circuits. In Proc. 3lst IEEE/A CM Design Aut. Conf., pages 12—17, Jun. 1994. C-Y. Tsui, J. Monteiro, M. Pedram, S. Devadas, A. Despain, and B. Lin. Power estimation for sequential circuits. IEEE Transactions on VLSI Systems, 3(3):404- 416, Sep. 1995. P. H. Schneider, U. Schlichtmann, and K. J. Antreich. A new power estimation technique with application to decomposition of boolean functions for low power. In Proc. 0f the 1994 European Design Aut. Conf., pages 388—393, Sep. 1994. R. Marculescu, D. Marculescu, and M. Pedram. Switching activity analysis consider- ing spatiotemporal correlations. In Proc. Of IEEE Int. Conf. 0n CAD, pages 294—299, Nov. 1994. R. Marculescu, D. Marculescu, and M. Pedram. Efficient power estimation for highly correlated input streams. In Proc. 32nd IEEE/ACM Design Aut. Conf., pages 628- 634, Jun. 1995. D. 1. Cheng, K-T. Cheng, D. C. Wang, and M. Marek—Sadowska. A new hybrid methodology for power estimation. In Proc. 33rd IEEE/ACM Design Aut. Conf., pages 439—444, Jun. 1996. D. I. Cheng, M. Marek-Sadowska, and K-T Cheng. Speeding up power estimation by topological analysis. In Proc. 0f the 1995 IEEE Custom 10 Conf., pages 623—626, May 1995. 158 [38] T-L. Chou and K. Roy. Statistical estimation of sequential circuit activity. In Proc. 0f IEEE Int. Conf. 0n CAD, pages 34—37, Nov. 1995. [39] T-L. Chou and K. Roy. Estimation of sequential circuit activity considering spatial and temporal correlations. In Proc. Of IEEE Int. Conf. 0n Comp. Design, pages 577—582, Oct. 1995. [40] A. Chandrakasan, S. Sheng, and R. Brodersen. Low-power CMOS digital design. IEEE Journal of Solid-State Circuits, 27(4):473—484, Apr. 1992. [41] S. Devadas and S. Malik. A survey of optimization techniques targeting low power VLSI circuits. In Proc. 32nd IEEE/ACM Design Aut. Conf., pages 242—247, Jun. 1995. [42] C.-Y. Tsui, M. Pedram, and A. M. Despain. Technology decomposition and mapping targeting low power dissipation. In Proc. 30th IEEE/ACM Design Aut. Conf., pages 68—73, Jun. 1993. [43] G. E. Sobelman and D. L. Raatz. Low-power multiplier design using delayed evalua- tion. In Proc. 0f IEEE Int. Symp. 0n Circuits and Systems, pages 1564-1567, May 1995. [44] A. Chandrakasan, R. Allmon, A. Stratakos, and R. Brodersen. Design of portable systems. In Proc. 0f the IEEE11994 Custom IC Conf., pages 259—266, May 1994. [45] K. Roy and S. C. Prasad. Circuit activity based logic synthesis for low power reliable operations. IEEE Transactions on VLSI Systems, 1(4):503—513, Dec. 1993. [46] E. Olson and S. M. Kang. State assignment for low-power FSM synthesis using genetic local search. In Proc. 0f the IEEE 1994 Custom IC Conf., pages 140—143, May 1994. [47] G. De Micheli. Synthesis and Optimization of Digital Circuits. McGraw—Hill, Inc., New York, NY, 1994. [48] M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou. Precomputation-based sequential logic optimization for low power. IEEE Transac- tions on VLSI Systems, 2(4):426—436, Dec. 1994. [49] A. Raghunathan and N. Jha. Behavioral synthesis for low power. In Proc. IEEE Int. Conf. 0n Comp. Design, pages 318—322, Oct. 1994. [50] L. Lavagno, P. McGeer, A. Saldanha, and A. Sangiovanni-Vicentelli. Timed Shannon circuits: A power-efficient design style and synthesis tool. In Proc. 32nd IEEE/A CM Design Aut. Conf., pages 254—260, Jun. 1995. 159 [51] A. Vittal and M. Marek—Sadowska. Power optimal buffered clock tree design. In Proc. 32nd IEEE/A CM Design Aut. Conf., pages 497—502, Jun. 1995. [52] J. G. Xi and W. Dai. Buffer insertion and sizing under process variations for low power clock distribution. In Proc. 32nd IEEE/A CM Design Aut. Conf., pages 491-496, Jun. 1995. [53] S. Sahni and A. Bhatt. The complexity of design automation problems. In Proc. 0f the 17th IEEE/ACM Design Aut. Conf., pages 402—411, Jun. 1980. [54] W. Donath. Complexity theory and design automation. In Proc. 0f the 17th IEEE/ACM Design Aut. Conf., pages 412—419, Jun. 1980. [55] S. M. Sait and H. Youssef. VLSI Physical Design Automation. IEEE Press, Inc., Piscataway, NJ 08855, 1995. [56] N. Sherwani. Algorithms for VLSI Physical Design Automation. Kluwer Academic Publishers, Norwell, MA 02061, second edition, 1997. [57] E. Reingold and K. Supowit. Hierarchy-driven amalgamation of standard and macro cells. IEEE Transactions on Computer Aided Design, CAD-3(1):3—11, Jan. 1984. [58] M. Upton, K. Samii, and S. Sugiyama. Integrated placement for mixed macro cell and standard cell designs. In Proc. 0f the 27th. IEEE/A CM Design Aut. Conf., pages 482—485, Jun. 1990. [59] G.Chartrand and O. Oellermann. Applied and Algorithmic Graph Theory. McGraw Hill, Inc., Hightstown, NJ 08520, 1993. [60] M. A. Jimenez and M. Shanblatt. Median biased steiner tree heuriscs in the rectilinear plane for low-power physical layout. In Proc. 0f the 413t Midwest Symp. 0n Circuits and Systems, pages 268—271, Aug. 1998. [61] F. Hwang. On Steiner minimal trees with rectilinear distance. SIAM J. of Applied Math., 30(1):104—114, Jan. 1976. [62] F. Hwang, D. Richards, and P. Winter. The Steiner Tree Problem. Elsevier Science Publishers B.V., AE Amsterdam, The Netherlands, 1992. [63] A. Kahng and G. Robins. A new class of iterative Steiner tree heuristics with good performance. IEEE Transactions on Computer Aided Design, 11(7):893—902, Jul. 1992. [64] M. Hanan. On Steiner’s problem with rectilinear distance. SIAM J. of Applied Math., 14(2):255—265, 1966. 160 [65] P. Winter. Reductions for the rectilinear Steiner tree problem. Technical Report RRR 11-95, Rutcor Research Report, Mar. 1995. [66] M. Garey and D. Johnson. The rectilinear Steiner problem is NP-complete. SIAM J. of Applied Math., 32(4):826—834, 1977. [67] J. Salowe and D. Warme. An exact rectilinear steiner tree algorithm. In Proc. 0f the IEEE Int. Conf. 0n Comp. Design, pages 472—475, Oct. 1993. [68] J. Griffith, G. Robins, J. Salowe, and T. Zhang. Closing the gap: Near-optimal Steiner trees in polynomial time. IEEE Transactions on Computer Aided Design, 13(11):1351—1365, Nov. 1994. [69] N. Hasan, G. Vijayan, and C. K. Wong. A neighborhood improvement algorithm for rectilinear Steiner trees. In Proc. 0f the 1990 Int. Symp. 0n Circuits and System, pages 2869—2872, May 1990. [70] D. Richards. Fast heuristic algorithms for rectilinear Steiner trees. Algorithmica, 1989(4):191—207, Apr. 1989. [71] J. L. Ganley and J. P. Cohoon. Routing a multi-terminal critical net: Steiner tree construction in the presence of obstacles. In Proc. 0f the 1994 Int. Symp. 0n Circuits and Systems, pages 113—116, May 1994. [72] H. Takahashi and A. Matsuyama. An approximate solution for the Steiner problem in graphs. Math. Japonica, 1980(6):573—577, 1980. [73] L. Kou, G. Markowsky, and L. Berman. A fast algorithm for Steiner trees. Acta Informatica, 1981(15):141—145, 1981. [74] M Diane and J. Plesnik. Three new heuristics for the Steiner problem in graphs. Acta Math. Univ. Comenianae, 60(1):105—121, 1991. [75] P. Winter and J. Smith. Path-distance heuristics for the Steiner. Algorithmica, 1992(7):309—327, 1992. [76] M. Breuer. A class of min-cut placement algorithms. In Proc. 0f the 14th IEEE/A CM Design Aut. Conf., pages 284—290, Jun. 1977. [77] U. Lauther. A min-cut placement algorithm for general cell assemblies based on a graph representation. In Proc. 0f the 16th IEEE/A CM Design Aut. Conf., pages 1—10, Jun. 1979. [78] A. Dunlop and B. Kernighan. A procedure for placement of standard cell VLSI circuits. IEEE Transactions on Computer Aided Design, CAD-4(1):92——98, Jan. 1985. 161 [79] P. Suaris and G. Kedem. Quadrisection: A new approach to standard cell layout. In Proc. 0f the IEEE Int. Conf. On Computer Design, pages 474—477, 1987. [80] K. Antreich, F. Johannes, and F. Kirsh. A new approach for solving the placement problem using force models. In Proc. 0f the IEEE Int. Symp. 0n Circuits and Systems, pages 481—486, 1982. [81] L. Sha and R. Dutton. An analytical algorithm for placement of arbitrarily sized rectangular blocks. In Proc. 0f the 22nd IEEE/A CM Design Aut. Conf., pages 602- 607,Jun.1985. [82] S. Goto. An efficient algorithm for the two-dimensional placement problem in circuit layout. IEEE Transactions on Circuits and Systems, CAS—28:12—18, Jan. 1981. [83] N. Quinn and M. Breuer. A force directed component placement procedure for printed circuit boards. IEEE Transactions on Circuit and Systems, CAS-26:377—388, Jun. 1979. [84] K. Shahookar and P. Mazumder. VLSI cell placement techniques. ACM Computing Surveys, 23(2):143—220, Jun. 1991. [85] B. Buckles and F. Petry. Genetic Algorithm. IEEE Computer Society Press, Los Alarnitos, CA 90720, 1992. [86] J. Cohoon and W. Paris. Genetic placement. In Proc. 0f the IEEE Int. Conf. On Computer Aided Design, pages 422—486, 1986. [87] K. Shahookar and P. Mazumder. A genetic approach to standard cell placement using meta-genetic parameter optimization. IEEE Transactions on Computer Aided Design, 9(5):500—511, May 1990. [88] H. Chan, P. Mazumder, and K. Shahookar. Macro-cell and module placement by genetic adaptive search with bitmap represented chromosome. Integration: The VLSI Journal, 12(1):49—77, Nov. 1991. [89] R. Kling and P. Banerjee. Empirical and theoretical studies on the simulated evolution method applied to standard cell placement. IEEE Transactions on Computer Aided Design, 10(10):1303—1315, Oct. 1991. [90] S. Kirkpatrik, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671—680, May 1983. [91] Emilie Aarts and Jan Korst. Simulated Annealing and Bolzmann Machines. John Wiley and Sons, Inc., New York, NY, 1989. 162 [92] N. Metropolis, M. Rosenbluth, A. Teller, and E. Teller. Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087—1092, 1953. [93] Peter Van Laarhoven and Emile Aarts. Simulated Annealing: Theory and Applica- tions. D. Reidel Publishing Co., Dordrecht, Holland, 1987. [94] C. Sechen. Chip-planning, placement, and global routing of macro-cell integrated circuits using simulated annealing. International Journal of Computer Aided VLSI Design, 2:127—158, 1990. [95] S. White. Concepts of scale in simulated annealing. In Proc. Of the IEEE Int. Conf. On Computer Design, pages 646—651, 1984. [96] M. Huang, F. Romeo, and A. Sangivanni—Vicentelli. An eflicient general cooling schedule for simulated annealing. In Proc. 0f the IEEE Int. Conf. On Computer Aided Design, pages 381-384, Nov. 1986. [97] J. Lam and M. Delosme. Performance of a new annealing schedule. In Proc. 0f the 25th IEEE/ACM Design Aut. Conf., pages 306—311, Jun. 1988. [98] K. Boese, A. Kahng, and C. Albert. Best-so-far vs. where-you-are: New perspectives on simulated annealing for cad. In Proc. 0f the European Design Aut. Conf., pages 78—83, Sep. 1993. [99] C. Sechen and A. Sangiovanni—Vicentelli. The TimberWolf placement and routing package. IEEE Journal of Solid State Circuits, SC—20(2):510—522, Apr. 1985. [100] W. Swartz and C. Sechen. New algorithms for the placement and routing of macro cells. In Proc. 0f the IEEE Int. Conf. On Computer Aided Design, pages 336—339, 1990. [101] W. P. Swartz. Automatic Layout of Analog and Digital Mixed Macro/Standard Cell Integrated Circuits. PhD thesis, Yale University, May 1993. [102] P. Bannerjee and M. Jones. A parallel simulated annealing algorithm for standard cell placement on a hypercube computer. In Proc. 0f the IEEE Int. Conf. On Computer Design, pages 34—39, 1986. [103] M. Yu. A study of the applicability of hopfield decision neural nets to VLSI CAD. In Proc. Of The 26th ACM/IEEE Design Aut. Conf., pages 412—417, Jun. 1989. [104] M. Takahashi, K. Kyuma, and E. Funada. 10000 cell placement optimization using a self-organization map. In Proc. 0f the 1993 Int. Joint Conf. 0n Neural Networks, pages 2417—2420, 1993. 163 [105] R. Chang and P. Hsiao. Arbitrarily sized cell placement by self-organizing neural networks. In Proc. 0f the Int. Symp. 0n Circuits and Systems, pages 2043-2046, 1993. [106] R. Lin and E. Shragowitz. Fuzzy logic approach to placement problem. In Proc. 0f the 29th Design Aut. Conf., pages 153—158, Jun. 1992. [107] C. Ball, A. Just, and D. Mlynski. A fuzzy mean field approach for partitioning and placement. In Proc. 0f the Int. Symp. 0n Circuits and Systems, pages 373—376, 1995. [108] E. Barke. Line-to—ground capacitance calculation for VLSI: A comparison. IEEE Transactions on Computer Aided Design, 7(2):295—298, Feb. 1988. [109] F. Brglez, D. Bryan, and K. Kominski. Combinational profiles of sequential bench- mark circuits. In Proc. 0f the IEEE Int. Symposium 0n Circuits and Systems, pages 192e1934, May 1989. [110] J. P. Cohoon and D. S. Richards. Optimal two—terminal a—B wire routing. Integration: The VLSI Journal, 6(1):35—57, May 1988. [111] J. L. Bentley and T. A. Ottmann. Algorithms for reporting and counting geometric intersections. IEEE Transactions on Computers, C-28(9):643-647, Sep. 1979. [112] A. Tenenbaum, Y. Langsam, and M. Augenstein. Data Structures Using C. Prentice Hall, Inc., Englewood Cliffs, NJ 07632, 1990. [113] C. Sechen. VLSI Placement and Routing using Simulated Annealing. Kluwer Academic Publishers, Norwell, MA 02061, 1988. [114] H. Onodera, Y. Taniguchi, and K. Tamaru. Branch-and-bound placement for building block layout. In Proc. 0f the 28th IEEE/ACM Design Aut. Conf., pages 433—439, Jun. 1991. [115] H. Cai. Connectivity biased channel construction and ordering for building-block layout. In Proc. 0f the 25th IEEE/ACM Design Aut. Conf., pages 560—565, Jun. 1988. [116] T. Gao, P. M. Vaidya, and L. Liu. A performance driven macro-cell placement algo- rithm. In Proc. 0f the IEEE Int. Conf. On Computer Aided Design, pages 147—152, Nov. 1992. [117] H. Murata, K. Eijiyoshi, S. Nakatake, and Y. Kajitani. Rectangle-packing-based module placement. In Proc. 0f the IEEE Int. Conf. On Computer Aided Design, pages 472—479, Nov. 1995. 164 [118] J. Bentley and D. Wood. An optimal worst case algorithm for reporting intersections of rectangles. IEEE Transactions on Computers, C-29(7):571—577, Jul. 1980. [119] W. Lipski and F. Preparata. Finding the contour of a union of iso-oriented rectangles. Journal of Algorithms, 1(3):235—246, Sep. 1980. [120] R. Guting. An optimal contour algorithm for iso—oriented rectangles. Journal of Algorithms, 5:303—326, 1984. [121] H. Stark and J. Woods. Probability, Random Processes, and Estimation Theory for Engineers. Prentice Hall, Inc., Englewood Cliffs, NJ 07632, second edition, 1994. [122] B. Eschermann, W. Dai, E. Kuh, and M. Pedram. Hierarchical placement for macro- cells: A "meet in the middle" approach. In Proc. 0f the IEEE Int. Conf. 0n CAD, pages 460—463, Nov. 1988. [123] L. W. Hagen, D. J. Huang, and A. B. Kahng. Quantified suboptimality of VLSI layout heuristics. In Proc. 0f the 32nd IEEE/A CM Design Aut. Conf., pages 216—221, Jun. 1995. [124] K. Keutzer and P. Vanbekbergen. Impact of CAD on the design of low power digital circuits. In Proc. 0f the 1994 IEEE Symp. 0n Low Power Electronics, pages 42—45, Oct. 1994. . 6mm; was. I .... . . .pmae. . E. 1.} .... \ z .395. stifle. . i... to. d. . 3.11.1! IVQIsVeIi. ...:t... u 'I. I . glam“ .. , .... I 3:: 1 s. 3;. . {Jugs c- ‘L o 1‘.- u. .Mmsw i: r 2 t 5.1)....1: ......(AI . I: A JUN) int-... ...... . . veiwi a: ’1 . .... MAX)? ...; a is... .........u..... ...... .. . Jhuw um a» . 8 . an...» “a... .. .....r h...“ 0‘ ......“ .u... , “mi... . .... c .54 4%,»Muwnmrva . with...» l «M» . “mm.” gm . ......1... gm, . . . Cr... I “an“... .mmwmwu . ...»... 1...... ”am. .... y 1...... «l ..I ...... z .-..W