‘ ' -4. I ' ‘ . _ . I. . ‘ ' - ~ ‘ ' ‘ . . ,( ‘ n , , V A > ' - This is to certify that the dissertation entitled ACTIVITY-AWARE MODELING AND DESIGN OPTIMIZATION OF ON-CHIP SIGNAL INTERCONNECTS presented by KRISHNAN SUNDARESAN has been accepted towards fulfillment of the requirements for the Ph.D. degree in Electrical Engineering wee? Major Professor's Signature I Z / ‘1' /7.0 o C Date MSU is an Affirmative Action/Equal Opportunity Institution - —.-¢-----o-o-c---0-.--o-u-o-o-o—u—--.—.c ACTIVITY-AWARE MODELING AND DESIGN OPTIMIZATION OF ON—CHIP SIGNAL INTERCONNECTS By Krishnan Sundaresan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical and Computer Engineering 2006 ABSTRACT ACTIVITY-AWARE MODELING AND DESIGN OPTIMIZATION OF ON-CHIP SIGNAL INTERCONNECTS By Krishnan Sundaresan On—chip global signal bus energy dissipation, thermal reliability, and latency are all dependent upon transmitted word values. Real—world microprocessor workloads cause bus traffic that exhibit significant spatial, temporal, and value locality. How- e‘ver, existing signal interconnect modeling and optimization schemes are oblivious of the correlated nature of such traffic and were developed with random or worse- case (highly-changing) traffic conditions in mind, which limits their effectiveness. To address this, we present activity—aware methods to model and optimize bus energy dissipation, thermal reliability, and latency. In the area of modeling, we present an activity-aware bus energy and thermal model that permits monitoring of energy dissipation and temperature, both spatially (horizontally across wires and longitudinally along individual wires) and temporally, during microarchitectural simulation of real programs. We find that final tempera- tures of wires in global signal buses carrying data (instruction) in the processor core increase by as much as 37 (58) degrees Celsius during a simulation run of only a billion instructions in 130-nm (45—nm) fabrication technology. We also find that highly—active wires in these buses attain absolute temperatures of up to 104 (123.7) degrees in 130— nm (45—nm) processors that are higher than the 100 degrees temperature typically assumed during interconnect design. In addition, wire temperature gradients across the sending and receiving ends, with magnitudes between 16-25 degrees, were also detected. These conditions were found to degrade processor performance by at least 4% (11.92%) in 130-nm (45—nm) processors. In bus design, we present a traffic-profile—guided approach to optimize bus en- ergy subject to designer-specified thermal constraints and to reduce worst-case bus crosstalk and latency conditions. Our methodology performs these by evaluating several options for signaling individual bit values and all possible ways of mapping bits to bus lines (bit ordering), and then choosing, based on traffic value character- istics, an optimal encoding scheme (the combination of bit signaling and ordering) statically at design time to support in hardware. Our energy—optimal static encoding techniques provide bus energy reductions of 30.2% (52.1%) for processor core data (in- struction) buses, respectively, compared to existing more-complex dynamic encoding schemes that yield only 4.19% (5.32%) reductions for the same buses. Our static encoding technique with thermal constraints added during optimization reduces peak wire temperatures by up to 12.26 (12.96) degrees for data (instruction) buses, while still providing significant energy savings. Finally, we also present a static encoding technique that reduces worst-case bus crosstalk conditions by at least 29.35% and a variable—cycle bus architecture that takes advantage of this reduced crosstalk to improve bus performance by 17.42%. Our work represents a significant advancement over existing approaches that are activity-oblivious and/ or consider worst-case traffic conditions. The microarchitecture-level activity-driven spatiotemporal bus energy and thermal model we present is the first of its kind. Our static value—aware bit reordering and sig- naling techniques are also highly-novel solutions that work remarkably well in real applications. Dedicated to Morn, Dad, and @eepa, for tfieir unending [oz/e, support, and encouragement ACKNOWLEDGEMENTS The completion of this research and writing of this dissertation has been one of the most significant academic challenges that I have ever had to face. Without the support, guidance, and patience of many people this endeavor would not have been possible. I owe my thanks to all of them. I have been fortunate to learn from many excellent teachers, from grade to grad- uate school, and I am indebted to all of them for helping me reach where I am today. In particular, I thank my advisor, Dr. Nihar Mahapatra, for his technical guidance and support, over the last five years. His mentorship has instilled in me the skill and confidence to identify, analyze, and efficiently solve research problems and present results in a clear and lucid manner. I have also learnt much from his classes and from our research meetings and discussions. I also thank my dissertation committee members, Dr. Anthony Wojcik, Dr. Andrew Mason, and Dr. Peixin Zhong, for their very insightful review and comments which have helped me improve this work. I have also been fortunate to be in the company of a lot of good friends, many of them my lab-mates, and I thank them all for their support. J iangjiang Liu helped me get my feet wet in research and was a great colleague during the early years. Kaushal Gandhi and Srivathsan Krishnamohan have been great friends and lab-mates and I have benefited greatly from many technical discussions I have had with them. I cherish their friendship, the good times we had together, and look forward to more Friday-night pizza-and-beer get—togethers in the Bay area where all three of us are starting our professional careers. My family—Mom, Dad, and sister~has been a great source of encouragement through the years and their continuing love and affection has made me what I am today. I owe much more to them than what a few sentences can express. This dissertation is dedicated to them. Last but not least, I thank all members of the Greater Lansing Bhagavad Gita group for their good thoughts and prayers. My association with them has helped me keep up my sanity during these years and taught me to live by the Bhagavad Gita’s motto: yogah karmasu kausalam— “Efficiency in Action leads to (the Ultimate) Knowledge.” vi TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xii SELECTED LIST OF SYMBOLS xix 1 INTRODUCTION AND OVERVIEW 1 1.1 Interconnect Scaling Trends: Delay, Power, Temperature, and Reliability 1 1.2 Material, Process, and Architectural Advances ............. 4 1.3 Impact of Interconnects on Architecture and VLSI ........... 6 1.3.1 Wire Delay ............................ 7 1.3.2 Power and Temperature ..................... 8 1.3.3 Computer-Aided Design Tool Requirements .......... 10 1.4 Drawbacks in Existing Techniques .................... 11 1.5 The Need for Activity-Aware Design .................. 12 1.6 Our Contributions ............................ 15 1.6.1 Activity-Aware Design Methodology .............. 15 1.6.2 Accurate Energy, Temperature, and Delay Modeling ...... 16 1.6.3 Profile-Guided Optimization Techniques ............ 17 1.6.4 Novel Thermal Optimization Methodology ........... 18 1.6.5 Performance-Oriented Adaptive Bus Design .......... 19 1.7 Dissertation Outline ........................... 19 2 PRELIMINARIES 21 2.1 Interconnect Analysis Methods ..................... 21 2.1.1 Global, Semiglobal, and Local Wires .............. 22 2.1.2 Interconnect Models: RC and RLC ............... 22 2.1.3 Effect of Inductance on Global Signal Lines .......... 23 2.1.4 Energy Estimation ........................ 24 2.1.5 Delay and Performance ...................... 26 2.2 Interconnect Optimization Techniques ................. 28 2.2.1 Data Encoding .......................... 28 2.2.2 Wire Spacing and Shielding ................... 31 2.3 Architecture-Level Simulators and Early-Stage Design ......... 31 2.4 Our Experimental Methodology ..................... 33 2.4.1 Interconnect Geometry and Technology Data .......... 33 2.4.2 Parasitic Capacitance Extraction ................ 34 vii 2.4.3 Simulation Infrastructure and Verification of its Correctness . 36 2.4.4 Target Systems and Benchmarks ................ 41 3 ACTIVITY-DRIVEN ENERGY AND TEMPERATURE MODEL 43 3.1 Introduction ................................ 43 3.2 Related Work and Our Contributions .................. 47 3.3 Bus Line Energy Dissipation Model ................... 50 3.3.1 Energy Dissipated due to Line Self Capacitance ........ 51 3.3.2 Energy Dissipated due to Inter-Wire Capacitance ....... 52 3.3.3 Distributed-RC Line Energy Model ............... 53 3.4 Thermal Model .............................. 57 3.4.1 Chip Thermal Structures and Heat Transfer .......... 58 3.4.2 Detailed Thermal Model ..................... 59 3.4.3 Steady—State Thermal Model ................... 64 3.5 Simulation Environment and Methodology ............... 66 3.5.1 Benchmarks and Sample Sizes .................. 66 3.5.2 Thermal Warmup and Initial Temperatures .......... 67 3.5.3 Granularity of Thermal Simulation ............... 68 3.6 Experiments and Results ......................... 69 3.6.1 Energy Dissipation in Processor Buses ............. 69 3.6.2 Correlation between Energy and Temperature ......... 75 3.6.3 Final and Peak Wire Temperatures ............... 76 3.6.4 Wire Temperature Gradients ................... 86 3.7 Summary ................................. 89 4 DATA- AND TEMPERATURE-DEPENDENT DELAY VARI- ABILITY MODEL 91 4.1 Introduction ................................ 91 4.2 Related Work and Our Contributions .................. 92 4.3 Temperature Dependent Delay Variability Model ........... 94 4.3.1 Wire Delay Considering Temperature Impact ......... 95 4.3.2 Wire Delay Variability Considering Crosstalk and Temperature 96 4.4 Results and Discussion .......................... 97 4.4.1 Maximum Wire Temperatures and Gradients .......... 98 4.4.2 Frequency of Timing Violations ................. 100 4.4.3 Performance Impact ....................... 105 4.5 Summary ................................. 106 5 ACTIVITY-AWARE ENERGY AND TEMPERATURE OPTI- MIZATION 109 5.1 Introduction ................................ 109 viii 5.1.1 Need for Energy and Temperature Aware Bus Design ..... 110 5.1.2 Key Contributions and Results ................. 112 5.2 Related Work ............................... 114 5.3 Methodology ............................... 116 5.3.1 Target Scenarios ......................... 117 5.3.2 Bus Layout and Wire Geometry ................. 120 5.4 Static Techniques for Bus Energy and Temperature Optimization . . 121 5.4.1 Choice of Signaling Modes .................... 121 5.4.2 Minimum Energy Signaling (MES) ............... 126 5.4.3 Minimum Energy Bit Ordering (MEBO) ............ 127 5.4.4 Simultaneous Bit Ordering and Signaling (SBOS) ....... 129 5.4.5 Thermal Optimization Methodology ............... 130 5.4.6 Routing Overheads ........................ 134 5.5 Results and Discussion .......................... 136 5.5.1 Energy Dissipation in Processor Buses ............. 138 5.5.2 Energy Reduction for General-Purpose Design ......... 145 5.5.3 Energy Reduction for Workload—Specific Design ........ 145 5.5.4 Energy Reduction for Program-Specific Design ......... 147 5.5.5 Wire Temperature Reduction .................. 154 5.6 Summary ................................. 161 6 ACTIVITY-AWARE PERFORMANCE OPTIMIZATION 163 6.1 Introduction ................................ 163 6.2 Related Work ............................... 164 6.3 Techniques for Performance Optimization ................ 165 6.3.1 Variable Cycle Bus (VCB) Design ................ 165 6.3.2 Minimum Crosstalk Bit Ordering (MCBO) ........... 168 6.3.3 MCBO with Signaling (MCBOS) ................ 171 6.4 Results and Discussion .......................... 172 6.4.1 Peak Crosstalk Reduction .................... 172 6.4.2 Performance Improvement with VCB .............. 173 6.5 Summary ................................. 177 7 CONCLUSION 178 7.1 Contributions and Key Results ..................... 178 7.2 Directions for Future Research ...................... 182 BIBLIOGRAPHY 183 2.1 2.2 2.3 3.1 3.2 4.1 4.2 5.1 5.2 LIST OF TABLES Bus crosstalk conditions and models for a rising transition in the middle (victim) wire. ............................... Technology, wire geometry, and equivalent circuit parameters for top- most layer interconnect. Values in top eight rows are from the interna- tional technology roadmap for semiconductors (ITRS) document [1]. Values listed in the next three rows are from Mui et a1. [2]. The values for the self and coupling capacitances were extracted using the FastCap tool and the value for T, was calculated using the for— mula r,- = pCu/(wi - t,), where pCu = 2.2 x IO-SQ-m. Values of h and it: were found using expressions given in Section 2.1.5 and ’r = ci,i:f: 1/(Cline+h X CO)' ...................... Configuration of our target system and benchmarks. This processor- memory system configuration is based on the Alpha 21264 processor. Comparison of normalized energy dissipated in wire subsegments ob- tained using our model and Cadence Spectre simulations for 10 sub- segments. ................................. Maximum wire temperatures in oC recorded during a simulation of one billion committed instructions for data and instruction buses using 130 nm and 45 nm parameters. ..................... Maximum wire temperatures recorded for the ALU result bus. Ambi- ent temperature is 318.15 K ........................ Performance impact expressed as percentage IPC degradation. . . . . Optimization scenarios considered in this work. ............ Correlation coefficients Try between test and training set data for var- ious signaling schemes discussed in Section 5.4.1. Since Try values are close to 1, our training and test sets are well correlated ......... 27 35 42 57 87 99 107 118 120 5.3 5.4 5.5 Number of iterations and running times for various problem types and sizes ..................................... Optimal signaling and ordering obtained for workload-specific design of the data bus (0=LSB, 63=MSB). Q = org, Q? = inv, <> =trs, Q =itr, and A =mm. .......................... Thermal Optimization Results. Peak wire temperatures (K) in data and instruction buses for SBOS scheme with and without thermal con- straints (TC) applied during optimization. The methodology described in Section 5.4.5 was used to obtain the trade-off curves in Figures 5.16— 5.20 and the wire permutations that resulted in bus energy reduction E t closest to 0.5(1 — BEL) were chosen from each benchmark’s tradeoff on curve. Results shown here are for detailed thermal simulations with this permutation. ............................ xi 137 148 1.1 1.2 1.3 1.4 1.5 2.1 2.2 LIST OF FIGURES Gate and interconnect delay scaling for current and future nanometer- scale technologies. Local interconnects scale with gate delay whereas global interconnect delays do not [3]. .................. Interconnect power dissipation due to global and local wires. Global lines are responsible for 21% of total dynamic power dissipation at 130 nm [4]. ................................ Projected wire temperature rise in multi-layer interconnects for various technologies under worst-case conditions. Global metal lines will be the hottest, with temperatures expected to reach as much as 209°C in 45 nm technology [5]. .......................... Pipeline stages and loops in a typical out-of—order processor. More frequently used loops like fetch, LSQ, and bypass are affected strongly by wire delay. .............................. Power dissipation in Intel processors showing an exponential trend [6]. Since 2001, low-power and power management techniques that have been used widely in microprocessors and have helped slow down the trend somewhat. ............................. Layout of wires routed in the top—most layer metal. Self and coupling capacitances are shown. The bottom plate represents the V D D / GND plane. ................................... Wire segment of length lopt between two repeaters. .......... xii 34 2.3 3.1 3.2 3.3 3.4 3.5 3.6 Distribution of self and coupling capacitance values for the middle wire of a 32—bit bus extracted using the FastCap tool [7] and = self capacitance of the wire; Ccl = coupling capacitance between the wire and its adjacent neighbor; Cc2 = coupling capacitance between the wire and a non-adjacent wire with 1 wire between them; Cc3 = coupling capacitance between the wire and a non-adjacent wire with 2 wires between them; Cc_rest = sum of coupling capacitances between the wire and other wires with 3 or more wires between them. For current and near-future ITRS technology nodes (up to 45 nm), non-adjacent coupling capacitances are somewhat non-negligible—they contribute approximately 8—10%. .......................... Distributed—RC model of the wire segment divided into n subsegments. Figure shows the view of different thermal structures of a C4/CBGA chip and the primary and secondary heat transfer paths. ....... Thermal model. (a) Complete equivalent thermal-RC network for a 5—wire bus. PIJC = Pék = = 5,k’ R1,k =R2,k = = R5,k’ 01,19 = 02,19 = = C5, is, and P1, is, P2, k, . . . , P5, is are bus-activity dependent in the model shown. (b) Geometry for calculating equivalent thermal resistances for a wire based on previous work of Chiang et al. The lightly shaded regions and arrows represent heat flow between the conductors or between layers (from a hotter to a cooler one). Steady state thermal equivalent circuit for three wires. Heat transfer between wires is modeled by Rinter and heat loss to surroundings by Rth' P,- represents power dissipated in each wire due to switching activity and it can found using a microarchitecture—level simulator. Total energy dissipated in a 64-bit data bus for various benchmarks. ‘Ccl only’ represents the existing energy models which consider only self and adjacent coupling capacitances. ‘Cc1+Cc2+Cc3’ represents our model that considers self capacitances, adjacent coupling capac- itances (Ccl), and two non-adjacent capacitances (Cc2 and Cc3) on each side. The % energy mismatch shown by the line is plotted with respect to the right-hand side Y—axis. .................. Total energy dissipated in a 128—bit instruction bus for various bench- marks. The % energy mismatch shown by the line is plotted with respect to the right-hand side Y-axis. .................. xiii 53 60 65 71 72 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 Total energy dissipated in a 64—bit data bus with various encod- ing schemes. ‘Self’ denotes self energy, ‘C/ D’ denotes the coupling charge/ discharge energy and ‘Toggle’ denotes the coupling toggle en- ergy dissipation. ‘Cc1 only’ refers to existing energy models that con- sider self and adjacent coupling capacitance only and ‘Cc1+Cc2+Cc3’ refers to our energy model that considers self, adjacent coupling, and two non-adjacent coupling capacitances. ................ This plot shows average energy dissipation and wire temperature of the bus for a simulation interval of 10 billion cycles. The continuing temperature rise can be clearly observed ................. Plots show the wire temperature rise recorded for benchmarks gcc and gzip for the data bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. ................................ Plots show the wire temperature rise recorded for benchmarks mcf and lucas for the data bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. ................................ Plots show the wire temperature rise recorded for benchmarks ammp and applu for the data bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. ............................. Plots show the wire temperature rise recorded for integer benchmarks gcc and gzip for the instruction bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. ........................... Plots show the wire temperature rise recorded for integer benchmarks mcf and lucas for the instruction bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. ........................... Plots show the wire temperature rise recorded for integer benchmarks ammp and applu for the instruction bus in 130 nm and 45 nm technol- ogy nodes over a simulation interval of one billion committed instruc- tions for each benchmark. ........................ xiv 73 77 79 80 81 82 83 3.15 3.16 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 A three-dimensional plot showing spatial and temporal variations in wire temperature for the lower-order 32 bits of the load / store data bus for the gcc benchmark. ......................... Frequency distribution of maximum wire temperature gradients for 130 nm and 45 nm processor wires. ................... Distribution of maximum wire temperature gradients in result bus wires for the 130 nm processor. ..................... The number of temperature-induced violations per hundred bus refer- ences occurring across ten benchmark programs in a 130 nm processor. The number of temperature-induced violations per hundred bus refer— ences occurring across ten benchmark programs in a 45 nm processor. ooooooooooooooooooooooooooooooooooooooo This plot shows the frequency of occurrence of five different crosstalk conditions on the bus. See Section 4.3.2 and Table 2.1 for an explana— tion of these crosstalk conditions. The crosstalk condition determines the actual propagation delay without considering thermal effects. Figure shows the percentage of temperature-induced delay violations that correspond to a given crosstalk condition. ............ Markov model-based signaling technique. (a) A 4-bit prediction table for the Markov model for bits 0—7 of the data bus obtained by analyzing training set benchmarks. Depending on which bits are selected for Markov model signaling, the corresponding row of the table can be translated to hardware using logic minimization tools. (b) Examples of sending end hardware that would be required for 2 bits (0 and 7) assuming these are chosen to be signaled using the m scheme. As can be seen, the logic overhead required for m signaling is very minimal. . Sample peak wire temperature versus bus energy trade—off curve. The thermal optimization steps can be used to obtain curves similar to the one shown here. ............................. Routing strategy and overheads for re—ordering. (a) Definition of the routing channel. (b) Matching diagram showing ten crossing points. (0) Two-layer routing strategy using eight horizontal tracks and ten vias. .................................... XV 85 88 100 101 102 103 104 124 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 Transition Densities for the 13 integer SPEC CPU2000 Benchmarks for 64-bit Data Bus. ........................... Transition Densities for the 13 floating—point SPEC CPU2000 Bench- marks for 64-bit Data Bus ......................... Fraction of bus energy dissipated in self and coupling (charge/discharge+toggle) transitions for 32-bit data address bus in the Alpha 21264 target system while running SPEC CPU2000 programs. ................................. Fraction of bus energy dissipated in self and coupling (charge/discharge+toggle) transitions for 32-bit instruction ad- dress bus in the Alpha 21264 target system while running SPEC CPU2000 programs. ........................... Fraction of bus energy dissipated in self and coupling (charge/discharge+toggle) transitions for 64-bit data bus in the Alpha 21264 target system while running SPEC CPU2000 programs. Ffaction of bus energy dissipated in self and coupling (charge/discharge+toggle) transitions for 128-bit instruction bus in the Alpha 21264 target system while running SPEC CPU2000 programs. ................................. Energy dissipation results for general-purpose design for the 64—bit data bus. Statistics collected on 13 training set benchmarks were used to obtain the optimal static encoding schemes. These were tested on 13 other (test set) benchmarks. Average energy reductions are MES: 7.81%, MEBO: 11.91%, and SBOS: 20.04%. .............. Energy dissipation results for general-purpose design for the instruction bus. Average energy reductions are MES: 10.96%, MEBO: 19.85%, and SBOS: 38.78%. .............................. Energy dissipation results for workload-specific design of the 64-bit data bus. Statistics collected for SimPoint samples from 13 training set benchmarks were aggregated and used to obtain the optimal static encoding schemes. These were then tested on a non-overlapping sample from the same set of benchmarks. The average energy reductions are MES: 9.73%, MEBO: 15.97%, and SBOS: 22.79%. .......... xvi 139 140 142 143 144 146 147 5.13 Energy dissipation results for workload—specific design for the 128—bit instruction bus. The average energy reductions are MES: 10.43%, MEBO: 21.25%, and SBOS: 40.77%. .................. 5.14 Energy reduction results for program-specific design. Statistics col- lected for SimPoint samples of each benchmarks was used to obtain the optimal static encoding schemes specific to that benchmark for our schemes, MES, MEBO, and SBOS. These were then tested on the same sample. Results for dynamic encoding schemes BI and OEBI proposed in previous work are also shown. The average energy reductions for the data bus are BI: 4.19%, OEBI: 1.58%, MES: 19.7%, MEBO: 23.25%, and SBOS: 30.2%. ............................ 5.15 Energy reduction results for program-specific design. Statistics col- lected for SimPoint samples of each benchmarks was used to obtain the optimal static encoding schemes specific to that benchmark for our schemes, MES, MEBO, and SBOS. These were then tested on the same sample. Results for dynamic encoding schemes BI and OEBI proposed in previous work are also shown. The average results for the instruc- tion bus are B1: 2.63%, OEBI: 5.32%, MES: 21.7%, MEBO: 32.1%, and SBOS: 52.1%. ............................ 5.16 Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for amp and crafty. The permutation selected for each benchmark was the one 5.17 Energy vs. temperature trade-off curves. Plots show the energy vs. 150 temperature tradeoff curves obtained for the data bus for eon and gcc. 156 5.18 Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for gzip and lucas. .................................. 5.19 Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for mesa and mgrid. ................................... 5.20 Energy vs. temperature trade—off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for swim and twolf. .................................. xvii 158 6.1 6.2 6.3 6.4 6.5 Three-bit crosstalk analyzer truth table and circuit. (a) Truth table showing only the ON-set. “-” indicates a don’t care input. (b) Logic circuit implementing the truth table. .................. Variable cycle bus. (a) Complete bus crosstalk analyzer for an n-bit bus. (b) Sender and receiver logic for VCB. .............. Crosstalk reduction results for workload-specific design of the 64—bit ALU result bus. (a) Average reductions in number of 1+4r delay cycles. For MCBO: 24.89% and MCBOS: 30.61%. (b) Average reductions in number of 1+3r cycles. For MCBO: 19.21% and MCBOS: 23.42%. Crosstalk reduction results for general purpose design of the 64-bit ALU result bus. (a) Average reductions in number of 1+4r delay cycles. For MCBO: 21.22% and MCBOS: 29.35%. (b) Average reductions in number of 1+3r cycles. For MCBO: 16.77% and MCBOS: 20.29%. Reduction in the number of cycles taken to transmit the information with MCBO and MCBOS applied to the result bus. (a) Workload— specific optimization. (b) General-purpose optimization. ....... xviii 174 175 SELECTED LIST OF SYMBOLS C,- k Thermal capacitance of the kth subsegment of the ith wire 72,, k Thermal resistance along the heat transfer path of the kth subsegment of the ith wire 60 Ambient temperature inside the computer box (45°C) 87- Relative permittivity of dielectric CO Capacitance of minimum size inverter in fF CM i 1 Adjacent coupling capacitance per unit length in pF / m C. 2, j Coupling capacitance between line i and any other line 3', z' aé j Cline Self/ Area capacitance of wire per unit length in pF/m folk Clock frequency kild Thermal conductivity of dielectric R0 Resistance of minimum size inverter in kQ Resistance of wire per unit length in k0 / m tz’ld Thickness of the inter-layer dielectric Thickness of wire z' tp End-to-end propagation delay of a wire VD D Supply voltage Width of wire 2' xix CHAPTER 1 INTRODUCTION AND OVERVIEW High—speed systems and circuits are increasingly facing the limitations posed by shrinking physical dimensions of transistors and their interconnections [8]. As cir- cuits become denser, smaller transistors naturally speed up. But interconnects, in general, do the reverse and introduce delays that reduce or even cancel the speed gains due to smaller transistors. The problems due to interconnects are exacerbated by the fact that parasitic resistance, inductance, and capacitance (RLC) effects in— crease as wires scale to smaller dimensions, which in turn aggravates delay, power consumption, and cause signal integrity/ reliability problems. Thus, on—chip intercon— nect design has been recognized as one of the most important challenge to address in nanometer-scale integrated circuits [9,10]. 1.1 Interconnect Scaling Trends: Delay, Power, Temperature, and Reliability According to the data available from the international technology roadmap for semi- conductors (ITRS) documents, the intrinsic gate delay has improved ten times, from 10 ps to 1 ps in the 20 years between 1980 and 2000. However, in the same period of time, the interconnect delay in a 1 mm line degraded 100 times, from 1 ps to 100 ps [1]. This growing disparity between gate and interconnect delays is also high- lighted in Figure 1.1 for current and future technologies [3]. The figure shows that while local interconnect delays scale with gate delays, global interconnect delays do not. Such trends have forced costly performance compron'iises, like the allocation of two out of twenty pipeline stages for communication in the Pentium—4 microproces— sor [11]. Feature size (nm) 250 180 130 90 65 45 32 100 L l l l l I Global interconnect without repeaters 10 a > Global interconnect g with repeaters 8 g 1 " ‘t 5 \ Local interconnect 1% (M1, M2) - 0.1 - \ Gate delay (F04) Source: ITRS 0.01 Figure 1.1. Gate and interconnect delay scaling for current and future nanometer- scale technologies. Local interconnects scale with gate delay whereas global intercon- nect delays do not [3]. Interconnects are also responsible for about 50% of the power dissipation, as shown by results from studies on a 130 nm Intel microprocessor [4]. Figure 1.2 shows the distribution of power dissipation by the type of the net / wire. As can be seen, global signal lines account for 34% of the total interconnect power dissipated and hence 21% of the total dynamic (switching-related) power dissipation at 130 nm. Due to increased Joule heating in the global wires, their temperatures are also in- creasing alarmingly. The spatial temperature distributions along the vertical direction Interconnect Power Total Dynamic Power Local Signals, 37% Figure 1.2. Interconnect power dissipation due to global and local wires. Global lines are responsible for 21% of total dynamic power dissipation at 130 nm [4]. from the Silicon (Si) substrate obtained using finite element models and simulations are shown in Figure 1.3 [5]. This analysis assumed that all wires in the interconnect stack carried currents with maximum rated current density for that technology which represents an extreme worst case. Nevertheless, the results show how temperatures will be distributed across interconnect layers. It can be observed that as technology scales down, the temperature gradient between the top metal lines and the substrate becomes larger. Global metal lines were found to be the hottest in all technologies using this worst—case analysis, with temperatures reaching as much as 209°C in 45 nm technology [5]. For the 35 nm node, the temperature gradient is smaller than that for the 50 nm node due to the larger fraction of metalization (Cu) layers compared to inter-layer dielectric (ILD) layers, an artifact of the ITRS scaling scenario that was used for this analysis. It should be noted that the total height of the (Cu+ILD) layers decreases as scaling continues, due to the smaller vertical dimensions of wires and insulators despite increase in number of metal layers. It can also be observed that the maximum chip temperature occurs for the long global wires, which are most prone to electromigration failures and also give rise to highest RC delays. This has important implications for both reliability and performance. 209 °C Global Wires H U o H cu h 3 u (U I... 0 Q. E d) I— 2345678910 Distance from Substrate [um] 50 nm NOde Figure 1.3. Projected wire temperature rise in multi—layer interconnects for various technologies under worst-case conditions. Global metal lines will be the hottest, with temperatures expected to reach as much as 209°C in 45 nm technology [5] 1.2 Material, Process, and Architectural Ad- vances Many methods, such as utilizing Cu and low-k insulators [12—14], short-wire architec— tures [15,16], on-chip networks [17], optical interconnects [18], and three—dimensional interconnect structures [19,20] have been suggested to help alleviate the impact of interconnect scaling on current and future nanometer—scale fabrication. The pros and cons of these techniques are discussed next. Material and process enhancements: Copper interconnects in high speed microprocessors were introduced by IBM in its 400 MHz PowerPC750 processor. Al- though the resistivity of Copper is 40% less than that of Aluminum, the percentage of performance improvement from using the former is limited to about 15% [14]. The thickness and resistivity of the Tantalum (Ta) liner, used in the dual-damascene process for Copper electrodeposition, also limit the performance advantage of Copper interconnects. Low-k dielectrics also help improve chip performance. For example, the performance of a metal wire improves 25% for 0.25 pm technology using a k=2.5 dielectric material compared to conventional silicon dioxide, which has k=3.9. How- ever, the use of Copper metal and low-k dielectrics are known to aggravate thermal issues in interconnects and cause reliability problems, during both fabrication and chip lifetime [21]. Novel architectures: Short-wire architectures such as systolic arrays can be em- ployed to overcome some of the problems imposed by long global interconnects [16]. Although these architectures are not applicable to all microprocessors, they can be useful in specific applications, such as pattern recognition, multiprocessor systems, and arithmetic computation. Orr-chip networks can be used instead of global inter- connects to reduce the global interconnect congestion [22]. Since most of the global wires are not utilized in every clock cycle, it is more efficient to send packets over a global network rather than signals in global wires. However, this requires a com- pletely new architecture, tools, and design methodology different from conventional microprocessors. Optical interconnections and 3D integration: It has been shown that the optical interconnections have higher bandwidth and consume lesser power for long- distance communication compared to electrical interconnections [23]. However, be— cause of incompatibility with standard CMOS technology, optical interconnects have not been widely deployed in current microprocessors. The primary application has been restricted to clock distribution networks in some designs [24]. Three-dimensional interconnection schemes are also expected to significantly reduce global wiring require- ments and have a significant impact on reducing interconnect delay and power [25]. However, vertical pitch limitations resulting from alignment tolerances in the bond- ing of wafers [26] and heat removal capacity limitations [27] are some of the problems limiting the use of three-dimensional architectures. 1.3 Impact of Interconnects on Architecture and VLSI Interconnect-related problems have affected chip design to such an extent that product roadmaps of almost all chip design companies have been drastically re—drawn as it is becoming evident that high-speed processors—with clock frequencies exceeding 10 GHz—are no longer economically viable, due to restrictions imposed by power, temperature, and reliability [28-30]. The impact of interconnect scaling and power and performance issues affects the very first architectural design decisions of today’s processors [31,32]. 1.3. 1 Wire Delay Processor clock speeds have increased continuously, due to faster transistors and also due to deeper pipelines. However, since global wire delays—for example, delay of register bypass wires—scale much slower than transistor delays, deeper superscalar pipelines have experienced increased latencies and a significant degradation in in- struction throughput. Several studies have pointed out rising wire delays dictate that deeper pipelines will not perform better than shallower ones in future technologies and also conclude that superscalars do not have sufficient parallelism to tolerate the relative rise in wire delays [33,34]. Hence the industry trend toward multi-core and multithreaded architectures [35]. WB-EX Bypass Loop MEM-EX Bypass Loop I | DE- RE- REG. FETCH CODE NAME ISSUE READ EXEC MEM Fetch Rename Issue EX Bypass LSQ Loop Loop Loop Loop Loop 7 Branch Misprediction Loop Load Mis-speculation Loop Figure 1.4. Pipeline stages and loops in a typical out—of—order processor. More fre- quently used loops like fetch, LSQ, and bypass are affected strongly by wire delay. We briefly examine next why wire delay trends affect architectural decisions. An out-of-order superscalar pipeline is composed of two in-order half-pipelines, called the front-end and back—end, connected by the issue queue. Figure 1.4 shows this configuration and the various loops in the pipeline [35]. Wire delay affects many of these loops significantly as discussed next. The fetch loop is due to the fact that the current program counter (PC) is used to predict the next PC. The delay of this loop includes the instruction bus and cache delays. The rename loop is due to the dependence between a previous instruction assigning a rename tag and a later instruction reading the tag and the issue loop is due to the dependence between the producer and wakeup of a consumer instruction. The rename and issue loop delays are sensitive to the delay on the tag lines. The load misspeculation loop is due to use of speculation and the need for load-miss replay. The load/store queue (LSQ) loop is due the dependence between a previous store and a later load to the same address and includes the load/ store bus and data cache delays. The various bypass loops—EX/ EX, EX-MEM, and Writeback-EX—are all affected by the wire delays on the ALU result bus. Also, the more frequently a loop is used, the higher its impact on performance. The fetch, rename, issue, and bypass loops are all fairly frequent and hence have the highest impact. The load misspeculation and branch misprediction loops that are used only upon load misses and branch mispredictions, respectively, are relatively less frequent and have lesser impact. 1.3.2 Power and Temperature Power has become a first-class constraint in the design of nanometer-scale ICs. Fig- ure 1.5 shows the trend observed in the power dissipation of Intel microprocessors [6]. In 2001, it was predicted that with the scaling rates at that time, the power density in microprocessors will reach that of the Sun by 2015, following an almost exponential trend [6]. Since then various steps have been taken to reduce power dissipation in logic and memories with techniques at various levels of abstraction. These have resulted in reducing the trend to a linear one, as shown by the dotted line in Figure 1.5. 100000 - 2000 10000 - 22 F ’3 Pentium Processors g f—H 2004 5 1000 - __ i. -— 3 D L- 3 o o. Figure 1.5. Power dissipation in Intel processors showing an exponential trend [6]. Since 2001, low-power and power management techniques that have been used widely in microprocessors and have helped slow down the trend somewhat. Among the three different sub—systems of a high-performance processor— computation, storage, and communication—the communication or interconnect sub- system, which carries address, instruction, data, and control signals, is still respon- sible for a bulk of the on—chip power dissipation as discussed earlier, in part due to interconnect scaling trends. With increasing interconnect power dissipation, wire temperatures rise as a result of the Joule effect, wire resistance increases due to temperature-dependent resistivity forcing performance to degrade further, and wire reliability decreases sharply due to electromigration-induced breakage. Even with the advent of multi-core processors, clock frequencies and datapath widths have continued to increase and hence, all of the above effects are bound to worsen further. Hence, interconnect power dissipation and temperature remains one of the primary issues facing microarchitects and VLSI designers. Popular low-power and power-management techniques like fine-grained clock gat- ing and power gating can also significantly affect on—chip temperature profiles by creating localized hot spots and / or temperature gradients on the chip. These gradi- ents cause delay variabilities, setup and hold time violations and, in the worst case, failure of interconnects that are routed across regions with varying temperatures. De- signing for these issues is almost impossible because accurate techniques to estimate temperature gradients in interconnects are currently unavailable. Thus, study of the thermal impact of architectural techniques is also becoming important. 1.3.3 Computer-Aided Design Tool Requirements In conventional ASIC design, signal and power integrity were checked in later stages of the design cycle and the design was modified if these checks were found to be unsat- isfactory. However, with explosion in the number of transistors and highly-complex designs in nanometer-scale technologies, iterating between upstream (architecture or high-level) design changes and layout to achieve design closure has becoming increas— ingly futile, leading to longer time-to-market schedules and higher design costs [9]. The design-productivity gap, exemplified by the lack of proper CAD tools to identify and correct issues at an early stage, exacerbates this problem. While the push to- ward ever-higher performance still drives the semiconductor industry, there is growing awareness now that winning designs need to balance multiple objectives: high per- formance, low power, low cost, robustness (noise immunity), and reliability. As such, it is becoming imperative to: (1) model interconnect—related effects accurately and 10 efficiently for different system architectures (superscalar, multi-core, and network-on- chip) and fabrication technologies (130 nm, 90 nm, etc.) and (2) design the intercon- nect system, at an early stage, to alleviate or mitigate these effects without incurring unsustainable performance, energy, and/ or area/ cost overheads. 1.4 Drawbacks in Existing Techniques Next, we discuss some drawbacks of existing models and design techniques for signal interconnects. First, almost all existing work addressing signal interconnect analy- sis, design, and optimization is not activity-aware, i.e., such interconnect models and design techniques are not developed with an accurate knowledge of the characteris- tics of data that is transmitted on these interconnects. An average wire switching factor—such as 0.15 suggested in [36]——is used to estimate energy dissipation, wire temperature, delay impact, and /or reliability impact [37,38]. As such, these average estimates lead to over-design because switching activity in interconnects is actually information and time dependent. It depends on the type of information (address, in- struction, data, or control) being transmitted because the information type influences switching activity factors; for example, the activity factor is expected to be higher for data and instruction streams since they are more random in nature than addresses. It varies with time too because, during execution of most typical programs, there are substantial periods when a bus may remain idle; for example, when there are no level-one (L1) cache misses, the bus connecting to level-two (L2) cache will remain idle. These idle cycles help bring down wire temperatures and hence, may reduce wire delay and electromigration impact. Hence, to facilitate interconnect design that 11 can be tuned to the requirements of different architectures, activity-aware modeling and design optimization techniques are necessary. Second, as mentioned earlier, increasing the number of iterations between high- level design and physical layout to achieve design closure, has become exorbitantly costly, time—consuming, and impractical in nanometer designs. Hence, growing em- phasis is being placed on making accurate early-stage design decisions obtained using microarchitecture-Ievel simulations on benchmark programs. Interconnect models that have been built into existing execution-driven simulators lack the detail needed to accurately estimate the impact of interconnect power dissipation, temperature, and related effects, since many not consider the influence of wire coupling and ther- mal heat dissipation paths. For example, the amount of energy dissipated due to the parasitic coupling capacitance between wires is much greater than energy dissipated due to the area capacitance. Similarly, thermal coupling or heat transfer through the inter-metal dielectric occurs between adjacent wires, affecting temperatures in both wires. 1.5 The Need for Activity-Aware Design Existing techniques that target bus energy and crosstalk reductions, perform well only when patterns that are transmitted on the bus are randomly distributed in time. However, this is rarely the case in actual microprocessor buses. Information transmitted on these buses show high degrees of correlation across programs as well as across sections of the same program, due to the presence of temporal, spatial, and value localities. Temporal locality describes the likelihood that a recently-referenced 12 item will be referenced again seen, while spatial locality describes the likelihood that a close neighbor of a recently—referenced item will be referenced soon. Value locality refers to the likelihood of a previously-seen value recurring repeatedly in the information stream. Address, instruction, and data streams in microprocessor buses exhibit substantial amounts of temporal and spatial locality due to the reasons discussed next. Instruc- tion addresses issued by the processor to the L1 cache are typically sequential, except when branches or jumps occur and even then the target addresses are not typically very far away from the last address. This is the reason why many instruction sets use PC-relative addressing with shorter-than—full—word-size offsets for branch and jump instructions. Data addresses issued by the processor are also exhibit these localities primarily because of scanning of data arrays in loops that are placed in contiguous memory locations. The dynamic instruction stream executed by a processor corresponds to instruc— tion addresses issued by fetch unit, and hence instructions exhibit the same temporal and spatial locality as instruction addresses. Also, not all instructions, instruction sequences, opcodes, register operands, and immediate constants are present equally frequently in the dynamic instruction mix, leading to more predictability in the in- struction stream. The reasons for the presence of such redundancies are that all pro- grams share certain basic characteristics: procedures and procedure calls, branches every few instructions—~typically every six instructions [39], and loops and if—then-else clauses that lead to repetitive instruction sequences. Data buses in the processor, such as load / store and ALU result buses, also exhibit temporal and spatial locality, although to a lesser extent than addresses and instruc- 13 tion buses. There is an additional element of redundancy present in the magnitude of values communicated by these buses and stored in registers, data caches, and/or CAM structures in the processor core. This redundancy is due to the fact that for any given type of data—character, integer, floating-point, etc—not all values are equally likely. For instance, many programs do not tend to use the entire range of integer values possible, but rather the values used tend to be concentrated around certain values, especially, zero. For such small magnitude two’s complement numbers, most high order bits of the data bus are likely to be either all zero (positive) or all one (negative) due to sign extension. The concept of value locality also adds to the redundancies present in data buses. For example, the number of times each static load (or store) instruction retrieves a value from (or writes to) memory that matches a previously seen value, is quite high. Studies have shown that this value is around 50% for most superscalar processors running standard benchmark applications [40]. The presence of temporal, spatial, and value localities in information streams opens opportunities for activity-aware design of high-performance buses, i.e., design that is tailored to the unique characteristics of different types of data that are trans- mitted on these buses, as well as to the typical applications that are executed on the processor. Such design can be achieved with the following steps: (1) profile the information transmitted on target buses using cycle-accurate microarchitecture-level simulators for a representative workload, (2) identify opportunities by correlating, for example, the number of self and coupling transitions with objective function (bus energy, temperature, crosstalk, etc.), (3) and design techniques that minimize the value of the objective function. Although, the technique is designed using a rep- resentative workload, it is likely to work well for any real application in the same 14 domain due to the similarities in program characteristics. In fact, computer architec- ture continue to use similar methodologies to design efficient branch, load-value, and other prediction-based techniques to improve instruction—level parallelism in modern superscalar processors. 1.6 Our Contributions As presented earlier, accurate modeling and cost-effective design of global signal inter- connects is a critical issue in current and future nanometer-scale design. Since inter- connect performance (wire delay) and energy dissipation depend closely on switching characteristics of the data stream, activity-aware modeling and design approaches are important. Furthermore, the introduction of Cu and low-k dielectrics exacerbate problems such as wire self-heating which need to be modeled, along with the impact of temperature on wire delay variability. Finally, newer design techniques are needed to deal with rising interconnect power dissipation and temperature since existing techniques are not effective in most real architectures, workloads, and applications. The objective of this research is to provide a methodology to model and design signal interconnects in nanometer-scale ICs and address power, temperature, and per- formance concerns during early-stage design. To accomplish this goal, four research tasks were identified and novel contributions are made in each. 1.6.1 Activity-Aware Design Methodology Our research is perhaps the first attempt that proposes and examines activity-aware design techniques for global signal buses. Existing techniques rely on worst-case estimates to design high-performance buses, resulting in overly-pessimistic energy, temperature, and clock cycle time estimates. Due to lack of accurate models suitable for early stage design exploration, interconnect design is done late in the design cy- cle, offering very limited opportunities to optimize the architecture for performance, power, and cost. In contrast, the methodology we propose examines typical applica- tions, collects statistics for different types of data, and optimizes the design of target buses, all using early stage simulation. Thus interconnect design can be completed early in the design cycle and it can be used as a parameter in design space exploration. 1.6.2 Accurate Energy, Temperature, and Delay Modeling We introduce accurate modeling techniques to help estimate the impact of activity- dependent interconnect energy dissipation, wire temperature rise due to Joule heating and delay variation due to temperature, using a microarchitecture-level simulator. In addition to self capacitance, our model incorporates the effects of capacitive cou- pling between adjacent as well as non-adjacent pairs of wires and repeater insertion on switching energy, the effect of lateral heat transfer between adjacent wires to esti- mate wire temperatures, and also estimates wire temperature gradients and its impact on wire delay, all of which were not available in earlier models. We estimate from simulations using our model for 130 nm technology node that, during the time in- terval taken to commit one billion instructions in the pipeline, high performance bus wire temperatures rise by 10-37°C for various SPEC CPU2000 benchmarks. This is solely due to Joule heat dissipated due to wire switching activities. In a future 45 nm technology node, wire temperature rise for the same set of benchmarks and simula- tion sample was found to be between 20-58°C. We observed that instruction and data 16 bus wires attained absolute temperature in the range 80.3—104°C and 97.6—123.7°C, in 130 nm and 45 nm processors, respectively, during the course of our simulation, show- ing that signal lines attain significant temperatures too. Significant wire temperature gradients of magnitude between 16—25°C were found to be most common between the sending and receiving ends of the wires during the course of simulation. Notable correlation was found to exist between energy dissipation behavior and wire temper- ature rise in buses across time; short, intermittent cycles of high energy-dissipating switching activity trigger steep changes in temperature. We also developed models that track the impact of changing wire temperature on timing/ delay violations occurring in global signal buses during microarchitecture-level exploration. Results show that for a 130 nm processor with no power and thermal management the temperature-induced clock cycle time violations in an ALU result bus—which is on the critical path—is 2.27 per hundred bus references, averaged over ten programs in the SPEC CPU2000 workload. It increases to an average of 6.20 per hundred bus references for the same processor at the 45 nm technology node. Our analysis also shows that conventional techniques like bus encoding that seek to reduce energy dissipation and potentially wire temperatures have limited impact on alleviating temperature-induced delay violations. 1.6.3 Profile-Guided Optimization Techniques Efforts to reduce bus energy dissipation, particularly in long global signal buses, are becoming increasingly important in nanometer-scale technologies as intercon- nects continue to aggravate performance, power, and cost. While dynamic encoding schemes have been proposed to reduce bus switching energy, they do not work well for 17 correlated traffic such as those found in typical workloads like SPEC CPU2000 bench— marks. Hence, we develop static bus encoding techniques and present a methodology to design such schemes in an optimal manner. Being completely static, such schemes can be designed during early stage microarchitectural exploration and incur mini- mal run-time hardware area/ cost, power, and latency compared to dynamic encoding logic. We use a microarchitecture-level simulator, profile representative samples of SPEC CPU2000 benchmarks to collect data, and use integer linear programming to design our encoding scheme. Results show that, for the SPEC CPU2000 work- load, i.e., workstation/ PC class processors, total bus energy dissipation reduced by as much as 22.79%/ 40.77% for data/ instruction buses when our best static encoding scheme was applied. In contrast, existing dynamic bus encoding techniques yield only 4.19% / 5.32% reductions for the same type of bus traffic. 1.6.4 Novel Thermal Optimization Methodology Apart from bus energy, rising wire temperatures are also becoming an important issue to address in high performance buses since they affect wire delay and reliabil- ity. We propose a first-of—its—kind methodology to design temperature-aware encoding schemes by trading off some of the energy gains we obtain with static encoding tech- niques to achieve wire temperature reduction. In this methodology we add tempera- ture constraints during energy optimization, and our ILP produces a static encoding scheme that reduces maximum / hottest wire temperatures by up to 15.23 K / 16.17 K for data/ instruction buses while still producing significant total bus energy reductions. 18 1.6.5 Performance-Oriented Adaptive Bus Design The rate at which signals can be transmitted in a high—speed processor bus is decided based on the worst-case crosstalk pattern. This pessimistic estimation gives rise to significant performance penalties since the worst case never occurs or occurs with very low frequency in actual applications. Hence, we propose an adaptive bus design, called variable cycle bus (VCB) architecture, that examines incoming data patterns and transmits them using variable number of clock cycles, improving bus performance significantly. To maximize effectiveness of our adaptive bus architecture, we propose a profile—guided optimization approach—like the one described earlier in Section 1.6.3— to reorder and signal bits to minimize bus crosstalk. Results on SPEC CPU 2000 benchmarks, in a general-purpose optimization scenario, show a 29.35% reduction in 1+4r cycles, a 20.29% reduction in 1+3r cycles, and a bus performance improvement of 17.42% for a VCB with static reordering and signaling technique targeting bus crosstalk minimization. 1.7 Dissertation Outline This remainder of this dissertation is organized as follows. Next, Chapter 2 presents a background on interconnect analysis and optimization for delay and power and pro- vides a general overview of our experimental methodology and simulation infrastruc- ture. Following that, in Chapter 3 we present the model for estimating activity-driven energy and temperature in processor buses and study the energy and temperature characteristics of data and instruction buses. Then, in Chapter 4, we present the model for estimating data and temperature-dependent delay variability and exam- 19 ine the impact of delay variability on the performance of a processor in current and future fabrication technologies. Next in Chapter 5, we discuss novel interconnect optimization techniques to reduce processor bus energy and temperatures. Then, we discuss delay optimization techniques in Chapter 6. Finally, we conclude and present directions for future work in Chapter 7. 20 CHAPTER 2 PRELIMINARIES Integrated circuits (ICs) consist of two basic components: transistors and their inter- connections. As more and more devices are integrated on a single die, wires or inter- connections gain importance and play an important role in determining the speed, area, reliability, and yield of VLSI circuits [41]. In this chapter, we provide a brief introduction to some terminology used in the context of interconnect design and dis- cuss interconnect analysis and optimization methods. We also discuss the role of architecture-level simulators in interconnect analysis and design. Finally, we outline the general experimental methodology followed in our experiments. 2.1 Interconnect Analysis Methods Interconnect analysis as it applied to power and timing seeks to answer three ques- tions: (1) what is the effective loading due to the interconnect? — this is necessary for driver/ repeater sizing to minimize delay and to estimate power dissipation, (2) what is delay and slew at the receivers? and (3) what is the effect of switching of this and other neighboring nets on power dissipation and propagation delay? This analysis can be performed with dynamic circuit simulation, in which specific stimuli are applied to the circuits and interconnect in question. Unfortunately, this technique cannot be practically applied to the millions of transistors on a digital integrated circuit. Hence interconnect analysis is performed using simpler models. Interconnects in a VLSI 21 chip can be grouped into three categories, based on their length, as discussed next. 2.1.1 Global, Semiglobal, and Local Wires Since it is not possible to connect millions of transistors on the die using only one level of interconnect, multi-layer interconnect structures are commonly used. The metal layers closest to the Silicon (Si) substrate are called local interconnects/wires. The next few layers are called semiglobal or intermediate interconnects, and the top layers are called global interconnects. The wires in the global layers are wider and thicker and this yields shorter propagation (or RC) delays since wire resistance and hence delay is inversely proportional to the area of cross section. Consequently, these layers are used to route high performance buses in the core of the microprocessor. Wider and thicker wires at higher layers are also used to provide low-resistance power / clock distribution lines to different regions of the chip. Layer assignment, i.e., the decision to route a wire/ net in the local, semiglobal, or global layer, is performed based on stochastic wire length estimates [42]. In our research, we are interested in power, temperature, performance, and reliability optimization of longer wires, i.e., semiglobal and global wires that are used to route high performance buses. These interconnects are analyzed using the models discussed next. 2.1.2 Interconnect Models: RC and RLC Interconnects, in general, have three important electric characteristics: resistance (R), capacitance (C), and inductance (L). All three depends on the interconnect geometry and its position relative to the other surrounding structures. These parasitics affect circuit performance; capacitance adds load to driving gates, resistance, inductance, 22 and capacitance all add signal delay, and inductive and capacitive coupling between interconnects add signal noise. The circuit parasitics of a wire are distributed along its length and are not lumped into a single position. As long as the resistive component of the wire is small, and the switching frequencies are in the low to medium range, it is meaningful to consider only the capacitance component of the wire, and to lump the distributed capacitance as a single capacitor. This is the simple capacitive model and is not very accurate. On-chip metal interconnects of over a few millimeters in length have a significant resistance. The n—model lumps the total wire resistance of each wire segment into a single resistor R and represents the total capacitance as two capacitances of £2:— each. This model, called the lumped-RC model is, however, pessimistic and inaccurate for long interconnects, which are more adequately represented by a distributed-RC model. In practice, this model is represented as a n-ladder network. Similar to resistance and capacitance of interconnect, the inductance is also distributed over the wire. Thus, a distributed RLC model of interconnects, also known as the transmission line model, is the most accurate approximation of the actual behavior of interconnects. 2.1.3 Effect of Inductance on Global Signal Lines In spite of shrinking dimensions and increasing clock frequencies in nanometer-scale technologies, it has been shown that inductance can be safely ignored for global signal lines that are longer than 10 mm [2,43]. This is due to various factors discussed next. First, it has been shown that, for long global signal lines, the signal response to a step input is over—damped when the line is modeled using the complex distributed RLC model. This response can be approximated using a distributed—RC model, without 23 significant error [43]. Second, inductance is not a significant problem in minimum- width global lines as much as it is in clock and power/ ground lines that are several times minimum width. It has been estimated that inductance becomes an issue in a global line only if its width is at least eight times the minimum width [2]. Third, in high-performance buses that we consider in this research, designers ensure that induc- tive effects are minimized by ensuring that current return paths for worst-case input patterns are kept within limits. This is normally achieved by placing power/ ground planes above and/or below the layer in which the high-performance bus is routed and also by routing shield wires in the same layer as the bus [44]. Finally, in the recent times, architectural trends have shifted toward improving power/ performance (or Watt/MIPS) efficiency by using shorter pipelines and multi-core architectures, compared to just improving performance by increasing clock speed. Thus, in cur— rent and future generation microprocessors, clock frequencies are not expected to increase exponentially as predicted until a few years ago. This trend also contributes to keeping inductive effects in check for global lines. Due to the reasons outlined above, we do not consider inductive effects in our work. Using an RC—model, interconnect energy can be estimated as discussed next. 2. 1.4 Energy Estimation Self transitions are defined as transitions on the self or area capacitance which is the parasitic capacitance between a bus line and the ground/V D D plane. Coupling transitions are defined as transitions that occur on the coupling capacitance which is the parasitic capacitance between two wires on the same plane. Figure 2.1 shows self and coupling capacitances for a 5-bit bus. Note that there can be two types 24 of coupling capacitances for a wire of length luiire: adjacent coupling capacitance Ccl = lwire x CZ" i :I: 1 and non—adjacent coupling capacitance Cm; = [wire x c,”- :I: x, where :1: Z 2. The adjacent coupling capacitance is the most dominant. Hence it is most often considered in energy and delay estimation and other (non-adjacent) capacitances are ignored. Self transitions in a wire are of two types: charge (0 —> 1) and discharge (1 —-> 0), and coupling transitions in a pair adjacent wires are of three types: coupling charge transitions (00 —> 01,1 00 —+ 10, 10 —> 11, and 01 ——+ 11), coupling discharge transitions (01 —> 00, 10 —> 00, 11 —> 10, and 11 —> 01), and toggle transitions (01 —+ 10 and 10 ——> 01). Note that if the total number of self and coupling (charge, discharge, and toggle) transitions is reduced, bus energy dissipation will reduce significantly. The energy consumption and energy dissipation of a bus in a given time interval t are given by: Ec0ns,aug = [N8 ' Cu) + Ccl ' (NC + 2 ' Ntll ' VDD2 ' fell: ‘ 75, (2-1) N N Ediss,a-vg = [N3 ' Cw + CCI ' (32 + —2_d + Nt)] - VDD2 - fClk - t, (2.2) where Cw = Cline + Crep 2 [wire X Cline + crap is the self capacitance of the wire including the contribution of repeaters, N5 is the total number of self-charge transitions recorded on the bus in time interval t, NC, N d7 and ,Nt are the number of coupling-charge, coupling-discharge, and coupling-toggle transitions, respectively, recorded in the same interval. Thus, only charging transitions that require current flow from the power supply to charge the parasitic capacitances are used to determine energy consumption, whereas current flow from the power supply (during charging) 1For two lines i and 3', this notation represents the transition: VimVan —» Vimejfm. and current flow into the ground (during discharging) of the parasitic capacitances account for energy dissipation. Energy consumption and dissipation are equal on the average, though their instantaneous values may be different. 2.1.5 Delay and Performance When designing circuits it is necessary to ensure that a signal is fully transmitted across a wire in a given time. This time should be at least the propagation delay of the wire which depends on wire and driver sizes and also on the interaction with neighboring wires, which is referred to as inter-wire crosstalk. Due to crosstalk, the propagation delay tp of a wire (called the victim), which is a function of transitions in its neighboring wires 1»: — 1 and k + 1, can be expressed as follows, including the effect of load (receiver) capacitance [45]: where Rw and R D are the wire and driver resistances, respectively, CT is the input (gate) capacitance of the receiver, go is the delay correction factor due to inter—wire coupling between wires separated by the minimum spacing and is a function of the capacitance ratio r = gfiul' The wire resistance Ru, is estimated using the resistivity at a design temperature of 100°C. The various crosstalk conditions occurring when the victim wire k experiences a rising (0 —> 1) transition (denoted as T) are listed in Table 2.1. A corresponding table of delay factors can be constructed for a victim wire experiencing a falling (1 —> 0) transition (I). In the worst case—toggle or oppositely switching transitions on both sides of the 26 Crosstalk mode k — 1, k, k + 1 Delay factor (g0) mode-0 T, T, T 1+07' mode-1 T, T, - 1+1T mode-2 T, T, I 1+2?“ mode-2 -, T, - 1+27‘ mode-3 —, T, i 1+3r mode-4 I, T, i 1+4r Table 2.1. Bus crosstalk conditions and models for a rising transition in the middle (victim) wire. victim—the delay is: twc = 0.69(RD + Rm) -C,~ + (1+ 4r) . Cw - (0.381310 + 0.69RD). (2.4) It is clear that width of the clock pulse to the circuit should be more than two to ensure that the signal propagates completely to the destination, i.e., thus_clk 2 twc- To ensure that this does not impact performance, repeaters / buffers are used to divide long wires into several sections and hence reduce propagation delay. Assuming that the size of each repeater is h times the size of a minimum-sized inverter (which is technology—dependent) and k is the number of repeaters needed to achieve optimum delay on the interconnect, these can be calculated using: h = M and (2.5) CO ‘ Rint 0.4(R- - C- ) k = int int 2. \/ 0.7(00 - R0) ’ ( 6) where Cint = Cline + 4 - Ci),- :l:1 is the total per-unit length capacitance of a wire leading to the worst-case delay impact, C0 are R0 are the capacitance and resistance of a minimum sized inverter, and Rint = Tline is the per-unit length wire resistance [46] . 27 2.2 Interconnect Optimization Techniques Several techniques have been proposed to ensure that interconnect power and perfor— mance are not affected due to technology scaling. We discuss these next. 2.2.1 Data Encoding In general, system-level encoding techniques fall under three categories, based on whether they use redundancy in space (extra number of bus lines), time (extra number of cycles) and voltage (number of distinct voltage levels) [47]. In particular, use of time redundancy has been demonstrated to be as effective as the space redundancy for decreasing the average switching activity and issues due to extra cycle overheads have been addressed by using compression [48—50]. Different modes of signaling—level and transition signaling—can also be used to reduce bus switching activity. The bus-invert (BI) code is a low-power encoding scheme designed to limit the average power of the bus [51]. It performs well when patterns to be transmitted are randomly distributed in time and no information about pattern correlation is available. Therefore, this method is most appropriate for encoding the information on data buses. A redundant control line I N V is needed to signal to the receiving end of the bus the encoding mode in the current cycle. The encoding depends on the Hamming distance (i.e., the number of bit differences) between the value of the encoded bus lines at time t —- 1 (also counting the redundant line at time t — 1) and the corresponding value at time t. The Hamming distance is compared to %, where n is the bus width (assuming it is even without loss of generality). If the Hamming distance between two successive patterns is larger than %, the current 28 value is transmitted with inverted polarity and the control line is asserted; otherwise, the current value is transmitted as is, and the I N V line is de—asserted. If the words transmitted on the bus are independent and uniformly distributed, the average number of transitions per clock cycle is lowered by less than 25% of the original value, due to the binomial distribution of the distance between consecutive patterns [52]. Major drawbacks of the BI technique are related to the required redundant bus line and the overheads due to the logic to implement the encoder to decide whether the Hamming distance exceeds 3. The encoding latency, in particular, is quite significant as discussed next. In BI, encoding consists of three sequential steps. First, the Hamming distance is computed. To do this, the current n-bit pattern and the previous n—bit pattern that was transmitted on the bus in the previous cycle are bitwise XOR-ed and the number of “1”s in the result is counted. This step requires a constant time operation for bitwise XOR and 0(n) to 0(log2 n) time for counting, depending upon the counter structure used. In the second step, the Hamming distance is compared with g to check which is greater; this can be completed in O(n) to 0(10g2 n) time, again depending on the hardware structure used. Finally, the current pattern is inverted or sent as-is and this takes constant time. Thus, BI encoding takes at least 0(log2 n) time. More recently, odd/ even bus invert (OEBI) [53] and coupling-driven bus invert (CBI) [54] encoding schemes, designed to reduce transitions on the coupling capaci— tance between adjacent bus lines, were proposed. In OEBI, even and odd bit positions can be encoded (with bus inversion) independently and two invert lines are used to in- dicate one of four modes of transmission: OO—none of the bits are inverted, 01—only the even-numbered bits are inverted, 10—only odd—numbered bits are inverted, and 11—all 29 bits are inverted. This is based on the observation that by inverting only the odd or even bits, a coupling toggle transition can be reduced to a coupling charge / discharge transition [53]. The scheme assigns weights of 1 and 4 to coupling charge/ discharge and toggle transitions, respectively, to estimate coupling energy dissipation. Based on the current and previous input patterns, the total coupling energy dissipation for each of the four modes is estimated. Then the mode that will result in the least coupling energy dissipation is chosen and data is transmitted on the bus in that form. In a similar manner, the CBI encoding technique examines pairs of adjacent bits in the same position for the current and previous input patterns and estimates coupling activity. The differences here are: (1) only one invert line is used to indicate whether the transmitted data is in inverted or non-inverted form; and (2) it uses weights of 1 and 2 for coupling charge / discharge and toggle transitions, respectively. Note that neither OEBI nor CBI considers self transitions to decide the inversion mode while BI considers only self transitions. Bus encoding is also often used to reduce crosstalk. Crosstalk-aware encoding schemes can be one of two types: those that have memory or those that are memo- ryless. If an encoding scheme has memory then each codeword is dependent on the word that came before it. Thus, each codeword has its own codebook of valid words that can come after it. On the other hand, if an encoding is memoryless then any codeword can follow any other codeword. The minimum number of wires needed to encode 32 bits with memory is 40 and without memory is 46 [55]. Thus the extra wiring overhead for an encoding scheme with memory is 25% and 44% for optimal encoding without memory. 30 2.2.2 Wire Spacing and Shielding Inserting V D D / GND wires known as shields is a popular method to avoid crosstalk in high-performance buses. Signal isolation due to the presence of shields prevents both noise and increase in delay due to coupled lines switching. A dense fabric interconnect architecture with shield lines inserted after every signal wire was proposed in [56]. Shield insertion also reduces inductive effects because it creates a shorter return path to ground for the current flowing through signal wires. However, inserting shield wires between every pair of signal wires results in large area/costs, increases wire congestion and may end up requiring more metal layers leading to higher production costs. Alternatively, wires can be simply spaced apart to produce a similar solution. Though spacing does not eliminate coupling noise, it reduces the value of the coupling capacitance—since capacitance is inversely proportional to the spacing——and at the same time reduces power dissipation since the total capacitance load of the line also decreases. In many cases, this is a significant gain compared to shielding which eliminates the noise at the cost of extra power dissipation [57]. 2.3 Architecture-Level Simulators and Early- Stage Design At the very early stages of design definition, microarchitects start with analytical cycles-per-instruction (CPI) performance models that lead to trace or execution- driven, cycle-by-cycle simulators. Full or sampled benchmark traces are processed through such simulators, driven by a microarchitecture parameter file. The goal of this design space exploration phase is to optimize the choice of microarchitectural 31 parameters for CPI performance under design constraints known at that stage. The performance model is typically written in a standard systems programming language such as C or C++ and is designed to project execution times (in cycles) for input application traces; it typically does not model the actual execution of the instruc- tions, but only the execution timing. More recently, power dissipation models that are based on counting the number of transitions occurring in microarchitecture blocks have also been added to these simulators. Several architecture—level simulators have been developed and used in the acad— emia and industry: Wattch [58], SimplePower [59], TEh12P2EST [60], WArPE [61], Sim-Panalyzer [62], IBM Turandot/PowerTimer [31], AccuPower [63], and HotSpot [64]. Interconnect / bus models used in these simulators suffer from many drawbacks. First, none of the existing simulators have models for estimating inter-wire coupling activity dependent power consumption and delay. For example, the SimplePower tool, which models only memory system buses (between different levels of caches and/ or main memory), uses an interconnect model that considers only the self-capacitance of bus lines calculated based on an empirical formula [65]. The Wattch simulator which models only the result bus in the microarchitecture also does not take into account inter-wire coupling activities when estimating power dissipation. Thermal models for buses are not available in most current simulators. The HotSpot tool addresses this need to some extent, but it contains a temperature model for the interconnect sys- tem as a whole rather than for each bus and hence cannot track activity-dependent temperature changes in key processor buses [66]. Temperature gradients and delay variations cannot be estimated using this tool. 32 2.4 Our Experimental Methodology 2.4.1 Interconnect Geometry and Technology Data For all the interconnects considered in this work, we assumed that it was routed in the top—most layer metal. A representation of wires in this layer is shown in Figure 2.1. if??? Figure 2.1. Layout of wires routed in the top-most layer metal. Self and coupling capacitances are shown. The bottom plate represents the V D D / GND plane. Values for wire geometry (wire width, spacing, etc.) and technology and equiv- alent circuit parameters, like capacitance and resistance of a global line for various nanometer—scale technologies were obtained from the ITRS document and are listed in Table 2.2. Note that wire spacing is assumed to be equal to wire width per ITRS [1]. In this work, we use 130 nm and 45 nm as the representative technologies for a cur- rent generation and a future-generation microprocessor and compared our results for these designs. In current generation microprocessors, a global signal bus is typically a few mil— limeters long; we consider a bus of length 6 mm using the numbers reported in [44] 33 Global Wire Segment Via ow Via / a / Q, 9* 0 <90 6‘49 00‘ .«s‘ 9 4° s for / f 00“ f / Figure 2.2. Wire segment of length lopt between two repeaters. for a Pentium-4 microprocessor. Using this length (lwirelv we estimate the number of repeaters (k) that need to be inserted to enable non—inverting transmission using Equation 2.6, and then we find the inter—repeater segment length lopt : 6—X—%0——§. In the remainder of this work, all experiments and analysis focus on a single wire segment of length lopta driven by a sending end repeater of size h and connected to a receiving end repeater of the same size, as shown in Fig 2.2. In addition to its self capacitance, this wire segment has a capacitance, due to its sending and receiving end repeaters, that can be calculated as: Crep = h x CO, where CO is the sum of the input and output capacitances of a minimum sized inverter. 2.4.2 Parasitic Capacitance Extraction The ITRS roadmap provides values only for self and adjacent—wire coupling capac- itance for current and future technology nodes. Hence, to estimate the coupling 34 Technology node Parameter 130 nm 90 nm 65 nm 45 nm Number of metal layers 8 9 10 10 Wire width, to,- (nm) 335 230 145 103 Wire thickness, t2- (nm) 670 482 319 236 Relative permittivity of dielectric, er 3.3 2.8 2.5 2.1 Thermal conductivity of dielectric, 0.6 0.19 0.12 0.07 kild (W/mK) Clock frequency, fclk (GHz) 1.68 3.99 6.73 11.51 Supply voltage, VDD (V) 1.1 1.0 0.7 0.6 Maximum current density in a wire, 0.96 1.5 2.1 2.7 jmax (MA/cm?) Height of inter-layer dielectric, tild (nm) 724 498 329 243 Resistance of minimum size inverter, 6.23 9.04 9.6 13.2 30 (k9) Capacitance of minimum size inverter, 4.65 3.14 2.25 1.5 Co (fF) Self capacitance of wire, Cline (pF/m) 44.06 32.77 25.07 19.05 Adjacent coupling capacitance, 91.72 76.84 68.42 58.12 02,241 (PF/mm) Non-adjacent coupling capacitance, 6.49 4.65 3.56 2.76 ci,i :t 2 (pF/mm) Non-adjacent coupling capacitance, 2.53 1.76 1.29 0.98 ct,- :I: 3 (pF/mm) Resistance of wire, Tline (kQ/m) 98.02 198.45 475.62 905.05 Optimal repeater size, h 74.95 70.25 51.77 49.45 Optimal # of repeaters for non-inverting 6 8 12 16 bus, k Coupling ratio including effect of re- 2.065 2.329 2.716 3.039 peaters, r Table 2.2. Technology, wire geometry, and equivalent circuit parameters for topmost layer interconnect. Values in top eight rows are from the international technology roadmap for semiconductors (ITRS) document [1]. Values listed in the next three rows are from Mui et a1. [2]. The values for the self and coupling capacitances were extracted using the FastCap tool and the value for r,- was calculated using the formula Ti = pCu/(wi - ti), where pCu = 2.2 X 10—8O-m. Values of h and k were found using + ’1 X Co). expressions given in Section 2.1.5 and r = Ci),- :I: 1/(Cline 35 capacitances between all pairs of wires (adjacent as well as non-adjacent wire pairs), we employed the publicly available three-dimensional capacitance extraction program called FastCap [7]. Using the wire geometry parameters from ITRS (see Table 2.2 for values) to model a coplanar global bus layout, similar to the one shown in Fig— ure 2.1, we extracted values of self and all coupling capacitances for the middle wire of a 32-bit bus. Figure 2.3 shows the percentage distribution of these capacitances for various technologies. From the figure, we observe that, for current 130 nm and 90 nm technologies, non-adjacent coupling capacitances are somewhat non—negligible (they contribute 210%), while even in a future 45 nm node, non-adjacent capacitances ac— count for about 8% of the total capacitance. Our energy model which is described in a later chapter considers the effect of two non-adjacent coupling capacitances, Cc2 and 003, for better accuracy. 2.4.3 Simulation Infrastructure and Verification of its Cor- rectness Computer simulators have been used for a long time to study both hardware and software behavior. They allow the collection of information and statistics during the execution of programs. Various types of information, such as memory profiles, in- struction profiles, and timing statistics, can be gathered from these simulators. For this research, we use the sim-outorder out-of—order processor simulator from the SimpleScalar microarchitecture tool set, which is very widely used in academia [67]. Many microarchitectural simulators used in the industry also closely resemble and / or are derived from SimpleScalar or its derivatives [31, 58-64]. We added several en— hancements to the sim-outorder simulator to facilitate our analysis and optimization 36 Scaling of Self and Coupling Capacitances [fogrfdi cm [:1 Cc2 [:1 Cc3 I 03% 100% ., 90% : ' ' 80% 70% 60% 50% 40% 30% 20% 10% ' 0% Percentage of Total Capacitance i Ti 90nm 65nm Technology Node Figure 2.3. Distribution of self and coupling capacitance values for the middle wire of a 32-bit bus extracted using the FastCap tool [7]. and = self capacitance of the wire; Cc1 2 coupling capacitance between the wire and its adjacent neighbor; Cc2 = coupling capacitance between the wire and a non—adjacent wire with 1 wire between them; Cc3 = coupling capacitance between the wire and a non-adjacent wire with 2 wires between them; Cc_rest = sum of coupling capacitances between the wire and other wires with 3 or more wires between them. For current and near—future ITRS technology nodes (up to 45 nm), non—adjacent coupling capacitances are somewhat non-negligible—they contribute approximately 840%. efforts. These are described next. Support for analyzing bus data: We added support for tracing and analyz- ing the data transmitted on high performance processor core buses. The original sim-outorder contains only a functional model of a superscalar processor and does not have the ability to track the data that is transmitted between the microarchi— tectural blocks in the pipeline. We modified the simulator to track and analyze, on a cycle—accurate basis, the data transmitted on load/store address, load/store data, 37 instruction, and result buses in the processor core. Wire energy, temperature, and delay models: We also added our wire energy, temperature, and delay models to the simulator. While energy dissipation and delay of our target buses—including the temperature impact on delay—can be estimated on a per-cycle basis, temperature estimates can be obtained at a coarser granularity, i.e., after every 100K cycles or so. This is because temperature is a slow— changing effect that does not warrant per-cycle estimation. More details on how we determine the granularity of temperature simulation depending on the fabrication technology used are discussed later in Section 3.5.3. Integration with other thermal analysis tools: Recently, a tool called HotSpot [64], also based on SimpleScalar, was developed to estimate substrate (ac- tive layer) temperatures using the Wattch model for energy estimation [58]. Even though the on-chip interconnect system is a major contributor to the power bud— get, it was not modeled accurately in HotSpot. we have integrated our models with HotSpot, thus creating a microarchitecture-level simulation tool for full-chip energy and thermal analysis. As a result of our enhancements to the simulator, the running time is somewhat longer. The original sim-outorder without enhancements executes ~200K instruc- tions per second [67] whereas our modified simulator executes ~110K instructions per second while running detailed energy and temperature simulations at the granu- larities described earlier in this subsection. To reduce simulation time for analyzing a large number of programs on our simulator, we used a shared Linux cluster for our experiments [68]. We verified the correctness of our modified simulator with regard to four aspects, 38 as discussed next. Functional correctness: All the changes we made to the simulator add to its instrumentation capabilities and do not change it functionally, with regard to the microarchitectural model it seeks to implement. We verified this in two ways, as discussed next. First, we executed and compared the outputs for a suite of six microbenchmarks, supplied along with the SimpleScalar toolset, using the original (unmodified) simulator and our modified version. As expected, the program out- puts from both versions matched exactly. Second, we compared several performance metrics recorded by the simulator—number of instructions executed, L1/L2 cache misses, branch misprediction rate, etc—and found that these matched in the original simulator and our modified version, for the six microbenchmarks we tested. These tests show that the functional correctness of our modified simulator has not changed _ compared to the original one. Instrumentation correctness: The original sim-outorder simulator contains a detailed—enough microarchitectural model that enabled us to gather data transmitted on our target buses, in each cycle. Thus, instruction addresses and instructions were gathered from the program counter and the fetch stage of the simulator, respectively, data addresses by computing the target address for load / store instructions, load/ store data by monitoring L1 cache reads/ writes, and ALU result bus data by monitoring the outputs of the functional units in the execute stage. As such, the instrumentation capabilities we added to the simulator are correct by design. Model correctness: We tested if the models we constructed represent actual energy/ thermal behavior of buses consistent with previously-known data and/ or es- timates. For our energy model, discussed later in Section 3.3, results were compared 39 with circuit simulation of a distributed-RC wire using the Cadence Spectre simula- tor. Our model yielded energy results that were only about 4.53% different compared to those from Spectre, faster and with much less complexity. Our thermal model, discussed in Section 3.4, is based on the well-known analogy between electrical and thermal quantities that has been used widely in earlier work to model chip ther- mal structures [66, 69—71] and verified using finite element modeling (FEM) simula- tions [72, 73]. The average and maximum temperatures obtained using our model, while running SPEC CPU 2000 benchmarks on the simulator, were consistent with previously published data in [66], although our model estimated bus energies more accurately considering actual bus traffic values, interconnect temperatures at a finer granularity, and tracked spatiotemporal variation of temperature, all of which were absent in earlier models. The worst—case temperatures that global signal lines may po- tentially attain, assuming they carry currents at maximum density all the time, were estimated using FEM—based techniques in [72,73]. Signal lines, which are the focus of our work, do not carry currents at maximum density all the time and hence their temperatures are likely to be somewhat less than estimates obtained using worst—case FEM analysis. We verified that results using our model were consistently lower than worst-case estimates and remained so for the different technology nodes we tested: 130 nm, 90 nm, and 45 nm. Implementation correctness: We also tested that modifications were imple- mented correctly in the simulator and that desired outputs were obtained. For all the six microbenchmarks, we collected tracedumps of various buses using our sim- ulator and verified manually that the data in the tracedump matched the expected value for that type of data. For example, each entry in the instruction address trace- 40 dump should match the program counter value which is in a known range of memory addresses and each entry in the instruction tracedump should correspond to known instructions in the processor’s instruction set architecture. We found these to be true in all the tracedumps we tested. We also prepared several small synthetic traces of data streams and verified that results obtained from hand calculations matched those using equations from our energy model implemented in the simulator. 2.4.4 Target Systems and Benchmarks The SimpleScalar platform can simulate various RISC microarchitectures. For our work, we use the Alpha 21264 microarchitecture representing general-purpose super- scalar processors. The Alpha 21264 architecture is modeled as a 4-issue, superscalar processor with out-of—order execution and with 32-bit address, 64—bit data, and 128- bit (fetch width=4) instruction bus between the processor and L1 cache [74]. Other details of the microarchitecture and memory system for our target system is presented in Table 2.3. For evaluation on the Alpha target system, we use the SPEC CPU2000 benchmark suite which consists of 26 programs drawn from real user CPU-intensive applications [75]. The little—endian SPEC benchmark executables we used were downloaded from the SimpleScalar Website [76]. These programs were compiled for the Alpha 21264 instruction set using a Compaq Alpha compiler with SPEC peak settings and included all linked libraries. We ran our experiments using the ref input set from the SPEC CPU2000 suite. Since the time taken to simulate an entire SPEC CPU2000 benchmark is very long—typically several days on a cycle-accurate simulator—we used the 100 million 41 Processor Core Clock rate 1.68 GHz (130 nm), 11.51 GHz (45 nm) [1] Fetch / Issue width 4 each LSQ 8 entries Memory System PHLI bus Non-pipelined; 64-bit data and 128-bit instruction L1 D-cache Virtually-indexed physically-tagged (VIPT), 64KB, 2- way set associative, 64B block size, LRU policy, 3 cycle hit latency, write—through cache. L1 I-cache Virtually—indexed virtually-tagged (VIVT), 64KB, 2-way set associative, 64B block size, LRU policy, 1 cycle hit latency L1 MAF 8 entries L1HL2 bus Non-pipelined; 128-bit data/ instruction lines and 38-bit address lines (21 bits for block index and 17 bits for tag) L2 cache Physically-indexed physically-tagged (PIPT), 2MB, direct-mapped, 64B block size, LRU policy, 12 CPU cy- cles hit latency, write-back policy, operating at 2x CPU clock cycle L2HM bus Non-pipelined; 64-bit data/instruction lines and 38-bit address lines Table 2.3. Configuration of our target system and benchmarks. This processor- memory system configuration is based on the Alpha 21264 processor. single simulation points recommended by the SimPoint toolset to collect results only a representative slice of the program [77,78]. Although the accuracy of representative samples from SimPoint has not been explicitly validated using energy/temperature metrics, its use in design / evaluation of microarchitecture-level energy reduction tech- niques is widespread in literature. Several works that use phase classification tech- niques like SimPoint for microprocessor energy evaluation have been surveyed in [79]. 42 CHAPTER 3 ACTIVITY-DRIVEN ENERGY AND TEMPERATURE MODEL Accurate early stage modeling techniques for signal interconnect energy dissipation and temperature are becoming necesary for current designs. This chapter describes a detailed energy model and a first-of-its-kind thermal model for interconnects [80,81]. 3. 1 Introduction As fabrication technologies scale down, interconnects are becoming the dominant factor in determining performance, power, cost, and reliability characteristics of a system. Interconnect scaling impacts performance because wire delay has continued to increase relative to that of logic. In recent years, power density in microprocessors has doubled every three years, primarily because feature sizes and clock frequencies have scaled faster than operating voltages [82]; this rate is expected to increase further in future technology generations [64]. The on—chip interconnect system is already the most important contributor to dynamic power; in current microprocessors (130 nm technology), interconnects are reported to contribute about 51% of the total on-chip dynamic power dissipation and global signal lines—address, instruction, data, and control buses routed in the top-most layer metal—about 21% [4]. As technology scales down, dynamic power dissipation will still remain important even as leakage power increases. It has been estimated that even in the 45 nm technology node, dynamic 43 power will contribute to about 46% of the total power dissipation [2]. Supply voltage scaling and smaller sizes will reduce dynamic power dissipation due to logic in future technologies at a faster rate than in interconnects and hence, interconnect dissipation will contribute a larger share to total dynamic power. Rising interconnect power dissipation will lead to localized Joule heating and temperature rise in wire metal that can affect wire delay due to temperature-dependence of resistivity and/ or cause wire breakage due to thermal stresses and electromigration. As power densities continue to increase, thermal effects in wires are becoming important due to the reasons outlined next. Signal transmission over a line/ wire i is associated with current flow, which results in 12R power dissipation, where I is the magnitude of current and R is the resistance of the wire. This dynamic switching power depends on: (1) the self capacitance (capacitance between the line to ground) of the wire Cline, (2) the coupling capacitance CZ" j between line i and any other line j, (3) the self and coupling activity factors (which in turn depend on self transitions on line i and coupling transitions between line i and any other line j, respectively), (4) the supply voltage, and (5) the bus clock frequency. Advances in technology have resulted in ever-higher values of Eel—:Lré due to higher wire aspect ratios and smaller inter-wire spacings; among all Ci, j’ the adjacent coupling capacitance (Ci,i j: 1) dom- inates the other (non-adjacent) coupling capacitances. With newer technologies, bus clock frequency has also continued to increase. The supply voltage is scaling down but at a rate not enough to offset the rate of increase in the other two. Thus, the net effect is that the 12R power is continuing to increase as technology scales down, and consequently local heating in wires is becoming a concern. Further, since global signal wires are separated by multiple layers of low-K dielectrics from the substrate 44 that is connected to the heat sink, and since these dielectrics have poor thermal conductivities, heat cannot be removed from the wire efficiently. Energy dissipation and/ or thermal effects in global signal lines are further aggravated due to the follow- ing reasons: (1) increasing use of repeaters in long signal lines to reduce delay leads to higher energy dissipation [46]; (2) a steady increase in the number of metal layers, particularly the number of global metal layers, also increases overall energy dissipa- tion; and (3) long via separations in upper metal layers contribute to higher average wire temperatures—vias are normally better thermal conductors than surrounding low-K dielectrics [83]. By virtue of their carrying smaller currents than power supply lines, energy dis- sipation and thermal characteristics of signal (both clock and data) lines have not been the subject of serious study. But this will need to change as clock frequencies increase with technology scaling. Higher frequency also means that the large fluc- tuating line currents drawn by the bus driving circuitry can influence resistive and inductive voltage drop in power supply lines, since long global signal lines present a high load capacitance. In this work, we develop a model for activity-dependent bus line energy dissipation and temperature rise, and apply it to different types of microprocessor core buses. While we do not study clock lines in this work, our model can be easily applied to thermal analysis of clock networks and estimate temperature impact on signal delay, skew, and reliability. The dynamic power dissipated in a bus wire, which ultimately determines its temperature as discussed earlier, is both time and information dependent. It depends on the type of information (address, instruction, data, or control) being carried by the bus because the information type influences the self and coupling activity factors; 45 for example, the number of coupling transitions are expected to be higher for data streams that are more random in nature than for others. The type of information also directly influences the temperature characteristics of the wire because of the presence of unequal numbers of idle cycles between successive transfers; address and instruction buses typically carry new information every cycle as opposed to data buses where more idle cycles are likely to be present between data accesses. These idle cycles, during which no power is dissipated in the bus lines (assuming they hold the last value that was transmitted), present opportunities for cooling. Hence, interconnect thermal models that estimate temperature and reliability based on the assumption that all bus lines carry the maximum RMS current density (worst-case scenario) [83,84], and models that use switching activity factors to estimate average self-heating power and determine temperature rise [66], may result in inaccuracies. This may, in turn, lead to incorrect interconnect lifetime prediction, since dynamic heating and cooling effects are not taken into account. Also, designers will be forced to allow higher-than-required safety margins and, as a result, the system will incur higher packaging costs. Hence, energy dissipation and thermal effects in buses are best studied using microarchitectural simulators and real workloads; in this work, we present models to facilitate this. Detailed thermal models and workload-based studies for estimating temperature distributions in substrate [64] and interconnects are essential for facilitating early- stage design of future high-performance processors. For such designs, a pessimistic temperature assumption will lead to costly and perhaps unrealistic guard bands and high cooling system costs. On the other hand, an optimistic assumption will lead to underestimation of the chip power and leakage, and may lead to shorter lifetime and 46 lesser reliability. Higher wire temperatures can have a dramatic impact on perfor- mance since temperature directly affects wire delay. Typically, the Elmore delay of an on-chip wire increases approximately 5% for every 20°C rise in temperature [37]. In addition to its absolute temperature, wire delay also depends on the temperature gradient between the sending and receiving ends. The growing popularity of chip multiprocessing (CMP) and simultaneous multi-threading (SMT) will increase bus switching activities, since, potentially, uncorrelated data from different streams are transferred on the same bus, resulting in higher per-wire energy dissipation and tem- peratures. Thus, realistic temperature models and early-stage estimates are essential for meeting design goals and avoiding temperature-induced problems in silicon. The organization of the rest of this chapter is as follows. Section 3.2 briefly reviews related work. Next, in Section 3.3 and 3.4, we present our energy dissipation and thermal models for global signal lines. Following that, in Section 3.5, we discuss our simulation environment and methodology. Then, in Section 3.6, we present results from simulations by applying our models in an execution-driven simulator. Finally, we summarize in Section 3.7. 3.2 Related Work and Our Contributions Some methods for architecture-level interconnect power analysis have been proposed [59,85]. Earlier modeling methods estimated bus energies based on self transitions only [59], whereas recent models also consider adjacent inter-wire capacitances for energy calculations [85]. Thermal effects in interconnects and their implications for performance, current density, and reliability have been studied in [21,37]. Recently, 47 interconnect thermal models have been proposed in [66,83]. But these models either perform a worst-case analysis using maximum current metrics suitable only for power supply lines [83] or consider average switching activities [66]. Such approaches are not suitable for analyzing signal lines since: (1) signal lines carry much less current than power supply lines, and (2) their energy dissipation and thermal characteristics are tied to actual traffic patterns (with intermittent idling) carried on the bus. A large body of work exists on low-power bus encoding, many of which also use bus energy models similar to ones described in [59] or [85]. Some of the older bus encoding schemes have been surveyed in [86]. Newer schemes include odd/ even bus-invert [53], coupling-driven bus-invert [54], transition pattern coding [87], and leakage-aware bus encoding [88]. The contributions of this work are outlined next. First, we present an accurate model to estimate bus line energy dissipation that can be used in a trace-driven setup or in an execution-driven simulator. Existing bus energy models, like the one proposed in [85], only estimate energy dissipation consid- ering the bus as a whole, not in each line, whereas our model is capable of estimating energy dissipated in each bus line. Also, these models do not account for the non- uniform dissipation of energy across the wire length, which we do in our model. As we shall see later, these factors are necessary to model dynamic temperature effects in buses, both temporally and spatially, across wires. Our bus model is also more accurate because it considers the effect of capacitive coupling between adjacent and non-adjacent wire pairs on switching energy in addition to energy dissipated in the self capacitance. Our work is the first to show that switching transitions in parasitic capacitances between non-adjacent wire pairs account for a significant (7—8%) portion of the total energy dissipation and hence this contribution should not be neglected 48 in bus energy models. Further, we model the effect of repeaters, which increase the self capacitance and hence self energy dissipation. This is so because the output ca- pacitance of a repeater adds to the self capacitance of the line segment that it drives, and the input gate capacitance of a repeater adds to that of its input line. Second, using our bus line energy dissipation model, we study the effectiveness of some existing low-power bus encoding techniques when used for data and instruction bus encoding. To our knowledge, no previous work has studied these bus encoding techniques using realistic traffic from SPEC CPU2000 benchmark programs; most of them have used random traffic patterns that do not behave like real-world instruction and data streams. In this context too, we use realistic technology parameters from the ITRS roadmap for current and future nanometer technology nodes. Finally, we present a thermal model and a methodology to estimate the tempera- tures of individual wires of a global signal bus during dynamic simulation. Our model incorporates the effect of inter-layer heat transfer (heat conduction from the substrate and lower metal layers through the inter-layer dielectric) and intra—layer heat transfer between adjacent bus lines through the inter-metal dielectric. It can also estimate the temperature gradient between the sending and receiving ends of the bus and hence, it can be used to estimate any dynamic delay variations due to Joule heating. Our model can also be used to estimate the effect of varying substrate temperatures on wire self heating, although in this work, we assume a constant substrate temperature for simplicity. Specific results we obtained are listed next. 0 We estimate from simulations using our model for 130 nm technology node that, during the time interval taken to commit one billion instructions in the pipeline, high performance bus wire temperatures rise by 10-37°C for various 49 SPEC CPU2000 benchmarks. This is solely due to Joule heat dissipated due to wire switching activities. In future 45 nm technology node, wire temperature rise for the same set of benchmarks and simulation sample was found to be between 20-58°C. We observed that instruction and data bus wires attained absolute temperature in the range 80.3-104°C and 97.6—123.7°C, in 130 nm and 45 nm processors, respectively, during the course of our simulation, showing that signal lines attain significant temperatures too. Significant wire temperature gradients of magnitude between 16-25°C were found to be most common between the sending and receiving ends of the wires during the course of simulation. Some significant correlation was found to exist between energy dissipation be- havior and wire temperature rise in buses across time; short, intermittent cycles of high energy-dissipating switching activity trigger step changes in tempera- ture. 3.3 Bus Line Energy Dissipation Model In this section, we develop our bus line energy dissipation model that calculates energy dissipated as a result of a switching (both self and coupling) transition. This energy model is then used to determine change in wire temperature that occurs due to the combined effect of self-heating in the wire and heat conduction into the surrounding medium. Values for wire geometry (wire width, spacing, etc.) and technology and 50 equivalent circuit parameters, like capacitance and resistance of a global line, that we used for various nanometer-scale technologies were listed in Table 2.2. As described earlier, the energy drawn from the supply rails by the driving gates of a bus line is dissipated as 12R losses in the bus line. This results in temperature rise in wires due to the self-heating effect. Existing bus energy models, like that in [85], only provide expressions for total energy dissipated in the bus. From the thermal design point of view, the energy dissipated in each bus line is important since it helps determine the temperature rise in each individual wire separately. This can be estimated using our model described below. First, we describe how the energy dissipated due to line self capacitance can be found; the procedure to estimate the contribution of repeaters to this self energy is also explained. Next, we explain how energy dissipated due to inter-wire coupling capacitances, including adjacent coupling and non-adjacent coupling capacitances, can be estimated. 3.3.1 Energy Dissipated due to Line Self Capacitance Define Vz = sz in — Vim, i.e., the difference between the final and initial voltages on line i. Note that Vzm and V7;f in can take either one of two values: 0 or VDD' Thus, V, = VDD implies that the self capacitance of line i charges due to a rising transition (0 ——> 1), whereas V,- = ‘VDD means that it discharges due to a falling transition (1 —> 0). For each transition, energy that is dissipated in wire i due to charging or discharging of the self capacitance of the wire can be calculated as: E; = 0.5 x (Cline + Crep) - V22, where Cline is the self capacitance of the wire and Crep is the total capacitance of repeaters on the line. The energy E; is called self energy since it involves only the self or line capacitance (including the contribution of repeaters). 51 Values for Cline are obtained by multiplying the per-unit length capacitances given in Table 2.2 with wire length and values for Crep are computed using Equation 2.6. 3.3.2 Energy Dissipated due to Inter-Wire Capacitance The second component of energy dissipation is coupling energy, which is influenced by the charging, discharging, or toggling of the coupling capacitance Ci, j between two lines i and j. A coupling charge transition occurs between the two lines when V2 2: 0 or V]- = 0, and V2 +Vj = VDDi 00 —> 01,00 ——+ 10,10 —> 11, and 01 —> 11 are the possible cases. A coupling discharge transition occurs when VZ = 0 or V]- : 0, and V,- + V]- = ’VDD; 01 —> 00, 10 —+ 00, 11 —+ 10, or 11 ——> 01 are the possible cases. A coupling toggle transition occurs when Viv Vj aé 0 and V2 = —V i.e., when 01 ——> 10 j, or 10 ——> 01 transition occurs. In all three cases, the coupling energy dissipated in line i due to CiJ is obtained as: E-C -= 0.5 x c- Vi2 — Vi ' lefl # j- Values 0f ci,i 21:1 2, ] Zaj ( are given in Table 2.2. It can be seen that the toggle case dissipates an equal amount of energy (EiC,j = E; i = 2 x Ci, j . VDDQ) in both coupled lines, but the charge and discharge transitions result in coupling energy dissipation equal to 0.5 X Ci, j - VD D2 in the line that charges/ discharges. Thus, the total energy dissipated in a segment of the wire between two repeaters of bus line i is the sum of the self energy and coupling energies and is given by the following equation. E, = Ef+ Z Efj (3.1) viii?“ 52 —> —> 0 I I W—l R Fl R c ‘nw' ‘n‘ "n“ d EL Ea 9.1 C, i n . n . n C‘ Ic2 'Cn-1 L . Jl . JL . J Sender Distributed RC wire ( n subsegments) Receiver Figure 3.1. Distributed-RC model of the wire segment divided into n subsegments. 3.3.3 Distributed-RC Line Energy Model This energy is dissipated non-uniformly across the length of the segment, as we show next. Consider the schematic of the segment of a distributed RC-wire shown in Figure 3.1. For this segment of length loptv the total wire electrical resistance Rw and the parasitic capacitance Cw which includes the self and coupling capacitance, can be divided equally across n subsegments. Thus, each subsegment has a resistance 0 £71,“ and capacitance 7]”. The driving repeater is represented by its resistance Rd. At the end of the wire is the receiving repeater, contributing a gate capacitance Cr to the load. Let the energy dissipated in the kth subsegment of wire i be represented by Ei, k' Consider the 4-stage RC network corresponding to shown in Figure 3.1; this represents a distributed RC line. For a unit input signal u(t), the s—domain voltages 53 at the four nodes will be: 1 1 1 12 ) 711(3) = Vppfs— —m0+mls+m23 +--- u2(s) = VDD(s_1—m8+m%s+mgsZ+-~) ”03(8) = VDD(s—1—m3(’)+m‘°is+m%s2+-~) ”04(5) : VDD(3_1—m40+m‘is+m:}232+...), where m6, mi, 77122, etc. represent the first, second moment and so on. The corresponding currents through the capacitors are icl(s) = Y1(s) . 211(5), i62(s) = Y2(s) ~u2(s), ic3(s) ——— Y3(s) 413(5), and ic4(s) = Y4(s) - u4(s), where I’Z-(s) = 30,- is the admittance of each subsegment [89]. This gives: icl(s) = sC1(s_1—m(1)+m[s+m%s2+~-)VDD i02(s) = 802(8”1 — mg + mgs + m332 + - . . )VDD ic3(s) = sC3(s_1— mg + mffs + m382 + - - - )VDD ic4(s) = sC4(s—1 — mg + mills + 771352 + - . - )VDD. From the circuit, it is clear that 11 =1c1+ic2+ic3+ic4v 12 = ic2+ic3+ic4r I3 = ic3 + 2'64, and I4 2 2°04. In general, we can write the following equation for current through a resistor i after discarding higher order moments: _ . . j 2 . j 1,-(3) _vDD Z c, 1—3 2 CJm0+s Z CJm, , (3.2) j E Di j E Di j E Di where the set D,- represents all the downstream nodes of node i, V]- is the voltage at the j-th node, and Cj = Cw/n is the capacitance of j—th subsegment. The down- stream capacitance of node i which is the sum of the capacitance of subsegments i 54 through n can be evaluated as: Cj (n — i). (3.3) j E D,- We can express the power series in Equation 3.2 in transfer function form with poles (pip pa) and zeros (31], 2%). However, it has been shown in [90] that for intercon- nect lines, the transfer function using two—pole analysis has a special form in which the numerator polynomial is a constant as shown in the following equation. 1 HA8) (1+ blls + (2232 ) (3'4) Expanding this transfer function about 5 = 0, we have HZ-(s) = 1 — his + ((bzl)2 — bg)s2 [89]. Comparing with Equation 3.2, we get: =20 t-m—wC—g“ 2: ED, (36> JEDi jEDi since the Elmore delay tjE D of the line until the j-th subsegment is given by the first moment m6 [90]. Thus we have: - C , I, :: IQ)LK”~—Z)—tf 2(5) 8 (l+—-— —-)(1+-Z—-) P1 P2 2' i . Plpg = V (n—i) . . . . (3.6) DD ” p]p§+(p]+p§)8+s2 In) = flit-(3)1 C 192191 t —-it = Vppln ) °,12,( 1’1 —e ’02), (3.7) where .C_1[-] is the inverse Laplace transform operator. In Equation 3.6, we have used the transfer function of the form: P211922 71]}??? + (2021 + 1022M + 82 GiIS) = i i - + By equating Hz-(s) and Gi(s), we obtain 1271 = W. Now the amount of Joule Pipz heat dissipated in the i-th subsegment can be estimated as: 13. z: jfw([1N )]2 ‘Ru’dt Z (n _ Z)ZC?UVDDI:wX P1192 n3 2(p,+p'2> _ (724)203ngwa 1 n3 2b]- Substituting for b; from Equation 3.5, and rearranging, we get: c7 2 ._~ 5} =#VD£’x731lRwCP. (as) Z 2 Z t] jesr) i We observe that the first term in Equation 3.8, i.e., 0.5 x QTJLQVIQ) D corresponds to the Joule heat dissipated in a subsegment assuming that energy is dissipated uniformly across the wire length. The second term can be regarded as a correction factor indicating that the energy dissipation is non-uniform across the length, i.e., higher energy is dissipated at the subsegments near the sending then than those near the receiving end. This is because for increasing i (0 S i S n — 1), the numerator reduces and the denominator increases in value and hence the correction factor reduces overall. We validated our model by comparing with energy distribution obtained using the Cadence Spectre simulator for different number of subsegments (n = 10, 50, and 100). The normalized energy dissipated in each wire subsegment for the n = 10 case, ob- tained using our model and Cadence Spectre simulations with 130 nm ITRS para- meters, is shown in Table. 3.1. The average error of our model is 4.53% and the maximum error is 7.75% compared to Spectre results. Note that this difference arises 56 because the derivation of Equation 3.8 ignored higher-order moments of the node voltages. For n = 50, and 100 subsegments too, we found that energy values from our model are very close to those from Spectre; the average errors in these cases were 3.94% and 3.51% respectively. As a trade—off between model complexity, in terms of its simulation time and its accuracy, we use n = 10. Sub— Normalized energy %Error segment # Equation 3.8 Spectre 0 0.132565 0.123033 7.75 1 0.125624 0.117550 6.87 2 0.118420 0.112207 5.34 3 0.112894 0.107001 5.50 4 0.106988 0.101931 4.96 5 0.101150 0.096996 4.28 6 0.095827 0.092194 3.94 7 0.089974 0.087525 2.79 8 0.084548 0.082986 1.88 9 0.080009 0.078578 1.82 Average 4.53 Table 3.1. Comparison of normalized energy dissipated in wire subsegments obtained using our model and Cadence Spectre simulations for 10 subsegments. 3.4 Thermal Model In this section, we present our thermal model. This enhanced model can also estimate the distribution of wire temperatures across the length of the wire segment, compared to our earlier model [81]. Next, we briefly introduce chip thermal structures and discuss the heat transfer mechanism in modern chip packages. 3.4.1 Chip Thermal Structures and Heat Transfer Figure 3.2 shows the cross sectional view of various layers in a chip package that influence the way heat is transferred away from the active areas. The figure shows a C4 / CBGA (flip-chip) package with an attached heat sink and no forced air cooling. For this type of packaging and cooling system, it has been found that there are two heat transfer paths: a primary path that conducts away heat generated at the active layer (substrate) through the heat spreader, attach material, and the heat sink, and a secondary path that transfers heat from the substrate through the dielectric layers—— heat flows from the bottommost to the topmost interconnect layer—and finally flows through C4 bumps, ceramic substrate, CBGA joints, and the printed circuit board to the ambient air [66]. As mentioned earlier, models for estimating substrate tem- peratures are available in tools like HotSpot [64,66] but detailed activity-dependent models for estimating global signal wire temperatures are not. Next, we present the model that will help estimate spatially-distributed wire temperatures in a wire segment. Heat sink l [M] Ilfl fl fl -PTSecondary Thermal paste —-¥7‘“ ‘ ' ” ,, * HT path Heat spreader —~> :— Si substrate C4 Pads . ‘ Metal layers Primary HT path Figure 3.2. Figure shows the view of different thermal structures of a C4 / CBGA chip and the primary and secondary heat transfer paths. 58 3.4.2 Detailed Thermal Model In the thermal model presented next, we consider any subsegment k as a point source of Joule heat, called a thermal node. Using the well-known analogy between thermal and electrical quantities, we can consider that, the temperature difference between two nodes, corresponds to a voltage difference and the heat transfer rate to current. The ability of the wire segment to hold heat is modeled by its thermal capacitance and the ability of the surrounding dielectric to conduct heat away from the wire segment is modeled as the thermal resistance. These thermal circuit parameters are brought together to form a therrnal-RC network, shown in Figure 3.3(a) for a 5-wire bus, across the same subsegment k in all wires. By equating the rate of heat flowing into a node in the thermal equivalent circuit to the rate of heat flowing out (analogous to Kirchoff’s current law in electrical circuits), we obtain the following. For the two edge wires: Wu: (flak—90) (6i,k—9ii1,kl P- -+Pf,=C« . +————+ (3.9) 2’ k 2’ A 2’ k dt Ri, k Rinter and for the middle wires: 619'}: (git—90) (26ik_6i—1k—9i+1kl 13. +13! =C- .L+_’__+ "' ’ ' ’ ,(3.10) 2’ k 2’ k 2’ k dt Rik Rinter where P,- k is the instantaneous power dissipated in the kth subsegment of the -th 2 wire, Pi, k is the equivalent power due to the effect of switching activity in lower metal layers and the substrate, and 60 is the ambient temperature (45 °C or 318.15 K) inside the computer box. Note that these equations do not include heat that may potentially flow through the vias. The reason for neglecting the via effect is given 59 Layer at substrate temperature .5: .2 A“, :4. x. —T 7 s m \ s '0 s s N m m o Pfic 0 Pix 0 P31: 0 Pix 0 5’3 R inter R inter R inter R inter (1) (2) (3) (4) (5) J‘. *4. *5. .3. a. v—1 N m V In as; an; 94 or. m cm P1, . CZ: P2,k C3:[\13, .. C‘;[\P4 I CS/‘[\P5,k Figure 3.3. Thermal model. (a) Complete equivalent thermal-RC network for a 5—wire I __ I _ _ I _ _ _ _ _ _ bus. P1,IC_P2,k_'H_ 5,k’R1,k—R2,I€_"'_R5,I€’Cl,k_C2,k_"'_ C5, k1 and P1, k, P2, k7 . . . , P51 k are bus-activity dependent in the model shown. (b) Geometry for calculating equivalent thermal resistances for a wire based on previous work of Chiang et al. The lightly shaded regions and arrows represent heat flow Layer at ambient temperature (a) Layer at ambient temperature (b) between the conductors or between layers (from a hotter to a cooler one). 60 in Section 3.4.2. The instantaneous or cycle-by-cycle power P1,)». can be obtained by dividing the energy Ei, k obtained using Equation 3.8 by the clock cycle time. However, in our microarchitectural simulations, we record the energy Ei, k for a finite interval and then divide it by the duration of the time interval to obtain the power dissipated. This time duration is set as explained later in Section 3.5.3. In the above equations, Ci, k, the thermal capacitance of the wire segment, is given by: Ci, k = C3 - (t,- -w,-), where C, is the specific heat per unit volume of the wire metal, and 211,- and ti are wire dimensions as shown in Figure 3.3(b) and with values given in Table 2.2. 76,-, k is the thermal resistance of the wire segment along the heat transfer path as shown in Figure 3.3(b) and it can be calculated from the following expression using wire geometry and thermal conductivity kild of the inter-layer dielectric (ILD) as described in [83]: all: + 82' w, ) t'ild — 0.582: 2 ' kild kildfwz' + 82'). ln( Rt, I6 = Rspr + RTBCI ‘2 (3.11) The above expression is the sum of two terms: the first is the spreading resis- tance Rspr due to the spreading of heat from the face of the wire exposed to a cooler layer (away from the substrate) in a trapezoidal manner, and the second is the thermal resistance Rrect due to rectangular heat flow as depicted in Figure 3.3(b). Equations 3.9 and 3.10 can be solved to determine the wire temperature 6i k' Heat transfer from lower layers through the dielectric Next we consider the temperature rise in global signal lines due to heat transfer from underlying layers. This is needed because, in current C4/CBGA packages, a secondary heat transfer path exists from the substrate through the interconnect layers. 61 Thus, some heat flows from the substrate through the metal layers—bottommost to the topmost interconnect—and finally flows through C4 bumps, ceramic substrate, CBGA joints, and the printed circuit board to the ambient air [66]. The temperature increase due to this effect to each global wire can be estimated using the following closed-form expression [83]: M — NZ 2"“ [NZ—fr >2 - H (312) _ ,_1kud,iS—_—a_tt -=,Jma$ p303“ ' where N is the number of layers of metal and pj is the resistivity of the metal line (Copper). The values for tild, i’ kild, i’ sit and ti» corresponding to different layers of metal, were obtained from the ITRS roadmap. Note that Equation 3.12 neglects the thermal capacitance of wire segments in the lower layers. This because wires at lower layers are usually thinner and shorter (smaller to and t) and also have smaller lengths. Thus, Rinter’ which depends on t-l, and Cth’ which depends on u! - t - l, both have negligible values, and the dominant Rt h terms are only considered in this equation. The above equation also assumes that all wiring tracks underneath the global bus are populated with power supply wires that carry current at their maximum density Omar)- The net effect of the secondary heat transfer path (from the substrate and lower metal layers) is depicted as the constant current source Pi], k in the network shown in Figure 3.3(a). Note that the Pi, ks are all equal since spatial variation in substrate temperature across the width of the bus is neglected. This is valid, because in almost all cases, the area footprint of the buses we study is well within the dimensions of the underlying circuit block for which we know the substrate temperature. 62 Heat transfer from lower layers through vias Joule heat generated in the lower metal layers can flow to the global metal layer through the ILD (as described in the previous subsubsection) and also, in parallel, through the vias. However, heat transfer through vias occurs only within the range of the thermal characteristic length L H of the wire [37,83]: t'°t' .k LH= 2 “d m (3.13) t- ’ k.,-ld(1+ 0.885%) where km = 401VV/mK is the thermal conductivity of Copper metal. If a wire is longer than L H, the via heat transfer is negligible. Using parameters in Table 2.2, L H was found to be 10.56 pm for 130 nm and 10.33 pm for 45 nm, which are much smaller compared to our inter-repeater segment length lopt- Hence, the heat transfer through vias will always be negligible in the global buses we consider. Lateral thermal coupling between wires The lateral heat transfer between adjacent wires can be a significant amount due to the large exposed sidewall area in high aspect-ratio global lines and due to the difference in activity rates of the neighboring lines (which creates a temperature difference and hence lateral heat flow). It has been shown using FEM simulations that thermal coupling is a significant phenomenon in global lines, particularly when high activity wires are placed next to low activity ones [73]. In our model, this effect is captured with a lateral inter-wire thermal resistance whose value depends on wire geometry parameters, as shown in Figure 3.3(a), and the inter-metal dielectric (IMD) thermal conductivity, kimd’ and is given by the expression: 8 . Rinter = lopt X l‘- t. (3.14) ‘27an “l 63 Previous work on interconnect thermal modelng did not consider the effect of inter-wire heat transfer [66]; our model incorporates this for better accuracy. For simplicity, we assume that the ILD and IMD are the same material. Hence kimd = kild° Thus, the temperature 92-, k of the k-th subsegment of wire i is affected by the rate of heat Pi, k generated in it as a result of activity-dependent current flow, the thermal capacitance Ci of the wire metal, thermal resistances of surrounding inter-layer and intra-layer dielectric 72,: and Rinteri respectively, and the temperature 6’,- :I: 1, k of the k-th subsegments of its adjacent wires, all of which are considered in our model. A distribution of wire temperatures across the wire length can be obtained by solving Eqs. 3.9 and 3.10 for a number of subsegments k = 0,1, . . . ,n. The temperature gradient A6,; or difference between the sending and receiving end temperatures can be estimated using: A0,- = 62-, 0 — 6,; n, where n is the number of subsegments. 3.4.3 Steady-State Thermal Model The detailed thermal model discussed above is used to track activity-dependent tem- perature variations in bus wires across time. However, due to its complexity, it is somewhat difficult to use in the temperature optimization methodologies that we propose later in our research. Hence we develop an approximate version of this model, known as the steady-state thermal model. This model is also used to estimate the initial temperatures for the bus wires before starting detailed thermal simulations. The steady—state model for three wires is discussed next. Consider three consecu- tive wires 101,10], and wk on a bus. When there is no bit reordering, data bits biv bj, and bk are carried on these lines. Let the corresponding power dissipation on these 64 wires be Pi, P1, and PM, respectively. We assume a steady state temperature model for thermal analysis of this wire set. In this model, the final temperature Tf in of a structure with initial temperature Tim is: Tf in = Tim + P x Rt, where P is the power dissipated by the structure and Rt is its thermal resistance. Thermal resis- tances of global signal wires can be estimated based on their geometry using equations given in [66,81] and wire power dissipation can be obtained using a microarchitecture— level simulator. For three adjoining wires, the steady state thermal equivalent circuit is shown in Figure 3.4. . R. (I) inter (j) inter (k) th Rth Rth l Ta: Ambient temperature Figure 3.4. Steady state thermal equivalent circuit for three wires. Heat transfer between wires is modeled by Rinter and heat loss to surroundings by Rth' P,- repre- sents power dissipated in each wire due to switching activity and it can found using a microarchitecture-level simulator. Using Kirchoff’s law on the three nodes, we get the following equations: + r z Rth Rinter P. : Tj —Ta _ Ti—Tj _ Tk—Tj’ J Rth Rinter Rinter P _ Tk - Ta Tk ‘ Tj k — T + Tr" th inter In these equations, Rth is the inter-layer thermal resistance, Rinter the intra—layer 65 thermal resistance, and T a is the ambient temperature, assumed to be 45°C inside the computer box. Solving this set of simultaneous equations using Mathematica, the expression for the temperature of the middle wire is found to be: T]: = (Pi+Pk)'a+Pj ' (0+,B)+Ta, (3.15) R2 R- th Rth inter where a = and 6 = . (3.16) 3Rth + Rinter 3Rth 'I' Rinter Thus, we find that the temperature rise (ATj = Tj — Ta) in the middle wire is proportional to a weighted sum of the power dissipated in itself and in its neighboring wires. 3.5 Simulation Environment and Methodology We used the Alpha 21264 platform for this work. Details of the simulation infrastruc- ture for this platform were described earlier in Chapter 2.4.4. 3.5.1 Benchmarks and Sample Sizes Previous work on temperature-aware microarchitecture design has characterized benchmarks, mostly in the SPECint suite, as hot, medium, or cold benchmarks based on the percentage number of cycles that they are in violation of a 818°C thresh- old [64]. From the benchmarks used in that work, we chose three benchmarks that were reported to result in extreme thermal stress (gcc, crafty, and vortex), and two from the medium (gzip and mesa) thermal stress group. We randomly chose seven benchmarks, that have not been characterized previously, to complete the 12 bench- marks in our set. Thus, our workload represent a mix of benchmarks that have been shown to result in severe to moderate thermal violations (those listed above) and 66 those which operate well below the threshold of 818°C. Hence, with this workload, we can also analyze the extent to which high silicon die temperatures and thermal stress, which [64] studied, correlate with global interconnect temperatures. We collected energy and temperature results for a simulation sample of one billion committed instructions after a fast-forward phase of five billion instructions that skips over the program startup phase. We did not use techniques like SimPoint [77] to choose representative samples because our thermal simulations needed a single, large sampling window covering possibly, multiple phases of benchmark execution, and to capture the effects of idling of processor units and buses that provide dynamic opportunities for wire temperatures to cool down. 3.5.2 Thermal Warmup and Initial Temperatures As reported in earlier work, it is computationally impractical to simulate long enough for the heat sink temperature to reach steady state, since its thermal RC time con- stant is significantly larger than that of any on—chip structure [64,91]. Hence, we followed the methodology suggested in [64] to obtain accurate results from our ther- mal simulations. First, we used the Wattch power/ performance simulator to obtain average power consumption values for various on-chip structures [58]. Then, we fed these values to the HotSpot tool to obtain the steady state heat sink temperature, and used this value to initialize the heat sink when running our simulations. Also, to avoid “cold start” effects during the initial period of our wire temperature simulation, we ran all simulations using our wire model twice. In the first pass, we obtained an approximate steady state temperature value for each wire by estimating the power dissipated in each wire for one billion cycles using the model discussed in Section 3.4.3. 67 We initialized the temperature of each wire of our target bus using its steady state temperature (Equation 3.15) and performed the temperature simulation as described in the next subsection. Note that, using this approach, the initial temperatures of the bus wires will not be the necessarily equal since it will depend on the distribution of energy across the wires. 3.5.3 Granularity of Thermal Simulation After the fast-forward phase which skips through the unrepresentative initial section of the benchmark program, wire temperatures were set to the steady state temper- atures estimated as described in the previous subsection. Then, for the next one billion instructions—our simulation window—we recorded energy and temperature results every 100K cycles. For thermal simulations, the energy dissipated per wire was divided by the time taken for each window ( f cl k x105), and a fourth—order Runge— Kutta (RK4) method was used to solve the differential equations for the thermal-RC network (Eqs. 3.9 and 3.10) to obtain the individual wire temperatures at the end of the interval. The RK4 simulation loop, which was implemented using the method described in [92], iterates for a number of times which depends on the interval size (100K cycles) and the thermal RC time constant of the wire. This ensures that each RK4 simulation advances the solution by a small enough time interval dt that is substantially less than the thermal RC time constant. In this way, each step of the temperature simulation will yield sufficiently accurate temperature estimates without the rigor of cycle-by-cycle simulation which will require huge computation time and memory resources. Using experimentation, we found that setting the value of dt to three (130 nm) 68 and two (45 nm) gave the best tradeoff between simulation time and the nature of temperature characteristics we obtained. For example, with the clock frequency in the 130 nm process (1.68 GHz), time taken by the processor to execute 100K cycles is t = 59.52 us and the thermal RC time constant of the wire, calculated using window wire geometry parameters in Table 2.2, is t RC = 3.6171 us. For these values, the t . RK4 Simulation should iterate dt X 4445mm 2 3 x 35%15721’ a: 50 times to ensure the RC ' best granularity of temperature simulation. 3.6 Experiments and Results In this section, we present results from simulations using our bus-line energy dissipa- tion and thermal models and discuss their implications. 3.6.1 Energy Dissipation in Processor Buses In this subsection we show that, in addition to adjacent wire coupling capacitances, energy dissipated in switching transitions between non-adjacent wires also affects bus energy dissipation significantly for current and future technologies. It is a well-known fact that, in global signal lines, the wire-aspect ratio—the ratio of wire thickness to wire width—is increasing faster than wire—spacing ratio, the ratio of inter-wire spacing to inter-layer spacing. This causes the sidewall (inter-wire) coupling capacitance to dominate the area capacitance. In sub—100 nanometer bus lines, the reduced inter- wire distance further causes increased fringing effects with adjacent as well as non- adjacent neighbors of a wire. With capacitance values we extracted using FastCap for the 130 nm technology node values are given in Table 2.2—and using our model 69 from Section 3.3 to estimate the coupling energy dissipation in each line, we found that the energy dissipation is underestimated by up to 7.8% in data buses and 7.6% in instruction buses, when non—adjacent coupling capacitances are neglected, for data bus traffic in the nine benchmarks we analyzed. Results for this experiment are shown in Figures 3.5 and 3.6. Also, we found that, although the non-adjacent coupling capacitance values are decreasing with technology scaling, this energy estimation error remains more or less constant in future technologies. Thus, we conclude that accurate bus energy dissipation models must consider the influence of non-adjacent coupling capacitances also. Previous work did not consider the effect of non-adjacent coupling capacitances and its influence on energy; ours is the first to do so. Non-adjacent coupling capacitances are especially important to consider when evaluating the benefits of microarchitectural techniques for low-power buses. In cur— rent literature, only schemes that aim to reduce energy dissipation due self and ad- jacent inter-wire coupling transitions exist. Such schemes can potentially increase the relative contribution of energy dissipated in transitions involving non-adjacent coupling capacitances. Effectiveness of Low-Power Bus Encoding Schemes We evaluated the effectiveness of some popular bus encoding schemes like bus-invert (BI) [51], odd / even bus-invert (OEBI) [53], and coupling—driven bus-invert (CBI) [54] on wide data and instruction buses. To our knowledge, this is the first study to re- port energy dissipation results for microprocessor buses using SPEC benchmarks that represent real-world programs; most previous studies, including the ones cited above, reported energies for random traffic patterns. Additionally, we also implemented a 70 Energy Dissipated in Data Bus ZOE-03 T -Total Energy (Cc1+Cc2+Cc3) " 8-6 1.8E-03 4 -Total Energy (061 only) + 8.4 A ‘ +% Mismatch «— 8.2 3 1.6E-03 T __ 8.0 c 3; 1.4E-03 T -_ 7.3 § § 1.2E-03 T —— 7.6 E, "' _ 2 3 1.0E-03T l 7'4 o '5"; I ~ 7.2 g a 805-04 T _ 7.0 g >. 0 9 6.0E-O4 T L 6.8 3 g T 6 6 °' I.” 4.0E'04 ‘” ] - i 6.4 2.0E-04 If T 6 2 0.0E+00 Figure 3.5. Total energy dissipated in a 64-bit data bus for various benchmarks. ‘Ccl only’ represents the existing energy models which consider only self and adjacent coupling capacitances. ‘Cc1+Cc2-l-Cc3’ represents our model that considers self ca- pacitances, adjacent coupling capacitances (Ccl), and two non-adjacent capacitances (Cc2 and Cc3) on each side. The % energy mismatch shown by the line is plotted with respect to the right-hand side Y-axis. variant of the BI scheme called segmented bus invert where the bus is divided into four groups and BI encoding is applied to each group separately. This arrangement requires four extra invert lines that are placed in the four higher order bit positions. In our experiments, BI was implemented with the one invert line at the MSB position— we found this to result in less energy dissipation compared to the case when the invert line is at the LS8 position—and CBI was implemented with the invert line in the LSB position as mentioned in [54]. OEBI was implemented with two invert lines (LSB as the odd-invert line and MSB as the even-invert line) as described in [53]. The total bus energy dissipated for unencoded and encoded data is shown in 71 Energy Dissipated in Instruction Bus SEE—03 T -Total Energy (Cc1+Cc2+Cc3) T 8'4 -Total Energy (Cc1 only) 2.5E-03 * +°/o Mismatch __ 8.2 ’6‘ . g T L 8.0 .1: O ZOE-03 ‘7 2 3 E g I - 7.8 g 31.5503 1 3: an 2 7.6 g n o 5 105-03 — § 0 ‘ 7.4 n- : l.|.| I 5.0E-04 T 7 2 T . 0.0E+00 I L 7.0 90° ’99 @é '86 {3’ «3‘6 Q\ (SQ c? 7th c} 0 a Q \0 0x Figure 3.6. Total energy dissipated in a 128-bit instruction bus for various bench— marks. The % energy mismatch shown by the line is plotted with respect to the right-hand side Y-axis. Figure 3.7. The energy values reported in this plot have been averaged across the nine benchmarks with each benchmark being simulated for 500 million committed instructions. Ffom the results shown for existing bus models (Ccl only), we find that all four encoding schemes reduce self energy, with segmented BI being the best. Coupling charge/discharge energy dissipation increases marginally, with BI and CBI encoding but reduce somewhat when OEBI encoding is used. Here too, segmented BI shows the best reductions. The amount of energy dissipated due to toggle transitions decreases when any of the four encoding schemes are used, with segmented BI again giving the best results followed by OEBI, BI, and CBI in that order. A significant observation from these results is that existing coupling—aware encoding schemes (like 72 Energy Estimated with Different Models Unencoded lBus Invert DCoupling-Driven Bl EIOdd/Even BI ISegmented BI 1.4E-04 T 12504] 3 105-04 1 a . 3, 8.0E-05T B 6.0E-05 '5 l 5 4.0E-05T 0.0E+OO - - 79 ”T3 ] 8 -‘-‘-’ .79 ”r3 8, 2 i ; T 5 i3 I ‘5 i9 I I l .g.) l .2 i i g % l l % : T , 9 1. ~ . 9 l l l s l . i l 6’ . T , ‘ Cc1 only , Cc1+Cc2+Cc3 T Figure 3.7. Total energy dissipated in a 64-bit data bus with various encoding schemes. ‘Self’ denotes self energy, ‘C/ D’ denotes the coupling charge/discharge energy and ‘Toggle’ denotes the coupling toggle energy dissipation. ‘Ccl only’ refers to existing energy models that consider self and adjacent coupling capacitance only and ‘Cc1+Cc2+Cc3’ refers to our energy model that considers self, adjacent coupling, and two non-adjacent coupling capacitances. CBI and OEBI) have limited impact for wide data buses. Furthermore, we observed that the average number of bit transitions between consecutive cycles was very low (much less than half the bus width) for the SPEC benchmarks we analyzed. This is most likely the result of the higher order 32 bits of data not being utilized. Hence the number of inversions was small, even for CBI and OEBI, and hence most of the time, data was being transmitted in original (unencoded) form. Segmented BI performed the best in these situations because, as the effective bus width for each segment was smaller, the number of cycles during which data-inversions took place was greater. Thus, overall, while segmented BI encoding resulted in lowest energy dissipation, 73 OEBI and BI were almost Similar in impact, while CBI was significantly worse. Note that none of the coupling-aware schemes we examined yielded improvements on the order of what had been reported earlier—36% for OEBI and 30% for CBI with respect to unencoded random data—for these schemes [53, 54]. When our energy model (considering Ccl, C02, and Cc3) was used, all coupling (charge, discharge, and toggle) energies increased and the trend in charge/ discharge energies remained unaffected. For toggle energies, however, we observed that OEBI performed significantly worse than others. This is clearly the effect of toggles on coupling capacitances between non-adjacent wire pairs. The net effect of this is that, with our new bus energy model, OEBI and CBI both perform significantly worse than BI and segmented BI. Based on our results, we can conclude that bus-inversion based encoding schemes do not work well for wide buses and for realistic data streams (from SPEC benchmark programs) where the number of bits that transition between consecutive cycles is low. Impact on Wire Temperature Distribution The influence of energy dissipation due to non-adjacent coupling capacitances on wire temperature can be illustrated with a simple example of a 5-wire bus like the one shown in Figure 2.3. Consider transitions on the five bus lines, from the most significant bit (MSB) line to the least significant bit (LSB) line as follows: THTT. The notation T indicates that, in the current cycle, the line charges to V D D from its previous ground state and 1 indicates that the line discharges in the current cycle from V D D held in the previous cycle. This set of transitions represents the relative thermal worst-case since most of the energy dissipation is concentrated in the center 74 line. Numbering the bus lines from 0 (MSB) to 4 (LSB) and noting that all inter—wire transitions, if any, are toggles, the coupling energy dissipated in each line estimated using our energy model, described earlier in Section 3.3, can be written as follows: c_ 2 _ 2 c _ 2 _ 2 E5 = (CO, 2 + 01,2 + c2, 3 + c2, 4>V12)D = 2(Cc1+ Cc2)V12)D c _ 2 _ 2 1352614ng = CCQ-VgD where CZ" j represents the coupling capacitance between wire 2' and j. Note that the self energy dissipated in all five wires is the same (%(Cw +Crep)V12) D) and hence its contributes equally to temperature rise in all five wires. The energy dissipated in the middle wire E5 is the highest even if Cc2 is neglected and hence, this wire is likely to have the maximum temperature. Furthermore, if non-adjacent coupling capacitances are non-negligible, the middle wire dissipates much higher energy and its temperature is likely to be even higher. 3.6.2 Correlation between Energy and Temperature In this subsection, we examine the correlation between energy and temperature char- acteristics obtained using our model. We report and analyze time-varying energy and temperature profiles for only one benchmark—gee, for a simulation interval of 10 billion cycles in the 130 nm technology node. We found that other benchmarks ex- hibited similar behavior; hence these are not reported. The energy and temperature profiles are shown in Figure 3.8. In this figure, energy and temperature, plotted on 75 the y-axes, have been averaged across the number of bus lines. The temperature plot clearly shows that the average wire temperature continues to rise with time although the rate of change is not linear; the trend line shown on the plot is only a very coarse approximation. But, the results are significant because they show that the average wire temperature increases by about 10 degrees over six seconds of execution of a typical program like gcc on a 130 nm microprocessor. We also observe that short, intermittent cycles of high switching activity can trigger changes in temperature, ev- idenced by the regions marked 1 and 2 on the plot. Also, we notice that such bursts of energy dissipation—likely caused by increased bus utilization—cause the temper- ature rise to ‘linger’ for a short period of time as shown by the step-like changes at the beginning of regions 1 and 2. 3.6.3 Final and Peak Wire Temperatures In this subsection, we present results obtained from simulations using our thermal model. During our simulations, we recorded type types of temperature information: (1) the temperature change in each wire between the start and end of simulation, (2) the highest temperature reached by each wire during the simulation, and (3) the temperature gradient of each wire between its sending and receiving ends. These results are presented next. We observed that wire temperatures increased significantly over the time interval of simulation for most wires. Figures 3.9 through 3.11 show the wire temperature rise that we recorded for three integer and three floating—point programs respectively, each for one billion committed instructions of execution for all bits of the 64-bit data bus. The corresponding results for the 128—bit instruction bus are in Figures 3.11 Temperature and Energy Dissipation in Data Bus for GCC 0) (A) O .1 Trend line: y = mx + c Avg. Wire Temp. across 64 wires (K) 0) N a. 322 m = 7.0448e—10 ' = 2 .44 320 c 3 3 28 _ 318 l I I l l J 1 l 0 1 2 3 4 5 6 7 8 9 Simulation Time (Cycles) x 109 —s A x 10 3 1 I i l l l i I I U) 2 E v 0.8 4 <0 3 E 0.6 _ 3 8? o 4 .5 . > O, E 0.2 _ Lu 6: > O < o 1 2 3 4 5 s 7 8 9 Simulation Time (Cycles) x 109 Figure 3.8. This plot shows average energy dissipation and wire temperature of the bus for a simulation interval of 10 billion cycles. The continuing temperature rise can be clearly observed. through 3.14. We show detailed results for only these six benchmarks since they exhibit interesting behavior. The highest temperature rise recorded for any wire during our simulation, for the 12 benchmarks we analyzed is given in Table 3.2. We show results for both 130 nm and 45 nm technologies in the figures and in the table. The time taken to commit a billion instructions in the pipeline which is typically on the order of a few seconds is much longer than the thermal RC time constant of the Wire, which is only a few microseconds. Thus our simulation interval is large 77 enough to allow temperatures to settle to their characteristic values. Furthermore, we initialized wire and heat sink temperatures to their steady state values as described earlier in Section 3.5.2, to prevent cold-start effects. From the knowledge of characteristics of instruction and data traffic, all lines in an instruction bus, which is 128 bits wide (fetch—width: 4 instructions), can be considered equally active, while in a load/ store data bus, which is 64 bits wide, the lower order 32-bits are expected to be most active due to data value locality. The results shown in Figures 3.9—3. 14 reflect these observations to some extent. For integer data, we observe that the hottest wires are the ones that carry lower-order bits. One notable exception is gzip in which all wires Show significant temperature rise across the simulation. This is expected because, when executing gzip, the data bus will carry primarily 8—bit characters packed in the 64-bit bus. Another observation is that, for mcf , the middle wire is the hottest at the end of the simulation interval. For floating-point benchmarks, temperature rise is somewhat evenly distributed across the 64 bits because the higher-order wires, which carry the exponent bits, are also quite active. Also, lucas shows higher temperatures in some lower order bits. Finally, we notice that the highly active wires are likely to end up at higher temperatures when executing integer workloads as against floating-point workloads. During the course of simulation, we observed that wire temperatures rose and fell as bus activity, the number of transitions, and the energy dissipation varied. A three- dimensional plot showing the variation across time and across the lower-order 32 bits of the data bus, plotted for three billion cycles of execution of the gcc benchmark is shown in Figure 3.15. This plot shows that there are intervals during which wire temperatures rise to higher values due to a sudden rise in energy dissipation and then 78 Temperature Rise in Data Bus Wires for 18 Cycles of Execution of gcc 00 O l +45 nm +130 nm N 01 N 0 Temperature (K) 8 a: 01 l i A ‘14. _ ‘ jrrf$f17yrrtrrffrlfl ‘ l l 0 7 1 4 21 28 35 42 49 56 63 Wire Number)(O=LSB, 63=MSB) a Temperature Rise in Data Bus Wires for 18 Cycles of Execution of gzip o i i i 1 i ql 40~ 35— +45 nm +130 nm Q30 '1 N N O 01 1 1 151 101 Temperature ( 0 I . . a O 7 1 4 21 28 35 42 49 56 63 Wire Number (O=LSB, 63=MSB) (b) T T l I Figure 3.9. Plots show the wire temperature rise recorded for benchmarks gcc and gzip for the data bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. 79 Temperature Rise in Data Bus Wires for 1 B Cycles of Execution of mcf 50 -- _._______ ,_ __ __ 1 +45 nm +130 nm: 45 - ____ __.-_;__ 4”} 4o « 235 ~ 230 — 925 « 320 g —i ,_15 ~ 10 ’ 0 7 1 4 21 28 35 42 49 56 63 Wire Numbe(r)(0=LSB, 63=MSB) a Temperature Rise in Data Bus Wires for 13 Cycles of Execution of Iucas O.) O J +45 nm +130 nmf N 01 N O Temperature (K) 8 a 01 i; ’ \ ; l'T'" i l j i l 7 14 21 28 35 42 49 56 63 Wire Number (0=LSB, 63=MSB) (b) O A________T__u - 4,, T, ._ C Figure 3.10. Plots show the wire temperature rise recorded for benchmarks mcf and Iucas for the data bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. 80 12- 10~ Temperature (K) O) Temperature Rise in Data Bus Wires for 1 B Cycles of Execution of ammp +45 nm +130 nm f I I f -._ .._,.‘o L‘ 141 12- §101 m l A l Temperature ( O) 7 14 21 28 35 42 49 56 63 Wire Number (0=LSB, 63=MSB) (a) Temperature Rise in Data Bus Wires for 13 Cycles of Execution of applu i+45 nm +130 nm I T T i TT—_ __—7 " ‘— l 7 1 4 21 28 35 42 49 56 63 Wire Numbe(r (0=LSB, 63=MSB) b) Figure 3.11. Plots Show the wire temperature rise recorded for benchmarks ammp and applu for the data bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. 81 Temperature Rise in Instruction Bus Wires for 13 Cycles of Execution of gcc 12 - f ‘ 1 .-0-45 nm +130nm1 9 41 91 Temperature (K) O) 0 -1T ' j'—T— 7 T T T I T I l l I T T o 7 14 2128 35 42 49 56 63 7o 77 84 9198105112119126 Wire Number (0=LSB, 127=MSB) (a) Temperature Rise in Instruction Bus Wires for 13 Cycles of Execution of gzip 181 i-0-45nm +130nm l m 154' _.L N Temperature (K) O) (D 0% r f T I r r r I I I r I . I I r 0 7 14 21 28 35 42 49 55 63 70 77 84 91 98 105112119126 Wire Number (0=LSB, 127=MSB) (b) Figure 3.12. Plots show the wire temperature rise recorded for integer benchmarks gcc and gzip for the instruction bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. 82 Temperature Rise in Instruction Bus Wires for 18 Cycles of Execution of mcf —L .h 1 l +45 nm +130 nmdJ —-L N a F’-. .5 o 1 CD I O) 1 Temperature (K) D D lLLlr‘IbUhi‘fl Il- ' Hutu; 0 7 14 21 28 35 42 49 56 63 7O 77 84 91 98 105112119126 Wire Number (0=LSB, 127=MSB) (3) Temperature Rise in Instruction Bus Wires for 18 Cycles of Execution of lucas A L N 1 O N 01 1 +45 nm +130 nm {i N O l _L 01 Temperature (K) O 01 O I i I f I I I I I I i r " r '1" 0 7 14 2128 35 42 49 56 63 70 77 84 9198105112119126 Wire Number (0=LSB, 127=MSB) (b) Figure 3.13. Plots show the wire temperature rise recorded for integer benchmarks mcf and lucas for the instruction bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. 83 Temperature Rise in Instruction Bus Wires for 18 Cycles of Execution of ammp 9.. 8 7 l+45nm +130 nm) 7_ $61, l1 1 2 2. 5 ‘ , 2 §47 0 l I" ’ at 34 t.‘ ' u 1- 1 t ‘ Ii I ‘ *' . 2- . ' '~hli liU’li 1 '1 I o W 444.. o 7 14 21 26 35 42 49 56 63 7o 77 84 91 96 105112119126 Wire Number (0=LSB, 63=MSB) (a) Temperature Rise in Instruction Bus Wires for 18 Cycles of Execution of applu 157 I +45nm +130nmi 121 , 2 . 2 97 3 E I 8. _ 1 6 6,1 I'- 3 . 0' I I I 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105112119126 Wire Number (0=LSB, 127=MSB) (b) Figure 3.14. Plots show the wire temperature rise recorded for integer benchmarks ammp and applu for the instruction bus in 130 nm and 45 nm technology nodes over a simulation interval of one billion committed instructions for each benchmark. 84 settle at lower values. Such intervals neither occur synchronously across wires nor are uniformly distributed among them. Wire Temperatues in the Lower-Order 32-bits of Data Bus for GCC Wre Temperature (K) Simulation Time (x 100Kcycles) 0 Wires (0=LSB) Figure 3.15. A three—dimensional plot showing spatial and temporal variations in wire temperature for the lower-order 32 bits of the load/store data bus for the gcc benchmark. Table 3.2 lists the absolute maximum temperatures attained by any wire during the course of simulation. As can be seen, we found that wire temperatures may reach up to 104°C for data bus wires and 896°C for the instruction bus in the 130 nm technology node. For the 45 nm node, data bus wire temperature was found to go as high as 128.7°C and instruction bus wire temperature as high as 104.9°C. Note that these values are higher than 100° which is the maximum temperature assumed during interconnect design. We also observed that maximum temperature trends for 85 data buses are very similar to those observed earlier for temperature rise. That is, the largest temperature change over the simulation interval occurs for bus wires whose transient temperature also touches maximum value, showing that different data bus wires experience varying amount of thermal stress depending on their location. For instruction buses, the maximum temperatures observed across bus wires were more or less similar. Hence, all instruction bus wires experience more or less similar amounts of transient thermal stress during the simulation interval. 3.6.4 Wire Temperature Gradients Next, in Figures 3.16(a) and (b), we show the frequency distribution of the wire tem- perature gradients that we recorded during our simulations, for 130 nm and 45 nm load/ store data bus wires. These plots Show that, on the average across the bench- marks we analyzed, the temperature gradient in this bus can be expected to be between 6—15°C for 130 nm technology. For 45 nm technology temperature gradients between 16—34°C were most commonly observed for the same set of benchmarks and simulation sample. During our simulations, the maximum temperature gradient ob- served was 31°C for 130 nm and 42°C for 45 nm simulations. These wire temperature gradients are the result of two factors: (1) the non-uniform dissipation of Joule heat along the wire length which is modeled using Equation 3.8, and (2) due to the dif— ference in temperature of the underlying substrate blocks which was obtained using HotSpot and applied during our thermal simulation. Temperature gradients across the length of the wire also affect delay. It has been reported that for a 1 mm long wire with the driver in the hot region and receivers in a cooler region, a temperature difference of 10°C results in a 5 ps (z 8%) additional delay at the receiver [93]. 86 .mhmpmawuwa 8s mv was 8: om; mam: woman compossmfi was 335 .8“ €032:me wepfiafioo :oEB one we cosflsczm w macaw 8283 00 E 8.393383 9:? 858382 .m.m 2nt 3.3 3.2: coda mmfi: ands 34H: 8.2: 3.3 50mg 8.2: $63 mnwm mzné Ecmv $6: mod: mmhofi Sam: 2..me «5.03 5.2: meg: 2.52 New: mfiomfi wmw: msmd vmdw mme oméw Exam Snow mmdw aw.mw oo.mw 05mm mmdw $.mw om.mw mom: S: cm: wmbm hmdm omemm omdm mfimofi 3.3 3.3 mafia 3.2: @063 ENS mwdm www-m— xmfio> :93 83% 3&8 smog Q83 3% 8m “08 mde Emma QEEQ 87 Distribution of Maximum Wire Temperature Gradients in 130 nm Wires ll: <69C I 6—155’0 J31 6—2390 D >249e, ammp applu crafty mcf gcc gzip lucas mesa mgrid swim twolf vortex Avg. (a) 1 00% 0) (O <3 <2 o\ o\ l l Percentage Number of Cycles assesses o\°o\o\°o\°o\°o\o\° L Distribution of Maximum Temperature Gradients in 45 nm Wires ll<690 l6—159C [116—24°C mas—34°C ”34901 100% 90% a 80% a 70% r 60% a 50% a 40% - 30% r 20% — 10% ~ 1 0% .. ammp applu crafty mcf gcc gzip Iucas mesa mgrid swim twolt vortex Avg (b) Percentage Number of Cycles Figure 3.16. Frequency distribution of maximum wire temperature gradients for 130 nm and 45 nm processor wires. 88 3.7 Summary In this chapter, we presented a unified nanometer-scale bus energy dissipation and thermal model that can help designers monitor energy dissipation and temperature change in individual wires during trace— or execution—driven simulation. In addition to self capacitance, our model incorporates the effects of adjacent and non-adjacent capacitive coupling on bus energy dissipation, the effect of repeater insertion, the effect of lateral heat transfer between adjacent wires, and the effect of inter-layer heat transfer. Unlike existing models which provide estimates for total bus energy, our model can estimate energy dissipated in each bus line; this feature helps to estimate wire temperatures also. Using this integrated model in a first-of-its—kind study, we studied energy and thermal characteristics of instruction and data buses using an execution-driven simulation of a billion or more instructions of nine SPEC CPU2000 benchmarks. We found that existing bus energy models provide estimates that are about 7—8% less accurate compared to our energy model. This is because they do not account for the effects of coupling between non-adjacent wire pairs of a bus. Our model, which incorporates these effects, is the first of its kind to do so. Our results also show that, in wide instruction and data buses used in modern processors executing SPEC CPU2000 workloads, existing bus encoding schemes Show no significant energy benefit due to the nature of data traffic. When non-adjacent coupling effects between wire pairs are considered, energy dissipation savings reduce considerably. Based on simulations using our thermal model, we found that average wire temperatures in data and instruction buses may rise 10—37 °C during a simulation run of only a billion cycles in a 130 nm spuerscalar processor executing SPEC CPU 2000 benchmark programs. 89 This temperature rise is primarily due to heat generation as a result of currents flowing in the wire during bit switching. Changes in substrate temperature may cause other effects in the temperature profile which we did not explore in this work. In a future 45 nm technology node, wire temperature rise for the same set of bench- marks and simulation sample was found to be between 20-58°C. We observed that instruction and data bus wires attained absolute temperature in the range 80.3-104°C and 97.6—123.7°C, in 130 nm and 45 nm processors, respectively, during the course of our simulation, showing that signal lines attain significant temperatures too. Sig- nificant wire temperature gradients of magnitude between 16—25°C were found to be most common between the sending and receiving ends of the wires during the course of simulation. Notable correlation was found to exist between energy dissipation be- havior and wire temperature rise in buses across time; short, intermittent cycles of high energy-dissipating switching activity trigger step changes in temperature. The impact of these results, especially, the highly fluctuating—both in time and space—energy and temperature profiles of instruction and data buses that we ob- served, is the following. Since the energy dissipation of the wire roughly represents the square of the time-varying current, fluctuations in the energy mean that a highly varying load is being placed on the power supply network by the driving circuits through which the currents flowing in the wires are drawn. This varying load can cause inductive voltage drops or Lfillt noise. This motivates the need to smoothen temporal variations in energy dissipation of wires with appropriate techniques. Also, the substantial disparity in wire temperatures across the bus motivates schemes that, based on information from interconnect thermal sensors, can migrate bus transmis- sions dynamically to cooler wires. 90 CHAPTER 4 DATA- AND TEMPERATURE-DEPENDENT DELAY VARIABILITY MODEL 4. 1 Introduction Rising wire temperatures are becoming an important issue in high-performance bus design, especially in current and future nanometer technology nodes, as the previ- ous chapter showed. Higher temperatures adversely impact wire delays—due to the temperature dependence of metal resistivity—causing timing violations when the end- to—end propagation delay exceeds the designed value. The factors that influence the dynamic propagation delay of a signal transmitted on the wire can be classified into two types, intrinsic factors that are related to the switching activity of the wire and / or its neighbors and extrinsic factors like process and voltage variations. As shown in the earlier chapters, the temperature distribution along the wire is a function of the switching activity in the wire and hence it is also an intrinsic factor. In the context of global interconnect lines, temperature variations occur due to two reasons that are both important to study. First, energy is dissipated in a non- uniform manner across the length of the wire. In Chapter 3.4, we showed that the temperature at the sending end of a wire will be higher than that of the receiving end of the wire. In this chapter, we develop a model to estimate the impact of this temperature gradient on the propagation delay of the signal. Substrate temperature 91 gradients, when present, will exacerbate thermal gradient-dependent delay. Second, temperature variations are also non-uniform across time since the characteristics of programs dictate the amount of switching activity in signal wires and consequently, the energy dissipated in them and their temperature. When switching activities rise, it also causes the wire temperature gradient to increase. Due to lack of detailed models, existing early—stage design exploration methods lump the effects of process, voltage, and temperature (PVT) variations. This results in overly pessimistic and / or incorrect delay estimates. Even in later stages of the design process, a constant temperature value across the chip is assumed to analyze of the elec- trical characteristics of devices and interconnects. In reality, given that on—chip power dissipation in devices as well as interconnect is workload-dependent, the temperature distribution within the chip is far from uniform, and thus the constant-temperature assumption will result in a design which will result in problems during validation and necessitate costly re-spins. Using detailed temperature models developed previ- ously, this chapter examines the impact of data— and temperature-dependent delay variations for various on—chip high performance processor buses. The organization of the rest of the chapter is as follows. In Section 4.2, we discuss related work. Following that, in Section 4.3, we describe our models. Then, we present results and discuss them in Section 4.4. Finally, we summarize in Section 4.5. 4.2 Related Work and Our Contributions This section reviews related work. The impact of increasing interconnect tempera- tures has been well studied in [21, 38, 69]. However, they do not use real data from 92 benchmark programs and hence their estimates are somewhat pessimistic. Also, these models are not amenable to use in microarchitecture—level exploration tools. Recent interest in temperature- and reliability-aware microarchitectures has led to the devel- opment of tools [64,66,94] and techniques [58,71] for processor thermal and reliability management. However, these tools do not address an important temperature-related reliability issue in on-chip interconnects: transient faults or timing violations due to temperature-dependent resistivity changes. In contrast to these, we seek to develop activity-dependent models that estimate the distribution of Joule heat across the length of a wire, the wire temperature gradient across it, and finally, the actual delay due to crosstalk and temperature-induced resistivity changes. Using these models, we analyze the number of delay violations occurring for different benchmark programs from the SPEC CPU2000 suite in 130 nm and 45 nm processor designs. In current design methodologies, temperature—related wire reliability problems are identified late in the design cycle and hence their rectification involves substantial cost and effort. But this overhead can be avoided by properly accounting for temperature- related effects in early stage design. To our knowledge, no early stage microarchitec- ture exploration tool currently offers the capability of estimating temperature-induced timing violations in high-performance buses; our work is likely the first of its kind to develop such a model. Our model can also be used in temperature-aware delay and skew analysis in clock trees, although we do not examine this aspect in this paper. Specific contributions and key results from this paper are outlined next. 0 Using a cycle-accurate microarchitectural simulator, we show that timing vio— lations due to temperature gradients are somewhat likely in 130 nm designs— average of 2.27 per hundred bus references for an ALU result bus across ten 93 SPEC CPU2K programs—and increases in the future 45 nm technology node to 6.20 per hundred for the same processor design. 0 We found that, by an optimistic analysis, the performance impact of overcoming temperature induced timing violations by re-transmitting data will be about 4% in a superscalar design at 130 nm and 11.92% at 45 nm. e We also found that conventional techniques like bus encoding that seek to re- duce energy dissipation and potentially wire temperatures have limited impact on alleviating temperature-induced timing violations. Reducing the bus clock frequency yielded a better impact, reducing average error rate to 1.07 in the 130 nm processor compared to encoding which reduced it to only 1.93 per hun- dred references. 4.3 Temperature Dependent Delay Variability Model In this section, we present analytical models for estimating the spatial distribution of Joule heat, temperature, and temperature-dependent delay in RC interconnects. Versions of the well-known energy model for a lumped-RC wire, discussed in Chap- ter 2.1.2, has been traditionally used in interconnect analysis to estimate energy dissipated due to self and coupling transitions. But this model assumes that Joule heat is dissipated uniformly across the length of a wire and hence leads to conserva- tive temperature estimate for the wire. Furthermore, it does not capture the spatial distribution of Joule heat, without which the impact of temperature on delay cannot be estimated accurately. In Chapter 3.3.3, we derived a new expression for energy 94 distributed along the length of a wire and validated it using circuit simulation. We also constructed a thermal model and found wire temperature gradients using this model. The effect of this temperature gradient on wire delay is found as discussed next. 4.3.1 Wire Delay Considering Temperature Impact The propagation delay of a lumped-RC wire considering only data dependent crosstalk was presented in Chapter 2.1.5. For a distributed RC line partitioned into n subseg— ments each of length l, the Elmore delay D of a signal passing through the line is the following: L L L D 2 Rd ' (Cr +/0 c0(:r)d:1:) +/O r0(:c) - (A c0(7')dr + Cr)d:r, (4.1) where c0(;z:) and r0(a:) are the per-unit length wire capacitance and resistance, respectively, Rd is the driver resistance, and CT is the receiver capacitance. Since the resistance of a wire segment changes with temperature, we can write: r0(:z:) = p0(1+fl-T(:1:)), where T(:z:) represents the temperature profile along the length of the wire, p0 is the unit length resistance at a reference temperature (273 K), and B is the temperature co—efficient of resistance for Copper (5 2 396—3 per°C). Substituting in Equation 4.1, we get: L L D 2 D0 + (COL + C7~)p0fi/O T(:r)a’.:r — copOfi/O :rT(:z:)d.r, (4.2) L2 where D0 = Rd(Cr + 00L) + (60p0—2- + pOLC'r), (4.3) Do is the Elmore delay corresponding to a unit length resistance at reference temperature. In Equation 4.3, fOL T(:r)d:r represents the area under the temperature 95 curve, denoted as A in a plot of temperature vs. wire-length. Let T(:z:) be a straight line with T(:r = O) 2 TA and T(:1: = L) 2 TB, TA 2 T3. The area under T(:2:) gives the value of A. Now the x-coordinate of the centroid of this region is given by 130 = % fOL 33T(:c)d:c [95]. Thus fOL :rT(:r)d:r = arc x A. Note that both 230 and A can be found easily using geometry, if T (:13) is assumed linear. Thus, by estimating TA = (92-, 0 and TB = 6i, n using the model in Chapter 3.4, and the area under the temperature curve for any given sampling window, we can estimate the actual delay which includes the effect of temperature—dependent resis— tance. Using this, we can determine if a timing violation has occurred as described next. 4.3.2 Wire Delay Variability Considering Crosstalk and Temperature During early stage design exploration, the designer’s aim is to ensure that the mi- croarchitecture meets all its performance expectations at the target clock frequency. The target frequency itself is decided based on knowledge of typical operating con- ditions that determine parameters like temperature, etc., and knowledge of process variations that are used to account for deviations from expected values. Based on estimates available from prior work, we assume that the delay can increase by up to 20% due to back end of line (BEOL) process variations and an additional 10% due to voltage drOp and temperature variability [96,97]. Thus, we assume a guard band of 30% for the delay of a global wire due to PVT. Hence tbus_clk = 1.3 X D. We described earlier in Chapter 2.1.5 the procedure to estimate the worst-case data dependent delay (Equation 2.4) and estimate the safe clock frequency at which 96 the bus can be operated. From that discussion, we note that not all bus references will trigger the worst case for delay in a bus line, resulting in varying amounts of delay slack across lines and also across time. As such, the actual delay for a line estimated using Equation 4.3 depends on the Wire temperature gradient and the nature of its crosstalk with its neighboring wires. If the neighbors both switch oppositely with respect to the line, the delay will be twc and, if the temperature gradient is sufficiently high, the actual delay may exceed tbus_clk' This is a timing violation. Note that, when this occurs, the temperature impact on delay overwhelms the 30% guard band that we have allocated to account for worst case PVT variations. Given the current and previous data to be transmitted on the bus, we do the following to determine if a temperature-induced timing violation has occurred for the bus as a whole, in our cycle-accurate simulator. First, for each wire in the bus that changed state from the previous cycle, we compute the delay slack by examining coupling transitions with respect to its neighbors and determining its nominal delay tp, [9' Then, depending on the Joule heat dissipated and the thermal gradient across its length, we determine its actual delay using Equation 4.3. Finally, we consider a temperature-induced timing violation to have occurred for the bus as a whole if the actual delay in any of the lines, exceeded tbusmlk‘ We report the number of such violations per hundred bus references in our results. 4.4 Results and Discussion We study the delay variability of the 64—bit result bus that runs over the integer and floating-point execution units of the processor. This bus was chosen since it is highly 97 capacitive and dissipates a substantial amount of energy in the processor core [45,58]. Also, it is routed over the execution unit consisting of ALUs and register files that are highly active; hence, the substrate temperature under the result bus will also be significantly higher than in other units. The result bus is also on the critical path and will be impacted most by any temperature-dependent timing violations, which may require retransmission of the data to maintain correct program execution. 4.4.1 Maximum Wire Temperatures and Gradients The maximum wire temperatures that we recorded during the simulation of the result bus is shown in Table 4.1 for 130 nm and 45 nm technology nodes. It can be seen that the wire temperature can be as high as 103°C in a 130 nm processor and about 117°C in a 45 nm processor. Note that the design temperature for global wires was assumed to be 100°C but significantly higher temperatures were observed during our simulation. As mentioned earlier, higher wire temperatures increase wire delays by about 5% for every 20°C rise in temperature [38]. Next, in Figure 4.1, we show the distribution of the maximum wire temperature gradient that we determined using our model in Chapter 3.4. This plot shows that, on the average, the maximum temperature gradient in a wire can be expected to be between 16 and 24 degrees. These temperature gradients across the length of the wire also affect delay. It has been reported that for a 1 mm long wire with the driver in the hot region and receivers in a cooler region, a temperature difference of 10°C results in a 5 ps (z 8%) additional delay at the receiver [93]. Having shown that significant wire temperatures and gradients occur when ex- ecuting the benchmark programs in our workload, we examine next whether these 98 .vm 2.me E 833353 .Eofiqdq. .msn flame“ DA< on» Sm @8288 moaspfiomamu 8E :58ng A...“ $38“ codnm wwgm swwpm :Nwm afimwm hwdwm wmdwm mfimnm wudwn womwm as we wnéom wwwwm owénm Bang endow. gag wmdom £15 mcdwm 3.3% S: G? 83% «Emma among fie :95 now .3380 NEE 9% 8w mmaspmaomfimp. 35> 823wa 99 Distribution of Maximum Wire Temperature Gradient in 130 nm Result Bus Wires E4390 l6—15°C 016—2490 $2490] 90% « __ __ — ' ‘— 80% — 70% _ 60% ~ 50% — 40%] 30% ~ 20% - 10% ~ 0% - Percentage Number of Cycles T T I T 900 gzip bzip2 crafty eon two/f art mesa mgrid swim Avg. Figure 4.1. Distribution of maximum wire temperature gradients in result bus wires for the 130 nm processor. result in timing violations in the ALU result bus. 4.4.2 Frequency of Timing Violations Figures 4.2 and 4.3 show the temperature-induced delay violations per hundred bus references for a 130 nm and a 45 nm processor, respectively, in the ALU result bus using our temperature-dependent delay model when running different benchmarks. The base case—processor operating at nominal clock frequency, 1.68 GHz for 130 nm and 11.51 GHz for 45 nm—is represented by the data series labeled “@ Nominal Fqu.” in the two plots. For this case, we observe that the average error rate across our benchmark set was 2.27 per hundred bus references for the 130 nm design. For the same processor in 45 nm technology node, the error rate increased to 6.2 per hundred 100 references on the average. Some benchmarks like gcc , gzip, bzip2, and art show higher than average error rates due to the fact that they had higher values of wire temperatures and/ or gradients than other benchmarks as observed by results shown in the previous subsection. It should be noted that the timing violation error rates reported here represent temperature—induced violations only; other factors like process variations and voltage drops are not included, as mentioned earlier in Section 4.3.2. In fact, our results show that, in many cases, the extra temperature induced delay can easily overwhelm voltage drop and process variation safety margins allocated by a designer. 6 7 Temperature Induced Delay Violations in a 130 nm Wires I I @ Nominal Freq. 5 A E: 3 El@ Nominal Freq. with OEBI Encoding ! ' ‘8 I@ 0.9 x Nominal Freq. | l l 4 1! i s l 05 3 - rs fig ] 3 ”.5 ' IN 2.27 #r eff 098 1.20 112 5 079 056 Delay Violations per Hundred References 0.36 2 - 3‘ 8. s 1 1 . I , _ -' - I # ‘ I= ' I 0 4 ' I- gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg. Figure 4.2. The number of temperature-induced violations per hundred bus references occurring across ten benchmark programs in a 130 nm processor. Most superscalar processor designs adopt such overly conservative methods to 101 Temperature-Induced Delay Violations in 45 nm Wires 12 — I@ Nominal Freq. El@ Nominal Freq. with OEBI Encoding 1o ' I@ 0.9 x Nominal Freq. .. 3, 8 A x N "I 5 § :6- 4 . ,-,- 9 - a v' ., v _ 4 _ 8 N; ' - ~1- - r~ N .. °°. 0 -d gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg. 991 10.1 2 5.24 1 Delay Violations per Hundred References 0) Figure 4.3. The number of temperature-induced violations per hundred bus references occurring across ten benchmark programs in a 45 nm processor. work around dynamic delay variability-related problems—like using an extra pipeline stage is allocated to account for wire propagation delays [44]. Temperature- distribution aware delay models, such as the one we have developed, can help ex- plore the extent of the timing violation problem during early stage design. Using this knowledge, a designer can implement schemes that address delay variability issues and avoid over—design. For example, results presented in the next subsection show that, by increasing the overall bus clock cycle time by only 10%, the error rates can be halved for a 130 nm design. As mentioned earlier in Section 4.3.2, not all bus references—even those incur- ring worst case delays due to peak crosstalk—are likely to trigger timing violations. Cycles in which peak crosstalk conditions occur in a wire, coupled with high Joule 102 heat dissipation and large temperature gradients, have high probability of causing a violation. Violations can occur during non—peak crosstalk conditions too, if wire temperature and/or gradients are large enough. The following results attempt to characterize how temperature-induced delay variations are distributed across various crosstalk conditions. Distribution of Crosstalk Conditions in ALU Result Bus ]i_1;4r delay lii3r delay El1+§r delaiD1+1rTielayl ;1+Qrdi3lay; 100°/o 7» a a a -1 90% - 80% ‘ 70% 60% 50% - 40% - 30% - 20% ~ 10% - 0% l Frequency of Occurrence l gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg. Figure 4.4. This plot shows the frequency of occurrence of five different crosstalk conditions on the bus. See Section 4.3.2 and Table 2.1 for an explanation of these crosstalk conditions. The crosstalk condition determines the actual propagation delay without considering thermal effects. Figure 4.4 shows the frequency with which different crosstalk conditions occur on the ALU result bus for the programs we analyzed. This is at nominal temperature. It can be seen that the peak crosstalk condition labeled “1+4r delay” occurs only about 10% of the time on average across the benchmark set. The dominant condition is “1+2r delay” which occurs about 40% of the time. Next, in Figure 4.5, we show 103 the distribution of temperature-induced delay violations across the different crosstalk conditions. As the figure shows, crosstalk conditions “1+r delay” and “1+0r delay” contribute a very low percentage (<1%) to number of total delay violations. Other cases have more significant contributions, suggesting that eliminating or reducing these crosstalk conditions can potentially reduce delay variabilities. Percentage of Temperature-Induced Delay Violations Caused Under Various Crosstalk Conditions for 130 nm . [I 1+4r delay I 1+3r delay Eli+2r delay EI1+1r delay I1+0r delay] 100% - - 90% - 80% I I 70% - 60% 50% 7 40% ~ 30% 20% 10% ~ 1 7V , , 0% - “ > . l . . gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg. l L Percentage of Total Delay Violations Figure 4.5. Figure shows the percentage of temperature—induced delay violations that correspond to a given crosstalk condition. From the above discussion, it can be argued that the impact of temperature- dependent delay can be reduced by reducing energy dissipation and hence tempera- ture. We examined two methods of reducing power: (1) a static design-time technique that uses a lower bus clock frequency and (2) a dynamic low power bus encoding scheme called odd/ even bus—invert (OEBI) that reduces toggle transitions [53]. The former is represented by the data series labeled “@ 0.9xNominal Freq.” and the 104 latter by “@ Nominal Heq. with OEBI” in Figures 4.2 and 4.3. We observe that slowing the bus down reduces delay violation rates better than applying the encoding scheme. This is because reducing the bus clock frequency results in two outcomes both of which contribute to reducing wire temperature: (1) it slows down the proces- sor resulting in a lower number of bus references per unit time and (2) it increases the clock cycle time over which bus switching energy is dissipated. This combined effect reduces wire power dissipation and hence lowers wire temperatures. In contrast, the encoding scheme only reduces the total amount of bus switching energy dissipated in a cycle and does not affect the cycle time. Hence its impact on wire temperature is lesser. Although, the OEBI encoding scheme is designed to reduce the number of toggle transitions in wires, it has the side-effect of increasing the number of coupling charge/discharge transitions. Thus, in the context of crosstalk, an OEBI-encoded stream will have more number of “1+2r delay” cases. We have observed earlier that somewhat significant temperature-induced violations are possible for this case and this may have contributed additionally to the ineffectiveness of OEBI in reducing error rates. we also observe that frequency reduction is less effective at 45 nm node than at 130 nm node. 4.4.3 Performance Impact Delay violations, if unchecked, will the impact performance of the processor, requiring an extra cycle to retransmit the data on the result bus. Also, dependent instructions may need to wait longer for dependencies to be resolved and this may cause pipeline stalls. Table. 4.2 shows the instructions-per-cycle (IPC) degradation observed across our benchmark set; the average performance degradation was 4.08%. Note that this is 105 an optimistic estimate since we have assumed that the re-transmission is not affected by delay violations, which is strictly valid only if the bus has cooled down compared to its state during the previous transmission. Our focus, in this work, is not to im- plement a dynamic scheme that inserts appropriate number of wait cycles to cool the bus after a delay violation is detected. However, such a scheme will only cause the data re-transmission to wait longer than what we have assumed here. Hence, our IPC estimates are lower-bound values. In reality, since the operating clock frequency at 45 nm is much higher than at 130 nm, the performance impact will be much higher at the smaller technology node. Our simulations with 45 nm technology parame- ters found that the average performance degradation across the ten benchmarks was 11.92% (Table 4.2). 4.5 Summary This chapter presented models for estimating the Joule heat and wire temperature across the length of a global wire, and to determine its temperature-dependent delay impact. We showed that temperature gradients exist between the sending and re- ceiving ends of a wire and this may lead to dynamic delay variations that can exceed design margins. We used our models to explore the extent of temperature-induced delay violations that may occur in the ALU result bus of a processor in the 130 nm and 45 nm technology nodes using real data from ten SPEC CPU2000 programs. Microarchitectural simulation results show that delay violations due to tempera- ture gradients are somewhat likely in 130 nm designs—average of 2.27 per hundred bus references for the ALU result bus. In the future 45 nm technology node, the 106 .aosdwaawmw On: omwpcmonma we wommmaxm poems: mocafinotmm Nd Each. NQAH mw.© 90¢ mad mi: Fwd vaH mm.: 3.2 magma $3.2 E: mv momN cod wmd 2N wwa. EM. mfim wmd www 3.3 omh :8 cm: .w>< gnaw 2me £88 and :95 now 53.8 NEE 9% com 2302 qofievgwow On: owgqgnom hook. 107 error rate was found to increase to 6.20 per hundred for the same processor design. Commercial 130 nm processor adopt techniques like an extra pipeline stage to com- bat the infiuence of dynamic delay variations in wires. However, this leads to over design. Temperature-aware delay models like the one we have developed can be used to explore the design space efficiently and avoid over design. We also found that, by an optimistic analysis, the performance impact of overcoming temperature induced delay violations by re—transmitting data will be about 4.1% in a superscalar design at 130 nm and about 11.9% at 45 nm technology node. We also found that conventional techniques like bus encoding that seek to reduce energy dissipation and potentially wire temperatures have limited impact on alleviating temperature-induced delay vi- olations. Reducing the bus clock frequency had a better impact, reducing average error rate to 1.07 per hundred references, compared to encoding which reduced error rates to only 1.93 per hundred references. 108 CHAPTER 5 ACTIVITY-AWARE ENERGY AND TEMPERATURE OPTIMIZATION With increasing energy dissipation and wire temperature in processor bus wires and the inability of existing low-power encoding schemes to address these problems ad- equately, novel approaches need to be examined. This chapter examines a family of such energy-efficient techniques that rely on data statistics and a first-of—its—kind optimization methodology to reduce bus wire temperatures [98]. 5.1 Introduction On—chip wires are a major impediment to realizing the performance gains that mo- tivate CMOS technology scaling in integrated circuits. At smaller technology nodes, transistors become faster and somewhat energy-efficient but wires become slower because smaller cross-sectional area increases their resistance. To counter this, wire are scaled less-aggressively than transistors. However, this scenario leads to taller and thinner wires that exacerbates parasitic effects like inter-wire coupling capacitance, thus leading to relatively more energy dissipation when wire switch- ing charges / discharges these capacitances. Global signal-carrying wires/ lines already contribute a major portion to total chip power dissipation—about 34% in an Intel 130 nm microprocessor [4]. As a result, rising wire temperatures are becoming an im- portant issue in high-performance processor design, especially in current and future 109 nanometer technology nodes since higher temperatures can impact wire delays and electromigration reliability [21, 66, 99]. Wires—like those that constitute address, instruction, data, and ALU result buses—routed in global metal layers are much more susceptible to higher temper- atures due to the following reasons: (1) with higher clock frequencies, the amount of energy dissipated in the wire as Joule heat increases compared to the energy dis- sipated in the repeaters [100], (2) they are furthest away from the substrate which is attached to the heat sink and they are surrounded by low-K dielectrics that have poor thermal conductivity, resulting in inefficient heat removal, and (3) their rela- tively large geometries result in higher thermal capacitance, i.e., the ability to retain heat. Rising wire temperatures increase wire delays by about 5% for every 20°C rise in temperature [38]. Wire temperature gradients across the length of the wire also affect delay. It has been reported that for a 1 mm long wire with the driver in a hot region and receivers in a cooler region, a temperature difference of 10°C results in a 5 ps (z 8%) additional delay at the receiver [93]. 5.1.1 Need for Energy and Temperature Aware Bus Design Real workloads cause bus traffic (in instruction, data, address buses) that exhibit sub- stantial spatial and temporal locality and value redundancy. Switching activities are therefore not random. Further, there is a high degree of correlation between switching (self and coupling) activities of traffic in different execution regions of the same bench- mark and across different benchmarks. These characteristics can be exploited using value-aware design of encoding schemes. Previous techniques are typically (inversion- based) dynamic encoding schemes which support a set of encoding modes, one of 110 which is dynamically chosen at run-time in a given cycle in an attempt to reduce bus energy. These suffer from several drawbacks. First, encoding modes supported are those that are effective only for random or worse-case (highly-changing) traffic, which is not the case in realistic workloads. Such value obliviousness limits their effectiveness. We present results showing average energy reductions for dynamic en- coding schemes to be only 4.19% (5.32%) at best for data (instruction) traffic across SPEC CPU2000 benchmarks. Second, being dynamic, there is a latency overhead in encoding and decoding and extra area for hardware and control lines. Also, as several earlier works have demonstrated, the efficacy of inversion-based encoders falls rapidly as bus width increases [101,102] and bus partitioning schemes have been proposed to address this issue [103]. However, with partitioned buses, the number of extra lines required for control signals increases and this restricts its attractiveness. Third, previous schemes attempt to reduce either self or coupling energy, not total bus dynamic energy. Hence their effectiveness will change as the ratio of self to coupling energy changes with technology scaling. Finally, energy and temperature- aware design of high-performance buses are only loosely related. Reducing energy (switching activity) through encoding reduces only the average temperature of a wire (tang) since it is dependent on total energy dissipated over time which reduces due to encoding. However, existing encoding techniques do not explicitly reduce maximum temperature of wires (imam) since these depend not only on the amount of energy dissipated in the wire itself but also in its neighbors. For example, a low-activity wire (victim) with highly-active neighbors (aggressors) leads to rise in the temperature of the victim wire due to thermal coupling [73]. The effects of thermal coupling can exacerbate electromigration and other related reliability problems in high performance 111 bus wires. Further, due to data locality, a few bus lines are highly—active most of the time and this makes them more susceptible to temperature—induced failures. Such problems can be remedied by combining encoding that reduces tavg with static bit reordering or permutation that seeks to reduce tmagj by minimizing thermal coupling. 5.1.2 Key Contributions and Results We evaluate several possible ways of signaling a bit value at design time, and then choose, based on traffic value characteristics, exactly one signaling mode for each bit statically to support in hardware to minimize total bus dynamic energy. We also consider all possible ways of mapping bits to bus lines (bit ordering or permutation), and then choose, again depending upon traffic value characteristics, exactly one bit ordering statically at design time to support in hardware to minimize total bus dy- namic energy. The combination of a particular way of signaling different bits and ordering them on the bus constitutes a static encoding scheme. We present an inte- ger linear program (ILP) methodology that evaluates q possible bit signaling modes and all possible bit orderings for an n-bit bus (i.e., it evaluates a total solution space of q” x n! encoding modes) based on traffic value characteristics and then chooses an optimal encoding mode that minimizes total bus (self + coupling) dynamic energy. This selection is done at design time using data from microarchitectural simulations and the ILP problems are solved optimally in a matter of minutes. Since only one encoding mode is statically supported in hardware, encoding / decoding (latency, area, and energy) overhead is virtually non-existent and there are no control lines needed. Since there is substantial correlation between switching characteristics across benchmarks, our static encoding scheme optimized for one set of training bench- 112 marks works very well for a different set of test benchmarks—we refer to this as general-purpose optimization; in this case, we obtain 20.04% (38.78%) average to- tal bus energy reductions with our best scheme for data (instruction) buses. With increasing degrees of customization (suitable for particular application domains or embedded systems), effectiveness improves: we obtain average bus energy reductions of 22.79% (40.77%) for workload—specific and 30.2% (52.1%) for program—specific opti— mization scenarios for data (instruction) buses. These average percentage bus energy reductions for our static encoding schemes are 5 to 10 times better compared to existing, more complex dynamic encoding schemes. We present a new way of bit signaling based on Markov models. Markov models have been used in a variety of situations (e.g., branch prediction, instruction com- pression, etc.), but never in the context of bus encoding or low—power bus design. We show that lowering bus energy (e.g., even significantly, as with our static encoding schemes), does not necessarily lower peak wire temperatures (the highest temperature attained by a bus wire during program run)—in fact, it often may in- crease it slightly. To address this, we present a novel method of efficiently explor- ing the peak-wire—temperature and total-bus-dynamic-energy trade-off space using a steady-state wire temperature model. Based on this, we present a new method of introducing thermal constraints into our energy optimization methodology that al- lows a designer to trade-off peak wire temperature with total bus dynamic energy as desired. For this thermally-constrained, energy-optimal static encoding scheme, we then perform simulations using a detailed per-wire bus thermal model to deter— mine the actual reductions in peak temperature, which we find to be significant. For example, by sacrificing approximately 50% of the energy savings provided by 113 the thermally-unconstrained, energy-optimal version of our scheme, we obtain up to 12.26°C (12.96°C) and on the average 803°C (924°C) peak wire temperature reduc- tions for data (instruction) buses, while at the same time providing significant average energy savings: 14.24% (16.17%) for data (instruction) buses (still much better than previous work). No previous work has attempted thermally—constrained energy opti- mization of buses. A recently proposed spreading encoding technique, which targets only peak wire temperature reduction and does not perform any energy optimization, has a number of drawbacks: latency, hardware, and energy overhead of a cross-bar switch network, use of a counter, and we also find that, for the same benchmarks, it does not provide as much temperature reduction. Finally, if needed, appropriate dynamic bus encoding schemes and the spreading technique for temperature reduction can be applied after our static encoding schemes to further Optimize bus energy and temperature. Therefore, in this sense, our work is orthogonal to, although much more effective than these previous works. The organization of the rest of this chapter is as follows. In Section 5.2, we discuss related work. Next, in Section 5.3, we discuss our methodology. Following that, in Section 5.4, we present our techniques. Then, we present results in Section 5.5. Finally, we summarize in Section 5.6. 5.2 Related Work Prior work on low-power bus design can be classified into three categories: (1) memory bus encoding schemes that reduce only self transitions, many of which are surveyed in [86], (2) on-chip bus encoding schemes that target both self and inter-wire coupling 114 energy reduction [53,54,104], and (3) wire permutation techniques like those proposed in [105—109] that seek to minimize coupling energy. Memory bus and on-chip bus encoding schemes are dynamic in nature and wire permutation techniques are static. Our proposed optimization approach differs from prior related work discussed above in many ways. First, wire permutation schemes discussed in [105—107] opti- mize only inter-wire coupling energy, whereas our scheme combines the benefits of signaling that reduces self transitions, with permutation that seeks to minimize cou- pling energy. In contrast to the optimization technique suggested in [108], our work considers a wider array of signaling schemes and solves the combined signaling and permutation problem optimally, while they use a greedy algorithm. This contributes to better results using our optimization technique. Compared to the address bus or- dering scheme proposed in [109] which can be applied to 8-bit buses only, our scheme can be applied to any bus regardless of bus width or transmitted data. Furthermore, their optimization uses simulated annealing technique, whereas we solve the problem optimally using integer linear programming, for much larger bus sizes and with com- parable time complexity. Our optimal static encoding scheme also results in much better energy reductions compared to well—known dynamic low-power bus encoding schemes. Most importantly, our technique incorporates a thermal optimization method- ology for buses which has not been addressed by any previous work. Rising wire temperatures are becoming an important issue in high-performance processor de— sign, especially in current and future nanometer technology nodes [21,66]. To ana— lyze temperature-related issues, microarchitecture-level thermal models like HotSpot [64,66] have been proposed to estimate substrate (active-layer) temperatures. Inter- 115 connect thermal models have also been proposed recently [71]. It has been shown that, in global layer interconnects, activity-dependent Joule heat dissipation in the metal leads to thermal coupling between adjacent wires causing maximum wire temperature to shoot up beyond safe design limits [73]. A recent work proposed a methodology called thermal spreading encoding to re- duce wire temperatures [110]. In that work, data is bit-shifted periodically before being transmitted on the bus, in an attempt to equalize wire temperatures across the bus by averaging out the Joule heat dissipated across all lines. This technique does not reduce energy dissipation since the coupling energies dissipated in the bus lines remain more or less the same after each shift. Furthermore, it does not alleviate the problem of temperature rise due to thermal coupling between wires. In contrast, our work addresses both these issues through the use of bit re—ordering instead of circu- lar shifting. Spreading encoding, as discussed in [110], is a dynamic technique and uses a n x n—crossbar for an n—bit bus and control logic (counters, etc.) to generate periodic shift signals. Our technique is completely static, incurs negligible overhead, and achieves much better temperature reductions. 5.3 Methodology We used the SimpleScalar / Alpha microarchitecture-level simulator to design and eval- uate our techniques [67]. The Alpha 21264 architecture modeled by this simulator uses a 64—bit (load/store) data bus between the processor and L1 data cache and a 128—bit instruction bus (fetch width 2 4) between the processor and L1 instruction cache. Since we have assumed our processor implementation technology to be 130 nm, 116 the clock rate was taken to be 1.68 GHz. We used little-endian Alpha executables of all 26 benchmarks from the SPEC CPU2000 suite with the ref input set and ran our simulations on a shared Linux cluster. We selected the SPEC suite as our target workload since pre—compiled little-endian executables for our target platform (Alpha 21264) were readily available for this suite from the SimpleScalar Website [76]. How- ever, our optimization methodology is equally applicable to other application and benchmark suites. We divided the 26 SPEC benchmarks into a training and test set with 13 pro- grams in each set chosen arbitrarily. The training set comprised of gzip, vpr(route), gcc, crafty, gap, vortex, wupwise, mgrid, mesa, art, facerec, lucas, and simtrack, and the test set had mcf, parser, eon, perlbmk, bzip2, twolf, swim, applu, galgel, equake, ammp, fma3d, and apsi. For these benchmarks, we used the 100 million single simu- lation point (SimPoint) sample to collect data for our analysis [77, 78]. 5.3. 1 Target Scenarios The three scenarios that we consider are, in the order of increasing degrees of cus- tomization, general-purpose, workload-specific, and program-specific. We consider these scenarios to show that our value-aware optimization techniques work well across all scenarios. Specific details of the analysis, design, and test steps for these scenarios are shown in Table 5.1 and are elaborated next. Analysis Step — Data Collection and Aggregation We consider several possible ways of signaling a bit value, with exactly one signal- ing mode for each bit chosen statically at design time depending on traffic value 117 Target Scenarios Step General-Purpose ] Workload-Specific Program—Specific Analysis Collect energy / cost matrices from Collect energy/ cost SimPoint samples of the 13 training set matrices from Sim- programs and aggregate them. Point samples of each program indi- vidually. Design Obtain the static encoding scheme using the CPLEX ILP optimizer. Test Apply the static Apply the static Apply the static en- encoding scheme on encoding scheme on coding scheme on the SimPoint samples a sample of 100M same SimPoint sam- of the 13 test set committed instruc- ple used in the analy- programs tions that does not sis step. overlap with the SimPoint sample for the 13 training set programs. Table 5.1. Optimization scenarios considered in this work. characteristics to minimize total bus dynamic energy. We also consider all pos- sible ways of mapping bits to bus lines (bit ordering or permutation) and then choose exactly one bit ordering statically at design time, again depending on traf- fic value characteristics. Hence, in the analysis step, we collect energy informa- tion for all possible bit signalings and reordering for all pairs of wires; these are represented in the form of energy cost matrices whose elements are represented as el’m[i][j], {1, m} E {0, . . . ,q — 1}, {i,j} E {0, . . . ,n.}, where q is the number of sig— naling mode choices that we consider. These signaling modes are discussed in detail in the next section. Each element el,m[i][j] is obtained by adding two components, both of which are collected using the bus line energy dissipation model [81] in the cycle-accurate 118 simulator for our target buses: the coupling energy Cl, m[i][j] dissipated when bits 2' and j, signaled using modes 1 and m, respectively, are placed next to each other on the bus, with j being the right-adjacent neighbor of i, and the one-half the self energy 31, m[i] and 31, m[j] of the bits, when they are signaled using the signaling modes l and m, respectively. When individual energy/ cost matrices need to be aggregated across benchmarks (B0, B1, . . . , Bl 3), as required in the general-purpose and workload—specific optimiza- tion scenarios—See Table 5.1—we add the corresponding elements of the matrices across all benchmarks: 1,771,] _ [1,777. J I’m J I’m’Ja a]: a - . Design Step — Integer Linear Programming We use ILOG CPLEX 9.0, a commercial mathematical programming optimizer, to solve the ILP problems [111]. CPLEX provides a C++ interface and a callable library that facilitates reading of input files (containing our energy/ cost matrices), examining candidate solutions, and re-solving the problem after adding appropriate constraints. To improve solution times, we also added a greedy approach to find subtours at each node and included elimination constraints for such subtours in our ILP. Test Step — Getting Results After the static encoding techniques are designed, results are collected for the bench- marks / samples mentioned in Table 5.1, depending on scenario being considered. The effectiveness of our optimization methodology depends on the degree of similarity be- tween the training and test benchmarks/ samples. To probe the extent of similarity, 119 we calculated the values of correlation coefficient rggy, with :1: representing the test set energy matrix linearized into a vector and y representing the training set energy matrix also linearized into a vector, using MATLAB for various signaling schemes listed in Section 5.4.1. These are shown in Table 5.2. The correlation of two variables reflects the linear dependence between them, i.e., it provides an estimate of how well the value of one variable can be predicted from the value of the other. If rggy is closer to unity then they are strongly correlated, which we find is the case with our training and test set coupling energy values, for both general-purpose and workload-specific optimization scenarios. rxy for Signaling Mode Optimization Type org 1nv trs 1tr m G eneral- purpose 0.9602 0.9602 0.9609 0.9609 0.9451 Workload-specific 0.9644 0.9644 0.9687 0.9687 0.9610 Table 5.2. Correlation coefficients rxy between test and training set data for various signaling schemes discussed in Section 5.4.1. Since Try values are close to 1, our training and test sets are well correlated. 5.3.2 Bus Layout and Wire Geometry We assume a standard model of a bus consisting of a sequence of n + 2 par- allel, minimum-width, minimum, spaced, identically-dimensioned, co—planar wires (Wn + 1,147”, . . . ,fV1,lV0) from left to right where W1, W2, . . . , Wn are signal lines and W0 and Wn + 1 are power/ ground lines that act as shields. The bus is assumed to use static logical therefore, it retains a previously-transmitted value until a dif- ferent one is transmitted. We assume the bus length to be 6-mm, routed in the topmost metal layer, and buffered by identical repeaters spaced equally apart in a 120 microprocessor fabricated in the 130 nm technology node. This global interconnect length is typical in many modern microprocessor floor plans [112]. Uniform repeater insertion methodology was used in this bus to ensure that the propagation delay did not exceed one clock cycle [46]. Several earlier works have also used this repeater model to evaluate buses. Wire geometry parameters were obtained from ITRS [1] and we used FastCap, a three-dimensional capacitance extraction program, to estimate parasitic wire capacitances of each wire [7]. 5.4 Static Techniques for Bus Energy and Tem- perature Optimization In this section, we present three optimization techniques for designing static encoding schemes for on—chip signal buses and minimizing energy dissipation and wire temper- ature based on their value characteristics. 5.4.1 Choice of Signaling Modes We use five candidate signaling modes in our optimization technique, one of which is selected for each bit: original (org), inverted (inv), transition—signaling (trs), inverted transition signaling (itr), and Markov model signaling (mm). In inv, the data on the bit line is always transmitted in inverted form, in trs, the XOR of the previous and current original value of the bit is transmitted, and in itr the XNOR of the previous and current original value of the bit is transmitted. We chose candidate signaling modes based on three characteristics: (1) potential to reduce self switching energy, (2) potential to reduce coupling energy with neighboring bits, and (3) potential to reduce the temporal distribution of energy-causing transitions. We 121 evaluate our candidate schemes according to these characteristics next. Inverted signaling Our optimization uses static inverted signaling (inv) as a candidate mode, i.e., the ILP is used to decide if data on a bit line is to be sent in inverted form always, depending on the value characteristics that we obtain for that bit from our training set. For any bit, this mode will be chosen if the amount of self and inter-wire coupling activities it causes with its neighboring wires is less than that for the original mode of transmission. Signaling a bit line with inv does not reduce the self switching activity and alters the temporal distribution of energy—dissipating transitions only slightly, but it can potentially reduce the coupling transitions in a significant manner. For example, a two-bit stream can be made completely toggle—free by inverting one of the bits and keeping the other in original mode; a significant amount of energy can be reduced since toggles dissipate most energy compared to charge/ discharge and self transitions. Transition signaling This signaling mode (trs) and its dual (itr) affect all three characteristics listed earlier. For bit-streams that are highly-changing, this mode can reduce self switching activity significantly and also reduce coupling transitions with a neighboring org- or inv-signaled line since every toggle transition is converted to a lower-energy- dissipating charge/discharge transition. It also reduces the temporal distribution of energy-dissipating transitions by converting a highly-changing pattern into a run of ones / zeros. 122 Markov model signaling In this candidate signaling technique, we use a small amount of hardware at the sending and receiving ends for only the bits that are selected to be signaled using this scheme. To our knowledge, this work is perhaps the first to use Markov model signaling (mm) to reduce bus energy dissipation in a value-aware framework. For bits chosen to be signaled using m, we maintain a history of k previous bits from the original data stream that was to be transmitted. These k-bits define the current state of the Markov model and it is maintained at both sending and receiving ends. At both ends, the encoding/ decoding logic uses this current state to predict the next bit to be sent on the bus. At the sending end, if this prediction matches the actual bit value to be sent, the bus line is held at its current value. Else, we signal a transition on the bus line which indicates a mis-prediction to the receiver. The receiver can retrieve the actual data by sampling the state of the bus lines (transition or no-transition) at the end of the clock cycle since it also has information on the current state of each bus line. The key to an efficient implementation of this signaling scheme is the design of the encoding logic. We analyze SimPoint samples of our 13 training benchmarks to build a prediction table. A portion of the 4-bit/16—state prediction table—for bus lines 0 to 7 of the data bus—is shown in Figure 5.1(a). This can be translated into hardware using standard logic synthesis tools. As an example, we show in Figure 5.1(b), the logic circuits required for implementing the prediction table for bits 0 through 7, obtained by logic minimization using the Espresso tool [113]. These circuits have at most two levels of logic and hence the hardware overheads they impose will be negligible. We tested Markov model based prediction schemes of varying depth, from 1-bit 123 Current State (83828180) .ssezesezssezsse: 58888823823333.3222: Next Bit Prediction 01110 1 110101010 1 0 1 11101110 1 0 1 0 I 0 1 0 211101110 1 0 1 0 1 0 1 0 3111 0111 0 1 0 1 010 1 0 410101110 1 0 1 01010 510 1 0111010 1 0 1 010 6 1 0 1 010 1 0 1 0 1 01010 7101011101010 1 010 gml Sm] 80) L11) Markov model signaling logic for bit 0 (DCDCD NN?‘ r..|..|... 'U ‘1 Markov model signaling logic for bit 7 s”: Bit ‘y’ of the current state for the x-th bitline (b) Figure 5.1. Markov model-based signaling technique. (a) A 4-bit prediction table for the Markov model for bits 0—7 of the data bus obtained by analyzing training set benchmarks. Depending on which bits are selected for Markov model signaling, the corresponding row of the table can be translated to hardware using logic minimization tools. (b) Examples of sending end hardware that would be required for 2 bits (0 and 7) assuming these are chosen to be signaled using the m scheme. As can be seen, the logic overhead required for m signaling is very minimal. 124 (2 states) to 10-bit (1024 states), for their prediction accuracy. As expected, the prediction accuracy improved as the depth of the model increased. However, we found that beyond a depth of 4 (2) for data (instruction) buses, the rate of improvement in prediction accuracy dropped significantly. Hence, we chose the 4-bit Markov model for the data bus and the 2—bit Markov model for the instruction bus. Henceforth, in this paper, we shall denote the candidate signaling schemes using subscript numbers 0 through 4 instead of org, inv, trs, itr, and mm, respectively. Let q represent the number of candidate signaling schemes; q = 5 in this work. Our ILP formulations use energy/ cost matrices or vectors whose individual elements we represent as el,m[i][j], {l,m} E {0,...,4}, {i,j} E {0,...,n}. For example, e0, 1[i][j] represents the energy dissipated between bits 2' and j when they are placed next to each other on the bus and wire i is signaled using the org scheme and wire j using the inv scheme. Since there are five signaling schemes, we have a total of 25 energy / cost matrices collected for the training set benchmarks and/ or simulation sample, depending on the scenario that we consider. Note that all energy/cost matrices are (n + 1) x (n + 1)-matrices because we consider the two shield wires as one node, called it a dummy node. The solution to our ILPs—MEBO and SBOS—are obtained as Hamiltonian cycles and we use the location of the dummy node to break the cycle into a linear bit order. However, the dummy node is not used in the ILP formulation for MES. The ILP formulations using these notations are discussed next. 125 5.4.2 Minimum Energy Signaling (MES) In minimum energy signaling (MES) optimization, we seek to find a static signaling scheme for each bit line of the bus, from among the five possible schemes discussed in Section 5.4.1, with the goal of minimizing total self and coupling energy dissipated. In the ILP formulation, for each adjacent bit pair (i,i + 1), we associate 25 binary variables yl’ m[i], {1, m} E {0, . . . ,q — 1} representing all combinations of signaling two bits using five schemes. Thus, the binary variable 310, 0[i] = 1 if both the i-th and (i + 1)-th bits are to be signaled using the original mode (i.e., the bits are transmitted as in the original traffic). Else, 310, ()[i] = 0. The formulation of MES in terms of the y variables is given next: n q—lq—l Minimize Z Z Z (61,mlil-yz,m[il) i=0 l=0m=0 subject to: yrmlil 6 {0,1}.V {km} s {0.....q —1},v2' (5.2) q—1q—1 Z Z yl,mlil=1,v2', (5.3) l=0m=0 q—1 q—l Z yrmlil = Z ym,1[i+1],v m,Vz' (5.4) Constraint 5.2 ensures that the variables take only binary values, Constraint 5.3 ensures that there is only one unique signaling scheme associated with each wire pair, and Constraint 5.4 ensure that the signaling schemes chosen for adjacent wire pairs are consistent. Solving this ILP yields an optimal (minimum energy) signaling scheme for the bus. 126 5.4.3 Minimum Energy Bit Ordering (MEBO) In contrast to MES, the next technique, minimum energy bit ordering (MEBO), seeks to minimize inter-wire coupling energy by reordering the bits. Thus, in MEBO, all bits are signaled using the original mode. It is formulated as an instance of the traveling salesman problem (TSP), which is one of the most widely studied combinatorial optimization problems. Simply stated, in the TSP, a salesman needs to visit n cities, visiting each exactly once, and return to the starting city with the minimum total trip cost. In graph theory terminology MEBO is expressed as follows: consider a complete digraph G = (V, A), where V = {1, . . ., n + 1} is the vertex set that represents the n + 1 bits including the dummy node, A = {(i, j) : i, j E V} is the are set, and e0, 0[i][j] is the energy / cost associated with are (i, j ), i.e., the total energy dissipated if bit j is placed as the right—adjacent neighbor of bit i on the bus, e0, 0[i] [i] = 00, \7’ i. Note that we use only 60’ 0[i][j] in MEBO since all bits are signaled using the original mode only. The problem is to find a minimum energy cycle that includes every node in the graph exactly once, i.e., to find the minimum weight Hamiltonian cycle in G. The MEBO formulation has one binary variable associated with each arc of G that is represented by :r[i][j]. In the solution, a:[i][j] = 1 if bits i and j are to be placed next to each other on the bus, with bit j as the right—adjacent neighbor or i and it is = 0 if i and j are not to be placed next to each other. The ILP formulation 127 in terms of the variables x[i][j] is given next: Step 1 : Step 2 : Step 3 : Step 4 : Step 5 : Step 6 : l\/Iinimize Z 80, ()[i] [J] ' $l7l lJl \7’(i,j) e A subject to: $1110] 6 {0,1},V 231' E V, (5-5) 2: a:[i][j]=1,ViE v and 2 my] = 1,v j e v, (5.6) V j e V v 2' e V Solve ILP to get the solution. Check if the solution has subtours. If none, go to Step 6. Else, let there be t subtours: S = {30(n0)»51(n1), - - - 73t('nt)}, where S k(n k) means that subtour S I; has length nk. Add subtour elimination constraint: Zap-1y]: (m) are in Sk(nk)) < WV 5. (5.7) Go to Step 2. The desired solution (Hamiltonian cycle) has been obtained. Stop. In the procedure descibed above, Constraint 5.5 ensures that the variables take only binary values. Constraint 5.6 ensures that the in- and out-degrees of every vertex are one, i.e., every bit occurs exactly once in the ordering. Eliminating all possible subtours in the beginning will increase the number of constraints substantially and may lead to a huge time overhead when solving the problem. Hence, we adopt an iterative approach to solve the problem in shorter time. First, we solve the problem with constraints eliminating all possible subtours of two nodes only. Then, we search the solution for the presence of subtours, and if any are found, we add constraints to 128 eliminate those subtours, and then re-solve. We found that almost all problems con- verge to a feasible solution (i.e., a Hamiltonian cycle) within a few hundred iterations using this iterative method and in a matter of minutes (see Table 5.3). 5.4.4 Simultaneous Bit Ordering and Signaling (SBOS) In simultaneous bit ordering and signaling (SBOS), we seek to combine the MES and MEBO and optimizations described above. Thus, for each bit, the best signaling scheme—one of the five schemes listed in Section 5.4.2—and the appropriate position of the bits on the bus lines is to be determined simultaneously. Note that combining MES and MEBO does not mean that the energy reductions with SBOS (the combined technique) will be exactly equal to the sum of savings obtained separately with MES and MEBO. In fact, the motivation for combining these problems is to enable the optimizer to select the optimal solution from a richer set of possibilities. Thus, we can view the problem as similar to MEBO but consisting of n+1 supernodes corresponding to the n bits of the bus and the dummy wire. A supernode contains five nodes, each representing a signaling scheme choice for a bit. By adding constraints that ensure that only one of these nodes is selected for each supernode and that the incoming and outgoing nodes for each supernode are the same, the ILP for SBOS is formulated as 129 described next: q—lq—l Minimize Z Z Z (€1,m['il'$l,mlilljl) V(i,j)€A l=0m=0 subject to : wrmlilljl e {0.11.11 {tr} e v, (5.8) 2 Z Z $I,mlil[jl =LVieV, (5.9) VjEV l=0m=0 q—1q—1 Z Z Z $Z,mlkllil =1,Vz’€V, (5.10) VkEV l=0mq=0 q—1—1 go a mm: :2) mm iljllkl Viz m 6 Wm (5.11) =0 Constraint 5. 8 ensures that all :13] m[i.]][ ]s take only binary values. Constraints 5. 9 and 5.10 ensure that there is exactly one outgoing and one incoming node selected, respectively, for each of the n + 1 supernodes. Constraint 5.11 ensures that the optimal tour enters and exits through the same node in a supernode (i.e., the signaling schemes chosen for adjacent pairs of bits in the final ordering are consistent). Costs e), m[i] [i], V i E V are set to 00 (a very large integer value). Finally, in SBOS too, constraints for eliminating all subtours with two nodes are added initially, and the problem is iteratively solved as described earlier in Section 5.4.3 until a Hamiltonian cycle that visits all supernodes exactly once is found. 5.4.5 Thermal Optimization Methodology As described earlier, two adjacent high-activity wires are likely to cause a hot—spot on the bus due to intra—layer heat transfer or thermal coupling between the wires. The peak temperature on the bus occurs at such hot-spots. In the energy optimal 130 bit orderings obtained using MEBO or SBOS, a special class of constraints called thermal constraints can be added to prevent high-activity wires from being placed next to each other. Similarly, in MES signaling schemes can be chosen to prevent hot- spots in a cluster of wires by adding such constraints. It is to be noted that although adding thermal constraints may decrease the energy saving potential of the energy- optimal bit ordering to some extent, it provides a designer the flexibility to effect a trade—off between optimizing energy and reducing peak wire temperatures. We use the steady state model, described earlier in Section 3.4.3 to determine, approximately, the thermal impact of various orderings and prune thermally-inefficient orderings by adding these constraints. We do this since it is virtually impossible to perform detailed thermal simulations using the model and methodology described in Section 3.4.2, for every candidate solution that we encounter during MEBO/SBOS optimization, and then select the thermally-superior solution. Using the steady state model, the procedure to effect a trade-off between energy and temperature reductions is discussed next. Steps for thermal optimization The switching activities of buses vary widely across bits due to the characteristics of data carried on them and hence, the solution space of energy-efficient bit orderings— that are found using MEBO and SBOS—also contains bit orderings in which the wire temperatures are reduced. The steps listed next enable us to find these thermally- superior orderings without affecting the energy optimality by much. Note that all temperature estimates used in the steps listed below are from the steady-state model. 1. Find the energy dissipated Eorig and peak wire temperature Tp f eak — orig O 131 the unmodified bus. 2. Find energy-optimal bit ordering and/or signaling without any temperature constraints using MEBO/SBOS. Let the total energy dissipated in the bus with this (energy—optimal) ordering/signaling be Eopt and the ordering/signaling be represented by 30. Let Tp t represent the peak wire temperature eak—op corresponding to the permutation 8 obtained using the steady state model. 3. Next, we target to reduce the peak wire temperature by a fixed fraction (say 77) from its original value in a step-by-step manner. Our target peak wire temper- ature in the pth step is T, = (1 —p - 77) - T where p = 1,2, ..., etc.. peak — opt) To find a permutation that achieves this peak temperature, we eliminate arcs to/from bit pairs (i, k) for any wire j that has T(j) Z (1 — p - n) 'Tpeak _ opt“ Such a constraint will take the following form in the ILP: Ilillil + xljllkl S 1 and 1301121 + 5r[klIJ'l S 1, Vi3T(j)Z(I—p-n)°T peak — opt (5'12) Adding this set of constraints and solving the ILP, we obtain a wire permutation Bp that has peak temperature of Tp S (1 — 77) x Tp Note that since eak — opt' Tp is estimated using the steady state model after obtaining the wire permutation, it can be less than the target temperature. Further, the energy dissipated by this permuted bus Ep will be somewhat worse than Eopt The iterative process of adding the thermal constraints and re-solving continues until one of two conditions occur: (1) the ILP becomes infeasible to solve, or the energy of the bit—ordering / permutation Ep becomes worse than that of the original bus (Ep > E origl° Figure 5.2 shows a 132 sample temperature vs. energy trade-off curve that will be obtained by following the steps listed above. The curve shows points (T1, E1), (T2, E2), ..., (7110,1310), corresponding to target temperatures 0.95 x Tpeak _ opt, 0.90 x Tpeak _ opt, ..., 0.50 X Tpeak _ Opt. A Unmodified Bus Tpeakofig _ --------------- ' --------- ' T - ------- Energy-Optimal Bus ' peak-opt ' 0.9meW : I S 0.95preaW : 3 1? I Q. g I E I? I i~ g. l g g 0.5meW : E a) I x V I 8 I CL I I I I I I I 1 V I I I I I I I I I I l E ES 111 §. Bus Energy Figure 5.2. Sample peak wire temperature versus bus energy trade-off curve. The thermal optimization steps can be used to obtain curves similar to the one shown here. The thermal constraint presented in Eq. 5.12 allows only one arc—among $[l] [J], :1:[J][k], :r[J][i], and :c[l:][J]—to be present in the solution if the presence of both bits i and k as neighbors causes the temperature in hit J to equal or increase be- yond the target temperature. In the CPLEX ILP optimizer, the inclusion of thermal constraints using the methodology outlined above can be fully automated. In our experiments, we used 77 = 0.05 and succeeded in reducing peak wire temperatures significantly across several benchmarks as shown by results in Section 5.5.5. Further- 133 more, the extra time taken for temperature optimization did not increase the overall solution time significantly compared to energy—only optimization. The running times are compared later in Section 5.5. 5.4.6 Routing Overheads In this subsection, we analyze the overheads for the wire ordering network required to implement MEBO and SBOS. We draw from previous work on efficient techniques for solving the crossing distribution problem [114—116] and use these principles to estimate the area/ cost of any ordering network. Consider two rows, called lower and upper rows (see Figure 5.3(a)), of points called terminals and a collection of two-terminal nets N = {N1,N2, . . . ,Nn} with each net N k connecting the terminal numbered k on the lower row to the corresponding numbered terminal on the upper row. The terminals in the lower row are numbered in—order as 1,2, . . . ,n from left to right. The left-to—right ordering on the upper row defines the final re—ordered bus. Let this new ordering be represented by II = (r1,7r2, . . . ,rn),1 S k S n. For example, for the figure shown, r1 = 5, r2 = 5, . . . ,7r8 = 7. DEFINITION: Two nets N,- and Nj are defined as crossing ifi > J and II(i) < II(J) or vice versa. Else, they are non-crossing. DEFINITION: A matching diagram is a straight line drawing of the nets for a given permutation II as shown in Figure 5.3(b) and the straight line representing a net Ni is called a chord. The intersection of two chords Ni and N j defines a crossing point Cij- There are ten crossing points shown in Figure 5.3(b). The notion of inversions can be used to calculate the minimal total number of 134 Upper row 5 3 6 1 4 8 2 7 Channel height Channel width A V 1 2 3 4 5 6 7 8 Lower row (b) H Metal-1 H Metal-2 I Via 5 3 6 1 4 8 2 O: Figure 5.3. Routing strategy and overheads for re-ordering. (a) Definition of the routing channel. (b) Matching diagram showing ten crossing points. (c) Two-layer routing strategy using eight horizontal tracks and ten vias. 135 crossing points g for any given II in the upper row [116]. An inversion is any pair (Wiflij) such that i < J and 7Tz‘ > rrj [117]. Accordingly, 5 = 10 can be calculated for the example. The total number of crossing points 5 determines the area/cost overhead of the sorting network in two ways. Intuitively, the number of horizontal wiring tracks in the channel will not exceed E, since each crossing point can be taken care of by assigning it a separate track and by using a two—layer wiring strategy. Also, the total number of vias required will not exceed 2g, in the worst case. However, in practice, the number of horizontal track and vias required will be less than E and 26, respectively. Figure 5.3(c) shows that the routing for this example can be achieved using two metal layers, eight horizontal tracks, and ten vias. Hence the number of crossing points g which is the number of inversions of the MEBO/SBOS order that we obtain can be used as a metric to select the re-ordering solution with the best energy-cost tradeoff. 5.5 Results and Discussion In this section, we present results for energy and wire temperature reductions obtained using our optimal static encoding schemes. In all results, percentage energy reductions are reported with respect to the energy dissipated in an unmodified bus. Table 5.3 lists the running times and number of iterations for problems of different sizes that we solved using CPLEX on a SunFire-880 server with two 750-MHz UltraSparc-III CPUs and 8 GB of RAM. The running times for MES optimization were negligible compared to those of MEBO and SBOS and hence they are not shown. As can be seen, these problems can be solved to optimality in a reasonable amount of time. 136 .mofim Use 899 88303 got? n8 885 mafia: 98 82.333 Mo 33:52 roam wimp. 8 5 aw : a3 .8882: ma x ms 2. ea 8 on as. see 8 x 8 .30 aeafi :55 an a E. 8 is 8:852: as x was 2 2 a. a a2 see 8 x 8 no age? 32:5, momm ommz momm ommz Adam—av 25H. and Each. 25398”; a» Aoahrfi mzmv oumm awash omen? EoEOHQ 137 5.5.1 Energy Dissipation in Processor Buses We profiled 100M SimPoint samples of all benchmarks in the SPEC CPU2000 suite and recorded their self and coupling activity characteristics. The results of our analy- sis are ahown in Figures 5.5.1 and 5.5.1. For the data bus, we observed that the transition density per bit did not exceed 0.45 for any benchmark. As expected, the higher order bits (32—63) for the data bus exhibited significantly lower switching ac- tivities in integer programs compared to floating-point programs, due to small values being predominant in integer traffic. For instruction buses, switching activities were spread more or less equally in the higher and lower order portions and, here too, it did not exceed 0.5 for any benchmark, with the exception of vpr which caused transition densities in the range 0.5—0.8 in a few bit lines. Next, we present results showing the ratio of self, coupling charge/ discharge, and coupling toggle energy dissipated for four kinds of buses: data and instruction ad- dress, data, and instruction. To our knowledge, no previous work has profiled such an extensive set of benchmarks and reported their energy dissipation behavior. Such results help designers quantify the important contributors to bus energy dissipation, like self, charge/ discharge, or toggle transitions, and explore appropriate static, dy- namic, or hybrid encoding techniques to reduce energy dissipation. Figures 5.6-5.9 show the fraction of energy dissipated in self, charge/ discharge, and toggle transitions for various benchmarks from the SPEC CPU2000 suites on the Alpha 21264 target systems. As can be seen, coupling (charge/discharge+toggle) energy forms a substantial portion of the total bus energy dissipation: it contributed 70-75% in the processor buses we analyzed. Among coupling transitions, charge/discharge transitions domi- 138 a5§5§$§§§§§8§§5§$§§2§B§%§:§§§fifiaa Ammznme £3ro cannon gm no hm em a 3. WV NV on om mm cm hN .vN MN mg m— S a c m o F. <1‘414441’§44144{41§1H[ ._ vm.*¥Xl% 4 — u — _ _ _ O wmawwaaammmm wmwwmmmmmammamaa . n.,h.w.., ..a . .n.»k........ 1.1.. .au,.¢s . .a =so _ . \; n ‘- u ..xm a . w. . u . . I. ..wi.nnvnnn.n.... . 13:11:11.1331 13. 2 Ian “HM"... .... 1.. E ... . . . . ..,. .. a... _ .....- .. .. .. ...,. ; ... .. y : r»- a, a. a . . .., 1a. .. “.3 .m x. I: .. r _ , .. . .. . x . 9 .9, 2.0 :2: x... ... .., ...... . . ... Aw... ...“...wwmfi.m mwxwfiy NQ—ND .I ......‘x’ __ . ..... . f \ ... , .¢.\ ... ‘7 ....u... y.‘ ‘ . .....x .“ n.3,... N; . N.O xoto> u... .0” x \ ... . H W: _ . ... >.X.2H xiv“ 1 .. .... ”flag I ...: .... pi. .. ._ >..;.. .y ..x... .... K _ 3o ... .... _ T. .. fl . . ”a... - _.__ a . . __ {I 98o _.__ > .... y, .: .3 N p be u .... r ... ,, .. .. moo 8M > r¥ ., .. 3505:; I ._ .. v o aim r _ p — _ — _ — b _ _ — _ _ — _ _ _ _ _ — WYO aEEUSm sues 088.6 ommm .8 8228 8.55:. Kigsuaq uomsum J. 139 ...:m SS 29% .8 fissfiam 8830 87% 26338: 2 a: 8.. 83.259 85.93:. ...... 25mm ...1:. 0288 . 83:3 lTl 3.36:3 Amminmo .quuov cosmon— :m w». 3 NV on on mm cm R .5 5 3 3 N_ m o _ - _ _ _ _ _ _ 0%. ”WOMWQmQWQ 4F, . . v. ) » «E? x** /}k. ..JQAKVK . _. ...,... 4.x Dim xix XXXXXWXX x, ......a . w x -.x xx » x. »,,I > ammammsmmmémmmmmmmmMmmmm mam mgm%mmmmwm§Wmfiwgmmsfi_ P . bx }_ >.>}.}.4\}.}.}.}.?4\>n :¥¥>4b?4fikf>#kf}+l}in >4??4$b4$}4ir}4..}. h _ _ fisefism “sausage..— 8820 9.5 .8 822.5 Ease... hgsuaq uongsuml 140 .35 3am fin-vo H8 miwanocmm ooomDmO Ummm pcmomafifimom 2 23 H8 $56:ch cosmmcafi. .m.m gamma Ammzumo .qunov :oEmom :m mo ow hm Vm E wv mv NV on mm mm cm hm vm 2.. w_ 2 Q o o m o ” +1704 _ _ a _ 4 _ _ _ _ 7 _ _ _ . _ _ _ _ _ o _ _ . _ _ .. MN m 90 unnammisainnmnmfiawn 1 a ....Wn J a. L can , .. . ._ .. “mum... .... ._ , m .. a... ...,... q . ...... was.W . . .. fkaWfom/Mxfl . 9w... EE § >..c N0. .1 . 028% ... WLHCCNMSHdehW! ,” . uv—MSD .- ... ....*X,. .bmhhhhhh..hphnb>.;. ...... . .. cm :& , 5.1.2.0.. ..xik . 9 9..? .... .....- woo 0...... .. .30. Emfiw ... ...f... .34.»,9909099w13KW 9N...90 (9.9... 9.0.-.0Q999.. .$.0 9&9. ,Om0u9w0w . . y a ’ \ ki. .. - i .....m . mmfi fixxxxmq WM EU¥¥*¥xxXXXXnyxme# xx: pt. ..-». .2.- s Ewe ..,...m y ., a.” mmmammammmommammmmmmm: nmmmgammaamamfimgmawamama. - 3 E35. . x x 3.3m; P x >. \ . . h _ _ >L >L_ri.>r+x>0.>.>ir0.}1>0i?>0§r4x>.> 5D} 04.}..>4.>.>40.>4i?>140?4.>.>4004 } . mmd fissfism .eoaaesoa 882.6 8% .e 83.50 822.3. Alisuaq uouisuml 140 .meHmoa ooomDmO Ommm 92:55 £33 88?? $ng 083 £32 23 5 man. $2305 Saw 93-3 .8“ 28365.5 Amfiwm3+mwanom€FmE£ov 95960 can tow E “03$:me .mwuocm man mo qosogm .©.m 8sz A 2M M S B . I: A d A I. d U m I: 9 1m 1 0 m. N 0.. m an a m z M M D m w n m. w m n B d m .wmaomxmn.£.uo5m.m.um.mmm3£mn. ddIm 9 9 X .II.J V4. 1 lad 3 d u ..M Z 9 w V4. D: B S P D: 3 9 m w. n d . . . . §o d H; ,n.. , . . .. .. .J m . ...,... m. m ... . ...” .... . .H w - ...... ........ ._ . - . . .. .. -..... u ,. .. . . .. . . m: flu - . . .. . .1 .$.. m. a w rA m rum-T; . .. .g . .. .. . . .. .. . .. .. . . .. .. . .. .. . . .. .. . .. ..1§ow w 092853920 D .m. 233%. m. 0002 D. £52on ooom DmU UmEm van 889$ Swamp. 23¢. 8w 25 3063. 3mm 5 38935 38cm 35 141 .mSSwoa ooomDmO Ommm madame EEK 839mm $33 «.83 .933. 2: 5 man $835 55055.04: 9.5-9 Ho“ £83658» Aofiwwop+mm$€m€$w$€v wasmsoo was :3. E vopwammmww $.85 man. we qosowfl Nb oSmE A ..m M Du S A d A J n m. I: W: 3 B 9 n .I. d o q d S .I. 00 b B J o w m a. m an ... z m m n. m m n W. m m n e d w camaonmufifimW¢fllwflmm39mWede 9 9 x ...l: J I: d 3 d a Z 9 w H P e s P D. o 9 u. w n d . _ . . , .. ., ... .. . 0&0 . . m .. . m . ...... . ...... I . ... . - .. ... ... ...... i 008 l =om I owszofleowhgu _U Emma. I .. . .. :1 £08 pamdgssyq Kfilaug sng JO (1011312151 I $03 @62on ooom DmU mum—mm Ea 803$ Emma. 23?. 8m 3m $063.. 5525: E 38935 wacm 25 142 .mESwoa ooomDmU UmEm madame 333 8393 $3.3 3mg 932 23 E £5, $8359 9285me 37mm 8“ £536an A£ww3+mwgw€m€\owfifiv 9:358 98 :mm 5 ©3998va >9on 25 mo sosomfi Em Ssmi A a M s B I... I: A d A 1 d 9 n x 3 B 9 e I q d S I b D... n w o m 0.. m 8 1 z m m 3 am m n W W. MW n e d w WnWomsmzfifimeflfltwumos9.9%...de 9 m. x H X. n P d 9 d ..u W... Z 9 m X. D. B m P D. 9 9 u w. n d exec . ....m x . W . , . H . ._ . , 1 ...,”... m... . A. , . .4 ..i. m . . , . , .,. mm m ..m mal. w .... .. .. m , H , . .., .. m . o. a“... W ., ...m T. ..,, a. W. .. .. .. .9. . .....h....:m.it...g._...1WEN .34 L y. . ... . ..,; . . , . . .. ...» .Wn L ... ...h: a... n9. 1.10999 p9nedgssyq £319ug sng J0 uopomfl tomli; . .. g .1 0&0w 992853920 D 0%on I $9: $52on ooom Dav UmEm can 829nm Swab «an? how 25 823:.» =28:me E 33985 xwhocm 25 142 .mepwoa ooomDmO OmEm mafia“: 233 839mm 63.3 3mg 93?. 23 E 25 $835 aofioapmfi infimm .899 3296qu A2ww8+mw8nom€$w§aov $5950 98 :3. E @3936me xmuoqo 25 mo :oEomHm Em 2:me A M B ..M S. 1.. A d A 1 d n m Q0 9 9 muommm maiasnmmlwwsm ,mm 0 u co 8 Z M M n 9 B ammoowmmufifim}¢.m.!wfi.mm3@mm«edwm 9 9 x ...l: X. 1 I. d 9 d u ,M Z 9 w H. D. B S P D. 9 9 U m n d _ .. a. _. . .. . .. .. w, . 3.. ... .9 F; . L I II, 4 u; r A I. :umll: . .. .. . .. .. . . .. .. . f .. . . .. .. . I .. : i owhmsomabwbfiu U 93on I 2:8on 88 Dav Dmmm was 823m “own? 93?. 8m 25 $822 85252: E @836me 38cm 25 exec WEN 0999 0&5 $3 092: p9112dgssyq K319ug sng JO 1109912151 142 .mESwoa ooomDmO Omam wEEEH SEE 88?? $ng 3mg .932 23 E 25 flat 95-3 HE 3036:de AEwmop+mwSEom€Fw8Eov wEEDOQ ES 3% E @359wa 3898 25 mo :Euowfi .w.m 95m; A m M S B A d A J n m. I: m: 9 o 0.. z M M n I 9 e d v... U, Go 9 B I . .8 Ch 0 B J 8 m a o m x e o mm: d m I. o 1 g fl 8 Wa m d ,W 9 9 x nu: X. 1 d u Z 9 m X. P e w .l P n 9 w. n d o\o . o 13 .... i . . : n. . a . .J m. w m ._ y. .. . . W a ., E ...w ....u . ;. _ . r 3 .31 ”w. J .: .. a N y. .9 s .. u. I ..WM ..mm 9 w. . m: J ..TL._;M..1 excom m a ..H v, m m“ , .H .. ., 0 MW. .. ....L "a a; .. J I .9 . .. .. . . .. .. . .. 110.99» Wu 3 w - A . .. $8 am .A mu. tom-Ii . .. .. . .. .. . . .. 4 . .. .. . . .. .. . I .. . T .. .. . f .i okoow % owhacomaxowhmsu D .m. “.3on I m 0909 0. 2:3on ooom DAD Ummm Ea 889m Hows... «an? .98 25 Sam E 33935 xwuocm mam 143 .2:meon ooomDmO Ummm mafia: 23% Scam? $3.3 3mg 932 23 E 25 2030559: £3-me .98 2856qu Amfiwwop+mwuwaomwawEnov 959:8 was :3 E @3936va mwponm 25 mo :omuowfi .Qh 8:me how I owhmcofleowhgu D 2won I A ..M M S Du . .IJ A d A I .d O n H co 1.. B m B nno me JNdsnwwMmen .mm 0 1 Q0 9 M M B B ...u w co co 9 I . B a.» O B J 3 m. 9 0 W. x O m. 3 B O 1.. d m I O I m B 9%. E W. m ..w .W. m 9 9 X "I: J I. .d 0 d H rm 3 9 m X. D. B S l. D. MW 9. I. n d W“ L T ..m . A .l .. .L .l. war—Swen ooom Dav UmEm Ea 883m woman. «.32 How 25 :28:me E @886me 3.25 gm 0&0 §om §ov 0x900 mRoom @002 p9mdgssgq K819ug sng JO uopomd 144 nate. Energy dissipated in toggle transitions are responsible for only less than 20% of total energy; in data buses, they are responsible for only 10% or less. We ob— served no significant difference between integer programs, shown in the first 14 bars in Figure 5.6—Figure 5.9, and floating-point programs in the SPEC workload. Next, we present results for energy reductions obtained with our static encoding schemes. 5.5.2 Energy Reduction for General-Purpose Design For the general-purpose design scenario, our static bus encoding schemes were de- signed using data collected from SimPoint samples for the training benchmarks and then evaluated on test benchmarks. Results are shown in Figures 5.10 and 5.11. They show that the average bus energy reductions obtained are as follows. MES: 7.81% and 10.96%, MEBO: 11.91% and 19.85%, and SBOS: 20.04% and 38.78% for data and instruction buses, respectively. On the average, we find that optimizations on the instruction bus yield better results than on the data bus. We also observe that SBOS is easily the best scheme for both data and instruction buses. 5.5.3 Energy Reduction for Workload-Specific Design To evaluate the effectiveness of our techniques in the workload-specific design sce- nario, statistics collected for SimPoint samples from 13 training set benchmarks were aggregated and used to obtain the optimal static encoding schemes. The scheme was then tested on non-overlapping samples from the same set of benchmarks. This non- overlapping sample was arbitrarily selected as a block of 100M committed instructions after the first 10 billion instructions of program execution. From the results shown 145 General—Purpose Design: Energy Reductions for Data Bus 50% - I MES I MEBO 8 40% ~ 8 '3 35% — a: 30% '- E.” E 25% - & 20% E 15% § 6: 10% 5% 0% o. :3 a N c: o “o -- ‘H H .3. E 3;: an '5. Q! 9" 0 '§ (3 80 U 8 E —. O > 5%“.8 " sass sea: “’ ” 8. Benchmarks Figure 5.10. Energy dissipation results for general-purpose design for the 64—bit data bus. Statistics collected on 13 training set benchmarks were used to obtain the optimal static encoding schemes. These were tested on 13 other (test set) benchmarks. Average energy reductions are MES: 7.81%, MEBO: 11.91%, and SBOS: 20.04%. in Figures 5.12 and 5.13, we observe that the average energy reduction across the benchmarks for the three schemes are as follows. MES: 9.73% and 10.43% for in- struction bus; MEBO: 15.97% and 21.25% for instruction buses; and SBOS: 22.79% and 40.77% for data and instruction instruction buses, respectively. Our results in- dicate that workload-specific energy optimizations on the instruction bus are likely to yield better results than on the data bus. Among the three different schemes we proposed, SBOS gives the best results. This is expected because it combines the benefits of signaling as well as bit ordering. Table. 5.4 shows the actual bit ordering and signaling for the data bus that was obtained using the training set. The cor- responding table for the instruction bus is not shown due to space constraints. For 146 General—Purpose Design: Energy Reductions for Instruction Bus 50%— .MES _ IMEBO : 45% .SBOS 8 40%— 8 E 35%c >, 30% 5:0 g 25% LU 5° 20% E 15% “2’ g 10% 5% 0% ca. 5 :7, N t: a.) “U —‘ 9— ‘- x E H; w eaa.eg-§Q€o§ae_0> E % “ B a s a s s E E N D d) D. Benchmarks Figure 5.11. Energy dissipation results for general—purpose design for the instruc- tion bus. Average energy reductions are MES: 10.96%, MEBO: 19.85%, and SBOS: 38.78%. both data and instruction buses all five signaling schemes were chosen. In particular, the original mode of signaling was retained for 36 (38) lines, inversion was chosen for 12 (45) lines, and Markov model signaling for 11 (40) lines in the data (instruction) bus. Relatively, transition and inverted transition signaling were chosen for a fewer number of wires, a total of 5 (5) nodes in data (instruction) bus. 5.5.4 Energy Reduction for Program-Specific Design In program—specific design, coupling energy/ cost matrices collected for the SimPoint samples of each benchmark are used to design a signaling / encoding scheme and tested on the same benchmark and sample. This is expected to yield best results as the static encoding schemes are specific to that sample and benchmark. Results for 147 .aa” 4 98 .83” * .mhp” Av .55 H D .98 H G .AmmEHmw .mmqflov man .830 23 go ammmov ocwoommfiwoguoa H8 8:830 @2390 EB mandamwm 35390 Jan 2an dooeeecooeeeecceooooeeeoeoeecceo Em m o a. H ... :afimmamomommmgfltmfio mamwflm wSBEmemmommmam swam no mo 5% mm 8% 9... 3 E mm. mm 39.. $959. $3. a. a. 2.3% wmam mm mm 3.. mm mm as; oeedeee. on 3'5 :30% LL] E”: C020% 0 on S 510% o H o m0%E>sUQUOumN'U.-¥><“O' ¢0m0~NW’—Oo°‘,£w eswwasgeeu>gs 0° "‘ E“° a. as 5> :3 H— (I) 3 Figure 5.13. Energy dissipation results for workload-specific design for the 128-bit instruction bus. The average energy reductions are MES: 10.43%, MEBO: 21.25%, and SBOS: 40.77%. because it does not take into account both self and coupling activities when deciding on the inversion mode. As a result, self switching activities increase significantly in the encoded data stream since the mode chosen to reduce coupling energy does not necessarily reduce total (self + coupling) energy. The switching activity in the instruction stream is coupling dominant. Hence OEBI performs better on this type of data. However, the energy reductions are only marginally better compared to BI. Our static encoding schemes, which optimize for both self and coupling energy by considering signaling and reordering, show much better energy reductions than previous dynamic encoding scheme for all benchmarks. The average energy reductions are: data bus, MES: 19.7% and 21.7%, MEBO: 23.25% and 32.1%, and SBOS: 30.2% and 52.1% for data and instruction buses, respectively. 150 .mxsmdm nmOmm was .fimmfim uOmmzz $5.3 ”mmE gamma ammo 6&de ”Hm 2m. 25. Sec 05 Sm 283058.” 3.85 “omega 23. .959? 83 Ed “.103 28ng E comomoa Hmmo was... E 855an wqmcoog 3:8qu com £38m 838$ 25% efi no powwow c2: v.83 $23. .mOmm cam .Ommz .mmE .$an% 5o 5% #5830qu pg: 8 3:0me 882% wfiwooqo 033m 1:530 SE 5330 3 com: me? $583282. :08 m0 338% “Eomfimm pom «938:8 wosmflfim .nwmmow oEoonéEHwoa 5% 33mm: componcmu zwummm .34“. 83mg A Md 0 ..m m. BJWJM mimmmmmfissammwmmmmmmmmm @MW emmnumpmmammwxmmmmmpmmammma _ exoofil so 8&2 §ON ea Qoov 550m see ____~Li_L_—____________L____ 0&0“ 889nm 3N5 23¢. 8m mam Sea “Else 5 couosuom 35cm 9&qu omhooamlfifimoi uononp9g [(319113 sng 9821u9919d 151 fifimm Wmomm was. ,WNWHNm WOmME 3&qu ”msz ,fimmb ”WmmWO ,vamod “Hm 8d mWE, quuoEWmWWW 23 WOW $152 omgvg 2W8 .EsoWWm cmWw was #83 253me WWW wmmoaoa Wmmo W28 Wm moEoWWom 95528 oWEdWWme SW 316me .2988 08am 0W: :0 gamma :23 983 08:er .mOmm W28 .OmmE .mmWSW 62528 :5 WOW #8830qu W9? 8 8&0on moEoWWom quUooWWw 033m WwESQo @WWW £950 op com: mm? $583283 mode We meQEam WWWWOnWEWm WOW @8828 83qume .WWmemv ocWoQOéWSwoa WOW 3158 205268 szoWWmW .mWh oWWWanW w m m A 1 d 0 d X Q0 1: e m 0.. m on e z M m n. on w n I A A u 0 S w Z Q0 0.0 9 I: .l. .... I. B J 9 0 GD 9 €anwppm.mwmmwmmmmwwpm W W W W W W W W W m x W. W. W: W: W. W. . W: W W. W... . W W 9&3 w W W W T. .. W. W..W W. .W.W. .. .W: excom cm W . . an I . Qwom m 3 1 $9“ m cm. I ............... o 8%. wow Ma OmmEWHW 9 r ............................................. ESE ........................ $00 m. Wmmol 1 I ................................................. . .................................................. o I. W W W H _ W W W W W W W Wmm: W W _ W W W W W W W W $05 m 883m $me «an? W8 25 WWoWWoWEwE WEIwNW E accustom 38km— .WwmeoQ oWWWoommIEWfiwoWnW 152 .WWoWWWSEWWWoQ mEW WWWWWB 285.3586 3885 v3? Wow 3.8 WOW 98 82W 852$ 338m .956 moovmb mWVWHmSWWoWWonW :98 SSW 88:0 053 AIWNMN l vad Op @083 228568 N $.85 man WWW 953%: page mWWoEaWWWSWmQ 2W? 2: USN omdlofim $.5me WWW 83:0 moéwws 2: 5.330 3 tum: ma? mama WWoWWomm WWW wonflommv $23552: wWWB .WWOWENWEWWWQO muggy gamma AOBW 3:35.01:on 88.8% 39$?» 25 fits oEmWWom mOmm .WoW mamas WWoWWoWWSmWWW cam Saw WWW Cd 8558383 839 xdwm .315me WWoWWaNWWWWWWEO $8.838 .m.m Babb WWW; $2 2.3 m3: 5: 8.: ©me :2 8.2 WWW; 32 3% .nBWWmecm Ed WW2: an 8.2 W3 mod 9% m3 ”WWW: a; 5;. O: .stfiédEm Swan 86% WSW? 3%” 8.3% $.23. 36% 8.8m Qawmm 3am 2.8m WvWWorW. WWW; momm SEW Ewan memm 3.9% $.me 8.9% 3%.. 5mm $.me 8W8 25mm WVWWPW. 0? momm 0:2 30% ~38 2.va 3.3% 5.25 mm.me WW3». 3.9% 3% tag 3: WEB .35 238388 8W3 x89 25 WWoWWWoWWSmWWH NWWW WW2: 2.: Ed 8.2 W02 ”SW 33 $6 2.2 WW3 W05 .qWéstqm mow NE W; ”E a? 3.2 mg W3 :2 mg 3m OWW .mWWo.W.W.a.WWBWW $.me 8.8m 2.me $.an 2.an $.me wWWmm $.me $.me 3me 35% 3:8. WWW; momm 35.. $5». 33” W32 NEWS 8.3m 3W3 WES $.93 $5 :3me WVWWPW. 0? momm 5% 3% «3mm WES $62 WES 8.8” «2me 3.2% 2.3m 5% C: dag .wWWo oSWmeQEB 9:? Adam 95 Sam .w>< W :95 W 836 flwfimfi _ 33: W .0483 W Bum _ 8w fl WW8 TEES _ mafia 153 5.5.5 Wire Temperature Reduction Our work is the first of its kind to design static encoding schemes that seek to reduce peak wire temperatures in addition to reducing bus energy. The thermal optimiza- tion methodology was explained earlier in Section 5.4.5 and thermal models used to estimate activity-dependent wire temperatures in Sections 3.4.2 and 3.4.3. Table 5.5 shows the reductions in peak temperature that we obtained for different benchmarks with and without the thermal optimization methodology. In this table, we Show the peak wire temperature observed for the unoptimized (original) bus and the wire temperatures after SBOS with thermal constraints was applied. We show results for temperature-optimized SBOS only since best results were obtained using this tech- nique; temperature reductions for MEBO were consistently lower. This is expected because the SBOS optimization technique has a larger solution space from which it can choose the best solution. Fiom Table 5.5, we note that applying SBOS without thermal constraints, which reduces energy of the bus by 20% or more for data buses (Figure 5.13), does not always reduce the peak wire temperature observed in the simulation window. In fact, it is seen that, for the data bus, the average peak temperature, across the ten benchmarks studied, actually rises slightly above that of the original bus by 035°C and it falls only slightly for the instruction bus by 049°C, which is not a lot considering the significant energy reductions we obtained for these buses. This can be attributed to the fact that the energy optimization does not explicitly consider thermal coupling when deciding on the bit ordering and signaling. However, by adding explicit thermal constraints using the methodology in Section 5.4.5, temperature of the hottest wire can be reduced. Recall that our thermal optimization methodology trades off some 154 Trade-Off Curve tor ammp 325 . . 0.8377, 324.21 Energy-Optima/Bus or’gm’ BUS 1, 324.28- 324 « 323 - 0.9072, 322.49 2 322 ~ 3 g... - g 0.9341. 320.17 '- 320 - Permutation at this 319 - point selected 0.9631. 318.21 318 ~ 317 I I i i Y r T . 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Normalized Energy (a) Trade-OH Curve tor crafty 327 7 325 1 Original Bus 0.7692. 325.22 Energy-Optimal Bus - 1. 325.76 325 . 0.8267. 323.92 324 . 8 E 323 ~ g. 322 _ 0.8711. 321.89 .2 321 . 0.8987. 319.97 320 1 319 . Permutation at this point 58,60,” 0.9501 . 318.82 318 ~ — i —— f —-——-—————— f—m 0.75 0.8 0.85 0.9 0.95 1 Normalized Energy (b) Figure 5.16. Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for amp and crafty. The permutation selected for each benchmark was the one that resulted in bus energy E reduction closest to 0.5(1 -- Fit) compared to the original bus. 0mg Trade-Off Curve for eon 332 Original Bus 330 - 0.8013, 329.82 Energy-OptimalBus 1,330.04- 323 0.8087, 328.23 0.8243, 326.12 g 326 2 E 0.8407, 324.91 a 324 . E 0.8436, 323.33 1- 322 ‘ 0.8602, 321.12 0.9013, 320.11 320 T 318 4 Permutation at this 0-9721. 313-44 point selected 316 I , r - 0.8 0.85 0.9 0.95 1 Normalized Energy (a) Trade-Off Curve for gcc 327 '1 326 ‘ 0.7259, 325.81 Energy-OptimalBus Original Bus 325 1 1, 324.82 - 324 ~ g 0.7445, 323.18 E 323 a g 322 1 ,2 0.8576, 320.92 321 J 0.9281, 319.81 320 “ Permutation at this 319 . POW 39’90‘90 0.9579, 319.21 318 T I T T I 0.7 0.75 0.8 0.85 0.9 0.95 1 Normalized Energy (b) Figure 5.17. Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for eon and gcc. 156 Trade-Off Curve for gzip 332 w Original Bus 1, 330.56 I 330 1 0.7223, 328.76 Energy-Optimal Bus 328 8 a _ 0 7498, 326 29 .- 326 8 0.7674, 324.77 a g 324 ‘ 0.8245, 323.05 .— 322 « 0.8477, 321.19 I 0.8839, 320.54 320 - Permutation at this ”’7" 59’9“” 0.9071, 318.75 318 "—‘_""_-‘ _' T ' l _—7—_- T T F 0.7 0.75 0.8 0.85 0.9 0.95 1 Normalized Energy (a) Trade-Off Curve for Iucas 334 7 i Original Bus 332 1 1,331.71 - | 330 4 , I 0.781, 329.28 Energy-Optimal Bus 1 2 328 1 0.7892, 327.85 a ' 0.7994, 326.7 ‘é’ 326 -1. a I 0.808, 324.12 g 324 l 0.8278, 323.07 " 3221' 0.8649, 319.76 0.9245, 319.59 320 l - } \ 318 .2 Permutation at this I 09756131829 1 point selected 316 i I m r 0.75 0.8 0.85 0.9 0.95 1 Normalized Energy (b) Figure 5.18. Energy vs. temperature trade-off curves. Plots Show the energy vs. temperature tradeoff curves obtained for the data bus for gzip and lucas. 157 TradeOfl Curve for mesa 327 - 326 0.8179, 326.42 Energy-Optimal Bus Original Bus l 1. 325.54 I 325 1 324 l 0.822, 323.77 3 323 « g 322 . 0.8574, 321.08 E 321 _ '- 0.8991, 319.94 320 1 319 1 Permutation at this 318 - Point selected 0.9571, 318.21 317 a . , . 0.8 0.85 0.9 0.95 1 Normalized Energy (a) Trade-Off Curve tor mgrid 330 — 0.7985, 329.11 Energy-Optimal Bus 328 - Original Bus 1, 327.49 I 326 - Temperature C») N a. 322 ~ 320 4 0.8344, 324.67 0.8574, 323.89 0.8713, 322.08 0.9043, 319.78 Permutation at this a point selected 0.9309, 319.02 318 ifi l I r T l 1 I T 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Normalized Energy 0)) Figure 5.19. Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for mesa and mgrid. 158 Trade-011 Curve for swim 3 27 Original Bus 1, 326.34- 326 - 0.8509, 325.97 Energy-OptimalBus 325 . 324 0.8536, 324.33 0 4 g 323 0.8597, 322.78 ’5 l a 322 E ,9 321 + 0.8642, 320.6 320 - 0.8711, 319.65 319 ‘ Permutation at this 318 - po’mselec’ed ~ 0.8821, 318.17 317 w 1 . t 1 t t 1 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Normalized Energy (a) Trade-0ft Curve for swim 327 1 Original Bus 1, 326.34- 326 7 0850932597 Energy-Optimal Bus 325 1 324 0.8536. 324.33 a -1 g 323 0.8597, 322.78 E 322 ~ E ,9: 321 « 0.8642, 320.6 320 0.8711, 319.65 319 1 Permutation at this 318 - po'mselec'ed .5 0.8821, 318.17 317 - . . . . . 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Normalized Energy b Figure 5.20. Energy vs. temperature trade-off curves. Plots show the energy vs. temperature tradeoff curves obtained for the data bus for swim and twolf. of the energy savings for more thermally-efficient orderings at each step. The steady- state temperature vs. energy tradeoff curves for nine benchmark programs are shown in Figures 5.16—5.20. For each point shown in these graphs, thermal constraints were added and the ILPs were re-solved to get a new wire ordering and permutation. As can be seen, in all the cases the ILP infeasibility occurred before the energy of the reordered bus approached E and hence, the optimization terminated. Using orig these curves, we selected the wire permutation—marked by the arrow in the plots— . . E t . . that resulted 1n bus energy reduction closest to 0.5(1 — E—QL), Since this represents orig the midway point for trading off temperature with energy savings. The peak wire temperature obtained for this selected thermally-efficient permutation is shown in the third row of Table 5.5. Note that the temperatures reported in this row are derived from detailed thermal simulations using the model in Section 3.4 and the not the steady state model. Temperature reductions we obtained with temperature-optimized SBOS range from 3.55 to 12.26 degrees for the data bus and from 5.69 to 12.96 degrees for the instruction bus, while still resulting in total energy reductions of 6.59 to 15.23% and 11.67 to 16.17% for data and instruction buses, respectively. Compared to the dynamic spreading encoding technique proposed in [110], our temperature-optimized SBOS provides much better temperature reductions. We compare results for three benchmarks that are common in their work and ours. The temperature reductions they report for the instruction bus are, gzip: 6.5 K, mesa: 6.25 K, and ammp: 4.75 K. Our results shown in Table 5.5 are much better, gzip: 15.89 K, mesa: 11.67 K, and ammp: 12.29 K, for these benchmarks. Note that our techniques are static and incur negligible overhead compared to the overheads for the crossbar switch and control 160 logic used in the spreading encoding technique. 5.6 Summary In this chapter, we presented a value aware optimization methodology to design static encoding schemes to reduce energy dissipation and temperature of global signal buses. Our methodology examines two aspects: (1) several possible ways of signaling a bit value, with exactly one signaling mode for each bit chosen, and (2) all possible ways of mapping bits to bus lines (bit ordering or permutation) and then chooses exactly one bit ordering, both statically at design time depending upon traffic value characteristics to minimize total bus dynamic energy. We present an integer linear program (ILP) methodology that evaluates several possible bit signaling modes and all possible bit orderings for an n-bit bus based on traffic value characteristics and then chooses an op- timal encoding mode that minimizes total bus (self + coupling) dynamic energy. We use the SimpleScalar/ Alpha simulator, profile SimPoint samples of SPEC CPU 2000 benchmarks to collect data, and use the CPLEX ILP optimizer design our encoding scheme. Results for three degrees of customization show increasingly better results for average bus energy reduction: general-purpose optimization: 20.04% (38.78%), workload—specific optimization: 22.79% (40.77%), and program-specific optimization 30.2% (52.1%), for 64-bit data (128-bit instruction) buses, respectively. In contrast, existing dynamic bus encoding techniques yield only 4.19% (5.32%) reductions at best for data (instruction) buses for the same set of programs. We show that lowering bus energy—even significantly, as with our static encod- ing schemes—does not necessarily lower peak wire temperatures. To address this, we 161 present a novel method of efficiently exploring the peak / hottest wire temperature and total bus dynamic energy trade—off space using a steady-state wire temperature model. Based on this, we present a new method of introducing thermal constraints into our energy optimization methodology that allows a designer to trade-off peak wire tem- perature with total bus dynamic energy as desired. For this thermally-constrained, energy-optimal encoding scheme, we then perform simulations using a detailed per— wire bus thermal model to determine the actual reductions in peak temperature, which we find to be significant—up to 12.26°C (12.96°C) for data (instruction) buses—while at the same time providing significant average energy savings: 14.24% (16.17%) for data (instruction) buses that are still much better than previous work. 162 CHAPTER 6 ACTIVITY-AWARE PERFORMANCE OPTIMIZATION The data—dependent nature of inter-wire crosstalk necessitates bus cycle time to be designed for the worst-case. This pessimistic approach incurs significant performance penalty since the worst case arises least frequently in actual applications. In this chapter, we examine an activity-aware technique that substantially reduces the fre- quency of worst case crosstalk and improve the bus performance by using a variable cycle bus architecture. 6. 1 Introduction Inter-wire capacitive crosstalk is the primary factor that affects the propagation delay of interconnects. In high-performance processor buses, crosstalk on a victim wire depends on the nature of transitions on its two adjacent wires, known as aggressors. Designers estimate the worst case crosstalk condition for a wire and set the bus clock cycle time greater than this value, ensuring that the signal transmission occurs in the correct manner. However, this is a pessimistic approach since worst case crosstalk conditions do not occur across all wires very frequently. An introduction to interconnect analysis and the impact of crosstalk on bus design was presented earlier in Section 2.1.5. Table 2.1 listed five different crosstalk condi- tions based on transitions in the victim and aggressor wires: 1 + 0r (mode-0), 1 + 1r 163 (mode-1), 1 + 2r (mode-2), 1 + 3r (mode-3), and 1 + 4r (mode-4), where the cou- pling ratio r is the ratio of the adjacent coupling capacitance and the line capacitance including the contribution of repeaters. The coupling ratio is greater than unity for nanometer-scale technologies as can be seen from Table 2.2. We address two aspects of the bus crosstalk problem to improve overall perfor- mance of global processor bus in the presence of crosstalk. First, we reduce the frequency of various crosstalk conditions by using a profile-guided wire reordering and signaling approach. Second, we propose a bus clocking approach that eliminates the need to use a pessimistic cycle time. Instead, our approach dynamically controls the number of cycles required for transmission of the data depending on its crosstalk mode. By doing so, we can use the average or most frequent crosstalk pattern to design the cycle time of the bus. This chapter is organized as follows. Next, Section 6.2 briefly reviews related work. Then, we present our techniques in Section 6.3. Following that, in Section 6.4 we present results. Finally, we summarize in Section 6.5. 6.2 Related Work Many crosstalk reduction techniques have been proposed in literature. These are re- viewed briefly next. Several techniques such as dense wire fabrics [56] and net order- ing and shield insertion techniques [118,119] have been proposed to reduce crosstalk noise in signal interconnects. The effectiveness of shielding and spacing techniques have also been explored [57]. Many coding techniques to reduce crosstalk have also been proposed, all of which rely on using a significant number of extra wires to elimi- 164 nate worst case crosstalk conditions: crosstalk protection code (CPC) [55], transition pattern code (TPC) [120], crosstalk avoidance code (CAC) [121], and the codes pro- posed in [122]. A technique that uses variable cycle transmission to improve the bus performance has also been suggested but it does not address crosstalk reduction [123]. 6.3 Techniques for Performance Optimization In this section, we describe techniques to optimize bus performance by reducing crosstalk and using a non-pessimistic approach to bus clocking. 6.3.1 Variable Cycle Bus (VCB) Design We propose an adaptive bus architecture called a variable cycle bus (VCB) that uses a faster bus clock and dynamically controls the number of cycles required for transmission based on the estimated delay of the data pattern to be transmitted. This removes the need to design the bus clock cycle in a pessimistic manner based on the worst-case crosstalk pattern. The VCB works as follows. The data to be transmitted in the current cycle is compared to the data that was transmitted in the previous cycle and the crosstalk group that it belongs to is determined. There are two groups: a Group—I data word is one that has at the most one mode-2, mode-1, or mode-0 crosstalk pattern and none higher and a Group-II data word is one that has at least one mode-3 or mode—.4 pattern. The crosstalk group is determined using the crosstalk analyzer (CA) circuit described next. In the VCB, we transmit Group-I data in one clock cycle and Group-II data using two clock cycles. A DAT/LREADY line indicates to the receiver when to latch the current value being transmitted on the 165 bus. The DATAJZEADY control line is completely shielded, i.e., it is routed with VD D / GN D lines on each side so that is completely unaffected by crosstalk. Inputs: U W 50 S 1 52 Output: f 0 0 1 1 1 1 — - 1 I 0 1 - - 0 1 1 1 (a) (b) Figure 6.1. Three-bit crosstalk analyzer truth table and circuit. (a) Truth table show- ing only the ON-set. “-—” indicates a don’t care input. (b) Logic circuit implementing the truth table. Our crosstalk analyzer (CA) circuit identifies the crosstalk mode for each trans- mission in an efficient manner. It compares the current information, three bits at a time, with corresponding bits in the pattern transmitted in the previous clock cycle and determines if the current pattern falls under one of two crosstalk groups. The way to determine the crosstalk group for a three-bit case is shown next. Consider two three—bit vectors, Xt — 1 2 (X6_ 1, Xi _ 1, X§_ 1) representing the data transmitted in the previous cycle and X t 2 (X6, X t, X5) representing data to be transmitted in the current cycle. At the first level of the CA circuit, the following 166 logic outputs are evaluated in parallel: SO = X6_1€BX6, (6.1) 51 = Xi‘lexf, (62) 52 = xg-lsxg, (63) U = Xg—l-Xf—1+X{—1-X§’1,and (6.4) W = X5.X§+X{-X§. (6.5) Using these signals, the truth table and a gate-level representation of the three-bit CA circuit can be constructed as shown in Figure 6.1. The truth table in Figure 6.1(a) shows only the ON—set of the Boolean function, i.e., the inputs for which the output evaluates to logic “1”. The corresponding two-level realization of this table is obtained using Espresso [113]: f = 30.3—1.SQ+30.51._S_2+U-W-SO-SQ, (6.5) = So-(SIEBSQ+I7-W-SQ). (6.7) The CA circuit outputs a logic “1” if the three bits it examined result a Group- II pattern and logic “0” if not. Thus, for an n-bit bus there are n — 2 three-bit CA circuts working in parallel to determine the crosstalk group. At the second level, these n — 2 outputs can be combined using the wired-OR logic style in which outputs from the three-bit CA circuits are simply connected together, as shown in Figure 6.2(a). Thus, the final wired-OR output is high if the output of at least one of the three- bit CA circuits is high. The wired-OR connection is used to simplify the hardware required at the sending end. The signal DATA_READY obtained from the bus crosstalk analyzer synchronizes the sender and receiver. W'hen F = 0, the data can 167 be transmitted in one cycle and hence DATA_READY is taken high. Else, the data is transmitted in two cycles and, in this case, DAT/LREADY is kept low for the first cycle and taken high in the second. The receiver uses a clock signal gated by DAT/LREADY and this ensures that the data is latched and read correctly. , DATA (PREVIOUS CYCLE) Bn_1 —‘ 3-blt f [— CA 0'3 r——\ H r-fi I x I I ll— : 35 ‘13 g d 0) DATA our B4 ' ’3'bit _f 003% L—~ +21——---—>‘I —> __ CA 5 2 El VCB BUS % L J C F DATA IN a, < z o f 3 (lg LU 33 3'bit _ f m (E CA 1 fl—J e—t we as B2 3-bit — . 1 :1 CA — f o CLK DATA_READY 0 K ..J CLK (a) (b) Figure 6.2. Variable cycle bus. (a) Complete bus crosstalk analyzer for an n-bit bus. (b) Sender and receiver logic for VCB. 6.3.2 Minimum Crosstalk Bit Ordering (MCBO) Our basic technique for profile-guided optimization was discussed earlier in Sec- tion 5.4. It may be noted that the objective function that we minimized earlier was the total energy of the bus. In the current problem, we minimize the combined probability of occurrence of the worst-case crosstalk condition for the bus as a whole. Let \I'2r1 ‘Illrv and \IIOT be three n X n bit-pair crosstalk probability matrices which record the probability of occurrence of the three crosstalk conditions possible for the bit pair (i,j),\7’(i,j) E {0,n — 1},i 74 j: mode-2, mode-1, and mode-0. Note that \1127. + \I/h. + ‘1’0r = .1”, where Jn is the n x n unity matrix, since all the probabilities 168 sum up to unity. These matrices are collected by aggregating data obtained by ana- lyzing information patterns transmitted on the target bus when running the training set benchmarks, similar to the procedure outlined in Section 5.3.1. For three neighboring wires 2', j, and k the worst case (1 +47“ or mode-4) crosstalk on the victim wire j occurs when both bit-pairs (2', j) and (j, k) have a mode—2 crosstalk pattern. Similarly, the next worst case (1 + 37" or mode—3) crosstalk oc- curs when one bit pair has a mode-2 and the other has a mode-1 pattern. Both of these situations necessitate transmission in two cycles with our VCB. Let event “A” represent the occurrence of mode-1 or mode-2 pattern in the first bit-pair (2', j) and event “B” the occurrence of mode-2 or mode-1 pattern in the second bit-pair (j, k), i.e., P(A) = 1 —¢07.[z'][j] and P(B) = 1—1/JOT [J][k] Note that we use lower-case sym- bols W) to represent individual elements of the crosstalk matrix ‘11. Since events A and B are mutually exclusive, we have P(A or B) = P(A) + P(B). We are interested in obtaining P(A or B) since this represents the probability of a mode-3 or a mode-4 crosstalk on the bus. Thus, we have: P(A or B) = (1 — ¢0T[i][j]) + (1 — w0T[j][k]) Following the example above, we combine the bit-pair crosstalk matrices \IJQT, ‘11”, and \IIOT, to get one matrix ‘1! = Jn — \IIOT. As noted earlier, our VCB design transmits mode-4 and mode-3 patterns in two clock cycles and mode-2, mode-1, and mode-0 patterns in one clock cycle. Hence, we seek to minimize the total probability of occurrence of mode-4 and mode-3 patterns across all bit—pairs through wire re— ordering and signaling using integer linear programming. Thus the objective function is the sum of all these probabilities since the events are mutually exclusive and the occurrence of a mode-4 or mode—3 event in any one bit-pair means that the transmis- sion takes two cycles instead of one. The simple wire reordering formulation, called 169 minimum crosstalk bit ordering (MCBO) using this objective function is discussed next. As before, the MCBO problem is formulated as an ILP by considering binary variables :1:[2'][j] associated with each bit pair (2, 3'). In the solution, :r[2'][j] = 1 if bits 2' and j are to be placed next to each other on the bus and :1:[2[j] = 0, otherwise. Let V = {1, . . . ,n} be the vertex set that represents the bits, A = {(2',j) : 2',j E V} represent the set of possible triplets of bits, and M2] [j] is the bit-pair crosstalk matrix. The ILP formulation in terms of the variables :1:[2][ j] and the iterative procedure used to solve the ILP is given next: Step 1 : Step 2 : Step 3 : Step 4 : Step 5 : Step 6 : Minimize Z wl’il [j l ‘ 5’3 lzl [j l V(z’,j) e A subject to: :c[2'][j] e {0,1},V 2,3" 6 V, (6.8) E a:[2][j]==1,‘v’2'€Vand Z :1:[2'][j]=1,Vj€V, Vj e V V 2' e V Solve ILP to get the solution. Check if the solution has subtours. If none, go to Step 6. Else, let there be t subtours: s = {30(n0),51(n1), . . .,St}, where Sk(nk) means that subtour S I: has length n 13' Add subtour elimination constraint: :(z[2][j]: (233') are in Sk(nk)) < nk,V s. (6.10) Go to Step 2. The desired solution (Hamiltonian cycle) has been obtained. Stop. In the above procedure, Constraint 6.8 ensures that the variables take only binary values and Constraint 6.9 ensures that the in- and out-degrees of every vertex are one, i.e., every bit occurs exactly once in the ordering. As explained in Section 5.4.3, we add subtour eliminations iteratively and solve the ILP efficiently with the CPLEX 170 optimizer tool. 6.3.3 MCBO with Signaling (MCBOS) In MCBO with signaling (MCBOS), the best signaling scheme—one of the five schemes listed in Section 5.4.1—and the appropriate position of the bits on the bus lines is determined simultaneously. As in the case of energy optimization, the motiva- tion for using signaling is to enable the optimizer to select the optimal solution from a richer set of possibilities. Thus, we can view the problem as similar to MCBO but consisting of n supernodes corresponding to the 72 bits of the bus. Each supernode contains five nodes, each representing a signaling scheme choice for a bit. By adding constraints that ensure that only one of these nodes is selected for each supernode and that the incoming and outgoing nodes for each supernode are the same, the ILP for MCBOS is formulated as given next: q—lq—l Minimize Z qZ Z( V[ ml7l xl Tull I’lljl) V(2',J')€A l=0m=-0 subject to : xl,m[2][j] E {O,1},V {2,j} E V, (6.11) q—lq—l Z Z Z ‘Tl,mlilljl =1,V2'€V, (6.12) VJ'EV l=0m=0 q—lq—l Z a: Z fizmlkllz=1,Vz'eV, (6.13) kaV l=0m= 0 q—l (1:1 x) ml" =2: 1cm )[j] [k7,] V{2,J, kt,}€V\7’m. (6.14) (=0 l=0 Constraint 6.11 of SBOS ensures that all variables at) m [2] [ J'], (1,222) 6 {0, . . . ,q — 1}, each of which represents a choice of signaling schemes for a pair of bits, take 171 only binary values. Constraints 6.12 and 5.10 ensure that there is only one outgoing and one incoming node, respectively, for each of the n supernodes. Constraints 6.14 ensures that the optimal tour enters and exits through the same node in a supernode (i.e., the signaling schemes chosen for adjacent pairs of bits in the final ordering are consistent). Crosstalk probabilities \I’l, m[2][2], V2 6 V are set to 00 (a very large inte- ger value). Finally, constraints for eliminating all subtours with two nodes are added initially, and the problem is iteratively solved as described earlier in Section 5.4.3 until a Hamiltonian cycle that visits all supernodes exactly once is found. 6.4 Results and Discussion We study the effect of MCBO and MCBOS on the 64-bit ALU result bus of our superscalar processor architecture. As explained earlier in Section 5.5, the result bus is on the critical path and is sensitive to delay variations due to crosstalk. Also, the performance of the processor can be improved if faster transmissions are enabled on this bus. We present two results for this bus next: crosstalk reduction using MCBO and MCBOS and performance improvement when VCB is used with MCBO and MCBOS. 6.4.1 Peak Crosstalk Reduction In workload-specific design, statistics collected for SimPoint samples from 13 train- ing set benchmarks were aggregated and used to obtain the optimal static encoding schemes. The scheme was then tested on non-overlapping samples from the same set of benchmarks. The non-overlapping sample was selected as explained in Sec- 172 tion 5.3.1. As explained earlier, our crosstalk optimization techniques MCBO and MCBOS seek to reduce the number of cycles that carry mode-4 and mode-3 pat- terns. From the results shown in Figures 6.3(a) and (b), we observe that both MCBO and MCBOS reduce mode-4 and mode-3 patterns significantly. The average reduc- tions in number of 1+4r delay cycles were MCBO: 24.89% and MCBOS: 30.61% and the average reductions in number of 1+3r cycles were MCBO: 19.21% and MCBOS: 23.42%. For the general-purpose design scenario, our static schemes were designed using data collected from SimPoint samples for the training benchmarks and then evaluated on test benchmarks. Results are shown in Figures 6.4(a) and (b), for reductions in the number of mode-4 and mode-3 cycles, respectively. We observe that the average reductions in number of 1+4r delay cycles were MCBO: 21.22% and MCBOS: 29.35% and the average reductions in number of 1+3r cycles were MCBO: 16.77% and MCBOS: 20.29%. 16.4.2 Performance Improvement with VCB The reduction in the number of cycles required to transmit the information with our techniques applied is shown in Figure 6.5(a) and (b). On the average, MCBOS which is our best technique reduces the number of cycles by 17.68% for workload-specific optimization and by 18.30% for general purpose optimization while MCBO reduces the number of cycles by 13.89% and 14.44% for workload-specific and general-purpose optimizations, respectively. 173 1:: Workload-Specific Design: Crosstalk Reduction in ALU Result Bus 5‘ 70% r I I j I I I I I I I I I I >5 I MCBO 7‘: 60% ,_ ........................... .' 'MC'BOS ....................... _. D i 50% ......................................................... ... “a 40% ....................................................... .. H g 30% ...................... . . . . . ................. . ._ z 20% ............. . . . . . . . . . . . . ..... . _ .E .5. 10% _. " ' " ' ' ' ' ' . ' ° ' “ ' 'a g 0% E 8 3 3 9 i5 8 33‘ g— 8 .9. >< 5. 3’0 e seaggs‘awwatw c6 >< D‘ o > > u— };3 5 c3 3 (a) g Workload—Specific Design: Crosstalk Reduction in ALU Result Bus 6‘ 70% I f T f T I I I I m I I I I >. I MCBO % 60% ............................. I 'MC'BOS .......................... _ Q a 50% I" ............................................................... _ i. “a 40% ................................................. . ...... _ g 30% ................................................ . ------ —Il Z 20% .. . . .............. ‘ . . . ................. . . . . . _ .S g l0% .. . . ....... . . . . ...... . . . . - . ... '8 3 0% t: o m a 1: x o >. o. o o. >< .. o 32’ egfisas-efiaasgsa o 3. E E b 3 H w 0 5 ME 5 g‘ 0 > > (I: 3 CU (b) Figure 6.3. Crosstalk reduction results for workload-specific design of the 64-bit ALU result bus. (a) Average reductions in number of 1+4r delay cycles. For MCBO: 24.89% and MCBOS: 30.61%. (b) Average reductions in number of 1+3r cycles. For MCBO: 19.21% and MCBOS: 23.42%. 174 i3 General—Purpose Design: Crosstalk Reduction in ALU Result Bus 8 70% L) >5 ..‘3 60% O D i 50% a5 40% g 30% :3 Z 20% .S g 10% §O% asaocsgmchhxbu .. smegma-saggy a w _ I... g ‘3 8" E 00 "D Q. g a (a) E) General—Purpose Design: Crosstalk Reduction in ALU Result Bus 5‘ 70% I fl I WI f I F I I I I I I >5 £3 60% O) D 3:, 50% i H5 40% I; 30% Z 20% .E .5 10% ‘5 3 0% “’3 E o.::~o-c— Nan-u aa§m80§9°°8 a“:“’7aBNOE-a ‘3 géww-D o. (b) perlbmk average Figure 6.4. Crosstalk reduction results for general purpose design of the 64—bit ALU result bus. (a) Average reductions in number of 1+4r delay cycles. For MCBO: 21.22% and MCBOS: 29.35%. (b) Average reductions in number of 1+3r cycles. For MCBO: 16.77% and MCBOS: 20.29%. 175 Workload—Specific Design: Performance Improvement with VCB m 40% I I I I I I I I I I I I I I 2 I MCBO 5 35% ............................ .‘M'CBOS .......................... .. “5 30% ---------------------------------------------------------------- _ 51.; 25% ------------------------- .. ............................. _ 2 20% ,- ............................... _ 5:“ 15% --------- - 8 'c: 10% _ U Fo’ a, 5% - a: 0% 33288§%8eg§g 0 0 cu "" no on N I: «I 3 E E” E 23:. 8 m g g .8 g a _ (a) . General—Purpose Desrgn: Performance Improvement w1th VCB 40% I I I I I T I I I I I I I f 8 IMCBo “g 35% .......................... ..MCBOS. ............................ —. 8 30% r --------------------------------------------------------------- _ o E 25% ..................... _ E 2 20% ..... -‘5 15% ..... 8 '5 10% 8 a) 5% ad 0% “saoufiaacfilfi'fibg’o "‘ .. o m §§%§E%a§”as£§§ 8" u... 00 Q. g I; (b) Figure 6.5. Reduction in the number of cycles taken to transmit the information with MCBO and MCBOS applied to the result bus. (a) Workload-specific optimization. (b) General-purpose optimization. 176 6.5 Summary This chapter presented a performance—oriented adaptive bus design technique that helps reduce the frequency of crosstalk conditions and adopts an adaptive approach to improve bus performance. We presented a variable cycle bus (VCB) architecture and a crosstalk analyzer circuit that can transmit the data using either one or two clock cycles depending on the type of crosstalk pattern. Consequently, the bus clock cycle time no longer needs to be greater than the worst-case (1+4r) crosstalk pattern but it can be designed using the average case or the most frequent (1+2r) crosstalk pattern. We also presented a profile-guided optimization that reduced the frequency of occurrence of 1+4r and 1+3r crosstalk patterns and thus helped improve the per- formance of the VCB bus significantly. Results on SPEC CPU 2000 benchmarks, in a general-purpose optimization scenario, show a 29.35% reduction in 1+4r cycles, a 20.29% reduction in 1+3r cycles, and a bus performance improvement of 17.42% for a static reordering and signaling technique targeting bus crosstalk minimization. 177 CHAPTER 7 CONCLUSION In this dissertation, we presented our research on activity-aware modeling and design optimization for on—chip interconnects in current and future nanometer-scale tech- nologies. We addressed three important issues in high-performance bus design for nanometer-scale microprocessors: accurate energy and thermal modeling, energy op- timization techniques, and crosstalk reduction. Key contributions and results from our research are summarized next 7 .1 Contributions and Key Results In Chapter 3, we presented a unified nanometer-scale bus energy dissipation and thermal model that can help designers monitor energy dissipation and temperature change in individual wires during trace— or execution—driven simulation. In addition to self capacitance, our model incorporates the effects of capacitive coupling between adjacent as well as non-adjacent pairs of wires and repeater insertion on switching energy, the effect of lateral heat transfer between adjacent wires to estimate wire temperatures, and also estimates wire temperature gradients and its impact on wire delay, all of which were not available in earlier models. Using this model, we studied energy and thermal characteristics of instruction and data buses using an execution-driven simulation of a billion or more instructions of nine SPEC CPU2000 benchmarks. We found that existing bus energy models 178 provide estimates that are about 7-8% less accurate compared to our energy model. This is because they do not account for the effects of coupling between non-adjacent wire pairs of a bus. Our model, which incorporates these effects, is the first of its kind to do so. Our results also showed that, in wide instruction and data buses used in modern processors executing SPEC CPU2000 workloads, existing bus encoding schemes show no significant energy benefit due to the nature of data traffic. When non-adjacent coupling effects between wire pairs are considered, energy dissipation savings reduce considerably. Based on simulations using our thermal model, we found that average wire temperatures in data and instruction buses may rise 10—37 °C during a simulation run of only a billion cycles for a 130 nm superscalar processor running SPEC benchmarks. This temperature rise is primarily due to heat generation as a result of currents flowing in the wire during bit switching. In a future 45 nm technology node, Wire temperature rise for the same set of bench- marks and simulation sample was found to be between 20—58°C. We observed that instruction and data bus wires attained absolute temperature in the range 80.3—104°C and 97.6—123.7°C, in 130 nm and 45 nm processors, respectively, during the course of our simulation, showing that signal lines attain significant temperatures too. Sig- nificant wire temperature gradients of magnitude between 16—25°C were found to be most common between the sending and receiving ends of the wires during the course of simulation. Notable correlation was found to exist between energy dissipation be- havior and wire temperature rise in buses across time; short, intermittent cycles of high energy-dissipating switching activity trigger step changes in temperature. In Chapter 4, we developed models that track the impact of changing wire temperature on timing/delay violations occurring in global signal buses during 179 microarchitecture-level exploration. Results show that for a 130 nm processor with no power and thermal management the temperature—induced clock cycle time vio— lations in an ALU result bus—which is on the critical path—was 2.27 per hundred bus references, averaged over ten programs in the SPEC CPU2000 workload. It in- creased to an average of 6.20 per hundred bus references for the same processor at the 45 nm technology node. We found that wire delay variability led to degradation in overall performance by about 4.1% in 130 nm processors and about 11.9% in 45 nm processors. Our analysis also showed that conventional techniques like bus encoding that seek to reduce energy dissipation and potentially wire temperatures have limited impact on alleviating temperature-induced delay violations. In Chapter 5, we formulated an optimization methodology to design en- ergy and temperature optimized static bus encoding schemes through early stage microarchitecture-level exploration, exploiting value characteristics of a target work- load. Binary integer linear programs (ILPs) were formulated and solved optimally to determine the signaling, bit ordering, or a combination of both that minimizes bus energy dissipation. For the SPEC CPU2K workload, our static bit ordering and signaling (SBOS) technique reduced total bus energy dissipation by 22.79%/40.77% for data/instruction buses in an application-specific scenario, where the technique was designed individually using statistics collected for each benchmark and tested on the same benchmark. In a much more general scenario, where the scheme was designed using statistics collected from 13 out of 26 benchmarks and tested on the remaining 13, the corresponding reductions were 20.04%/38.78%. These reductions are significantly higher compared to those obtained from dynamic encoding schemes for the same benchmarks. We also proposed a first-of-its-kind methodology to de- 180 sign temperature-aware encoding schemes by trading off some of the energy gains we obtain with static encoding techniques to achieve wire temperature reduction. In this methodology we add temperature constraints during energy optimization, and our ILP produces a static encoding scheme that reduces maximum/ hottest wire tem- peratures by up to 15.23 K/ 16.17 K for data/ instruction buses while still producing significant total bus energy reductions. Finally, in Chapter 6, we examined techniques to reduce bus crosstalk and improve overall bus performance. We presented a variable cycle bus (VCB) architecture and a crosstalk analyzer circuit that can transmit the data using either one or two clock cycles depending on the type of crosstalk pattern. Consequently, the bus clock cycle time no longer needs to be greater than the worst—case crosstalk pattern but it can be designed using the average case or the most frequent crosstalk pattern which results in roughly doubling the bus clock frequency. We also presented a profile-guided optimization that reduced the frequency of occurrence of worst-case crosstalk patterns and thus helped improve the performance of the VCB bus significantly. Results on SPEC CPU 2000 benchmarks show at least 29.35% reduction in number of worst case crosstalk cycles and a bus performance improvement of 17.42% for a VCB with static reordering and signaling technique targeting bus crosstalk minimization. Our work represents a significant advancement over existing approaches that are activity-oblivious and / or consider worst-case traffic conditions. The microarchitecture—level activity-driven spatiotemporal bus energy and thermal model we present is the first of its kind. Our static value-aware bit reordering and sig- naling techniques are also highly-novel solutions that work remarkably well in real applications. 181 7 .2 Directions for Future Research Some potential research directions for the future are outlined next. 0 A methodology to dynamically select between different static wire orderings and signaling strategies for energy and / or thermal optimization can be investigated. In such a scheme, a controller will select a particular strategy based on input or hints from the compiler through data stored in the program’s executable. o The wire ordering and signaling strategies can be used to create configurable interconnect intellectual property (IIP) blocks similar to configurable IP blocks available today for logic circuits. Such an IIP block will contain routing speci- fication for all on—chip high-performance signals between logic blocks, suitably optimized for power, temperature, crosstalk, or a combination of the tree, auto— matically synthesized by a CAD tool by analyzing the user-supplied workload. o The thermal model can be enhanced to investigate thermal issues in clock trees and a temperature-aware clock—tree synthesis approach can be developed. The thermal model can also be used as a starting point for analyzing issues related to three-dimensional interconnects. In such systems, the presence of multiple vertically connected interconnect stacks emphasizes the need to investigate ther- mal issues, since heat dissipation paths from interconnect layers may be several times longer than conventional designs. 182 [1] [2] l3] [4] l5] l9] BIBLIOGRAPHY Semiconductor Industry Association, “International Technology Roadmap for Semiconductors (ITRS), 2005 edition,” URL: http://public.itrs.net. M. Mui, K. Banerjee, and A. Mehrotra, “A Global Interconnect Optimization Scheme for Nanometer Scale VLSI with Implications for Latency, Bandwidth, and Power Dissipation,” IEEE Transactions on Electron Devices, vol. 51, no. 3, pp. 195—203, Feb. 2004. S. Rusu, “Circuit Technologies for Multi-Core Design,” Talk at the IEEE Santa Clara Valley Solid-State Circuits Society, slides at: http://www.ewh.ieee.org/ r6/scv/ssc/Apri106.pdf, Apr. 2006. N. Magen, A. Kolodny, U.,Weiser, and N. Shamir, “Interconnect-Power Dissi— pation in a Microprocessor,” in Proceedings of the 2004 International Workshop on System level Interconnect Prediction ( SLIP ’04 ) New York, NY, USA: ACM Press, 2004, pp. 7—13. S. Im and K. Banerjee, “F1111 Chip Thermal Analysis of Planar (2—D) and Ver- tically Integrated (3—D) High Performance ICs,” in Proceedings of the IEEE International Electron Devices Meeting (IEDM). Piscataway, NJ, USA: IEEE Press, Dec. 2000, pp. 727—730. P. Gelsinger, “Microprocessors for the New Millennium: Challenges, Opportu- nities and New Frontiers,” in Proceedings of the IEEE Solid-State and Circuits Conference (ISSCC). Piscataway, NJ, USA: IEEE Press, Dec. 2001, pp. 2225. K. Nabors, S. Kim, J. White, and S. Senturia, “Fast Capacitance Extraction of General Three-Dimensional Structures,” in Proceedings of International Con- ference on Computer Design (ICCD). Washington DC, USA: IEEE Computer Society, Oct. 1991, pp. 479-484. M. Bohr, “Interconnect Scaling: The Real Limiter to High Performance ULSI,” in Proceedings of the International Electron Devices Meeting (IEDM). Piscat- away, NJ, USA: IEEE Press, Dec. 1995, pp. 241—244. L. Lev and P. Chao, “Down to the Wire: Requirements for N anometer Design Implementation,” White Paper, Cadence Design Systems Inc., 2002. 183 [10] W. Li, B. Mbouombouo, and L. Tsai, “Needed: High-Level Interconnect Methodology for N anometer ICs,” EE Times, http://www.eetimes.com/story/ OEG2003062380039, June 2003. [11] P. Green, “A GHz IA-32 Architecture Microprocessor Implemented on 0.18am Technology with Aluminum Interconnect,” in Proceedings of the IEEE Solid- State and Circuits Conference (ISSCC). Piscataway, NJ, USA: IEEE Press, 2000, pp. 98-99. [12] J. Heidenreich, D. Edelstein, R. Goldblatt, W. Cote, C. Uzoh, N. Lustig, T. McDevitt, A. Stamper, A. Simon, J. Dukovic, P. Andriacacos, R. Wash- nik, H. Rathore, T. Katsetos, P. McLaughlin, S. Luce, and J. Slattery, “Copper Dual Damascene for sub—0.25am CMOS,” in Proceedings of the IEEE Interna- tional Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June 1998, pp. 151—153. [13] B. Zhao, D. Feiler, V. Ramanathan, Q. Liu, M. Brongoa, J. Wu, H. Zhang, J. Kuei, and D. Young, “A Cu Low-k Dual Damascene Interconnect for High- Performance and Low Cost Integrated Circuits,” in Proceedings of the IEEE Symposium on VLSI Technology. Piscataway, NJ, USA: IEEE Press, June 1998, pp. 28—29. [14] P. Zarkesh—Ha, J. Davis, and J. Meindl, “The Impact of Cu/Low-k on Chip Performance,” in Proceedings of the IEEE International ASIC/SOC Conference. Piscataway, NJ, USA: IEEE Press, 1999, pp. 257—261. [15] H. Feng, F. Ercal, and F. Bunyak, “Systolic Algorithm for Processing RLE Images,” in IEEE Southwest Symposium on Image Analysis and Interpretation. Piscataway, NJ, USA: IEEE Press, 1998, pp. 127-131. [16] S. Chai, A. Gentile, W. Lugo—Beauchamp, J. Fonseca, J. Cruz-Rivera, and D. Wills, “Focal Plane Processing Architectures for Real-Time Hyperspectral Image Processing,” Applied Optics: Special Issue on Optics in Computing, vol. 39, pp. 835—849, Feb. 2000. [17] W. Dally, “Interconnect-limited VLSI architecture,” in Proceedings of the IEEE International Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June 1999, pp. 15-17. [18] J. Goodman, R. Kostuk, and B. Clymer, “Optical Interconnects: An Overview,” in Proceedings of the 2"" International IEEE VLSI Multilevel Interconnection Conference. Piscataway, NJ, USA: IEEE Press, 1985, pp. 219—224. 184 [19] [20] [21] [22] [23] [24] [25] [26] [27] A. Rahman, A. Fan, J. Chung, and R. Reif, “Wire-Length Distribution of Three-Dimensional Integrated Circuits,” in Proceedings of the IEEE Interna- tional Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June 1999, pp. 233—235. S. Souri and K. Saraswat, “Interconnect Performance Modeling for 3D Inte- grated Circuits with Multiple Si Layers,” in Proceedings of the IEEE Interna- tional Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June 1999, pp. 24—26. K. Banerjee, “Trends for ULSI Interconnections and Their Implications for Thermal, Reliability and Performance Issues (Invited Paper),” in Proceedings of the Seventh International Dielectrics and Conductors for ULSI Multilevel Interconnection Conference (DCMIC). Tampa, FL, USA: IMIC, Mar. 2001, pp. 38—50. W. Dally and J. Poulton, Digital Systems Engineering. Cambridge University Press, 1998. A. Krishnamoorthy and D. Miller, “Scaling Optoelectronic-VLSI Circuits into the 2lst century: A Technology Roadmap,” IEEE Journal on Selected Topics in Quantum Electronics, vol. 2, no. 1, pp. 55—76, 1996. T. Mule, S. Schultz, T. Gaylord, and J. Meindl, “An Optical Clock Distribution Network for Gigascale Integration,” in Proceedings of the IEEE International Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June 2000, pp. 176—179. J. Joyner and J. Meindl, “Opportunities for Reduced Power Dissipation Using Three-Dimensional Integration,” in Proceedings of the IEEE International In- terconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June 2002, pp. 148-150. J. Joyner, P. Zarkesh-Ha, J. Davis, and J. Meindl, “Vertical Pitch Limitations on Performance Enhancement in Bonded Three-Dimensional Interconnect Ar- 9 chitectures,’ in Proceedings of the International Workshop on System-Level In- terconnect Prediction. New York, NY, USA: ACM Press, 2000, pp. 123—127. K. Saraswat, S. Souri, K. Banerjee, and P. Kapur, “Performance Analysis and Technology of 3-D ICs,” in Proceedings of the International Workshop on System-Level Interconnect Prediction. New York, NY, USA: ACM Press, Apr. 2000, pp. 85—90. 185 [28] [29] [30] [31] [32] [33] [34] [35] [37] [38] A. Shilov, “Intel to Cancel NetBurst, Pentium 4, Xeon Evolution: Tejas, Jayhawk Reportedly Shelved,” X—Bit Laboratories, http://www.xbitlabs.com/ news / cpu / display / 20040507000306.htm1, May 2004. J. Kovar, “Sun Cancels UltraSPARC V, Gemini, But Not Future Processor De- velopment,” CMP Media’s CRN, http://www.crn.com/sections/breakingnews/ dailyarchivesjhtml?articleId=18841521, Apr. 2004. K. Krewell, “Multicore Mania is Here to Stay,” Electronic Design News (EDN), http: / / www.edn.com / article / CA6302 185.html?partner=eb&pubdate= 2%2F1%2F2006, Feb. 2006. D. Brooks, M. Martonosi, J. Wellman, and P. Bose, “Power-Performance Model- ing and Tfadeoff Analysis for a High End Microprocessor,” in Proceedings of the First International Workshop on Power-Aware Computer Systems (PACS’OO) held with ASPLOS-IX, Nov. 2000. J. Cong, “An Interconnect-Centric Design Flow for Nanometer Technologies,” Proceedings of the IEEE, vol. 89, no. 4, pp. 505—528, April 2001. V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” in Proceedings of the Annual Symposium on Computer Architecture (ISCA). New York, NY, USA: ACM Press, July 2000, pp. 248—259. R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proceedings of the IEEE, vol. 89, no. 4, pp. 490-504, Apr. 2001. T. N. Vijaykumar and Z. Chishti, “Wire Delay is Not a Problem for SMT (In the Near Future),” in Proceedings of the Annual Symposium on Computer Architecture (ISCA). Washington, DC, USA: IEEE Computer Society, July 2004, pp. 40—50. J. Meindl, J. Davis, P. Zarkesh—Ha, C. Patel, K. Martin, and P. Kohl, “Inter- connect Opportunities for Gigascale Integration,” IBM Journal of Research and Development, vol. 46, no. 2, pp. 245-263, Mar. 2002. K. Banerjee and A. Mehrotra, “Global Interconnect Warming,” IEEE Circuits and Devices, vol. 17, pp. 16—32, Sept. 2001. A. Ajami, K. Banerjee, and M. Pedram, “Modeling and Analysis of Nonuniform Substrate Temperature Effects on Global ULSI Interconnects,” IEEE Transac- tions on Computer Aided Design of Integrated Circuits and systems, vol. 24, no. 6, pp. 849—860, June 2005. 186 [39] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Ap- proach, Third Edition. Morgan Kaufmann Publishers, 2003. [40] J. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors. McGraw Hill, 2004. [41] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison- Wesley, 1990. [42] J. Davis, V. De, and J. Meindl, “A Stochastic Wire-Length Distribution for Gigascale Integration—Part I: Derivation and Validation,” IEEE Transactions on Electron Devices, vol. 45, no. 3, pp. 580—589, Mar. 1998. [43] K. Banerjee and A. Mehrotra, “Analysis of On—Chip Inductive Effects for Dis- tributed RLC Interconnects,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 21, no. 8, pp. 904—915, Aug. 2002. [44] R. Kumar, “Interconnect and Noise Immunity Design for the Pentium 4 Proces— sor,” in Proceedings of the Annual ACM/IEEE Design Automation Conference (DAC). New York, NY, USA: ACM Press, 2003, pp. 938—943. [45] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, Second Edition. Prentice-Hall, Dec. 2002. [46] A. Naeemi, R. Venkatesan, and J. D. Meindl, “Optimal Global Interconnects for GSI,” IEEE Transactions on Electron Devices, vol. 50, no. 4, pp. 980—987, Apr. 2003. [47] M. Stan and W. Burleson, “Low—Power Encodings for Global Communication in CMOS VLSI,” IEEE Transactions on VLSI Systems, vol. 5, no. 4, pp. 444—455, Dec. 1997. [48] J. Liu, N. Mahapatra, and K. Sundaresan, “Hardware—Only Compression to Reduce Cost and Improve Utilization of Address Buses,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI). Los Alamitos, CA, USA: IEEE Computer Society, Feb. 2003, pp. 220—221. [49] J. Liu, K. Sundaresan, and N. Mahapatra, “Energy-Efficient Compressed Ad- dress Transmission,” in Proceedings of the 18‘” International Conference on VLSI Design (VLSID). Washington, DC, USA: IEEE Computer Society, Jan. 2005, pp. 592—597. [50] ————, “Fast Perfomance—Optimized Partial Match Compression for Low-Latency On-Chip Address Buses,” in Proceedings of International Conference on Cam- puter Design (ICCD). Piscataway, NJ, USA: IEEE Press, Oct. 2006. 187 [51] [52] [53] [54] [55] [56] [57] [58] [60] M. Stan and W. Burleson, “Bus-Invert Coding for Low-Power I/O,” IEEE Transactions on VLSI Systems, vol. 3, no. 1, pp. 49—58, Mar. 1995. L. Benini, G. D. Micheli, E. Macii, D. Sciuto, and C. Silvano, “Address Bus Encoding Techniques for System-Level Power Optimization,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Washington, DC, USA: IEEE Computer Society, Feb. 1998. Y. Zhang, J. Lach, K. Skadron, and M. Stan, “Odd/Even Bus Invert with Two-Phase Transfer for Buses with Coupling,” in Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). New York, NY, USA: ACM Press, Aug. 2002, pp. 80—83. K. Kim, K. Back, N. Shanbhag, C. Liu, and S. Kang, “Coupling-Driven Signal Encoding Scheme for Low-Power Interface Design,” in Proceedings of the Inter- national Conference on Computer-Aided Design (ICCAD). Washington, DC, USA: IEEE Computer Society, Nov. 2000, pp. 318—321. B. Victor and K. Keutzer, “Bus Encoding to Prevent Crosstalk Delay,” in Pro— ceedings of the International Conference on Computer-Aided Design (I CCAD). Piscataway, NJ, USA: IEEE Press, Nov. 2001, pp. 57—63. S. Khatri, A. Mehrotra, R. Brayton, R. Otten, and A. Sangiovanni-Vincentelli, “A Novel VLSI Layout Fabric for Deep Sub-Micron Applications,” in Proceed— ings of the Annual ACM/IEEE Design Automation Conference (DA C). New York, NY, USA: ACM Press, 1999, pp. 491—496. R. Arunachalam, E. Acar, and S. Nassif, “Optimal Shielding/Spacing Metrics for Low Power Design,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI. Washington DC, USA: IEEE Computer Society, Feb. 2003, pp. 167—171. D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” in Proceedings of the Annual Symposium on Computer Architecture (ISCA). New York, NY, USA: ACM Press, 2000, pp. 83—94. W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool,” in Proceedings of the Annual ACM/IEEE Design Automation Conference (DAC). New York, NY, USA: ACM Press, June 2000, pp. 340—345. A. Dhodapkar, C. Lim, G. Cai, and W. Daasch, “TEM2P2EST: A Thermal Enabled Multi-model Power/Performance ESTimator,” in Lecture Notes In 188 [61] [62] [63] [64] [65] [56] [67] [68] [69l [70] Computer Science, Proceedings of the First International Workshop on Power- Aware Computer Systems (PACS’OO) held with ASPLOS—IX, November, 2000. Springer-Verlag, 2001, pp. 112—125. J. Smith, L. He, A. Dhodapkar, and N. Nidhi, “WArPE: Wisconsin Architecture Power Estimator,” URL: http://eda.ee.ucla.edu/ntool/. The Sim-Panalyzer Team, “SimpleScalar-ARM Power Modeling Project,” URL: http: //www.eecs.umich.edu/~panalyzer/ . D. Ponomarev, G. Kukuk, and K. Ghose, “AccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Washington, DC, USA: IEEE Computer Society, Mar. 2002, pp. 124—128. K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, “Temperature-Aware Microarchitecture,” in Proceedings of the An- nual Symposium on Computer Architecture (ISCA). New York, NY, USA: ACM Press, June 2003, pp. 2-13. Y. Zhang, R. Chen, W. Ye, and M. Irwin, “System Level Interconnect Power Modeling,” in Proceedings of the IEEE ASIC/SOC Conference. Piscataway, NY, USA: IEEE Press, Sept. 1998, pp. 289-293. W. Huang, M. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusamy, “Compact Thermal Modeling for Temperature-Aware Design,” in Proceedings of the Annual ACM/IEEE Design Automation Conference (DA C). New York, NY, USA: ACM Press, June 2004, pp. 878—883. D. Burger and T. Austin, “The SimpleScalar Tool Set, version 2.0,” Computer Architecture News, vol. 25, no. 5, pp. 13—25, June 1997. Michigan State University High Performance Computing Center, “128 Node Opteron Cluster from Western Scientific,” https://hpc.msu.edu/twiki/bin/ view/Main/WesternScientificCluster. T.-Y. Chiang, K. Banerjee, and K. Saraswat, “Compact Modeling and SPICE- Based Simulation for Electrothermal Analysis of Multilevel ULSI Inteconnects,” in Proceedings of the International Conference on Computer-Aided Design (IC- CAD). Washington, DC, USA: IEEE Computer Society, Nov. 2001, pp. 165— 172. T.-Y. Wang and C.-P. Chen, “SPICE-Compatible Thermal Simulation with Lumped Circuit Modeling for Thermal Reliability Analysis based on Modeling 189 [71] [72] [73] [74l [75l [76] [77] [78] [79] [80] Order Reduction,” in Proceedings of International Symposium on Quality of Electronics Design (ISQED), 2004. L. Shang, L.-S. Peh, A. Kumar, and N. K. Jha, “Thermal Modeling, Character- ization and Management of On—Chip Networks,” in Proceedings of the Annual ACM/IEEE International Symposium on Microarchitecture (MICRO). Los Alamitos, CA, USA: IEEE Computer Society, Dec. 2004, pp. 67—78. K. Banerjee, A. Mehrotra, A. Sangiovanni—Vincentelli, and C. Hu, “On Thermal Effects in Deep Sub-Micron VLSI Interconnects,” in Proceedings of the Annual ACM/IEEE Design Automation Conference (DAC). New York, NY, USA: ACM Press, 1999, pp. 885—891. D. Chen, E. Li, B. Rosenbaum, and S. Kang, “Interconnect Thermal Modeling for Accurate Simulation of Circuit Timing and Reliability,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 19, no. 2, pp.197-205,be.2000. R. Desikan, D. Burger, S. Keckler, and T. Austin, “Sim-alpha: A Validated, Execution-Driven Alpha 21264 Simulator,” The University of Texas at Austin, Department of Computer Sciences, Tech. Rep. TR—01-23, 2001. Standard Performance Evaluation Council, “CPU2000 Version 1.2,” http:// www.spec.org/cpu2000, 2001. SimpleScalar LLC, http://www.simplescalar.com. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Char- acterizing Large Scale Program Behavior,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Systems (ASPLOS). New York, NY, USA: ACM Press, Oct. 2002, pp. 45—57. ——-, “SimPoint Single Simulation Points for SPEC CPU 2000,” URL: http: / /www.cse. ucsd.edu/~calder/simpoint/single— sim— pionts.htm. G. Hamerly, E. Perelman, J. Lau, and B. Calder, “SimPoint 3.0: Faster and More Flexible Program Analysis,” The Journal of Instruction-Level Parallelism, vol. 7, 2005, http://www.jilp.org/vol7/v7paper14.pdf. K. Sundaresan and N. Mahapatra, “An Accurate Energy and Thermal Model for Global Signal Buses,” in Proceedings of the 18th International Conference on VLSI Design. Washington DC, USA: IEEE Computer Society, Jan. 2005, pp. 685—690. 190 [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] —, “Accurate Energy Dissipation and Thermal Modeling for Nanometer- Scale Signal Buses,” in Proceedings of International Symposium on High Per- formance Computer Architecture {HPCA). Washington DC, USA: IEEE Com- puter Society, Feb. 2005, pp. 51-60. S. Borkar, “Design Challenges of Technology Scaling,” IEEE Micro, vol. 19, no. 4, pp. 23—29, Jul—Aug. 1999. T.-Y. Chiang and K. Saraswat, “Closed-Form Analytical Thermal Model for Accurate Temperature Estimation of Multilevel ULSI Interconnects,” in 2003 Symposium on VLSI Circuits Digest of Papers. Piscataway, NJ, USA: IEEE Press, June 2003, pp. 275-279. K. Banerjee and A. Mehrotra, “Coupled Analysis of Electromigration Reliabil- ity and Performance in ULSI Signal Nets,” in Proceedings of the International Conference on Computer-Aided Design (ICCAD). Washington, DC, USA: IEEE Computer Society, Nov. 2001, pp. 158—164. P. Sotiriadis and A. Chandrakasan, “A Bus Energy Model for Deep Submicron Technology,” IEEE Transactions on VLSI Systems, vol. 10, no. 3, pp. 341—350, June 2002. W.-C. Cheng and M. Pedram, “Memory Bus Encoding for Low-Power: A Th- torial,” in Proceedings of International Symposium on Quality of Electronics Design (ISQED). Washington, DC, USA: IEEE Computer Society, Mar. 2001. P. Sotiriadis and A. Chandrakasan, “Low Power Bus Coding Techniques Con- sidering Inter-wire Capacitances,” in Proceedings of Custom Integrated Circuits Conference (CICC). Washington DC, USA: IEEE Computer Society, May 2000, pp. 414—419. H. Deogun, R. Rao, D. Sylvester, and D. Blaauw, “Leakage- and Crosstalk- Aware Bus Encoding for Total Power Reduction,” in Proceedings of the Annual ACM/IEEE Design Automation Conference (DAC). New York, NY, USA: ACM Press, June 2004, pp. 779—782. N. Menezes and L. Pillegi, “Analyzing On-Chip Interconnect Effects,” in Design of High-Performance Microprocessor Circuits, A. Chandrakasan, W. Bowhill, and F. Fox, Eds. Piscataway, NJ, USA: IEEE Press, 2000, pp. 331—351. A. Kahng, K. Masuko, and S. Muddu, “Analytical Delay Models for VLSI Inter- connects under Ramp Input,” in Proceedings of the International Conference on Computer-Aided Design (ICCAD). Washington, DC, USA: IEEE Computer Society, Nov. 1996, pp. 30—36. 191 [91] J. Srinivasan and S. Adve, “The Importance of Heat-Sink Modeling for DTM and a Correction to Predictive DTM for Multimedia Applications,” In Pro- ceedings 0f the Fourth Annual Workshop on Duplicating, Deconstructing, and Debunking, Madison, WI, USA, June 2005. [92] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C: The Art of Scientific Computing. New York, NY, USA: Cambridge University Press, 1992. [93] R. Chandra, “Impact of Thermal Analysis on Large Chip Design,” Elec- tronic Design Process Symposium (EPDS 2005) talk slides, URL: http://www. gradient-da.com/pdf/ EDP_for.website.pdf , 2005. [94] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, “The Case for Lifetime Reliability-Aware Microprocessors,” in Proceedings of the Annual Symposium on Computer Architecture (ISCA). Washington, DC, USA: IEEE Computer Society, June 2004, pp. 276—286. [95] E. W. Weisstein, “Geometric Centroid,” From MathWorld—A Wolfram Web Resource, http: //mathworld.wolfram.com/GeometricCentroid.html. [96] K. Agarwal, D. Sylvester, D. Blaauw, F. Liu, S. Nassif, and S. Vrudhula, “Vari- ational Delay Metrics for Interconnect Timing Analysis,” in Proceedings of the Annual ACM/IEEE Design Automation Conference (DA C). New York, NY, USA: ACM Press, 2004, pp. 381—384. [97] P. Bose, “Power- and Reliability-Aware (Integrated) Design: Challenges and Opportunities,” Talk slides URL: ee.usc.edu / news / dls / talks / bose_presentation. pdf, Oct. 2005. [98] K. Sundaresan and N. Mahapatra, “Value-Based Bit Ordering for Energy Op- timization of On—Chip Global Signal Buses,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Leuven, Belgium: European Design and Automation Association, Mar. 2006, pp. 624—625. [99] Z. Lu, W. Huang, J. Lach, M. Stan, and K. Skadron, “Interconnect Lifetime Prediction under Dynamic Stress for Reliability-Aware Design,” in Proceedings of the International Conference on Computer-Aided Design (ICCAD). Wash- ington, DC, USA: IEEE Computer Society, Nov. 2004, pp. 327—334. [100] Q. Zhou and K. Mohanram, “Elmore Model for Energy Estimation in RC Trees,” in Proceedings of the Annual ACM/IEEE Design Automation Confer- ence (DAC). New York, NY, USA: ACM Press, July 2006, pp. 965—970. 192 [101] [102] [103] [104] [105] [106] [107] [108] [109) [110] [111] S. Ramprasad, N. Shanbhag, and I. Hajj, “Information-Theoretic Bounds on Average Signal Transition Activity,” IEEE Transactions on VLSI Systems, vol. 7, no. 3, pp. 359—368, Sept. 1999. R.-B. Lin and C.-M. T sai, “Theoretical Analysis of Bus-Invert Coding,” IEEE Transactions on VLSI Systems, vol. 10, no. 6, pp. 929—935, Dec. 2002. Y. Shin and K. Choi, “Narrow Bus Encoding for Low Power Systems,” in Pro— ceedings of Asia and South Pacific Design Automation Conference (ASPDA C). New York, NY, USA: ACM Press, Jan. 2000, pp. 217—220. P. Sotiriadis and A. Chandrakasan, “Bus Energy Minimization by Transition Pattern Coding (TPC) Using a Detailed Deep Sub-Micron Bus Model,” in Pro- ceedings 0f the International Conference on Computer-Aided Design (ICCAD). Washington, DC, USA: IEEE Computer Society, Nov. 2001, pp. 322—328. L. Macchiarulo, E. Macii, and M. Poncino, “Low-Energy Encoding for Deep— Submicron Address Buses,” in Proceedings of International Symposium on Low Power Electronics and Design (ISLPED). New York, NY, USA: ACM Press, 2001, pp. 176—181. —, “Wire Placement for Crosstalk Energy Minimization in Address Buses,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Washington, DC, USA: IEEE Computer Society, Mar. 2002, pp. 158—162. E. Macii, M. Poncino, and S. Salerno, “Combining Wire Swapping and Spacing for Low-Power Deep-Submicron Buses,” in Proceedings of Great Lakes Sym- posium on VLSI (GLSVLSI). New York, NY, USA: ACM Press, 2003, pp. 198—202. E. Naroska, S.-J. Ruan, and U. Schwiegelshohn, “An Efficient Algorithm for Simultaneous Wire Permutation, Inversion, and Spacing,” in Proceedings of International Symposium on Circuits and Systems (ISCAS). Piscataway, NJ, USA: IEEE Press, May 2005, pp. 109—112. L. Deng and M. Wong, “Energy Optimization in Memory Address Bus Structure for Application-Specific Systems,” in Proceedings of Great Lakes Symposium on VLSI (CLSVLSI). New York, NY, USA: ACM Press, Apr. 2005, pp. 232—237. F.Wang, Y. Xie, N. Vijaykrishnan, and M. Irwin, “On-Chip Bus Analysis and Optimization,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Leuven, Belgium: European Design and Automation Association, Mar. 2006, pp. 850-855. ILOG, Inc., “CPLEX 9.0 ,” http://www.ilog.com/products/cplex, 2003. 193 [112] R. Kumar, “Interconnect and Noise Immunity Design for the Pentium 4 Proces- sor,” Intel Technology Journal, Ist Quarter, vol. Q1, 2001. [113] Berkeley Espresso minimization tool, “Web version,” http://embsys. technikum-wien. at / espresso / html / espressohtml. [114] P. Groeneveld, “Wire Ordering for Detailed Routing,” IEEE Design and Test, vol. 6, no. 6, pp. 6—17, 1989. [115] M. Marek-Sadowska and M. Sarrafzadeh, “The Crossing Distribution Prob— [116] [117] [118] [119] [120] [121] [122] lem,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, no. 4, pp. 423—433, Apr. 1995. X. Song and Y. Wang, “On the Crossing Distribution Problem,” ACM Trans- actions on Design Automation of Electronic Systems, vol. 4, no. 1, pp. 39—51, 1999. D. Knuth, The Art of Computer Programming. Reading, MA: Addison-Wesley Longman, 1973. L. He and K. Lepak, “Simultaneous Shield Insertion and Net Ordering for Capacitive and Inductive Coupling Minimization,” in Proceedings of the Inter- national Conference on Computer-Aided Design (I CCAD). Los Alamitos, CA, USA: IEEE Computer Society Press, 2000, pp. 55—60. J. Ma and L. He, “Formulae and Applications of Interconnect Estimation Con- sidering Shield Insertion and Net Ordering,” in Proceedings of the International Conference on Computer-Aided Design (I CCAD). Piscataway, NJ, USA: IEEE Press, 2001, pp. 327—332. P. Sotiriadis and A. Chandrakasan, “Reducing Bus Delay in Sub-Micron Tech- nology Using Coding,” in Proceedings of Asia and South Pacific Design Au- tomation Conference (ASPDAC). New York, NY, USA: ACM Press, Jan. 2001, pp. 109—114. S. Sridhara, A. Ahmed, and N. Shanbhag, “Area and Energy-Efficient Crosstalk Avoidance Codes for On-Chip Buses,” in Proceedings of International Confer- ence on Computer Design (ICCD). Washington, DC, USA: IEEE Computer Society, Oct. 2004, pp. 12—17. C. Duan and S. Khatri, “Exploiting Crosstalk to Speed up On—Chip Buses,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Washington, DC, USA: IEEE Computer Society, 2004, pp. 20 778—20 782. 194 [123] L. Li, N. Vijaykrishnan, M. Kandemir, and M. Irwin, “A Crosstalk Aware Interconnect with Variable Cycle Transmission,” in Proceedings of Conference on Design Automation and Test in Europe (DATE). Washington, DC, USA: IEEE Computer Society, 2004, pp. 10102—10106. 195 I[III][[I[][IIQ[II[[I