REDUCED OCCUPANCY FOR PROCESSOR EFFICIENCY THROUGH DOWNSIZED OUT-OF-ORDER ENGINE AND REDUCED TRAFFIC By Ziad Youssfi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements of the degree of DOCTOR OF PHILOSOPHY Electrical and Computer Engineering 2011 ABSTRACT REDUCED OCCUPANCY FOR PROCESSOR EFFICIENCY THROUGH DOWNSIZED OUT-OF-ORDER ENGINE AND REDUCED TRAFFIC By Ziad Youssfi Reducing microprocessor power consumption can alleviate expensive thermal management, improve reliability, allow increased performance, and reduce carbon emissions. Microarchitecture techniques that reduce instruction occupancy in the out-of-order engine to ultimately create more empty segments that can be disabled provide an attractive approach to reducing power consumption. This thesis presents two guiding principles to reduce occupancy for implementing such an approach. One involves downsizing the out-of-order engine as it becomes less effective in issuing instructions out of order. The second reduces instruction traffic by dampening traffic bursts. This thesis developed two techniques that implement these principles. One downsizes the out-of-order engine based on a data dependence index (DDI), and the second moderates the instruction traffic based on the sustained dispatch rate (SDR). The SDR, in addition to reducing occupancy, reduces the i-cache's power consumption. Total processor power savings are reported. Component power savings are also reported to show their relative effects on total savings. Simulations find that occupancy was reduced from 27% to 33% in the out-of-order engine and 40% in the fetch buffer. These occupancy reductions are, to our knowledge, the largest reported to date, with a performance loss of only 2.6%. Reduced occupancy results in fewer active segments, which directly contributes to power savings. Power models estimate that the savings for a single threaded, high performance processor can total to 27% of overall dynamic power consumption, including level-1 caches. Acknowledgments I would like to express my deep appreciation to my wife, Maya, whose love gives me comforting perspective on life and outweighs any hardship; to my mother whose love makes me a better person; to my sister, Rasha, and her family, whose support is steadfast; to my father whose pride in me gives me confidence; and to Maya’s parents, Andi and Baruch, whose supporting words always remind me that I am on the right path. I also would like to express appreciation to my advisor, Dr. Michael Shanblatt, for taking me as a Ph.D. student and for his patience over the many years it took to finish this dissertation. Finally, I express my appreciation for my committee members, Dr. Richard Enbody, Dr. Erik Goodman, and Dr. Shlomo Levental, for their helpful comments and support along the way. iii Table of Contents List of Tables ................................................................................................................................ vii List of Figures ............................................................................................................................. viii 1 Introduction: Reducing Processor Power Consumption ................................ 1 1.1 Motivation .......................................................................................................................... 1 1.2 Thesis Summary ................................................................................................................ 2 1.3 Dynamic Power Consumption ........................................................................................ 4 1.3.1 CMOS VLSI Design ................................................................................................... 4 1.3.2 Estimating Dynamic Power...................................................................................... 6 1.3.3 Increasing Trend ........................................................................................................ 8 1.4 Static Power Consumption ............................................................................................ 10 1.5 Power Density ................................................................................................................. 11 1.6 Thesis Contribution ........................................................................................................ 12 1.6.1 Current Occupancy Reduction Techniques ......................................................... 12 1.6.2 New Principles for Active Occupancy Reduction .............................................. 13 1.6.3 Implementation of Principles ................................................................................. 14 2 Background & Related Work.................................................................................... 16 2.1 Processor Power Reduction Levels .............................................................................. 17 2.1.1 Device Level ............................................................................................................. 17 2.1.2 Circuit Level ............................................................................................................. 18 2.1.3 Microarchitecture Level .......................................................................................... 19 2.1.4 System Level ............................................................................................................. 21 2.1.5 Compiler Level ......................................................................................................... 22 2.2 Complexity of the Out-of-Order Issue ......................................................................... 22 2.2.1 Out-of-Order Pipeline Flow ................................................................................... 23 2.2.2 Circuit Complexity .................................................................................................. 24 2.3 Power Reduction for the Out-of-Order Engine .......................................................... 27 2.3.1 Classification ............................................................................................................ 27 2.3.2 Downsizing with Neutral Occupancy Reduction ............................................... 29 2.3.3 Downsizing with Active Occupancy Reduction ................................................. 30 2.3.4 Fixed-Size Techniques ............................................................................................. 33 2.4 Complexity Reduction of the Out-of-Order Issue ...................................................... 33 2.4.1 Associative Techniques ........................................................................................... 33 2.4.2 Non-Associative Techniques.................................................................................. 35 2.4.3 Scheduling Techniques ........................................................................................... 38 iv 2.5 Power Models at the Microarchitecture Level ............................................................ 40 2.5.1 The Wattch Approach ............................................................................................. 41 3 Methods ............................................................................................................................. 45 3.1 Efficiency Principles for the Out-of-Order Engine ..................................................... 46 3.1.1 Principle 1 ................................................................................................................. 46 3.1.2 Principle 2 ................................................................................................................. 48 3.1.1 Tandem Principle ..................................................................................................... 49 3.2 Downsizing the Out-of-Oder Engine ........................................................................... 49 3.2.1 The Data Dependence Index .................................................................................. 49 3.2.2 Token Passing Algorithm ....................................................................................... 52 3.2.3 Extended Register-Renaming Algorithm ............................................................. 55 3.2.4 Validation.................................................................................................................. 56 3.2.5 Downsizing Policy ................................................................................................... 56 3.3 Moderating Front-End Traffic Pressure....................................................................... 59 3.3.1 Fetch Pause Based on Data Dependence .............................................................. 59 3.3.2 Fetch Pause Based on Sustained Dispatch Rate .................................................. 60 3.3.3 Fetch Pause Based on Sustained Dispatch Rate and Occupancy ...................... 60 3.4 Clustering Fetched Instructions .................................................................................... 62 3.5 Evaluation Framework ................................................................................................... 62 3.5.1 Architecture Models ................................................................................................ 62 3.5.2 Power Models........................................................................................................... 62 3.5.3 Simulation Methods ................................................................................................ 63 4 Results ................................................................................................................................ 65 4.1 Instruction Occupancy Reduction ................................................................................ 66 4.2 Traffic Reduction ............................................................................................................. 68 4.3 Fetch Clustering .............................................................................................................. 69 4.4 Power Savings ................................................................................................................. 71 4.5 Performance Effect .......................................................................................................... 73 5 Discussion of Results ................................................................................................... 75 5.1 Validation of Principle 1 ................................................................................................ 75 5.2 Validation of Principle 2 and the Tandem Principle.................................................. 76 5.3 Significance of Clustering Fetched Instructions ......................................................... 77 v 5.4 Significance of Power Reduction Results .................................................................... 78 5.5 Comparative Discussion ................................................................................................ 79 6 Conclusion & Future Research ............................................................................... 80 6.1 Conclusion ....................................................................................................................... 80 6.2 Future Research ............................................................................................................... 83 6.2.1 Applications to Power Efficiency .......................................................................... 83 6.2.2 Applications to Performance Improvement ........................................................ 84 References ................................................................................................................................ 89 vi List of Tables Table 1 Capacitive load equations .................................................................................................. 43 Table 2 Processor parameters used for simulations . .................................................................... 63 Table 3 IPC performance for all benchmarks and for all methods ............................................ 74 vii List of Figures Figure 1. CMOS process for an inverter gate ..................................................................................... 5 Figure 2. Logic circuit diagram for an inverter gate with a capacitive load .................................. 6 Figure 3. The relative increase in processor clock frequency ........................................................... 9 Figure 4. The rise in dynamic power consumption with technology generations ....................... 9 Figure 5. The percentage of power consumption due to leakage current .................................... 11 Figure 6. Power density and equivalent temperature density ....................................................... 12 Figure 7. Simplified processor pipeline and stages ......................................................................... 23 Figure 8. CAM array for the wakeup logic in the issue queue . ..................................................... 25 Figure 9. Circuit diagram of a CAM cell for one bit position of the wakeup logic .................... 27 Figure 10. Classification of power reduction techniques for the out-of-order engine ................. 28 Figure 11. Active versus no occupancy reduction for an abstract queue . ..................................... 29 Figure 12. Associative complexity reduction techniques .................................................................. 34 Figure 13. Block diagram for the first-use indexing .......................................................................... 36 Figure 14. Array representation ........................................................................................................... 37 Figure 15. Using an in-order issue queue with latency scheduling ................................................ 39 Figure 16. Schematic diagram of word-lines and bit-lines for a circuit array structure .............. 42 Figure 17. Performance as a function of window size and data dependence level ...................... 47 Figure 18. Trading window size for performance at different data dependence levels . ............. 47 Figure 19. Downsizing with active occupancy reduction ................................................................ 48 Figure 20. Traffic burst effect on occupancy ....................................................................................... 49 Figure 21. Data dependence graphs .................................................................................................... 51 Figure 22. Counter intuitive dependence graphs .............................................................................. 51 viii Figure 23. Token passing logic in the issue queue ............................................................................ 53 Figure 24. Token passing at rename stage .......................................................................................... 56 Figure 25. DDI during two randomly selected periods .................................................................... 57 Figure 26. A graphical representation for the DDI policy ................................................................ 58 Figure 27. Fetch pause based on the fetch buffer occupancy and the sustained dispatch rate ... 61 Figure 28. Reduction in instruction occupancy for the FB, the IQ, the LSQ, and the ROB .......... 67 Figure 29. Percentage of time for the number of active segments ................................................... 68 Figure 30. The average instruction fetch rate, which is reduced the most in M5 ......................... 69 Figure 31. Average number of instructions per fetch, which is highest in M5 ............................. 70 Figure 32. Percentage of fetches by the number of instructions they carry ................................... 70 Figure 33. Average fetch event rate, which is also lowest in M5 . ................................................... 71 Figure 34. Total power savings for components broken down by methods .................................. 72 Figure 35. Total power savings for methods broken down by component .................................... 72 ix 1 INTRODUCTION: REDUCING PROCESSOR POWER CONSUMPTION 1.1 Motivation Steadily increasing with each technology generation, the power consumption of processor chips has become so high that it limits performance of new designs. High power consumption requires expensive heat-sinking and thermal management, and leads to high current and chip hot spots that reduce reliability [1, 2]. At the same time, data centers stacked with high power processors have costly cooling systems, high energy consumption, and greater carbon emissions. In fact, processor power consumption has been declared a grand challenge, and is a problem for which even small savings are considered important gains [3]. The processor power consumption challenge is relatively new in computing. In 1997, when billion transistor chips were becoming feasible, an IEEE Computer Magazine issue surveyed new processor challenges [4]. None of the authors in the issue predicted power consumption to be a challenge. However, by 2004, power consumption was already becoming a challenge for higher performance. A 2004 Computer Magazine retrospective issue discussed how predicting the challenge of power consumption was missed in 1997 [5]. By 2005, the International Technology Roadmap for Semiconductors (ITRS) declared processor power consumption a "grand challenge" for processor design [3]. 1 When processor power consumption is increased, more elaborate heat-sinking and thermal management are required to remove the generated heat in order to keep the chip temperature below its operating limit. Moreover, above a certain power consumption threshold, air-cooling techniques cannot keep the chip below that temperature limit. Resorting to alternative cooling techniques, such as liquid cooling, can result in a sharp increase in the cost of chip package integration [1, 6]. Another aspect of this challenge is environmental. In 2008, the number of PCs in use (as opposed to the number shipped) in the world was estimated at more than one billion units [7]. The Hoover Dam power plant is rated at 4 Mega Horsepower [8]. If we conservatively assume a processor chip consumes on average 75 Watts, then powering the world’s PCs we requires at least 33 power plants like Hoover Dam! (Actually, many more than 33 if we consider the system’s power consumption and not just the processor’s). Moreover, the number of PCs in the world is expected to double by the year 2014 [7]. With the rise in processor power consumption, this growth rate may not be sustainable. This thesis details microarchitecture techniques to reduce the average power consumption of a processor chip. The microarchitecture techniques presented reduce instruction occupancy in the out-of-order engine components, creating empty segments that can be dynamically shut down to save power. Instruction occupancy is reduced by dynamically down-sizing the out-of-order engine components and by reducing instruction traffic. The same techniques used to reduce traffic are used to cluster more instructions in a single i-cache fetch, thus also reducing i-cache power consumption. 1.2 Thesis Summary This thesis is divided into six chapters. The problem of increased processor power consumption is introduced in Chapter 1. To show the main sources of power consumption, a basic CMOS circuit design is given in Section 1.3. Factors behind the increasing trend in dynamic and static power consumption as well as the increase in power density are discussed in Sections 1.3–1.5. A summary of the proposed microarchitecture techniques to reduce processor power 2 consumption is given in Section 1.6. To help establish a context for the proposed techniques, a brief review and a classification of previous related research is also presented in that section. The principles for downsizing the out-of-order engine and for reducing instruction traffic are discussed. The Contribution section also presents a summary of the methods that implement those two principles. More extensive review of related work on processor power reduction techniques is presented in Chapter 2. A general taxonomy of these techniques is presented in Section 2.1, starting at the device level all the way up to the software level. A brief review of circuit complexity of the out-of-order engine components is presented in 2.2. An in-depth review of power reduction techniques for the out-of-order engine is presented in Section 2.3, where the techniques are classified based on a new definition for occupancy reduction of either neutral or active. Active occupancy reduction techniques are further subdivided into downsizing or traffic reduction techniques. A review of complexity reduction techniques for the out-order engine is given in Section 2.4. Complexity reduction techniques are intended to increase performance rather than reduce power consumption. Nonetheless, including these techniques here offers a more complete review. Chapter 2 ends in Section 2.5 with a review of power modeling at the architecture level. The methods used to implement the two proposed principles for reducing instruction occupancy are presented in Chapter 3. The principles for downsizing the out-of-order engine and for reducing traffic are restated in Section 3.1. The method that was developed to implement the principle of downsizing the out-of-order engine is detailed in Section 3.2. The three developed variations of the method for reducing instruction traffic are detailed in Section 3.3. One variation of the traffic reduction method exhibits significantly better clustering of fetched instructions from the i-cache. This effect and its benefit for reducing the i-cache power is detailed in Section 3.4. The framework used for architecture models, the power models, and the simulation workload is presented in Section 3.5. Simulation results are reported in Chapter 4. First, the reductions in instruction occupancy for all out-of-order engine components using all methods are reported in Section 4.1. 3 The results include average occupancy as well as time distributions for the number of active segments. Traffic reduction results are reported in Section 4.2. In Section 4.3, the results for improved clustering of fetched instructions from the i-cache are presented. Dynamic power savings’ results are reported in Section 4.4. In order to show the effect of components on the processor’s total dynamic power savings, both component and total dynamic power savings, including level-1 caches, are reported. The effects on performance of all methods are reported in Section 4.5. Validation of the principles and their implementation is discussed in Chapter 5. The significance of the occupancy reduction results is also discussed in Chapter 5. Finally, conclusions and future research are presented in Chapter 6. Future research discussion includes using the proposed principles and methods for more power efficiency, such as reducing static power in addition to dynamic power. They also include using the proposed methods to improve scheduling and throughput on heterogeneous multi-core processor systems. 1.3 Dynamic Power Consumption 1.3.1 CMOS VLSI Design To help understand the rising trend in power consumption, we first examine processor circuit design. The dominant VLSI technology for microprocessor integrated circuits has been the Complementary Metal Oxide Silicon (CMOS) [9]. In CMOS two types of transistors (also called devices) are created: an n-type device (nMOS) and a p-type device (pMOS). In general, starting with a silicon wafer, CMOS fabrication involves successive steps of chemical processing to create sandwich-like layers of different conductive properties. For example, diffusion of electron-donor or -acceptor impurities is used to create positively or negatively charged silicon. Oxidation is used to create insulating layers of SiO2. Deposition and etching are used to connect devices with poly-silicon or aluminum much like circuit board components. Chemical etching is guided by mask patterns and lithography techniques. A layout and cross section of an n-well CMOS process for an inverter gate is shown in Figure 1, [9]. In an n-well process, p-devices are created in a p-substrate within an n-type well. 4 in (a) (a) out VDD VSS in (b) (a) (a) out VDD (c) out p+ VSS p+ n+ n+ n-well p-substrate contact cut VDD polysilicon field oxide gate oxide VSS out p+ (d) metal p+ n+ n+ n-well p-substrate Figure 1. CMOS process for an inverter gate. (a) logic circuit diagram, (b) layout top view, (c) abstract cross-section view, and (d) actual cross section view, [9]. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 5 The n+ and p+ regions serve as the current source and drain for the n and p-devices. Their polarity is interchangeable, based on the direction of current. The applied gate voltage to the polysilicon above the gate oxide controls the conduction of current between the source and the drain. When high voltage (logic 1) is applied to the inverter input, the p-device is off, and the ndevice conducts current to ground, exerting logic 0 at the output. When low voltage is applied to the input, the n-device is off, and the p-device conducts current from the voltage supply, exerting logic 1 at the output. The complementary use of n- and p-devices allows a solid sourcing and sinking of current from power supply and ground, respectively. This maintains reliable signal propagation against noise and is one advantage of CMOS. Another advantage that CMOS had when it was introduced was lower power consumption than that of other VLSI technologies [9]. Because of CMOS high input impedance on device gates, a negligible current is drawn by the gate at the input when it is not switching, which saves power. However, CMOS circuits still consume power in two ways. One is dynamic power dissipation to charge and discharge the output capacitive load when the input changes. The second is static power dissipation due to substrate leakage current. 1.3.2 Estimating Dynamic Power When the input logic changes from '1' to '0' for an inverter gate (Figure 2), the n-device is turned off, and the p-device is turned on and has to charge the capacitive load, CL. The capacitive load VDD p-device Vout Vin CL n-device Figure 2. Logic circuit diagram for an inverter gate with a capacitive load, , and repetitive step input, . 6 represents the fan-out capacitive load of later logic stages. When the inverter input changes from '0' to '1', the p-device is turned off and the n-device is turned on and has to discharge CL. This charging and discharging of current consumes dynamic power, which constitutes the largest portion of power consumption for CMOS. To calculate average dynamic power for an inverter gate, we assume a repetitive step input, such as clock. If the input frequency fp = 1 / tp, then the average power consumed can be calculated by integrating the product of current and voltage over the period, tp. This integral can be broken down into two integrals. The first integrates the current through the n-device and the voltage across it over a half period (0 to tp / 2). And the second integrates the current through the p-device and the voltage across it over the second half period (tp / 2 to tp). This is expressed as: (1) where n device transient current p device transient current Substituting the current values and changing the variable of integration from time to voltage, the following is obtained: (2) One observation that can be made from equation (2) is that the average dynamic power scales linearly with capacitive load and input frequency, and scales quadratically with the supply voltage. Another observation is that power consumption is independent of device parameters. 7 The average dynamic power was calculated above for a single inverter with a repetitive step input to switch charge on a load capacitance. For simulating a system with a large number of gates (e.g., thousands or millions) with non-repetitive inputs, the above calculations are too complex and impractical to carry out for each gate. A faster estimate can be obtained by using the total capacitance, , for all gates in the circuit and the percent of time during which they are active expressed as: (3) The accuracy of equation (3) depends on the accuracy of the active time percentage, which is only an approximation. A more accurate power estimate can be obtained during simulation by summing up capacitances only for components that switch at any given cycle. To estimate power consumption to run a program on a processor, for example, the total switched capacitance, , can be summed up for switching components during simulation. At the end of simulation, the total switched capacitance can be used along with the number of total cycles to run the program, , to estimate power consumption: (4) Equation (4) can be used to show how different processor design parameters affect dynamic power consumption, , as will be discussed in the next section. 1.3.3 Increasing Trend Microarchitecture techniques that increase performance generally require physically large units (typically arrays) due to a high device count. The high device count and the long transmission lines within the arrays quickly add to in equation (4). Capacitive cross-coupling among transmission lines also quickly adds to , especially with new generations of fabrication technology, due to closer physical proximity of devices. Achieving higher performance also requires decreasing the number of total cycles, decreasing the cycle time, , as well as , in order to decrease runtime, which also rapidly increases Combining all these variables changes together results in a superlinear increase in 8 . . Figure 3 shows how clock frequency is increasing (or how is decreasing) in relation to Intel processor generations [10]. 3 2.5 2 Relative Frequency 1.5 1 0.5 0 286 386 486 P5 Intel® Processors P6 P4P Figure 3. Relative increase in clock frequency for Intel Processors (Data source from Hinton et al. [10]). Reducing the supply voltage, reducing , can reduce dynamic power quadratically. However, leads to an exponential increase in leakage current and static power consumption. This will offset any decrease in dynamic power. In fact, had been scaling down about 30% per technology generation, but now has 1000 Active Power 100 (Watts) 10 .25u .18u .13u 90nm 65nm Technology Feature Size 45nm Figure 4. The rise in dynamic power consumption with technology generations for Intel® processors (Data source from Borkar [11]). 9 bottomed out because of this offsetting effect [11]. Even if continued to scale down by 15% per generation, clock frequency is increasing simultaneously by 40% [11]. At this increasing clock frequency rate, can reach prohibitive levels. Figure 4 shows the trend for dynamic power for Intel processors as a function of technology generations. Some of the recent processors fabricated in 32nm technology running at over 3 GHz clock frequency, such as some of the Intel Xeon 5000 series processors, consume well over 100 Watts [12]. 1.4 Static Power Consumption Fundamentally, static power consumption is due to leakage current at sub-threshold voltage in reverse biased parasitic diodes. These parasitic diodes, which are inherent in silicon CMOS, are formed between diffusion regions and the substrate [9]. Unlike dynamic power, static power is always dissipated whether devices are switching or not. The total static power, , dissipated for a processor is independent of time, and it is summed up over all devices as: (5) where Static power consumption, just like dynamic power, has also shown an increasing trend. The reason for this increasing trend is twofold. First, scaling down power has the undesired side effect of increasing static power. As to reduce dynamic is reduced, the threshold voltage also has to be reduced. The leakage current for the parasitic reverse biased diodes increases exponentially with the decreasing threshold voltage. With Intel processors, for example, a continued decrease in supply voltage for fabrication technology of less than 90 nm feature size would actually result in static power accounting for more than 50% of total chip power dissipation. This trend is shown in the graph of Figure 5. 10 50% Must Stop at 50% 40% Leakage Power (% of Total) 30% 20% 10% 0% 1.5u 0.7u 0.35u 0.18u Technology Feature Size (in Microns) Figure 5. Percentage of total chip power consumption due to leakage current as a function of technology feature size.(Data source from Borkar [11] and Grove [13]). The second reason for the increase of static power is the increase of the number of devices on processor chips. With the doubling of devices per technology generation, the sum of all the leakage currents for these devices, as indicated in Equation (5), is quickly reaching a critical level. 1.5 Power Density Increasing the number of devices on a chip and increasing the clock frequency with every technology generation increases both the total chip power consumption and the power consumption density per unit area. As more heat is generated from more devices crammed per unit area, the heat can detrimentally affect the operating temperature in certain areas of the chip. Figure 6 shows the trend over time of the power density and equivalent temperature densities for Intel processors [13]. These high heat densities, as high as a rocket nozzle, cannot be sustained on a silicon chip under normal operations. 11 Figure 6. Power density for Intel processor generation over years and equivalent temperature density. (Data source from A. Grove [13]) Since processor chip power dissipation is not spatially or temporally uniform, localized heating areas, in the form of hotspots, occur on a shorter time scale than chip-wide heating. The time scale for these hotspots is on the order of one hundred microseconds to milliseconds. However, the rapid increase in power density is such that the temperature at these hotspots can cause timing errors or even physical damage [2, 14]. 1.6 Thesis Contribution 1.6.1 Current Occupancy Reduction Techniques The out-of-order execution engine consumes a large proportion of the power used by a chip. As a result, dynamically shutting down empty segments in its components can reduce power. Shutting down segments that are normally empty, with no additional attempt to reduce instruction occupancy, can be classified as neutral occupancy reduction [15, 16]. In contrast, microarchitecture techniques that cause components to reduce instruction occupancy based on dynamic thread behavior can be classified as active occupancy reduction. As a thread executes, its code behavior and its demands on different components change. This dynamic creates opportunities to actively reduce occupancy, creating additional empty segments. The challenge is to dynamically detect when such behavioral changes occur, so as to determine when a 12 component's occupancy can be reduced to save energy and when it should be increased to protect performance. Active occupancy reduction offers more attractive power savings than neutral occupancy reduction does [17]. However, current approaches have yet to fully realize that potential. One approach to active occupancy reduction is downsizing triggered by changes in a general performance metric such as the IPC (instructions commit per cycle) [18, 19]. However, the IPC is only a symptom of performance determined by many behaviors, and it can itself be penalized by downsizing. To minimize the penalty, a downsized component must periodically be upsized to measure any effect on the IPC. This trial-and-error approach requires long "tuning" periods (on the order of hundred-thousands to millions of cycles) in order to determine the optimal size for a component, especially one that has many size configurations. With often short-lived execution behaviors, this approach can miss power saving opportunities [20]. A second approach to active occupancy reduction is reducing traffic to the out-of-order engine, for example by monitoring behavioral changes in branch prediction and throttling the fetch when the front-end encounters highly mispredicted branches [21]. However, reducing traffic alone will not always reduce occupancy. A queue's occupancy might be reduced when traffic going into it is reduced, but if the queue's service rate is also reduced, the queue steady state remains at high occupancy. Moreover, using solely one behavior, such as branch prediction, does not take advantage of other behavioral changes. 1.6.2 New Principles for Active Occupancy Reduction Because both downsizing and reducing traffic can reduce occupancy, we propose an approach that combines them, while overcoming some of their individual limitations. Specifically, we propose improvements to downsizing and traffic reduction based on two principles: Principle 1: Gradually downsize the out-of-order engine as it becomes less effective in issuing instructions out of program order. There are two states in which queues in the out-oforder engine become less effective. One occurs when instructions line up in a data dependence chain. The more instructions line up, the more rapidly performance becomes insensitive to 13 instruction window size. That insensitivity can be exploited by setting a smaller window that yields the same performance. The second ineffective state is the opposite, when instructions become virtually independent. Then they need much less (if any) out-of-order movement, making large queues energy inefficient. In both states, components can be forced to reduce their occupancy by downsizing. Components should maintain maximum size when the engine holds a roughly equal mix of independent and dependent instructions, in order to allow out-of-order issue. The sizing decision must be solely based on the state of data dependences; otherwise, the engine's ability to find and issue independent instructions might be hindered. Principle 2: Continuously moderate the front-end pressure to reduce instruction traffic. The front end is normally designed to fetch instructions from the i-cache and push them into the out-of-order engine as fast as possible. However, the engine may not always be able to move instructions as fast as they come in. It might be slowed down, for example, by data cache misses, function unit latencies, and resource conflicts. When that happens, the engine is under constant front-end traffic pressure to raise its occupancy. By moderating front-end pressure, traffic into the out-of-order engine can be reduced. 1.6.3 Implementation of Principles This thesis reports two methods implementing these two principles in order to reduce occupancy in all of the out-of-order engine components. The first downsizes components based on a newly developed data dependence index (DDI). The DDI is a single number that expresses how well instructions line up in a data dependence chain. In previous work, it was shown that a metric based on data dependences can be used alone to reduce occupancy in the out-of-order engine [22]. Here, a new algorithm that uses token-passing in the issue queue to measure the DDI is reported. The algorithm has the advantage of having low hardware complexity and not requiring periodic upsizing to keep components at the most efficient size. The second method reduces traffic by using the fetch buffer to moderate the front-end pressure, matching it to the currently sustained dispatch rate (SDR). The SDR is the average dispatch rate that the engine can sustain over a short interval. This moderation dampens traffic bursts above the SDR, thereby reducing average traffic. In addition to reducing traffic, this 14 method provides an energy saving bonus for the i-cache. Because it drains the fetch buffer before fetching is resumed, this method can cluster significantly more instructions in each icache access. Because an i-cache access energy cost is the same regardless of how many instructions are fetched, this method significantly reduces i-cache power consumption. The two methods operate independently but in tandem, matching resources to behavior changes that are both microarchitecture independent and dependent. The downsizing method responds to data dependence changes, which are microarchitecture independent. The fetch rate moderation method responds to the changes that are microarchitecture dependent. Similar matching is desired with heterogeneous multi-core processing (HMP) where the objective is to match an inherent thread behavior to one of many dissimilar processors. HMP research has found that techniques that directly measure inherent thread behavior changes are more effective than ones relying on a general performance metric such as the IPC [23]. These findings directly support our findings in the power efficiency domain. To evaluate the methods' power efficiency, the Wattch power models were used [24]. These power models can estimate dynamic power consumption for the SimpleScalar architecture [25]. For a realistic estimate of performance, occupancy, and power consumption, the SPEC CPU 2000 benchmark suite from the University of Minnesota was simulated as a workload [26]. This benchmark suite is derived from SPEC CPU 2000 and is recognized by SPEC as a valid simulation tool for simulation based computer architecture research. The suite uses statistical sampling to maintain function-level execution patterns, instruction mix, and cache behaviors for entire benchmarks. The available suite compatible with the “pisa” SimpleScalar architecture was used. The suite contains a mix of compression, graphics, scientific, and database benchmarks. Power savings per component as well as total processor savings in dynamic power, including level-1 caches are reported. It is asserted that a meaningful quantitative analysis of efficiency should consider both component as well as total processor savings in dynamic power, in order to show the relative effect of components on the total savings. 15 2 BACKGROUND & RELATED WORK This chapter reviews related methods and research on reducing processor power consumption. Section 1 reviews the design levels in which current methods can be grouped, from the device up to the compiler level. Since the focus of this thesis is on the out-of-order engine, Section 2 briefly reviews the function of the out-of-order issue in a typical multi-issue pipeline and its circuit complexity. Section 3 then examines power reduction techniques specific to the out-oforder engine. The out-of-order engine techniques reviewed in Section 3 are classified based on their approach. This taxonomic classification helps emphasize common strengths that should be kept and common weaknesses that should be addressed in a new design. The classification also helps define a new method that combines the strengths of different classes. The first level of classification of out-of-order techniques is based on whether they downsize components. Those techniques that downsize components are further classified by whether they are active or neutral regarding occupancy reduction. Active occupancy reduction techniques are further subdivided by whether they use IPC or traffic reduction. Section 3 also discusses the need for a tandem technique that draws on the strength of the active occupancy reduction techniques while addressing their weaknesses. Section 4 reviews complexity reduction techniques for the out-of-order issue. These techniques’ objective is performance rather than power efficiency. Since this thesis' work deals with out-of-order issue design, these techniques are reviewed here for completeness. 16 Finally, Section 5 reviews power models at the architecture levels, specifically the Wattch power models. These models are critical for a quantitative analysis of efficiency that considers both component as well as total processor power savings. Obtaining the total power consumption helps to show the relative effect of components on the total savings. 2.1 Processor Power Reduction Levels 2.1.1 Device Level The minimum threshold and supply voltages for a CMOS device can be lowered by reducing the device size, which in turn helps reduce chip dynamic power consumption. Pursuing smaller device sizes through successive silicon fabrication technologies has helped curb the increase in power consumption, but this approach is running into limitations [27]. One such limitation is that lowering the threshold voltage leads to an exponential increase in device leakage current and hence an exponential increase in chip static power consumption [9]. For feature size less than 90 nm, the increase in static power will offset any decrease in dynamic power consumption (Section 1.3.3). Silicon on Insulator (SOI) fabrication technology was proposed to reduce leakage current; however, SOI is subject to other limitations, including higher cost of fabrication than bulk CMOS [9]. Even if leakage current is reduced for feature sizes less than 90 nm, CMOS fabrication technologies are not thought capable of producing feature sizes less than 10 nm [28]. A radical solution to reduced device size limitations is finding alternate materials and devices to those of CMOS. Advances in nanotechnology are helping find potential devices. One such device is the memristor, which is basically a resistor with memory [29]. It can be used as memory storage device; it is denser and faster than flash memory and uses much less power. While it was shown to be feasible for memory storage, a memristor device still requires much more development for use in general logic or processor circuits [28]. Another explored alternative to implement a logic switch is nano-scale germaniumsilicon wires that can be used in "nano-processor" tiles [30]. Researchers have also looked at 17 switching a material state from crystalline to amorphous using electric current in what is called phase-change memories [31]. Using carbon nano-tubes is yet another alternative that is being pursued as alternate logic switch [28]. As no clear alternate device has emerged yet, the challenge remains not just the feasibility of these devices, but also the cost of manufacturing them. 2.1.2 Circuit Level Circuit techniques to reduce power consumption involve reducing the total switched capacitive load of devices and their connections. This basically amounts to reducing in equation (4) given in Section 1.3.2. The capacitive load can be reduced either statically at designtime or dynamically at run-time. To reduce the capacitive load at design time, geometries for some devices can be reduced to lower their capacitances. Some devices still must have large geometries so they can source or sink enough current to ensure fast switching speed. The challenge is finding those devices on the chip whose geometries can be reduced without sacrificing too much performance. A power-delay product curve can be used in such a case to trade off as little performance as possible for significant power savings [1]. Cell layout libraries that are used repeatedly on the chip can be targeted for this approach. Circuit design algorithms can be used at design time to minimize the switched capacitance. During the synthesis phase in a computer-aided design (CAD) tool, these algorithms can optimize logic gate connections to have minimum switching activity. Unfortunately, only 10% of total processor power is from synthesized logic that can be optimized by CAD tools [1]. To reduce the capacitive load during run-time, empty units or segments of units can be disabled to avoid wasteful switching activities. Extra control logic can detect when units or segments become empty and disable clocking or signal propagation. One challenge for this approach is finding the optimum granular size for units or segments that can be disabled. If the granular size is too large, a unit would rarely be disabled because some part of it, even if it is a small one, would almost always be used. If the granular size is too small, the power overhead 18 from many logic control gates would offset the power savings. In addition, at a small granular size, the large number of control gates can create clock skew or glitches that can be difficult to mitigate [1]. The issue queue in the out-of-order engine, for example, can be segmented, and control logic can then disable segments whenever they become empty. Because the issue queue is a "power-hungry" component on the chip, this segmentation is an effective power saving scheme. Ponomarev et al. proposed such a scheme by segmenting the bit-lines of the issue queue and using a row decoder to activate only those segments that contain an accessed instructions [16]. From a power efficiency perspective, asynchronous circuits are the most efficient. Without a global clock, asynchronous units process data only when it is available, and thus circuit activity is never wasteful. However, the challenge for asynchronous circuits has been their slow performance. Novel techniques to improve their performance, such as Interlocked Pipelines CMOS [32], have been proposed. But even with these improvements, asynchronous circuits' performance still lags behind synchronous circuits. 2.1.3 Microarchitecture Level At the microarchitecture level, the functional organization between and within processor units can be designed to maximize efficiency with no or little impact on performance. The design for efficiency can be either static or adaptive. Static Design: A static design is where the microarchitecture parameters are fixed at design time so that both power efficiency and performance are optimized. The design can be guided with an optimization metric that includes both performance and power. One such metric is the energydelay product. For a processor, the delay can be the cycle per instruction (CPI) and the energy can be the power consumption in Watts. Other metrics derived from the energy-delay product can emphasize more heavily either on the delay or the energy aspect. For example, the metric emphasizes optimizing the delay aspect over power because the CPI is raised to the cube power. 19 In a static design, the energy-delay product metric for superscalar processor can be used to determine the tradeoff between power consumption and performance as cache and core sizes are varied [6]. An optimal size can then be chosen that minimizes power consumption with an acceptable performance level. The energy delay product is also used to show how simultaneous multi-threading and chip multi-processor designs can scale up better in terms of throughput and power consumption than a single wide-issue processor [6]. Adaptive Design: In an adaptive microarchitecture design, the processor’s functional organization is made to adapt to certain conditions at run-time to reduce power consumption. Many of the functional components in a superscalar processor are designed to provide high performance in the long run. However, as a thread executes, its code and data behavior change, resulting in varying demands on these components. If one or more components become a performance bottleneck, other components can become temporarily underutilized, thus wasting energy. The goal of adaptive design is to take advantage of such situations and disable unused components or portions of them to save power. The design also has to detect when the thread behavior changes again and disabled components are needed back in service. The disabled components should then be quickly re-enabled for optimal performance. The challenge for adaptive designs is finding indicators of behavioral change in executing threads to trigger adaptation. The choice of indicator also depends on the targeted component. For example, Balasubramonian et al. propose using a combination of cache miss rate, instruction commit rate, and branch frequency as an indicator to trigger reconfiguration of the cache hierarchy [33]. By dynamically reallocating a different amount of cache memory to different levels, they claim to improve cache performance and reduce cache power consumption. This thesis proposes microarchitecture techniques to reduce power consumption in the out-of-order engine's components (issue queue, reorder buffer, load-store queue, and connecting buses). It uses two indicators to reduce instruction occupancy in the out-of-order engine. The data dependence index (DDI) is one indicator used to downsize the out-of-order 20 engine. The sustained dispatch rate (SDR) is a second indicator used to reduce instruction traffic bursts into the out-of-order engine. Both indicators work in tandem to adapt the out-of-order engine to behaviors that are microarchitecture dependent and independent—the data dependence index being microarchitecture independent, and the sustained dispatch rate being microarchitecture dependent. This tandem approach results in a very effective reduction in instruction occupancy, which leads to more disabled segments and greater reductions in power consumption. A brief review of other adaptive designs for the out-of-order engine was given in section 1.6.1. A more detailed review of adaptive designs for the out-of-order engine is given in the rest of this chapter. Temperature Aware Design: Techniques for reducing processor power consumption reduce power either on average or in the long run. They may not reduce transient spikes in power consumption that result in hotspots on the chip. These hotspots tend to vary temporally and spatially, causing steep temperature gradients that can compromise chip reliability. Microarchitecture techniques can be used to moderate occurrences of chip hotspots using workload and instruction-level-parallelism indicators [2, 14]. 2.1.4 System Level Power savings can be achieved when considering the processor and its interaction with memory, I/O devices, and peripherals. While waiting for memory to retrieve data, idle processor time can trigger transitions into one of many possible low power modes. Memory and I/O devices can transition into low power modes as well when not in use. A system power management control is responsible for detecting idle components and triggering their transition into, or out of, lower power states. Dynamic voltage and frequency scaling (DVFS) techniques can be used to transition the processor into low power modes [34]. The Advanced Configuration & Power Interface (ACPI) is a good example of a standard system power management that has been widely adopted by the industry [35]. 21 The operating system can also be part of a system power management. For example, the OS can help in detecting process or user activity and force a system to go into a low power mode. 2.1.5 Compiler Level A compiler can assign different power levels to a code section based on the section's behavior. A compiler technique can be either static or dynamic. A static technique uses offline profiling analysis to determine code behavior. Dynamic compilation techniques can capture program behavior dynamically by inserting an extra execution layer between the application binaries and lower OS and hardware levels. Dynamic techniques have the advantage of capturing program behavior related to the run-time environment, such as instruction and data cache performance and branch prediction that would not be possible with static techniques. Dynamic techniques, however, have the disadvantage of potentially slowing execution if code analysis is not fast or simple enough. Both static and dynamic techniques can trigger processor low power modes through dynamic voltage and frequency scaling (DVFS). Wu et al., for example, propose to reduce processor power consumption by using DVFS and dynamic compilation to detect idle CPU cycles during memory operations [34]. 2.2 Complexity of the Out-of-Order Issue In order to improve a single thread performance, microarchitecture techniques try to exploit instruction-level parallelism (ILP) to reduce execution time. The out-of-order engine helps exploit ILP by allowing instructions to bypass each other in the pipeline. If an executing instruction stalls, other independent instruction can bypass it, keeping the instruction pipeline flowing. Unfortunately, this ILP exploitation is not linear in relation to the out-of-order engine's resources. Exploiting small ILP requires a much larger increase in the number of entries in the engine [36, 37]. This non-linearity between ILP and number of entries can be compounded with high circuit complexity. As will be shown later in this section, the out-of-order issue does indeed have high circuit complexity, which impacts not only performance but also power 22 consumption. Out-of-order engine components tend to be among the highest power consumers on chip [15, 16, 38]. They also tend to suffer the most intense hotspots [2, 14]. This section briefly reviews the function of different out-of-order engine components within the pipeline. It then qualitatively analyzes circuit complexity of the out-of-order issue, which is critical in terms of power consumption (for a quantitative analysis of the out-of-issue complexity, refer to [39]). This analysis provides a context for reviewing previously reported techniques (presented in later sections) that reduce power in some of the engine's components. Similarly, the analysis provides a context for the adaptive techniques presented in this thesis, which reduce power consumption in all of the engine's components. 2.2.1 Out-of-Order Pipeline Flow To analyze the circuit complexity of the out-of-order issue, a brief review of the logic function of its different components in the pipeline is given first. Fetching, Decoding, and Renaming Stages: In each cycle, the pipeline starts by fetching a group of instructions from the instruction cache and then decoding them. A standard pipeline is depicted in Figure 7. For a branch instruction, Fetch Decode Rename Issue Register Read Execute Bypass Retire Figure 7. Simplified processor pipeline and stages. the branch predictor determines whether the branch is taken and what the target address is. The compiler usually has access to only logic registers made available by the instruction set. This limitation introduces dependences among the destination registers (or write-after-write dependence) in a thread. A particular microarchitecture alleviates these dependences by offering many more physical registers, allowing multiple instructions with the same destination register to proceed in the pipeline. The rename stage, following fetch and decode, maps logic destination registers to available physical registers. Source register operands are also mapped and their status is flagged in one of two ways. If the source register already contains the desired result, then the source operand status is flagged as ready. Otherwise, the status is flagged not 23 ready, as the desired result will be produced by an instruction yet to be executed ("in flight instruction"). Renamed instructions are then dispatched in program order into the issue queue and the reorder buffer. Out-of-Order Issue, Execute, and Retire Stages: Instructions that are dependent on results from producer instruction(s) still in flight are called consumer instructions. Since the relation between a producer and consumer instruction is not determined a priori, the result tag after execution of an in-flight instruction is broadcast to all instructions in the issue queue. Every instruction in the issue queue then tries to match the result tag against its source operand tags. If matched, that source operand is marked as ready. When both operands for a consumer instruction are ready, and the function unit needed for its execution is available, the select logic can issue this instruction to execute out of program order. When an instruction is issued for execution, its entry in the issue queue is freed. The broadcasting of result tags, matching them against operand tags, and marking of operand tags as ready is collectively referred to as the wakeup logic. The bypass logic can also pass result values directly from a producer instruction executed in the last cycle, so that a producer and a consumer instruction can execute in back-to-back cycles, avoiding pipeline bubbles. Sometime after an instruction executes, the reorder buffer commits it in program order and frees its entry, marking the end of its pipeline journey. 2.2.2 Circuit Complexity To implement the result tag matching against all source operands, the conventional wakeup logic implements associative matching using content-addressable memory (CAM). Figure 7 shows a block diagram for the CAM in the issue queue. When an instruction is selected for execution, its result register tag is driven on the tag lines, so that the tag is available to all instructions in the queue. Every instruction then compares the tags of its own source operands against the result tag. If matched, the source operand status is changed to ready. Once both source operands of an instruction are ready, and the needed execution unit for that instruction is available, the select logic can issue that instruction for execution. 24 Tag 1 OR Tag IW Comparator Comparator Comparator Comparator OR Ready L Operand tag L Operand tag R Ready R Instruction 0 Ready L Operand tag L Operand tag R Ready R Instruction N-1 Figure 8. CAM array for the wakeup logic in the issue queue [39]. In Figure 9, a detailed CAM cell for one bit position of the wakeup logic is illustrated [39]. The cell compares one bit of the operand tag (DATA) with the corresponding bit position of all result tags (TAG1–TAGIW). The operand tag bit is stored in a RAM cell. Other bit positions are placed on the same row connected to the same match line. In every cycle, the horizontal match lines are pre-charged high. The pull-down stack would pull a match line low if any of the operand bits are mismatched against the result tag. Otherwise, the match line stays high, indicating all operand bits match all bits of one result tag. The horizontal match-line and its associated pull-down stack correspond to one comparator in Figure 8. If an operand tag is matched against any of the result tags, then the OR logic indicates the ready status for that operand. To exploit more ILP, both the issue width (IW) and the number of entries in the issue queue need to increase. The capacitive load in the CAM array then increases superlinearly due to the following: 25 1. Increasing IW implies increasing the number of tag lines, which in turn results in increasing the width of the CAM array horizontally; that is, the match lines length has to increase. 2. Increasing the number of instruction entries implies increasing the height of the CAM array vertically; that is, the tag lines length has to increase. 3. The number of devices and capacitive loading also increases with increasing the CAM array tag lines and match line. The superlinear increase in the capacitive load in the associative CAM array leads to superlinear increase of power consumption in the out-of-order issue [39]. The reorder buffer and the load-store queues are usually implemented as RAM array, rather than CAM array. However, to exploit more ILP, the capacitive loading in these units also increases superlinearly due to the following: 1. With increasing IW, the RAM arrays have to support multiple read and write ports. This leads to increased array size horizontally. 2. Increasing the number of entries leads to increased array size vertically, leading to increased bit-line length. 3. Just as with CAM array, the number of devices and their capacitive loading also increases. With increasing the number of entries and the issue width, the result-bus, which connects all components together in the out-of-order engine, has to support more ports for simultaneous reading and writing instructions. This leads to increased capacitive loading. The result-bus would also have higher a circuit activity. As a result, the superlinear increase in the capacitive loading in all out-of-order engine components, coupled with higher circuit activity in all of its components, leads to superlinear increase in its power consumption. 26 Figure 9. Circuit diagram of a CAM cell for one bit position of the wakeup logic [39]. 2.3 Power Reduction for the Out-of-Order Engine 2.3.1 Classification To reduce the high power consumption of the out-of-order engine, techniques with different approaches have been proposed. Figure 10 shows a graphical classification of these techniques based on their approach. Techniques that keep a fixed size for components rely on dynamic reconfiguration or circuit techniques to reduce power consumption. Such techniques are outside the scope of this thesis and are briefly reviewed in this section. Downsizing techniques can dynamically reduce the effective size of components to reduce power consumption. The trigger conditions for resizing in these techniques vary, but the variations share some common patterns. This section identifies and classifies these patterns in order to help address any limitations they might have in common. This also helps in the design of a new technique that combines the strength of different class patterns. A common pattern for most downsizing techniques is to segment one or more components in the out-of-order engine and disable empty segments for power savings. The 27 Power Reduction for the Out-of-Order Engine Fixed Size Downsize No Occupancy Reduction Active Occupancy Reduction IPC Based Traffic Based Data Dependence Burst Reduction + I-Cache Fetch Clustering Figure 10. Classification of power reduction techniques for the out-of-order engine. The boxes on bottom right represent techniques proposed by thesis to replace the IPC based and traffic based techniques. segmentation is implemented at the circuit level, by disabling signal propagations to a segment to reduce activity and the total capacitive load. Beyond common downsizing, the techniques differ on whether a limit on the number of active segments is set. Shutting down empty segments that are normally empty, with no additional attempt to reduce instruction occupancy, is classified here as having no occupancy reduction. In contrast, microarchitecture techniques that cause components to reduce instruction occupancy based on dynamic thread behavior are classified here as active occupancy reduction. This difference is graphically illustrated in Figure 11. 28 Empty, Disabled Segments Max Segments Active Segments Max Segments Figure 11. Active versus no occupancy reduction for an abstract queue. Without actively limiting the number of active segments, only a few segments can be disabled (left). When a limit is set based on thread behavior (middle), segments are forced to drain, creating more empty segments that can be disabled (right). Active occupancy reduction techniques that have been proposed thus far downsize their components based on monitoring the instruction commit rate per-cycle (IPC), or based on reducing instruction traffic. Active occupancy reduction offers more attractive power savings than neutral occupancy reduction. The advantage of actively adjusting processor resources based on thread behavior was also discussed in [17]. However, current approaches for active occupancy reduction have yet to fully realize that potential. It is worth noting that all proposed active occupancy reduction techniques downsize only the issue queue. The only technique that downsizes all out-of-order engine components does not reduce occupancy. The following sections will review proposed neutral and active occupancy reduction techniques in details. 2.3.2 Downsizing with Neutral Occupancy Reduction Only one technique proposed by Ponomarev et al. used neutral occupancy reduction, but it is also the only one that downsized all major components in the out-of-order engine (issue queue, reorder buffer, and load-store queue) [15]. In this technique, empty segments are disabled for power savings, but they are reactivated as soon as the front-end exerts pressure to dispatch more instructions. The reactivation of empty segments takes place regardless of whether they help performance or not. 29 Power savings for Ponomarev et al.'s technique were reported for the out-of-order engine components. However, total processor power savings were not reported due to a limited use of power modeling in their methodology. Moreover, clustering of fetched instruction to reduce the i-cache power consumption was not addressed [15]. 2.3.3 Downsizing with Active Occupancy Reduction A number of the proposed techniques to reduce power in the out-of-order engine can be classified as active occupancy reduction. They either implement IPC monitoring algorithms to trigger resizing or they limit instruction traffic based on microarchitecture indicators. However, none of the active occupancy reduction techniques proposed so far have targeted all components of the out-of-order engine for power reduction. Using IPC: Downsizing the issue queue based on an IPC metric was proposed by Folegnani et al. [38]. In this technique, the issue queue is segmented and the top segment is monitored for its contribution to the IPC. If the segment does not contribute, then it is disabled to save power. The monitoring then moves on to the next lower segment. Since the only way to determine whether a segment contributes to performance is by enabling it and then disabling it, the algorithm can spend a substantial time upsizing the issue queue. With often short-lived execution behaviors, this periodic upsizing can miss power savings opportunities and may not react fast enough to thread behavioral changes [20]. This is a typical adverse side effect of downsizing based on the IPC. In Folegnani et al.'s technique, only power savings for the issue queue are reported; occupancy reduction is not reported. Moreover, a large portion of the power savings comes from disabling the wakeup logic of entries that either are empty or have ready operands; not from downsizing per se. Total processor power savings for their technique are not reported due to limitations of power modeling. Another technique that monitors the IPC to actively reduce instruction occupancy was proposed by Karkhanis et al. [19]. In this technique, an algorithm measures the IPC over an 30 initial interval while setting the number of instructions in the pipeline stages from fetch through issue to the maximum possible. Over subsequent intervals, the algorithm lowers the number of instructions to different maximums. If a lower maximum is found to yield the same IPC as the initial interval, then this new maximum is maintained. Otherwise, the algorithm reverts back to the initial interval state. This process is repeated anytime the IPC or branch behavior changes. This technique suffers from the same IPC limitation as Folegnani et al.'s. The algorithm has to go back to the maximum possible state and then "hunt" for a lower instruction limit. The duration of the initial interval as well all subsequent intervals are on the order of 100K cycles. This means that to find an optimal state, the algorithm might need on the order of a million cycles. With thread behaviors acting on a much shorter scale [20], the processor can be constantly in a "hunt" or inefficient state. Moreover, changes in IPC or branch behavior can trigger this "hunting" process, even if the original state is optimally efficient. Segmentation of out-of-order engine components and disabling empty segments were not implemented in Karkhanis et al.'s technique. Thus, high power consumption by the tag line was not reduced, and only power savings for the issue queue comparators are reported. Reduction in instruction activities is reported, but reduction in instruction occupancy is not reported. Total processor power savings were not reported due to limited power modeling. Using Traffic Reduction: A traffic reduction technique based on branch behavior was reported by Manne et al. [21]. In this technique, the front-end stops fetching instructions when it encounters a branch that it cannot predict confidently. Instructions from an often-unpredicted path are then less likely to enter the pipeline, thereby reducing total traffic. However, reducing instruction traffic alone will not always reduce occupancy. A queue's occupancy might be reduced when traffic going into it is reduced, but if the queue's service rate is also reduced, the queue steady state remains at high occupancy. Moreover, reducing traffic based on a single behavior, such as branch prediction, does not take advantage of other thread behavioral changes. 31 Further reduction in traffic over Manne et al. was reported by Baniasadi and Moshovos [40]. Data dependences and flow rate were used to control traffic. However, this technique shares the same traffic reduction limitation as Mann et al. Moreover, their data dependence metric only counts data dependences. This simple count does not consider how well instructions line up in a dependence chain. Another traffic reduction technique proposed by Buyuktosunoglu et al. uses issue queue utilization and a parallelism indicator to pause instruction fetch [41]. When utilization is high and parallelism is low, fetching is paused to lower utilization. When utilization becomes low, fetching is resumed. This dynamic, however, causes the utilization to fluctuate between high and low levels during low parallelism periods, instead of remaining low. Only issue utilization is addressed in the out-of-order engine. Moreover, the fetch pausing is not timed with the fetch buffer occupancy to cluster i-cache instructions more effectively. Need for a New Tandem Approach: Because the IPC is only a symptom of performance that is determined by many behaviors, downsizing the out-of-order issue queue based on the IPC can affect the instruction window's ability to find and issue instructions out of order. To avoid this effect, IPC techniques have to employ periodic upsizing, which renders them too slow to react to short-lived thread behaviors [20]. Moreover, a traffic reduction technique alone will not always reduce occupancy, especially when the out-of-order engine's service rate slows down. What is needed is a technique that resizes the out-of-order engine but does not depend on the IPC. The out-of-order engine should be resized whenever it becomes less effective in issuing instructions out of program order. This technique should be coupled with a technique that reduces traffic bursts, especially when the out-of-order engine service rate is slow. Reducing traffic bursts above the sustained dispatch rate lessens the chances of congesting the out-of-order engine. This thesis develops both techniques. It develops a new data dependence index to resize the out-of-order engine, and it develops a burst reduction technique by using the fetch buffer to 32 moderate traffic. The latter technique also achieves clustering more instructions in a single icache access, thus reducing the i-cache power consumption. 2.3.4 Fixed-Size Techniques These techniques do not downsize components. However, they might be applied in conjunction with microarchitecture techniques to reduce occupancy for more power savings. For example, dynamic reconfiguration of the issue queue was proposed by Bai and Bahar based on performance counters [42]. Reducing tag comparisons in the issue queue was proposed by Huang et al. [43]. A pointer-based issue queue was proposed by Ramirez at al. [44]. 2.4 Complexity Reduction of the Out-of-Order Issue The techniques discussed in this section are designed to reduce the complexity of the associative wakeup logic in the issue queue to improve performance. None of these techniques address power consumption. It is theoretically possible to combine active occupancy techniques with complexity reduction to improve both performance and efficiency. Such an approach is beyond the scope of this work and will be discussed in the future research section. These techniques are reviewed here for completeness. They are grouped into three categories: associative techniques in which the associative logic is still used, non-associative techniques in which other circuit forms are used (such as indexing and arrays), and scheduling techniques in which instruction latencies are used to reduce complexity. 2.4.1 Associative Techniques Associative techniques try to reduce issue queue complexity by using a smaller out-of-order issue queue, and complement it with a larger simple queue or buffer. Deciding which instructions go to the small, complex queue and which ones go to the large, simple queue is critical for performance. Lebeck et al. complement a small conventional issue queue with a large 2K-entry, simple, waiting buffer [45]. If a load instruction misses in the cache, then it is moved to the waiting buffer along with all the instructions that depend on it. The load and its dependent 33 instructions stay in the wait-buffer until the cache miss is serviced, as Figure 12 (a) shows. The authors demonstrate that such a design can achieve significant speedup over a conventional, monolithic issue queue; however, they do not estimate the impact on power consumption. Some high-performance processors allow instructions that are dependent on a load to issue speculatively before knowing whether the load will actually hit in the data cache or not. Based on decisions from a load-hit predictor, load-dependent instructions are speculatively issued early to save some pipeline cycles. When the predictor is wrong and the load does not hit in the cache, all dependent instructions that were issued speculatively must be reissued after the cache miss has been serviced. After having been issued, the speculative instructions must stay in the issue queue until their status is resolved, thereby occupying more entries and requiring a larger issue queue size. For the purpose of having a small issue queue capable of high clock speed, Moreshet and Bahar propose using a replay buffer to store load-dependent instructions after they are issued speculatively from the main issue queue [46]. If the load hits in the cache, these instructions are removed from the replay buffer; if the load misses in the cache, then the instructions are reissued from the replay buffer when the cache miss is serviced, as Figure 12 (b) Dispatch Issue Queue Function al Units (a) Dispatch Wait Buffer Issue Queue Dispatch Replay Buffer Function al Units Issue Queue Large, Slow Issue Queue Function al Units (b) (c) Figure 12. (a) Using CAM queue with a waiting buffer for instructions dependent on a load that missed in the cache; (b) Replay buffer for speculative loads; (c) Moving critical or oldest instructions to an associative queue. 34 shows. The replay buffer can be implemented with simpler, slower circuitry since loads that miss in the cache take multiple cycles to be serviced. The authors claim 18% power savings for the issue queue with no degradation of performance [46]. Brekelbaum et al. propose a simple, slow, and large issue queue to which all instructions are dispatched [47]. As non-ready instructions age and become older, they become more “critical”. In every cycle, the oldest critical instructions are moved to a fast, smaller queue, from where they are issued, as Figure 12 (c) illustrates. Since the large queue is outside the critical timing path, this hierarchical scheme yields a larger instruction window capable of extracting more ILP, but with the benefit of a fast, low-complexity issue queue. Beyond performance benefits reported in this study, no detailed power consumption tradeoffs were reported. Ernst and Austin leverage the fact that most instructions have only one non-ready operand [48]. They propose using three queues: one non-CAM queue for instructions that are ready at dispatch, a second CAM-based queue for instructions having only one ready operand, and a third small CAM-based queue for instructions having none of their operands ready. Since the two CAM-based queues are smaller than a conventional large CAM queue, the authors show that such arrangement consumes 10-20% less power while running at clock speed 20-40% faster than the conventional design [48]. However, they do not report any power savings. 2.4.2 Non-Associative Techniques Indexing: Canal and Gonzalez offer indexing as an alternative to associative CAM design [49]. This technique uses RAM and an index of register tags to send each result to only one consumer instruction. Cruz et al. observed that, for a majority of instructions, a result produced by one instruction is used by only by one other instruction [50]. Based on this observation, Canal and Gonzalez proposed a queue for instructions that are the first user of a result produced by another instruction. This queue is called the first-use instruction queue, as Figure 13 shows. Other instructions that are not “first-use” are dispatched to a small, conventional, associative queue. Instructions that have all operands ready are dispatched directly into an in-order queue. 35 The reduced complexity of the combined three queues allows them to be clocked faster than a conventional associative issue queue. The authors do not, however, estimate power savings for this technique. Dispatch First-Use Queue Tag Instructions 1 Ready Queue Issue Queue 2 3 4 5 Result Register Tag Functional Units Figure 13. Block diagram for the first-use indexing [50]. Array: Another alternative to a conventional issue queue based on associative search is to use an array structure that keeps track of instruction dependences. The array consists of rows representing instructions, and columns representing registers, as shown in Figure 14. A cell in the array is set to “1” when an instruction depends on a physical register whose value will be written by another instruction that is still in “flight”. When a register is written by an instruction, the wakeup logic resets all cells in that register’s column to “0”. An instruction is determined as ready when all of its row cells are reset to “0”. 36 Instructions 1 2 3 4 5 6 Registers 1 2 3 0 1 0 1 0 0 add R6, R2, R4 4 5 6 Figure 14. Array representation. The add instruction stored in row 3 depends on both registers 2 and 4. Compared to the CAM methods, the array technique reduces complexity by eliminating the associative comparisons of register tags. However, it still has the disadvantage of requiring a large number of decoders for selecting and clearing a register column. Determining when an instruction is ready also requires a large number of comparisons to check all cells of a row. Goshima et al. improved on the basic array method by distributing the array into three arrays for storing different types of instructions: integer, floating-point, and load-store [51]. They report improvement in clock speed due to reduced cycle time of the wakeup and select logic, but they do not study potential power reduction benefits. Deeper pipelining has been used consistently by microprocessor designers to increase performance, as deeper pipelines allow higher exploitation of parallelism. Increasing pipeline depth also allows faster clocking, as there is smaller logic computation per stage. Over the past 20 years, for example, pipeline depths have increased from 1 for the Intel 286 processor, to 5 for the Intel 486, to 10 for the Intel Pentium Pro, to 40 (based on fast clock) for the Intel Pentium 4 [10]. Critical loops, however, pose a challenge for deeper pipelining. A critical loop is a sequence of logic operations that must produce results in one pipeline stage, so the results can be used in the next cycle; otherwise, a pipeline “bubble” is propagated, reducing IPC performance. 37 The wakeup and select logic form such a critical loop. When an instruction is selected for execution, its dependent instructions have to wake up in the same cycle so that they can execute in the next cycle; otherwise, the instruction and its dependent instructions cannot execute in consecutive cycles; i.e., a pipeline bubble would be produced. To allow pipelining the wakeup and select logic, Brown et al. have proposed breaking down the critical loop of wakeup and select logic into two loops: a critical, single-cycle loop for wakeup logic and a non-critical loop for select logic [52]. They accomplish breakdown by speculating that all waking instructions are immediately selected for execution. They use an array to keep track of instruction dependences over the two loops. The authors claim potential power savings but they do not report power savings data for their technique. 2.4.3 Scheduling Techniques Scheduling techniques try to take advantage of complete or partial knowledge about instruction latencies to reduce the wakeup's high wire-latency in the issue queue. The challenge for these techniques is how to accurately estimate latency or issue-time. Generally, the issue-time of an instruction is based on the issue-time of all instructions it depends on. If all load instructions hit in the cache, this estimation would be simple. The difficulty arises due to variable latency of memory access instructions. Canal and Gonzalez have proposed two similar but different techniques to deal with the variable latency of loads [49]. In one technique, they propose using a wait-buffer to hold load instructions, and all instructions dependent on them, until it is known whether the loads hit in the cache or not. Once cache latency for a load is determined, the load and all dependent instructions are scheduled with proper priority into an in-order queue, as Figure 15 (a) illustrates. The second technique proposed by Canal and Gonzalez also uses a wait-buffer, but with a different arrangement [53]. All load instructions are assumed to have cache-hit latency as they enter the in-order queue. If they suffer a cache miss and can’t issue when they reach the bottom of the queue, they are relocated to the wait-buffer along with dependent instructions. Once the 38 cache miss is serviced, instructions can then issue directly from the wait-buffer, as Figure 15 (b) illustrates. For both examples above, the authors assert that these techniques reduce the complexity in comparison to using only one large out-of-order issue queue and, therefore, are amenable to faster clocking and scalability. The authors, however, do not conduct a study for potential power savings benefits. Other scheduling techniques have been reported that report innovative circuit design with similar results [54-56]. From a power consumption perspective, scheduling instructions is not power efficient in two cases. One is when most instructions line up in a data dependence chain and cannot be scheduled. The second is when most instructions are independent and performance does not benefit from scheduling them. It is theoretically possible that active occupancy reduction Dispatch Dispatch Wait Buffer 1 1 2 2 3 In-Order Issue 4 Queue 5 In-Order 3 Issue Queue 4 5 head 6 head 6 Wait Buffer Function al Units Function al Units (a) (b) Figure 15. Using an in-order issue queue with latency scheduling. Only instructions at the head of the queue can issue. In (a), loads and their dependent instructions are dispatched to a wait-buffer, and in (b), loads are moved to a wait-buffer when they can't issue due a cache miss. 39 techniques developed here could be combined with scheduling techniques to achieve additional optimization in performance and power consumption. 2.5 Power Models at the Microarchitecture Level Since power consumption has become a critical design constraint, microprocessors designers have to evaluate power consumption early on in the design process. Traditional circuit or device level power estimation tools, however, are not practical at the architectural definition stage. Traditional power estimation tools rely on circuit parameters obtained after the chip layout and floor-planning stages. It is usually too late after these stages to make architecture modifications for power reduction. Moreover, the dynamic power component is dependent on actual workloads. Architecture simulations with workloads using detailed layout circuit parameters entail time-consuming simulations that are impractical for designers to quickly explore the tradeoff space of power and performance. Recently, researchers have developed new techniques capable of quick dynamic power estimation at the architecture level. The key feature of these techniques is the estimation of the total, switched capacitive loading variable (Ctot) in the processor dynamic power equation (3), in Chapter 1. Recall from that equation that dynamic power is dependent on the variables Ctot , VDD, Nc, and tp. All of these variables are dependent on the technology process, but the total switched capacitance, Ctot , and the number of cycles, Nc , are, in addition, dependent on architecture configurations. Nc can be easily obtained from an architecture, cycle-level simulator; however, Ctot is the key variable that needs to be estimated efficiently at the architecture level to estimate dynamic power consumption. Brooks et al. [24], Vijaykrishnan et al. [57], and Cai [58] have implemented dynamic power models that estimate Ctot based on architecture parameters, such as pipeline width, issue queue size, cache sizes and configuration, number of registers, and other architecture parameters. 40 2.5.1 The Wattch Approach To estimate the switched capacitance efficiently, Brooks et al. [24] in their Wattch power models simplify the capacitance calculations by grouping typical processor circuits into four major circuit categories: 1. Memory Array Structures: used for instruction and data caches, register files, branch predictor, and buffers. 2. Fully Associative CAM: used for the issue queue wakeup logic, the reorder buffer and the translation look-aside buffer. 3. Combinational Logic and Wires: used for functional units, the issue queue selection logic and result buses. 4. Clock Structures: for distributing clock signals to all units on the die. For each of the above circuit categories, Brooks et al. then derive formulas for the capacitive load of all essential circuit stages based on a typical physical layout. To illustrate how they derive the capacitance equations for the word-lines and bit-lines activation stages for the register file structure, the schematic circuit diagram in Figure 16 is used. The register file is a memory array type. Its word-lines are laid out horizontally and its bit-lines are laid out vertically in the schematic. The word data bits are stored in static memory cells across the array. The array access has four stages: in the first stage, the array activates decoding the requested address (left in the schematic); in the second, it activates the resulting word-line; in the third, it activates the bit-lines discharge from cell transistors after the initial pre-charge; and in the fourth, it activates the output data from the sense amplifiers connected to each pair of bit-lines complements . The capacitance of a single word-line is composed of the capacitance of the diffusion capacitance of the word-line driver (left in Figure 16), plus the gate capacitance of the cell access transistors, plus the capacitance of the line metal wire. Similarly, the capacitance of a single bitline is composed of diffusion capacitance of the pre-charge driver (top in Figure 16), plus the 41 diffusion capacitance of the cell access transistors, plus the capacitance of the line metal wire. Table 1 lists the total capacitance equations for all the word-lines and the bit-lines for a register file structure. precharge Wordline Driver From Decoder Bit’ Bit Number of Bitlines Cell Access Transistors Cell Cell Number of wordlines Cell Cell To sense amps To sense amps Figure 16. Schematic diagram of word-lines and bit-lines for a circuit array structure (adapted from Brooks et al. [24]) The capacitance equations in Table 1 factor in the register file size, so they are a function of architecture parameters, which is a key feature for these equations. The number of wordlines, the number of bit-lines, the diffusion capacitance for the word-line and the pre-charge drivers, the word-line length, and the bit-line length all are a function of the register file size. The register file size is an architecture parameter dependent on the register file width, the number of entries, and the number of read-write ports. Table 1 lists circuit types whose capacitance load is scaled based on the respective architecture structure. Complete details for 42 other structures’ capacitance equations can be found in [24] and the Wattch source code that is available from the authors. Table 1 Capacitive load equations used for bit-lines and word-lines stages for the circuit of type memory array for a register file structure. Node Register File Wordline Capacitance = Capacitance Equation Cdiff (Wordline Driver) + Cgate (Cell Access) × Num Bitlines + Cmetal × Wordline Length. Register File Bitline Capacitance = Cdiff (Precharge) + Cdiff (Cell Access) × Num Wordlines + Cmetal × Bitline Length. The Wattch power models integrate with the sim-outorder SimpleScalar cycle-level simulator, and Wattch uses the architecture parameters given to sim-outorder. When simulating a program, and before sim-outorder gets into its main cycle-level loop, the Wattch models are invoked to calculate the capacitance load for each architecture structure. The capacitance load values are then stored in memory for fast dynamic access. Each time a unit is accessed during the cycle-level loop, that unit’s capacitance is added to the running sum for Ctot.. When the simulated program ends, Wattch uses Ctot and the total number of cycles, Nc , obtained from sim-outorder to calculate the average dynamic processor power (equation (3), Chapter 1). Wattch uses constant VDD and tp values for a given technology process. Clock gating: For multi-ported units, clock gating becomes a more common approach in processor designs to curb power consumption. When a multi-ported unit is accessed on some but not all ports, only the accessed ports are usually activated to reduce power consumption. Wattch authors have developed three models of clock gating for multi-ported units: 43 1. A unit consumes 100% of its full power whenever it is accessed on any particular port. The unit consumes zero power when none of its ports are accessed. 2. The power that a unit consumes increases linearly with the number of accessed ports. The unit consumes zero power when none of its ports are accessed. 3. The power that a unit consumes increases linearly with the number of accessed ports. The unit, however, consumes 10% of its full power when none of its ports are accessed. This option reflects the power overhead for control logic to disable part or all of a unit. In our work, we used the above option (3), and our reported data for power savings reflect the 10% overhead. 44 3 METHODS This Chapter presents the methods used to reduce instruction occupancy and ultimately increase processor efficiency. Section 3.1 states the guiding two principles for the methods. The principles were introduced in Chapter 1. They are elaborated here in more depth. Principle 1 states two cases for downsizing the out-of-order engine based on data dependences. Principle 2 states the case for reducing instruction traffic bursts. The tandem principle explains why both principles must be applied to achieve significant occupancy reduction. Section 3.2 describes the methods used to implement Principle 1. It first develops the data dependence index (DDI) that is needed to quantify data dependences. The token passing algorithm shows how to measure the DDI in hardware. An alternative measuring algorithm to measure the DDI by extending register renaming is also proposed. Validation of the algorithms is then presented. The downsizing policy shows the rationale for when and how much to downsize the engine as a function of the DDI. Section 3.3 describes the methods used to implement Principle 2. Three methods are investigated to evaluate the most effective one in moderating traffic bursts: The first uses the DDI as a condition to pause fetching; the second uses the sustained dispatch rate (SDR) and the fetch buffer occupancy to pause fetching; and the third uses both SDR and the issue occupancy to pause fetching. Section 3.4 then briefly shows that the second and third methods have the advantage of clustering fetched instructions, which can reduce i-cache power consumption. Finally, Section 3.5 describes the simulation framework in detail. Both the architecture and the power models are described. The workload benchmarks and the simulations methods are then described. 45 3.1 Efficiency Principles for the Out-of-Order Engine 3.1.1 Principle 1 Gradually downsize the out-of-order engine as it becomes less effective in issuing instructions out of program order. This principle can be applied to queues in the out-of-order engine in two cases. Case 1: Upper levels of data dependence. If window instructions exhibit long chains of data dependence, the data dependence level is said to be high. The more instructions line up, the higher the dependence level. When this happens, the window’s performance rapidly becomes insensitive to window size. This insensitivity effect can be observed in Wall's study on available instruction level parallelism [36]. Figure 17 shows this effect for two benchmarks using Wall's data. For an ideal processor, no parameter affects the issue rate performance other than data dependence and window size. Although the figure shows data dependence for entire benchmarks, data dependence levels vary dynamically, as will be seen later in the chapter. This effect is also observed for an ideal processor with an unlimited issue width. However, it can still be applied for a non-ideal processor. The high data dependence levels naturally limit the issue rate, so that the limited issue width of a non-ideal processor does come into play. Thus, this trade-off between window size and performance at upper levels of data dependence can still be exploited for efficiency for a non-ideal processor. Figure 18 graphically shows this exploitation. Case 2: Lower levels of data dependence. This case exploits the limited issue width of a non-ideal processor at low levels of data dependence. In Case 1, it was asserted that performance is sensitive to window size at low levels of data dependence for an ideal processor. However, the limited issue-width of a non-ideal processor renders performance insensitive to window size at very low levels of data dependence. In this case, a small window would be able to find enough independent instructions to issue and saturate the issue-width, even in the event of other microarchitecture latencies, such data cache misses. This effect can then be exploited for efficiency. 46 low dependence, tomcatv 70 high dependence, li long chain dependence 60 50 40 Instruction issues per cycle 30 20 10 0 32 128 Window size 512 1024 Figure 17. Ideal processor performance as a function of window size for different data dependence levels. The data dependence level is higher in li than in tomcatv. Thus, li’s performance is less “sensitive” to window size. If the window size is reduced from 128 to 32 entries, for example, performance drops much less significantly for li than does for tomcatv. For a long chain of data dependence, performance is completely insensitive Data from Hennessy and Patterson [37]. Lower data dependence level Performance Higher data dependence level Window size Figure 18. Ideal processor trading off window size for performance at different data dependence levels. At higher data dependence levels, a more significant decrease in window size can be traded off for smaller performance loss than can be traded at lower data dependence levels. 47 In both cases, components can be forced to reduce their occupancy by downsizing. This concept is illustrated in Figure 19. Components should maintain maximum size when data dependence exhibits neither high nor low levels; that is, when the engine holds a roughly equal mix of independent and dependent instructions. In other words, maximum size should be maintained when neither Case 1 nor 2 applies. The sizing decision must be solely based on the state of data dependences; otherwise, the engine's ability to find and issue independent instructions might be hindered. Empty, disabled segment Max Segments Active segment Max Segments Figure 19. Downsizing with active occupancy reduction. Segments above the max segments pointer are allowed to drain so they can be disabled. 3.1.2 Principle 2 Continuously moderate the front-end pressure to reduce traffic bursts. The front end is normally designed to fetch instructions from the i-cache and push them into the out-of-order engine as fast as possible. However, the engine may not always be able to move instructions as fast as they come in. It might be slowed down, for example, by data cache misses, function unit latencies, or resource conflicts. Because of this constant front-end pressure, any empty entries in the engine are quickly filled up by short dispatch bursts. Even if traffic is reduced, engine queues can still maintain high occupancy. This scenario is graphically illustrated in Figure 20. To help maintain a lower occupancy, it is important to not only reduce traffic, but also to dampen short bursts above the currently sustained dispatch rate. 48 (a) (b) Slow arrival rate + short burst Slow arrival rate Uncongested queue Congested queue Slow service rate Slow service rate 3.1.1 Tandem Principle Figure 20. A short traffic burst as a result of front-end pressure, in (a), can raise occupancy in the queue in (b). The queue can then stay congested even if traffic is reduced. For a significant occupancy reduction, both principles one and two must be applied. Applying Principle 1 alone does not cover cases when the engine is slowed down by microarchitecture-dependent latencies. For such cases, applying Principle 2 helps reduce occupancy when Principle 1 alone cannot. Conversely, applying Principle 2 alone does not cover cases when the engine can be downsized based on microarchitecture-independent latencies (namely data dependences). In such cases, applying Principle 1 helps reduce occupancy when Principle 2 alone cannot. For example, when data dependences are very few, traffic can be high so Principle 2 would not apply, yet the engine can still be downsized by Case 2 of Principle 1. 3.2 Downsizing the Out-of-Oder Engine 3.2.1 The Data Dependence Index In order to downsize the out-of-order engine's components based on data dependences, as outlined in Principle 1, a metric for data dependences is needed. The latency metric proposed by out-of-order scheduling techniques often involves a data dependence order [54-56]. However, downsizing components based on such metric results in a large performance loss. This is because this metric includes microarchitecture latencies unrelated to data dependences. 49 As Principle 1 states, downsizing on any condition other than data dependences hinders the instruction window’s ability to find and issue independent instructions. Dynamically building an entire dependence graph in the issue queue is impractical due to high hardware complexity. Measuring the longest data dependence path is more practical. However, the longest path alone does not account for instructions that are not part of it, and thus does not fully capture data dependence parallelism. Therefore, we propose the Data Dependence Index (DDI) as the ratio of the number of instructions in the longest path, Nmax, to the total number of instructions in the graph, N, expressed as: Nmax = Pmax + 1, where Pmax is the number of edges in the longest path. If all instructions are independent, then Nmax = 0. The range of the DDI is a fixed scale from zero to one (0 ≤ DDI ≤ 1). As more instructions become independent, the DDI approaches zero, because the longest path becomes shorter, containing fewer instructions. On the other end, as more instructions become dependent, the DDI value approaches one because the longest path becomes longer, containing more instructions. If the DDI = 1, then all instructions form one long dependence chain. Figure 21 gives examples of the DDI. In (a) the DDI is greater than in (b), indicating greater data dependence. Figure 22 gives examples of the DDI that are counter intuitive: the graph in (a) appears to have more data dependence than (b); however, both graphs in (a) and (b) have the same DDI. The DDI can be theoretically justified using the ideal processor concept, such as the one defined by Wall [36]. In such a processor, the issue rate is a function of only data dependences. To derive an expression for this issue rate, we can realize that all instructions, N, in a graph can issue in the time it takes to issue the longest path’s instructions. Thus the ideal issue rate can be expressed as N/(Pmax + 1), where Pmax is the number of edges in the longest path. The ideal issue rate is almost the inverse of the DDI, except that its range is not bounded between zero and one as with the DDI. 50 1 2 1 2 3 4 3 4 7 5 5 8 9 6 6 (a) (b) Figure 21. Data dependence graphs with instructions numbered in their program order. In (a) Nmax = 4 and N = 6, giving DDI = 4/6. In (b) the DDI = 4/9. 1 5 1 5 2 6 2 6 3 7 3 7 4 4 (a) (b) Figure 22. The graph in (a) appear to have more data dependence than the graph in (b); however, counter intuitively, both graphs have the same DDI = 4/7. Applying the ideal issue rate directly instead of the DDI is impractical because of its range. As data dependences decrease and become very few, the ideal issue rate becomes very large and falls outside the useable dynamic range of size adjustment for the out-of-order engine’s components. The DDI, on the other hand, quickly goes to zero, then equals zero when all instructions are independent. In other words, the fixed DDI scale between zero and one is more practical for size adjustment. The DDI measures data dependences for all instructions, including those from a predicted path. This implies that the window size is adjusted to data dependences from whichever path is currently taken. If the prediction turns out to be correct, then the window size was properly adjusted. If the path was mispredicted, the window will adjust when the correct path instructions are fetched and dispatched. 51 Modern processors cannot determine data dependences through memory at the issue stage. Likewise, the DDI does not measure dependences through memory; the DDI measures only data dependences through registers. However, modern processors still issue loads and stores out of order by relying on special disambiguation techniques and memory address speculation [59, 60]. These disambiguation techniques verify that data dependences are not violated at the retirement stage. So as far as adjusting the window is concerned, loads and stores are considered to be independent (at least at the issue stage) and do not affect the DDI. 3.2.2 Token Passing Algorithm To dynamically measure the longest data dependence path, we use a token-passing algorithm. The algorithm was inspired by Fields et al.’s algorithm that tracks the critical path for instruction scheduling [54]. The algorithm tracks only data dependences as instructions issue and does not need a predictor, resulting in a simpler hardware implementation. The proposed algorithm is for issue queue instructions. An alternative algorithm for the reorder buffer is proposed in the Future Directions chapter. Each instruction entry in the issue queue has a token field that represents the instruction depth in the dependence chain. When instructions are dispatched into the issue queue, their token is initially set to zero, which means that they do not yet have a token. Independent (or ready) instructions, whose operands are available in registers, are never part of a chain, and thus their token is never inherited. Dependent instructions will eventually inherit a token from previous instructions before they issue. Token creation and increment: When an instruction issues for execution, its token is incremented by one. If it is incremented from zero to one, then a token is said to be "created". This could be the head of a new chain. Token passing: Normally when an instruction issues to execute, it passes its result tag to the issue queue so that waiting instructions can compare the result tag against their operand tags. If there is a match, the corresponding operand is marked as ready. 52 Token passing can happen at the same time. When an instruction executes, it passes its incremented token along with its result tag. Waiting instructions can then inherit this token when they compare their operand tags to the result tag. To ensure that the token from a longer path is propagated when a chain merges, each waiting instruction has to compare two tokens (assuming dyadic instructions) and store the larger one as follows:  If the waiting instruction matches two result tags (A and B), then compare token A to token B and store the larger.  If the waiting instruction matches one result tag, then compare the corresponding token (A or B depending upon which tag matches) to the instruction’s existing token and store the larger. Figure 23 (a) summarizes the token passing algorithm in a graph, and Figure 23 (b) shows the logic needed to compare and store instruction tokens. Token termination: If an instruction executes and no waiting instruction inherits its 0 1 0 2 1 3 1 4 0 7 2 5 2 8 0 9 3 6 (a) max Mux token A token B Token (b) Figure 23. Token passing is shown in (a). As instructions issue, instruction No. 1’s token (small black circle) is passed all the way to instruction No 6. The logic for comparing and storing the largest token is shown in (b). The Mux either passes both token A and B, or passes token A or B and the existing token. This Mux is controlled by the matching of result tags with operand tags. 53 token, that token is said to "terminate", indicating the end of a chain; that is, the window is not large enough to pass this token to future instructions. Token termination: If an instruction executes and no waiting instruction inherits its token, that token is said to "terminate", indicating the end of a chain; that is, the window is not large enough to pass this token to future instructions. The DDI: As instructions issue during each cycle, the maximum token among the issue group is compared to a global maximum. The global maximum stores the issue group maximum token if it is greater than its existing value. The global maximum then holds the longest path, Pmax, that the window "sees" over any time period until Pmax is reset again. A sample DDI is computed using Pmax over the time period it takes to issue N = 8 instructions, then Pmax is reset again. Once 128 DDI samples are accumulated, a DDI interval average is taken, which is used to adjust the window size for the next interval. On average, this interval lasts about 512 clock cycles. Complexity: Because Pmax is computed over 8 instructions, a 3-bit token is sufficient to hold the largest value. Compared to other issue queue entries, such as instruction opcode, operands, operand flags, and operand tags and matching, the 3-bit token adds negligible storage to the issue queue. The token does affect the tag-drive width because the token is passed along the results tag. It also slightly increases the number of bit-lines for reading tokens as instructions issue. In the worst case scenario, if the issue queue is completely full all the time (empty segments are not turned off) and all devices are switching, the power models used in this work estimate about 6% power overhead in the issue queue and 0.5% in the total processor. However, with segmentation of the bit-lines, reducing instruction occupancy, and turning off empty segments, this overhead becomes practically undetectable, especially for the total processor. Moreover, computing Pmax requires eight comparators (8 issue width), each 3-bits wide, which we assert requires a negligible power increase. Pmax also requires 3-4 logic levels, which are well within the optimum cycle of 6-8 fan-out-of-four (FO4) logic levels [61]. The DDI 54 sample can be computed with a small ROM lookup table. Also, both the sample DDI and the interval DDI computations are not on the critical path, so they can overlap with Pmax and the token passing activities and can take more than one cycle if necessary. 3.2.3 Extended Register-Renaming Algorithm This is an alternative method to the token-passing algorithm to measure the DDI. Measuring the DDI can be accomplished at the rename stage by extending the dependence-check-logic. As instructions are decoded, logical register operands are renamed to physical registers to remove any name dependences introduced by the compiler. When the dependence-check-logic determines that an instruction’s source operand is dependent on a previous instruction’s destination operand, it renames the producer and consumer to the same physical register. The check-logic can be extended so that each time data dependence is detected for an instruction, its depth in the dependence chain is incremented by one and propagated to the next instruction in the chain. A comparator is needed to propagate the maximum dependence for two source operands for the same instruction. In other words, if an instruction is the merge point of two dependence chains, only the longest branch of the chain is propagated to the consumer instruction. A final comparator determines the maximum of all dependence chains lengths to find the DDI. An average DDI is then computed over 512 clock cycles. This averaging helps to smooth out any irregular DDI samples. Complexity. Since the DDI is measured every time over the same number of instructions, e.g., over 8 instructions at dispatch, the ratio is not needed and the longest path itself is equivalent to the DDI. Figure 24 illustrates a simple example for obtaining the LDPR over 4 instructions. The magnitude of the measurement of the longest path is on the order of where n is the number of instructions. Moreover, since n, the dispatch width, is small, the complexity and overhead to obtain the DDI at rename is minimal compared to the overall dependence-check-logic. For instance, if the dispatch width is 4 instructions, the hardware needed to measure the DDI is 4 comparators and 3 increment-by-1 counters, all of which are two bits wide. 55 0 i1: Add R1, R2, R3 1 i2: Add R4, R1, R5 1 i3: Add R7, R1, R6 2 i4: Add R9, R7, R8 Figure 24. An example of four instructions at rename. The shaded boxes represent dependence depth counters. The maximum dependence value for all chains is 2 and the DDI = 2/4. 3.2.4 Validation To validate the implementation of the DDI algorithm, the measured DDI must be compared to a known data dependence condition. Two loops were hand-coded in assembly language so that their performance was a function only of data dependences. The first loop had most of its instructions dependent on each other in a long chain (DDI near one), whereas the second loop had most of its instructions independent of each other (DDI near zero). The loops were coded so that instruction latency was least dependent on the microarchitecture. This implies that instructions hit in the caches, loop branches were predicted correctly and instructions did not contend for functional units. The two loops were then run several hundred times. The measured issue rate, which is near ideal, is then compared to the measured DDI for validation. During the run of the first loop (with long chain dependences), the measured DDI was observed to be near one while the issue rate was also near one. During the run of the second loop (no data dependences), the measured DDI was observed to be near zero while the issue rate was near the maximum issue width. These observations confirmed correct implementation and allowed proceeding with the window adjustment using the DDI. 3.2.5 Downsizing Policy Data dependences tend to change in phases during program execution because they inherently relate to program structures. Figure 25 shows DDI graphs during the execution of randomly selected periods of two SPEC_2000 benchmarks. A long running loop, for example, might have many instructions dependent on each other. Even though different iterations of the loop might be independent, the processor window may not be large enough to "see" beyond one loop 56 1 DDI 0.8 0.6 0.4 0.2 0 0 200 400 600 1 DDI 0.8 0.6 0.4 0.2 0 0 100 200 300 400 500 600 sim cycle × 3000 Figure 25. During two randomly selected periods, the DDI for the 164.gzip.log benchmark (top) alternates between high and low periods. The DDI for the 175.gcc benchmark (bottom) remains fairly constant around a value of 0.36. iteration. A different loop, on the other hand, might have many independent instructions. Because of this phase pattern, the DDI should be measured over short intervals so that an efficient window size can be set for the duration of a phase. To adjust the instruction-window size, the architecture and the power models were segmented for the issue queue, the load-store queue, and the reorder buffer. Past studies, for example Ponomarev et al. [16], showed power saving benefits from turning off issue queue segments whenever they become empty. In this work, we proactively set a maximum number of useable segments based on the interval DDI. This reduces the number of instructions and forces more segments to become empty and thus be turned off. In this work, this mapping from DDI values to the maximum useable segments is called the DDI policy. An initial DDI policy was first derived based on the two cases of Principle 1; that is, the maximum number of active segments was limited as the DDI value increased above a threshold 57 and as it deceased below another threshold. The threshold DDI values and the rate of increase and decrease were then empirically optimized with initial simulation tests. These values were then applied to all subsequent simulations. Figure 26 shows a graphical representation of the DDI policy. When the maximum window size was adjusted, the issue queue, the load-store queue, and the reorder buffer were adjusted together and kept to their original proportions: the reorder buffer was set to hold twice as many entries as the issue queue or the load-store queue. However, since the reorder buffer segment size (16 entries) is twice as large as the issue queue or the load-store queue segment size (8 entries), the maximum useable segments remain the same for all three units. So, for example, when the maximum number of useable segments is set at 4, the reorder buffer can have up to 4×16 = 64 entries, and the issue and load-store queues can have up to 4×8 = 32 entries each. As Figure 26 shows, the DDI policy assigns the maximum useable segments to the physical window size of 8 segments for DDI values between 0.17 and 0.55. In this range, the window contains a mix of data-dependent and independent instructions, so that the window’s performance is sensitive to its size. In other words, in this range a larger window can effectively Maximum Useable Segments find more independent instructions to issue. 8 7 6 5 4 3 2 1 0 0 0.2 0.4 0.6 0.8 DDI Figure 26. A graphical representation for the DDI policy. 58 1 As the DDI value increases above 0.55, the window size is gradually decreased as fewer independent instructions are found and the window’s performance gradually becomes less sensitive to size. Less sensitivity means that a smaller, more efficient window can perform just as well. When the DDI reaches 0.8, most instructions essentially become dependent and the window becomes completely insensitive to size. Two active segments can then perform just as well. As the DDI value decreases below 0.17, independent instructions dominate. Because of limited microarchitectural resources (such as limited issue width), the window can find enough independent instructions in a smaller instruction pool to sustain the maximum issue rate. Thus, the window size is gradually decreased. Below a 0.05 DDI value, sufficient independent instructions are available so that only two segments can sustain maximum issue rate. Because of the nature of the DDI ratio, the DDI value goes to zero faster when independent instructions dominate than it goes to one when dependent instructions dominate, as shown in Figure 26. This explains why the DDI policy curve is steeper on the left and slightly asymmetric. To control the window size, three counters were assigned to keep track of the number of non-empty segments in each of the reorder buffer, the issue queue, and the load-store queue. These counters increment and decrement each time a segment is turned on and off, respectively, in each unit. If incrementing one of the units' counters results in exceeding the maximum number of useable segments set by the DDI policy, instruction dispatch is momentarily halted. When entries are freed up by instructions issuing or committing, instruction dispatch resumes. 3.3 Moderating Front-End Traffic Pressure To find the most effective method for reducing traffic, three variations on fetch pausing aimed at moderating the front-end pressure were implemented. 3.3.1 Fetch Pause Based on Data Dependence The first variation pauses fetching on any cycle if the out-of-order engine has a high level of data dependences: 59 FetchPause-DDI: if (highDDI && fBufferNum > 0) { fetchPause = true; } else { fetchPause = false; } Here the fBufferNum > 0 condition ensures that fetching is not paused if the fetch buffer is empty so that the engine does not "starve". The highDDI is a flag set if the interval DDI is greater than 0.4. Although this variation can slow down traffic when the engine slows down due to a high level of data dependences, it does not consider other microarchitecture latencies. 3.3.2 Fetch Pause Based on Sustained Dispatch Rate To consider microarchitecture latencies, a second variation on fetch pausing was implemented. Fetching is paused on any cycle if the fetch buffer occupancy is higher than the sustained dispatch rate (SDR). In this case the SDR is the dispatch rate averaged over intervals of 512 cycles: FetchPause-SDR: if (fBufferNum > SustainedDispatchRate + delta) { fetchPause = true; } else { fetchPause = false; } The delta parameter is a safety margin to "cushion" fast deviations in the SDR and avoid the risk of starving the engine. A delta = 2 was empirically found to be optimal. The SDR is determined by all microarchitecture latencies. The advantage of this fetch pause variation is that it can moderate traffic bursts above the SDR. This concept is illustrated in Figure 27. If the engine has many empty entries and dispatch bursts by the front-end try to fill it up, the bursts are dampened because the fetch buffer would be frequently drained by the fetch pause. 3.3.3 Fetch Pause Based on Sustained Dispatch Rate and Occupancy The third variation is the same as the second except that an issue queue occupancy condition is added. Fetching is paused not only when the fetch buffer occupancy is higher than the dispatch rate but also when the issue queue has more than four full segments: 60 Fetch Buffer Decode and Rename Cycle 1 Cycle 2 Fetch Pause Cycle 3 Cycle 4 Fetch Pause Figure 27. Fetch pause based on the fetch buffer occupancy and the sustained dispatch rate (SDR). In cycle 2, fetch is paused because the fetch buffer holds more instructions than the SDR. This can cause the fetch buffer to empty. In cycle 3, fetch is resumed, but no instructions move downstream because the fetch buffer was empty cycle 2. In cycle 4 fetch is paused again. The behavior can repeat as long as the front-end is trying to push instruction faster the SDR, in effect dampening traffic bursts above the SDR that can raise occupancy in the out-of-order engine. FetchPause-SDR+IQ: if ((fBufferNum > SustainedDispatchRate + delta) && (IQ_num >= 4 * segSize)) { fetchPause = true; } else { fetchPause = false; } IQ_num is the instruction occupancy in the issue queue and segSize is the segment size (8 for the issue queue). The rationale for adding the issue queue occupancy condition was to avoid the possibility of “starving” the out-of-order engine. 61 3.4 Clustering Fetched Instructions The FetchPause-SDR and FetchPause-SDR+IQ methods are designed to take advantage of an energy saving bonus in the i-cache. Because an i-cache access energy cost is the same however many instructions are fetched, clustering as many instructions as possible into a single fetch is more energy efficient. The second and third variations let the fetch buffer drain first before fetching is resumed, creating ample space to cluster instructions. Instruction clustering was addressed by Buyuktosunoglu et al. [41]. However, their technique does not consider the fetch buffer occupancy for effective clustering. Other techniques proposed to reduce traffic did not consider instruction clustering [19, 21, 40]. 3.5 Evaluation Framework 3.5.1 Architecture Models The architecture models were simulated using SimpleScalar [25]. For the base model, SimpleScalar was modified to have a separate issue queue and reorder buffer. An 8-way, performance-driven model with parameters comparable to those used by similar studies, such as in [38, 41, 42], was chosen to help compare the results. Table 2 lists the architecture parameters. Also, the issue queue and the reorder buffer were sized just large enough so that performance was not degraded (i.e., simulations for this work showed that a smaller 32-entry issue queue and a 64-entry reorder buffer degraded performance for some benchmarks by as much as 15%). 3.5.2 Power Models Power consumption was modeled using the Wattch framework [24]. Wattch provides mathematical models for the capacitive loading for different types of processor circuits (RAM, CAM, clock, or complex logic) as a function of architectural unit sizes. The models can estimate dynamic power consumption in the early architectural definitions before committing designs to circuit floor-planning and layout. 62 The Wattch models were modified first to fit the base model with a separate issue queue and reorder buffer, both with appropriate read and write ports. The models were then decomposed to represent structural segmentation. This allowed only occupied segments to contribute toward total power during any given cycle. These segmentation models were then used for window adjustment and fetch clustering. We also used Wattch power models that are sensitive to data changes and that apply aggressive conditional clocking (cc3 mode). The goal was to use a base model that is already power efficient to which the occupancy reduction techniques could be applied to further reduce power. Table 2 Processor parameters used for simulations. Fetch Buffer 16 instructions, up to 2 branches Decode/Issue/Commit Width 8 instructions Issue Queue 64 entries Load-Store Queue 64 entries Reorder Buffer 128 entries ALU 8 integer, 2 integer multiply/divide, 4 floating, 2 floating multiply/divide 2048 sets, direct, 32 bytes per block, LRU, 2 cycle hit latency 512 sets, 4 way, 32 bytes per block, LRU, 2 cycle hit latency 16K sets, 4 way, 64 bytes per block, LRU, 16 cycle hit latency I-cache-L1 D-cache-L1 D/I-cache-L2 3.5.3 Simulation Methods For a realistic estimate of performance, occupancy, and power consumption, the SPEC CPU 2000 benchmark suite from the University of Minnesota was simulated as a workload [26]. This benchmark suite is derived from SPEC CPU 2000 and is recognized by SPEC as a valid simulation tool for simulation based computer architecture research. The suite uses statistical sampling to maintain function-level execution patterns, instruction mix, and cache behaviors for entire benchmarks. The available suite compatible with “pisa” SimpleScalar architecture was used. The suite contains a mix of compression, graphics, scientific, and database benchmarks 63 [26] For each of the benchmarks, large input data sets were chosen for simulation, resulting in 700 million to 5 billion instructions per method per benchmark simulation. To each benchmark, the following six methods were applied, which are designated as M1–M6 as follows: M1: Base model was used to establish performance with no architectural or power model modifications. M2: Empty segments were turned off whenever they became empty. No active restriction on window size or fetching was applied. M3: The out-of-order engine was dynamically adjusted using the DDI policy. This implied simultaneously adjusting the sizes for the issue queue, reorder buffer, and the load-store queue. M4: Fetch pause based on the window DDI (FetchPause-DDI) was added to M3. M5: Fetch pause based on the sustained dispatch rate (FetchPause-SDR) was added to M3. M6: Fetch pause based on the sustained dispatch rate and the issue queue occupancy (FetchPause-SDR+IQ) was added to M3. In addition to performance and power statistics reported by SimpleScalar and Wattch, the simulator was instrumented to collect distribution data about the number of active segments and instruction fetches. Simulation data was then post-processed to obtain averages for all benchmarks for each of the M1–M6 methods. 64 4 RESULTS This chapter reports the simulation results for the methods M1 through M6 that were described in Chapter 3. Section 1 reports instruction occupancy statistics for the out-of-order engine's components. The occupancy averaged over all benchmarks for each method is reported. Moreover, the percentage time distribution for the number of active segments in each of the outof-order engine components is reported for each method. The aim of the fetch pausing methods in M4–M6 is to reduce instruction traffic as outlined in Principle 2. To compare the methods’ efficacy in reducing traffic, Section 2 reports instruction traffic statistics for all methods. Since two of the fetch pause techniques (FetchPauseSDR in M5 and FetchPause-SDR+IQ in M6) are also designed to take advantage of instruction clustering for extra power savings in the i-cache, Section 3 reports statistics for instruction clustering for all methods. The clustering statistics include the average number of instructions per single fetch, the distribution of the number of instructions per fetch, and the average fetch event rate. Section 4 reports power consumption statistics generated by the power models for all methods. The power statistics are reported in two formats: the first is power consumption by each component broken down by method, and the second is total processor power consumption for each method broken down by component. The first format helps indentify how effective each method is in reducing power consumption for a specific component. The second format shows total processor savings for each method and the relative effect of each component on the total power savings. 65 Finally, Section 5 reports the methods' effects on the benchmarks' performance. The percent change in IPC performance is tabulated for each benchmark and for each method. The average performance effect for each method is also given in that section. 4.1 Instruction Occupancy Reduction The average instruction occupancy for each of the six methods (M1–M6) was calculated for the reorder buffer (ROB), the issue queue (IQ), the fetch buffer (FB), and the load-store queue (LSQ). The instruction occupancy statistics are based on architecture models and are independent of the power models. Figure 28 shows the average occupancy reduction for the M2–M6 methods. Turning off empty segments in M2 does not reduce instruction occupancy, since no size or flow restrictions were applied. When the out-of-order engine was downsized in M3 with the DDI policy, instruction occupancy dropped 16-17% in the issue queue, the reorder buffer, and the load-store queue. At the same time, the fetch buffer occupancy remained constant in M3 because no fetch pausing restriction was applied. The fetch buffer occupancy dropped by 15% in M4 when the FetchPause-DDI condition was added (Figure 28). However, when the sustained dispatch rate condition (FetchPause-SDR) was added in M5, the occupancy dropped the most in all components: 40% in the FB, 33% in the IQ, 27% in the ROB, and 27% in the LSQ (Figure 28). The decrease in the fetch buffer occupancy is a strong indication that the fetch buffer is getting drained at a much faster rate. This implies that traffic bursts are being dampened successfully, which is the aim of Principle 2. When the IQ minimum occupancy condition was added to the fetch pause condition (FetchPause-SDR+IQ) in M6, the fetch buffer and the window occupancies substantially dropped as well, but not as much as they did in M5. The FB dropped 21%, the IQ 25%, and the ROB and LSQ 22% (Figure 28). 66 Normalized Occupancy Component: FB IQ LSQ ROB 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 M2 M3 M4 Methods M5 M6 Figure 28. Reduction in instruction occupancy for the FB, the IQ, the LSQ, and the ROB. Occupancies are averaged over all benchmarks and normalized to M1. As instruction occupancy is reduced, the number of active segments is also reduced. Figure 29 shows the proportions of time (averaged over all benchmarks) of the number of active segments for the IQ, the ROB, and the LSQ. In M2, the IQ spends a large proportion of time having 8 active segments, but in M3 this time proportion is reduced in favor of smaller (2–5) active segments. This trend toward reducing large number active segments reaches a maximum in M5 where the IQ spends significantly more time having a smaller number of active segments, which helps reduce power consumption. This trend toward reducing the number of active segments is exhibited in the ROB and the LSQ as well. The number of active segments was computed based only on the instruction occupancy, which means it was based on the architectural model and is unrelated to the power models. 67 Active Segments: 0 or 1 2 3 4 5 6 7 8 IQ M2 M3 M4 M5 M6 ROB M2 M3 Methods M4 M5 M6 LSQ M2 M3 M4 M5 M6 0% 25% 50% 75% 100% % Time Figure 29. Percentage of time for the number of active segments for all methods. M1 is omitted as it does not support segmentation. 4.2 Traffic Reduction To illustrate effects on reducing the front-end instruction traffic, Figure 30 shows the average instruction fetch rate (instruction per cycle) for all methods. M5, with its FetchPause-SDR, reduced the instruction fetch rate the most, from 4 instructions per cycle to 3.5, or about 13%. The next best method in reducing fetch rate was M6, which is also based on the SDR. 68 M1 M2 M3 Methods M4 M5 M6 0.0 1.0 2.0 3.0 4.0 5.0 Instruction per Cycle Figure 30. The average instruction fetch rate, which is reduced the most in M5. 4.3 Fetch Clustering In addition to reducing traffic, the goal of methods M4–M6 was to increase the clustering of fetched instructions. To show the effect on clustering, Figure 31 displays the average number of instructions in a single fetch or i-cache access for all methods. Pausing fetch based on the window DDI in M4 leads to a small increase in instruction clustering. However, when pausing fetch based on the SDR was applied in M5, the average instructions fetched from a single cache line almost doubled compared to M1, from about 7 to 14 instructions. This doubling is significant, especially if we consider that the fetch buffer size is 16 instructions; that is, the number of instructions fetched in a single access went from less than half to almost the full fetch buffer size. This increase of clustering is also shown in Figure 32, which plots the percentage proportion of fetches by the instructions they carry; it shows that in M5 over 80% of fetches carry 13–16 instructions. 69 M1 M2 M3 Methods M4 M5 M6 0 2 4 6 8 10 12 14 16 Instructions Figure 31. Average number of instructions per fetch, which is highest in M5. Instructions per Fetch: 1-4 5-8 9-12 13-16 M1 M2 M3 Methods M4 M5 M6 0% 20% 40% 60% 80% 100% % Fetches Figure 32. Percentage of fetches by the number of instructions they carry. In M5, over 80% of fetches cluster between 13–16 instructions in one i-cache access. Another important variable to consider is the fetch activity on the bus that connects to the i-cache. Figure 33 shows this activity in the fetch event rate (fetch per cycle). With effective clustering, bus traffic is reduced. M5 exhibits the lowest fetch activity, which is reduced by 56% compared to M1. 70 M1 M2 M3 Methods M4 M5 M6 0.0 0.2 0.4 0.6 Fetch Event per Cycle Figure 33. Average fetch event rate, which is also lowest in M5. 4.4 Power Savings Along with occupancy and traffic statistics, dynamic power consumption in Watts was collected during the same simulations. Power savings are illustrated in Figure 34, which shows the average power consumption breakdown for the window, the load-store queue, and the i-cache for all methods. The window power component covers both issue queue and reorder buffer activities, which include wakeup and select-logic accesses, dispatching and issuing instructions, physical register accesses and allocating-deallocating reorder buffer entries. The i-cache power component reflects energy usage for cache line accesses. The result bus and the clock are also included in the figure, because their power consumption is related and was reduced. The resultbus component reflects energy spent transferring instructions and data between the issue queue, the reorder buffer, and the execution units. The clock component is proportional to global chip activities. The “other” component in the figure did not have much power change and includes the power for the branch predictor, the TLB, the rename and decode, and the dcache activities. Figure 35 shows the same data as in Figure 34 but grouped by method in a stacked bar graph to highlight savings trends in total processor power consumption. 71 35 Method: M1 M2 M3 M4 M5 M6 30 25 20 Power 15 (Watts) 10 5 0 Window LSQ Clock Result Bus I-Cache Other Component Figure 34. Total power savings for components broken down by methods. The window component includes the IQ and the ROB. Component: Window LSQ Clock Result Bus I-Cache Other 90 80 70 60 50 Power (Watts) 40 30 20 10 0 M1 M2 M3 Method M4 M5 M6 Figure 35. Total power savings for methods broken down by component to show total processor savings and relative effect of component’s savings on the total. 72 When only empty segments were turned off in M2, total processor power consumption went down to 76 W from 85 W in M1, or 11% (Figure 35). When the out-of-order engine was downsized using the DDI policy in M3, the processor power went down to 70 W, or 17% less than M1, mostly due to window power reduction, which was reduced 29% compared to M1 (Figure 34). Pausing instruction fetch based on the window DDI in M4 yielded only a slightly greater reduction in power compared to M3. However, pausing fetch based on the SDR in M5 reduced the processor power to 62 W, or 27% less than M1. This substantial power reduction was largely due to window power reduction (40%), helped by additional i-cache power reduction (47%). M5 also reduced the result-bus power, largely due to reduced accesses, and it reduced clock power due to reduced global instruction activities. The power reduction in M5 was also consistent with M5’s reduced traffic and increased fetch clustering. Finally, M6 was second best in power reduction with 22% total processor power reduction compared to M1. 4.5 Performance Effect Turning off empty segments in M2 does not affect any performance compared to M1 because it does not apply any architecture restrictions. M3 and M4 had 1.5% and 1.8% IPC performance loss respectively. M5, the method with the most power savings, did only slightly worse with 2.6% IPC performance loss, and M6 performed slightly better than M5 with only 1.9% IPC performance loss. Table 3 lists the IPC performance for all benchmarks for all methods. For each method, the average IPC for all benchmarks is used to calculate the percent change in performance compared to the M1 method. 73 Table 3 IPC performance for all benchmarks and for all methods. The bottom row represents the percent change of the average IPC over all benchmarks for each method compared to the M1 method. Methods SPEC Benchmark M1 M2 M3 M4 M5 M6 164.gzip.graphic 1.9257 1.9257 1.9165 1.9069 1.8683 1.8886 164.gzip.log 2.2931 2.2931 2.2802 2.2755 2.2623 2.2682 164.gzip.program 2.2276 2.2276 2.2181 2.2115 2.1987 2.2033 164.gzip.random 1.2838 1.2838 1.2783 1.2599 1.2122 1.2547 164.gzip.source 1.7486 1.7486 1.7382 1.7314 1.7083 1.7197 175.vpr.place 2.4632 2.4632 2.4609 2.4582 2.4194 2.4496 175.vpr.route 2.5163 2.5163 2.4986 2.4977 2.4816 2.4946 176.gcc 1.1231 1.1231 1.1112 1.0937 1.0671 1.0832 177.mesa 4.1002 4.1002 3.8516 3.8621 4.0162 3.9272 179.art 3.1234 3.1234 3.029 3.0507 3.002 3.0507 181.mcf 1.9645 1.9645 1.9327 1.9351 1.9158 1.9351 183.equake 3.0728 3.0728 3.0462 3.046 3.0626 3.0627 188.ammp 1.4644 1.4644 1.4557 1.4348 1.4248 1.4348 197.parser 2.1958 2.1958 2.1954 2.1704 2.136 2.193 255.vortex 2.011 2.011 1.9633 1.9143 1.786 1.83 256.bzip2.graphic 2.8865 2.8865 2.8391 2.841 2.8261 2.841 256.bzip2.program 2.9001 2.9001 2.8636 2.8663 2.8501 2.8663 256.bzip2.source 2.8872 2.8872 2.86 2.8621 2.8481 2.8621 Average 2.3437 2.3438 2.3077 2.3010 2.2825 2.2980 % Change compared to M1 ― 0.00% -1.54% -1.82% -2.61% -1.95% 74 5 DISCUSSION OF RESULTS 5.1 Validation of Principle 1 The objective of reducing instruction occupancy is to create more empty segments in the out-oforder engine's components than would normally exist. Empty segments can then be disabled to reduce power consumption. Two principles, along with a tandem principle to apply them together, are proposed to reduce instruction occupancy in the out-of-order engine. The principles were described in Section 3.1. To validate Principle 1, the data dependence index (DDI) was conceived, and a token passing algorithm was developed to measure it (described in Section 3.2.1 and 3.2.2). Case 1 and 2 of Principle 1 could then be implemented through the developed DDI policy (described in Section 3.2.5) to downsize the out-of-order engine's components. Results from simulation method M3, which implemented the DDI measurement and policy, show that instruction occupancy compared to the base model M1 was reduced in all outof-order engine components by 16–17%. More specifically, M3 results also show that the out-oforder engine components spent higher percentage of cycles with fewer active segments. For example, in the segmented base model M2, the engine spent, on average over all benchmarks, less than 15% of cycles having 2-4 active segments and over 30 % of cycles having 8 active segments. In M3, the percentage of 2-4 active segments increased to about 30%, while the percentage of 8 segments decreased to about 15%. The fact that this significant decrease in instruction occupancy came at IPC performance loss of only 1.5% indicates the following: 75  Both cases 1 and 2 of Principle 1 are validated. When they are applied, they will reduce instruction occupancy when data dependences become very high or very low.  The proposed token passing algorithm and the DDI policy are validated as a means of applying Principle 1 to reduce instruction occupancy. 5.2 Validation of Principle 2 and the Tandem Principle To validate Principle 2, three techniques were simulated. The objective in each of the three techniques with respect to Principle 2 was as follows:  FetchPause-DDI, simulated via M4, was not complying with Principle 2 and the objective was to reduce occupancy based on the DDI state in the instruction window.  FetchPause-SDR, simulated via M5, was fully complying with Principle 2. The object was to dampen traffic bursts based on the sustained dispatch rate (SDR).  FetchPause-SDR+IQ, simulated via M6, was partially complying with Principle 2. The objective was similar to FetchPause-SDR except a condition for IQ occupancy is added. All three techniques are described in Section 3.3 and their simulations described in 3.5. Simulation results on standard benchmarks show that instruction occupancy was reduced the most in M5 for the technique that is fully complying with Principle 2: 40% in the FB, 33% in the IQ, 27% in the ROB, and LSQ. The partially complying technique in M6 came in second in occupancy reduction: 21% in the FB, 25% in the IQ, and 22% in the ROB and LSQ. The non-complying technique in M4 showed only modest occupancy reduction over M3. These results indicate the following:  Since the fully complying technique with Principle 2 showed the most reduction in instruction occupancy, Principle 2 is validated. The fact that the FB occupancy 76 dropped by 40% in this technique also indicates that the fetch buffer was being drained or emptied more often. As this happens, traffic bursts into the out-oforder engine cannot be sustained and are dampened, just as outlined by Principle 2.  The FetchPause-SDR technique is validated. Conditioning the fetch pause on the fetch buffer occupancy and the sustained dispatch rate (SDR) dampened traffic bursts and reduced traffic.  Since applying the FetchPause-DDI technique in M5 in addition to the DDI policy reduced instruction occupancy more than applying the DDI policy alone in M3, the Tandem Principle is also validated. That is, applying Principle 2 in tandem with Principle 1 significantly reduced in instruction occupancy. 5.3 Significance of Clustering Fetched Instructions Clustering of fetched instructions has a direct effect on reducing i-cache power consumption. Since each i-cache fetch consumes the same amount of power, efficiency increases if each fetch is made to carry more instructions. Moreover, as more instructions are clustered together in each fetch, fewer fetches are needed for the same number of instructions. This helps reduce traffic activities on the bus connecting the front end to the i-cache. When the FetchPause-SDR technique pauses fetch to let the fetch buffer drain, it allows space for more instructions to be clustered together when fetching is resumed. The results show that the average number of instructions per fetch almost doubled, from 7 to 14 instructions, in M5 compared to the base model. Since the fetch buffer size is 16 instructions, this increase shows that the fetch clustering went to almost the full fetch buffer size. The results also show that in M5 the fetch event rate or fetch activity was reduced by 56%. 77 5.4 Significance of Power Reduction Results Just as expected, the power consumption results from the power models were closely associated with the instruction occupancy reduction. As instruction occupancy was reduced, more empty segments were disabled and power was saved. By applying Principle 1 alone with the DDI policy, total processor power consumption in M3 went down by 17%, mostly due to power savings in the IQ and the ROB, whose power consumption was collectively reduced by 29%. However, when the FetchPause-SDR technique was included, as Principle 2 and the Tandem Principle outlined, total power consumption in M5 went down by 27%. Just like in M3, the IQ and the ROB were also large contributors with 40% power savings. Another contributor to the power savings in M5 came from the i-cache. As fetched instructions were clustered more effectively, the i-cache power models indicate that its power consumption was reduced by 47%. Another important contributor was the result bus and the global clock. As more segments were disabled, less signaling was required, thus global clocking and bus activities were also reduced. These reported power savings are in comparison to the base models that include some power efficiency measures. The chosen Wattch power models for this work include conditional clocking of accessed unit ports; that is, only accessed ports on a unit contribute toward its power consumption. The chosen models also include the overhead for the conditional clocking. The reported power savings also include the overhead for segmentation. Care was taken to include driver power overhead per segment. That is, the assumption was made that each segment needs its own array drivers. Although the power savings reported here are significant, they are not a solution by themselves to the processor power problem. This work addresses only the dynamic power consumption, but as the clock rate and the number of devices on a chip continue to increase with each technology generation, both dynamic and static power consumption will continue to increase. The power savings reported here at the microarchitecture level, in conjunction with other power saving measures at other design levels, will help curb the power increase. They 78 will help reduce cost, increase reliability, reduce environmental impact, and allow for continued performance growth for only some time, and only until a more radical solution is found. 5.5 Comparative Discussion This thesis reports significant reduction in instruction occupancy in all pipeline stages, front to back. So far, it is the only known work that reports such statistics for power efficiency. Moreover, this thesis reports statistics for clustering fetched instructions. Only Ponomarev et al. in [15] report power saving in all of out-of-order engine components. Although they use segmentation to reduce power, their technique is passive in occupancy reduction. They do not report instruction occupancy or total processor power savings, and they report performance loss of 5%, which is twice as much as the performance loss in M5. Folegnani et al. in [38] use the IPC to actively disable IQ segments. They do not report instruction occupancy statistics or total processor power consumption. Karakhanis et al. [19] also use the IPC to limit instruction count, but instruction occupancy statistics are not given, nor is the total processor power savings. To compare techniques that reduce the instruction count in the processor, at least the instruction occupancy should be reported for all major pipeline components. Traffic reduction techniques by Manne et al. and Baniasadi et al. report traffic reduction statistics [21, 40]. However, traffic reduction does not necessarily reduce instruction occupancy, as explained in Section 3.1.2. Reducing occupancy is more effective in reducing power in segmented structures. Reporting total processor power consumption would help compare technique effectiveness, especially if they share the same power models. It would also be useful if component power consumptions are reported to show the effect of components on the total power savings. The work by Buyuktosunoglu et al. in [41] is the only known work based on traffic reduction that reports power statistics. The reported total power savings of 10% and the icache power savings of 35% are less than those reported in this thesis for a similar performance loss. 79 6 CONCLUSION & FUTURE RESEARCH 6.1 Conclusion Processor power consumption has been steadily increasing with each technology generation, and this trend is constraining performance of new designs. High power consumption results in reduced chip reliability, increased cost of chip package integration and cooling, and increased cost of operation. Moreover, with the current growth rate of personal computing, high power consumption may not be environmentally sustainable. The ITRS has declared the power consumption’s increasing trend a grand challenge. In effect, this trend must be curbed for a continued growth in performance. Both the dynamic and the static components of processor power consumption are on an increasing trend. The dynamic component is increasing due to high clock rate, increased number of devices on a chip, and increased switched capacitance. The static component is increasing due to increased leakage current and increased number of devices on the chip. Not only is the chip’s total power consumption increasing, but the power density per unit area is increasing as well. The high power density results in localized hotspots that vary spatially and temporally and can cause timing errors or even physical damage. Reductions in power consumption are being pursued at all design levels. At the device level, new materials are being sought that can draw less current than present CMOS based devices. At the circuit level, techniques are proposed to reduce the switched load capacitance of devices and connections. At the microarchitecture level, static and adaptive techniques are 80 proposed to maximize processor efficiency. And at the system and software levels, techniques are being investigated that can detect idle cycle times and transition I/O devices, processor, and memory into low power states. For the processor, the out-of-order engine’s components consume a large proportion of the total power on the chip. To achieve high performance, large CAM and RAM arrays are needed for the out-of-order issue, and multi-ported RAM arrays are needed for the reorder buffer and load-store queue. However, the circuit complexity of such arrays grows superlinearly with size [39]. In fact, the match and the tag lines for the CAM array and the bit and word lines of the RAM array all carry high capacitive load, which contributes to their high power consumption. Previously reported microarchitecture techniques have shown that segmenting the outof-order engine and shutting down empty segments can reduce power consumption. However, these techniques have not reached their full potential. In one technique, segments in an out-oforder engine component can be disabled to save power whenever they become empty, but no attempt is made to limit the number of active segments that don’t contribute to performance [15]. In other techniques, occupancy is actively reduced when it doesn’t contribute to performance, but only in the issue queue and based on the IPC [19, 38]. Relying on the IPC requires periodic upsizing and long tuning periods that can miss power savings opportunities. Another group of techniques relies on reducing instruction traffic, but reducing traffic can still leave engine queues congested, especially when the queues’ service rate is slow[21, 40, 41]. The work in this thesis builds on previous techniques to overcome some of their limitations. It offers the following contributions to reduce processor power consumption:  Principle 1 is proposed to actively and gradually limit the number of active segments in the out-of-order engine whenever data dependences reduce the engine’s effectiveness in issuing instructions out-of-order. The advantage of this principle is that is it can reduce power consumption not only when data dependences are high, but also when they are low. 81  A new data dependence index (DDI) is proposed to measure data dependences in a single number.  A token passing algorithm is proposed to dynamically measure the DDI. The advantage of this algorithm is that it does not require periodic upsizing and can be applied at short intervals to increase power savings over short intervals.  A DDI policy is proposed to apply Principle 1 to limit the number of active segments.  Principle 2 is proposed to dampen traffic bursts and moderate the front-end pressure in order to reduce traffic. The advantage of dampening traffic bursts is that it helps queues maintain reduced occupancy even if the service rate is slow.  A fetch pause technique is proposed to implement Principle 2 based on the sustained dispatch rate (SDR) and the fetch buffer occupancy. The advantages of this technique are its low complexity and its ability to pause fetching on a cycleby-cycle basis.  The proposed SDR fetch pause technique offers effective instruction clustering from the i-cache. This clustering translates into increased efficiency in the i-cache and its connection bus.  A Tandem Principle is proposed to apply both Principles 1 and 2 at the same time to further reduce instruction occupancy and increase efficiency. The two Principles and their methods operate independently but in tandem, matching resources to thread behavior changes that are both microarchitecture independent and dependent. Principle 1 responds to data dependence changes, which are microarchitecture independent. Principle 2 responds to changes that are microarchitecture dependent.  The application of the two Principles offers reduced occupancy in all pipeline stages, front to back: 40% in the fetch buffer, 33% in the issue queue, 27% in the 82 reorder buffer and the load-store queue. The SDR fetch pause technique offers clustering of nearly a full fetch buffer size.  This thesis offers statistics for the reduction in total dynamic power consumption for a single threaded, high performance processor, and in all of its major components. This helps to show the effect of components’ dynamic power reduction on the total.  The proposed techniques offer 27% savings of the total dynamic power consumption with performance loss of only 2.6%. 6.2 Future Research 6.2.1 Applications to Power Efficiency Static Power Consumption: In addition to rising dynamic power consumption, which is addressed in this thesis, static power consumption has also been rising due to reduced supply and threshold voltages of deep sub-micron technologies (Section 1.4). The DDI policy and the FetchPause-SDR were applied in this work to reduce instruction occupancy, thereby creating additional empty segments that can be disabled to reduce dynamic power consumption. Static power is still consumed, however, by those disabled segments, unless a circuit technique is applied to reduce it. But first, circuit models have to be developed to accurately estimate how much static power is consumed by out-of-order engine CAM and RAM arrays and how much of it can theoretically be saved. Adaptive microarchitecture techniques have been proposed to reduce static power consumption in caches. In i-cache, only a small subset of cache lines is active and accessed at any interval. Thus, other cache lines can be turned off to stop or reduce their leakage current. Examples of these techniques are Powell et al. [62] and Flautner et al. [63]. In [62], a wide NMOS transistor is used to cut off the supply voltage, Vdd, for SRAM cells in unused lines. In [63], unused cache lines are put into a drowsy, low leakage, state by lowering the supply voltage. The advantage of the drowsy state is that the caches lines in the drowsy state maintain their 83 data. In the gated-Vdd technique, turned-off lines lose their data; thus data has to be fetched again from L2 caches, which can increase dynamic power consumption. For the out-of-order engine’s components, a gated-Vdd technique could be used if maintaining states in disabled segments is not required. Otherwise, a drowsy state technique could be used. Mobile Processors: The DDI policy and FetchPause-SDR target the dynamic power consumption of a high-end, high performance processor. However, mobile and embedded processors’ performance has been increasing and steadily approaching that of high-end processors. The constraint for mobile and embedded processors is performance per Watt. A question worthy of study would be how applicable are the DDI policy and FetchPause to a mobile processor that has somewhat reduced microarchitecture parameters. The hypothesis in this case is that reducing instruction occupancy in mobile processors can lead to power efficiency and long-battery life. 6.2.2 Applications to Performance Improvement Thread Assignment in Heterogeneous Multi-Processors: To take advantage of the increasing number of on-chip devices offered by each technology generation, computer architects are exploiting thread-level parallelism by offering multicores or chip multi-processing (CMP). A CMP with small low-power cores can produce a higher thread throughput than a single, large high-power core. Thus CMP offers higher system throughput per unit power than a single core. For servers with a high thread count, a large number of low power cores is more desirable to increase the ratio of throughput per unit power. But for desktops and laptops with a low thread count, the performance of a single thread is more important; thus, fewer large, high-power cores are more desirable. A heterogeneous multi-processor (HMP) has been proposed to get the best of both worlds by having a mix of large, complex cores and small, simple cores [64]. When a thread's performance is not helped by a large, complex core, it can be dynamically switched to a small, low-power core, which consumes less power. Since the smaller cores take up less chip area than complex ones, an HMP can have more cores than a CMP with large cores for the same fixed 84 chip area. Thus, an HMP can also have a higher throughput than a CMP with only large, complex cores. One challenge for getting the most power efficiency and performance in HMP is the assignment of threads to cores. Proposed assignment techniques so far have had some limitations. A dynamic assignment policy was proposed based on the ratios of IPC for different threads running on different cores [65]. To obtain the IPC ratios, all threads have to be run on all cores for an extended period of time, which can limit this technique’s usefulness. By the time all the ratios are obtained, thread behavior can change already and performance can suffer because of sub-optimal assignments. Another technique proposes an assignment metric as the geometric distance between inherent thread characteristics and core configurations [23]. But this technique allows only a static assignment and does not consider data dependences as an inherent characteristic. Data dependences can play an essential role in dynamic thread assignment in HMP. For example, if a thread's instructions line up in a long data dependence chain, core configuration doe not matter much, and a simple, in-order issue core can be sufficient for optimal performance. The sustained dispatch rate (SDR) can also help in thread-to-core assignment. For example, a low DDI (low data dependences) combined with a low SDR indicates that the core microarchitecture, such as caches, is causing a bottleneck. A thread assignment to a core with larger caches would then be more suitable. Future research can explore the combination of the DDI and SDR as metrics for dynamic assignment in HMP. Simultaneous Multi-Threading (SMT): In SMT, two or more threads can execute simultaneously on the same core, sharing the instruction window. This can improve throughput, as long as the sharing threads don't interfere with each other's execution. Thus, dynamically picking non-interfering threads from an available pool to share a core is critical for optimal throughput. Data dependences constitute an essential and inherent characteristic that should be taken into consideration for sharing threads. For instance, if two threads have high level of data dependences, then their performance is insensitive to window size and can share a core without interference. If they both have mid85 level data dependences such that their performance is sensitive to window size, then their sharing might interfere with their performance. In this case, throughput would benefit if one of the two threads is swapped out with another non-interfering thread. Future research can explore how the data dependence index (DDI) along with the sustained dispatch rate (SDR) can be used to dynamically determine whether threads' execution can be interfering on any given core. Out-of-order Instruction Scheduling: Scheduling techniques have been proposed to reduce the complexity of the out-of-order issue and improve performance [49, 53-56]. However, scheduling instructions might be wasteful in two cases. (1) When data dependences are high, instructions cannot be scheduled. (2) When data dependence are very low, the out-of-order issue can easily find enough instructions to keep the multiple-issue pipeline flowing without scheduling. Thus, future research can explore scheduling techniques' tradeoff between performance and power. In a heterogeneous multiprocessor (HMP) system, for example, one core can have high-power with complex scheduling, and another core can have low-power without scheduling. A metric such as the DDI can then help decide which core a thread can be assigned for optimal performance and power efficiency. Dynamic Voltage and Frequency Scaling (DVFS): The out-of-order engine's components have high circuit complexity that limits maximum clock frequency on the chip [39]. The limitation stems from the large switched capacitance of the long bit-lines within component arrays. Dynamic downsizing of the out-of-order engine by segmentation and disabling empty segments reduces the switched capacitance of the bit-lines. Future research can explore whether reduced switched capacitance can be exploited by DVFS to increase frequency or voltage when segments are disabled to improve performance. The DDI and the SDR can help decide when voltage and frequency should be scaled up or down. For example, low DDI and high SDR values imply that data dependences are low and the microarchitecture is not being a bottleneck. In this case, the voltage and frequency can be scaled up to improve performance. But a low DDI and low SDR imply that the 86 microarchitecture is causing a bottleneck (maybe due to i-cache misses), in which case the voltage and frequency should be low to save power. Data dependences constitute an essential and inherent characteristic of thread execution. Quantifying this characteristic through a metric (such as the DDI) and using it with a microarchitecture dependent metric (such as the SDR) helps adapt chip resources to optimize power efficiency. This adaptation has potential to also optimize performance. Chip resources can be in the same core, can be different cores, or can be adapting voltage or clock frequency. This type of adaptation has benefit to the continued growth of computing. 87 REFRENCES 88 References [1] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, "Reducing power in high-performance microprocessors," in (DAC 98) the 35th annual conference on Design automation. San Francisco, California, United States: ACM Press, 1998. [2] K. Skadron, M. R. Stan, H. Wei, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-aware computer systems: Opportunities and challenges," Micro, IEEE, vol. 23, pp. 52-61, 2003. [3] ITRS (International Technology Roadmap for Semiconductors), "Executive Summary," 2005. [4] D. Burger and J. R. Goodman, "Billion-Transistor Architectures," Computer, vol. 30, pp. 46-49, 1997. [5] D. Burger and J. R. Goodman, "Billion-Transistor Architectures: There and Back Again," Computer, vol. 37, pp. 22-28, 2004. [6] D. M. Brooks, P. Bose, S. E. Schuster, H. Jacobson, P. N. Kudva, A. Buyuktosunoglu, J. Wellman, V. Zyuban, M. Gupta, and P. W. Cook, "Power-aware microarchitecture: design and modeling challenges for next-generation microprocessors," Micro, IEEE, vol. 20, pp. 26-44, 2000. [7] Gartner Inc., "Gartner Says More than 1 Billion PCs In Use Worldwide and Headed to 2 Billion Units by 2014," Gartner Inc., 2008. [8] U.S. Department of the Interior Bureau of Reclamation, "Hoover Dam Frequently Asked Questions and Answers." [9] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design, A System Perspective, 2nd ed: Addison Wesley, 1994. [10] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel, "The Microarchitecture of the Pentium 4 Processor," Intel Technology Journal, Q1, vol. 5, 2001. [11] S. Borkar, "Gigascale Integration-Challenges and Opportunities", 2007, http://software.intel.com/en-us/articles/gigascale-integration-challenges-andopportunities/, Accessed 6/15/2011. [12] Intel Inc., "Intel® Xeon® Processor 5000 Sequence." 89 [13] A. Grove, "Changing Vectors of Moore's Law," presented at The International Electron Devices Meeting (IEDM), Keynote Speaker, 2002. [14] K. Skadron, M. R. Stan, W. Huang, V. Sivakumar, S. Karthik, and D. Tarjan, "Temperature-aware microarchitecture," in (ISCA 03) 30th Annual International Symposium on Computer Architecture, 2003, pp. 2-13. [15] D. Ponomarev, G. Kuck, and K. Ghose, "Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources," in Proceedings of the International Symposium on Microarchitecture 2001. [16] D. V. Ponomarev, G. Kucuk, O. Ergin, K. Ghose, and P. Kogge, "Energy-Efficient Issue Queue Design," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 11, 2003. [17] D. H. Albonesi, R. Balasubramonian, S. G. Dropsbo, S. Dwarkadas, F. G. Friedman, M. C. Huang, V. Kursun, G. Magklis, M. L. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. W. Cook, and S. E. Schuster, "Dynamically tuning processor resources with adaptive processing," Computer, vol. 36, pp. 49-58, 2003. [18] D. Folegnani and A. Gonzlez, "Energy-Effective Issue Logic," in (ISCA 01) 28th Annual International Symposium on Computer Architecture. Göteborg, Sweden: ACM Press, 2001. [19] T. Karkhanis, J. E. Smith, and P. Bose, "Saving Engergy with Just In Time Instruction Delivery," in ISLPED. Monterey, California, USA: ACM, 2002. [20] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, "Discovering and Exploiting Program Phases," IEEE Micro, vol. 23, pp. 84-93, 2003. [21] S. Manne, A. Klauser, and D. Grunwald, "Pipeline gating: speculation control for energy reduction," in (ISCA 98) Proceedings of the 25th Annual International Symposium on Computer Architecture. Barcelona, Spain: IEEE Computer Society, 1998. [22] Z. Youssfi and M. Shanblatt, "Instruction-Window Power Reduction Using Data Dependence Metric," in Workshop on Introspective Architecture, in conjunction with HPCA12, International Symposium on High-Performance Computer Architecture. Austin, Texas, 2006. [23] J. Chen and L. K. John, "Efficient program scheduling for heterogeneous multi-core processors," in Proceedings of the 46th Annual Design Automation Conference. San Francisco, California: ACM, 2009. 90 [24] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A Framework for Architectural-Level Analysis and Optimizations," in (ISCA 00) the 27th Annual International Symposium on Computer Architecture, 2000. [25] T. Austin and D. Burger, "The SimpleScalar Tool Set, ver. 3.0, Technical Report," University Of Wisconsin 1999. [26] A. KleinOsowski and D. J. Lilja, "MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research " Computer Architecture Letters, vol. 1, 2002. [27] S. Borkar, "Gigascale Integration-Challenges and Opportunities", 2004, http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/itanium/reference/182440.htm. [28] J. Markoff, "Remapping Computer Circuitry to Avert Impending Bottlenecks," The New York Times, 2011. [29] H. Yenpo, G. M. Huang, and L. Peng, "Dynamical Properties and Design Analysis for Nonvolatile Memristor Memories," Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 58, pp. 724-736, 2011. [30] H. Yan, H. S. Choe, S. Nam, Y. Hu, S. Das, J. F. Klemic, J. C. Ellenbogen, and C. M. Lieber, "Programmable nanowire circuits for nanoprocessors," Nature, vol. 470, pp. 240244, 2011. [31] I. Daniele, L. L. Andrea, and M. Davide, "Recovery and Drift Dynamics of Resistance and Threshold Voltages in Phase-Change Memories," Electron Devices, IEEE Transactions on, vol. 54, pp. 308-315, 2007. [32] S. Schuster, P. Cook, H. Jacobson, and P. Kudva, "Low-Power, high-performance circuit techniques," Micro, IEEE, vol. 20 No. 6, pp. 32-33, 2000. [33] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, "Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures," in (MICRO 00) the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. Monterey, California, United States: ACM Press, 2000. [34] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, W. Youfeng, L. Jin, and D. Brooks, "Dynamic-Compiler-Driven Control for Microprocessor Energy and Performance," Micro, IEEE, vol. 26, pp. 119-129, 2006. 91 [35] ACPI, "Advanced Configuration and Power Interface Specification", 2010, http://www.acpi.info/DOWNLOADS/ACPIspec40a.pdf, Accessed: 6/15/2011. [36] D. W. Wall, "Limits of Instruction-Level Parallelism," in International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. [37] D. P. John Hennessy, Computer Architecture, A Quantitative Approach, Fourth ed: Morgan Kaufman, 2007. [38] D. Folegnani and A. Gonzalez, "Energy-effective issue logic," in (ISCA 01) 28th Annual International Symposium on Computer Architecture. Göteborg, Sweden: ACM Press, 2001. [39] S. Palacharla, N. P. Jouppi, and J. E. Smith, "Complexity-Effective Superscalar Processors," in (ISCA 97) the 24th Annual International Symposium on Computer Architecture, 1997. [40] A. Baniasadi and A. Moshovos, "Instruction Flow-Based Front-end Throttling for PowerAware High-Performance Processors," in ISLPED. Huntington Beach, California, USA: ACM, 2001. [41] A. Buyuktosunoglu, T. Karkhanis, D. H. Albonesi, and P. Bose, "Energy Efficient CoAdaptive Instruction Fetch and Issue," in (ISCA 03) 30th Annual International Symposium on Computer Architecture, 2003. [42] Y. Bai and R. I. Bahar, "A Dynamically Reconfigurable Mixed In-Order/Out-of-Order Issue Queue for Power-Aware Microprocessors," in Proceedings fo the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'03), 2003. [43] M. Huang, J. Renau, and J. Torrellas, "Energy-Efficient Hybrid Wakeup Logic," in (ISLPED'02) International Symposium on Low-Power Electronics and Design, 2002. [44] M. Ramirez, A. Cristal, A. V. Veidenbaum, L. Villa, and M. Valero, "A New Pointerbased Instruction Queue Design and Its Power-Performance Evaluation," in International Conference on Computer Design (ICCD), IEEE, Ed.: IEEE, 2005. [45] A. R. Lebeck, J. Koppanalil, T. Li, J. Patwardham, and E. Rotenberg, "A Large, Fast Instruction Window for Tolerating Cache Misses," in (ISCA 02) 29th Annual International Symposium on Computer Architecture, 2002. [46] T. Moreshet and I. Bahar, "Power-Aware Issue Queue Design for Speculative Instructions," in (DAC'03) Annual ACM IEEE Design Automation Conference 2003. 92 [47] E. Brekelbaum, J. R. II, C. Wilkerson, and B. Black, "Hierarchical Scheduling Windows," in (Micro-35) International Symposium on Microarchitecture, 2002. [48] D. Ernst and T. Austin, "Efficient Dynamic Scheduling through Tag Elemination," in (ISCA 02) 29th Annual International Symposium on Computer Architecture, 2002. [49] R. Canal and A. Gonzalez, "A Low-Complexity Issue Logic," in (ICS 00) International Conference on Supercomputing, 2000. [50] J.-L. Cruz, A. González, M. Valero, and N. P. Topham, "Multiple-Banked Register File Architectures," in (ISCA'00) the 27th Annual International Symposium on Computer Architecture 2000. [51] M. Goshima, K. Nishino, T. Kitamura, Y. Nakashima, S. Tomita, and S.-i. Mori, "A HighSpeed Dynamic Instruction Scheduling Scheme for Superscalar Processors " in (Micro'01) the 34th Annual ACM/IEEE International Symposium on Microarchitecture 2001. [52] M. D. Brown, J. Stark, and Y. N. Patt, "Select-Free Instruction Scheduling Logic," in (MICRO 34) 33rd International Symposium on Microarchitecture, 2001. [53] R. Canal and A. Gonzalez, "Reducing the Complexity of the Issue Logic," in (ICS 01) International Conference on Supercomputing, 2001. [54] B. Fields, S. Rubin, and R. Bodik, "Focusing Processor Policies via Critical-Path Prediction," in 28th International Symposum on Computer Architecture (ISCA 01), 2001. [55] P. Michaud and A. Seznec, "Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors," in International Symposium on High-Performance Computer Architecture HPCA-8, 2001. [56] D. Ernst, A. Hamel, and T. Austin, "Cyclone: A Broadcast-Free Dynamic Instruction Scheduler with Selective Replay," in Proceedings of the 30th Annual Symposium on Computer Architecture (ISCA'03), 2003. [57] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye, "Energy-Driven Integrated Hardware-Software Optimizations Using Simplepower," in (ISCA'00) the 27th Annual International Symposium on Computer Architecture, 2000. [58] G. Cai, "Architectural Level Power/Performance Optimization and Dynamic Power Estimation," in Cool-Chips Tutorial, An Industrial Perspective on Low Power Processor Design, in conjunction with MICRO-32, 1999. 93 [59] A. Moshovos and G. S. Sohi, "Microarchiectural Innovations: Boosting Microprocessor Performance Beyond Semiconductor Technology Scaling," Proceedings of the IEEE, vol. 89, pp. 1560-1576, 2001. [60] A. Roth, R. Ronen, and A. Mendelson, "Dynamic Technique for Load and Load-User Scheduling," Proceedings of the IEEE, vol. 89, pp. 1621-1638, 2001. [61] M. S. Hrishikesh, N. P. Jouppi, K. I. Farkas, D. Burger, S. W. Keckler, and P. Shivakumar, "The Optimal Logic Depth Per Pipeline State is 6 to 8 FO4 Inverter Delays," (ISCA 02) 29th Annual International Symposium on Computer Archiecture, 2002. [62] M. Powell, S.-h. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, "Gated-Vdd: A Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories," in ISLPED, 2000. [63] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge, "Drowsy Caches: Simple Techniques for Reducing Leakage Power," in Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002. [64] R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan, "Heterogeneous chip multiprocessors," Computer, vol. 38, pp. 32-38, 2005. [65] M. Becchi and P. Crowley, "Dynamic Thread Assignment on Heterogenious Multiprocessor Architectures," in Computing Frontiers, 2006. 94