‘ ' -4.
I ' ‘ . _ . I. . ‘
' - ~ ‘ ' ‘ . . ,(
‘ n , , V A > ' -

 

 

This is to certify that the
dissertation entitled

ACTIVITY-AWARE MODELING AND DESIGN
OPTIMIZATION OF ON-CHIP SIGNAL INTERCONNECTS

presented by

KRISHNAN SUNDARESAN

has been accepted towards fulfillment
of the requirements for the

Ph.D. degree in Electrical Engineering

 

wee?

 

Major Professor's Signature
I Z / ‘1' /7.0 o C

 

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

- —.-¢-----o-o-c---0-.--o-u-o-o-o—u—--.—.c

ACTIVITY-AWARE MODELING AND DESIGN
OPTIMIZATION OF ON—CHIP SIGNAL
INTERCONNECTS
By

Krishnan Sundaresan

A DISSERTATION
Submitted to
Michigan State University

in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Electrical and Computer Engineering

2006

ABSTRACT

ACTIVITY-AWARE MODELING AND DESIGN OPTIMIZATION OF
ON-CHIP SIGNAL INTERCONNECTS

By

Krishnan Sundaresan

On—chip global signal bus energy dissipation, thermal reliability, and latency are
all dependent upon transmitted word values. Real—world microprocessor workloads
cause bus trafﬁc that exhibit signiﬁcant spatial, temporal, and value locality. How-
e‘ver, existing signal interconnect modeling and optimization schemes are oblivious
of the correlated nature of such trafﬁc and were developed with random or worse-
case (highly-changing) trafﬁc conditions in mind, which limits their effectiveness. To
address this, we present activity—aware methods to model and optimize bus energy
dissipation, thermal reliability, and latency.

In the area of modeling, we present an activity-aware bus energy and thermal
model that permits monitoring of energy dissipation and temperature, both spatially
(horizontally across wires and longitudinally along individual wires) and temporally,
during microarchitectural simulation of real programs. We ﬁnd that ﬁnal tempera-
tures of wires in global signal buses carrying data (instruction) in the processor core
increase by as much as 37 (58) degrees Celsius during a simulation run of only a billion
instructions in 130-nm (45—nm) fabrication technology. We also ﬁnd that highly—active
wires in these buses attain absolute temperatures of up to 104 (123.7) degrees in 130—
nm (45—nm) processors that are higher than the 100 degrees temperature typically
assumed during interconnect design. In addition, wire temperature gradients across

the sending and receiving ends, with magnitudes between 16-25 degrees, were also

detected. These conditions were found to degrade processor performance by at least
4% (11.92%) in 130-nm (45—nm) processors.

In bus design, we present a trafﬁc-proﬁle—guided approach to optimize bus en-
ergy subject to designer-speciﬁed thermal constraints and to reduce worst-case bus
crosstalk and latency conditions. Our methodology performs these by evaluating
several options for signaling individual bit values and all possible ways of mapping
bits to bus lines (bit ordering), and then choosing, based on trafﬁc value character-
istics, an optimal encoding scheme (the combination of bit signaling and ordering)
statically at design time to support in hardware. Our energy—optimal static encoding
techniques provide bus energy reductions of 30.2% (52.1%) for processor core data (in-
struction) buses, respectively, compared to existing more-complex dynamic encoding
schemes that yield only 4.19% (5.32%) reductions for the same buses. Our static
encoding technique with thermal constraints added during optimization reduces peak
wire temperatures by up to 12.26 (12.96) degrees for data (instruction) buses, while
still providing signiﬁcant energy savings. Finally, we also present a static encoding
technique that reduces worst-case bus crosstalk conditions by at least 29.35% and
a variable—cycle bus architecture that takes advantage of this reduced crosstalk to
improve bus performance by 17.42%.

Our work represents a signiﬁcant advancement over existing approaches
that are activity-oblivious and/ or consider worst-case trafﬁc conditions. The
microarchitecture-level activity-driven spatiotemporal bus energy and thermal model
we present is the ﬁrst of its kind. Our static value—aware bit reordering and sig-
naling techniques are also highly-novel solutions that work remarkably well in real

applications.

Dedicated to Morn, Dad, and @eepa,

for tﬁeir unending [oz/e, support, and encouragement

 

ACKNOWLEDGEMENTS

The completion of this research and writing of this dissertation has been one of
the most signiﬁcant academic challenges that I have ever had to face. Without the
support, guidance, and patience of many people this endeavor would not have been
possible. I owe my thanks to all of them.

I have been fortunate to learn from many excellent teachers, from grade to grad-
uate school, and I am indebted to all of them for helping me reach where I am today.
In particular, I thank my advisor, Dr. Nihar Mahapatra, for his technical guidance
and support, over the last ﬁve years. His mentorship has instilled in me the skill and
conﬁdence to identify, analyze, and efﬁciently solve research problems and present
results in a clear and lucid manner. I have also learnt much from his classes and
from our research meetings and discussions. I also thank my dissertation committee
members, Dr. Anthony Wojcik, Dr. Andrew Mason, and Dr. Peixin Zhong, for their
very insightful review and comments which have helped me improve this work.

I have also been fortunate to be in the company of a lot of good friends, many of
them my lab-mates, and I thank them all for their support. J iangjiang Liu helped me
get my feet wet in research and was a great colleague during the early years. Kaushal
Gandhi and Srivathsan Krishnamohan have been great friends and lab-mates and
I have beneﬁted greatly from many technical discussions I have had with them. I
cherish their friendship, the good times we had together, and look forward to more

Friday-night pizza-and-beer get—togethers in the Bay area where all three of us are

starting our professional careers.

My family—Mom, Dad, and sister~has been a great source of encouragement
through the years and their continuing love and affection has made me what I am
today. I owe much more to them than what a few sentences can express. This
dissertation is dedicated to them.

Last but not least, I thank all members of the Greater Lansing Bhagavad Gita
group for their good thoughts and prayers. My association with them has helped
me keep up my sanity during these years and taught me to live by the Bhagavad
Gita’s motto: yogah karmasu kausalam— “Efﬁciency in Action leads to (the Ultimate)

Knowledge.”

vi

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xii
SELECTED LIST OF SYMBOLS xix
1 INTRODUCTION AND OVERVIEW 1
1.1 Interconnect Scaling Trends: Delay, Power, Temperature, and Reliability 1
1.2 Material, Process, and Architectural Advances ............. 4
1.3 Impact of Interconnects on Architecture and VLSI ........... 6
1.3.1 Wire Delay ............................ 7

1.3.2 Power and Temperature ..................... 8

1.3.3 Computer-Aided Design Tool Requirements .......... 10

1.4 Drawbacks in Existing Techniques .................... 11
1.5 The Need for Activity-Aware Design .................. 12
1.6 Our Contributions ............................ 15
1.6.1 Activity-Aware Design Methodology .............. 15

1.6.2 Accurate Energy, Temperature, and Delay Modeling ...... 16

1.6.3 Proﬁle-Guided Optimization Techniques ............ 17

1.6.4 Novel Thermal Optimization Methodology ........... 18

1.6.5 Performance-Oriented Adaptive Bus Design .......... 19

1.7 Dissertation Outline ........................... 19

2 PRELIMINARIES 21
2.1 Interconnect Analysis Methods ..................... 21
2.1.1 Global, Semiglobal, and Local Wires .............. 22

2.1.2 Interconnect Models: RC and RLC ............... 22

2.1.3 Effect of Inductance on Global Signal Lines .......... 23

2.1.4 Energy Estimation ........................ 24

2.1.5 Delay and Performance ...................... 26

2.2 Interconnect Optimization Techniques ................. 28
2.2.1 Data Encoding .......................... 28

2.2.2 Wire Spacing and Shielding ................... 31

2.3 Architecture-Level Simulators and Early-Stage Design ......... 31
2.4 Our Experimental Methodology ..................... 33
2.4.1 Interconnect Geometry and Technology Data .......... 33

2.4.2 Parasitic Capacitance Extraction ................ 34

vii

2.4.3 Simulation Infrastructure and Veriﬁcation of its Correctness . 36
2.4.4 Target Systems and Benchmarks ................ 41

3 ACTIVITY-DRIVEN ENERGY AND TEMPERATURE MODEL 43

3.1 Introduction ................................ 43
3.2 Related Work and Our Contributions .................. 47
3.3 Bus Line Energy Dissipation Model ................... 50
3.3.1 Energy Dissipated due to Line Self Capacitance ........ 51
3.3.2 Energy Dissipated due to Inter-Wire Capacitance ....... 52
3.3.3 Distributed-RC Line Energy Model ............... 53
3.4 Thermal Model .............................. 57
3.4.1 Chip Thermal Structures and Heat Transfer .......... 58
3.4.2 Detailed Thermal Model ..................... 59
3.4.3 Steady—State Thermal Model ................... 64
3.5 Simulation Environment and Methodology ............... 66
3.5.1 Benchmarks and Sample Sizes .................. 66
3.5.2 Thermal Warmup and Initial Temperatures .......... 67
3.5.3 Granularity of Thermal Simulation ............... 68
3.6 Experiments and Results ......................... 69
3.6.1 Energy Dissipation in Processor Buses ............. 69
3.6.2 Correlation between Energy and Temperature ......... 75
3.6.3 Final and Peak Wire Temperatures ............... 76
3.6.4 Wire Temperature Gradients ................... 86
3.7 Summary ................................. 89
4 DATA- AND TEMPERATURE-DEPENDENT DELAY VARI-
ABILITY MODEL 91
4.1 Introduction ................................ 91
4.2 Related Work and Our Contributions .................. 92
4.3 Temperature Dependent Delay Variability Model ........... 94
4.3.1 Wire Delay Considering Temperature Impact ......... 95
4.3.2 Wire Delay Variability Considering Crosstalk and Temperature 96
4.4 Results and Discussion .......................... 97
4.4.1 Maximum Wire Temperatures and Gradients .......... 98
4.4.2 Frequency of Timing Violations ................. 100
4.4.3 Performance Impact ....................... 105
4.5 Summary ................................. 106

5 ACTIVITY-AWARE ENERGY AND TEMPERATURE OPTI-
MIZATION 109
5.1 Introduction ................................ 109

viii

5.1.1 Need for Energy and Temperature Aware Bus Design ..... 110

5.1.2 Key Contributions and Results ................. 112

5.2 Related Work ............................... 114
5.3 Methodology ............................... 116
5.3.1 Target Scenarios ......................... 117
5.3.2 Bus Layout and Wire Geometry ................. 120

5.4 Static Techniques for Bus Energy and Temperature Optimization . . 121
5.4.1 Choice of Signaling Modes .................... 121
5.4.2 Minimum Energy Signaling (MES) ............... 126
5.4.3 Minimum Energy Bit Ordering (MEBO) ............ 127
5.4.4 Simultaneous Bit Ordering and Signaling (SBOS) ....... 129
5.4.5 Thermal Optimization Methodology ............... 130
5.4.6 Routing Overheads ........................ 134

5.5 Results and Discussion .......................... 136
5.5.1 Energy Dissipation in Processor Buses ............. 138
5.5.2 Energy Reduction for General-Purpose Design ......... 145
5.5.3 Energy Reduction for Workload—Speciﬁc Design ........ 145
5.5.4 Energy Reduction for Program-Speciﬁc Design ......... 147
5.5.5 Wire Temperature Reduction .................. 154

5.6 Summary ................................. 161
6 ACTIVITY-AWARE PERFORMANCE OPTIMIZATION 163
6.1 Introduction ................................ 163
6.2 Related Work ............................... 164
6.3 Techniques for Performance Optimization ................ 165
6.3.1 Variable Cycle Bus (VCB) Design ................ 165
6.3.2 Minimum Crosstalk Bit Ordering (MCBO) ........... 168
6.3.3 MCBO with Signaling (MCBOS) ................ 171

6.4 Results and Discussion .......................... 172
6.4.1 Peak Crosstalk Reduction .................... 172
6.4.2 Performance Improvement with VCB .............. 173

6.5 Summary ................................. 177
7 CONCLUSION 178
7.1 Contributions and Key Results ..................... 178
7.2 Directions for Future Research ...................... 182
BIBLIOGRAPHY 183

 

2.1

2.2

2.3

3.1

3.2

4.1

4.2

5.1

5.2

LIST OF TABLES

Bus crosstalk conditions and models for a rising transition in the middle
(victim) wire. ...............................

Technology, wire geometry, and equivalent circuit parameters for top-
most layer interconnect. Values in top eight rows are from the interna-
tional technology roadmap for semiconductors (ITRS) document [1].
Values listed in the next three rows are from Mui et a1. [2]. The
values for the self and coupling capacitances were extracted using
the FastCap tool and the value for T, was calculated using the for—
mula r,- = pCu/(wi - t,), where pCu = 2.2 x IO-SQ-m. Values
of h and it: were found using expressions given in Section 2.1.5 and

’r = ci,i:f: 1/(Cline+h X CO)' ......................

Conﬁguration of our target system and benchmarks. This processor-
memory system conﬁguration is based on the Alpha 21264 processor.

Comparison of normalized energy dissipated in wire subsegments ob-
tained using our model and Cadence Spectre simulations for 10 sub-
segments. .................................

Maximum wire temperatures in oC recorded during a simulation of
one billion committed instructions for data and instruction buses using
130 nm and 45 nm parameters. .....................

Maximum wire temperatures recorded for the ALU result bus. Ambi-
ent temperature is 318.15 K ........................

Performance impact expressed as percentage IPC degradation. . . . .
Optimization scenarios considered in this work. ............

Correlation coefﬁcients Try between test and training set data for var-
ious signaling schemes discussed in Section 5.4.1. Since Try values are
close to 1, our training and test sets are well correlated .........

27

35

42

57

87

99

107

118

120

5.3

5.4

5.5

Number of iterations and running times for various problem types and
sizes .....................................

Optimal signaling and ordering obtained for workload-speciﬁc design
of the data bus (0=LSB, 63=MSB). Q = org, Q? = inv, <> =trs,
Q =itr, and A =mm. ..........................

Thermal Optimization Results. Peak wire temperatures (K) in data
and instruction buses for SBOS scheme with and without thermal con-
straints (TC) applied during optimization. The methodology described
in Section 5.4.5 was used to obtain the trade-off curves in Figures 5.16—
5.20 and the wire permutations that resulted in bus energy reduction

E t
closest to 0.5(1 — BEL) were chosen from each benchmark’s tradeoff
on

curve. Results shown here are for detailed thermal simulations with
this permutation. ............................

xi

137

148

 

1.1

1.2

1.3

1.4

1.5

2.1

2.2

LIST OF FIGURES

Gate and interconnect delay scaling for current and future nanometer-
scale technologies. Local interconnects scale with gate delay whereas
global interconnect delays do not [3]. ..................

Interconnect power dissipation due to global and local wires. Global
lines are responsible for 21% of total dynamic power dissipation at
130 nm [4]. ................................

Projected wire temperature rise in multi-layer interconnects for various
technologies under worst-case conditions. Global metal lines will be
the hottest, with temperatures expected to reach as much as 209°C in

45 nm technology [5]. ..........................

Pipeline stages and loops in a typical out-of—order processor. More
frequently used loops like fetch, LSQ, and bypass are affected strongly
by wire delay. ..............................

Power dissipation in Intel processors showing an exponential trend [6].
Since 2001, low-power and power management techniques that have
been used widely in microprocessors and have helped slow down the
trend somewhat. .............................

Layout of wires routed in the top—most layer metal. Self and coupling
capacitances are shown. The bottom plate represents the V D D / GND
plane. ...................................

Wire segment of length lopt between two repeaters. ..........

xii

34

2.3

3.1

3.2

3.3

3.4

3.5

3.6

Distribution of self and coupling capacitance values for the middle
wire of a 32—bit bus extracted using the FastCap tool [7] and = self
capacitance of the wire; Ccl = coupling capacitance between the wire
and its adjacent neighbor; Cc2 = coupling capacitance between the
wire and a non-adjacent wire with 1 wire between them; Cc3 = coupling
capacitance between the wire and a non-adjacent wire with 2 wires
between them; Cc_rest = sum of coupling capacitances between the
wire and other wires with 3 or more wires between them. For current
and near-future ITRS technology nodes (up to 45 nm), non-adjacent
coupling capacitances are somewhat non-negligible—they contribute
approximately 8—10%. ..........................

Distributed—RC model of the wire segment divided into n subsegments.

Figure shows the view of different thermal structures of a C4/CBGA
chip and the primary and secondary heat transfer paths. .......

Thermal model. (a) Complete equivalent thermal-RC network for a
5—wire bus. PIJC = Pék = = 5,k’ R1,k =R2,k = = R5,k’
01,19 = 02,19 = = C5, is, and P1, is, P2, k, . . . , P5, is are bus-activity
dependent in the model shown. (b) Geometry for calculating equivalent
thermal resistances for a wire based on previous work of Chiang et al.
The lightly shaded regions and arrows represent heat ﬂow between the
conductors or between layers (from a hotter to a cooler one).

Steady state thermal equivalent circuit for three wires. Heat transfer
between wires is modeled by Rinter and heat loss to surroundings by
Rth' P,- represents power dissipated in each wire due to switching
activity and it can found using a microarchitecture—level simulator.

Total energy dissipated in a 64-bit data bus for various benchmarks.
‘Ccl only’ represents the existing energy models which consider only
self and adjacent coupling capacitances. ‘Cc1+Cc2+Cc3’ represents
our model that considers self capacitances, adjacent coupling capac-
itances (Ccl), and two non-adjacent capacitances (Cc2 and Cc3) on
each side. The % energy mismatch shown by the line is plotted with
respect to the right-hand side Y—axis. ..................

Total energy dissipated in a 128—bit instruction bus for various bench-
marks. The % energy mismatch shown by the line is plotted with
respect to the right-hand side Y-axis. ..................

xiii

53

60

65

71

72

3.7

3.8

3.9

3.10

3.11

3.12

3.13

3.14

Total energy dissipated in a 64—bit data bus with various encod-
ing schemes. ‘Self’ denotes self energy, ‘C/ D’ denotes the coupling
charge/ discharge energy and ‘Toggle’ denotes the coupling toggle en-
ergy dissipation. ‘Cc1 only’ refers to existing energy models that con-
sider self and adjacent coupling capacitance only and ‘Cc1+Cc2+Cc3’
refers to our energy model that considers self, adjacent coupling, and
two non-adjacent coupling capacitances. ................

This plot shows average energy dissipation and wire temperature of
the bus for a simulation interval of 10 billion cycles. The continuing
temperature rise can be clearly observed .................

Plots show the wire temperature rise recorded for benchmarks gcc and
gzip for the data bus in 130 nm and 45 nm technology nodes over
a simulation interval of one billion committed instructions for each
benchmark. ................................

Plots show the wire temperature rise recorded for benchmarks mcf
and lucas for the data bus in 130 nm and 45 nm technology nodes over
a simulation interval of one billion committed instructions for each
benchmark. ................................

Plots show the wire temperature rise recorded for benchmarks ammp
and applu for the data bus in 130 nm and 45 nm technology nodes
over a simulation interval of one billion committed instructions for
each benchmark. .............................

Plots show the wire temperature rise recorded for integer benchmarks
gcc and gzip for the instruction bus in 130 nm and 45 nm technology
nodes over a simulation interval of one billion committed instructions
for each benchmark. ...........................

Plots show the wire temperature rise recorded for integer benchmarks
mcf and lucas for the instruction bus in 130 nm and 45 nm technology
nodes over a simulation interval of one billion committed instructions
for each benchmark. ...........................

Plots show the wire temperature rise recorded for integer benchmarks
ammp and applu for the instruction bus in 130 nm and 45 nm technol-
ogy nodes over a simulation interval of one billion committed instruc-
tions for each benchmark. ........................

xiv

73

77

79

80

81

82

83

3.15

3.16

4.1

4.2

4.3

4.4

4.5

5.1

5.2

5.3

A three-dimensional plot showing spatial and temporal variations in
wire temperature for the lower-order 32 bits of the load / store data bus
for the gcc benchmark. .........................

Frequency distribution of maximum wire temperature gradients for
130 nm and 45 nm processor wires. ...................

Distribution of maximum wire temperature gradients in result bus
wires for the 130 nm processor. .....................

The number of temperature-induced violations per hundred bus refer-
ences occurring across ten benchmark programs in a 130 nm processor.

The number of temperature-induced violations per hundred bus refer—
ences occurring across ten benchmark programs in a 45 nm processor.

ooooooooooooooooooooooooooooooooooooooo

This plot shows the frequency of occurrence of ﬁve different crosstalk
conditions on the bus. See Section 4.3.2 and Table 2.1 for an explana—
tion of these crosstalk conditions. The crosstalk condition determines
the actual propagation delay without considering thermal effects.

Figure shows the percentage of temperature-induced delay violations
that correspond to a given crosstalk condition. ............

Markov model-based signaling technique. (a) A 4-bit prediction table
for the Markov model for bits 0—7 of the data bus obtained by analyzing
training set benchmarks. Depending on which bits are selected for
Markov model signaling, the corresponding row of the table can be
translated to hardware using logic minimization tools. (b) Examples
of sending end hardware that would be required for 2 bits (0 and 7)
assuming these are chosen to be signaled using the m scheme. As can
be seen, the logic overhead required for m signaling is very minimal. .

Sample peak wire temperature versus bus energy trade—off curve. The
thermal optimization steps can be used to obtain curves similar to the
one shown here. .............................

Routing strategy and overheads for re—ordering. (a) Deﬁnition of the
routing channel. (b) Matching diagram showing ten crossing points.
(0) Two-layer routing strategy using eight horizontal tracks and ten
vias. ....................................

XV

85

88

100

101

102

103

104

124

5.4

5.5

5.6

5.7

5.8

5.9

5.10

5.11

5.12

Transition Densities for the 13 integer SPEC CPU2000 Benchmarks
for 64-bit Data Bus. ...........................

Transition Densities for the 13 floating—point SPEC CPU2000 Bench-
marks for 64-bit Data Bus .........................

Fraction of bus energy dissipated in self and coupling
(charge/discharge+toggle) transitions for 32-bit data address bus
in the Alpha 21264 target system while running SPEC CPU2000
programs. .................................

Fraction of bus energy dissipated in self and coupling
(charge/discharge+toggle) transitions for 32-bit instruction ad-
dress bus in the Alpha 21264 target system while running SPEC
CPU2000 programs. ...........................

Fraction of bus energy dissipated in self and coupling
(charge/discharge+toggle) transitions for 64-bit data bus in the
Alpha 21264 target system while running SPEC CPU2000 programs.

Ffaction of bus energy dissipated in self and coupling
(charge/discharge+toggle) transitions for 128-bit instruction bus
in the Alpha 21264 target system while running SPEC CPU2000
programs. .................................

Energy dissipation results for general-purpose design for the 64—bit data
bus. Statistics collected on 13 training set benchmarks were used to
obtain the optimal static encoding schemes. These were tested on

13 other (test set) benchmarks. Average energy reductions are MES:
7.81%, MEBO: 11.91%, and SBOS: 20.04%. ..............

Energy dissipation results for general-purpose design for the instruction
bus. Average energy reductions are MES: 10.96%, MEBO: 19.85%, and
SBOS: 38.78%. ..............................

Energy dissipation results for workload-speciﬁc design of the 64-bit
data bus. Statistics collected for SimPoint samples from 13 training
set benchmarks were aggregated and used to obtain the optimal static
encoding schemes. These were then tested on a non-overlapping sample

from the same set of benchmarks. The average energy reductions are
MES: 9.73%, MEBO: 15.97%, and SBOS: 22.79%. ..........

xvi

139

140

142

143

144

146

147

5.13 Energy dissipation results for workload—specific design for the 128—bit
instruction bus. The average energy reductions are MES: 10.43%,
MEBO: 21.25%, and SBOS: 40.77%. ..................

5.14 Energy reduction results for program-speciﬁc design. Statistics col-
lected for SimPoint samples of each benchmarks was used to obtain
the optimal static encoding schemes speciﬁc to that benchmark for our
schemes, MES, MEBO, and SBOS. These were then tested on the same
sample. Results for dynamic encoding schemes BI and OEBI proposed
in previous work are also shown. The average energy reductions for the
data bus are BI: 4.19%, OEBI: 1.58%, MES: 19.7%, MEBO: 23.25%,
and SBOS: 30.2%. ............................

5.15 Energy reduction results for program-speciﬁc design. Statistics col-
lected for SimPoint samples of each benchmarks was used to obtain
the optimal static encoding schemes speciﬁc to that benchmark for our
schemes, MES, MEBO, and SBOS. These were then tested on the same
sample. Results for dynamic encoding schemes BI and OEBI proposed
in previous work are also shown. The average results for the instruc-
tion bus are B1: 2.63%, OEBI: 5.32%, MES: 21.7%, MEBO: 32.1%,
and SBOS: 52.1%. ............................

5.16 Energy vs. temperature trade-off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for amp and
crafty. The permutation selected for each benchmark was the one

5.17 Energy vs. temperature trade-off curves. Plots show the energy vs.

150

temperature tradeoff curves obtained for the data bus for eon and gcc. 156

5.18 Energy vs. temperature trade-off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for gzip and
lucas. ..................................

5.19 Energy vs. temperature trade-off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for mesa and
mgrid. ...................................

5.20 Energy vs. temperature trade—off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for swim and
twolf. ..................................

xvii

158

6.1

6.2

6.3

6.4

6.5

Three-bit crosstalk analyzer truth table and circuit. (a) Truth table
showing only the ON-set. “-” indicates a don’t care input. (b) Logic
circuit implementing the truth table. ..................

Variable cycle bus. (a) Complete bus crosstalk analyzer for an n-bit
bus. (b) Sender and receiver logic for VCB. ..............

Crosstalk reduction results for workload-speciﬁc design of the 64—bit
ALU result bus. (a) Average reductions in number of 1+4r delay cycles.
For MCBO: 24.89% and MCBOS: 30.61%. (b) Average reductions in
number of 1+3r cycles. For MCBO: 19.21% and MCBOS: 23.42%.

Crosstalk reduction results for general purpose design of the 64-bit
ALU result bus. (a) Average reductions in number of 1+4r delay cycles.
For MCBO: 21.22% and MCBOS: 29.35%. (b) Average reductions in
number of 1+3r cycles. For MCBO: 16.77% and MCBOS: 20.29%.

Reduction in the number of cycles taken to transmit the information
with MCBO and MCBOS applied to the result bus. (a) Workload—
speciﬁc optimization. (b) General-purpose optimization. .......

xviii

174

175

SELECTED LIST OF SYMBOLS

C,- k Thermal capacitance of the kth subsegment of the ith wire
72,, k Thermal resistance along the heat transfer path of the kth subsegment of the
ith wire

60 Ambient temperature inside the computer box (45°C)

87- Relative permittivity of dielectric

CO Capacitance of minimum size inverter in fF

CM i 1 Adjacent coupling capacitance per unit length in pF / m

C.

2, j Coupling capacitance between line i and any other line 3', z' aé j

Cline Self/ Area capacitance of wire per unit length in pF/m
folk Clock frequency

kild Thermal conductivity of dielectric

R0 Resistance of minimum size inverter in kQ
Resistance of wire per unit length in k0 / m
tz’ld Thickness of the inter-layer dielectric
Thickness of wire z'

tp End-to-end propagation delay of a wire
VD D Supply voltage

Width of wire 2'

xix

CHAPTER 1
INTRODUCTION AND OVERVIEW

High—speed systems and circuits are increasingly facing the limitations posed by
shrinking physical dimensions of transistors and their interconnections [8]. As cir-
cuits become denser, smaller transistors naturally speed up. But interconnects, in
general, do the reverse and introduce delays that reduce or even cancel the speed
gains due to smaller transistors. The problems due to interconnects are exacerbated
by the fact that parasitic resistance, inductance, and capacitance (RLC) effects in—
crease as wires scale to smaller dimensions, which in turn aggravates delay, power
consumption, and cause signal integrity/ reliability problems. Thus, on—chip intercon—
nect design has been recognized as one of the most important challenge to address in

nanometer-scale integrated circuits [9,10].

1.1 Interconnect Scaling Trends: Delay, Power,
Temperature, and Reliability

According to the data available from the international technology roadmap for semi-
conductors (ITRS) documents, the intrinsic gate delay has improved ten times, from
10 ps to 1 ps in the 20 years between 1980 and 2000. However, in the same period
of time, the interconnect delay in a 1 mm line degraded 100 times, from 1 ps to
100 ps [1]. This growing disparity between gate and interconnect delays is also high-
lighted in Figure 1.1 for current and future technologies [3]. The ﬁgure shows that

while local interconnect delays scale with gate delays, global interconnect delays do

not. Such trends have forced costly performance compron'iises, like the allocation of

two out of twenty pipeline stages for communication in the Pentium—4 microproces—

 

 

 

 

sor [11].
Feature size (nm)
250 180 130 90 65 45 32
100 L l l l l I
Global interconnect
without repeaters
10 a
> Global interconnect
g with repeaters
8
g 1 " ‘t
5 \ Local interconnect
1% (M1, M2) -
0.1 - \
Gate delay (F04)
Source: ITRS
0.01

 

Figure 1.1. Gate and interconnect delay scaling for current and future nanometer-
scale technologies. Local interconnects scale with gate delay whereas global intercon-

nect delays do not [3].

Interconnects are also responsible for about 50% of the power dissipation, as shown

by results from studies on a 130 nm Intel microprocessor [4]. Figure 1.2 shows the

distribution of power dissipation by the type of the net / wire. As can be seen, global

signal lines account for 34% of the total interconnect power dissipated and hence 21%

of the total dynamic (switching-related) power dissipation at 130 nm.

Due to increased Joule heating in the global wires, their temperatures are also in-

creasing alarmingly. The spatial temperature distributions along the vertical direction

Interconnect Power Total Dynamic Power

Local
Signals,
37%

 

Figure 1.2. Interconnect power dissipation due to global and local wires. Global lines
are responsible for 21% of total dynamic power dissipation at 130 nm [4].

from the Silicon (Si) substrate obtained using ﬁnite element models and simulations
are shown in Figure 1.3 [5]. This analysis assumed that all wires in the interconnect
stack carried currents with maximum rated current density for that technology which
represents an extreme worst case. Nevertheless, the results show how temperatures
will be distributed across interconnect layers. It can be observed that as technology
scales down, the temperature gradient between the top metal lines and the substrate
becomes larger. Global metal lines were found to be the hottest in all technologies
using this worst—case analysis, with temperatures reaching as much as 209°C in 45 nm
technology [5]. For the 35 nm node, the temperature gradient is smaller than that
for the 50 nm node due to the larger fraction of metalization (Cu) layers compared
to inter-layer dielectric (ILD) layers, an artifact of the ITRS scaling scenario that

was used for this analysis. It should be noted that the total height of the (Cu+ILD)

layers decreases as scaling continues, due to the smaller vertical dimensions of wires
and insulators despite increase in number of metal layers. It can also be observed
that the maximum chip temperature occurs for the long global wires, which are most
prone to electromigration failures and also give rise to highest RC delays. This has

important implications for both reliability and performance.

209 °C
Global Wires

H
U
o
H
cu
h
3
u
(U
I...
0
Q.
E
d)
I—

2345678910

Distance from Substrate [um] 50 nm NOde

 

Figure 1.3. Projected wire temperature rise in multi—layer interconnects for various
technologies under worst-case conditions. Global metal lines will be the hottest, with
temperatures expected to reach as much as 209°C in 45 nm technology [5]

1.2 Material, Process, and Architectural Ad-
vances

Many methods, such as utilizing Cu and low-k insulators [12—14], short-wire architec—
tures [15,16], on-chip networks [17], optical interconnects [18], and three—dimensional
interconnect structures [19,20] have been suggested to help alleviate the impact of

interconnect scaling on current and future nanometer—scale fabrication. The pros and

cons of these techniques are discussed next.

Material and process enhancements: Copper interconnects in high speed
microprocessors were introduced by IBM in its 400 MHz PowerPC750 processor. Al-
though the resistivity of Copper is 40% less than that of Aluminum, the percentage
of performance improvement from using the former is limited to about 15% [14].
The thickness and resistivity of the Tantalum (Ta) liner, used in the dual-damascene
process for Copper electrodeposition, also limit the performance advantage of Copper
interconnects. Low-k dielectrics also help improve chip performance. For example,
the performance of a metal wire improves 25% for 0.25 pm technology using a k=2.5
dielectric material compared to conventional silicon dioxide, which has k=3.9. How-
ever, the use of Copper metal and low-k dielectrics are known to aggravate thermal
issues in interconnects and cause reliability problems, during both fabrication and
chip lifetime [21].

Novel architectures: Short-wire architectures such as systolic arrays can be em-
ployed to overcome some of the problems imposed by long global interconnects [16].
Although these architectures are not applicable to all microprocessors, they can be
useful in speciﬁc applications, such as pattern recognition, multiprocessor systems,
and arithmetic computation. Orr-chip networks can be used instead of global inter-
connects to reduce the global interconnect congestion [22]. Since most of the global
wires are not utilized in every clock cycle, it is more efﬁcient to send packets over
a global network rather than signals in global wires. However, this requires a com-
pletely new architecture, tools, and design methodology different from conventional

microprocessors.

Optical interconnections and 3D integration: It has been shown that the
optical interconnections have higher bandwidth and consume lesser power for long-
distance communication compared to electrical interconnections [23]. However, be—
cause of incompatibility with standard CMOS technology, optical interconnects have
not been widely deployed in current microprocessors. The primary application has
been restricted to clock distribution networks in some designs [24]. Three-dimensional
interconnection schemes are also expected to signiﬁcantly reduce global wiring require-
ments and have a signiﬁcant impact on reducing interconnect delay and power [25].
However, vertical pitch limitations resulting from alignment tolerances in the bond-
ing of wafers [26] and heat removal capacity limitations [27] are some of the problems

limiting the use of three-dimensional architectures.

1.3 Impact of Interconnects on Architecture and
VLSI

Interconnect-related problems have affected chip design to such an extent that product
roadmaps of almost all chip design companies have been drastically re—drawn as it
is becoming evident that high-speed processors—with clock frequencies exceeding
10 GHz—are no longer economically viable, due to restrictions imposed by power,
temperature, and reliability [28-30]. The impact of interconnect scaling and power
and performance issues affects the very ﬁrst architectural design decisions of today’s

processors [31,32].

1.3. 1 Wire Delay

Processor clock speeds have increased continuously, due to faster transistors and also
due to deeper pipelines. However, since global wire delays—for example, delay of
register bypass wires—scale much slower than transistor delays, deeper superscalar
pipelines have experienced increased latencies and a signiﬁcant degradation in in-
struction throughput. Several studies have pointed out rising wire delays dictate that
deeper pipelines will not perform better than shallower ones in future technologies
and also conclude that superscalars do not have sufﬁcient parallelism to tolerate the
relative rise in wire delays [33,34]. Hence the industry trend toward multi-core and
multithreaded architectures [35].

WB-EX Bypass Loop

 

MEM-EX Bypass Loop

I |

  
    
  

DE- RE- REG.

 

 

FETCH CODE NAME ISSUE READ EXEC MEM
Fetch Rename Issue EX Bypass LSQ
Loop Loop Loop Loop Loop

7 Branch Misprediction Loop

 

 

Load Mis-speculation Loop

Figure 1.4. Pipeline stages and loops in a typical out—of—order processor. More fre-
quently used loops like fetch, LSQ, and bypass are affected strongly by wire delay.

We briefly examine next why wire delay trends affect architectural decisions. An
out-of-order superscalar pipeline is composed of two in-order half-pipelines, called
the front-end and back—end, connected by the issue queue. Figure 1.4 shows this

conﬁguration and the various loops in the pipeline [35]. Wire delay affects many of

these loops signiﬁcantly as discussed next. The fetch loop is due to the fact that
the current program counter (PC) is used to predict the next PC. The delay of
this loop includes the instruction bus and cache delays. The rename loop is due to
the dependence between a previous instruction assigning a rename tag and a later
instruction reading the tag and the issue loop is due to the dependence between the
producer and wakeup of a consumer instruction. The rename and issue loop delays
are sensitive to the delay on the tag lines. The load misspeculation loop is due to
use of speculation and the need for load-miss replay. The load/store queue (LSQ)
loop is due the dependence between a previous store and a later load to the same
address and includes the load/ store bus and data cache delays. The various bypass
loops—EX/ EX, EX-MEM, and Writeback-EX—are all affected by the wire delays on
the ALU result bus. Also, the more frequently a loop is used, the higher its impact on
performance. The fetch, rename, issue, and bypass loops are all fairly frequent and
hence have the highest impact. The load misspeculation and branch misprediction
loops that are used only upon load misses and branch mispredictions, respectively,

are relatively less frequent and have lesser impact.

1.3.2 Power and Temperature

Power has become a ﬁrst-class constraint in the design of nanometer-scale ICs. Fig-
ure 1.5 shows the trend observed in the power dissipation of Intel microprocessors [6].
In 2001, it was predicted that with the scaling rates at that time, the power density in
microprocessors will reach that of the Sun by 2015, following an almost exponential
trend [6]. Since then various steps have been taken to reduce power dissipation in logic

and memories with techniques at various levels of abstraction. These have resulted

in reducing the trend to a linear one, as shown by the dotted line in Figure 1.5.

100000 -
2000

10000 - 22 F
’3 Pentium Processors
g f—H 2004
5 1000 - __
i. -—
3
D
L-
3
o
o.

 

 

 

 

 

 

 

 

 

Figure 1.5. Power dissipation in Intel processors showing an exponential trend [6].
Since 2001, low-power and power management techniques that have been used widely
in microprocessors and have helped slow down the trend somewhat.

Among the three different sub—systems of a high-performance processor—
computation, storage, and communication—the communication or interconnect sub-
system, which carries address, instruction, data, and control signals, is still respon-
sible for a bulk of the on—chip power dissipation as discussed earlier, in part due
to interconnect scaling trends. With increasing interconnect power dissipation, wire
temperatures rise as a result of the Joule effect, wire resistance increases due to
temperature-dependent resistivity forcing performance to degrade further, and wire
reliability decreases sharply due to electromigration-induced breakage. Even with the
advent of multi-core processors, clock frequencies and datapath widths have continued

to increase and hence, all of the above effects are bound to worsen further. Hence,

interconnect power dissipation and temperature remains one of the primary issues
facing microarchitects and VLSI designers.

Popular low-power and power-management techniques like ﬁne-grained clock gat-
ing and power gating can also signiﬁcantly affect on—chip temperature proﬁles by
creating localized hot spots and / or temperature gradients on the chip. These gradi-
ents cause delay variabilities, setup and hold time violations and, in the worst case,
failure of interconnects that are routed across regions with varying temperatures. De-
signing for these issues is almost impossible because accurate techniques to estimate
temperature gradients in interconnects are currently unavailable. Thus, study of the

thermal impact of architectural techniques is also becoming important.

1.3.3 Computer-Aided Design Tool Requirements

In conventional ASIC design, signal and power integrity were checked in later stages
of the design cycle and the design was modiﬁed if these checks were found to be unsat-
isfactory. However, with explosion in the number of transistors and highly-complex
designs in nanometer-scale technologies, iterating between upstream (architecture or
high-level) design changes and layout to achieve design closure has becoming increas—
ingly futile, leading to longer time-to-market schedules and higher design costs [9].
The design-productivity gap, exemplified by the lack of proper CAD tools to identify
and correct issues at an early stage, exacerbates this problem. While the push to-
ward ever-higher performance still drives the semiconductor industry, there is growing
awareness now that winning designs need to balance multiple objectives: high per-
formance, low power, low cost, robustness (noise immunity), and reliability. As such,

it is becoming imperative to: (1) model interconnect—related effects accurately and

10

efﬁciently for different system architectures (superscalar, multi-core, and network-on-
chip) and fabrication technologies (130 nm, 90 nm, etc.) and (2) design the intercon-
nect system, at an early stage, to alleviate or mitigate these effects without incurring

unsustainable performance, energy, and/ or area/ cost overheads.

1.4 Drawbacks in Existing Techniques

Next, we discuss some drawbacks of existing models and design techniques for signal
interconnects. First, almost all existing work addressing signal interconnect analy-
sis, design, and optimization is not activity-aware, i.e., such interconnect models and
design techniques are not developed with an accurate knowledge of the characteris-
tics of data that is transmitted on these interconnects. An average wire switching
factor—such as 0.15 suggested in [36]——is used to estimate energy dissipation, wire
temperature, delay impact, and /or reliability impact [37,38]. As such, these average
estimates lead to over-design because switching activity in interconnects is actually
information and time dependent. It depends on the type of information (address, in-
struction, data, or control) being transmitted because the information type inﬂuences
switching activity factors; for example, the activity factor is expected to be higher for
data and instruction streams since they are more random in nature than addresses.
It varies with time too because, during execution of most typical programs, there
are substantial periods when a bus may remain idle; for example, when there are no
level-one (L1) cache misses, the bus connecting to level-two (L2) cache will remain
idle. These idle cycles help bring down wire temperatures and hence, may reduce

wire delay and electromigration impact. Hence, to facilitate interconnect design that

11

can be tuned to the requirements of different architectures, activity-aware modeling
and design optimization techniques are necessary.

Second, as mentioned earlier, increasing the number of iterations between high-
level design and physical layout to achieve design closure, has become exorbitantly
costly, time—consuming, and impractical in nanometer designs. Hence, growing em-
phasis is being placed on making accurate early-stage design decisions obtained using
microarchitecture-Ievel simulations on benchmark programs. Interconnect models
that have been built into existing execution-driven simulators lack the detail needed
to accurately estimate the impact of interconnect power dissipation, temperature,
and related effects, since many not consider the influence of wire coupling and ther-
mal heat dissipation paths. For example, the amount of energy dissipated due to the
parasitic coupling capacitance between wires is much greater than energy dissipated
due to the area capacitance. Similarly, thermal coupling or heat transfer through the
inter-metal dielectric occurs between adjacent wires, affecting temperatures in both

wires.

1.5 The Need for Activity-Aware Design

Existing techniques that target bus energy and crosstalk reductions, perform well
only when patterns that are transmitted on the bus are randomly distributed in
time. However, this is rarely the case in actual microprocessor buses. Information
transmitted on these buses show high degrees of correlation across programs as well
as across sections of the same program, due to the presence of temporal, spatial, and

value localities. Temporal locality describes the likelihood that a recently-referenced

12

item will be referenced again seen, while spatial locality describes the likelihood
that a close neighbor of a recently—referenced item will be referenced soon. Value
locality refers to the likelihood of a previously-seen value recurring repeatedly in the
information stream.

Address, instruction, and data streams in microprocessor buses exhibit substantial
amounts of temporal and spatial locality due to the reasons discussed next. Instruc-
tion addresses issued by the processor to the L1 cache are typically sequential, except
when branches or jumps occur and even then the target addresses are not typically
very far away from the last address. This is the reason why many instruction sets use
PC-relative addressing with shorter-than—full—word-size offsets for branch and jump
instructions. Data addresses issued by the processor are also exhibit these localities
primarily because of scanning of data arrays in loops that are placed in contiguous
memory locations.

The dynamic instruction stream executed by a processor corresponds to instruc—
tion addresses issued by fetch unit, and hence instructions exhibit the same temporal
and spatial locality as instruction addresses. Also, not all instructions, instruction
sequences, opcodes, register operands, and immediate constants are present equally
frequently in the dynamic instruction mix, leading to more predictability in the in-
struction stream. The reasons for the presence of such redundancies are that all pro-
grams share certain basic characteristics: procedures and procedure calls, branches
every few instructions—~typically every six instructions [39], and loops and if—then-else
clauses that lead to repetitive instruction sequences.

Data buses in the processor, such as load / store and ALU result buses, also exhibit

temporal and spatial locality, although to a lesser extent than addresses and instruc-

13

tion buses. There is an additional element of redundancy present in the magnitude
of values communicated by these buses and stored in registers, data caches, and/or
CAM structures in the processor core. This redundancy is due to the fact that for
any given type of data—character, integer, floating-point, etc—not all values are
equally likely. For instance, many programs do not tend to use the entire range of
integer values possible, but rather the values used tend to be concentrated around
certain values, especially, zero. For such small magnitude two’s complement numbers,
most high order bits of the data bus are likely to be either all zero (positive) or all
one (negative) due to sign extension. The concept of value locality also adds to the
redundancies present in data buses. For example, the number of times each static
load (or store) instruction retrieves a value from (or writes to) memory that matches
a previously seen value, is quite high. Studies have shown that this value is around
50% for most superscalar processors running standard benchmark applications [40].
The presence of temporal, spatial, and value localities in information streams
opens opportunities for activity-aware design of high-performance buses, i.e., design
that is tailored to the unique characteristics of different types of data that are trans-
mitted on these buses, as well as to the typical applications that are executed on
the processor. Such design can be achieved with the following steps: (1) proﬁle the
information transmitted on target buses using cycle-accurate microarchitecture-level
simulators for a representative workload, (2) identify opportunities by correlating,
for example, the number of self and coupling transitions with objective function (bus
energy, temperature, crosstalk, etc.), (3) and design techniques that minimize the
value of the objective function. Although, the technique is designed using a rep-

resentative workload, it is likely to work well for any real application in the same

14

domain due to the similarities in program characteristics. In fact, computer architec-
ture continue to use similar methodologies to design efﬁcient branch, load-value, and
other prediction-based techniques to improve instruction—level parallelism in modern

superscalar processors.

1.6 Our Contributions

As presented earlier, accurate modeling and cost-effective design of global signal inter-
connects is a critical issue in current and future nanometer-scale design. Since inter-
connect performance (wire delay) and energy dissipation depend closely on switching
characteristics of the data stream, activity-aware modeling and design approaches
are important. Furthermore, the introduction of Cu and low-k dielectrics exacerbate
problems such as wire self-heating which need to be modeled, along with the impact
of temperature on wire delay variability. Finally, newer design techniques are needed
to deal with rising interconnect power dissipation and temperature since existing
techniques are not effective in most real architectures, workloads, and applications.
The objective of this research is to provide a methodology to model and design
signal interconnects in nanometer-scale ICs and address power, temperature, and per-
formance concerns during early-stage design. To accomplish this goal, four research

tasks were identified and novel contributions are made in each.

1.6.1 Activity-Aware Design Methodology

Our research is perhaps the ﬁrst attempt that proposes and examines activity-aware

design techniques for global signal buses. Existing techniques rely on worst-case

estimates to design high-performance buses, resulting in overly-pessimistic energy,
temperature, and clock cycle time estimates. Due to lack of accurate models suitable
for early stage design exploration, interconnect design is done late in the design cy-
cle, offering very limited opportunities to optimize the architecture for performance,
power, and cost. In contrast, the methodology we propose examines typical applica-
tions, collects statistics for different types of data, and optimizes the design of target
buses, all using early stage simulation. Thus interconnect design can be completed

early in the design cycle and it can be used as a parameter in design space exploration.

1.6.2 Accurate Energy, Temperature, and Delay Modeling

We introduce accurate modeling techniques to help estimate the impact of activity-
dependent interconnect energy dissipation, wire temperature rise due to Joule heating
and delay variation due to temperature, using a microarchitecture-level simulator. In
addition to self capacitance, our model incorporates the effects of capacitive cou-
pling between adjacent as well as non-adjacent pairs of wires and repeater insertion
on switching energy, the effect of lateral heat transfer between adjacent wires to esti-
mate wire temperatures, and also estimates wire temperature gradients and its impact
on wire delay, all of which were not available in earlier models. We estimate from
simulations using our model for 130 nm technology node that, during the time in-
terval taken to commit one billion instructions in the pipeline, high performance bus
wire temperatures rise by 10-37°C for various SPEC CPU2000 benchmarks. This is
solely due to Joule heat dissipated due to wire switching activities. In a future 45 nm
technology node, wire temperature rise for the same set of benchmarks and simula-

tion sample was found to be between 20-58°C. We observed that instruction and data

16

bus wires attained absolute temperature in the range 80.3—104°C and 97.6—123.7°C, in
130 nm and 45 nm processors, respectively, during the course of our simulation, show-
ing that signal lines attain signiﬁcant temperatures too. Signiﬁcant wire temperature
gradients of magnitude between 16—25°C were found to be most common between
the sending and receiving ends of the wires during the course of simulation. Notable
correlation was found to exist between energy dissipation behavior and wire temper-
ature rise in buses across time; short, intermittent cycles of high energy-dissipating
switching activity trigger steep changes in temperature.

We also developed models that track the impact of changing wire temperature on
timing/ delay violations occurring in global signal buses during microarchitecture-level
exploration. Results show that for a 130 nm processor with no power and thermal
management the temperature-induced clock cycle time violations in an ALU result
bus—which is on the critical path—is 2.27 per hundred bus references, averaged over
ten programs in the SPEC CPU2000 workload. It increases to an average of 6.20
per hundred bus references for the same processor at the 45 nm technology node.
Our analysis also shows that conventional techniques like bus encoding that seek to
reduce energy dissipation and potentially wire temperatures have limited impact on

alleviating temperature-induced delay violations.

1.6.3 Proﬁle-Guided Optimization Techniques

Efforts to reduce bus energy dissipation, particularly in long global signal buses,
are becoming increasingly important in nanometer-scale technologies as intercon-
nects continue to aggravate performance, power, and cost. While dynamic encoding

schemes have been proposed to reduce bus switching energy, they do not work well for

17

correlated trafﬁc such as those found in typical workloads like SPEC CPU2000 bench—
marks. Hence, we develop static bus encoding techniques and present a methodology
to design such schemes in an optimal manner. Being completely static, such schemes
can be designed during early stage microarchitectural exploration and incur mini-
mal run-time hardware area/ cost, power, and latency compared to dynamic encoding
logic. We use a microarchitecture-level simulator, proﬁle representative samples of
SPEC CPU2000 benchmarks to collect data, and use integer linear programming
to design our encoding scheme. Results show that, for the SPEC CPU2000 work-
load, i.e., workstation/ PC class processors, total bus energy dissipation reduced by
as much as 22.79%/ 40.77% for data/ instruction buses when our best static encoding
scheme was applied. In contrast, existing dynamic bus encoding techniques yield only

4.19% / 5.32% reductions for the same type of bus trafﬁc.

1.6.4 Novel Thermal Optimization Methodology

Apart from bus energy, rising wire temperatures are also becoming an important
issue to address in high performance buses since they affect wire delay and reliabil-
ity. We propose a ﬁrst-of—its—kind methodology to design temperature-aware encoding
schemes by trading off some of the energy gains we obtain with static encoding tech-
niques to achieve wire temperature reduction. In this methodology we add tempera-
ture constraints during energy optimization, and our ILP produces a static encoding
scheme that reduces maximum / hottest wire temperatures by up to 15.23 K / 16.17 K

for data/ instruction buses while still producing signiﬁcant total bus energy reductions.

18

1.6.5 Performance-Oriented Adaptive Bus Design

The rate at which signals can be transmitted in a high—speed processor bus is decided
based on the worst-case crosstalk pattern. This pessimistic estimation gives rise to
signiﬁcant performance penalties since the worst case never occurs or occurs with
very low frequency in actual applications. Hence, we propose an adaptive bus design,
called variable cycle bus (VCB) architecture, that examines incoming data patterns
and transmits them using variable number of clock cycles, improving bus performance
signiﬁcantly. To maximize effectiveness of our adaptive bus architecture, we propose a
proﬁle—guided optimization approach—like the one described earlier in Section 1.6.3—
to reorder and signal bits to minimize bus crosstalk. Results on SPEC CPU 2000
benchmarks, in a general-purpose optimization scenario, show a 29.35% reduction in
1+4r cycles, a 20.29% reduction in 1+3r cycles, and a bus performance improvement
of 17.42% for a VCB with static reordering and signaling technique targeting bus

crosstalk minimization.

1.7 Dissertation Outline

This remainder of this dissertation is organized as follows. Next, Chapter 2 presents
a background on interconnect analysis and optimization for delay and power and pro-
vides a general overview of our experimental methodology and simulation infrastruc-
ture. Following that, in Chapter 3 we present the model for estimating activity-driven
energy and temperature in processor buses and study the energy and temperature
characteristics of data and instruction buses. Then, in Chapter 4, we present the

model for estimating data and temperature-dependent delay variability and exam-

19

ine the impact of delay variability on the performance of a processor in current and
future fabrication technologies. Next in Chapter 5, we discuss novel interconnect
optimization techniques to reduce processor bus energy and temperatures. Then, we
discuss delay optimization techniques in Chapter 6. Finally, we conclude and present

directions for future work in Chapter 7.

20

CHAPTER 2
PRELIMINARIES

Integrated circuits (ICs) consist of two basic components: transistors and their inter-
connections. As more and more devices are integrated on a single die, wires or inter-
connections gain importance and play an important role in determining the speed,
area, reliability, and yield of VLSI circuits [41]. In this chapter, we provide a brief
introduction to some terminology used in the context of interconnect design and dis-
cuss interconnect analysis and optimization methods. We also discuss the role of
architecture-level simulators in interconnect analysis and design. Finally, we outline

the general experimental methodology followed in our experiments.

2.1 Interconnect Analysis Methods

Interconnect analysis as it applied to power and timing seeks to answer three ques-
tions: (1) what is the effective loading due to the interconnect? — this is necessary for
driver/ repeater sizing to minimize delay and to estimate power dissipation, (2) what
is delay and slew at the receivers? and (3) what is the effect of switching of this and
other neighboring nets on power dissipation and propagation delay? This analysis can
be performed with dynamic circuit simulation, in which speciﬁc stimuli are applied
to the circuits and interconnect in question. Unfortunately, this technique cannot be
practically applied to the millions of transistors on a digital integrated circuit. Hence

interconnect analysis is performed using simpler models. Interconnects in a VLSI

21

chip can be grouped into three categories, based on their length, as discussed next.

2.1.1 Global, Semiglobal, and Local Wires

Since it is not possible to connect millions of transistors on the die using only one level
of interconnect, multi-layer interconnect structures are commonly used. The metal
layers closest to the Silicon (Si) substrate are called local interconnects/wires. The
next few layers are called semiglobal or intermediate interconnects, and the top layers
are called global interconnects. The wires in the global layers are wider and thicker
and this yields shorter propagation (or RC) delays since wire resistance and hence
delay is inversely proportional to the area of cross section. Consequently, these layers
are used to route high performance buses in the core of the microprocessor. Wider
and thicker wires at higher layers are also used to provide low-resistance power / clock
distribution lines to different regions of the chip. Layer assignment, i.e., the decision
to route a wire/ net in the local, semiglobal, or global layer, is performed based on
stochastic wire length estimates [42]. In our research, we are interested in power,
temperature, performance, and reliability optimization of longer wires, i.e., semiglobal
and global wires that are used to route high performance buses. These interconnects

are analyzed using the models discussed next.

2.1.2 Interconnect Models: RC and RLC

Interconnects, in general, have three important electric characteristics: resistance (R),
capacitance (C), and inductance (L). All three depends on the interconnect geometry
and its position relative to the other surrounding structures. These parasitics affect

circuit performance; capacitance adds load to driving gates, resistance, inductance,

22

and capacitance all add signal delay, and inductive and capacitive coupling between
interconnects add signal noise.

The circuit parasitics of a wire are distributed along its length and are not lumped
into a single position. As long as the resistive component of the wire is small, and
the switching frequencies are in the low to medium range, it is meaningful to consider
only the capacitance component of the wire, and to lump the distributed capacitance
as a single capacitor. This is the simple capacitive model and is not very accurate.
On-chip metal interconnects of over a few millimeters in length have a signiﬁcant
resistance. The n—model lumps the total wire resistance of each wire segment into a
single resistor R and represents the total capacitance as two capacitances of £2:— each.
This model, called the lumped-RC model is, however, pessimistic and inaccurate for
long interconnects, which are more adequately represented by a distributed-RC model.
In practice, this model is represented as a n-ladder network. Similar to resistance and
capacitance of interconnect, the inductance is also distributed over the wire. Thus, a

distributed RLC model of interconnects, also known as the transmission line model,

is the most accurate approximation of the actual behavior of interconnects.

2.1.3 Effect of Inductance on Global Signal Lines

In spite of shrinking dimensions and increasing clock frequencies in nanometer-scale
technologies, it has been shown that inductance can be safely ignored for global signal
lines that are longer than 10 mm [2,43]. This is due to various factors discussed next.
First, it has been shown that, for long global signal lines, the signal response to a step
input is over—damped when the line is modeled using the complex distributed RLC

model. This response can be approximated using a distributed—RC model, without

23

signiﬁcant error [43]. Second, inductance is not a signiﬁcant problem in minimum-
width global lines as much as it is in clock and power/ ground lines that are several
times minimum width. It has been estimated that inductance becomes an issue in a
global line only if its width is at least eight times the minimum width [2]. Third, in
high-performance buses that we consider in this research, designers ensure that induc-
tive effects are minimized by ensuring that current return paths for worst-case input
patterns are kept within limits. This is normally achieved by placing power/ ground
planes above and/or below the layer in which the high-performance bus is routed
and also by routing shield wires in the same layer as the bus [44]. Finally, in the
recent times, architectural trends have shifted toward improving power/ performance
(or Watt/MIPS) efﬁciency by using shorter pipelines and multi-core architectures,
compared to just improving performance by increasing clock speed. Thus, in cur—
rent and future generation microprocessors, clock frequencies are not expected to
increase exponentially as predicted until a few years ago. This trend also contributes
to keeping inductive effects in check for global lines.

Due to the reasons outlined above, we do not consider inductive effects in our

work. Using an RC—model, interconnect energy can be estimated as discussed next.

2. 1.4 Energy Estimation

Self transitions are deﬁned as transitions on the self or area capacitance which is
the parasitic capacitance between a bus line and the ground/V D D plane. Coupling
transitions are deﬁned as transitions that occur on the coupling capacitance which
is the parasitic capacitance between two wires on the same plane. Figure 2.1 shows

self and coupling capacitances for a 5-bit bus. Note that there can be two types

24

of coupling capacitances for a wire of length luiire: adjacent coupling capacitance
Ccl = lwire x CZ" i :I: 1 and non—adjacent coupling capacitance Cm; = [wire x c,”- :I: x,
where :1: Z 2. The adjacent coupling capacitance is the most dominant. Hence it
is most often considered in energy and delay estimation and other (non-adjacent)
capacitances are ignored.

Self transitions in a wire are of two types: charge (0 —> 1) and discharge (1 —->
0), and coupling transitions in a pair adjacent wires are of three types: coupling
charge transitions (00 —> 01,1 00 —+ 10, 10 —> 11, and 01 ——+ 11), coupling discharge
transitions (01 —> 00, 10 —> 00, 11 —> 10, and 11 —> 01), and toggle transitions (01 —+
10 and 10 ——> 01). Note that if the total number of self and coupling (charge, discharge,
and toggle) transitions is reduced, bus energy dissipation will reduce signiﬁcantly.

The energy consumption and energy dissipation of a bus in a given time interval

t are given by:

Ec0ns,aug = [N8 ' Cu) + Ccl ' (NC + 2 ' Ntll ' VDD2 ' fell: ‘ 75, (2-1)
N N
Ediss,a-vg = [N3 ' Cw + CCI ' (32 + —2_d + Nt)] - VDD2 - fClk - t, (2.2)

where Cw = Cline + Crep 2 [wire X Cline + crap is the self capacitance of the
wire including the contribution of repeaters, N5 is the total number of self-charge
transitions recorded on the bus in time interval t, NC, N d7 and ,Nt are the number
of coupling-charge, coupling-discharge, and coupling-toggle transitions, respectively,
recorded in the same interval. Thus, only charging transitions that require current
flow from the power supply to charge the parasitic capacitances are used to determine

energy consumption, whereas current flow from the power supply (during charging)

 

1For two lines i and 3', this notation represents the transition: VimVan —» Vimejfm.

and current flow into the ground (during discharging) of the parasitic capacitances
account for energy dissipation. Energy consumption and dissipation are equal on the

average, though their instantaneous values may be different.

2.1.5 Delay and Performance

When designing circuits it is necessary to ensure that a signal is fully transmitted
across a wire in a given time. This time should be at least the propagation delay
of the wire which depends on wire and driver sizes and also on the interaction with
neighboring wires, which is referred to as inter-wire crosstalk. Due to crosstalk, the
propagation delay tp of a wire (called the victim), which is a function of transitions
in its neighboring wires 1»: — 1 and k + 1, can be expressed as follows, including the

effect of load (receiver) capacitance [45]:

where Rw and R D are the wire and driver resistances, respectively, CT is the input
(gate) capacitance of the receiver, go is the delay correction factor due to inter—wire
coupling between wires separated by the minimum spacing and is a function of the
capacitance ratio r = gfiul' The wire resistance Ru, is estimated using the resistivity
at a design temperature of 100°C. The various crosstalk conditions occurring when
the victim wire k experiences a rising (0 —> 1) transition (denoted as T) are listed in
Table 2.1. A corresponding table of delay factors can be constructed for a victim wire
experiencing a falling (1 —> 0) transition (I).

In the worst case—toggle or oppositely switching transitions on both sides of the

26

 

 

 

 

Crosstalk mode k — 1, k, k + 1 Delay factor (g0)
mode-0 T, T, T 1+07'
mode-1 T, T, - 1+1T
mode-2 T, T, I 1+2?“
mode-2 -, T, - 1+27‘
mode-3 —, T, i 1+3r
mode-4 I, T, i 1+4r

 

 

 

 

 

Table 2.1. Bus crosstalk conditions and models for a rising transition in the middle
(victim) wire.

victim—the delay is:
twc = 0.69(RD + Rm) -C,~ + (1+ 4r) . Cw - (0.381310 + 0.69RD). (2.4)

It is clear that width of the clock pulse to the circuit should be more than two to
ensure that the signal propagates completely to the destination, i.e., thus_clk 2 twc-
To ensure that this does not impact performance, repeaters / buffers are used to divide
long wires into several sections and hence reduce propagation delay. Assuming that
the size of each repeater is h times the size of a minimum-sized inverter (which is
technology—dependent) and k is the number of repeaters needed to achieve optimum

delay on the interconnect, these can be calculated using:

 

 

h = M and (2.5)
CO ‘ Rint
0.4(R- - C- )
k = int int 2.
\/ 0.7(00 - R0) ’ ( 6)

where Cint = Cline + 4 - Ci),- :l:1 is the total per-unit length capacitance of a
wire leading to the worst-case delay impact, C0 are R0 are the capacitance and
resistance of a minimum sized inverter, and Rint = Tline is the per-unit length wire

resistance [46] .

27

2.2 Interconnect Optimization Techniques

Several techniques have been proposed to ensure that interconnect power and perfor—

mance are not affected due to technology scaling. We discuss these next.

2.2.1 Data Encoding

In general, system-level encoding techniques fall under three categories, based on
whether they use redundancy in space (extra number of bus lines), time (extra number
of cycles) and voltage (number of distinct voltage levels) [47]. In particular, use of
time redundancy has been demonstrated to be as effective as the space redundancy
for decreasing the average switching activity and issues due to extra cycle overheads
have been addressed by using compression [48—50]. Different modes of signaling—level
and transition signaling—can also be used to reduce bus switching activity.

The bus-invert (BI) code is a low-power encoding scheme designed to limit the
average power of the bus [51]. It performs well when patterns to be transmitted
are randomly distributed in time and no information about pattern correlation is
available. Therefore, this method is most appropriate for encoding the information
on data buses. A redundant control line I N V is needed to signal to the receiving
end of the bus the encoding mode in the current cycle. The encoding depends on
the Hamming distance (i.e., the number of bit differences) between the value of the
encoded bus lines at time t —- 1 (also counting the redundant line at time t — 1)
and the corresponding value at time t. The Hamming distance is compared to %,
where n is the bus width (assuming it is even without loss of generality). If the

Hamming distance between two successive patterns is larger than %, the current

28

value is transmitted with inverted polarity and the control line is asserted; otherwise,
the current value is transmitted as is, and the I N V line is de—asserted.

If the words transmitted on the bus are independent and uniformly distributed, the
average number of transitions per clock cycle is lowered by less than 25% of the original
value, due to the binomial distribution of the distance between consecutive patterns
[52]. Major drawbacks of the BI technique are related to the required redundant bus
line and the overheads due to the logic to implement the encoder to decide whether the
Hamming distance exceeds 3. The encoding latency, in particular, is quite signiﬁcant
as discussed next.

In BI, encoding consists of three sequential steps. First, the Hamming distance is
computed. To do this, the current n-bit pattern and the previous n—bit pattern that
was transmitted on the bus in the previous cycle are bitwise XOR-ed and the number
of “1”s in the result is counted. This step requires a constant time operation for
bitwise XOR and 0(n) to 0(log2 n) time for counting, depending upon the counter
structure used. In the second step, the Hamming distance is compared with g to check
which is greater; this can be completed in O(n) to 0(10g2 n) time, again depending
on the hardware structure used. Finally, the current pattern is inverted or sent as-is
and this takes constant time. Thus, BI encoding takes at least 0(log2 n) time.

More recently, odd/ even bus invert (OEBI) [53] and coupling-driven bus invert
(CBI) [54] encoding schemes, designed to reduce transitions on the coupling capaci—
tance between adjacent bus lines, were proposed. In OEBI, even and odd bit positions
can be encoded (with bus inversion) independently and two invert lines are used to in-
dicate one of four modes of transmission: OO—none of the bits are inverted, 01—only the

even-numbered bits are inverted, 10—only odd—numbered bits are inverted, and 11—all

29

bits are inverted. This is based on the observation that by inverting only the odd or
even bits, a coupling toggle transition can be reduced to a coupling charge / discharge
transition [53]. The scheme assigns weights of 1 and 4 to coupling charge/ discharge
and toggle transitions, respectively, to estimate coupling energy dissipation. Based
on the current and previous input patterns, the total coupling energy dissipation for
each of the four modes is estimated. Then the mode that will result in the least
coupling energy dissipation is chosen and data is transmitted on the bus in that form.
In a similar manner, the CBI encoding technique examines pairs of adjacent bits in
the same position for the current and previous input patterns and estimates coupling
activity. The differences here are: (1) only one invert line is used to indicate whether
the transmitted data is in inverted or non-inverted form; and (2) it uses weights of
1 and 2 for coupling charge / discharge and toggle transitions, respectively. Note that
neither OEBI nor CBI considers self transitions to decide the inversion mode while
BI considers only self transitions.

Bus encoding is also often used to reduce crosstalk. Crosstalk-aware encoding
schemes can be one of two types: those that have memory or those that are memo-
ryless. If an encoding scheme has memory then each codeword is dependent on the
word that came before it. Thus, each codeword has its own codebook of valid words
that can come after it. On the other hand, if an encoding is memoryless then any
codeword can follow any other codeword. The minimum number of wires needed to
encode 32 bits with memory is 40 and without memory is 46 [55]. Thus the extra
wiring overhead for an encoding scheme with memory is 25% and 44% for optimal

encoding without memory.

30

2.2.2 Wire Spacing and Shielding

Inserting V D D / GND wires known as shields is a popular method to avoid crosstalk in
high-performance buses. Signal isolation due to the presence of shields prevents both
noise and increase in delay due to coupled lines switching. A dense fabric interconnect
architecture with shield lines inserted after every signal wire was proposed in [56].
Shield insertion also reduces inductive effects because it creates a shorter return path
to ground for the current flowing through signal wires. However, inserting shield
wires between every pair of signal wires results in large area/costs, increases wire
congestion and may end up requiring more metal layers leading to higher production
costs. Alternatively, wires can be simply spaced apart to produce a similar solution.
Though spacing does not eliminate coupling noise, it reduces the value of the coupling
capacitance—since capacitance is inversely proportional to the spacing——and at the
same time reduces power dissipation since the total capacitance load of the line also
decreases. In many cases, this is a signiﬁcant gain compared to shielding which

eliminates the noise at the cost of extra power dissipation [57].

2.3 Architecture-Level Simulators and Early-
Stage Design

At the very early stages of design deﬁnition, microarchitects start with analytical
cycles-per-instruction (CPI) performance models that lead to trace or execution-
driven, cycle-by-cycle simulators. Full or sampled benchmark traces are processed
through such simulators, driven by a microarchitecture parameter ﬁle. The goal of

this design space exploration phase is to optimize the choice of microarchitectural

31

parameters for CPI performance under design constraints known at that stage. The
performance model is typically written in a standard systems programming language
such as C or C++ and is designed to project execution times (in cycles) for input
application traces; it typically does not model the actual execution of the instruc-
tions, but only the execution timing. More recently, power dissipation models that
are based on counting the number of transitions occurring in microarchitecture blocks
have also been added to these simulators.

Several architecture—level simulators have been developed and used in the acad—
emia and industry: Wattch [58], SimplePower [59], TEh12P2EST [60], WArPE [61],
Sim-Panalyzer [62], IBM Turandot/PowerTimer [31], AccuPower [63], and HotSpot
[64]. Interconnect / bus models used in these simulators suffer from many drawbacks.
First, none of the existing simulators have models for estimating inter-wire coupling
activity dependent power consumption and delay. For example, the SimplePower tool,
which models only memory system buses (between different levels of caches and/ or
main memory), uses an interconnect model that considers only the self-capacitance of
bus lines calculated based on an empirical formula [65]. The Wattch simulator which
models only the result bus in the microarchitecture also does not take into account
inter-wire coupling activities when estimating power dissipation. Thermal models for
buses are not available in most current simulators. The HotSpot tool addresses this
need to some extent, but it contains a temperature model for the interconnect sys-
tem as a whole rather than for each bus and hence cannot track activity-dependent
temperature changes in key processor buses [66]. Temperature gradients and delay

variations cannot be estimated using this tool.

32

2.4 Our Experimental Methodology

2.4.1 Interconnect Geometry and Technology Data

For all the interconnects considered in this work, we assumed that it was routed in the

top—most layer metal. A representation of wires in this layer is shown in Figure 2.1.

if???

Figure 2.1. Layout of wires routed in the top-most layer metal. Self and coupling
capacitances are shown. The bottom plate represents the V D D / GND plane.

 

 

 

 

 

 

Values for wire geometry (wire width, spacing, etc.) and technology and equiv-
alent circuit parameters, like capacitance and resistance of a global line for various
nanometer—scale technologies were obtained from the ITRS document and are listed in
Table 2.2. Note that wire spacing is assumed to be equal to wire width per ITRS [1].
In this work, we use 130 nm and 45 nm as the representative technologies for a cur-
rent generation and a future-generation microprocessor and compared our results for
these designs.

In current generation microprocessors, a global signal bus is typically a few mil—

limeters long; we consider a bus of length 6 mm using the numbers reported in [44]

33

Global Wire Segment

 

 

Via ow Via
/
a /
Q, 9*
0 <90 6‘49 00‘
.«s‘ 9 4° s
for / f 00“ f
/

 

Figure 2.2. Wire segment of length lopt between two repeaters.

for a Pentium-4 microprocessor. Using this length (lwirelv we estimate the number
of repeaters (k) that need to be inserted to enable non—inverting transmission using
Equation 2.6, and then we ﬁnd the inter—repeater segment length lopt : 6—X—%0——§.
In the remainder of this work, all experiments and analysis focus on a single wire
segment of length lopta driven by a sending end repeater of size h and connected to
a receiving end repeater of the same size, as shown in Fig 2.2. In addition to its self
capacitance, this wire segment has a capacitance, due to its sending and receiving

end repeaters, that can be calculated as: Crep = h x CO, where CO is the sum of the

input and output capacitances of a minimum sized inverter.

2.4.2 Parasitic Capacitance Extraction

The ITRS roadmap provides values only for self and adjacent—wire coupling capac-

itance for current and future technology nodes. Hence, to estimate the coupling

34

 

 

 

 

 

 

 

 

 

Technology node
Parameter 130 nm 90 nm 65 nm 45 nm
Number of metal layers 8 9 10 10
Wire width, to,- (nm) 335 230 145 103
Wire thickness, t2- (nm) 670 482 319 236
Relative permittivity of dielectric, er 3.3 2.8 2.5 2.1
Thermal conductivity of dielectric, 0.6 0.19 0.12 0.07
kild (W/mK)
Clock frequency, fclk (GHz) 1.68 3.99 6.73 11.51
Supply voltage, VDD (V) 1.1 1.0 0.7 0.6
Maximum current density in a wire, 0.96 1.5 2.1 2.7
jmax (MA/cm?)
Height of inter-layer dielectric, tild (nm) 724 498 329 243
Resistance of minimum size inverter, 6.23 9.04 9.6 13.2
30 (k9)
Capacitance of minimum size inverter, 4.65 3.14 2.25 1.5
Co (fF)
Self capacitance of wire, Cline (pF/m) 44.06 32.77 25.07 19.05
Adjacent coupling capacitance, 91.72 76.84 68.42 58.12
02,241 (PF/mm)
Non-adjacent coupling capacitance, 6.49 4.65 3.56 2.76
ci,i :t 2 (pF/mm)
Non-adjacent coupling capacitance, 2.53 1.76 1.29 0.98
ct,- :I: 3 (pF/mm)
Resistance of wire, Tline (kQ/m) 98.02 198.45 475.62 905.05
Optimal repeater size, h 74.95 70.25 51.77 49.45
Optimal # of repeaters for non-inverting 6 8 12 16
bus, k
Coupling ratio including effect of re- 2.065 2.329 2.716 3.039
peaters, r

 

 

 

 

 

 

Table 2.2. Technology, wire geometry, and equivalent circuit parameters for topmost
layer interconnect. Values in top eight rows are from the international technology
roadmap for semiconductors (ITRS) document [1]. Values listed in the next three
rows are from Mui et a1. [2]. The values for the self and coupling capacitances were
extracted using the FastCap tool and the value for r,- was calculated using the formula
Ti = pCu/(wi - ti), where pCu = 2.2 X 10—8O-m. Values of h and k were found using
+ ’1 X Co).

expressions given in Section 2.1.5 and r = Ci),- :I: 1/(Cline

35

capacitances between all pairs of wires (adjacent as well as non-adjacent wire pairs),
we employed the publicly available three-dimensional capacitance extraction program
called FastCap [7]. Using the wire geometry parameters from ITRS (see Table 2.2
for values) to model a coplanar global bus layout, similar to the one shown in Fig—
ure 2.1, we extracted values of self and all coupling capacitances for the middle wire
of a 32-bit bus. Figure 2.3 shows the percentage distribution of these capacitances for
various technologies. From the ﬁgure, we observe that, for current 130 nm and 90 nm
technologies, non-adjacent coupling capacitances are somewhat non—negligible (they
contribute 210%), while even in a future 45 nm node, non-adjacent capacitances ac—
count for about 8% of the total capacitance. Our energy model which is described in
a later chapter considers the effect of two non-adjacent coupling capacitances, Cc2

and 003, for better accuracy.

2.4.3 Simulation Infrastructure and Veriﬁcation of its Cor-
rectness
Computer simulators have been used for a long time to study both hardware and
software behavior. They allow the collection of information and statistics during the
execution of programs. Various types of information, such as memory proﬁles, in-
struction proﬁles, and timing statistics, can be gathered from these simulators. For
this research, we use the sim-outorder out-of—order processor simulator from the
SimpleScalar microarchitecture tool set, which is very widely used in academia [67].
Many microarchitectural simulators used in the industry also closely resemble and / or
are derived from SimpleScalar or its derivatives [31, 58-64]. We added several en—

hancements to the sim-outorder simulator to facilitate our analysis and optimization

36

Scaling of Self and Coupling Capacitances

[fogrfdi cm [:1 Cc2 [:1 Cc3 I 03%

 

100% .,
90% : ' '
80%
70%
60%
50%
40%
30%
20%
10% '

0%

  

Percentage of Total Capacitance

 

i Ti

 

90nm 65nm
Technology Node

Figure 2.3. Distribution of self and coupling capacitance values for the middle wire
of a 32-bit bus extracted using the FastCap tool [7]. and = self capacitance of the
wire; Cc1 2 coupling capacitance between the wire and its adjacent neighbor; Cc2 =
coupling capacitance between the wire and a non—adjacent wire with 1 wire between
them; Cc3 = coupling capacitance between the wire and a non-adjacent wire with 2
wires between them; Cc_rest = sum of coupling capacitances between the wire and
other wires with 3 or more wires between them. For current and near—future ITRS
technology nodes (up to 45 nm), non—adjacent coupling capacitances are somewhat
non-negligible—they contribute approximately 840%.

efforts. These are described next.

Support for analyzing bus data: We added support for tracing and analyz-
ing the data transmitted on high performance processor core buses. The original
sim-outorder contains only a functional model of a superscalar processor and does
not have the ability to track the data that is transmitted between the microarchi—
tectural blocks in the pipeline. We modiﬁed the simulator to track and analyze, on

a cycle—accurate basis, the data transmitted on load/store address, load/store data,

37

instruction, and result buses in the processor core.

Wire energy, temperature, and delay models: We also added our wire
energy, temperature, and delay models to the simulator. While energy dissipation
and delay of our target buses—including the temperature impact on delay—can be
estimated on a per-cycle basis, temperature estimates can be obtained at a coarser
granularity, i.e., after every 100K cycles or so. This is because temperature is a slow—
changing effect that does not warrant per-cycle estimation. More details on how we
determine the granularity of temperature simulation depending on the fabrication
technology used are discussed later in Section 3.5.3.

Integration with other thermal analysis tools: Recently, a tool called
HotSpot [64], also based on SimpleScalar, was developed to estimate substrate (ac-
tive layer) temperatures using the Wattch model for energy estimation [58]. Even
though the on-chip interconnect system is a major contributor to the power bud—
get, it was not modeled accurately in HotSpot. we have integrated our models with
HotSpot, thus creating a microarchitecture-level simulation tool for full-chip energy
and thermal analysis.

As a result of our enhancements to the simulator, the running time is somewhat
longer. The original sim-outorder without enhancements executes ~200K instruc-
tions per second [67] whereas our modiﬁed simulator executes ~110K instructions
per second while running detailed energy and temperature simulations at the granu-
larities described earlier in this subsection. To reduce simulation time for analyzing
a large number of programs on our simulator, we used a shared Linux cluster for our
experiments [68].

We veriﬁed the correctness of our modiﬁed simulator with regard to four aspects,

38

as discussed next.

Functional correctness: All the changes we made to the simulator add to
its instrumentation capabilities and do not change it functionally, with regard to
the microarchitectural model it seeks to implement. We veriﬁed this in two ways,
as discussed next. First, we executed and compared the outputs for a suite of six
microbenchmarks, supplied along with the SimpleScalar toolset, using the original
(unmodiﬁed) simulator and our modiﬁed version. As expected, the program out-
puts from both versions matched exactly. Second, we compared several performance
metrics recorded by the simulator—number of instructions executed, L1/L2 cache
misses, branch misprediction rate, etc—and found that these matched in the original
simulator and our modiﬁed version, for the six microbenchmarks we tested. These
tests show that the functional correctness of our modiﬁed simulator has not changed _
compared to the original one.

Instrumentation correctness: The original sim-outorder simulator contains a
detailed—enough microarchitectural model that enabled us to gather data transmitted
on our target buses, in each cycle. Thus, instruction addresses and instructions were
gathered from the program counter and the fetch stage of the simulator, respectively,
data addresses by computing the target address for load / store instructions, load/ store
data by monitoring L1 cache reads/ writes, and ALU result bus data by monitoring
the outputs of the functional units in the execute stage. As such, the instrumentation
capabilities we added to the simulator are correct by design.

Model correctness: We tested if the models we constructed represent actual
energy/ thermal behavior of buses consistent with previously-known data and/ or es-

timates. For our energy model, discussed later in Section 3.3, results were compared

39

with circuit simulation of a distributed-RC wire using the Cadence Spectre simula-
tor. Our model yielded energy results that were only about 4.53% different compared
to those from Spectre, faster and with much less complexity. Our thermal model,
discussed in Section 3.4, is based on the well-known analogy between electrical and
thermal quantities that has been used widely in earlier work to model chip ther-
mal structures [66, 69—71] and veriﬁed using ﬁnite element modeling (FEM) simula-
tions [72, 73]. The average and maximum temperatures obtained using our model,
while running SPEC CPU 2000 benchmarks on the simulator, were consistent with
previously published data in [66], although our model estimated bus energies more
accurately considering actual bus trafﬁc values, interconnect temperatures at a ﬁner
granularity, and tracked spatiotemporal variation of temperature, all of which were
absent in earlier models. The worst—case temperatures that global signal lines may po-
tentially attain, assuming they carry currents at maximum density all the time, were
estimated using FEM—based techniques in [72,73]. Signal lines, which are the focus
of our work, do not carry currents at maximum density all the time and hence their
temperatures are likely to be somewhat less than estimates obtained using worst—case
FEM analysis. We veriﬁed that results using our model were consistently lower than
worst-case estimates and remained so for the different technology nodes we tested:
130 nm, 90 nm, and 45 nm.

Implementation correctness: We also tested that modiﬁcations were imple-
mented correctly in the simulator and that desired outputs were obtained. For all
the six microbenchmarks, we collected tracedumps of various buses using our sim-
ulator and veriﬁed manually that the data in the tracedump matched the expected

value for that type of data. For example, each entry in the instruction address trace-

40

dump should match the program counter value which is in a known range of memory
addresses and each entry in the instruction tracedump should correspond to known
instructions in the processor’s instruction set architecture. We found these to be true
in all the tracedumps we tested. We also prepared several small synthetic traces of
data streams and veriﬁed that results obtained from hand calculations matched those

using equations from our energy model implemented in the simulator.

2.4.4 Target Systems and Benchmarks

The SimpleScalar platform can simulate various RISC microarchitectures. For our
work, we use the Alpha 21264 microarchitecture representing general-purpose super-
scalar processors. The Alpha 21264 architecture is modeled as a 4-issue, superscalar
processor with out-of—order execution and with 32-bit address, 64—bit data, and 128-
bit (fetch width=4) instruction bus between the processor and L1 cache [74]. Other
details of the microarchitecture and memory system for our target system is presented
in Table 2.3.

For evaluation on the Alpha target system, we use the SPEC CPU2000 benchmark
suite which consists of 26 programs drawn from real user CPU-intensive applications
[75]. The little—endian SPEC benchmark executables we used were downloaded from
the SimpleScalar Website [76]. These programs were compiled for the Alpha 21264
instruction set using a Compaq Alpha compiler with SPEC peak settings and included
all linked libraries. We ran our experiments using the ref input set from the SPEC
CPU2000 suite.

Since the time taken to simulate an entire SPEC CPU2000 benchmark is very

long—typically several days on a cycle-accurate simulator—we used the 100 million

41

 

 

Processor Core

 

 

 

 

Clock rate 1.68 GHz (130 nm), 11.51 GHz (45 nm) [1]
Fetch / Issue width 4 each
LSQ 8 entries
Memory System
PHLI bus Non-pipelined; 64-bit data and 128-bit instruction
L1 D-cache Virtually-indexed physically-tagged (VIPT), 64KB, 2-

way set associative, 64B block size, LRU policy, 3 cycle
hit latency, write—through cache.

L1 I-cache Virtually—indexed virtually-tagged (VIVT), 64KB, 2-way
set associative, 64B block size, LRU policy, 1 cycle hit
latency

L1 MAF 8 entries

L1HL2 bus Non-pipelined; 128-bit data/ instruction lines and 38-bit
address lines (21 bits for block index and 17 bits for tag)

L2 cache Physically-indexed physically-tagged (PIPT), 2MB,

direct-mapped, 64B block size, LRU policy, 12 CPU cy-
cles hit latency, write-back policy, operating at 2x CPU
clock cycle

L2HM bus Non-pipelined; 64-bit data/instruction lines and 38-bit
address lines

 

 

 

 

 

Table 2.3. Conﬁguration of our target system and benchmarks. This processor-
memory system conﬁguration is based on the Alpha 21264 processor.

single simulation points recommended by the SimPoint toolset to collect results only
a representative slice of the program [77,78]. Although the accuracy of representative
samples from SimPoint has not been explicitly validated using energy/temperature
metrics, its use in design / evaluation of microarchitecture-level energy reduction tech-
niques is widespread in literature. Several works that use phase classiﬁcation tech-

niques like SimPoint for microprocessor energy evaluation have been surveyed in [79].

42

CHAPTER 3

ACTIVITY-DRIVEN ENERGY AND
TEMPERATURE MODEL

Accurate early stage modeling techniques for signal interconnect energy dissipation
and temperature are becoming necesary for current designs. This chapter describes a

detailed energy model and a ﬁrst-of-its-kind thermal model for interconnects [80,81].

3. 1 Introduction

As fabrication technologies scale down, interconnects are becoming the dominant
factor in determining performance, power, cost, and reliability characteristics of a
system. Interconnect scaling impacts performance because wire delay has continued
to increase relative to that of logic. In recent years, power density in microprocessors
has doubled every three years, primarily because feature sizes and clock frequencies
have scaled faster than operating voltages [82]; this rate is expected to increase further
in future technology generations [64]. The on—chip interconnect system is already the
most important contributor to dynamic power; in current microprocessors (130 nm
technology), interconnects are reported to contribute about 51% of the total on-chip
dynamic power dissipation and global signal lines—address, instruction, data, and
control buses routed in the top-most layer metal—about 21% [4]. As technology scales
down, dynamic power dissipation will still remain important even as leakage power

increases. It has been estimated that even in the 45 nm technology node, dynamic

43

power will contribute to about 46% of the total power dissipation [2]. Supply voltage
scaling and smaller sizes will reduce dynamic power dissipation due to logic in future
technologies at a faster rate than in interconnects and hence, interconnect dissipation
will contribute a larger share to total dynamic power. Rising interconnect power
dissipation will lead to localized Joule heating and temperature rise in wire metal
that can affect wire delay due to temperature-dependence of resistivity and/ or cause
wire breakage due to thermal stresses and electromigration.

As power densities continue to increase, thermal effects in wires are becoming
important due to the reasons outlined next. Signal transmission over a line/ wire i is
associated with current ﬂow, which results in 12R power dissipation, where I is the
magnitude of current and R is the resistance of the wire. This dynamic switching
power depends on: (1) the self capacitance (capacitance between the line to ground)
of the wire Cline, (2) the coupling capacitance CZ" j between line i and any other line
j, (3) the self and coupling activity factors (which in turn depend on self transitions
on line i and coupling transitions between line i and any other line j, respectively),
(4) the supply voltage, and (5) the bus clock frequency. Advances in technology have
resulted in ever-higher values of Eel—:Lré due to higher wire aspect ratios and smaller
inter-wire spacings; among all Ci, j’ the adjacent coupling capacitance (Ci,i j: 1) dom-
inates the other (non-adjacent) coupling capacitances. With newer technologies, bus
clock frequency has also continued to increase. The supply voltage is scaling down
but at a rate not enough to offset the rate of increase in the other two. Thus, the
net effect is that the 12R power is continuing to increase as technology scales down,
and consequently local heating in wires is becoming a concern. Further, since global

signal wires are separated by multiple layers of low-K dielectrics from the substrate

44

that is connected to the heat sink, and since these dielectrics have poor thermal
conductivities, heat cannot be removed from the wire efﬁciently. Energy dissipation
and/ or thermal effects in global signal lines are further aggravated due to the follow-
ing reasons: (1) increasing use of repeaters in long signal lines to reduce delay leads
to higher energy dissipation [46]; (2) a steady increase in the number of metal layers,
particularly the number of global metal layers, also increases overall energy dissipa-
tion; and (3) long via separations in upper metal layers contribute to higher average
wire temperatures—vias are normally better thermal conductors than surrounding
low-K dielectrics [83].

By virtue of their carrying smaller currents than power supply lines, energy dis-
sipation and thermal characteristics of signal (both clock and data) lines have not
been the subject of serious study. But this will need to change as clock frequencies
increase with technology scaling. Higher frequency also means that the large ﬂuc-
tuating line currents drawn by the bus driving circuitry can inﬂuence resistive and
inductive voltage drop in power supply lines, since long global signal lines present
a high load capacitance. In this work, we develop a model for activity-dependent
bus line energy dissipation and temperature rise, and apply it to different types of
microprocessor core buses. While we do not study clock lines in this work, our model
can be easily applied to thermal analysis of clock networks and estimate temperature
impact on signal delay, skew, and reliability.

The dynamic power dissipated in a bus wire, which ultimately determines its
temperature as discussed earlier, is both time and information dependent. It depends
on the type of information (address, instruction, data, or control) being carried by

the bus because the information type influences the self and coupling activity factors;

45

for example, the number of coupling transitions are expected to be higher for data
streams that are more random in nature than for others. The type of information
also directly inﬂuences the temperature characteristics of the wire because of the
presence of unequal numbers of idle cycles between successive transfers; address and
instruction buses typically carry new information every cycle as opposed to data
buses where more idle cycles are likely to be present between data accesses. These
idle cycles, during which no power is dissipated in the bus lines (assuming they
hold the last value that was transmitted), present opportunities for cooling. Hence,
interconnect thermal models that estimate temperature and reliability based on the
assumption that all bus lines carry the maximum RMS current density (worst-case
scenario) [83,84], and models that use switching activity factors to estimate average
self-heating power and determine temperature rise [66], may result in inaccuracies.
This may, in turn, lead to incorrect interconnect lifetime prediction, since dynamic
heating and cooling effects are not taken into account. Also, designers will be forced
to allow higher-than-required safety margins and, as a result, the system will incur
higher packaging costs. Hence, energy dissipation and thermal effects in buses are
best studied using microarchitectural simulators and real workloads; in this work, we
present models to facilitate this.

Detailed thermal models and workload-based studies for estimating temperature
distributions in substrate [64] and interconnects are essential for facilitating early-
stage design of future high-performance processors. For such designs, a pessimistic
temperature assumption will lead to costly and perhaps unrealistic guard bands and
high cooling system costs. On the other hand, an optimistic assumption will lead to

underestimation of the chip power and leakage, and may lead to shorter lifetime and

46

lesser reliability. Higher wire temperatures can have a dramatic impact on perfor-
mance since temperature directly affects wire delay. Typically, the Elmore delay of
an on-chip wire increases approximately 5% for every 20°C rise in temperature [37].
In addition to its absolute temperature, wire delay also depends on the temperature
gradient between the sending and receiving ends. The growing popularity of chip
multiprocessing (CMP) and simultaneous multi-threading (SMT) will increase bus
switching activities, since, potentially, uncorrelated data from different streams are
transferred on the same bus, resulting in higher per-wire energy dissipation and tem-
peratures. Thus, realistic temperature models and early-stage estimates are essential
for meeting design goals and avoiding temperature-induced problems in silicon.

The organization of the rest of this chapter is as follows. Section 3.2 brieﬂy
reviews related work. Next, in Section 3.3 and 3.4, we present our energy dissipation
and thermal models for global signal lines. Following that, in Section 3.5, we discuss
our simulation environment and methodology. Then, in Section 3.6, we present results
from simulations by applying our models in an execution-driven simulator. Finally,

we summarize in Section 3.7.

3.2 Related Work and Our Contributions

Some methods for architecture-level interconnect power analysis have been proposed
[59,85]. Earlier modeling methods estimated bus energies based on self transitions
only [59], whereas recent models also consider adjacent inter-wire capacitances for
energy calculations [85]. Thermal effects in interconnects and their implications for

performance, current density, and reliability have been studied in [21,37]. Recently,

47

interconnect thermal models have been proposed in [66,83]. But these models either
perform a worst-case analysis using maximum current metrics suitable only for power
supply lines [83] or consider average switching activities [66]. Such approaches are not
suitable for analyzing signal lines since: (1) signal lines carry much less current than
power supply lines, and (2) their energy dissipation and thermal characteristics are
tied to actual trafﬁc patterns (with intermittent idling) carried on the bus. A large
body of work exists on low-power bus encoding, many of which also use bus energy
models similar to ones described in [59] or [85]. Some of the older bus encoding
schemes have been surveyed in [86]. Newer schemes include odd/ even bus-invert [53],
coupling-driven bus-invert [54], transition pattern coding [87], and leakage-aware bus
encoding [88]. The contributions of this work are outlined next.

First, we present an accurate model to estimate bus line energy dissipation that
can be used in a trace-driven setup or in an execution-driven simulator. Existing bus
energy models, like the one proposed in [85], only estimate energy dissipation consid-
ering the bus as a whole, not in each line, whereas our model is capable of estimating
energy dissipated in each bus line. Also, these models do not account for the non-
uniform dissipation of energy across the wire length, which we do in our model. As
we shall see later, these factors are necessary to model dynamic temperature effects
in buses, both temporally and spatially, across wires. Our bus model is also more
accurate because it considers the effect of capacitive coupling between adjacent and
non-adjacent wire pairs on switching energy in addition to energy dissipated in the
self capacitance. Our work is the ﬁrst to show that switching transitions in parasitic
capacitances between non-adjacent wire pairs account for a signiﬁcant (7—8%) portion

of the total energy dissipation and hence this contribution should not be neglected

48

in bus energy models. Further, we model the effect of repeaters, which increase the
self capacitance and hence self energy dissipation. This is so because the output ca-
pacitance of a repeater adds to the self capacitance of the line segment that it drives,
and the input gate capacitance of a repeater adds to that of its input line.

Second, using our bus line energy dissipation model, we study the effectiveness of
some existing low-power bus encoding techniques when used for data and instruction
bus encoding. To our knowledge, no previous work has studied these bus encoding
techniques using realistic trafﬁc from SPEC CPU2000 benchmark programs; most of
them have used random trafﬁc patterns that do not behave like real-world instruction
and data streams. In this context too, we use realistic technology parameters from
the ITRS roadmap for current and future nanometer technology nodes.

Finally, we present a thermal model and a methodology to estimate the tempera-
tures of individual wires of a global signal bus during dynamic simulation. Our model
incorporates the effect of inter-layer heat transfer (heat conduction from the substrate
and lower metal layers through the inter-layer dielectric) and intra—layer heat transfer
between adjacent bus lines through the inter-metal dielectric. It can also estimate the
temperature gradient between the sending and receiving ends of the bus and hence,
it can be used to estimate any dynamic delay variations due to Joule heating. Our
model can also be used to estimate the effect of varying substrate temperatures on
wire self heating, although in this work, we assume a constant substrate temperature

for simplicity. Speciﬁc results we obtained are listed next.

0 We estimate from simulations using our model for 130 nm technology node
that, during the time interval taken to commit one billion instructions in the

pipeline, high performance bus wire temperatures rise by 10-37°C for various

49

SPEC CPU2000 benchmarks. This is solely due to Joule heat dissipated due to

wire switching activities.

In future 45 nm technology node, wire temperature rise for the same set of

benchmarks and simulation sample was found to be between 20-58°C.

We observed that instruction and data bus wires attained absolute temperature
in the range 80.3-104°C and 97.6—123.7°C, in 130 nm and 45 nm processors,
respectively, during the course of our simulation, showing that signal lines attain

signiﬁcant temperatures too.

Signiﬁcant wire temperature gradients of magnitude between 16-25°C were
found to be most common between the sending and receiving ends of the wires

during the course of simulation.

Some signiﬁcant correlation was found to exist between energy dissipation be-
havior and wire temperature rise in buses across time; short, intermittent cycles
of high energy-dissipating switching activity trigger step changes in tempera-

ture.

3.3 Bus Line Energy Dissipation Model

In this section, we develop our bus line energy dissipation model that calculates energy

dissipated as a result of a switching (both self and coupling) transition. This energy

model is then used to determine change in wire temperature that occurs due to the

combined effect of self-heating in the wire and heat conduction into the surrounding

medium. Values for wire geometry (wire width, spacing, etc.) and technology and

50

equivalent circuit parameters, like capacitance and resistance of a global line, that we
used for various nanometer-scale technologies were listed in Table 2.2.

As described earlier, the energy drawn from the supply rails by the driving gates
of a bus line is dissipated as 12R losses in the bus line. This results in temperature
rise in wires due to the self-heating effect. Existing bus energy models, like that
in [85], only provide expressions for total energy dissipated in the bus. From the
thermal design point of view, the energy dissipated in each bus line is important
since it helps determine the temperature rise in each individual wire separately. This
can be estimated using our model described below. First, we describe how the energy
dissipated due to line self capacitance can be found; the procedure to estimate the
contribution of repeaters to this self energy is also explained. Next, we explain how
energy dissipated due to inter-wire coupling capacitances, including adjacent coupling

and non-adjacent coupling capacitances, can be estimated.

3.3.1 Energy Dissipated due to Line Self Capacitance

Deﬁne Vz = sz in — Vim, i.e., the difference between the ﬁnal and initial voltages
on line i. Note that Vzm and V7;f in can take either one of two values: 0 or VDD'
Thus, V, = VDD implies that the self capacitance of line i charges due to a rising
transition (0 ——> 1), whereas V,- = ‘VDD means that it discharges due to a falling
transition (1 —> 0). For each transition, energy that is dissipated in wire i due to
charging or discharging of the self capacitance of the wire can be calculated as: E; =
0.5 x (Cline + Crep) - V22, where Cline is the self capacitance of the wire and Crep is

the total capacitance of repeaters on the line. The energy E; is called self energy since

it involves only the self or line capacitance (including the contribution of repeaters).

51

Values for Cline are obtained by multiplying the per-unit length capacitances given

in Table 2.2 with wire length and values for Crep are computed using Equation 2.6.

3.3.2 Energy Dissipated due to Inter-Wire Capacitance

The second component of energy dissipation is coupling energy, which is inﬂuenced by
the charging, discharging, or toggling of the coupling capacitance Ci, j between two
lines i and j. A coupling charge transition occurs between the two lines when V2 2: 0
or V]- = 0, and V2 +Vj = VDDi 00 —> 01,00 ——+ 10,10 —> 11, and 01 —> 11 are the
possible cases. A coupling discharge transition occurs when VZ = 0 or V]- : 0, and
V,- + V]- = ’VDD; 01 —> 00, 10 —+ 00, 11 —+ 10, or 11 ——> 01 are the possible cases. A

coupling toggle transition occurs when Viv Vj aé 0 and V2 = —V i.e., when 01 ——> 10

j,
or 10 ——> 01 transition occurs. In all three cases, the coupling energy dissipated in line

i due to CiJ is obtained as: E-C -= 0.5 x c- Vi2 — Vi ' leﬂ # j- Values 0f ci,i 21:1

2, ] Zaj (
are given in Table 2.2. It can be seen that the toggle case dissipates an equal amount
of energy (EiC,j = E; i = 2 x Ci, j . VDDQ) in both coupled lines, but the charge and
discharge transitions result in coupling energy dissipation equal to 0.5 X Ci, j - VD D2
in the line that charges/ discharges.

Thus, the total energy dissipated in a segment of the wire between two repeaters
of bus line i is the sum of the self energy and coupling energies and is given by the
following equation.

E, = Ef+ Z Efj (3.1)
viii?“

52

     

 

—> —>
0 I I W—l
R Fl R
c ‘nw' ‘n‘ "n“
d EL Ea 9.1 C,
i n . n . n
C‘ Ic2 'Cn-1
L . Jl . JL . J
Sender Distributed RC wire ( n subsegments) Receiver

Figure 3.1. Distributed-RC model of the wire segment divided into n subsegments.
3.3.3 Distributed-RC Line Energy Model

This energy is dissipated non-uniformly across the length of the segment, as we show
next. Consider the schematic of the segment of a distributed RC-wire shown in
Figure 3.1. For this segment of length loptv the total wire electrical resistance Rw
and the parasitic capacitance Cw which includes the self and coupling capacitance,
can be divided equally across n subsegments. Thus, each subsegment has a resistance

0

£71,“ and capacitance 7]”. The driving repeater is represented by its resistance Rd.

At the end of the wire is the receiving repeater, contributing a gate capacitance Cr

to the load. Let the energy dissipated in the kth

subsegment of wire i be represented
by Ei, k' Consider the 4-stage RC network corresponding to shown in Figure 3.1; this

represents a distributed RC line. For a unit input signal u(t), the s—domain voltages

53

at the four nodes will be:

1 1 1 12 )

711(3) = Vppfs— —m0+mls+m23 +---

u2(s) = VDD(s_1—m8+m%s+mgsZ+-~)
”03(8) = VDD(s—1—m3(’)+m‘°is+m%s2+-~)
”04(5) : VDD(3_1—m40+m‘is+m:}232+...),

where m6, mi, 77122, etc. represent the ﬁrst, second moment and so on. The

corresponding currents through the capacitors are icl(s) = Y1(s) . 211(5), i62(s) =
Y2(s) ~u2(s), ic3(s) ——— Y3(s) 413(5), and ic4(s) = Y4(s) - u4(s), where I’Z-(s) = 30,- is

the admittance of each subsegment [89]. This gives:

icl(s) = sC1(s_1—m(1)+m[s+m%s2+~-)VDD

i02(s) = 802(8”1 — mg + mgs + m332 + - . . )VDD
ic3(s) = sC3(s_1— mg + mffs + m382 + - - - )VDD

ic4(s) = sC4(s—1 — mg + mills + 771352 + - . - )VDD.

From the circuit, it is clear that 11 =1c1+ic2+ic3+ic4v 12 = ic2+ic3+ic4r I3 =
ic3 + 2'64, and I4 2 2°04. In general, we can write the following equation for current

through a resistor i after discarding higher order moments:
_ . . j 2 . j
1,-(3) _vDD Z c, 1—3 2 CJm0+s Z CJm, , (3.2)
j E Di j E Di j E Di
where the set D,- represents all the downstream nodes of node i, V]- is the voltage
at the j-th node, and Cj = Cw/n is the capacitance of j—th subsegment. The down-

stream capacitance of node i which is the sum of the capacitance of subsegments i

54

through n can be evaluated as:

Cj (n — i). (3.3)
j E D,-

We can express the power series in Equation 3.2 in transfer function form with
poles (pip pa) and zeros (31], 2%). However, it has been shown in [90] that for intercon-
nect lines, the transfer function using two—pole analysis has a special form in which

the numerator polynomial is a constant as shown in the following equation.

1
HA8) (1+ blls + (2232 ) (3'4)

 

Expanding this transfer function about 5 = 0, we have HZ-(s) = 1 — his + ((bzl)2 —

bg)s2 [89]. Comparing with Equation 3.2, we get:

=20 t-m—wC—g“ 2: ED, (36>
JEDi jEDi

since the Elmore delay tjE D of the line until the j-th subsegment is given by the

ﬁrst moment m6 [90]. Thus we have:

 

 

 

- C ,
I, :: IQ)LK”~—Z)—tf
2(5) 8
(l+—-— —-)(1+-Z—-)
P1 P2
2' i
. Plpg
= V (n—i) . . . . (3.6)
DD ” p]p§+(p]+p§)8+s2
In) = flit-(3)1
C 192191 t —-it
= Vppln ) °,12,( 1’1 —e ’02), (3.7)

where .C_1[-] is the inverse Laplace transform operator. In Equation 3.6, we have
used the transfer function of the form:
P211922
71]}??? + (2021 + 1022M + 82

 

GiIS) =

i i
- +
By equating Hz-(s) and Gi(s), we obtain 1271 = W. Now the amount of Joule
Pipz

heat dissipated in the i-th subsegment can be estimated as:

13. z: jfw([1N )]2 ‘Ru’dt
Z

(n _ Z)ZC?UVDDI:wX P1192

 

 

n3 2(p,+p'2>
_ (724)203ngwa 1
n3 2b]-

Substituting for b; from Equation 3.5, and rearranging, we get:

 

c7 2 ._~
5} =#VD£’x731lRwCP. (as)
Z 2 Z t]
jesr)

i

 

 

We observe that the ﬁrst term in Equation 3.8, i.e., 0.5 x QTJLQVIQ) D corresponds to
the Joule heat dissipated in a subsegment assuming that energy is dissipated uniformly
across the wire length. The second term can be regarded as a correction factor
indicating that the energy dissipation is non-uniform across the length, i.e., higher
energy is dissipated at the subsegments near the sending then than those near the
receiving end. This is because for increasing i (0 S i S n — 1), the numerator reduces
and the denominator increases in value and hence the correction factor reduces overall.

We validated our model by comparing with energy distribution obtained using the
Cadence Spectre simulator for different number of subsegments (n = 10, 50, and 100).
The normalized energy dissipated in each wire subsegment for the n = 10 case, ob-
tained using our model and Cadence Spectre simulations with 130 nm ITRS para-
meters, is shown in Table. 3.1. The average error of our model is 4.53% and the

maximum error is 7.75% compared to Spectre results. Note that this difference arises

56

because the derivation of Equation 3.8 ignored higher-order moments of the node
voltages. For n = 50, and 100 subsegments too, we found that energy values from
our model are very close to those from Spectre; the average errors in these cases were
3.94% and 3.51% respectively. As a trade—off between model complexity, in terms of

its simulation time and its accuracy, we use n = 10.

 

 

 

 

 

 

 

Sub— Normalized energy %Error
segment # Equation 3.8 Spectre

0 0.132565 0.123033 7.75
1 0.125624 0.117550 6.87
2 0.118420 0.112207 5.34
3 0.112894 0.107001 5.50
4 0.106988 0.101931 4.96
5 0.101150 0.096996 4.28
6 0.095827 0.092194 3.94
7 0.089974 0.087525 2.79
8 0.084548 0.082986 1.88
9 0.080009 0.078578 1.82

Average 4.53

 

 

 

 

 

Table 3.1. Comparison of normalized energy dissipated in wire subsegments
obtained using our model and Cadence Spectre simulations for 10 subsegments.

3.4 Thermal Model

In this section, we present our thermal model. This enhanced model can also estimate
the distribution of wire temperatures across the length of the wire segment, compared
to our earlier model [81]. Next, we brieﬂy introduce chip thermal structures and

discuss the heat transfer mechanism in modern chip packages.

3.4.1 Chip Thermal Structures and Heat Transfer

Figure 3.2 shows the cross sectional view of various layers in a chip package that
inﬂuence the way heat is transferred away from the active areas. The ﬁgure shows a
C4 / CBGA (ﬂip-chip) package with an attached heat sink and no forced air cooling.
For this type of packaging and cooling system, it has been found that there are two
heat transfer paths: a primary path that conducts away heat generated at the active
layer (substrate) through the heat spreader, attach material, and the heat sink, and a
secondary path that transfers heat from the substrate through the dielectric layers——
heat ﬂows from the bottommost to the topmost interconnect layer—and ﬁnally ﬂows
through C4 bumps, ceramic substrate, CBGA joints, and the printed circuit board
to the ambient air [66]. As mentioned earlier, models for estimating substrate tem-
peratures are available in tools like HotSpot [64,66] but detailed activity-dependent
models for estimating global signal wire temperatures are not. Next, we present
the model that will help estimate spatially-distributed wire temperatures in a wire

segment.

Heat sink
l [M] Ilfl fl fl -PTSecondary
Thermal paste —-¥7‘“ ‘ ' ” ,, * HT path

Heat spreader —~> :— Si substrate
C4 Pads . ‘ Metal layers

Primary
HT path

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.2. Figure shows the view of different thermal structures of a C4 / CBGA chip
and the primary and secondary heat transfer paths.

58

3.4.2 Detailed Thermal Model

In the thermal model presented next, we consider any subsegment k as a point source
of Joule heat, called a thermal node. Using the well-known analogy between thermal
and electrical quantities, we can consider that, the temperature difference between
two nodes, corresponds to a voltage difference and the heat transfer rate to current.
The ability of the wire segment to hold heat is modeled by its thermal capacitance and
the ability of the surrounding dielectric to conduct heat away from the wire segment
is modeled as the thermal resistance. These thermal circuit parameters are brought
together to form a therrnal-RC network, shown in Figure 3.3(a) for a 5-wire bus,
across the same subsegment k in all wires.

By equating the rate of heat ﬂowing into a node in the thermal equivalent circuit to
the rate of heat ﬂowing out (analogous to Kirchoff’s current law in electrical circuits),
we obtain the following.

For the two edge wires:

Wu: (flak—90) (6i,k—9ii1,kl

 

 

P- -+Pf,=C« . +————+ (3.9)
2’ k 2’ A 2’ k dt Ri, k Rinter
and for the middle wires:
619'}: (git—90) (26ik_6i—1k—9i+1kl
13. +13! =C- .L+_’__+ "' ’ ' ’ ,(3.10)
2’ k 2’ k 2’ k dt Rik Rinter
where P,- k is the instantaneous power dissipated in the kth subsegment of the

-th

2 wire, Pi, k is the equivalent power due to the effect of switching activity in lower
metal layers and the substrate, and 60 is the ambient temperature (45 °C or 318.15 K)
inside the computer box. Note that these equations do not include heat that may

potentially ﬂow through the vias. The reason for neglecting the via effect is given

59

Layer at substrate temperature

 

.5: .2 A“, :4. x.
—T 7 s m \ s '0
s s N m m
o Pfic 0 Pix 0 P31: 0 Pix 0 5’3
R inter R inter R inter R inter
(1) (2) (3) (4) (5)
J‘. *4. *5. .3. a.
v—1 N m V In
as; an; 94 or. m
cm P1, . CZ: P2,k C3:[\13, .. C‘;[\P4 I CS/‘[\P5,k

 

 

 

Figure 3.3. Thermal model. (a) Complete equivalent thermal-RC network for a 5—wire

I __ I _ _ I _ _ _ _ _ _
bus. P1,IC_P2,k_'H_ 5,k’R1,k—R2,I€_"'_R5,I€’Cl,k_C2,k_"'_
C5, k1 and P1, k, P2, k7 . . . , P51 k are bus-activity dependent in the model shown. (b)
Geometry for calculating equivalent thermal resistances for a wire based on previous
work of Chiang et al. The lightly shaded regions and arrows represent heat ﬂow

Layer at ambient temperature

(a)

 

 

Layer at ambient temperature
(b)

between the conductors or between layers (from a hotter to a cooler one).

60

in Section 3.4.2. The instantaneous or cycle-by-cycle power P1,)». can be obtained
by dividing the energy Ei, k obtained using Equation 3.8 by the clock cycle time.
However, in our microarchitectural simulations, we record the energy Ei, k for a ﬁnite
interval and then divide it by the duration of the time interval to obtain the power
dissipated. This time duration is set as explained later in Section 3.5.3.

In the above equations, Ci, k, the thermal capacitance of the wire segment, is given
by: Ci, k = C3 - (t,- -w,-), where C, is the speciﬁc heat per unit volume of the wire metal,
and 211,- and ti are wire dimensions as shown in Figure 3.3(b) and with values given in
Table 2.2. 76,-, k is the thermal resistance of the wire segment along the heat transfer
path as shown in Figure 3.3(b) and it can be calculated from the following expression
using wire geometry and thermal conductivity kild of the inter-layer dielectric (ILD)

as described in [83]:

all: + 82'
w, ) t'ild — 0.582:

2 ' kild kildfwz' + 82').

ln(

 

Rt, I6 = Rspr + RTBCI ‘2 (3.11)

The above expression is the sum of two terms: the ﬁrst is the spreading resis-
tance Rspr due to the spreading of heat from the face of the wire exposed to a
cooler layer (away from the substrate) in a trapezoidal manner, and the second is the
thermal resistance Rrect due to rectangular heat ﬂow as depicted in Figure 3.3(b).

Equations 3.9 and 3.10 can be solved to determine the wire temperature 6i k'

Heat transfer from lower layers through the dielectric

Next we consider the temperature rise in global signal lines due to heat transfer
from underlying layers. This is needed because, in current C4/CBGA packages, a

secondary heat transfer path exists from the substrate through the interconnect layers.

61

Thus, some heat ﬂows from the substrate through the metal layers—bottommost to
the topmost interconnect—and ﬁnally ﬂows through C4 bumps, ceramic substrate,
CBGA joints, and the printed circuit board to the ambient air [66]. The temperature
increase due to this effect to each global wire can be estimated using the following

closed-form expression [83]:
M — NZ 2"“ [NZ—fr >2 - H (312)
_ ,_1kud,iS—_—a_tt -=,Jma$ p303“ '
where N is the number of layers of metal and pj is the resistivity of the metal line
(Copper). The values for tild, i’ kild, i’ sit and ti» corresponding to different layers
of metal, were obtained from the ITRS roadmap.

Note that Equation 3.12 neglects the thermal capacitance of wire segments in
the lower layers. This because wires at lower layers are usually thinner and shorter
(smaller to and t) and also have smaller lengths. Thus, Rinter’ which depends on t-l,
and Cth’ which depends on u! - t - l, both have negligible values, and the dominant Rt h
terms are only considered in this equation. The above equation also assumes that all
wiring tracks underneath the global bus are populated with power supply wires that
carry current at their maximum density Omar)-

The net effect of the secondary heat transfer path (from the substrate and lower
metal layers) is depicted as the constant current source Pi], k in the network shown in
Figure 3.3(a). Note that the Pi, ks are all equal since spatial variation in substrate
temperature across the width of the bus is neglected. This is valid, because in almost

all cases, the area footprint of the buses we study is well within the dimensions of the

underlying circuit block for which we know the substrate temperature.

62

Heat transfer from lower layers through vias

Joule heat generated in the lower metal layers can ﬂow to the global metal layer
through the ILD (as described in the previous subsubsection) and also, in parallel,
through the vias. However, heat transfer through vias occurs only within the range

of the thermal characteristic length L H of the wire [37,83]:

 

t'°t' .k
LH= 2 “d m (3.13)

t- ’
k.,-ld(1+ 0.885%)

 

where km = 401VV/mK is the thermal conductivity of Copper metal. If a wire is
longer than L H, the via heat transfer is negligible. Using parameters in Table 2.2,
L H was found to be 10.56 pm for 130 nm and 10.33 pm for 45 nm, which are much
smaller compared to our inter-repeater segment length lopt- Hence, the heat transfer

through vias will always be negligible in the global buses we consider.

Lateral thermal coupling between wires

The lateral heat transfer between adjacent wires can be a signiﬁcant amount due to the
large exposed sidewall area in high aspect-ratio global lines and due to the difference
in activity rates of the neighboring lines (which creates a temperature difference and
hence lateral heat ﬂow). It has been shown using FEM simulations that thermal
coupling is a signiﬁcant phenomenon in global lines, particularly when high activity
wires are placed next to low activity ones [73]. In our model, this effect is captured
with a lateral inter-wire thermal resistance whose value depends on wire geometry

parameters, as shown in Figure 3.3(a), and the inter-metal dielectric (IMD) thermal

 

conductivity, kimd’ and is given by the expression:
8 .
Rinter = lopt X l‘- t. (3.14)
‘27an “l

63

Previous work on interconnect thermal modelng did not consider the effect of
inter-wire heat transfer [66]; our model incorporates this for better accuracy. For
simplicity, we assume that the ILD and IMD are the same material. Hence kimd =
kild°

Thus, the temperature 92-, k of the k-th subsegment of wire i is affected by the rate
of heat Pi, k generated in it as a result of activity-dependent current ﬂow, the thermal
capacitance Ci of the wire metal, thermal resistances of surrounding inter-layer and
intra-layer dielectric 72,: and Rinteri respectively, and the temperature 6’,- :I: 1, k of
the k-th subsegments of its adjacent wires, all of which are considered in our model.
A distribution of wire temperatures across the wire length can be obtained by solving
Eqs. 3.9 and 3.10 for a number of subsegments k = 0,1, . . . ,n. The temperature

gradient A6,; or difference between the sending and receiving end temperatures can

be estimated using: A0,- = 62-, 0 — 6,; n, where n is the number of subsegments.

3.4.3 Steady-State Thermal Model

The detailed thermal model discussed above is used to track activity-dependent tem-
perature variations in bus wires across time. However, due to its complexity, it
is somewhat difﬁcult to use in the temperature optimization methodologies that we
propose later in our research. Hence we develop an approximate version of this model,
known as the steady-state thermal model. This model is also used to estimate the
initial temperatures for the bus wires before starting detailed thermal simulations.
The steady—state model for three wires is discussed next. Consider three consecu-
tive wires 101,10], and wk on a bus. When there is no bit reordering, data bits biv bj,

and bk are carried on these lines. Let the corresponding power dissipation on these

64

wires be Pi, P1, and PM, respectively. We assume a steady state temperature model
for thermal analysis of this wire set. In this model, the ﬁnal temperature Tf in of a
structure with initial temperature Tim is: Tf in = Tim + P x Rt, where P is the
power dissipated by the structure and Rt is its thermal resistance. Thermal resis-
tances of global signal wires can be estimated based on their geometry using equations
given in [66,81] and wire power dissipation can be obtained using a microarchitecture—

level simulator. For three adjoining wires, the steady state thermal equivalent circuit

is shown in Figure 3.4.

. R.
(I) inter (j) inter (k)

th Rth Rth

 

l
Ta: Ambient temperature

Figure 3.4. Steady state thermal equivalent circuit for three wires. Heat transfer
between wires is modeled by Rinter and heat loss to surroundings by Rth' P,- repre-

sents power dissipated in each wire due to switching activity and it can found using
a microarchitecture-level simulator.

Using Kirchoff’s law on the three nodes, we get the following equations:

+ r
z Rth Rinter

 

 

P. : Tj —Ta _ Ti—Tj _ Tk—Tj’
J Rth Rinter Rinter
P _ Tk - Ta Tk ‘ Tj
k — T + Tr"
th inter

In these equations, Rth is the inter-layer thermal resistance, Rinter the intra—layer

65

thermal resistance, and T a is the ambient temperature, assumed to be 45°C inside
the computer box. Solving this set of simultaneous equations using Mathematica, the

expression for the temperature of the middle wire is found to be:

 

 

T]: = (Pi+Pk)'a+Pj ' (0+,B)+Ta, (3.15)
R2 R-
th Rth inter
where a = and 6 = . (3.16)
3Rth + Rinter 3Rth 'I' Rinter

Thus, we ﬁnd that the temperature rise (ATj = Tj — Ta) in the middle wire is
proportional to a weighted sum of the power dissipated in itself and in its neighboring

wires.

3.5 Simulation Environment and Methodology

We used the Alpha 21264 platform for this work. Details of the simulation infrastruc-

ture for this platform were described earlier in Chapter 2.4.4.

3.5.1 Benchmarks and Sample Sizes

Previous work on temperature-aware microarchitecture design has characterized
benchmarks, mostly in the SPECint suite, as hot, medium, or cold benchmarks based
on the percentage number of cycles that they are in violation of a 818°C thresh-
old [64]. From the benchmarks used in that work, we chose three benchmarks that
were reported to result in extreme thermal stress (gcc, crafty, and vortex), and two
from the medium (gzip and mesa) thermal stress group. We randomly chose seven
benchmarks, that have not been characterized previously, to complete the 12 bench-
marks in our set. Thus, our workload represent a mix of benchmarks that have been

shown to result in severe to moderate thermal violations (those listed above) and

66

those which operate well below the threshold of 818°C. Hence, with this workload,
we can also analyze the extent to which high silicon die temperatures and thermal
stress, which [64] studied, correlate with global interconnect temperatures.

We collected energy and temperature results for a simulation sample of one billion
committed instructions after a fast-forward phase of ﬁve billion instructions that skips
over the program startup phase. We did not use techniques like SimPoint [77] to
choose representative samples because our thermal simulations needed a single, large
sampling window covering possibly, multiple phases of benchmark execution, and
to capture the effects of idling of processor units and buses that provide dynamic

opportunities for wire temperatures to cool down.

3.5.2 Thermal Warmup and Initial Temperatures

As reported in earlier work, it is computationally impractical to simulate long enough
for the heat sink temperature to reach steady state, since its thermal RC time con-
stant is signiﬁcantly larger than that of any on—chip structure [64,91]. Hence, we
followed the methodology suggested in [64] to obtain accurate results from our ther-
mal simulations. First, we used the Wattch power/ performance simulator to obtain
average power consumption values for various on-chip structures [58]. Then, we fed
these values to the HotSpot tool to obtain the steady state heat sink temperature,
and used this value to initialize the heat sink when running our simulations. Also, to
avoid “cold start” effects during the initial period of our wire temperature simulation,
we ran all simulations using our wire model twice. In the ﬁrst pass, we obtained an
approximate steady state temperature value for each wire by estimating the power

dissipated in each wire for one billion cycles using the model discussed in Section 3.4.3.

67

We initialized the temperature of each wire of our target bus using its steady state
temperature (Equation 3.15) and performed the temperature simulation as described
in the next subsection. Note that, using this approach, the initial temperatures of
the bus wires will not be the necessarily equal since it will depend on the distribution

of energy across the wires.

3.5.3 Granularity of Thermal Simulation

After the fast-forward phase which skips through the unrepresentative initial section
of the benchmark program, wire temperatures were set to the steady state temper-
atures estimated as described in the previous subsection. Then, for the next one
billion instructions—our simulation window—we recorded energy and temperature
results every 100K cycles. For thermal simulations, the energy dissipated per wire
was divided by the time taken for each window ( f cl k x105), and a fourth—order Runge—
Kutta (RK4) method was used to solve the differential equations for the thermal-RC
network (Eqs. 3.9 and 3.10) to obtain the individual wire temperatures at the end
of the interval. The RK4 simulation loop, which was implemented using the method
described in [92], iterates for a number of times which depends on the interval size
(100K cycles) and the thermal RC time constant of the wire. This ensures that each
RK4 simulation advances the solution by a small enough time interval dt that is
substantially less than the thermal RC time constant. In this way, each step of the
temperature simulation will yield sufficiently accurate temperature estimates without
the rigor of cycle-by-cycle simulation which will require huge computation time and
memory resources.

Using experimentation, we found that setting the value of dt to three (130 nm)

68

and two (45 nm) gave the best tradeoff between simulation time and the nature of
temperature characteristics we obtained. For example, with the clock frequency in
the 130 nm process (1.68 GHz), time taken by the processor to execute 100K cycles is

t = 59.52 us and the thermal RC time constant of the wire, calculated using

window
wire geometry parameters in Table 2.2, is t RC = 3.6171 us. For these values, the
t .
RK4 Simulation should iterate dt X 4445mm 2 3 x 35%15721’ a: 50 times to ensure the
RC '

best granularity of temperature simulation.

3.6 Experiments and Results

In this section, we present results from simulations using our bus-line energy dissipa-

tion and thermal models and discuss their implications.

3.6.1 Energy Dissipation in Processor Buses

In this subsection we show that, in addition to adjacent wire coupling capacitances,
energy dissipated in switching transitions between non-adjacent wires also affects bus
energy dissipation signiﬁcantly for current and future technologies. It is a well-known
fact that, in global signal lines, the wire-aspect ratio—the ratio of wire thickness to
wire width—is increasing faster than wire—spacing ratio, the ratio of inter-wire spacing
to inter-layer spacing. This causes the sidewall (inter-wire) coupling capacitance to
dominate the area capacitance. In sub—100 nanometer bus lines, the reduced inter-
wire distance further causes increased fringing effects with adjacent as well as non-

adjacent neighbors of a wire. With capacitance values we extracted using FastCap

 

for the 130 nm technology node values are given in Table 2.2—and using our model

69

from Section 3.3 to estimate the coupling energy dissipation in each line, we found
that the energy dissipation is underestimated by up to 7.8% in data buses and 7.6%
in instruction buses, when non—adjacent coupling capacitances are neglected, for data
bus trafﬁc in the nine benchmarks we analyzed. Results for this experiment are shown
in Figures 3.5 and 3.6. Also, we found that, although the non-adjacent coupling
capacitance values are decreasing with technology scaling, this energy estimation
error remains more or less constant in future technologies. Thus, we conclude that
accurate bus energy dissipation models must consider the inﬂuence of non-adjacent
coupling capacitances also. Previous work did not consider the effect of non-adjacent
coupling capacitances and its inﬂuence on energy; ours is the ﬁrst to do so.
Non-adjacent coupling capacitances are especially important to consider when
evaluating the beneﬁts of microarchitectural techniques for low-power buses. In cur—
rent literature, only schemes that aim to reduce energy dissipation due self and ad-
jacent inter-wire coupling transitions exist. Such schemes can potentially increase
the relative contribution of energy dissipated in transitions involving non-adjacent

coupling capacitances.

Effectiveness of Low-Power Bus Encoding Schemes

We evaluated the effectiveness of some popular bus encoding schemes like bus-invert
(BI) [51], odd / even bus-invert (OEBI) [53], and coupling—driven bus-invert (CBI) [54]
on wide data and instruction buses. To our knowledge, this is the ﬁrst study to re-
port energy dissipation results for microprocessor buses using SPEC benchmarks that
represent real-world programs; most previous studies, including the ones cited above,

reported energies for random trafﬁc patterns. Additionally, we also implemented a

70

Energy Dissipated in Data Bus

    
 
  
   
 
 

 

 

ZOE-03 T -Total Energy (Cc1+Cc2+Cc3) " 8-6
1.8E-03 4 -Total Energy (061 only) + 8.4
A ‘ +% Mismatch «— 8.2
3 1.6E-03 T __ 8.0 c
3; 1.4E-03 T -_ 7.3 §
§ 1.2E-03 T —— 7.6 E,
"' _ 2
3 1.0E-03T l 7'4 o
'5"; I ~ 7.2 g
a 805-04 T _ 7.0 g
>. 0
9 6.0E-O4 T L 6.8 3
g T 6 6 °'
I.” 4.0E'04 ‘” ] -
i 6.4
2.0E-04 If T 6 2

0.0E+00

Figure 3.5. Total energy dissipated in a 64-bit data bus for various benchmarks. ‘Ccl
only’ represents the existing energy models which consider only self and adjacent
coupling capacitances. ‘Cc1+Cc2-l-Cc3’ represents our model that considers self ca-
pacitances, adjacent coupling capacitances (Ccl), and two non-adjacent capacitances
(Cc2 and Cc3) on each side. The % energy mismatch shown by the line is plotted
with respect to the right-hand side Y-axis.

variant of the BI scheme called segmented bus invert where the bus is divided into
four groups and BI encoding is applied to each group separately. This arrangement
requires four extra invert lines that are placed in the four higher order bit positions. In
our experiments, BI was implemented with the one invert line at the MSB position—
we found this to result in less energy dissipation compared to the case when the invert
line is at the LS8 position—and CBI was implemented with the invert line in the LSB
position as mentioned in [54]. OEBI was implemented with two invert lines (LSB as
the odd-invert line and MSB as the even-invert line) as described in [53].

The total bus energy dissipated for unencoded and encoded data is shown in

71

Energy Dissipated in Instruction Bus

  
  
 
 
  
 
 

 

SEE—03 T -Total Energy (Cc1+Cc2+Cc3) T 8'4
-Total Energy (Cc1 only)
2.5E-03 * +°/o Mismatch __ 8.2
’6‘ .
g T L 8.0 .1:
O ZOE-03 ‘7 2
3 E
g I - 7.8 g
31.5503 1 3:
an
2 7.6 g
n o
5 105-03 — §
0 ‘ 7.4 n-
:
l.|.|
I
5.0E-04 T 7 2
T .
0.0E+00 I L 7.0
90° ’99 @é '86 {3’ «3‘6 Q\ (SQ c? 7th
c} 0 a Q \0 0x

Figure 3.6. Total energy dissipated in a 128-bit instruction bus for various bench—
marks. The % energy mismatch shown by the line is plotted with respect to the
right-hand side Y-axis.

Figure 3.7. The energy values reported in this plot have been averaged across the
nine benchmarks with each benchmark being simulated for 500 million committed
instructions. Ffom the results shown for existing bus models (Ccl only), we ﬁnd
that all four encoding schemes reduce self energy, with segmented BI being the best.
Coupling charge/discharge energy dissipation increases marginally, with BI and CBI
encoding but reduce somewhat when OEBI encoding is used. Here too, segmented BI
shows the best reductions. The amount of energy dissipated due to toggle transitions
decreases when any of the four encoding schemes are used, with segmented BI again
giving the best results followed by OEBI, BI, and CBI in that order. A signiﬁcant

observation from these results is that existing coupling—aware encoding schemes (like

72

Energy Estimated with Different Models
Unencoded lBus Invert DCoupling-Driven Bl EIOdd/Even BI ISegmented BI

 
 
  
  
  
 

 

 

 

 

 

 

 

 

 

1.4E-04 T
12504]
3 105-04 1
a .
3, 8.0E-05T
B 6.0E-05
'5 l
5 4.0E-05T
0.0E+OO - -
79 ”T3 ] 8 -‘-‘-’ .79 ”r3 8, 2
i ; T 5 i3 I ‘5 i9
I I l .g.) l .2
i i g % l l %
: T , 9 1. ~ . 9
l l l s l . i l 6’
. T ,
‘ Cc1 only , Cc1+Cc2+Cc3 T

Figure 3.7. Total energy dissipated in a 64-bit data bus with various encoding
schemes. ‘Self’ denotes self energy, ‘C/ D’ denotes the coupling charge/discharge
energy and ‘Toggle’ denotes the coupling toggle energy dissipation. ‘Ccl only’ refers
to existing energy models that consider self and adjacent coupling capacitance only
and ‘Cc1+Cc2+Cc3’ refers to our energy model that considers self, adjacent coupling,
and two non-adjacent coupling capacitances.

CBI and OEBI) have limited impact for wide data buses. Furthermore, we observed
that the average number of bit transitions between consecutive cycles was very low
(much less than half the bus width) for the SPEC benchmarks we analyzed. This is
most likely the result of the higher order 32 bits of data not being utilized. Hence the
number of inversions was small, even for CBI and OEBI, and hence most of the time,
data was being transmitted in original (unencoded) form. Segmented BI performed
the best in these situations because, as the effective bus width for each segment was
smaller, the number of cycles during which data-inversions took place was greater.

Thus, overall, while segmented BI encoding resulted in lowest energy dissipation,

73

OEBI and BI were almost Similar in impact, while CBI was signiﬁcantly worse. Note
that none of the coupling-aware schemes we examined yielded improvements on the
order of what had been reported earlier—36% for OEBI and 30% for CBI with respect
to unencoded random data—for these schemes [53, 54].

When our energy model (considering Ccl, C02, and Cc3) was used, all coupling
(charge, discharge, and toggle) energies increased and the trend in charge/ discharge
energies remained unaffected. For toggle energies, however, we observed that OEBI
performed signiﬁcantly worse than others. This is clearly the effect of toggles on
coupling capacitances between non-adjacent wire pairs. The net effect of this is that,
with our new bus energy model, OEBI and CBI both perform signiﬁcantly worse
than BI and segmented BI. Based on our results, we can conclude that bus-inversion
based encoding schemes do not work well for wide buses and for realistic data streams
(from SPEC benchmark programs) where the number of bits that transition between

consecutive cycles is low.

Impact on Wire Temperature Distribution

The influence of energy dissipation due to non-adjacent coupling capacitances on
wire temperature can be illustrated with a simple example of a 5-wire bus like the
one shown in Figure 2.3. Consider transitions on the ﬁve bus lines, from the most
signiﬁcant bit (MSB) line to the least signiﬁcant bit (LSB) line as follows: THTT.
The notation T indicates that, in the current cycle, the line charges to V D D from
its previous ground state and 1 indicates that the line discharges in the current cycle
from V D D held in the previous cycle. This set of transitions represents the relative

thermal worst-case since most of the energy dissipation is concentrated in the center

74

line. Numbering the bus lines from 0 (MSB) to 4 (LSB) and noting that all inter—wire
transitions, if any, are toggles, the coupling energy dissipated in each line estimated

using our energy model, described earlier in Section 3.3, can be written as follows:

c_ 2 _ 2

c _ 2 _ 2
E5 = (CO, 2 + 01,2 + c2, 3 + c2, 4>V12)D = 2(Cc1+ Cc2)V12)D
c _ 2 _ 2

1352614ng = CCQ-VgD

where CZ" j represents the coupling capacitance between wire 2' and j. Note that
the self energy dissipated in all ﬁve wires is the same (%(Cw +Crep)V12) D) and hence
its contributes equally to temperature rise in all ﬁve wires. The energy dissipated
in the middle wire E5 is the highest even if Cc2 is neglected and hence, this wire
is likely to have the maximum temperature. Furthermore, if non-adjacent coupling
capacitances are non-negligible, the middle wire dissipates much higher energy and

its temperature is likely to be even higher.

3.6.2 Correlation between Energy and Temperature

In this subsection, we examine the correlation between energy and temperature char-
acteristics obtained using our model. We report and analyze time-varying energy
and temperature proﬁles for only one benchmark—gee, for a simulation interval of 10
billion cycles in the 130 nm technology node. We found that other benchmarks ex-
hibited similar behavior; hence these are not reported. The energy and temperature

proﬁles are shown in Figure 3.8. In this ﬁgure, energy and temperature, plotted on

75

the y-axes, have been averaged across the number of bus lines. The temperature plot
clearly shows that the average wire temperature continues to rise with time although
the rate of change is not linear; the trend line shown on the plot is only a very coarse
approximation. But, the results are signiﬁcant because they show that the average
wire temperature increases by about 10 degrees over six seconds of execution of a
typical program like gcc on a 130 nm microprocessor. We also observe that short,
intermittent cycles of high switching activity can trigger changes in temperature, ev-
idenced by the regions marked 1 and 2 on the plot. Also, we notice that such bursts
of energy dissipation—likely caused by increased bus utilization—cause the temper-
ature rise to ‘linger’ for a short period of time as shown by the step-like changes at

the beginning of regions 1 and 2.

3.6.3 Final and Peak Wire Temperatures

In this subsection, we present results obtained from simulations using our thermal
model. During our simulations, we recorded type types of temperature information:
(1) the temperature change in each wire between the start and end of simulation,
(2) the highest temperature reached by each wire during the simulation, and (3) the
temperature gradient of each wire between its sending and receiving ends. These
results are presented next.

We observed that wire temperatures increased signiﬁcantly over the time interval
of simulation for most wires. Figures 3.9 through 3.11 show the wire temperature
rise that we recorded for three integer and three ﬂoating—point programs respectively,
each for one billion committed instructions of execution for all bits of the 64-bit data

bus. The corresponding results for the 128—bit instruction bus are in Figures 3.11

Temperature and Energy Dissipation in Data Bus for GCC

 

0)
(A)
O
.1

     

Trend line: y = mx + c

 

 

 

Avg. Wire Temp. across 64 wires (K)
0)
N
a.

 

 

 

 

322 m = 7.0448e—10 '
= 2 .44
320 c 3 3 28 _
318 l I I l l J 1 l
0 1 2 3 4 5 6 7 8 9
Simulation Time (Cycles) x 109
—s
A x 10
3 1 I i l l l i I I
U)
2
E
v 0.8 4
<0
3
E 0.6 _
3
8? o 4
.5 .
>
O,
E 0.2 _
Lu
6:
> O
< o 1 2 3 4 5 s 7 8 9
Simulation Time (Cycles) x 109

Figure 3.8. This plot shows average energy dissipation and wire temperature of the
bus for a simulation interval of 10 billion cycles. The continuing temperature rise can
be clearly observed.

through 3.14. We show detailed results for only these six benchmarks since they
exhibit interesting behavior. The highest temperature rise recorded for any wire
during our simulation, for the 12 benchmarks we analyzed is given in Table 3.2. We
show results for both 130 nm and 45 nm technologies in the ﬁgures and in the table.
The time taken to commit a billion instructions in the pipeline which is typically
on the order of a few seconds is much longer than the thermal RC time constant

of the Wire, which is only a few microseconds. Thus our simulation interval is large

77

enough to allow temperatures to settle to their characteristic values. Furthermore, we
initialized wire and heat sink temperatures to their steady state values as described
earlier in Section 3.5.2, to prevent cold-start effects.

From the knowledge of characteristics of instruction and data trafﬁc, all lines
in an instruction bus, which is 128 bits wide (fetch—width: 4 instructions), can be
considered equally active, while in a load/ store data bus, which is 64 bits wide, the
lower order 32-bits are expected to be most active due to data value locality. The
results shown in Figures 3.9—3. 14 reflect these observations to some extent. For integer
data, we observe that the hottest wires are the ones that carry lower-order bits. One
notable exception is gzip in which all wires Show signiﬁcant temperature rise across
the simulation. This is expected because, when executing gzip, the data bus will
carry primarily 8—bit characters packed in the 64-bit bus. Another observation is
that, for mcf , the middle wire is the hottest at the end of the simulation interval. For
ﬂoating-point benchmarks, temperature rise is somewhat evenly distributed across
the 64 bits because the higher-order wires, which carry the exponent bits, are also
quite active. Also, lucas shows higher temperatures in some lower order bits. Finally,
we notice that the highly active wires are likely to end up at higher temperatures
when executing integer workloads as against floating-point workloads.

During the course of simulation, we observed that wire temperatures rose and fell
as bus activity, the number of transitions, and the energy dissipation varied. A three-
dimensional plot showing the variation across time and across the lower-order 32 bits
of the data bus, plotted for three billion cycles of execution of the gcc benchmark
is shown in Figure 3.15. This plot shows that there are intervals during which wire

temperatures rise to higher values due to a sudden rise in energy dissipation and then

78

Temperature Rise in Data Bus Wires
for 18 Cycles of Execution of gcc

00
O
l

 

 

+45 nm +130 nm

 

 

N
01

N
0

Temperature (K)
8 a:

 

01
l

i

A
‘14. _ ‘
jrrf$f17yrrtrrffrlﬂ ‘
l l

0 7 1 4 21 28 35 42 49 56 63

Wire Number)(O=LSB, 63=MSB)
a

Temperature Rise in Data Bus Wires
for 18 Cycles of Execution of gzip

 

o
i
i
i

1
i

ql

40~
35—

 

+45 nm +130 nm

 

 

 

Q30 '1

N N
O 01
1 1

 

151
101

Temperature (

 

 

 

 

0 I . . a
O 7 1 4 21 28 35 42 49 56 63

Wire Number (O=LSB, 63=MSB)
(b)

T T l I

Figure 3.9. Plots show the wire temperature rise recorded for benchmarks gcc and

gzip for the data bus in 130 nm and 45 nm technology nodes over a simulation interval
of one billion committed instructions for each benchmark.

79

Temperature Rise in Data Bus Wires
for 1 B Cycles of Execution of mcf

50 --

_._______ ,_ __ __ 1

+45 nm +130 nm:
45 -

____ __.-_;__ 4”}

4o «
235 ~
230 —
925 «
320
g —i
,_15 ~
10 ’

 

 

 

 

 

 

 

0 7 1 4 21 28 35 42 49 56 63

Wire Numbe(r)(0=LSB, 63=MSB)
a

Temperature Rise in Data Bus Wires
for 13 Cycles of Execution of Iucas

O.)
O
J

 

 

+45 nm +130 nmf

 

N
01

N
O

 

Temperature (K)
8 a

01
i;
’

\

 

;

l'T'" i l j i l

7 14 21 28 35 42 49 56 63
Wire Number (0=LSB, 63=MSB)
(b)

O

A________T__u - 4,, T, ._

 

 

C

Figure 3.10. Plots show the wire temperature rise recorded for benchmarks mcf and

Iucas for the data bus in 130 nm and 45 nm technology nodes over a simulation
interval of one billion committed instructions for each benchmark.

80

12-

10~

Temperature (K)
O)

 

Temperature Rise in Data Bus Wires
for 1 B Cycles of Execution of ammp

 

+45 nm +130 nm

 

 

 

 

 

f I I f -._ .._,.‘o L‘

 

141
12-

§101

m
l A l

Temperature (
O)

 

7 14 21 28 35 42 49 56 63
Wire Number (0=LSB, 63=MSB)
(a)

Temperature Rise in Data Bus Wires
for 13 Cycles of Execution of applu

 

i+45 nm +130 nm

 

 

I

 

 

T T i TT—_ __—7 " ‘— l

7 1 4 21 28 35 42 49 56 63

Wire Numbe(r (0=LSB, 63=MSB)
b)

Figure 3.11. Plots Show the wire temperature rise recorded for benchmarks ammp
and applu for the data bus in 130 nm and 45 nm technology nodes over a simulation
interval of one billion committed instructions for each benchmark.

81

Temperature Rise in Instruction Bus Wires
for 13 Cycles of Execution of gcc

 

12 - f ‘
1 .-0-45 nm +130nm1

9 41

91

 

 

 

 

Temperature (K)
O)

 

 

 

0 -1T ' j'—T— 7 T T T I T I l l I T T

o 7 14 2128 35 42 49 56 63 7o 77 84 9198105112119126
Wire Number (0=LSB, 127=MSB)
(a)

Temperature Rise in Instruction Bus Wires
for 13 Cycles of Execution of gzip

181 i-0-45nm +130nm

l m

154'

 

 

_.L
N

 

Temperature (K)
O) (D

0% r f T I r r r I I I r I . I I r
0 7 14 21 28 35 42 49 55 63 70 77 84 91 98 105112119126
Wire Number (0=LSB, 127=MSB)
(b)

 

Figure 3.12. Plots show the wire temperature rise recorded for integer benchmarks
gcc and gzip for the instruction bus in 130 nm and 45 nm technology nodes over a
simulation interval of one billion committed instructions for each benchmark.

82

Temperature Rise in Instruction Bus Wires
for 18 Cycles of Execution of mcf

—L
.h
1

 

l +45 nm +130 nmdJ

 

—-L
N
a

F’-.

.5
o
1

CD
I

 

O)
1

 

 

Temperature (K)

 

 

D

 

 

 

D

lLLlr‘IbUhi‘ﬂ Il- ' Hutu;

0 7 14 21 28 35 42 49 56 63 7O 77 84 91 98 105112119126
Wire Number (0=LSB, 127=MSB)
(3)

Temperature Rise in Instruction Bus Wires
for 18 Cycles of Execution of lucas

 

A
L

N
1

 

O

N
01
1

 

 

 

 

+45 nm +130 nm {i

N
O
l

 

_L
01

Temperature (K)
O

01

 

 

O I i I f I I I I I I i r " r '1"
0 7 14 2128 35 42 49 56 63 70 77 84 9198105112119126
Wire Number (0=LSB, 127=MSB)
(b)

Figure 3.13. Plots show the wire temperature rise recorded for integer benchmarks
mcf and lucas for the instruction bus in 130 nm and 45 nm technology nodes over a
simulation interval of one billion committed instructions for each benchmark.

83

Temperature Rise in Instruction Bus Wires
for 18 Cycles of Execution of ammp

 

 

 

 

 

 

 

 

 

9..
8 7 l+45nm +130 nm)
7_
$61, l1 1
2
2. 5 ‘ ,
2
§47 0 l I" ’
at 34 t.‘ ' u
1- 1
t ‘ Ii I ‘ *' .
2- . '
'~hli liU’li
1 '1
I
o W 444..
o 7 14 21 26 35 42 49 56 63 7o 77 84 91 96 105112119126
Wire Number (0=LSB, 63=MSB)
(a)
Temperature Rise in Instruction Bus Wires
for 18 Cycles of Execution of applu
157 I
+45nm +130nmi
121 ,
2 .
2 97
3
E I
8. _ 1
6 6,1
I'-
3 .
0' I I I

 

0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105112119126
Wire Number (0=LSB, 127=MSB)
(b)

Figure 3.14. Plots show the wire temperature rise recorded for integer benchmarks
ammp and applu for the instruction bus in 130 nm and 45 nm technology nodes over
a simulation interval of one billion committed instructions for each benchmark.

84

settle at lower values. Such intervals neither occur synchronously across wires nor are

uniformly distributed among them.

Wire Temperatues in the Lower-Order 32-bits of Data Bus for GCC

Wre Temperature (K)

 

 

Simulation Time (x 100Kcycles) 0

Wires (0=LSB)

Figure 3.15. A three—dimensional plot showing spatial and temporal variations in
wire temperature for the lower-order 32 bits of the load/store data bus for the gcc
benchmark.

Table 3.2 lists the absolute maximum temperatures attained by any wire during
the course of simulation. As can be seen, we found that wire temperatures may reach
up to 104°C for data bus wires and 896°C for the instruction bus in the 130 nm
technology node. For the 45 nm node, data bus wire temperature was found to go
as high as 128.7°C and instruction bus wire temperature as high as 104.9°C. Note
that these values are higher than 100° which is the maximum temperature assumed

during interconnect design. We also observed that maximum temperature trends for

85

data buses are very similar to those observed earlier for temperature rise. That is, the
largest temperature change over the simulation interval occurs for bus wires whose
transient temperature also touches maximum value, showing that different data bus
wires experience varying amount of thermal stress depending on their location. For
instruction buses, the maximum temperatures observed across bus wires were more or
less similar. Hence, all instruction bus wires experience more or less similar amounts

of transient thermal stress during the simulation interval.

3.6.4 Wire Temperature Gradients

Next, in Figures 3.16(a) and (b), we show the frequency distribution of the wire tem-
perature gradients that we recorded during our simulations, for 130 nm and 45 nm
load/ store data bus wires. These plots Show that, on the average across the bench-
marks we analyzed, the temperature gradient in this bus can be expected to be
between 6—15°C for 130 nm technology. For 45 nm technology temperature gradients
between 16—34°C were most commonly observed for the same set of benchmarks and
simulation sample. During our simulations, the maximum temperature gradient ob-
served was 31°C for 130 nm and 42°C for 45 nm simulations. These wire temperature
gradients are the result of two factors: (1) the non-uniform dissipation of Joule heat
along the wire length which is modeled using Equation 3.8, and (2) due to the dif—
ference in temperature of the underlying substrate blocks which was obtained using
HotSpot and applied during our thermal simulation. Temperature gradients across
the length of the wire also affect delay. It has been reported that for a 1 mm long
wire with the driver in the hot region and receivers in a cooler region, a temperature

difference of 10°C results in a 5 ps (z 8%) additional delay at the receiver [93].

86

 

.mhmpmawuwa 8s mv was 8: om; mam: woman compossmﬁ was 335

.8“ €032:me wepﬁaﬁoo :oEB one we cosﬂsczm w macaw 8283 00 E 8.393383 9:? 858382 .m.m 2nt

 

 

 

 

 

 

 

 

 

 

3.3 3.2: coda mmﬁ: ands 34H: 8.2: 3.3 50mg 8.2: $63 mnwm mzné Ecmv
$6: mod: mmhoﬁ Sam: 2..me «5.03 5.2: meg: 2.52 New: mﬁomﬁ wmw: msmd

vmdw mme oméw Exam Snow mmdw aw.mw oo.mw 05mm mmdw $.mw om.mw mom: S: cm:
wmbm hmdm omemm omdm mﬁmoﬁ 3.3 3.3 maﬁa 3.2: @063 ENS mwdm www-m—

xmﬁo> :93 83% 3&8 smog Q83 3% 8m “08 mde Emma QEEQ

 

 

 

 

 

 

 

 

 

 

87

Distribution of Maximum Wire Temperature Gradients
in 130 nm Wires

ll: <69C I 6—155’0 J31 6—2390 D >249e,

ammp applu crafty mcf gcc gzip lucas mesa mgrid swim twolf vortex Avg.
(a)

 

1 00%

 

0) (O

<3 <2

o\ o\
l l

Percentage Number of Cycles

assesses
o\°o\o\°o\°o\°o\o\°
L

 

 

Distribution of Maximum Temperature Gradients
in 45 nm Wires

 

 

 

ll<690 l6—159C [116—24°C mas—34°C ”34901

100%
90% a
80% a
70% r
60% a
50% a
40% -
30% r
20% —
10% ~ 1

0% ..

ammp applu crafty mcf gcc gzip Iucas mesa mgrid swim twolt vortex Avg
(b)

 

 

 

 

 

 

 

 

Percentage Number of Cycles

 

 

 

Figure 3.16. Frequency distribution of maximum wire temperature gradients for
130 nm and 45 nm processor wires.

88

3.7 Summary

In this chapter, we presented a unified nanometer-scale bus energy dissipation and
thermal model that can help designers monitor energy dissipation and temperature
change in individual wires during trace— or execution—driven simulation. In addition
to self capacitance, our model incorporates the effects of adjacent and non-adjacent
capacitive coupling on bus energy dissipation, the effect of repeater insertion, the
effect of lateral heat transfer between adjacent wires, and the effect of inter-layer
heat transfer. Unlike existing models which provide estimates for total bus energy, our
model can estimate energy dissipated in each bus line; this feature helps to estimate
wire temperatures also. Using this integrated model in a ﬁrst-of-its—kind study, we
studied energy and thermal characteristics of instruction and data buses using an
execution-driven simulation of a billion or more instructions of nine SPEC CPU2000
benchmarks. We found that existing bus energy models provide estimates that are
about 7—8% less accurate compared to our energy model. This is because they do
not account for the effects of coupling between non-adjacent wire pairs of a bus. Our
model, which incorporates these effects, is the ﬁrst of its kind to do so. Our results also
show that, in wide instruction and data buses used in modern processors executing
SPEC CPU2000 workloads, existing bus encoding schemes Show no signiﬁcant energy
beneﬁt due to the nature of data traffic. When non-adjacent coupling effects between
wire pairs are considered, energy dissipation savings reduce considerably. Based on
simulations using our thermal model, we found that average wire temperatures in data
and instruction buses may rise 10—37 °C during a simulation run of only a billion cycles

in a 130 nm spuerscalar processor executing SPEC CPU 2000 benchmark programs.

89

This temperature rise is primarily due to heat generation as a result of currents flowing
in the wire during bit switching. Changes in substrate temperature may cause other
effects in the temperature profile which we did not explore in this work.

In a future 45 nm technology node, wire temperature rise for the same set of bench-
marks and simulation sample was found to be between 20-58°C. We observed that
instruction and data bus wires attained absolute temperature in the range 80.3-104°C
and 97.6—123.7°C, in 130 nm and 45 nm processors, respectively, during the course
of our simulation, showing that signal lines attain signiﬁcant temperatures too. Sig-
niﬁcant wire temperature gradients of magnitude between 16—25°C were found to be
most common between the sending and receiving ends of the wires during the course
of simulation. Notable correlation was found to exist between energy dissipation be-
havior and wire temperature rise in buses across time; short, intermittent cycles of
high energy-dissipating switching activity trigger step changes in temperature.

The impact of these results, especially, the highly ﬂuctuating—both in time and
space—energy and temperature proﬁles of instruction and data buses that we ob-
served, is the following. Since the energy dissipation of the wire roughly represents
the square of the time-varying current, fluctuations in the energy mean that a highly
varying load is being placed on the power supply network by the driving circuits
through which the currents flowing in the wires are drawn. This varying load can
cause inductive voltage drops or Lﬁllt noise. This motivates the need to smoothen
temporal variations in energy dissipation of wires with appropriate techniques. Also,
the substantial disparity in wire temperatures across the bus motivates schemes that,
based on information from interconnect thermal sensors, can migrate bus transmis-

sions dynamically to cooler wires.

90

CHAPTER 4

DATA- AND
TEMPERATURE-DEPENDENT DELAY
VARIABILITY MODEL

4. 1 Introduction

Rising wire temperatures are becoming an important issue in high-performance bus
design, especially in current and future nanometer technology nodes, as the previ-
ous chapter showed. Higher temperatures adversely impact wire delays—due to the
temperature dependence of metal resistivity—causing timing violations when the end-
to—end propagation delay exceeds the designed value. The factors that inﬂuence the
dynamic propagation delay of a signal transmitted on the wire can be classiﬁed into
two types, intrinsic factors that are related to the switching activity of the wire and / or
its neighbors and extrinsic factors like process and voltage variations. As shown in
the earlier chapters, the temperature distribution along the wire is a function of the
switching activity in the wire and hence it is also an intrinsic factor.

In the context of global interconnect lines, temperature variations occur due to
two reasons that are both important to study. First, energy is dissipated in a non-
uniform manner across the length of the wire. In Chapter 3.4, we showed that the
temperature at the sending end of a wire will be higher than that of the receiving
end of the wire. In this chapter, we develop a model to estimate the impact of this

temperature gradient on the propagation delay of the signal. Substrate temperature

91

gradients, when present, will exacerbate thermal gradient-dependent delay. Second,
temperature variations are also non-uniform across time since the characteristics of
programs dictate the amount of switching activity in signal wires and consequently,
the energy dissipated in them and their temperature. When switching activities rise,
it also causes the wire temperature gradient to increase.

Due to lack of detailed models, existing early—stage design exploration methods
lump the effects of process, voltage, and temperature (PVT) variations. This results in
overly pessimistic and / or incorrect delay estimates. Even in later stages of the design
process, a constant temperature value across the chip is assumed to analyze of the elec-
trical characteristics of devices and interconnects. In reality, given that on—chip power
dissipation in devices as well as interconnect is workload-dependent, the temperature
distribution within the chip is far from uniform, and thus the constant-temperature
assumption will result in a design which will result in problems during validation
and necessitate costly re-spins. Using detailed temperature models developed previ-
ously, this chapter examines the impact of data— and temperature-dependent delay
variations for various on—chip high performance processor buses.

The organization of the rest of the chapter is as follows. In Section 4.2, we discuss
related work. Following that, in Section 4.3, we describe our models. Then, we

present results and discuss them in Section 4.4. Finally, we summarize in Section 4.5.

4.2 Related Work and Our Contributions

This section reviews related work. The impact of increasing interconnect tempera-

tures has been well studied in [21, 38, 69]. However, they do not use real data from

92

benchmark programs and hence their estimates are somewhat pessimistic. Also, these
models are not amenable to use in microarchitecture—level exploration tools. Recent
interest in temperature- and reliability-aware microarchitectures has led to the devel-
opment of tools [64,66,94] and techniques [58,71] for processor thermal and reliability
management. However, these tools do not address an important temperature-related
reliability issue in on-chip interconnects: transient faults or timing violations due to
temperature-dependent resistivity changes. In contrast to these, we seek to develop
activity-dependent models that estimate the distribution of Joule heat across the
length of a wire, the wire temperature gradient across it, and ﬁnally, the actual delay
due to crosstalk and temperature-induced resistivity changes. Using these models, we
analyze the number of delay violations occurring for different benchmark programs
from the SPEC CPU2000 suite in 130 nm and 45 nm processor designs.

In current design methodologies, temperature—related wire reliability problems are
identiﬁed late in the design cycle and hence their rectiﬁcation involves substantial cost
and effort. But this overhead can be avoided by properly accounting for temperature-
related effects in early stage design. To our knowledge, no early stage microarchitec-
ture exploration tool currently offers the capability of estimating temperature-induced
timing violations in high-performance buses; our work is likely the ﬁrst of its kind to
develop such a model. Our model can also be used in temperature-aware delay and
skew analysis in clock trees, although we do not examine this aspect in this paper.

Speciﬁc contributions and key results from this paper are outlined next.

0 Using a cycle-accurate microarchitectural simulator, we show that timing vio—
lations due to temperature gradients are somewhat likely in 130 nm designs—

average of 2.27 per hundred bus references for an ALU result bus across ten

93

SPEC CPU2K programs—and increases in the future 45 nm technology node

to 6.20 per hundred for the same processor design.

0 We found that, by an optimistic analysis, the performance impact of overcoming
temperature induced timing violations by re-transmitting data will be about 4%

in a superscalar design at 130 nm and 11.92% at 45 nm.

e We also found that conventional techniques like bus encoding that seek to re-
duce energy dissipation and potentially wire temperatures have limited impact
on alleviating temperature-induced timing violations. Reducing the bus clock
frequency yielded a better impact, reducing average error rate to 1.07 in the
130 nm processor compared to encoding which reduced it to only 1.93 per hun-

dred references.

4.3 Temperature Dependent Delay Variability
Model

In this section, we present analytical models for estimating the spatial distribution
of Joule heat, temperature, and temperature-dependent delay in RC interconnects.
Versions of the well-known energy model for a lumped-RC wire, discussed in Chap-
ter 2.1.2, has been traditionally used in interconnect analysis to estimate energy
dissipated due to self and coupling transitions. But this model assumes that Joule
heat is dissipated uniformly across the length of a wire and hence leads to conserva-
tive temperature estimate for the wire. Furthermore, it does not capture the spatial
distribution of Joule heat, without which the impact of temperature on delay cannot

be estimated accurately. In Chapter 3.3.3, we derived a new expression for energy

94

distributed along the length of a wire and validated it using circuit simulation. We
also constructed a thermal model and found wire temperature gradients using this
model. The effect of this temperature gradient on wire delay is found as discussed

next.

4.3.1 Wire Delay Considering Temperature Impact

The propagation delay of a lumped-RC wire considering only data dependent crosstalk
was presented in Chapter 2.1.5. For a distributed RC line partitioned into n subseg—
ments each of length l, the Elmore delay D of a signal passing through the line is the

following:

L L L
D 2 Rd ' (Cr +/0 c0(:r)d:1:) +/O r0(:c) - (A c0(7')dr + Cr)d:r, (4.1)

where c0(;z:) and r0(a:) are the per-unit length wire capacitance and resistance,
respectively, Rd is the driver resistance, and CT is the receiver capacitance. Since
the resistance of a wire segment changes with temperature, we can write: r0(:z:) =
p0(1+ﬂ-T(:1:)), where T(:z:) represents the temperature proﬁle along the length of the
wire, p0 is the unit length resistance at a reference temperature (273 K), and B is the
temperature co—efficient of resistance for Copper (5 2 396—3 per°C). Substituting in

Equation 4.1, we get:

L L
D 2 D0 + (COL + C7~)p0ﬁ/O T(:r)a’.:r — copOﬁ/O :rT(:z:)d.r, (4.2)
L2
where D0 = Rd(Cr + 00L) + (60p0—2- + pOLC'r), (4.3)

Do is the Elmore delay corresponding to a unit length resistance at reference

temperature. In Equation 4.3, fOL T(:r)d:r represents the area under the temperature

95

curve, denoted as A in a plot of temperature vs. wire-length. Let T(:z:) be a straight
line with T(:r = O) 2 TA and T(:1: = L) 2 TB, TA 2 T3. The area under T(:2:)
gives the value of A. Now the x-coordinate of the centroid of this region is given by
130 = % fOL 33T(:c)d:c [95]. Thus fOL :rT(:r)d:r = arc x A. Note that both 230 and A can
be found easily using geometry, if T (:13) is assumed linear.

Thus, by estimating TA = (92-, 0 and TB = 6i, n using the model in Chapter 3.4,
and the area under the temperature curve for any given sampling window, we can
estimate the actual delay which includes the effect of temperature—dependent resis—
tance. Using this, we can determine if a timing violation has occurred as described

next.

4.3.2 Wire Delay Variability Considering Crosstalk and
Temperature

During early stage design exploration, the designer’s aim is to ensure that the mi-
croarchitecture meets all its performance expectations at the target clock frequency.
The target frequency itself is decided based on knowledge of typical operating con-
ditions that determine parameters like temperature, etc., and knowledge of process
variations that are used to account for deviations from expected values. Based on
estimates available from prior work, we assume that the delay can increase by up to
20% due to back end of line (BEOL) process variations and an additional 10% due
to voltage drOp and temperature variability [96,97]. Thus, we assume a guard band
of 30% for the delay of a global wire due to PVT. Hence tbus_clk = 1.3 X D.

We described earlier in Chapter 2.1.5 the procedure to estimate the worst-case

data dependent delay (Equation 2.4) and estimate the safe clock frequency at which

96

the bus can be operated. From that discussion, we note that not all bus references will
trigger the worst case for delay in a bus line, resulting in varying amounts of delay slack
across lines and also across time. As such, the actual delay for a line estimated using
Equation 4.3 depends on the Wire temperature gradient and the nature of its crosstalk
with its neighboring wires. If the neighbors both switch oppositely with respect to
the line, the delay will be twc and, if the temperature gradient is sufﬁciently high,
the actual delay may exceed tbus_clk' This is a timing violation. Note that, when
this occurs, the temperature impact on delay overwhelms the 30% guard band that
we have allocated to account for worst case PVT variations.

Given the current and previous data to be transmitted on the bus, we do the
following to determine if a temperature-induced timing violation has occurred for
the bus as a whole, in our cycle-accurate simulator. First, for each wire in the bus
that changed state from the previous cycle, we compute the delay slack by examining
coupling transitions with respect to its neighbors and determining its nominal delay
tp, [9' Then, depending on the Joule heat dissipated and the thermal gradient across
its length, we determine its actual delay using Equation 4.3. Finally, we consider a
temperature-induced timing violation to have occurred for the bus as a whole if the
actual delay in any of the lines, exceeded tbusmlk‘ We report the number of such

violations per hundred bus references in our results.

4.4 Results and Discussion

We study the delay variability of the 64—bit result bus that runs over the integer and

ﬂoating-point execution units of the processor. This bus was chosen since it is highly

97

capacitive and dissipates a substantial amount of energy in the processor core [45,58].
Also, it is routed over the execution unit consisting of ALUs and register ﬁles that
are highly active; hence, the substrate temperature under the result bus will also be
signiﬁcantly higher than in other units. The result bus is also on the critical path
and will be impacted most by any temperature-dependent timing violations, which

may require retransmission of the data to maintain correct program execution.

4.4.1 Maximum Wire Temperatures and Gradients

The maximum wire temperatures that we recorded during the simulation of the result
bus is shown in Table 4.1 for 130 nm and 45 nm technology nodes. It can be seen
that the wire temperature can be as high as 103°C in a 130 nm processor and about
117°C in a 45 nm processor. Note that the design temperature for global wires was
assumed to be 100°C but signiﬁcantly higher temperatures were observed during our
simulation. As mentioned earlier, higher wire temperatures increase wire delays by
about 5% for every 20°C rise in temperature [38].

Next, in Figure 4.1, we show the distribution of the maximum wire temperature
gradient that we determined using our model in Chapter 3.4. This plot shows that,
on the average, the maximum temperature gradient in a wire can be expected to be
between 16 and 24 degrees. These temperature gradients across the length of the
wire also affect delay. It has been reported that for a 1 mm long wire with the driver
in the hot region and receivers in a cooler region, a temperature difference of 10°C
results in a 5 ps (z 8%) additional delay at the receiver [93].

Having shown that signiﬁcant wire temperatures and gradients occur when ex-

ecuting the benchmark programs in our workload, we examine next whether these

98

.vm 2.me

E 833353 .Eoﬁqdq. .msn ﬂame“ DA< on» Sm @8288 moaspﬁomamu 8E :58ng A...“ $38“

 

 

 

 

 

 

 

 

 

 

 

 

 

codnm wwgm swwpm :Nwm aﬁmwm hwdwm wmdwm mﬁmnm wudwn womwm as we
wnéom wwwwm owénm Bang endow. gag wmdom £15 mcdwm 3.3% S: G?
83% «Emma among ﬁe :95 now .3380 NEE 9% 8w

 

 

mmaspmaomﬁmp. 35> 823wa

 

 

 

99

Distribution of Maximum Wire Temperature Gradient
in 130 nm Result Bus Wires

E4390 l6—15°C 016—2490 $2490]
90% « __ __ — ' ‘—

80% —
70% _
60% ~
50% —
40%]
30% ~
20% -
10% ~

0% -

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Percentage Number of Cycles

 

 

         

 

 

T T I T

900 gzip bzip2 crafty eon two/f art mesa mgrid swim Avg.

Figure 4.1. Distribution of maximum wire temperature gradients in result bus wires
for the 130 nm processor.

result in timing violations in the ALU result bus.

4.4.2 Frequency of Timing Violations

Figures 4.2 and 4.3 show the temperature-induced delay violations per hundred bus
references for a 130 nm and a 45 nm processor, respectively, in the ALU result bus
using our temperature-dependent delay model when running different benchmarks.
The base case—processor operating at nominal clock frequency, 1.68 GHz for 130 nm
and 11.51 GHz for 45 nm—is represented by the data series labeled “@ Nominal
Fqu.” in the two plots. For this case, we observe that the average error rate across
our benchmark set was 2.27 per hundred bus references for the 130 nm design. For the

same processor in 45 nm technology node, the error rate increased to 6.2 per hundred

100

references on the average. Some benchmarks like gcc , gzip, bzip2, and art show
higher than average error rates due to the fact that they had higher values of wire
temperatures and/ or gradients than other benchmarks as observed by results shown
in the previous subsection. It should be noted that the timing violation error rates
reported here represent temperature—induced violations only; other factors like process
variations and voltage drops are not included, as mentioned earlier in Section 4.3.2.
In fact, our results show that, in many cases, the extra temperature induced delay
can easily overwhelm voltage drop and process variation safety margins allocated by

a designer.

6 7 Temperature Induced Delay Violations in a 130 nm Wires
I

I @ Nominal Freq.

 

5 A E: 3 El@ Nominal Freq. with OEBI Encoding
! ' ‘8 I@ 0.9 x Nominal Freq.
|
l
l

4 1!

i s
l 05
3 - rs
ﬁg ] 3 ”.5
' IN

 

2.27

#r eff
098
1.20
112
5
079

056

Delay Violations per Hundred References

0.36

 

2 - 3‘ 8. s
1 1 . I , _ -' -

I # ‘ I= ' I
0 4 ' I-

gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg.

Figure 4.2. The number of temperature-induced violations per hundred bus references
occurring across ten benchmark programs in a 130 nm processor.

Most superscalar processor designs adopt such overly conservative methods to

101

Temperature-Induced Delay Violations in 45 nm Wires
12 —
I@ Nominal Freq.
El@ Nominal Freq. with OEBI Encoding
1o ' I@ 0.9 x Nominal Freq.

.. 3,
8 A x N
"I
5 §
:6- 4
. ,-,- 9
- a v' .,
v _
4 _
8
N; ' - ~1-
- r~ N
.. °°.
0 -d

gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg.

991
10.1 2

5.24

1

Delay Violations per Hundred References
0)

 

Figure 4.3. The number of temperature-induced violations per hundred bus references
occurring across ten benchmark programs in a 45 nm processor.

work around dynamic delay variability-related problems—like using an extra pipeline
stage is allocated to account for wire propagation delays [44]. Temperature-
distribution aware delay models, such as the one we have developed, can help ex-
plore the extent of the timing violation problem during early stage design. Using this
knowledge, a designer can implement schemes that address delay variability issues
and avoid over—design. For example, results presented in the next subsection show
that, by increasing the overall bus clock cycle time by only 10%, the error rates can
be halved for a 130 nm design.

As mentioned earlier in Section 4.3.2, not all bus references—even those incur-
ring worst case delays due to peak crosstalk—are likely to trigger timing violations.

Cycles in which peak crosstalk conditions occur in a wire, coupled with high Joule

102

heat dissipation and large temperature gradients, have high probability of causing
a violation. Violations can occur during non—peak crosstalk conditions too, if wire
temperature and/or gradients are large enough. The following results attempt to
characterize how temperature-induced delay variations are distributed across various

crosstalk conditions.

Distribution of Crosstalk Conditions in ALU Result Bus

]i_1;4r delay lii3r delay El1+§r delaiD1+1rTielayl ;1+Qrdi3lay;

100°/o 7» a a a -1

90% -
80% ‘
70%
60%
50% -
40% -
30% -
20% ~
10% -

0%

     

l

Frequency of Occurrence

l

 

gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg.

Figure 4.4. This plot shows the frequency of occurrence of ﬁve different crosstalk
conditions on the bus. See Section 4.3.2 and Table 2.1 for an explanation of these
crosstalk conditions. The crosstalk condition determines the actual propagation delay
without considering thermal effects.

Figure 4.4 shows the frequency with which different crosstalk conditions occur on
the ALU result bus for the programs we analyzed. This is at nominal temperature.
It can be seen that the peak crosstalk condition labeled “1+4r delay” occurs only
about 10% of the time on average across the benchmark set. The dominant condition

is “1+2r delay” which occurs about 40% of the time. Next, in Figure 4.5, we show

103

the distribution of temperature-induced delay violations across the different crosstalk
conditions. As the ﬁgure shows, crosstalk conditions “1+r delay” and “1+0r delay”
contribute a very low percentage (<1%) to number of total delay violations. Other
cases have more signiﬁcant contributions, suggesting that eliminating or reducing

these crosstalk conditions can potentially reduce delay variabilities.

Percentage of Temperature-Induced Delay Violations Caused
Under Various Crosstalk Conditions for 130 nm

 

. [I 1+4r delay I 1+3r delay Eli+2r delay EI1+1r delay I1+0r delay]
100%

- -
90% -
80% I I
70% -
60%

50% 7
40% ~
30%
20%
10% ~ 1 7V , ,
0% - “ > . l . .

gcc gzip bzip2 crafty eon twolf art mesa mgrid swim Avg.

 

 

l

L

 

 

Percentage of Total Delay Violations

Figure 4.5. Figure shows the percentage of temperature—induced delay violations that
correspond to a given crosstalk condition.

From the above discussion, it can be argued that the impact of temperature-
dependent delay can be reduced by reducing energy dissipation and hence tempera-
ture. We examined two methods of reducing power: (1) a static design-time technique
that uses a lower bus clock frequency and (2) a dynamic low power bus encoding
scheme called odd/ even bus—invert (OEBI) that reduces toggle transitions [53]. The

former is represented by the data series labeled “@ 0.9xNominal Freq.” and the

104

latter by “@ Nominal Heq. with OEBI” in Figures 4.2 and 4.3. We observe that
slowing the bus down reduces delay violation rates better than applying the encoding
scheme. This is because reducing the bus clock frequency results in two outcomes
both of which contribute to reducing wire temperature: (1) it slows down the proces-
sor resulting in a lower number of bus references per unit time and (2) it increases the
clock cycle time over which bus switching energy is dissipated. This combined effect
reduces wire power dissipation and hence lowers wire temperatures. In contrast, the
encoding scheme only reduces the total amount of bus switching energy dissipated
in a cycle and does not affect the cycle time. Hence its impact on wire temperature
is lesser. Although, the OEBI encoding scheme is designed to reduce the number of
toggle transitions in wires, it has the side-effect of increasing the number of coupling
charge/discharge transitions. Thus, in the context of crosstalk, an OEBI-encoded
stream will have more number of “1+2r delay” cases. We have observed earlier that
somewhat signiﬁcant temperature-induced violations are possible for this case and
this may have contributed additionally to the ineffectiveness of OEBI in reducing
error rates. we also observe that frequency reduction is less effective at 45 nm node

than at 130 nm node.

4.4.3 Performance Impact

Delay violations, if unchecked, will the impact performance of the processor, requiring
an extra cycle to retransmit the data on the result bus. Also, dependent instructions
may need to wait longer for dependencies to be resolved and this may cause pipeline
stalls. Table. 4.2 shows the instructions-per-cycle (IPC) degradation observed across

our benchmark set; the average performance degradation was 4.08%. Note that this is

105

an optimistic estimate since we have assumed that the re-transmission is not affected
by delay violations, which is strictly valid only if the bus has cooled down compared
to its state during the previous transmission. Our focus, in this work, is not to im-
plement a dynamic scheme that inserts appropriate number of wait cycles to cool the
bus after a delay violation is detected. However, such a scheme will only cause the
data re-transmission to wait longer than what we have assumed here. Hence, our IPC
estimates are lower-bound values. In reality, since the operating clock frequency at
45 nm is much higher than at 130 nm, the performance impact will be much higher
at the smaller technology node. Our simulations with 45 nm technology parame-
ters found that the average performance degradation across the ten benchmarks was

11.92% (Table 4.2).

4.5 Summary

This chapter presented models for estimating the Joule heat and wire temperature
across the length of a global wire, and to determine its temperature-dependent delay
impact. We showed that temperature gradients exist between the sending and re-
ceiving ends of a wire and this may lead to dynamic delay variations that can exceed
design margins. We used our models to explore the extent of temperature-induced
delay violations that may occur in the ALU result bus of a processor in the 130 nm
and 45 nm technology nodes using real data from ten SPEC CPU2000 programs.
Microarchitectural simulation results show that delay violations due to tempera-
ture gradients are somewhat likely in 130 nm designs—average of 2.27 per hundred

bus references for the ALU result bus. In the future 45 nm technology node, the

106

.aosdwaawmw On: omwpcmonma we wommmaxm poems: mocaﬁnotmm Nd Each.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NQAH mw.© 90¢ mad mi: Fwd vaH mm.: 3.2 magma $3.2 E: mv

momN cod wmd 2N wwa. EM. mﬁm wmd www 3.3 omh :8 cm:

.w>< gnaw 2me £88 and :95 now 53.8 NEE 9% com 2302
qoﬁevgwow On: owgqgnom hook.

 

 

 

107

error rate was found to increase to 6.20 per hundred for the same processor design.
Commercial 130 nm processor adopt techniques like an extra pipeline stage to com-
bat the inﬁuence of dynamic delay variations in wires. However, this leads to over
design. Temperature-aware delay models like the one we have developed can be used
to explore the design space efﬁciently and avoid over design. We also found that, by
an optimistic analysis, the performance impact of overcoming temperature induced
delay violations by re—transmitting data will be about 4.1% in a superscalar design at
130 nm and about 11.9% at 45 nm technology node. We also found that conventional
techniques like bus encoding that seek to reduce energy dissipation and potentially
wire temperatures have limited impact on alleviating temperature-induced delay vi-
olations. Reducing the bus clock frequency had a better impact, reducing average
error rate to 1.07 per hundred references, compared to encoding which reduced error

rates to only 1.93 per hundred references.

108

CHAPTER 5

ACTIVITY-AWARE ENERGY AND
TEMPERATURE OPTIMIZATION

With increasing energy dissipation and wire temperature in processor bus wires and
the inability of existing low-power encoding schemes to address these problems ad-
equately, novel approaches need to be examined. This chapter examines a family
of such energy-efﬁcient techniques that rely on data statistics and a ﬁrst-of—its—kind

optimization methodology to reduce bus wire temperatures [98].

5.1 Introduction

On—chip wires are a major impediment to realizing the performance gains that mo-
tivate CMOS technology scaling in integrated circuits. At smaller technology nodes,
transistors become faster and somewhat energy-efﬁcient but wires become slower
because smaller cross-sectional area increases their resistance. To counter this,
wire are scaled less-aggressively than transistors. However, this scenario leads to
taller and thinner wires that exacerbates parasitic effects like inter-wire coupling
capacitance, thus leading to relatively more energy dissipation when wire switch-
ing charges / discharges these capacitances. Global signal-carrying wires/ lines already
contribute a major portion to total chip power dissipation—about 34% in an Intel
130 nm microprocessor [4]. As a result, rising wire temperatures are becoming an im-

portant issue in high-performance processor design, especially in current and future

109

nanometer technology nodes since higher temperatures can impact wire delays and
electromigration reliability [21, 66, 99].

Wires—like those that constitute address, instruction, data, and ALU result
buses—routed in global metal layers are much more susceptible to higher temper-
atures due to the following reasons: (1) with higher clock frequencies, the amount
of energy dissipated in the wire as Joule heat increases compared to the energy dis-
sipated in the repeaters [100], (2) they are furthest away from the substrate which
is attached to the heat sink and they are surrounded by low-K dielectrics that have
poor thermal conductivity, resulting in inefﬁcient heat removal, and (3) their rela-
tively large geometries result in higher thermal capacitance, i.e., the ability to retain
heat. Rising wire temperatures increase wire delays by about 5% for every 20°C rise
in temperature [38]. Wire temperature gradients across the length of the wire also
affect delay. It has been reported that for a 1 mm long wire with the driver in a hot
region and receivers in a cooler region, a temperature difference of 10°C results in a

5 ps (z 8%) additional delay at the receiver [93].

5.1.1 Need for Energy and Temperature Aware Bus Design

Real workloads cause bus trafﬁc (in instruction, data, address buses) that exhibit sub-
stantial spatial and temporal locality and value redundancy. Switching activities are
therefore not random. Further, there is a high degree of correlation between switching
(self and coupling) activities of traffic in different execution regions of the same bench-
mark and across different benchmarks. These characteristics can be exploited using
value-aware design of encoding schemes. Previous techniques are typically (inversion-

based) dynamic encoding schemes which support a set of encoding modes, one of

110

which is dynamically chosen at run-time in a given cycle in an attempt to reduce
bus energy. These suffer from several drawbacks. First, encoding modes supported
are those that are effective only for random or worse-case (highly-changing) trafﬁc,
which is not the case in realistic workloads. Such value obliviousness limits their
effectiveness. We present results showing average energy reductions for dynamic en-
coding schemes to be only 4.19% (5.32%) at best for data (instruction) trafﬁc across
SPEC CPU2000 benchmarks. Second, being dynamic, there is a latency overhead in
encoding and decoding and extra area for hardware and control lines. Also, as several
earlier works have demonstrated, the efficacy of inversion-based encoders falls rapidly
as bus width increases [101,102] and bus partitioning schemes have been proposed to
address this issue [103]. However, with partitioned buses, the number of extra lines
required for control signals increases and this restricts its attractiveness.

Third, previous schemes attempt to reduce either self or coupling energy, not
total bus dynamic energy. Hence their effectiveness will change as the ratio of self to
coupling energy changes with technology scaling. Finally, energy and temperature-
aware design of high-performance buses are only loosely related. Reducing energy
(switching activity) through encoding reduces only the average temperature of a wire
(tang) since it is dependent on total energy dissipated over time which reduces due to
encoding. However, existing encoding techniques do not explicitly reduce maximum
temperature of wires (imam) since these depend not only on the amount of energy
dissipated in the wire itself but also in its neighbors. For example, a low-activity wire
(victim) with highly-active neighbors (aggressors) leads to rise in the temperature
of the victim wire due to thermal coupling [73]. The effects of thermal coupling can

exacerbate electromigration and other related reliability problems in high performance

111

bus wires. Further, due to data locality, a few bus lines are highly—active most of the
time and this makes them more susceptible to temperature—induced failures. Such
problems can be remedied by combining encoding that reduces tavg with static bit

reordering or permutation that seeks to reduce tmagj by minimizing thermal coupling.

5.1.2 Key Contributions and Results

We evaluate several possible ways of signaling a bit value at design time, and then
choose, based on traffic value characteristics, exactly one signaling mode for each bit
statically to support in hardware to minimize total bus dynamic energy. We also
consider all possible ways of mapping bits to bus lines (bit ordering or permutation),
and then choose, again depending upon trafﬁc value characteristics, exactly one bit
ordering statically at design time to support in hardware to minimize total bus dy-
namic energy. The combination of a particular way of signaling different bits and
ordering them on the bus constitutes a static encoding scheme. We present an inte-
ger linear program (ILP) methodology that evaluates q possible bit signaling modes
and all possible bit orderings for an n-bit bus (i.e., it evaluates a total solution space
of q” x n! encoding modes) based on traffic value characteristics and then chooses an
optimal encoding mode that minimizes total bus (self + coupling) dynamic energy.
This selection is done at design time using data from microarchitectural simulations
and the ILP problems are solved optimally in a matter of minutes. Since only one
encoding mode is statically supported in hardware, encoding / decoding (latency, area,
and energy) overhead is virtually non-existent and there are no control lines needed.

Since there is substantial correlation between switching characteristics across

benchmarks, our static encoding scheme optimized for one set of training bench-

112

marks works very well for a different set of test benchmarks—we refer to this as
general-purpose optimization; in this case, we obtain 20.04% (38.78%) average to-
tal bus energy reductions with our best scheme for data (instruction) buses. With
increasing degrees of customization (suitable for particular application domains or
embedded systems), effectiveness improves: we obtain average bus energy reductions
of 22.79% (40.77%) for workload—speciﬁc and 30.2% (52.1%) for program—speciﬁc opti—
mization scenarios for data (instruction) buses. These average percentage bus energy
reductions for our static encoding schemes are 5 to 10 times better compared to
existing, more complex dynamic encoding schemes.

We present a new way of bit signaling based on Markov models. Markov models
have been used in a variety of situations (e.g., branch prediction, instruction com-
pression, etc.), but never in the context of bus encoding or low—power bus design.

We show that lowering bus energy (e.g., even signiﬁcantly, as with our static
encoding schemes), does not necessarily lower peak wire temperatures (the highest
temperature attained by a bus wire during program run)—in fact, it often may in-
crease it slightly. To address this, we present a novel method of efﬁciently explor-
ing the peak-wire—temperature and total-bus-dynamic-energy trade-off space using a
steady-state wire temperature model. Based on this, we present a new method of
introducing thermal constraints into our energy optimization methodology that al-
lows a designer to trade-off peak wire temperature with total bus dynamic energy
as desired. For this thermally-constrained, energy-optimal static encoding scheme,
we then perform simulations using a detailed per-wire bus thermal model to deter—
mine the actual reductions in peak temperature, which we ﬁnd to be signiﬁcant.

For example, by sacriﬁcing approximately 50% of the energy savings provided by

113

the thermally-unconstrained, energy-optimal version of our scheme, we obtain up to
12.26°C (12.96°C) and on the average 803°C (924°C) peak wire temperature reduc-
tions for data (instruction) buses, while at the same time providing signiﬁcant average
energy savings: 14.24% (16.17%) for data (instruction) buses (still much better than
previous work). No previous work has attempted thermally—constrained energy opti-
mization of buses. A recently proposed spreading encoding technique, which targets
only peak wire temperature reduction and does not perform any energy optimization,
has a number of drawbacks: latency, hardware, and energy overhead of a cross-bar
switch network, use of a counter, and we also ﬁnd that, for the same benchmarks, it
does not provide as much temperature reduction.

Finally, if needed, appropriate dynamic bus encoding schemes and the spreading
technique for temperature reduction can be applied after our static encoding schemes
to further Optimize bus energy and temperature. Therefore, in this sense, our work
is orthogonal to, although much more effective than these previous works.

The organization of the rest of this chapter is as follows. In Section 5.2, we discuss
related work. Next, in Section 5.3, we discuss our methodology. Following that, in
Section 5.4, we present our techniques. Then, we present results in Section 5.5.

Finally, we summarize in Section 5.6.

5.2 Related Work

Prior work on low-power bus design can be classiﬁed into three categories: (1) memory
bus encoding schemes that reduce only self transitions, many of which are surveyed

in [86], (2) on-chip bus encoding schemes that target both self and inter-wire coupling

114

energy reduction [53,54,104], and (3) wire permutation techniques like those proposed
in [105—109] that seek to minimize coupling energy. Memory bus and on-chip bus
encoding schemes are dynamic in nature and wire permutation techniques are static.

Our proposed optimization approach differs from prior related work discussed
above in many ways. First, wire permutation schemes discussed in [105—107] opti-
mize only inter-wire coupling energy, whereas our scheme combines the beneﬁts of
signaling that reduces self transitions, with permutation that seeks to minimize cou-
pling energy. In contrast to the optimization technique suggested in [108], our work
considers a wider array of signaling schemes and solves the combined signaling and
permutation problem optimally, while they use a greedy algorithm. This contributes
to better results using our optimization technique. Compared to the address bus or-
dering scheme proposed in [109] which can be applied to 8-bit buses only, our scheme
can be applied to any bus regardless of bus width or transmitted data. Furthermore,
their optimization uses simulated annealing technique, whereas we solve the problem
optimally using integer linear programming, for much larger bus sizes and with com-
parable time complexity. Our optimal static encoding scheme also results in much
better energy reductions compared to well—known dynamic low-power bus encoding
schemes.

Most importantly, our technique incorporates a thermal optimization method-
ology for buses which has not been addressed by any previous work. Rising wire
temperatures are becoming an important issue in high-performance processor de—
sign, especially in current and future nanometer technology nodes [21,66]. To ana—
lyze temperature-related issues, microarchitecture-level thermal models like HotSpot

[64,66] have been proposed to estimate substrate (active-layer) temperatures. Inter-

115

connect thermal models have also been proposed recently [71]. It has been shown that,
in global layer interconnects, activity-dependent Joule heat dissipation in the metal
leads to thermal coupling between adjacent wires causing maximum wire temperature
to shoot up beyond safe design limits [73].

A recent work proposed a methodology called thermal spreading encoding to re-
duce wire temperatures [110]. In that work, data is bit-shifted periodically before
being transmitted on the bus, in an attempt to equalize wire temperatures across the
bus by averaging out the Joule heat dissipated across all lines. This technique does
not reduce energy dissipation since the coupling energies dissipated in the bus lines
remain more or less the same after each shift. Furthermore, it does not alleviate the
problem of temperature rise due to thermal coupling between wires. In contrast, our
work addresses both these issues through the use of bit re—ordering instead of circu-
lar shifting. Spreading encoding, as discussed in [110], is a dynamic technique and
uses a n x n—crossbar for an n—bit bus and control logic (counters, etc.) to generate
periodic shift signals. Our technique is completely static, incurs negligible overhead,

and achieves much better temperature reductions.

5.3 Methodology

We used the SimpleScalar / Alpha microarchitecture-level simulator to design and eval-
uate our techniques [67]. The Alpha 21264 architecture modeled by this simulator
uses a 64—bit (load/store) data bus between the processor and L1 data cache and a
128—bit instruction bus (fetch width 2 4) between the processor and L1 instruction

cache. Since we have assumed our processor implementation technology to be 130 nm,

116

the clock rate was taken to be 1.68 GHz. We used little-endian Alpha executables
of all 26 benchmarks from the SPEC CPU2000 suite with the ref input set and ran
our simulations on a shared Linux cluster. We selected the SPEC suite as our target
workload since pre—compiled little-endian executables for our target platform (Alpha
21264) were readily available for this suite from the SimpleScalar Website [76]. How-
ever, our optimization methodology is equally applicable to other application and
benchmark suites.

We divided the 26 SPEC benchmarks into a training and test set with 13 pro-
grams in each set chosen arbitrarily. The training set comprised of gzip, vpr(route),
gcc, crafty, gap, vortex, wupwise, mgrid, mesa, art, facerec, lucas, and simtrack, and
the test set had mcf, parser, eon, perlbmk, bzip2, twolf, swim, applu, galgel, equake,
ammp, fma3d, and apsi. For these benchmarks, we used the 100 million single simu-

lation point (SimPoint) sample to collect data for our analysis [77, 78].

5.3. 1 Target Scenarios

The three scenarios that we consider are, in the order of increasing degrees of cus-
tomization, general-purpose, workload-speciﬁc, and program-speciﬁc. We consider
these scenarios to show that our value-aware optimization techniques work well across
all scenarios. Speciﬁc details of the analysis, design, and test steps for these scenarios

are shown in Table 5.1 and are elaborated next.

Analysis Step — Data Collection and Aggregation

We consider several possible ways of signaling a bit value, with exactly one signal-

ing mode for each bit chosen statically at design time depending on trafﬁc value

117

 

 

Target Scenarios

 

 

 

 

 

Step General-Purpose ] Workload-Speciﬁc Program—Speciﬁc
Analysis Collect energy / cost matrices from Collect energy/ cost
SimPoint samples of the 13 training set matrices from Sim-
programs and aggregate them. Point samples of
each program indi-
vidually.
Design Obtain the static encoding scheme using the CPLEX ILP
optimizer.
Test Apply the static Apply the static Apply the static en-

encoding scheme on encoding scheme on coding scheme on the
SimPoint samples a sample of 100M same SimPoint sam-
of the 13 test set committed instruc- ple used in the analy-
programs tions that does not sis step.

overlap with the
SimPoint sample for
the 13 training set
programs.

 

 

 

 

 

 

 

Table 5.1. Optimization scenarios considered in this work.

characteristics to minimize total bus dynamic energy. We also consider all pos-
sible ways of mapping bits to bus lines (bit ordering or permutation) and then
choose exactly one bit ordering statically at design time, again depending on traf-
ﬁc value characteristics. Hence, in the analysis step, we collect energy informa-
tion for all possible bit signalings and reordering for all pairs of wires; these are
represented in the form of energy cost matrices whose elements are represented as
el’m[i][j], {1, m} E {0, . . . ,q — 1}, {i,j} E {0, . . . ,n.}, where q is the number of sig—
naling mode choices that we consider. These signaling modes are discussed in detail
in the next section.

Each element el,m[i][j] is obtained by adding two components, both of which

are collected using the bus line energy dissipation model [81] in the cycle-accurate

118

simulator for our target buses: the coupling energy Cl, m[i][j] dissipated when bits 2'
and j, signaled using modes 1 and m, respectively, are placed next to each other on
the bus, with j being the right-adjacent neighbor of i, and the one-half the self energy
31, m[i] and 31, m[j] of the bits, when they are signaled using the signaling modes l
and m, respectively.

When individual energy/ cost matrices need to be aggregated across benchmarks
(B0, B1, . . . , Bl 3), as required in the general-purpose and workload—speciﬁc optimiza-
tion scenarios—See Table 5.1—we add the corresponding elements of the matrices

across all benchmarks:
1,771,] _ [1,777. J I’m J I’m’Ja a]: a - .

Design Step — Integer Linear Programming

We use ILOG CPLEX 9.0, a commercial mathematical programming optimizer, to
solve the ILP problems [111]. CPLEX provides a C++ interface and a callable library
that facilitates reading of input ﬁles (containing our energy/ cost matrices), examining
candidate solutions, and re-solving the problem after adding appropriate constraints.
To improve solution times, we also added a greedy approach to ﬁnd subtours at each

node and included elimination constraints for such subtours in our ILP.

Test Step — Getting Results

After the static encoding techniques are designed, results are collected for the bench-
marks / samples mentioned in Table 5.1, depending on scenario being considered. The
effectiveness of our optimization methodology depends on the degree of similarity be-

tween the training and test benchmarks/ samples. To probe the extent of similarity,

119

we calculated the values of correlation coefﬁcient rggy, with :1: representing the test
set energy matrix linearized into a vector and y representing the training set energy
matrix also linearized into a vector, using MATLAB for various signaling schemes
listed in Section 5.4.1. These are shown in Table 5.2. The correlation of two variables
reﬂects the linear dependence between them, i.e., it provides an estimate of how well
the value of one variable can be predicted from the value of the other. If rggy is closer
to unity then they are strongly correlated, which we ﬁnd is the case with our training
and test set coupling energy values, for both general-purpose and workload-speciﬁc

optimization scenarios.

 

 

rxy for Signaling Mode

 

Optimization Type org 1nv trs 1tr m
G eneral- purpose 0.9602 0.9602 0.9609 0.9609 0.9451
Workload-speciﬁc 0.9644 0.9644 0.9687 0.9687 0.9610

 

 

 

 

 

 

 

 

 

 

Table 5.2. Correlation coefﬁcients rxy between test and training set data for various
signaling schemes discussed in Section 5.4.1. Since Try values are close to 1, our
training and test sets are well correlated.

5.3.2 Bus Layout and Wire Geometry

We assume a standard model of a bus consisting of a sequence of n + 2 par-
allel, minimum-width, minimum, spaced, identically-dimensioned, co—planar wires
(Wn + 1,147”, . . . ,fV1,lV0) from left to right where W1, W2, . . . , Wn are signal lines
and W0 and Wn + 1 are power/ ground lines that act as shields. The bus is assumed
to use static logical therefore, it retains a previously-transmitted value until a dif-
ferent one is transmitted. We assume the bus length to be 6-mm, routed in the

topmost metal layer, and buffered by identical repeaters spaced equally apart in a

120

microprocessor fabricated in the 130 nm technology node. This global interconnect
length is typical in many modern microprocessor ﬂoor plans [112]. Uniform repeater
insertion methodology was used in this bus to ensure that the propagation delay
did not exceed one clock cycle [46]. Several earlier works have also used this repeater
model to evaluate buses. Wire geometry parameters were obtained from ITRS [1] and
we used FastCap, a three-dimensional capacitance extraction program, to estimate

parasitic wire capacitances of each wire [7].

5.4 Static Techniques for Bus Energy and Tem-
perature Optimization

In this section, we present three optimization techniques for designing static encoding
schemes for on—chip signal buses and minimizing energy dissipation and wire temper-

ature based on their value characteristics.

5.4.1 Choice of Signaling Modes

We use ﬁve candidate signaling modes in our optimization technique, one of which
is selected for each bit: original (org), inverted (inv), transition—signaling (trs),
inverted transition signaling (itr), and Markov model signaling (mm). In inv, the
data on the bit line is always transmitted in inverted form, in trs, the XOR of
the previous and current original value of the bit is transmitted, and in itr the
XNOR of the previous and current original value of the bit is transmitted. We chose
candidate signaling modes based on three characteristics: (1) potential to reduce self
switching energy, (2) potential to reduce coupling energy with neighboring bits, and

(3) potential to reduce the temporal distribution of energy-causing transitions. We

121

evaluate our candidate schemes according to these characteristics next.

Inverted signaling

Our optimization uses static inverted signaling (inv) as a candidate mode, i.e., the
ILP is used to decide if data on a bit line is to be sent in inverted form always,
depending on the value characteristics that we obtain for that bit from our training
set. For any bit, this mode will be chosen if the amount of self and inter-wire coupling
activities it causes with its neighboring wires is less than that for the original mode of
transmission. Signaling a bit line with inv does not reduce the self switching activity
and alters the temporal distribution of energy—dissipating transitions only slightly,
but it can potentially reduce the coupling transitions in a signiﬁcant manner. For
example, a two-bit stream can be made completely toggle—free by inverting one of the
bits and keeping the other in original mode; a signiﬁcant amount of energy can be
reduced since toggles dissipate most energy compared to charge/ discharge and self

transitions.

Transition signaling

This signaling mode (trs) and its dual (itr) affect all three characteristics listed
earlier. For bit-streams that are highly-changing, this mode can reduce self switching
activity signiﬁcantly and also reduce coupling transitions with a neighboring org-
or inv-signaled line since every toggle transition is converted to a lower-energy-
dissipating charge/discharge transition. It also reduces the temporal distribution
of energy-dissipating transitions by converting a highly-changing pattern into a run

of ones / zeros.

122

Markov model signaling

In this candidate signaling technique, we use a small amount of hardware at the
sending and receiving ends for only the bits that are selected to be signaled using
this scheme. To our knowledge, this work is perhaps the ﬁrst to use Markov model
signaling (mm) to reduce bus energy dissipation in a value-aware framework. For bits
chosen to be signaled using m, we maintain a history of k previous bits from the
original data stream that was to be transmitted. These k-bits deﬁne the current state
of the Markov model and it is maintained at both sending and receiving ends. At both
ends, the encoding/ decoding logic uses this current state to predict the next bit to be
sent on the bus. At the sending end, if this prediction matches the actual bit value
to be sent, the bus line is held at its current value. Else, we signal a transition on the
bus line which indicates a mis-prediction to the receiver. The receiver can retrieve
the actual data by sampling the state of the bus lines (transition or no-transition)
at the end of the clock cycle since it also has information on the current state of
each bus line. The key to an efﬁcient implementation of this signaling scheme is
the design of the encoding logic. We analyze SimPoint samples of our 13 training
benchmarks to build a prediction table. A portion of the 4-bit/16—state prediction
table—for bus lines 0 to 7 of the data bus—is shown in Figure 5.1(a). This can
be translated into hardware using standard logic synthesis tools. As an example, we
show in Figure 5.1(b), the logic circuits required for implementing the prediction table
for bits 0 through 7, obtained by logic minimization using the Espresso tool [113].
These circuits have at most two levels of logic and hence the hardware overheads they
impose will be negligible.

We tested Markov model based prediction schemes of varying depth, from 1-bit

123

 

 

 

 

 

Current State (83828180)
.ssezesezssezsse:
58888823823333.3222:

Next Bit Prediction
01110 1 110101010 1 0
1 11101110 1 0 1 0 I 0 1 0
211101110 1 0 1 0 1 0 1 0
3111 0111 0 1 0 1 010 1 0
410101110 1 0 1 01010
510 1 0111010 1 0 1 010
6 1 0 1 010 1 0 1 0 1 01010
7101011101010 1 010

 

 

 

 

 

 

 

gml Sm] 80)
L11)

Markov model signaling logic for bit 0

 

 

(DCDCD
NN?‘
r..|..|...

'U
‘1

Markov model signaling logic for bit 7

s”: Bit ‘y’ of the current state for the x-th bitline

(b)

Figure 5.1. Markov model-based signaling technique. (a) A 4-bit prediction table for
the Markov model for bits 0—7 of the data bus obtained by analyzing training set
benchmarks. Depending on which bits are selected for Markov model signaling, the
corresponding row of the table can be translated to hardware using logic minimization
tools. (b) Examples of sending end hardware that would be required for 2 bits (0 and
7) assuming these are chosen to be signaled using the m scheme. As can be seen, the
logic overhead required for m signaling is very minimal.

124

(2 states) to 10-bit (1024 states), for their prediction accuracy. As expected, the
prediction accuracy improved as the depth of the model increased. However, we found
that beyond a depth of 4 (2) for data (instruction) buses, the rate of improvement in
prediction accuracy dropped signiﬁcantly. Hence, we chose the 4-bit Markov model
for the data bus and the 2—bit Markov model for the instruction bus.

Henceforth, in this paper, we shall denote the candidate signaling schemes using
subscript numbers 0 through 4 instead of org, inv, trs, itr, and mm, respectively.
Let q represent the number of candidate signaling schemes; q = 5 in this work.
Our ILP formulations use energy/ cost matrices or vectors whose individual elements
we represent as el,m[i][j], {l,m} E {0,...,4}, {i,j} E {0,...,n}. For example,
e0, 1[i][j] represents the energy dissipated between bits 2' and j when they are placed
next to each other on the bus and wire i is signaled using the org scheme and wire
j using the inv scheme. Since there are ﬁve signaling schemes, we have a total of
25 energy / cost matrices collected for the training set benchmarks and/ or simulation
sample, depending on the scenario that we consider.

Note that all energy/cost matrices are (n + 1) x (n + 1)-matrices because we
consider the two shield wires as one node, called it a dummy node. The solution to
our ILPs—MEBO and SBOS—are obtained as Hamiltonian cycles and we use the
location of the dummy node to break the cycle into a linear bit order. However, the
dummy node is not used in the ILP formulation for MES. The ILP formulations using

these notations are discussed next.

125

5.4.2 Minimum Energy Signaling (MES)

In minimum energy signaling (MES) optimization, we seek to ﬁnd a static signaling
scheme for each bit line of the bus, from among the ﬁve possible schemes discussed in
Section 5.4.1, with the goal of minimizing total self and coupling energy dissipated.
In the ILP formulation, for each adjacent bit pair (i,i + 1), we associate 25 binary
variables yl’ m[i], {1, m} E {0, . . . ,q — 1} representing all combinations of signaling
two bits using ﬁve schemes. Thus, the binary variable 310, 0[i] = 1 if both the i-th and
(i + 1)-th bits are to be signaled using the original mode (i.e., the bits are transmitted
as in the original trafﬁc). Else, 310, ()[i] = 0. The formulation of MES in terms of the

y variables is given next:

n q—lq—l

Minimize Z Z Z (61,mlil-yz,m[il)

i=0 l=0m=0

subject to:

yrmlil 6 {0,1}.V {km} s {0.....q —1},v2' (5.2)
q—1q—1

Z Z yl,mlil=1,v2', (5.3)
l=0m=0

q—1 q—l

Z yrmlil = Z ym,1[i+1],v m,Vz' (5.4)

Constraint 5.2 ensures that the variables take only binary values, Constraint 5.3
ensures that there is only one unique signaling scheme associated with each wire pair,
and Constraint 5.4 ensure that the signaling schemes chosen for adjacent wire pairs
are consistent. Solving this ILP yields an optimal (minimum energy) signaling scheme

for the bus.

126

5.4.3 Minimum Energy Bit Ordering (MEBO)

In contrast to MES, the next technique, minimum energy bit ordering (MEBO), seeks
to minimize inter-wire coupling energy by reordering the bits. Thus, in MEBO, all bits
are signaled using the original mode. It is formulated as an instance of the traveling
salesman problem (TSP), which is one of the most widely studied combinatorial
optimization problems. Simply stated, in the TSP, a salesman needs to visit n cities,
visiting each exactly once, and return to the starting city with the minimum total trip
cost. In graph theory terminology MEBO is expressed as follows: consider a complete
digraph G = (V, A), where V = {1, . . ., n + 1} is the vertex set that represents the
n + 1 bits including the dummy node, A = {(i, j) : i, j E V} is the are set, and
e0, 0[i][j] is the energy / cost associated with are (i, j ), i.e., the total energy dissipated
if bit j is placed as the right—adjacent neighbor of bit i on the bus, e0, 0[i] [i] = 00, \7’ i.
Note that we use only 60’ 0[i][j] in MEBO since all bits are signaled using the original
mode only. The problem is to ﬁnd a minimum energy cycle that includes every node
in the graph exactly once, i.e., to ﬁnd the minimum weight Hamiltonian cycle in G.

The MEBO formulation has one binary variable associated with each arc of G
that is represented by :r[i][j]. In the solution, a:[i][j] = 1 if bits i and j are to be
placed next to each other on the bus, with bit j as the right—adjacent neighbor or i

and it is = 0 if i and j are not to be placed next to each other. The ILP formulation

127

in terms of the variables x[i][j] is given next:

Step 1 :

Step 2 :

Step 3 :

Step 4 :

Step 5 :

Step 6 :

l\/Iinimize Z 80, ()[i] [J] ' $l7l lJl

\7’(i,j) e A
subject to:
$1110] 6 {0,1},V 231' E V, (5-5)

2: a:[i][j]=1,ViE v and 2 my] = 1,v j e v, (5.6)
V j e V v 2' e V
Solve ILP to get the solution.
Check if the solution has subtours. If none, go to Step 6.
Else, let there be t subtours:
S = {30(n0)»51(n1), - - - 73t('nt)},
where S k(n k) means that subtour S I; has length nk.
Add subtour elimination constraint:
Zap-1y]: (m) are in Sk(nk)) < WV 5. (5.7)
Go to Step 2.

The desired solution (Hamiltonian cycle) has been obtained. Stop.

In the procedure descibed above, Constraint 5.5 ensures that the variables take

only binary values. Constraint 5.6 ensures that the in- and out-degrees of every vertex

are one, i.e., every bit occurs exactly once in the ordering. Eliminating all possible

subtours in the beginning will increase the number of constraints substantially and

may lead to a huge time overhead when solving the problem. Hence, we adopt an

iterative approach to solve the problem in shorter time. First, we solve the problem

with constraints eliminating all possible subtours of two nodes only. Then, we search

the solution for the presence of subtours, and if any are found, we add constraints to

128

eliminate those subtours, and then re-solve. We found that almost all problems con-
verge to a feasible solution (i.e., a Hamiltonian cycle) within a few hundred iterations

using this iterative method and in a matter of minutes (see Table 5.3).

5.4.4 Simultaneous Bit Ordering and Signaling (SBOS)

In simultaneous bit ordering and signaling (SBOS), we seek to combine the MES and
MEBO and optimizations described above. Thus, for each bit, the best signaling
scheme—one of the ﬁve schemes listed in Section 5.4.2—and the appropriate position
of the bits on the bus lines is to be determined simultaneously. Note that combining
MES and MEBO does not mean that the energy reductions with SBOS (the combined
technique) will be exactly equal to the sum of savings obtained separately with MES
and MEBO. In fact, the motivation for combining these problems is to enable the
optimizer to select the optimal solution from a richer set of possibilities. Thus, we can
view the problem as similar to MEBO but consisting of n+1 supernodes corresponding
to the n bits of the bus and the dummy wire. A supernode contains ﬁve nodes, each
representing a signaling scheme choice for a bit. By adding constraints that ensure
that only one of these nodes is selected for each supernode and that the incoming and

outgoing nodes for each supernode are the same, the ILP for SBOS is formulated as

129

described next:

q—lq—l

Minimize Z Z Z (€1,m['il'$l,mlilljl)

V(i,j)€A l=0m=0
subject to :

wrmlilljl e {0.11.11 {tr} e v, (5.8)

2 Z Z $I,mlil[jl =LVieV, (5.9)

VjEV l=0m=0
q—1q—1

Z Z Z $Z,mlkllil =1,Vz’€V, (5.10)

VkEV l=0mq=0
q—1—1

go a mm: :2) mm iljllkl Viz m 6 Wm (5.11)
=0

Constraint 5. 8 ensures that all :13] m[i.]][ ]s take only binary values. Constraints 5. 9
and 5.10 ensure that there is exactly one outgoing and one incoming node selected,
respectively, for each of the n + 1 supernodes. Constraint 5.11 ensures that the
optimal tour enters and exits through the same node in a supernode (i.e., the signaling
schemes chosen for adjacent pairs of bits in the ﬁnal ordering are consistent). Costs
e), m[i] [i], V i E V are set to 00 (a very large integer value). Finally, in SBOS too,
constraints for eliminating all subtours with two nodes are added initially, and the
problem is iteratively solved as described earlier in Section 5.4.3 until a Hamiltonian

cycle that visits all supernodes exactly once is found.

5.4.5 Thermal Optimization Methodology

As described earlier, two adjacent high-activity wires are likely to cause a hot—spot
on the bus due to intra—layer heat transfer or thermal coupling between the wires.

The peak temperature on the bus occurs at such hot-spots. In the energy optimal

130

bit orderings obtained using MEBO or SBOS, a special class of constraints called
thermal constraints can be added to prevent high-activity wires from being placed
next to each other. Similarly, in MES signaling schemes can be chosen to prevent hot-
spots in a cluster of wires by adding such constraints. It is to be noted that although
adding thermal constraints may decrease the energy saving potential of the energy-
optimal bit ordering to some extent, it provides a designer the ﬂexibility to effect a
trade—off between optimizing energy and reducing peak wire temperatures. We use
the steady state model, described earlier in Section 3.4.3 to determine, approximately,
the thermal impact of various orderings and prune thermally-inefﬁcient orderings by
adding these constraints. We do this since it is virtually impossible to perform detailed
thermal simulations using the model and methodology described in Section 3.4.2,
for every candidate solution that we encounter during MEBO/SBOS optimization,
and then select the thermally-superior solution. Using the steady state model, the
procedure to effect a trade-off between energy and temperature reductions is discussed

next.

Steps for thermal optimization

The switching activities of buses vary widely across bits due to the characteristics of
data carried on them and hence, the solution space of energy-efficient bit orderings—
that are found using MEBO and SBOS—also contains bit orderings in which the wire
temperatures are reduced. The steps listed next enable us to ﬁnd these thermally-
superior orderings without affecting the energy optimality by much. Note that all

temperature estimates used in the steps listed below are from the steady-state model.

1. Find the energy dissipated Eorig and peak wire temperature Tp f

eak — orig O

131

the unmodiﬁed bus.

2. Find energy-optimal bit ordering and/or signaling without any temperature
constraints using MEBO/SBOS. Let the total energy dissipated in the bus with
this (energy—optimal) ordering/signaling be Eopt and the ordering/signaling

be represented by 30. Let Tp t represent the peak wire temperature

eak—op

corresponding to the permutation 8 obtained using the steady state model.

3. Next, we target to reduce the peak wire temperature by a ﬁxed fraction (say 77)
from its original value in a step-by-step manner. Our target peak wire temper-

ature in the pth step is T, = (1 —p - 77) - T where p = 1,2, ..., etc..

peak — opt)
To ﬁnd a permutation that achieves this peak temperature, we eliminate arcs
to/from bit pairs (i, k) for any wire j that has T(j) Z (1 — p - n) 'Tpeak _ opt“

Such a constraint will take the following form in the ILP:

Ilillil + xljllkl S 1 and 1301121 + 5r[klIJ'l S 1,

Vi3T(j)Z(I—p-n)°T

peak — opt (5'12)

Adding this set of constraints and solving the ILP, we obtain a wire permutation

Bp that has peak temperature of Tp S (1 — 77) x Tp Note that since

eak — opt'
Tp is estimated using the steady state model after obtaining the wire permutation,
it can be less than the target temperature. Further, the energy dissipated by this
permuted bus Ep will be somewhat worse than Eopt The iterative process of adding
the thermal constraints and re-solving continues until one of two conditions occur:

(1) the ILP becomes infeasible to solve, or the energy of the bit—ordering / permutation

Ep becomes worse than that of the original bus (Ep > E

origl° Figure 5.2 shows a

132

sample temperature vs. energy trade-off curve that will be obtained by following

the steps listed above. The curve shows points (T1, E1), (T2, E2), ..., (7110,1310),

corresponding to target temperatures 0.95 x Tpeak _ opt, 0.90 x Tpeak _ opt, ...,
0.50 X Tpeak _ Opt.
A
Unmodified Bus
Tpeakoﬁg _ --------------- ' --------- '
T - ------- Energy-Optimal Bus '
peak-opt '
0.9meW :
I
S 0.95preaW :
3 1? I
Q. g I
E I? I
i~ g. l
g g 0.5meW :
E a) I
x V I
8 I
CL I
I
I
I
I
I
I
1

 

 

V

 

I
I
I
I
I
I
I
I
I
I
l
E

ES
111

§.

Bus Energy

Figure 5.2. Sample peak wire temperature versus bus energy trade-off curve. The
thermal optimization steps can be used to obtain curves similar to the one shown
here.

The thermal constraint presented in Eq. 5.12 allows only one arc—among
$[l] [J], :1:[J][k], :r[J][i], and :c[l:][J]—to be present in the solution if the presence of both
bits i and k as neighbors causes the temperature in hit J to equal or increase be-
yond the target temperature. In the CPLEX ILP optimizer, the inclusion of thermal
constraints using the methodology outlined above can be fully automated. In our
experiments, we used 77 = 0.05 and succeeded in reducing peak wire temperatures

signiﬁcantly across several benchmarks as shown by results in Section 5.5.5. Further-

133

more, the extra time taken for temperature optimization did not increase the overall
solution time signiﬁcantly compared to energy—only optimization. The running times

are compared later in Section 5.5.

5.4.6 Routing Overheads

In this subsection, we analyze the overheads for the wire ordering network required
to implement MEBO and SBOS. We draw from previous work on efﬁcient techniques
for solving the crossing distribution problem [114—116] and use these principles to
estimate the area/ cost of any ordering network.

Consider two rows, called lower and upper rows (see Figure 5.3(a)), of points called
terminals and a collection of two-terminal nets N = {N1,N2, . . . ,Nn} with each
net N k connecting the terminal numbered k on the lower row to the corresponding
numbered terminal on the upper row. The terminals in the lower row are numbered
in—order as 1,2, . . . ,n from left to right. The left-to—right ordering on the upper
row deﬁnes the ﬁnal re—ordered bus. Let this new ordering be represented by II =
(r1,7r2, . . . ,rn),1 S k S n. For example, for the ﬁgure shown, r1 = 5, r2 =
5, . . . ,7r8 = 7.

DEFINITION: Two nets N,- and Nj are deﬁned as crossing ifi > J and II(i) < II(J)
or vice versa. Else, they are non-crossing.

DEFINITION: A matching diagram is a straight line drawing of the nets for a given
permutation II as shown in Figure 5.3(b) and the straight line representing a net Ni
is called a chord. The intersection of two chords Ni and N j deﬁnes a crossing point
Cij- There are ten crossing points shown in Figure 5.3(b).

The notion of inversions can be used to calculate the minimal total number of

134

Upper row
5 3 6 1 4 8 2 7

 

 

Channel
height

Channel width

A
V

 

 

 

1 2 3 4 5 6 7 8

Lower row

 

 

 

 

 

 

 

 

 

(b)

H Metal-1 H Metal-2 I Via

5 3 6 1 4 8 2

O:

 

Figure 5.3. Routing strategy and overheads for re-ordering. (a) Deﬁnition of the
routing channel. (b) Matching diagram showing ten crossing points. (c) Two-layer
routing strategy using eight horizontal tracks and ten vias.

135

crossing points g for any given II in the upper row [116]. An inversion is any pair
(Wiﬂij) such that i < J and 7Tz‘ > rrj [117]. Accordingly, 5 = 10 can be calculated
for the example. The total number of crossing points 5 determines the area/cost
overhead of the sorting network in two ways. Intuitively, the number of horizontal
wiring tracks in the channel will not exceed E, since each crossing point can be taken
care of by assigning it a separate track and by using a two—layer wiring strategy. Also,
the total number of vias required will not exceed 2g, in the worst case. However, in
practice, the number of horizontal track and vias required will be less than E and 26,
respectively. Figure 5.3(c) shows that the routing for this example can be achieved
using two metal layers, eight horizontal tracks, and ten vias. Hence the number of
crossing points g which is the number of inversions of the MEBO/SBOS order that

we obtain can be used as a metric to select the re-ordering solution with the best

energy-cost tradeoff.

5.5 Results and Discussion

In this section, we present results for energy and wire temperature reductions obtained
using our optimal static encoding schemes. In all results, percentage energy reductions
are reported with respect to the energy dissipated in an unmodiﬁed bus. Table 5.3
lists the running times and number of iterations for problems of different sizes that
we solved using CPLEX on a SunFire-880 server with two 750-MHz UltraSparc-III
CPUs and 8 GB of RAM. The running times for MES optimization were negligible
compared to those of MEBO and SBOS and hence they are not shown. As can be

seen, these problems can be solved to optimality in a reasonable amount of time.

136

.moﬁm Use 899 88303 got? n8 885 maﬁa: 98 82.333 Mo 33:52 roam wimp.

 

 

 

 

 

 

 

 

8 5 aw : a3 .8882: ma x ms

2. ea 8 on as. see 8 x 8 .30 aeaﬁ :55

an a E. 8 is 8:852: as x was

2 2 a. a a2 see 8 x 8 no age? 32:5,
momm ommz momm ommz

 

 

Adam—av 25H. and Each.

 

25398”; a»

 

Aoahrﬁ mzmv oumm awash

 

omen? EoEOHQ

 

 

 

137

5.5.1 Energy Dissipation in Processor Buses

We profiled 100M SimPoint samples of all benchmarks in the SPEC CPU2000 suite
and recorded their self and coupling activity characteristics. The results of our analy-
sis are ahown in Figures 5.5.1 and 5.5.1. For the data bus, we observed that the
transition density per bit did not exceed 0.45 for any benchmark. As expected, the
higher order bits (32—63) for the data bus exhibited significantly lower switching ac-
tivities in integer programs compared to floating-point programs, due to small values
being predominant in integer trafﬁc. For instruction buses, switching activities were
spread more or less equally in the higher and lower order portions and, here too, it did
not exceed 0.5 for any benchmark, with the exception of vpr which caused transition
densities in the range 0.5—0.8 in a few bit lines.

Next, we present results showing the ratio of self, coupling charge/ discharge, and
coupling toggle energy dissipated for four kinds of buses: data and instruction ad-
dress, data, and instruction. To our knowledge, no previous work has profiled such
an extensive set of benchmarks and reported their energy dissipation behavior. Such
results help designers quantify the important contributors to bus energy dissipation,
like self, charge/ discharge, or toggle transitions, and explore appropriate static, dy-
namic, or hybrid encoding techniques to reduce energy dissipation. Figures 5.6-5.9
show the fraction of energy dissipated in self, charge/ discharge, and toggle transitions
for various benchmarks from the SPEC CPU2000 suites on the Alpha 21264 target
systems.

As can be seen, coupling (charge/discharge+toggle) energy forms a substantial
portion of the total bus energy dissipation: it contributed 70-75% in the processor

buses we analyzed. Among coupling transitions, charge/discharge transitions domi-

138

a5§5§$§§§§§8§§5§$§§2§B§%§:§§§ﬁﬁaa

Ammznme £3ro cannon gm

 

 

 

no hm em a 3. WV NV on om mm cm hN .vN MN mg m— S a c m o
F. <1‘414441’§44144{41§1H[ ._ vm.*¥Xl% 4 — u — _ _ _ O
wmawwaaammmm wmwwmmmmmammamaa . n.,h.w.., ..a .
.n.»k........ 1.1.. .au,.¢s . .a =so
_ . \; n ‘- u ..xm a . w. . u . . I.
..wi.nnvnnn.n.... . 13:11:11.1331 13. 2 Ian “HM"... .... 1.. E ... . .
. . ..,. .. a... _ .....- .. .. .. ...,. ;
... .. y : r»- a, a. a . . .., 1a. .. “.3 .m x.
I: .. r _ , .. . .. . x . 9 .9, 2.0
:2: x... ... .., ...... . . ... Aw... ...“...wwmﬁ.m mwxwﬁy
NQ—ND .I ......‘x’ __ . ..... . f \ ... , .¢.\ ... ‘7 ....u... y.‘ ‘ . .....x .“ n.3,... N; . N.O
xoto> u... .0” x \ ... . H W: _ . ... >.X.2H xiv“ 1 .. ....
”ﬂag I ...: .... pi. .. ._ >..;.. .y ..x... .... K _ 3o
... .... _ T. .. ﬂ . .
”a... - _.__ a . . __ {I
98o _.__ > .... y, .: .3 N p
be u .... r ... ,, .. .. moo
8M > r¥ ., ..
3505:; I ._ .. v o
aim r
_ p — _ — _ — b _ _ — _ _ — _ _ _ _ _ — WYO

 

aEEUSm sues 088.6 ommm .8 8228 8.55:.

Kigsuaq uomsum J.

139

...:m SS 29% .8 ﬁssﬁam 8830 87% 26338: 2 a: 8.. 83.259 85.93:. ...... 25mm

...1:. 0288
. 83:3

lTl 3.36:3

Amminmo .quuov cosmon— :m
w». 3 NV on on mm cm R .5 5 3 3 N_ m o

 

 

_ - _ _ _ _ _ _

 

0%. ”WOMWQmQWQ

4F, . . v. ) » «E? x** /}k. ..JQAKVK .
_. ...,... 4.x Dim xix XXXXXWXX x, ......a .
w x -.x xx » x. »,,I > ammammsmmmémmmmmmmmMmmmm mam mgm%mmmmwm§Wmﬁwgmmsﬁ_

 

P . bx
}_ >.>}.}.4\}.}.}.}.?4\>n :¥¥>4b?4ﬁkf>#kf}+l}in >4??4$b4$}4ir}4..}.

 

h _ _
ﬁseﬁsm “sausage..— 8820 9.5 .8 822.5 Ease...

hgsuaq uongsuml

140

.35 3am ﬁn-vo H8 miwanocmm ooomDmO Ummm pcmomaﬁﬁmom 2 23 H8 $56:ch cosmmcaﬁ. .m.m gamma

Ammzumo .qunov :oEmom :m
mo ow hm Vm E wv mv NV on mm mm cm hm vm 2.. w_ 2 Q o o m o

 

          

 

      
 

” +1704 _ _ a _ 4 _ _ _ _ 7 _ _ _ . _ _ _ _ _ o
_ _
. _ _
..
MN m 90 unnammisainnmnmﬁawn 1 a

....Wn J a. L

can , .. . ._
.. “mum... .... ._ , m .. a... ...,... q . ......
was.W . . .. fkaWfom/Mxﬂ . 9w...
EE § >..c N0. .1 .
028% ... WLHCCNMSHdehW! ,” .
uv—MSD .- ... ....*X,. .bmhhhhhh..hphnb>.;. ...... . ..
cm :& , 5.1.2.0.. ..xik . 9 9..? .... .....- woo 0...... .. .30.
Emﬁw ... ...f... .34.»,9909099w13KW 9N...90 (9.9... 9.0.-.0Q999.. .$.0 9&9. ,Om0u9w0w
. . y a ’ \ ki. .. - i .....m .
mmﬁ ﬁxxxxmq WM EU¥¥*¥xxXXXXnyxme# xx: pt. ..-». .2.- s
Ewe ..,...m y ., a.” mmmammammmommammmmmmm: nmmmgammaamamﬁmgmawamama. - 3
E35. . x x
3.3m; P x >. \ .
. h _ _ >L >L_ri.>r+x>0.>.>ir0.}1>0i?>0§r4x>.> 5D} 04.}..>4.>.>40.>4i?>140?4.>.>4004 } . mmd

 

ﬁssﬁsm .eoaaesoa 882.6 8% .e 83.50 822.3.

Alisuaq uouisuml

140

.meHmoa ooomDmO Ommm 92:55 £33 88?? $ng 083 £32 23 5 man. $2305
Saw 93-3 .8“ 28365.5 Amﬁwm3+mwanom€FmE£ov 95960 can tow E “03$:me .mwuocm man mo qosogm .©.m 8sz

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A 2M M S
B . I:
A d A I. d U m I: 9 1m
1 0 m. N 0.. m an a m z M M D m w n m. w m n B d m
.wmaomxmn.£.uo5m.m.um.mmm3£mn. ddIm
9 9 X .II.J V4. 1 lad 3 d u ..M Z 9 w V4. D: B S P D: 3 9 m w. n d
. . . . §o d
H; ,n.. , . . .. .. .J
m . ...,... m. m ... . ...” .... . .H w
- ...... ........ ._ . - . . .. .. -..... u
,. .. . . .. . . m:
ﬂu
- . . .. . .1 .$.. m.
a
w
rA
m
rum-T; . .. .g . .. .. . . .. .. . .. .. . . .. .. . .. .. . . .. .. . .. ..1§ow w
092853920 D .m.
233%. m.
0002 D.

 

£52on ooom DmU UmEm van 889$ Swamp. 23¢. 8w 25 3063. 3mm 5 38935 38cm 35

141

.mSSwoa ooomDmO Ommm madame EEK 839mm $33 «.83 .933. 2: 5 man $835
55055.04: 9.5-9 Ho“ £83658» Aoﬁwwop+mm$€m€$w$€v wasmsoo was :3. E vopwammmww $.85 man. we qosowﬂ Nb oSmE

 
 

A ..m M
Du S
A d A J n m. I: W: 3 B
9 n .I. d o q d S .I. 00 b B
J o w m a. m an ... z m m n. m m n W. m m n e d w
camaonmuﬁﬁmW¢ﬂlwﬂmm39mWede
9 9 x ...l: J I: d 3 d a Z 9 w H P e s P D. o 9 u. w n d
. _ . . , .. ., ... .. . 0&0
. . m .. . m . ...... . ......
I . ... . - .. ... ... ...... i 008

 

l

=om I
owszoﬂeowhgu _U
Emma. I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.. . .. :1 £08

 

 

 

 

 

pamdgssyq Kﬁlaug sng JO (1011312151

I $03
@62on ooom DmU mum—mm Ea 803$ Emma. 23?. 8m 3m $063.. 5525: E 38935 wacm 25

142

.mESwoa ooomDmU UmEm madame 333 8393 $3.3 3mg 932 23 E £5, $8359
9285me 37mm 8“ £536an A£ww3+mwgw€m€\owﬁﬁv 9:358 98 :mm 5 ©3998va >9on 25 mo sosomﬁ Em Ssmi

A a M s
B I... I:
A d A 1 d 9 n x 3 B 9 e
I q d S I b D...
n w o m 0.. m 8 1 z m m 3 am m n W W. MW n e d w
WnWomsmzﬁﬁmeﬂﬂtwumos9.9%...de
9 m. x H X. n P d 9 d ..u W... Z 9 m X. D. B m P D. 9 9 u w. n d exec
. ....m x . W . , . H . ._ . , 1 ...,”... m... . A.
, . .4 ..i. m . . , . , .,. mm m ..m mal. w ....
.. .. m , H , . .., .. m . o. a“... W ., ...m
T. ..,, a. W. .. .. .. .9. . .....h....:m.it...g._...1WEN
.34 L y. . ... . ..,; . . , . . .. ...» .Wn L ... ...h: a... n9.

 

1.10999

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

p9nedgssyq £319ug sng J0 uopomﬂ

tomli; . .. g .1 0&0w
992853920 D
0%on I

$9:

$52on ooom Dav UmEm can 829nm Swab «an? how 25 823:.» =28:me E 33985 xwhocm 25

142

.mepwoa ooomDmO OmEm maﬁa“: 233 839mm 63.3 3mg 93?. 23 E 25 $835

aoﬁoapmﬁ inﬁmm .899 3296qu A2ww8+mw8nom€$w§aov $5950 98 :3. E @3936me xmuoqo 25 mo :oEomHm Em 2:me

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A M
B ..M S. 1..
A d A 1 d n m Q0 9 9
muommm maiasnmmlwwsm ,mm
0 u co 8 Z M M n 9 B
ammoowmmuﬁﬁm}¢.m.!wﬁ.mm3@mm«edwm
9 9 x ...l: X. 1 I. d 9 d u ,M Z 9 w H. D. B S P D. 9 9 U m n d
_ .. a. _. . .. . .. .. w, . 3.. ... .9
F; . L I
II, 4 u; r A I.
:umll: . .. .. . .. .. . . .. .. . f .. . . .. .. . I .. : i
owhmsomabwbﬁu U
93on I

 

2:8on 88 Dav Dmmm was 823m “own? 93?. 8m 25 $822 85252: E @836me 38cm 25

 

exec

WEN

0999

0&5

$3

092:

p9112dgssyq K319ug sng JO 1109912151

142

.mESwoa ooomDmO Omam wEEEH SEE 88?? $ng 3mg .932 23
E 25 ﬂat 95-3 HE 3036:de AEwmop+mwSEom€Fw8Eov wEEDOQ ES 3% E @359wa 3898 25 mo :Euowﬁ .w.m 95m;

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A m M S
B
A d A J n m. I: m: 9
o 0.. z M M n I 9 e d
v... U, Go 9 B I . .8 Ch 0 B J
8 m a o m x e o mm: d m I. o 1 g ﬂ 8 Wa m d ,W
9 9 x nu: X. 1 d u Z 9 m X. P e w .l P n 9 w. n d o\o
. o 13
.... i . . : n. . a . .J
m. w m ._ y. .. . . W a ., E
...w ....u . ;. _ . r 3
.31 ”w. J .: .. a N y. .9 s .. u.
I ..WM ..mm 9 w. . m: J ..TL._;M..1 excom m
a ..H v, m m“ , .H .. ., 0
MW. .. ....L "a a; .. J
I .9 . .. .. . . .. .. . .. 110.99» Wu
3
w
- A . .. $8 am
.A
mu.
tom-Ii . .. .. . .. .. . . .. 4 . .. .. . . .. .. . I .. . T .. .. . f .i okoow %
owhacomaxowhmsu D .m.
“.3on I m
0909 0.

2:3on ooom DAD Ummm Ea 889m Hows... «an? .98 25 Sam E 33935 xwuocm mam

143

.2:meon ooomDmO Ummm maﬁa: 23% Scam? $3.3 3mg 932 23 E 25 2030559:

£3-me .98 2856qu Amﬁwwop+mwuwaomwawEnov 959:8 was :3 E @3936va mwponm 25 mo :omuowﬁ .Qh 8:me

how I
owhmcoﬂeowhgu D
2won I

 

A ..M M S
Du . .IJ
A d A I .d O n H co 1.. B m B
nno me JNdsnwwMmen .mm
0 1 Q0 9 M M B
B ...u w co co 9 I . B a.» O B J
3 m. 9 0 W. x O m. 3 B O 1.. d m I O I m B 9%. E W. m ..w .W. m
9 9 X "I: J I. .d 0 d H rm 3 9 m X. D. B S l. D. MW 9. I. n d
W“
L T ..m . A .l
.. .L .l.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

war—Swen ooom Dav UmEm Ea 883m woman. «.32 How 25 :28:me E @886me 3.25 gm

 

 

0&0

§om

§ov

0x900

mRoom

@002

p9mdgssgq K819ug sng JO uopomd

144

nate. Energy dissipated in toggle transitions are responsible for only less than 20%
of total energy; in data buses, they are responsible for only 10% or less. We ob—
served no signiﬁcant difference between integer programs, shown in the ﬁrst 14 bars
in Figure 5.6—Figure 5.9, and floating-point programs in the SPEC workload.

Next, we present results for energy reductions obtained with our static encoding

schemes.

5.5.2 Energy Reduction for General-Purpose Design

For the general-purpose design scenario, our static bus encoding schemes were de-
signed using data collected from SimPoint samples for the training benchmarks and
then evaluated on test benchmarks. Results are shown in Figures 5.10 and 5.11. They
show that the average bus energy reductions obtained are as follows. MES: 7.81%
and 10.96%, MEBO: 11.91% and 19.85%, and SBOS: 20.04% and 38.78% for data
and instruction buses, respectively. On the average, we ﬁnd that optimizations on
the instruction bus yield better results than on the data bus. We also observe that

SBOS is easily the best scheme for both data and instruction buses.

5.5.3 Energy Reduction for Workload-Speciﬁc Design

To evaluate the effectiveness of our techniques in the workload-speciﬁc design sce-
nario, statistics collected for SimPoint samples from 13 training set benchmarks were
aggregated and used to obtain the optimal static encoding schemes. The scheme was
then tested on non-overlapping samples from the same set of benchmarks. This non-
overlapping sample was arbitrarily selected as a block of 100M committed instructions

after the ﬁrst 10 billion instructions of program execution. From the results shown

145

General—Purpose Design: Energy Reductions for Data Bus

 

 

50% - I MES
I MEBO
8 40% ~
8
'3 35% —
a: 30% '-
E.”
E 25% -
& 20%
E 15%
§
6: 10%
5%
0%
o. :3 a N c: o “o -- ‘H H .3. E 3;: an
'5. Q! 9" 0 '§ (3 80 U 8 E —. O >
5%“.8 " sass sea: “’
” 8.
Benchmarks

Figure 5.10. Energy dissipation results for general-purpose design for the 64—bit data
bus. Statistics collected on 13 training set benchmarks were used to obtain the
optimal static encoding schemes. These were tested on 13 other (test set) benchmarks.
Average energy reductions are MES: 7.81%, MEBO: 11.91%, and SBOS: 20.04%.

in Figures 5.12 and 5.13, we observe that the average energy reduction across the
benchmarks for the three schemes are as follows. MES: 9.73% and 10.43% for in-
struction bus; MEBO: 15.97% and 21.25% for instruction buses; and SBOS: 22.79%
and 40.77% for data and instruction instruction buses, respectively. Our results in-
dicate that workload-speciﬁc energy optimizations on the instruction bus are likely
to yield better results than on the data bus. Among the three different schemes we
proposed, SBOS gives the best results. This is expected because it combines the
beneﬁts of signaling as well as bit ordering. Table. 5.4 shows the actual bit ordering
and signaling for the data bus that was obtained using the training set. The cor-

responding table for the instruction bus is not shown due to space constraints. For

146

General—Purpose Design: Energy Reductions for Instruction Bus

 

 

50%— .MES
_ IMEBO
: 45% .SBOS
8 40%—
8
E 35%c
>, 30%
5:0
g 25%
LU
5° 20%
E 15%
“2’
g 10%
5%
0%
ca. 5 :7, N t: a.) “U —‘ 9— ‘- x E H; w
eaa.eg-§Q€o§ae_0>
E % “ B a s a s s E E N
D d)
D.
Benchmarks

Figure 5.11. Energy dissipation results for general—purpose design for the instruc-
tion bus. Average energy reductions are MES: 10.96%, MEBO: 19.85%, and SBOS:
38.78%.

both data and instruction buses all ﬁve signaling schemes were chosen. In particular,
the original mode of signaling was retained for 36 (38) lines, inversion was chosen for
12 (45) lines, and Markov model signaling for 11 (40) lines in the data (instruction)
bus. Relatively, transition and inverted transition signaling were chosen for a fewer

number of wires, a total of 5 (5) nodes in data (instruction) bus.

5.5.4 Energy Reduction for Program-Speciﬁc Design

In program—speciﬁc design, coupling energy/ cost matrices collected for the SimPoint
samples of each benchmark are used to design a signaling / encoding scheme and tested
on the same benchmark and sample. This is expected to yield best results as the

static encoding schemes are speciﬁc to that sample and benchmark. Results for

147

.aa” 4 98 .83” * .mhp” Av .55 H D .98 H G .AmmEHmw
.mmqﬂov man .830 23 go ammmov ocwoommﬁwoguoa H8 8:830 @2390 EB mandamwm 35390 Jan 2an

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dooeeecooeeeecceooooeeeoeoeecceo Em
m o a. H ... :aﬁmmamomommmgﬂtmﬁo mamwﬂm wSBEmemmommmam swam
no mo 5% mm 8% 9... 3 E mm. mm 39.. $959. $3. a. a. 2.3% wmam mm mm 3.. mm mm as;
oeedeee<oc<<oo<ece<<ocoedcece<ee .2me
2. mm 8 3. em 9.. «.2... mm. mm mm a. mo 88 3 mm a. 3. mm mm a. a. E on and. am a. was we #5
Hmommmmmsmmmmmemgmm882$:S£322:2a w a e ... a. m m H oases

 

 

 

148

Workload—Specific Design: Energy Reductions for Data Bus

40% -
I MES
35% r I MEBO
I SBOS
30%
25%
20%
15%
10%
5%
0%

    

 

Percentage Bus Energy Reduction

ti
:3

crafty
facerec
gap
gcc
gZIP
lucas
mesa
mgrid
s1xtrack
vortex
vpr
avg.

wupwise

Figure 5.12. Energy dissipation results for workload—speciﬁc design of the 64—bit data
bus. Statistics collected for SimPoint samples from 13 training set benchmarks were
aggregated and used to obtain the optimal static encoding schemes. These were then
tested on a non-overlapping sample from the same set of benchmarks. The average
energy reductions are MES: 9.73%, MEBO: 15.97%, and SBOS: 22.79%.

26 benchmarks are shown in Figures 5.14 and 5.15 for data and instruction buses,
respectively. For custom optimization of the data bus, energy reductions in the range
of 50-60% can be obtained for some benchmarks like art, bzip2, and fma3d with SBOS.
In comparison, dynamic bus encoding schemes BI and OEBI provide only up to about
10% energy reduction for a few of the benchmarks, for data and instruction buses. For
a majority of the programs, reductions with BI and OEBI are less than 5% for data
as well as instruction buses. The average energy reductions were B1: 4.19%, OEBI:
1.58%, for the data bus and BI: 2.63%, OEBI: 5.32%, for the instruction bus. For
the data bus, where self switching activities are dominant, OEBI results in an energy

increase for some benchmarks since many higher order lines remain inactive. This is

149

Workload—Specific Design: Energy Reductions for Instruction Bus

 

 

:3

9507* IMES

g o IMEBO

,8 ISBOS

M40%

>.

on

3'5

:30%

LL]

E”:

C020%

0

on

S

510%

o

H

o

m0%E>sUQUOumN'U.-¥><“O'
¢0m0~NW’—Oo°‘,£w
eswwasgeeu>gs
0° "‘ E“° a.

as 5> :3
H— (I) 3

Figure 5.13. Energy dissipation results for workload-speciﬁc design for the 128-bit
instruction bus. The average energy reductions are MES: 10.43%, MEBO: 21.25%,
and SBOS: 40.77%.

because it does not take into account both self and coupling activities when deciding
on the inversion mode. As a result, self switching activities increase signiﬁcantly
in the encoded data stream since the mode chosen to reduce coupling energy does
not necessarily reduce total (self + coupling) energy. The switching activity in the
instruction stream is coupling dominant. Hence OEBI performs better on this type
of data. However, the energy reductions are only marginally better compared to
BI. Our static encoding schemes, which optimize for both self and coupling energy
by considering signaling and reordering, show much better energy reductions than
previous dynamic encoding scheme for all benchmarks. The average energy reductions
are: data bus, MES: 19.7% and 21.7%, MEBO: 23.25% and 32.1%, and SBOS: 30.2%

and 52.1% for data and instruction buses, respectively.

150

.mxsmdm nmOmm was .ﬁmmﬁm uOmmzz $5.3 ”mmE
gamma ammo 6&de ”Hm 2m. 25. Sec 05 Sm 283058.” 3.85 “omega 23. .959? 83 Ed “.103 28ng E comomoa
Hmmo was... E 855an wqmcoog 3:8qu com £38m 838$ 25% eﬁ no powwow c2: v.83 $23. .mOmm cam .Ommz
.mmE .$an% 5o 5% #5830qu pg: 8 3:0me 882% wﬁwooqo 033m 1:530 SE 5330 3 com: me? $583282.
:08 m0 338% “Eomﬁmm pom «938:8 wosmﬂﬁm .nwmmow oEoonéEHwoa 5% 33mm: componcmu zwummm .34“. 83mg

 

 

 

 

 

 

 

 

 

A Md 0 ..m m. BJWJM
mimmmmmﬁssammwmmmmmmmmm @MW
emmnumpmmammwxmmmmmpmmammma

_ exooﬁl

so

8&2

§ON

ea

Qoov

550m

see

____~Li_L_—____________L____ 0&0“

889nm 3N5 23¢. 8m mam Sea “Else 5 couosuom 35cm 9&qu omhooamlﬁﬁmoi

uononp9g [(319113 sng 9821u9919d

151

ﬁﬁmm Wmomm was. ,WNWHNm WOmME 3&qu
”msz ,ﬁmmb ”WmmWO ,vamod “Hm 8d mWE, quuoEWmWWW 23 WOW $152 omgvg 2W8 .EsoWWm cmWw was #83 253me WWW wmmoaoa
Wmmo W28 Wm moEoWWom 95528 oWEdWWme SW 316me .2988 08am 0W: :0 gamma :23 983 08:er .mOmm W28 .OmmE
.mmWSW 62528 :5 WOW #8830qu W9? 8 8&0on moEoWWom quUooWWw 033m WwESQo @WWW £950 op com: mm? $583283
mode We meQEam WWWWOnWEWm WOW @8828 83qume .WWmemv ocWoQOéWSwoa WOW 3158 205268 szoWWmW .mWh oWWWanW

 

 

 

w m m
A 1 d 0 d X Q0 1:
e m 0.. m on e z M m n. on w n I
A A u 0 S w Z Q0 0.0 9 I: .l. .... I. B J 9 0 GD 9
€anwppm.mwmmwmmmmwwpm
W W W W W W W W W m
x W. W. W: W: W. W. . W: W W. W... . W W 9&3 w
W W W
T. .. W. W..W W. .W.W. .. .W: excom cm
W .
. an
I . Qwom m
3
1 $9“ m
cm.
I ............... o
8%. wow Ma
OmmEWHW 9
r ............................................. ESE ........................ $00 m.
Wmmol 1
I ................................................. . .................................................. o I.
W W W H _ W W W W W W W Wmm: W W _ W W W W W W W W $05 m

 

 

 

883m $me «an? W8 25 WWoWWoWEwE WEIwNW E accustom 38km— .WwmeoQ oWWWoommIEWﬁwoWnW

152

.WWoWWWSEWWWoQ mEW WWWWWB 285.3586 3885 v3?

Wow
3.8

WOW 98 82W 852$ 338m .956 moovmb mWVWHmSWWoWWonW :98 SSW 88:0 053 AIWNMN l vad Op @083 228568
N
$.85 man WWW 953%: page mWWoEaWWWSWmQ 2W? 2: USN omdloﬁm $.5me WWW 83:0 moéwws 2: 5.330 3 tum: ma?

mama WWoWWomm WWW wonﬂommv $23552: wWWB .WWOWENWEWWWQO muggy gamma AOBW 3:35.01:on 88.8% 39$?» 25 ﬁts
oEmWWom mOmm .WoW mamas WWoWWoWWSmWWW cam Saw WWW Cd 8558383 839 xdwm .315me WWoWWaNWWWWWWEO $8.838 .m.m Babb

 

 

 

 

 

 

 

 

 

 

 

 

 

 

WWW; $2 2.3 m3: 5: 8.: ©me :2 8.2 WWW; 32 3% .nBWWmecm
Ed WW2: an 8.2 W3 mod 9% m3 ”WWW: a; 5;. O: .stﬁédEm
Swan 86% WSW? 3%” 8.3% $.23. 36% 8.8m Qawmm 3am 2.8m WvWWorW. WWW; momm
SEW Ewan memm 3.9% $.me 8.9% 3%.. 5mm $.me 8W8 25mm WVWWPW. 0? momm
0:2 30% ~38 2.va 3.3% 5.25 mm.me WW3». 3.9% 3% tag 3: WEB .35
238388 8W3 x89 25 WWoWWWoWWSmWWH
NWWW WW2: 2.: Ed 8.2 W02 ”SW 33 $6 2.2 WW3 W05 .qWéstqm
mow NE W; ”E a? 3.2 mg W3 :2 mg 3m OWW .mWWo.W.W.a.WWBWW
$.me 8.8m 2.me $.an 2.an $.me wWWmm $.me $.me 3me 35% 3:8. WWW; momm
35.. $5». 33” W32 NEWS 8.3m 3W3 WES $.93 $5 :3me WVWWPW. 0? momm
5% 3% «3mm WES $62 WES 8.8” «2me 3.2% 2.3m 5% C: dag .wWWo

 

 

 

 

 

 

 

 

 

 

 

oSWmeQEB 9:? Adam 95 Sam

 

 

.w>< W :95 W 836 ﬂwﬁmﬁ _ 33: W .0483 W Bum _ 8w ﬂ WW8 TEES _ maﬁa

 

 

 

 

153

5.5.5 Wire Temperature Reduction

Our work is the ﬁrst of its kind to design static encoding schemes that seek to reduce
peak wire temperatures in addition to reducing bus energy. The thermal optimiza-
tion methodology was explained earlier in Section 5.4.5 and thermal models used to
estimate activity-dependent wire temperatures in Sections 3.4.2 and 3.4.3. Table 5.5
shows the reductions in peak temperature that we obtained for different benchmarks
with and without the thermal optimization methodology. In this table, we Show
the peak wire temperature observed for the unoptimized (original) bus and the wire
temperatures after SBOS with thermal constraints was applied. We show results for
temperature-optimized SBOS only since best results were obtained using this tech-
nique; temperature reductions for MEBO were consistently lower. This is expected
because the SBOS optimization technique has a larger solution space from which it
can choose the best solution.

Fiom Table 5.5, we note that applying SBOS without thermal constraints, which
reduces energy of the bus by 20% or more for data buses (Figure 5.13), does not always
reduce the peak wire temperature observed in the simulation window. In fact, it is
seen that, for the data bus, the average peak temperature, across the ten benchmarks
studied, actually rises slightly above that of the original bus by 035°C and it falls
only slightly for the instruction bus by 049°C, which is not a lot considering the
signiﬁcant energy reductions we obtained for these buses. This can be attributed to
the fact that the energy optimization does not explicitly consider thermal coupling
when deciding on the bit ordering and signaling. However, by adding explicit thermal
constraints using the methodology in Section 5.4.5, temperature of the hottest wire

can be reduced. Recall that our thermal optimization methodology trades off some

154

Trade-Off Curve tor ammp

  
  
 

 

 

 
 
  
 
   

 

 

 

325 . .
0.8377, 324.21 Energy-Optima/Bus or’gm’ BUS
1, 324.28-
324 «
323 -
0.9072, 322.49
2 322 ~
3
g... -
g 0.9341. 320.17
'- 320 -
Permutation at this
319 - point selected
0.9631. 318.21
318 ~
317 I I i i Y r T .
0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Normalized Energy
(a)
Trade-OH Curve tor crafty
327 7
325 1 Original Bus
0.7692. 325.22 Energy-Optimal Bus - 1. 325.76
325 .
0.8267. 323.92
324 .
8
E 323 ~
g. 322 _ 0.8711. 321.89
.2
321 .
0.8987. 319.97
320 1
319 . Permutation at this
point 58,60,” 0.9501 . 318.82
318 ~ — i —— f —-——-—————— f—m
0.75 0.8 0.85 0.9 0.95 1
Normalized Energy
(b)

Figure 5.16. Energy vs. temperature trade-off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for amp and crafty. The
permutation selected for each benchmark was the one that resulted in bus energy

E
reduction closest to 0.5(1 -- Fit) compared to the original bus.
0mg

  
   
  
  
  
 

Trade-Off Curve for eon

    
 
 

 

 

 

  
    
  
  

 

332
Original Bus
330 - 0.8013, 329.82 Energy-OptimalBus 1,330.04-
323 0.8087, 328.23
0.8243, 326.12
g 326 2
E 0.8407, 324.91
a 324 .
E 0.8436, 323.33
1- 322 ‘ 0.8602, 321.12
0.9013, 320.11
320 T
318 4 Permutation at this 0-9721. 313-44
point selected
316 I , r -
0.8 0.85 0.9 0.95 1
Normalized Energy
(a)
Trade-Off Curve for gcc
327 '1
326 ‘ 0.7259, 325.81 Energy-OptimalBus
Original Bus
325 1 1, 324.82 -
324 ~
g 0.7445, 323.18
E 323 a
g 322 1
,2 0.8576, 320.92
321 J
0.9281, 319.81
320 “ Permutation at this
319 . POW 39’90‘90 0.9579, 319.21
318 T I T T I
0.7 0.75 0.8 0.85 0.9 0.95 1
Normalized Energy
(b)

Figure 5.17. Energy vs. temperature trade-off curves. Plots show the energy vs.

temperature tradeoff curves obtained for the data bus for eon and gcc.

156

Trade-Off Curve for gzip

332 w

   
   
    
  

 

 

 

 
    
 
 
 
  

 

 

Original Bus
1, 330.56 I
330 1
0.7223, 328.76 Energy-Optimal Bus
328
8
a _ 0 7498, 326 29
.- 326
8 0.7674, 324.77
a
g 324 ‘ 0.8245, 323.05
.—
322 « 0.8477, 321.19
I 0.8839, 320.54
320 - Permutation at this
”’7" 59’9“” 0.9071, 318.75
318 "—‘_""_-‘ _' T ' l _—7—_- T T F
0.7 0.75 0.8 0.85 0.9 0.95 1
Normalized Energy
(a)
Trade-Off Curve for Iucas
334 7
i Original Bus
332 1 1,331.71 -
|
330 4 ,
I 0.781, 329.28 Energy-Optimal Bus
1
2 328 1 0.7892, 327.85
a ' 0.7994, 326.7
‘é’ 326 -1.
a I 0.808, 324.12
g 324 l 0.8278, 323.07
" 3221'
0.8649, 319.76 0.9245, 319.59
320 l -
} \
318 .2 Permutation at this I 09756131829
1 point selected
316 i I m r
0.75 0.8 0.85 0.9 0.95 1
Normalized Energy
(b)

Figure 5.18. Energy vs. temperature trade-off curves. Plots Show the energy vs.
temperature tradeoff curves obtained for the data bus for gzip and lucas.

157

  
 
   
 

TradeOﬂ Curve for mesa

 
 
  

 

 

 

327 -
326 0.8179, 326.42 Energy-Optimal Bus Original Bus
l 1. 325.54 I
325 1
324 l 0.822, 323.77
3 323 «
g 322 .
0.8574, 321.08
E 321 _
'- 0.8991, 319.94
320 1
319 1 Permutation at this
318 - Point selected 0.9571, 318.21
317 a . , .
0.8 0.85 0.9 0.95 1
Normalized Energy
(a)
Trade-Off Curve tor mgrid
330 —
0.7985, 329.11 Energy-Optimal Bus
328 - Original Bus
1, 327.49 I
326 -

Temperature
C»)
N
a.

322 ~

320 4

 

   
  
   
 

0.8344, 324.67
0.8574, 323.89

0.8713, 322.08

0.9043, 319.78

Permutation at this a

point selected 0.9309, 319.02

 

318

iﬁ l I r T l

1 I T

0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Normalized Energy

0))

Figure 5.19. Energy vs. temperature trade-off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for mesa and mgrid.

158

Trade-011 Curve for swim

 
   
  
  
  

 

 

 
   
  
  
  

3 27 Original Bus
1, 326.34-
326 - 0.8509, 325.97 Energy-OptimalBus
325 .
324 0.8536, 324.33
0 4
g 323 0.8597, 322.78
’5 l
a 322
E
,9 321 +
0.8642, 320.6
320 -
0.8711, 319.65
319 ‘ Permutation at this
318 - po’mselec’ed ~ 0.8821, 318.17
317 w 1 . t 1 t t 1
0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Normalized Energy
(a)
Trade-0ft Curve for swim
327 1 Original Bus
1, 326.34-
326 7 0850932597 Energy-Optimal Bus
325 1
324 0.8536. 324.33
a -1
g 323 0.8597, 322.78
E 322 ~
E
,9: 321 «
0.8642, 320.6
320
0.8711, 319.65
319 1 Permutation at this
318 - po'mselec'ed .5 0.8821, 318.17
317 - . . . . .

 

 

0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Normalized Energy
b

Figure 5.20. Energy vs. temperature trade-off curves. Plots show the energy vs.
temperature tradeoff curves obtained for the data bus for swim and twolf.

of the energy savings for more thermally-efﬁcient orderings at each step. The steady-
state temperature vs. energy tradeoff curves for nine benchmark programs are shown
in Figures 5.16—5.20. For each point shown in these graphs, thermal constraints were
added and the ILPs were re-solved to get a new wire ordering and permutation.
As can be seen, in all the cases the ILP infeasibility occurred before the energy of

the reordered bus approached E and hence, the optimization terminated. Using

orig
these curves, we selected the wire permutation—marked by the arrow in the plots—

. . E t . .
that resulted 1n bus energy reduction closest to 0.5(1 — E—QL), Since this represents

orig
the midway point for trading off temperature with energy savings. The peak wire
temperature obtained for this selected thermally-efﬁcient permutation is shown in the
third row of Table 5.5. Note that the temperatures reported in this row are derived
from detailed thermal simulations using the model in Section 3.4 and the not the
steady state model.

Temperature reductions we obtained with temperature-optimized SBOS range
from 3.55 to 12.26 degrees for the data bus and from 5.69 to 12.96 degrees for the
instruction bus, while still resulting in total energy reductions of 6.59 to 15.23%
and 11.67 to 16.17% for data and instruction buses, respectively. Compared to the
dynamic spreading encoding technique proposed in [110], our temperature-optimized
SBOS provides much better temperature reductions. We compare results for three
benchmarks that are common in their work and ours. The temperature reductions
they report for the instruction bus are, gzip: 6.5 K, mesa: 6.25 K, and ammp: 4.75 K.
Our results shown in Table 5.5 are much better, gzip: 15.89 K, mesa: 11.67 K, and

ammp: 12.29 K, for these benchmarks. Note that our techniques are static and incur

negligible overhead compared to the overheads for the crossbar switch and control

160

logic used in the spreading encoding technique.

5.6 Summary

In this chapter, we presented a value aware optimization methodology to design static
encoding schemes to reduce energy dissipation and temperature of global signal buses.
Our methodology examines two aspects: (1) several possible ways of signaling a bit
value, with exactly one signaling mode for each bit chosen, and (2) all possible ways of
mapping bits to bus lines (bit ordering or permutation) and then chooses exactly one
bit ordering, both statically at design time depending upon trafﬁc value characteristics
to minimize total bus dynamic energy. We present an integer linear program (ILP)
methodology that evaluates several possible bit signaling modes and all possible bit
orderings for an n-bit bus based on trafﬁc value characteristics and then chooses an op-
timal encoding mode that minimizes total bus (self + coupling) dynamic energy. We
use the SimpleScalar/ Alpha simulator, profile SimPoint samples of SPEC CPU 2000
benchmarks to collect data, and use the CPLEX ILP optimizer design our encoding
scheme. Results for three degrees of customization show increasingly better results
for average bus energy reduction: general-purpose optimization: 20.04% (38.78%),
workload—speciﬁc optimization: 22.79% (40.77%), and program-speciﬁc optimization
30.2% (52.1%), for 64-bit data (128-bit instruction) buses, respectively. In contrast,
existing dynamic bus encoding techniques yield only 4.19% (5.32%) reductions at best
for data (instruction) buses for the same set of programs.

We show that lowering bus energy—even signiﬁcantly, as with our static encod-

ing schemes—does not necessarily lower peak wire temperatures. To address this, we

161

present a novel method of efﬁciently exploring the peak / hottest wire temperature and
total bus dynamic energy trade—off space using a steady-state wire temperature model.
Based on this, we present a new method of introducing thermal constraints into our
energy optimization methodology that allows a designer to trade-off peak wire tem-
perature with total bus dynamic energy as desired. For this thermally-constrained,
energy-optimal encoding scheme, we then perform simulations using a detailed per—
wire bus thermal model to determine the actual reductions in peak temperature, which
we ﬁnd to be signiﬁcant—up to 12.26°C (12.96°C) for data (instruction) buses—while
at the same time providing signiﬁcant average energy savings: 14.24% (16.17%) for

data (instruction) buses that are still much better than previous work.

162

CHAPTER 6

ACTIVITY-AWARE PERFORMANCE
OPTIMIZATION

The data—dependent nature of inter-wire crosstalk necessitates bus cycle time to be
designed for the worst-case. This pessimistic approach incurs signiﬁcant performance
penalty since the worst case arises least frequently in actual applications. In this
chapter, we examine an activity-aware technique that substantially reduces the fre-
quency of worst case crosstalk and improve the bus performance by using a variable

cycle bus architecture.

6. 1 Introduction

Inter-wire capacitive crosstalk is the primary factor that affects the propagation delay
of interconnects. In high-performance processor buses, crosstalk on a victim wire
depends on the nature of transitions on its two adjacent wires, known as aggressors.
Designers estimate the worst case crosstalk condition for a wire and set the bus clock
cycle time greater than this value, ensuring that the signal transmission occurs in the
correct manner. However, this is a pessimistic approach since worst case crosstalk
conditions do not occur across all wires very frequently.

An introduction to interconnect analysis and the impact of crosstalk on bus design
was presented earlier in Section 2.1.5. Table 2.1 listed ﬁve different crosstalk condi-

tions based on transitions in the victim and aggressor wires: 1 + 0r (mode-0), 1 + 1r

163

(mode-1), 1 + 2r (mode-2), 1 + 3r (mode-3), and 1 + 4r (mode-4), where the cou-
pling ratio r is the ratio of the adjacent coupling capacitance and the line capacitance
including the contribution of repeaters. The coupling ratio is greater than unity for
nanometer-scale technologies as can be seen from Table 2.2.

We address two aspects of the bus crosstalk problem to improve overall perfor-
mance of global processor bus in the presence of crosstalk. First, we reduce the
frequency of various crosstalk conditions by using a proﬁle-guided wire reordering
and signaling approach. Second, we propose a bus clocking approach that eliminates
the need to use a pessimistic cycle time. Instead, our approach dynamically controls
the number of cycles required for transmission of the data depending on its crosstalk
mode. By doing so, we can use the average or most frequent crosstalk pattern to
design the cycle time of the bus.

This chapter is organized as follows. Next, Section 6.2 brieﬂy reviews related
work. Then, we present our techniques in Section 6.3. Following that, in Section 6.4

we present results. Finally, we summarize in Section 6.5.

6.2 Related Work

Many crosstalk reduction techniques have been proposed in literature. These are re-
viewed brieﬂy next. Several techniques such as dense wire fabrics [56] and net order-
ing and shield insertion techniques [118,119] have been proposed to reduce crosstalk
noise in signal interconnects. The effectiveness of shielding and spacing techniques
have also been explored [57]. Many coding techniques to reduce crosstalk have also

been proposed, all of which rely on using a significant number of extra wires to elimi-

164

nate worst case crosstalk conditions: crosstalk protection code (CPC) [55], transition
pattern code (TPC) [120], crosstalk avoidance code (CAC) [121], and the codes pro-
posed in [122]. A technique that uses variable cycle transmission to improve the bus

performance has also been suggested but it does not address crosstalk reduction [123].

6.3 Techniques for Performance Optimization

In this section, we describe techniques to optimize bus performance by reducing

crosstalk and using a non-pessimistic approach to bus clocking.

6.3.1 Variable Cycle Bus (VCB) Design

We propose an adaptive bus architecture called a variable cycle bus (VCB) that
uses a faster bus clock and dynamically controls the number of cycles required for
transmission based on the estimated delay of the data pattern to be transmitted.
This removes the need to design the bus clock cycle in a pessimistic manner based
on the worst-case crosstalk pattern. The VCB works as follows. The data to be
transmitted in the current cycle is compared to the data that was transmitted in the
previous cycle and the crosstalk group that it belongs to is determined. There are
two groups: a Group—I data word is one that has at the most one mode-2, mode-1,
or mode-0 crosstalk pattern and none higher and a Group-II data word is one that
has at least one mode-3 or mode—.4 pattern. The crosstalk group is determined using
the crosstalk analyzer (CA) circuit described next. In the VCB, we transmit Group-I
data in one clock cycle and Group-II data using two clock cycles. A DAT/LREADY

line indicates to the receiver when to latch the current value being transmitted on the

165

bus. The DATAJZEADY control line is completely shielded, i.e., it is routed with

VD D / GN D lines on each side so that is completely unaffected by crosstalk.

 

 

 

 

 

 

 

 

Inputs: U W 50 S 1 52 Output: f
0 0 1 1 1 1
— - 1 I 0 1
- - 0 1 1 1
(a)

 

(b)

Figure 6.1. Three-bit crosstalk analyzer truth table and circuit. (a) Truth table show-
ing only the ON-set. “-—” indicates a don’t care input. (b) Logic circuit implementing
the truth table.

Our crosstalk analyzer (CA) circuit identiﬁes the crosstalk mode for each trans-
mission in an efﬁcient manner. It compares the current information, three bits at a
time, with corresponding bits in the pattern transmitted in the previous clock cycle
and determines if the current pattern falls under one of two crosstalk groups. The
way to determine the crosstalk group for a three-bit case is shown next. Consider

two three—bit vectors, Xt — 1 2 (X6_ 1, Xi _ 1, X§_ 1) representing the data

transmitted in the previous cycle and X t 2 (X6, X t, X5) representing data to be

transmitted in the current cycle. At the ﬁrst level of the CA circuit, the following

166

logic outputs are evaluated in parallel:

SO = X6_1€BX6, (6.1)
51 = Xi‘lexf, (62)
52 = xg-lsxg, (63)
U = Xg—l-Xf—1+X{—1-X§’1,and (6.4)
W = X5.X§+X{-X§. (6.5)

Using these signals, the truth table and a gate-level representation of the three-bit
CA circuit can be constructed as shown in Figure 6.1. The truth table in Figure 6.1(a)
shows only the ON—set of the Boolean function, i.e., the inputs for which the output
evaluates to logic “1”. The corresponding two-level realization of this table is obtained

using Espresso [113]:

f = 30.3—1.SQ+30.51._S_2+U-W-SO-SQ, (6.5)

= So-(SIEBSQ+I7-W-SQ). (6.7)

The CA circuit outputs a logic “1” if the three bits it examined result a Group-
II pattern and logic “0” if not. Thus, for an n-bit bus there are n — 2 three-bit CA
circuts working in parallel to determine the crosstalk group. At the second level, these
n — 2 outputs can be combined using the wired-OR logic style in which outputs from
the three-bit CA circuits are simply connected together, as shown in Figure 6.2(a).
Thus, the ﬁnal wired-OR output is high if the output of at least one of the three-
bit CA circuits is high. The wired-OR connection is used to simplify the hardware
required at the sending end. The signal DATA_READY obtained from the bus

crosstalk analyzer synchronizes the sender and receiver. W'hen F = 0, the data can

167

be transmitted in one cycle and hence DATA_READY is taken high. Else, the data
is transmitted in two cycles and, in this case, DAT/LREADY is kept low for the

ﬁrst cycle and taken high in the second. The receiver uses a clock signal gated by

DAT/LREADY and this ensures that the data is latched and read correctly.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

, DATA (PREVIOUS CYCLE)
Bn_1 —‘ 3-blt f
[— CA 0'3 r——\ H r-ﬁ
I x I I
ll— : 35 ‘13 g
d 0) DATA our
B4 ' ’3'bit _f 003% L—~ +21——---—>‘I —>
__ CA 5 2 El VCB BUS %
L J C F DATA IN a, < z o
f 3 (lg LU
33 3'bit _ f m (E
CA 1
ﬂ—J e—t we as
B2
3-bit — . 1
:1 CA — f o CLK DATA_READY
0 K ..J CLK
(a) (b)

Figure 6.2. Variable cycle bus. (a) Complete bus crosstalk analyzer for an n-bit bus.
(b) Sender and receiver logic for VCB.

6.3.2 Minimum Crosstalk Bit Ordering (MCBO)

Our basic technique for proﬁle-guided optimization was discussed earlier in Sec-
tion 5.4. It may be noted that the objective function that we minimized earlier
was the total energy of the bus. In the current problem, we minimize the combined
probability of occurrence of the worst-case crosstalk condition for the bus as a whole.

Let \I'2r1 ‘Illrv and \IIOT be three n X n bit-pair crosstalk probability matrices which
record the probability of occurrence of the three crosstalk conditions possible for the
bit pair (i,j),\7’(i,j) E {0,n — 1},i 74 j: mode-2, mode-1, and mode-0. Note that

\1127. + \I/h. + ‘1’0r = .1”, where Jn is the n x n unity matrix, since all the probabilities

168

sum up to unity. These matrices are collected by aggregating data obtained by ana-
lyzing information patterns transmitted on the target bus when running the training
set benchmarks, similar to the procedure outlined in Section 5.3.1.

For three neighboring wires 2', j, and k the worst case (1 +47“ or mode-4) crosstalk
on the victim wire j occurs when both bit-pairs (2', j) and (j, k) have a mode—2
crosstalk pattern. Similarly, the next worst case (1 + 37" or mode—3) crosstalk oc-
curs when one bit pair has a mode-2 and the other has a mode-1 pattern. Both of
these situations necessitate transmission in two cycles with our VCB. Let event “A”
represent the occurrence of mode-1 or mode-2 pattern in the ﬁrst bit-pair (2', j) and
event “B” the occurrence of mode-2 or mode-1 pattern in the second bit-pair (j, k),
i.e., P(A) = 1 —¢07.[z'][j] and P(B) = 1—1/JOT [J][k] Note that we use lower-case sym-
bols W) to represent individual elements of the crosstalk matrix ‘11. Since events A
and B are mutually exclusive, we have P(A or B) = P(A) + P(B). We are interested
in obtaining P(A or B) since this represents the probability of a mode-3 or a mode-4
crosstalk on the bus. Thus, we have: P(A or B) = (1 — ¢0T[i][j]) + (1 — w0T[j][k])

Following the example above, we combine the bit-pair crosstalk matrices \IJQT,
‘11”, and \IIOT, to get one matrix ‘1! = Jn — \IIOT. As noted earlier, our VCB design
transmits mode-4 and mode-3 patterns in two clock cycles and mode-2, mode-1, and
mode-0 patterns in one clock cycle. Hence, we seek to minimize the total probability
of occurrence of mode-4 and mode-3 patterns across all bit—pairs through wire re—
ordering and signaling using integer linear programming. Thus the objective function
is the sum of all these probabilities since the events are mutually exclusive and the
occurrence of a mode-4 or mode—3 event in any one bit-pair means that the transmis-

sion takes two cycles instead of one. The simple wire reordering formulation, called

169

minimum crosstalk bit ordering (MCBO) using this objective function is discussed

next.

As before, the MCBO problem is formulated as an ILP by considering binary

variables :1:[2'][j] associated with each bit pair (2, 3'). In the solution, :r[2'][j] = 1 if bits

2' and j are to be placed next to each other on the bus and :1:[2[j] = 0, otherwise.

Let V = {1, . . . ,n} be the vertex set that represents the bits, A = {(2',j) : 2',j E V}

represent the set of possible triplets of bits, and M2] [j] is the bit-pair crosstalk matrix.

The ILP formulation in terms of the variables :1:[2][ j] and the iterative procedure used

to solve the ILP is given next:

Step 1 :

Step 2 :
Step 3 :

Step 4 :

Step 5 :
Step 6 :

Minimize Z wl’il [j l ‘ 5’3 lzl [j l

V(z’,j) e A
subject to:
:c[2'][j] e {0,1},V 2,3" 6 V, (6.8)
E a:[2][j]==1,‘v’2'€Vand Z :1:[2'][j]=1,Vj€V,
Vj e V V 2' e V

Solve ILP to get the solution.

Check if the solution has subtours. If none, go to Step 6.

Else, let there be t subtours:

s = {30(n0),51(n1), . . .,St<nt>},

where Sk(nk) means that subtour S I: has length n 13'

Add subtour elimination constraint:

:(z[2][j]: (233') are in Sk(nk)) < nk,V s. (6.10)
Go to Step 2.

The desired solution (Hamiltonian cycle) has been obtained. Stop.

In the above procedure, Constraint 6.8 ensures that the variables take only binary

values and Constraint 6.9 ensures that the in- and out-degrees of every vertex are

one, i.e., every bit occurs exactly once in the ordering. As explained in Section 5.4.3,

we add subtour eliminations iteratively and solve the ILP efﬁciently with the CPLEX

170

optimizer tool.

6.3.3 MCBO with Signaling (MCBOS)

In MCBO with signaling (MCBOS), the best signaling scheme—one of the ﬁve
schemes listed in Section 5.4.1—and the appropriate position of the bits on the bus
lines is determined simultaneously. As in the case of energy optimization, the motiva-
tion for using signaling is to enable the optimizer to select the optimal solution from
a richer set of possibilities. Thus, we can view the problem as similar to MCBO but
consisting of n supernodes corresponding to the 72 bits of the bus. Each supernode
contains ﬁve nodes, each representing a signaling scheme choice for a bit. By adding
constraints that ensure that only one of these nodes is selected for each supernode
and that the incoming and outgoing nodes for each supernode are the same, the ILP

for MCBOS is formulated as given next:

q—lq—l

Minimize Z qZ Z( V[ ml7l xl Tull I’lljl)

V(2',J')€A l=0m=-0

subject to :

xl,m[2][j] E {O,1},V {2,j} E V, (6.11)
q—lq—l

Z Z Z ‘Tl,mlilljl =1,V2'€V, (6.12)

VJ'EV l=0m=0
q—lq—l

Z a: Z ﬁzmlkllz=1,Vz'eV, (6.13)

kaV l=0m= 0

q—l
(1:1 x) ml" =2: 1cm )[j] [k7,] V{2,J, kt,}€V\7’m. (6.14)
(=0 l=0

Constraint 6.11 of SBOS ensures that all variables at) m [2] [ J'], (1,222) 6 {0, . . . ,q —

1}, each of which represents a choice of signaling schemes for a pair of bits, take

171

only binary values. Constraints 6.12 and 5.10 ensure that there is only one outgoing
and one incoming node, respectively, for each of the n supernodes. Constraints 6.14
ensures that the optimal tour enters and exits through the same node in a supernode
(i.e., the signaling schemes chosen for adjacent pairs of bits in the ﬁnal ordering are
consistent). Crosstalk probabilities \I’l, m[2][2], V2 6 V are set to 00 (a very large inte-
ger value). Finally, constraints for eliminating all subtours with two nodes are added
initially, and the problem is iteratively solved as described earlier in Section 5.4.3

until a Hamiltonian cycle that visits all supernodes exactly once is found.

6.4 Results and Discussion

We study the effect of MCBO and MCBOS on the 64-bit ALU result bus of our
superscalar processor architecture. As explained earlier in Section 5.5, the result bus
is on the critical path and is sensitive to delay variations due to crosstalk. Also, the
performance of the processor can be improved if faster transmissions are enabled on
this bus. We present two results for this bus next: crosstalk reduction using MCBO
and MCBOS and performance improvement when VCB is used with MCBO and

MCBOS.

6.4.1 Peak Crosstalk Reduction

In workload-speciﬁc design, statistics collected for SimPoint samples from 13 train-
ing set benchmarks were aggregated and used to obtain the optimal static encoding
schemes. The scheme was then tested on non-overlapping samples from the same

set of benchmarks. The non-overlapping sample was selected as explained in Sec-

172

tion 5.3.1. As explained earlier, our crosstalk optimization techniques MCBO and
MCBOS seek to reduce the number of cycles that carry mode-4 and mode-3 pat-
terns. From the results shown in Figures 6.3(a) and (b), we observe that both MCBO
and MCBOS reduce mode-4 and mode-3 patterns signiﬁcantly. The average reduc-
tions in number of 1+4r delay cycles were MCBO: 24.89% and MCBOS: 30.61% and
the average reductions in number of 1+3r cycles were MCBO: 19.21% and MCBOS:
23.42%.

For the general-purpose design scenario, our static schemes were designed using
data collected from SimPoint samples for the training benchmarks and then evaluated
on test benchmarks. Results are shown in Figures 6.4(a) and (b), for reductions
in the number of mode-4 and mode-3 cycles, respectively. We observe that the
average reductions in number of 1+4r delay cycles were MCBO: 21.22% and MCBOS:
29.35% and the average reductions in number of 1+3r cycles were MCBO: 16.77%

and MCBOS: 20.29%.

16.4.2 Performance Improvement with VCB

The reduction in the number of cycles required to transmit the information with our
techniques applied is shown in Figure 6.5(a) and (b). On the average, MCBOS which
is our best technique reduces the number of cycles by 17.68% for workload-speciﬁc
optimization and by 18.30% for general purpose optimization while MCBO reduces
the number of cycles by 13.89% and 14.44% for workload-speciﬁc and general-purpose

optimizations, respectively.

173

 

 

 

 

 

 

1:: Workload-Speciﬁc Design: Crosstalk Reduction in ALU Result Bus
5‘ 70% r I I j I I I I I I I I I I
>5 I MCBO
7‘: 60% ,_ ........................... .' 'MC'BOS ....................... _.
D
i 50% ......................................................... ...
“a 40% ....................................................... ..
H
g 30% ...................... . . . . . ................. . ._
z 20% ............. . . . . . . . . . . . . ..... . _
.E
.5. 10% _. " ' " ' ' ' ' ' . ' ° ' “ ' 'a
g 0% E 8 3 3 9 i5 8 33‘ g— 8 .9. >< 5. 3’0
e seaggs‘awwatw
c6 >< D‘ o > >
u— };3 5 c3
3
(a)
g Workload—Speciﬁc Design: Crosstalk Reduction in ALU Result Bus
6‘ 70% I f T f T I I I I m I I I I
>. I MCBO
% 60% ............................. I 'MC'BOS .......................... _
Q
a 50% I" ............................................................... _
i.
“a 40% ................................................. . ...... _
g 30% ................................................ . ------ —Il
Z 20% .. . . .............. ‘ . . . ................. . . . . . _
.S
g l0% .. . . ....... . . . . ...... . . . . - . ...
'8
3 0% t: o m a 1: x o >. o. o o. >< .. o
32’ egﬁsas-eﬁaasgsa
o 3. E E b 3 H w 0 5
ME 5 g‘ 0 > >
(I: 3 CU
(b)

Figure 6.3. Crosstalk reduction results for workload-speciﬁc design of the 64-bit ALU
result bus. (a) Average reductions in number of 1+4r delay cycles. For MCBO:
24.89% and MCBOS: 30.61%. (b) Average reductions in number of 1+3r cycles. For
MCBO: 19.21% and MCBOS: 23.42%.

174

 

 

 

 

 

 

 

 

i3 General—Purpose Design: Crosstalk Reduction in ALU Result Bus

8 70%

L)

>5

..‘3 60%

O

D

i 50%

a5 40%

g 30%

:3

Z 20%

.S

g 10%

§O% asaocsgmchhxbu

.. smegma-saggy

a w _ I...
g ‘3 8" E 00 "D Q. g a
(a)

E) General—Purpose Design: Crosstalk Reduction in ALU Result Bus

5‘ 70% I ﬂ I WI f I F I I I I I I

>5

£3 60%

O)

D

3:, 50%

i

H5 40%

I; 30%

Z 20%

.E

.5 10%

‘5

3 0% “’3

E

o.::~o-c— Nan-u
aa§m80§9°°8
a“:“’7aBNOE-a
‘3 géww-D o.

(b)

perlbmk
average

Figure 6.4. Crosstalk reduction results for general purpose design of the 64—bit ALU
result bus. (a) Average reductions in number of 1+4r delay cycles. For MCBO:
21.22% and MCBOS: 29.35%. (b) Average reductions in number of 1+3r cycles. For
MCBO: 16.77% and MCBOS: 20.29%.

175

Workload—Specific Design: Performance Improvement with VCB

 

 

 

   

 

 

 

   

m 40% I I I I I I I I I I I I I I
2 I MCBO
5 35% ............................ .‘M'CBOS .......................... ..
“5 30% ---------------------------------------------------------------- _
51.; 25% ------------------------- .. ............................. _
2 20% ,- ............................... _
5:“ 15% --------- -
8
'c: 10% _
U
Fo’
a, 5% -
a:
0% 33288§%8eg§g
0 0 cu "" no on N I: «I
3 E E” E 23:. 8 m g g
.8 g a
_ (a) .
General—Purpose Desrgn: Performance Improvement w1th VCB
40% I I I I I T I I I I I I I f
8 IMCBo
“g 35% .......................... ..MCBOS. ............................ —.
8 30% r --------------------------------------------------------------- _
o
E 25% ..................... _
E
2 20% .....
-‘5 15% .....
8
'5 10%
8
a) 5%
ad
0% “saouﬁaacﬁlﬁ'ﬁbg’o
"‘ .. o m
§§%§E%a§”as£§§
8" u... 00 Q. g I;
(b)

Figure 6.5. Reduction in the number of cycles taken to transmit the information with
MCBO and MCBOS applied to the result bus. (a) Workload-speciﬁc optimization.
(b) General-purpose optimization.

176

6.5 Summary

This chapter presented a performance—oriented adaptive bus design technique that
helps reduce the frequency of crosstalk conditions and adopts an adaptive approach
to improve bus performance. We presented a variable cycle bus (VCB) architecture
and a crosstalk analyzer circuit that can transmit the data using either one or two
clock cycles depending on the type of crosstalk pattern. Consequently, the bus clock
cycle time no longer needs to be greater than the worst-case (1+4r) crosstalk pattern
but it can be designed using the average case or the most frequent (1+2r) crosstalk
pattern. We also presented a proﬁle-guided optimization that reduced the frequency
of occurrence of 1+4r and 1+3r crosstalk patterns and thus helped improve the per-
formance of the VCB bus signiﬁcantly. Results on SPEC CPU 2000 benchmarks, in
a general-purpose optimization scenario, show a 29.35% reduction in 1+4r cycles, a
20.29% reduction in 1+3r cycles, and a bus performance improvement of 17.42% for

a static reordering and signaling technique targeting bus crosstalk minimization.

177

CHAPTER 7
CONCLUSION

In this dissertation, we presented our research on activity-aware modeling and design
optimization for on—chip interconnects in current and future nanometer-scale tech-
nologies. We addressed three important issues in high-performance bus design for
nanometer-scale microprocessors: accurate energy and thermal modeling, energy op-
timization techniques, and crosstalk reduction. Key contributions and results from

our research are summarized next

7 .1 Contributions and Key Results

In Chapter 3, we presented a uniﬁed nanometer-scale bus energy dissipation and
thermal model that can help designers monitor energy dissipation and temperature
change in individual wires during trace— or execution—driven simulation. In addition
to self capacitance, our model incorporates the effects of capacitive coupling between
adjacent as well as non-adjacent pairs of wires and repeater insertion on switching
energy, the effect of lateral heat transfer between adjacent wires to estimate wire
temperatures, and also estimates wire temperature gradients and its impact on wire
delay, all of which were not available in earlier models.

Using this model, we studied energy and thermal characteristics of instruction
and data buses using an execution-driven simulation of a billion or more instructions

of nine SPEC CPU2000 benchmarks. We found that existing bus energy models

178

provide estimates that are about 7-8% less accurate compared to our energy model.
This is because they do not account for the effects of coupling between non-adjacent
wire pairs of a bus. Our model, which incorporates these effects, is the ﬁrst of its
kind to do so. Our results also showed that, in wide instruction and data buses used
in modern processors executing SPEC CPU2000 workloads, existing bus encoding
schemes show no signiﬁcant energy beneﬁt due to the nature of data trafﬁc. When
non-adjacent coupling effects between wire pairs are considered, energy dissipation
savings reduce considerably. Based on simulations using our thermal model, we found
that average wire temperatures in data and instruction buses may rise 10—37 °C during
a simulation run of only a billion cycles for a 130 nm superscalar processor running
SPEC benchmarks. This temperature rise is primarily due to heat generation as a
result of currents ﬂowing in the wire during bit switching.

In a future 45 nm technology node, Wire temperature rise for the same set of bench-
marks and simulation sample was found to be between 20—58°C. We observed that
instruction and data bus wires attained absolute temperature in the range 80.3—104°C
and 97.6—123.7°C, in 130 nm and 45 nm processors, respectively, during the course
of our simulation, showing that signal lines attain signiﬁcant temperatures too. Sig-
niﬁcant wire temperature gradients of magnitude between 16—25°C were found to be
most common between the sending and receiving ends of the wires during the course
of simulation. Notable correlation was found to exist between energy dissipation be-
havior and wire temperature rise in buses across time; short, intermittent cycles of
high energy-dissipating switching activity trigger step changes in temperature.

In Chapter 4, we developed models that track the impact of changing wire

temperature on timing/delay violations occurring in global signal buses during

179

microarchitecture-level exploration. Results show that for a 130 nm processor with
no power and thermal management the temperature—induced clock cycle time vio—
lations in an ALU result bus—which is on the critical path—was 2.27 per hundred
bus references, averaged over ten programs in the SPEC CPU2000 workload. It in-
creased to an average of 6.20 per hundred bus references for the same processor at the
45 nm technology node. We found that wire delay variability led to degradation in
overall performance by about 4.1% in 130 nm processors and about 11.9% in 45 nm
processors. Our analysis also showed that conventional techniques like bus encoding
that seek to reduce energy dissipation and potentially wire temperatures have limited
impact on alleviating temperature-induced delay violations.

In Chapter 5, we formulated an optimization methodology to design en-
ergy and temperature optimized static bus encoding schemes through early stage
microarchitecture-level exploration, exploiting value characteristics of a target work-
load. Binary integer linear programs (ILPs) were formulated and solved optimally
to determine the signaling, bit ordering, or a combination of both that minimizes
bus energy dissipation. For the SPEC CPU2K workload, our static bit ordering and
signaling (SBOS) technique reduced total bus energy dissipation by 22.79%/40.77%
for data/instruction buses in an application-speciﬁc scenario, where the technique
was designed individually using statistics collected for each benchmark and tested
on the same benchmark. In a much more general scenario, where the scheme was
designed using statistics collected from 13 out of 26 benchmarks and tested on the
remaining 13, the corresponding reductions were 20.04%/38.78%. These reductions
are signiﬁcantly higher compared to those obtained from dynamic encoding schemes

for the same benchmarks. We also proposed a ﬁrst-of-its-kind methodology to de-

180

sign temperature-aware encoding schemes by trading off some of the energy gains
we obtain with static encoding techniques to achieve wire temperature reduction. In
this methodology we add temperature constraints during energy optimization, and
our ILP produces a static encoding scheme that reduces maximum/ hottest wire tem-
peratures by up to 15.23 K/ 16.17 K for data/ instruction buses while still producing
signiﬁcant total bus energy reductions.

Finally, in Chapter 6, we examined techniques to reduce bus crosstalk and improve
overall bus performance. We presented a variable cycle bus (VCB) architecture and
a crosstalk analyzer circuit that can transmit the data using either one or two clock
cycles depending on the type of crosstalk pattern. Consequently, the bus clock cycle
time no longer needs to be greater than the worst—case crosstalk pattern but it can be
designed using the average case or the most frequent crosstalk pattern which results
in roughly doubling the bus clock frequency. We also presented a proﬁle-guided
optimization that reduced the frequency of occurrence of worst-case crosstalk patterns
and thus helped improve the performance of the VCB bus signiﬁcantly. Results on
SPEC CPU 2000 benchmarks show at least 29.35% reduction in number of worst case
crosstalk cycles and a bus performance improvement of 17.42% for a VCB with static
reordering and signaling technique targeting bus crosstalk minimization.

Our work represents a signiﬁcant advancement over existing approaches
that are activity-oblivious and / or consider worst-case trafﬁc conditions. The
microarchitecture—level activity-driven spatiotemporal bus energy and thermal model
we present is the ﬁrst of its kind. Our static value-aware bit reordering and sig-
naling techniques are also highly-novel solutions that work remarkably well in real

applications.

181

7 .2 Directions for Future Research

Some potential research directions for the future are outlined next.

0 A methodology to dynamically select between different static wire orderings and
signaling strategies for energy and / or thermal optimization can be investigated.
In such a scheme, a controller will select a particular strategy based on input

or hints from the compiler through data stored in the program’s executable.

o The wire ordering and signaling strategies can be used to create conﬁgurable
interconnect intellectual property (IIP) blocks similar to conﬁgurable IP blocks
available today for logic circuits. Such an IIP block will contain routing speci-
ﬁcation for all on—chip high-performance signals between logic blocks, suitably
optimized for power, temperature, crosstalk, or a combination of the tree, auto—

matically synthesized by a CAD tool by analyzing the user-supplied workload.

o The thermal model can be enhanced to investigate thermal issues in clock trees
and a temperature-aware clock—tree synthesis approach can be developed. The
thermal model can also be used as a starting point for analyzing issues related
to three-dimensional interconnects. In such systems, the presence of multiple
vertically connected interconnect stacks emphasizes the need to investigate ther-
mal issues, since heat dissipation paths from interconnect layers may be several

times longer than conventional designs.

182

[1]

[2]

l3]

[4]

l5]

l9]

BIBLIOGRAPHY

Semiconductor Industry Association, “International Technology Roadmap for
Semiconductors (ITRS), 2005 edition,” URL: http://public.itrs.net.

M. Mui, K. Banerjee, and A. Mehrotra, “A Global Interconnect Optimization
Scheme for Nanometer Scale VLSI with Implications for Latency, Bandwidth,
and Power Dissipation,” IEEE Transactions on Electron Devices, vol. 51, no. 3,
pp. 195—203, Feb. 2004.

S. Rusu, “Circuit Technologies for Multi-Core Design,” Talk at the IEEE Santa
Clara Valley Solid-State Circuits Society, slides at: http://www.ewh.ieee.org/
r6/scv/ssc/Apri106.pdf, Apr. 2006.

N. Magen, A. Kolodny, U.,Weiser, and N. Shamir, “Interconnect-Power Dissi—
pation in a Microprocessor,” in Proceedings of the 2004 International Workshop
on System level Interconnect Prediction ( SLIP ’04 ) New York, NY, USA: ACM
Press, 2004, pp. 7—13.

S. Im and K. Banerjee, “F1111 Chip Thermal Analysis of Planar (2—D) and Ver-
tically Integrated (3—D) High Performance ICs,” in Proceedings of the IEEE
International Electron Devices Meeting (IEDM). Piscataway, NJ, USA: IEEE
Press, Dec. 2000, pp. 727—730.

P. Gelsinger, “Microprocessors for the New Millennium: Challenges, Opportu-
nities and New Frontiers,” in Proceedings of the IEEE Solid-State and Circuits
Conference (ISSCC). Piscataway, NJ, USA: IEEE Press, Dec. 2001, pp. 2225.

K. Nabors, S. Kim, J. White, and S. Senturia, “Fast Capacitance Extraction of
General Three-Dimensional Structures,” in Proceedings of International Con-
ference on Computer Design (ICCD). Washington DC, USA: IEEE Computer
Society, Oct. 1991, pp. 479-484.

M. Bohr, “Interconnect Scaling: The Real Limiter to High Performance ULSI,”
in Proceedings of the International Electron Devices Meeting (IEDM). Piscat-
away, NJ, USA: IEEE Press, Dec. 1995, pp. 241—244.

L. Lev and P. Chao, “Down to the Wire: Requirements for N anometer Design
Implementation,” White Paper, Cadence Design Systems Inc., 2002.

183

[10] W. Li, B. Mbouombouo, and L. Tsai, “Needed: High-Level Interconnect
Methodology for N anometer ICs,” EE Times, http://www.eetimes.com/story/
OEG2003062380039, June 2003.

[11] P. Green, “A GHz IA-32 Architecture Microprocessor Implemented on 0.18am
Technology with Aluminum Interconnect,” in Proceedings of the IEEE Solid-
State and Circuits Conference (ISSCC). Piscataway, NJ, USA: IEEE Press,
2000, pp. 98-99.

[12] J. Heidenreich, D. Edelstein, R. Goldblatt, W. Cote, C. Uzoh, N. Lustig,
T. McDevitt, A. Stamper, A. Simon, J. Dukovic, P. Andriacacos, R. Wash-
nik, H. Rathore, T. Katsetos, P. McLaughlin, S. Luce, and J. Slattery, “Copper
Dual Damascene for sub—0.25am CMOS,” in Proceedings of the IEEE Interna-
tional Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press,
June 1998, pp. 151—153.

[13] B. Zhao, D. Feiler, V. Ramanathan, Q. Liu, M. Brongoa, J. Wu, H. Zhang,
J. Kuei, and D. Young, “A Cu Low-k Dual Damascene Interconnect for High-
Performance and Low Cost Integrated Circuits,” in Proceedings of the IEEE
Symposium on VLSI Technology. Piscataway, NJ, USA: IEEE Press, June
1998, pp. 28—29.

[14] P. Zarkesh—Ha, J. Davis, and J. Meindl, “The Impact of Cu/Low-k on Chip
Performance,” in Proceedings of the IEEE International ASIC/SOC Conference.
Piscataway, NJ, USA: IEEE Press, 1999, pp. 257—261.

[15] H. Feng, F. Ercal, and F. Bunyak, “Systolic Algorithm for Processing RLE
Images,” in IEEE Southwest Symposium on Image Analysis and Interpretation.
Piscataway, NJ, USA: IEEE Press, 1998, pp. 127-131.

[16] S. Chai, A. Gentile, W. Lugo—Beauchamp, J. Fonseca, J. Cruz-Rivera, and
D. Wills, “Focal Plane Processing Architectures for Real-Time Hyperspectral

Image Processing,” Applied Optics: Special Issue on Optics in Computing,
vol. 39, pp. 835—849, Feb. 2000.

[17] W. Dally, “Interconnect-limited VLSI architecture,” in Proceedings of the IEEE
International Interconnect Technology Conference. Piscataway, NJ, USA: IEEE
Press, June 1999, pp. 15-17.

[18] J. Goodman, R. Kostuk, and B. Clymer, “Optical Interconnects: An Overview,”
in Proceedings of the 2"" International IEEE VLSI Multilevel Interconnection
Conference. Piscataway, NJ, USA: IEEE Press, 1985, pp. 219—224.

184

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

A. Rahman, A. Fan, J. Chung, and R. Reif, “Wire-Length Distribution of
Three-Dimensional Integrated Circuits,” in Proceedings of the IEEE Interna-
tional Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press,
June 1999, pp. 233—235.

S. Souri and K. Saraswat, “Interconnect Performance Modeling for 3D Inte-
grated Circuits with Multiple Si Layers,” in Proceedings of the IEEE Interna-
tional Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press,
June 1999, pp. 24—26.

K. Banerjee, “Trends for ULSI Interconnections and Their Implications for
Thermal, Reliability and Performance Issues (Invited Paper),” in Proceedings
of the Seventh International Dielectrics and Conductors for ULSI Multilevel
Interconnection Conference (DCMIC). Tampa, FL, USA: IMIC, Mar. 2001,
pp. 38—50.

W. Dally and J. Poulton, Digital Systems Engineering. Cambridge University
Press, 1998.

A. Krishnamoorthy and D. Miller, “Scaling Optoelectronic-VLSI Circuits into
the 2lst century: A Technology Roadmap,” IEEE Journal on Selected Topics
in Quantum Electronics, vol. 2, no. 1, pp. 55—76, 1996.

T. Mule, S. Schultz, T. Gaylord, and J. Meindl, “An Optical Clock Distribution
Network for Gigascale Integration,” in Proceedings of the IEEE International

Interconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June
2000, pp. 176—179.

J. Joyner and J. Meindl, “Opportunities for Reduced Power Dissipation Using
Three-Dimensional Integration,” in Proceedings of the IEEE International In-
terconnect Technology Conference. Piscataway, NJ, USA: IEEE Press, June
2002, pp. 148-150.

J. Joyner, P. Zarkesh-Ha, J. Davis, and J. Meindl, “Vertical Pitch Limitations
on Performance Enhancement in Bonded Three-Dimensional Interconnect Ar-

9

chitectures,’ in Proceedings of the International Workshop on System-Level In-
terconnect Prediction. New York, NY, USA: ACM Press, 2000, pp. 123—127.

K. Saraswat, S. Souri, K. Banerjee, and P. Kapur, “Performance Analysis
and Technology of 3-D ICs,” in Proceedings of the International Workshop on
System-Level Interconnect Prediction. New York, NY, USA: ACM Press, Apr.
2000, pp. 85—90.

185

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[37]

[38]

A. Shilov, “Intel to Cancel NetBurst, Pentium 4, Xeon Evolution: Tejas,
Jayhawk Reportedly Shelved,” X—Bit Laboratories, http://www.xbitlabs.com/
news / cpu / display / 20040507000306.htm1, May 2004.

J. Kovar, “Sun Cancels UltraSPARC V, Gemini, But Not Future Processor De-
velopment,” CMP Media’s CRN, http://www.crn.com/sections/breakingnews/
dailyarchivesjhtml?articleId=18841521, Apr. 2004.

K. Krewell, “Multicore Mania is Here to Stay,” Electronic Design News
(EDN), http: / / www.edn.com / article / CA6302 185.html?partner=eb&pubdate=
2%2F1%2F2006, Feb. 2006.

D. Brooks, M. Martonosi, J. Wellman, and P. Bose, “Power-Performance Model-
ing and Tfadeoff Analysis for a High End Microprocessor,” in Proceedings of the
First International Workshop on Power-Aware Computer Systems (PACS’OO)
held with ASPLOS-IX, Nov. 2000.

J. Cong, “An Interconnect-Centric Design Flow for Nanometer Technologies,”
Proceedings of the IEEE, vol. 89, no. 4, pp. 505—528, April 2001.

V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger, “Clock Rate versus IPC:
The End of the Road for Conventional Microarchitectures,” in Proceedings of
the Annual Symposium on Computer Architecture (ISCA). New York, NY,
USA: ACM Press, July 2000, pp. 248—259.

R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proceedings of the
IEEE, vol. 89, no. 4, pp. 490-504, Apr. 2001.

T. N. Vijaykumar and Z. Chishti, “Wire Delay is Not a Problem for SMT
(In the Near Future),” in Proceedings of the Annual Symposium on Computer
Architecture (ISCA). Washington, DC, USA: IEEE Computer Society, July
2004, pp. 40—50.

J. Meindl, J. Davis, P. Zarkesh—Ha, C. Patel, K. Martin, and P. Kohl, “Inter-
connect Opportunities for Gigascale Integration,” IBM Journal of Research and
Development, vol. 46, no. 2, pp. 245-263, Mar. 2002.

K. Banerjee and A. Mehrotra, “Global Interconnect Warming,” IEEE Circuits
and Devices, vol. 17, pp. 16—32, Sept. 2001.

A. Ajami, K. Banerjee, and M. Pedram, “Modeling and Analysis of Nonuniform
Substrate Temperature Effects on Global ULSI Interconnects,” IEEE Transac-
tions on Computer Aided Design of Integrated Circuits and systems, vol. 24,
no. 6, pp. 849—860, June 2005.

186

[39] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Ap-
proach, Third Edition. Morgan Kaufmann Publishers, 2003.

[40] J. Shen and M. Lipasti, Modern Processor Design: Fundamentals of Superscalar
Processors. McGraw Hill, 2004.

[41] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison-
Wesley, 1990.

[42] J. Davis, V. De, and J. Meindl, “A Stochastic Wire-Length Distribution for
Gigascale Integration—Part I: Derivation and Validation,” IEEE Transactions
on Electron Devices, vol. 45, no. 3, pp. 580—589, Mar. 1998.

[43] K. Banerjee and A. Mehrotra, “Analysis of On—Chip Inductive Effects for Dis-
tributed RLC Interconnects,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 21, no. 8, pp. 904—915, Aug. 2002.

[44] R. Kumar, “Interconnect and Noise Immunity Design for the Pentium 4 Proces—
sor,” in Proceedings of the Annual ACM/IEEE Design Automation Conference
(DAC). New York, NY, USA: ACM Press, 2003, pp. 938—943.

[45] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, Second
Edition. Prentice-Hall, Dec. 2002.

[46] A. Naeemi, R. Venkatesan, and J. D. Meindl, “Optimal Global Interconnects
for GSI,” IEEE Transactions on Electron Devices, vol. 50, no. 4, pp. 980—987,
Apr. 2003.

[47] M. Stan and W. Burleson, “Low—Power Encodings for Global Communication in
CMOS VLSI,” IEEE Transactions on VLSI Systems, vol. 5, no. 4, pp. 444—455,
Dec. 1997.

[48] J. Liu, N. Mahapatra, and K. Sundaresan, “Hardware—Only Compression to
Reduce Cost and Improve Utilization of Address Buses,” in Proceedings of the
IEEE Computer Society Annual Symposium on VLSI (ISVLSI). Los Alamitos,
CA, USA: IEEE Computer Society, Feb. 2003, pp. 220—221.

[49] J. Liu, K. Sundaresan, and N. Mahapatra, “Energy-Efficient Compressed Ad-
dress Transmission,” in Proceedings of the 18‘” International Conference on
VLSI Design (VLSID). Washington, DC, USA: IEEE Computer Society, Jan.
2005, pp. 592—597.

[50] ————, “Fast Perfomance—Optimized Partial Match Compression for Low-Latency
On-Chip Address Buses,” in Proceedings of International Conference on Cam-

puter Design (ICCD). Piscataway, NJ, USA: IEEE Press, Oct. 2006.

187

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[60]

M. Stan and W. Burleson, “Bus-Invert Coding for Low-Power I/O,” IEEE
Transactions on VLSI Systems, vol. 3, no. 1, pp. 49—58, Mar. 1995.

L. Benini, G. D. Micheli, E. Macii, D. Sciuto, and C. Silvano, “Address Bus
Encoding Techniques for System-Level Power Optimization,” in Proceedings of
Conference on Design Automation and Test in Europe (DATE). Washington,

DC, USA: IEEE Computer Society, Feb. 1998.

Y. Zhang, J. Lach, K. Skadron, and M. Stan, “Odd/Even Bus Invert with
Two-Phase Transfer for Buses with Coupling,” in Proceedings of International
Symposium on Low Power Electronics and Design (ISLPED). New York, NY,
USA: ACM Press, Aug. 2002, pp. 80—83.

K. Kim, K. Back, N. Shanbhag, C. Liu, and S. Kang, “Coupling-Driven Signal
Encoding Scheme for Low-Power Interface Design,” in Proceedings of the Inter-
national Conference on Computer-Aided Design (ICCAD). Washington, DC,
USA: IEEE Computer Society, Nov. 2000, pp. 318—321.

B. Victor and K. Keutzer, “Bus Encoding to Prevent Crosstalk Delay,” in Pro—
ceedings of the International Conference on Computer-Aided Design (I CCAD).
Piscataway, NJ, USA: IEEE Press, Nov. 2001, pp. 57—63.

S. Khatri, A. Mehrotra, R. Brayton, R. Otten, and A. Sangiovanni-Vincentelli,
“A Novel VLSI Layout Fabric for Deep Sub-Micron Applications,” in Proceed—
ings of the Annual ACM/IEEE Design Automation Conference (DA C). New
York, NY, USA: ACM Press, 1999, pp. 491—496.

R. Arunachalam, E. Acar, and S. Nassif, “Optimal Shielding/Spacing Metrics
for Low Power Design,” in Proceedings of the IEEE Computer Society Annual
Symposium on VLSI. Washington DC, USA: IEEE Computer Society, Feb.
2003, pp. 167—171.

D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for
Architectural-Level Power Analysis and Optimizations,” in Proceedings of the
Annual Symposium on Computer Architecture (ISCA). New York, NY, USA:
ACM Press, 2000, pp. 83—94.

W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “The Design and Use
of SimplePower: A Cycle-Accurate Energy Estimation Tool,” in Proceedings of
the Annual ACM/IEEE Design Automation Conference (DAC). New York,
NY, USA: ACM Press, June 2000, pp. 340—345.

A. Dhodapkar, C. Lim, G. Cai, and W. Daasch, “TEM2P2EST: A Thermal
Enabled Multi-model Power/Performance ESTimator,” in Lecture Notes In

188

[61]

[62]

[63]

[64]

[65]

[56]

[67]

[68]

[69l

[70]

Computer Science, Proceedings of the First International Workshop on Power-
Aware Computer Systems (PACS’OO) held with ASPLOS—IX, November, 2000.
Springer-Verlag, 2001, pp. 112—125.

J. Smith, L. He, A. Dhodapkar, and N. Nidhi, “WArPE: Wisconsin Architecture
Power Estimator,” URL: http://eda.ee.ucla.edu/ntool/.

The Sim-Panalyzer Team, “SimpleScalar-ARM Power Modeling Project,” URL:
http: //www.eecs.umich.edu/~panalyzer/ .

D. Ponomarev, G. Kukuk, and K. Ghose, “AccuPower: An Accurate Power
Estimation Tool for Superscalar Microprocessors,” in Proceedings of Conference
on Design Automation and Test in Europe (DATE). Washington, DC, USA:
IEEE Computer Society, Mar. 2002, pp. 124—128.

K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and
D. Tarjan, “Temperature-Aware Microarchitecture,” in Proceedings of the An-
nual Symposium on Computer Architecture (ISCA). New York, NY, USA:
ACM Press, June 2003, pp. 2-13.

Y. Zhang, R. Chen, W. Ye, and M. Irwin, “System Level Interconnect Power
Modeling,” in Proceedings of the IEEE ASIC/SOC Conference. Piscataway,
NY, USA: IEEE Press, Sept. 1998, pp. 289-293.

W. Huang, M. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and
S. Velusamy, “Compact Thermal Modeling for Temperature-Aware Design,” in
Proceedings of the Annual ACM/IEEE Design Automation Conference (DA C).
New York, NY, USA: ACM Press, June 2004, pp. 878—883.

D. Burger and T. Austin, “The SimpleScalar Tool Set, version 2.0,” Computer
Architecture News, vol. 25, no. 5, pp. 13—25, June 1997.

Michigan State University High Performance Computing Center, “128 Node
Opteron Cluster from Western Scientiﬁc,” https://hpc.msu.edu/twiki/bin/
view/Main/WesternScientiﬁcCluster.

T.-Y. Chiang, K. Banerjee, and K. Saraswat, “Compact Modeling and SPICE-
Based Simulation for Electrothermal Analysis of Multilevel ULSI Inteconnects,”
in Proceedings of the International Conference on Computer-Aided Design (IC-
CAD). Washington, DC, USA: IEEE Computer Society, Nov. 2001, pp. 165—
172.

T.-Y. Wang and C.-P. Chen, “SPICE-Compatible Thermal Simulation with
Lumped Circuit Modeling for Thermal Reliability Analysis based on Modeling

189

[71]

[72]

[73]

[74l

[75l

[76]

[77]

[78]

[79]

[80]

Order Reduction,” in Proceedings of International Symposium on Quality of
Electronics Design (ISQED), 2004.

L. Shang, L.-S. Peh, A. Kumar, and N. K. Jha, “Thermal Modeling, Character-
ization and Management of On—Chip Networks,” in Proceedings of the Annual
ACM/IEEE International Symposium on Microarchitecture (MICRO). Los
Alamitos, CA, USA: IEEE Computer Society, Dec. 2004, pp. 67—78.

K. Banerjee, A. Mehrotra, A. Sangiovanni—Vincentelli, and C. Hu, “On Thermal
Effects in Deep Sub-Micron VLSI Interconnects,” in Proceedings of the Annual
ACM/IEEE Design Automation Conference (DAC). New York, NY, USA:
ACM Press, 1999, pp. 885—891.

D. Chen, E. Li, B. Rosenbaum, and S. Kang, “Interconnect Thermal Modeling
for Accurate Simulation of Circuit Timing and Reliability,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 19, no. 2,

pp.197-205,be.2000.

R. Desikan, D. Burger, S. Keckler, and T. Austin, “Sim-alpha: A Validated,
Execution-Driven Alpha 21264 Simulator,” The University of Texas at Austin,
Department of Computer Sciences, Tech. Rep. TR—01-23, 2001.

Standard Performance Evaluation Council, “CPU2000 Version 1.2,” http://
www.spec.org/cpu2000, 2001.

SimpleScalar LLC, http://www.simplescalar.com.

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Char-
acterizing Large Scale Program Behavior,” in Proceedings of the International

Conference on Architectural Support for Programming Languages and Systems
(ASPLOS). New York, NY, USA: ACM Press, Oct. 2002, pp. 45—57.

——-, “SimPoint Single Simulation Points for SPEC CPU 2000,” URL: http:
/ /www.cse. ucsd.edu/~calder/simpoint/single— sim— pionts.htm.

G. Hamerly, E. Perelman, J. Lau, and B. Calder, “SimPoint 3.0: Faster and
More Flexible Program Analysis,” The Journal of Instruction-Level Parallelism,
vol. 7, 2005, http://www.jilp.org/vol7/v7paper14.pdf.

K. Sundaresan and N. Mahapatra, “An Accurate Energy and Thermal Model
for Global Signal Buses,” in Proceedings of the 18th International Conference
on VLSI Design. Washington DC, USA: IEEE Computer Society, Jan. 2005,
pp. 685—690.

190

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

—, “Accurate Energy Dissipation and Thermal Modeling for Nanometer-
Scale Signal Buses,” in Proceedings of International Symposium on High Per-
formance Computer Architecture {HPCA). Washington DC, USA: IEEE Com-
puter Society, Feb. 2005, pp. 51-60.

S. Borkar, “Design Challenges of Technology Scaling,” IEEE Micro, vol. 19,
no. 4, pp. 23—29, Jul—Aug. 1999.

T.-Y. Chiang and K. Saraswat, “Closed-Form Analytical Thermal Model for
Accurate Temperature Estimation of Multilevel ULSI Interconnects,” in 2003
Symposium on VLSI Circuits Digest of Papers. Piscataway, NJ, USA: IEEE
Press, June 2003, pp. 275-279.

K. Banerjee and A. Mehrotra, “Coupled Analysis of Electromigration Reliabil-
ity and Performance in ULSI Signal Nets,” in Proceedings of the International
Conference on Computer-Aided Design (ICCAD). Washington, DC, USA:
IEEE Computer Society, Nov. 2001, pp. 158—164.

P. Sotiriadis and A. Chandrakasan, “A Bus Energy Model for Deep Submicron
Technology,” IEEE Transactions on VLSI Systems, vol. 10, no. 3, pp. 341—350,
June 2002.

W.-C. Cheng and M. Pedram, “Memory Bus Encoding for Low-Power: A Th-
torial,” in Proceedings of International Symposium on Quality of Electronics
Design (ISQED). Washington, DC, USA: IEEE Computer Society, Mar. 2001.

P. Sotiriadis and A. Chandrakasan, “Low Power Bus Coding Techniques Con-
sidering Inter-wire Capacitances,” in Proceedings of Custom Integrated Circuits
Conference (CICC). Washington DC, USA: IEEE Computer Society, May
2000, pp. 414—419.

H. Deogun, R. Rao, D. Sylvester, and D. Blaauw, “Leakage- and Crosstalk-
Aware Bus Encoding for Total Power Reduction,” in Proceedings of the Annual
ACM/IEEE Design Automation Conference (DAC). New York, NY, USA:
ACM Press, June 2004, pp. 779—782.

N. Menezes and L. Pillegi, “Analyzing On-Chip Interconnect Effects,” in Design
of High-Performance Microprocessor Circuits, A. Chandrakasan, W. Bowhill,
and F. Fox, Eds. Piscataway, NJ, USA: IEEE Press, 2000, pp. 331—351.

A. Kahng, K. Masuko, and S. Muddu, “Analytical Delay Models for VLSI Inter-
connects under Ramp Input,” in Proceedings of the International Conference on
Computer-Aided Design (ICCAD). Washington, DC, USA: IEEE Computer
Society, Nov. 1996, pp. 30—36.

191

[91] J. Srinivasan and S. Adve, “The Importance of Heat-Sink Modeling for DTM
and a Correction to Predictive DTM for Multimedia Applications,” In Pro-

ceedings 0f the Fourth Annual Workshop on Duplicating, Deconstructing, and
Debunking, Madison, WI, USA, June 2005.

[92] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C:
The Art of Scientiﬁc Computing. New York, NY, USA: Cambridge University
Press, 1992.

[93] R. Chandra, “Impact of Thermal Analysis on Large Chip Design,” Elec-
tronic Design Process Symposium (EPDS 2005) talk slides, URL: http://www.
gradient-da.com/pdf/ EDP_for.website.pdf , 2005.

[94] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, “The Case for Lifetime
Reliability-Aware Microprocessors,” in Proceedings of the Annual Symposium
on Computer Architecture (ISCA). Washington, DC, USA: IEEE Computer
Society, June 2004, pp. 276—286.

[95] E. W. Weisstein, “Geometric Centroid,” From MathWorld—A Wolfram Web
Resource, http: //mathworld.wolfram.com/GeometricCentroid.html.

[96] K. Agarwal, D. Sylvester, D. Blaauw, F. Liu, S. Nassif, and S. Vrudhula, “Vari-
ational Delay Metrics for Interconnect Timing Analysis,” in Proceedings of the
Annual ACM/IEEE Design Automation Conference (DA C). New York, NY,
USA: ACM Press, 2004, pp. 381—384.

[97] P. Bose, “Power- and Reliability-Aware (Integrated) Design: Challenges and

Opportunities,” Talk slides URL: ee.usc.edu / news / dls / talks / bose_presentation.
pdf, Oct. 2005.

[98] K. Sundaresan and N. Mahapatra, “Value-Based Bit Ordering for Energy Op-
timization of On—Chip Global Signal Buses,” in Proceedings of Conference on
Design Automation and Test in Europe (DATE). Leuven, Belgium: European
Design and Automation Association, Mar. 2006, pp. 624—625.

[99] Z. Lu, W. Huang, J. Lach, M. Stan, and K. Skadron, “Interconnect Lifetime
Prediction under Dynamic Stress for Reliability-Aware Design,” in Proceedings
of the International Conference on Computer-Aided Design (ICCAD). Wash-
ington, DC, USA: IEEE Computer Society, Nov. 2004, pp. 327—334.

[100] Q. Zhou and K. Mohanram, “Elmore Model for Energy Estimation in RC
Trees,” in Proceedings of the Annual ACM/IEEE Design Automation Confer-
ence (DAC). New York, NY, USA: ACM Press, July 2006, pp. 965—970.

192

[101]

[102]

[103]

[104]

[105]

[106]

[107]

[108]

[109)

[110]

[111]

S. Ramprasad, N. Shanbhag, and I. Hajj, “Information-Theoretic Bounds on
Average Signal Transition Activity,” IEEE Transactions on VLSI Systems,
vol. 7, no. 3, pp. 359—368, Sept. 1999.

R.-B. Lin and C.-M. T sai, “Theoretical Analysis of Bus-Invert Coding,” IEEE
Transactions on VLSI Systems, vol. 10, no. 6, pp. 929—935, Dec. 2002.

Y. Shin and K. Choi, “Narrow Bus Encoding for Low Power Systems,” in Pro—
ceedings of Asia and South Paciﬁc Design Automation Conference (ASPDA C).
New York, NY, USA: ACM Press, Jan. 2000, pp. 217—220.

P. Sotiriadis and A. Chandrakasan, “Bus Energy Minimization by Transition
Pattern Coding (TPC) Using a Detailed Deep Sub-Micron Bus Model,” in Pro-
ceedings 0f the International Conference on Computer-Aided Design (ICCAD).
Washington, DC, USA: IEEE Computer Society, Nov. 2001, pp. 322—328.

L. Macchiarulo, E. Macii, and M. Poncino, “Low-Energy Encoding for Deep—
Submicron Address Buses,” in Proceedings of International Symposium on Low
Power Electronics and Design (ISLPED). New York, NY, USA: ACM Press,
2001, pp. 176—181.

—, “Wire Placement for Crosstalk Energy Minimization in Address Buses,” in
Proceedings of Conference on Design Automation and Test in Europe (DATE).
Washington, DC, USA: IEEE Computer Society, Mar. 2002, pp. 158—162.

E. Macii, M. Poncino, and S. Salerno, “Combining Wire Swapping and Spacing
for Low-Power Deep-Submicron Buses,” in Proceedings of Great Lakes Sym-
posium on VLSI (GLSVLSI). New York, NY, USA: ACM Press, 2003, pp.
198—202.

E. Naroska, S.-J. Ruan, and U. Schwiegelshohn, “An Efficient Algorithm for
Simultaneous Wire Permutation, Inversion, and Spacing,” in Proceedings of

International Symposium on Circuits and Systems (ISCAS). Piscataway, NJ,
USA: IEEE Press, May 2005, pp. 109—112.

L. Deng and M. Wong, “Energy Optimization in Memory Address Bus Structure
for Application-Speciﬁc Systems,” in Proceedings of Great Lakes Symposium on

VLSI (CLSVLSI). New York, NY, USA: ACM Press, Apr. 2005, pp. 232—237.

F.Wang, Y. Xie, N. Vijaykrishnan, and M. Irwin, “On-Chip Bus Analysis and
Optimization,” in Proceedings of Conference on Design Automation and Test
in Europe (DATE). Leuven, Belgium: European Design and Automation
Association, Mar. 2006, pp. 850-855.

ILOG, Inc., “CPLEX 9.0 ,” http://www.ilog.com/products/cplex, 2003.

193

[112] R. Kumar, “Interconnect and Noise Immunity Design for the Pentium 4 Proces-

sor,” Intel Technology Journal, Ist Quarter, vol. Q1, 2001.

[113] Berkeley Espresso minimization tool, “Web version,” http://embsys.

technikum-wien. at / espresso / html / espressohtml.

[114] P. Groeneveld, “Wire Ordering for Detailed Routing,” IEEE Design and Test,

vol. 6, no. 6, pp. 6—17, 1989.

[115] M. Marek-Sadowska and M. Sarrafzadeh, “The Crossing Distribution Prob—

[116]

[117]

[118]

[119]

[120]

[121]

[122]

lem,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 14, no. 4, pp. 423—433, Apr. 1995.

X. Song and Y. Wang, “On the Crossing Distribution Problem,” ACM Trans-
actions on Design Automation of Electronic Systems, vol. 4, no. 1, pp. 39—51,
1999.

D. Knuth, The Art of Computer Programming. Reading, MA: Addison-Wesley
Longman, 1973.

L. He and K. Lepak, “Simultaneous Shield Insertion and Net Ordering for
Capacitive and Inductive Coupling Minimization,” in Proceedings of the Inter-
national Conference on Computer-Aided Design (I CCAD). Los Alamitos, CA,
USA: IEEE Computer Society Press, 2000, pp. 55—60.

J. Ma and L. He, “Formulae and Applications of Interconnect Estimation Con-
sidering Shield Insertion and Net Ordering,” in Proceedings of the International
Conference on Computer-Aided Design (I CCAD). Piscataway, NJ, USA: IEEE
Press, 2001, pp. 327—332.

P. Sotiriadis and A. Chandrakasan, “Reducing Bus Delay in Sub-Micron Tech-
nology Using Coding,” in Proceedings of Asia and South Paciﬁc Design Au-
tomation Conference (ASPDAC). New York, NY, USA: ACM Press, Jan.
2001, pp. 109—114.

S. Sridhara, A. Ahmed, and N. Shanbhag, “Area and Energy-Efﬁcient Crosstalk
Avoidance Codes for On-Chip Buses,” in Proceedings of International Confer-
ence on Computer Design (ICCD). Washington, DC, USA: IEEE Computer
Society, Oct. 2004, pp. 12—17.

C. Duan and S. Khatri, “Exploiting Crosstalk to Speed up On—Chip Buses,” in
Proceedings of Conference on Design Automation and Test in Europe (DATE).
Washington, DC, USA: IEEE Computer Society, 2004, pp. 20 778—20 782.

194

 

[123] L. Li, N. Vijaykrishnan, M. Kandemir, and M. Irwin, “A Crosstalk Aware
Interconnect with Variable Cycle Transmission,” in Proceedings of Conference
on Design Automation and Test in Europe (DATE). Washington, DC, USA:
IEEE Computer Society, 2004, pp. 10102—10106.

195

   

I[III][[I[][IIQ[II[[I