i.
a. ..
. :335.
i ..
éuéut
:

. 33:1... L
. .13,

 

 

 

 

 

 

 

.1 .

7 OR
Ota..i.~v.
32.15:...

1:» tr

u... 5:. E r . i
t a. . ‘
gr; . ...ﬁw&:§..ﬁ ‘

37...... E,

I. t 6..

gig g,

 

THESIS

 

30°" LIBRARY 1

I Michigan State “‘3

University

 

 

___l

This is to certify that the

thesis entitled

ARTIFICIAL NEURAL NETWORKS
FOR BRANCH PREDICTION

presented by

Brian Adam Dazsi

has been accepted towards fulﬁllment
of the requirements for

Master's degree in Electrical Eng

 

Owen

MZjor professdr/
Date 55/9 /°(
I j/

0-7639 MS U is an Aﬂinmm've Action/Equal Opportunity Institution

 

PLACE IN RETURN Box to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/CIRC/DateDue.p65-p.15

 

 

ARTIFICIAL NEURAL NETWORKS FOR BRANCH PREDICTION
By

Brian Adam Dazsi

AN ABSTRACT OF A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE

Department of Electrical and Computer Engineering
2001

Dr. Richard Enbody

ABSTRACT
ARTIFICIAL NEURAL NETWORKS FOR BRANCH PREDICTION
By

Brian Adam Dazsi

Current microprocessor technology is being enhanced at an amazing rate. According
to “Moore’s Law”, microprocessor transistor count and clock speed doubles every 18
months. With the speed that superscalar microprocessors can execute multiple
instructions out-of-order, it is imperative that an extremely efﬁcient branch predictor is
implemented on the microprocessor, to keep the frequency of recovering from
mispredicted instructions low. With current transistor counts in microprocessors being so
high, more complex microprocessor components can now be considered, such as multiple
pipelines, larger caches, and more complex branch predictors.

Artiﬁcial Neural Networks have recently been Showing amazing usefulness in the
areas of pattern recognition. They have also been used rather well for prediction
applications. Through the use of C programming, the SPEC95 benchmarks, and a
microprocessor simulator called SimpleScalar, this thesis explores the possibility of using
artiﬁcial neural networks for branch prediction. The feed-forward, back-propagation
artiﬁcial neural network coded for this thesis did not perform as well as expected;
however, the area of Artiﬁcial Neural Networks is a rapidly growing ﬁeld, and this thesis
is just the beginning of the possibilities for using Artiﬁcial Neural Networks in branch

prediction.

For my parents, who have helped me become the man I am today,

and my wife, who will continue life’s journey with me.

iii

ACKNOWLEDGEMENTS

There are many people throughout the years who have helped me get to where I am
today. All of my professors and teachers over the years, and all of my family and friends
have made this educational journey a bearable trip. To them I am eternally grateful. There
are also a few special people who have helped me with this thesis - the toughest obstacle
I had to overcome in all my education. I would like to thank them here.

My thanks go to Dr. Richard Enbody for his extreme patience. There were times that I
doubted that I would get this thesis completed, but he never let me think that was an
option. I would also like to thank him for his enthusiasm and teaching ability. His
Advanced Computer Architecture class was my favorite graduate class - it was the most
fun and most rewarding challenge. It instilled the desire to learn and explore further -
hence this thesis.

Mark Brehob assisted me with using the SimpleScalar simulator. His few minutes of
help here and there saved me from hours of wild goose chasing when I ran into problems
that weren't really problems at all, or had simple solutions. His help is priceless.

I would like to thank Jackie Carlson for her endless support, encouragement and great
C reference. Without her help I would not have had all the resources I needed to complete
this project. Her time to Sit and talk put my mind at ease and helped me get through the
day-to-day struggles of this thesis and other things that distracted me from the thesis.

Dr. Fathi Salam introduced me to Artiﬁcial Neural Networks. His Neural Network
classes were a wonderful learning experience, and helped me develop the idea for this

thesis. For that I thank him.

iv

My undying love and thanks goes to my wife, Sara, for putting up with me, and
helping me through it all. I could not have done it without her support and love. Together
we can accomplish anything.

I am thankful to my parents for absolutely everything. For obvious reasons, none of
this would be possible without them; but they have also always been a guiding light for
me. They have always being there for me - they have never let me down. They have
always been tremendous role models. They have instilled in me the personal
characteristics and responsibility that have guided me through my education and this
thesis. Without those traits, I could never have overcome all of life’s struggles to get this
far.

I would like to thank God, for the opportunity to Share my talents.

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................. ix
LIST OF FIGURES ............................................................................................................. x
INTRODUCTION ............................................................................................................... 1
Chapter 1: Branch Prediction .............................................................................................. 4
1.1 Microprocessor Architecture ..................................................................................... 4
1.2 Branch Prediction ...................................................................................................... 7
1.2.1 Two-bit Predictor ............................................................................................... 7
1.2.2 Two-level Adaptive Branch Prediction .............................................................. 8

1.2.3 The Branch-Target Buffer ................................................................................ 16
1.2.4 Hybrid or combinational predictors ................................................................. 17

1.3 Current Branch Prediction Methods ........................................................................ 18
1.3.1 AMD K-6 2 ...................................................................................................... 18
1.3.2 Pentium III ........................................................................................................ 18

1.3.3 Power PC 630 ................................................................................................... 18
Chapter 2: Artiﬁcial Neural Networks .............................................................................. 19
2.1 The Neuron .............................................................................................................. 19
2.1.1 Biological Model .............................................................................................. 19
2.1.2 Mathematical Model ........................................................................................ 21

vi

2.1.3 Activation functions ......................................................................................... 22

2.1.4 Directed Graphs ................................................................................................ 25
2.2 Learning .................................................................................................................. 26
2.2.1 Description ....................................................................................................... 26
2.2.2 Hebbian Learning ............................................................................................. 27
2.3 The Perceptron and Multilayer Perceptrons ............................................................ 28
2.4 Feedforward and Backpropagation ......................................................................... 30
Chapter 3: SimpleScalar .................................................................................................... 33
3.1 SimpleScalar ............................................................................................................ 33
3.1.1 Software Architecture ...................................................................................... 33
3.1.2 Hardware Architecture ..................................................................................... 34
3.1.3 Instruction Set Architecture ............................................................................. 34
3.1.4 Running SimpleScalar ...................................................................................... 35
3.1.5 Branch Prediction ............................................................................................. 35
3.2 Using the SPEC95 benchmarks ............................................................................... 37
Chapter 4: Methodology .................................................................................................... 39
4.1 Programming the Neural Network .......................................................................... 39
4.2 Verifying the FFBPANN Code ............................................................................... 41
4.2.1 The XOR Problem ............................................................................................ 41
4.2.2 Predicting Sunspots .......................................................................................... 43
4.3 Adding a new predictor to SimpleScalar ................................................................. 44
4.4 The new predictor .................................................................................................... 45
4.4.1 Design ............................................................................................................... 45

vii

4.4.2 Training the FFBAPNN ................................................................................... 46

Chapter 5: Results and Discussion .................................................................................... 49
5.1 Training ................................................................................................................... 49
5.2 Discussion ............................................................................................................... 50
5.3 Branch Predictor Results ......................................................................................... 59

Chapter 6: Conclusions and Future Work ......................................................................... 64
6.1 Conclusions ............................................................................................................. 64
6.2 Future Work ............................................................................................................ 64

6.2.1 Other inputs ...................................................................................................... 64
6.2.2 Other Neural Networks and Training methods ................................................ 65
6.2.3 BTB .................................................................................................................. 69
6.2.4 State Output ...................................................................................................... 69
6.2.5 Hardware Feasibility ........................................................................................ 7O

BIBLIOGRAPHY ............................................................................................................. 72

viii

Table 1.1:

Table 3.1:

Table 3.2:

Table 3.3:

Table 4.1:

Table 5.1:

Table 5.2:

Table 5.3:

Table 5.4:

Table 5.5:

LIST OF TABLES

Two-Level Branch Predictor Variations ............................................................ 9
Branch Prediction Types .................................................................................. 36
Branch Prediction Options ............................................................................... 37
Two-Level Branch Prediction .......................................................................... 37
Sunspots Test Program Output ........................................................................ 44
Limitations of Branches Recorded ................................................................... 51
Branch Data ...................................................................................................... 54
Branch Predictor Results — 2-1evel ................................................................... 59
Branch Predictor Results - Bimodal ................................................................. 60
Branch Prediction Results - Hybrid ................................................................. 60

ix

Figure 1.1:
Figure 1.2:
Figure 1.3:
Figure 1.4:
Figure 1.5:
Figure 1.6:
Figure 1.7:
Figure 2.1:
Figure 2.2:
Figure 2.3:
Figure 2.4:
Figure 2.5:
Figure 2.6:
Figure 2.7:
Figure 2.8:
Figure 2.9:
Figure 3.1:
Figure 3.2:

Figure 3.3:

LIST OF FIGURES

The Post-RISC Architecture ............................................................................. 5
Two-bit prediction method states ..................................................................... 8
Generic Two-Level Branch Prediction ............................................................. 9
Global Adaptive Branch Prediction Methods ................................................. 11
Per-Address Adaptive Branch Prediction Methods ........................................ 13
Per-Set Adaptive Branch Prediction Methods ................................................ 15
Branch Target Buffer ...................................................................................... 17
A Neuron ........................................................................................................ 19
Model of a Neuron .......................................................................................... 22
Activation Functions ....................................................................................... 24
Signal-Flow Graph of a Neuron ..................................................................... 25
Architectural Graph of a Neuron .................................................................... 26
The Taxonomy of Learning ............................................................................ 27
A Single Layer Perceptron .............................................................................. 29
Multilayer Perceptron ..................................................................................... 30
Signals in a Multilayer Perceptron ................................................................. 31
SimpleScalar Software Architecture ............................................................... 33
Out-of-Order Issue Architecture ..................................................................... 34
Instruction Format .......................................................................................... 35

Figure 3.4:
Figure 4.1:
Figure 4.2:
Figure 4.3:
Figure 5.1:
Figure 5.2:
Figure 5.3:
Figure 5.4:
Figure 5.5:
Figure 5.6:
Figure 5.7:
Figure 5.8:
Figure 5.9:
Figure 6.1:
Figure 6.2:
Figure 6.3:

Figure 6.4:

Two-level Predictor Layout ............................................................................ 36
XOR Classiﬁcations ....................................................................................... 42
Multi-layer network for solving the XOR Problem ........................................ 43
FFBPANN Structure ....................................................................................... 46
Mgrid Training Progress ................................................................................. 52
Mgrid Training Error ...................................................................................... 53
Mgrid Training Progress Zoom ...................................................................... 54
Geo Training Progress .................................................................................... 56
Go Training Progress ...................................................................................... 57
“Opposite” Gcc Training Progress ................................................................. 58
Prediction Rate by Predictor ........................................................................... 61
Prediction Rate by Simulation ........................................................................ 62
Overall Predictor Performance ....................................................................... 63
the taxonomy of learning ................................................................................ 65
Recurrent network ............................................................................ 66
Boltzmann Machine ........................................................................................ 68
Competitive learning network ........................................................................ 69

xi

INTRODUCTION

Background

The number of transistors in a microprocessor is growing at an enormous rate and how
to use all those transistors is a topic of much debate. The number of, and Specialization
of, the execution units in the microprocessor pipeline are increasing. As a result, out-of-
order instruction processing is standard practice. Since so many instructions are being
fetched and executed out-of-order, when a branch instruction is encountered, it is
important that the next instruction fetched is really the instruction that would be fetched if
the correct answer to the branch decision was already known (at least two clock cycles
are needed in order to process a branch instruction and calculate the branch decision). If
the wrong path is chosen, then instructions start being executed that should not be
executed and the microprocessor will have to recover from executing those invalid
instructions. Therefore, it is imperative to have a good branch predictor.

Artiﬁcial Neural Networks are becoming more useful in the areas of pattern
recognition and prediction. ANN S are starting to be used alongside standard statistical
prediction models used for years in the ﬁelds of ﬁnance and marketing. The ANN s are
performing as well as, if not better than, the statistical models [West]. Because of their
success in these ﬁelds, an Artiﬁcial Neural Network might perform well as a

microprocessor branch predictor.

§_O_al.§

When this thesis was started, there was no readily available documentation published
about using a neural network for branch prediction. The goal of this thesis was to obtain a
working simulation to compare a neural network branch predictor with current branch

prediction technology. In order to achieve that goal, four tasks were established:

0 Develop an Artiﬁcial Neural Network

Before an ANN could be used for branch prediction a set of tools and structures needed
to be deﬁned. The ANN was to be programmed in C, so that it could be incorporated into
SimpleScalar.

0 Modify the SimpleScalar simulator to accept an ANN branch predictor

To gather branch predictor performance data and add another predictor to the simulator,
the coding for SimpleScalar needed to be explored.

0 Train the Neural Network

Training data was to be gathered from a normal run of SimpleScalar and train the neural
network. In order to obtain the working simulation in a reasonable amount of time, a
feed-forward, back-propagation neural network using Hebbian learning would be used.

0 Evaluate the ANN branch predictor

Finally, information about branch predictor performance was to be gathered by mnning
SimpleScalar for each branch predictor. The branch predictor performance could then be

plotted and examined.

Thesis Organization

The rest of the thesis is organized as follows. Chapter 1 discusses the background
behind Branch Prediction. The Post—RISC architecture is discussed to Show the
importance of branch prediction in the microprocessor pipeline. Artiﬁcial Neural
Networks are discussed in Chapter 2. A brief introduction of neural networks from a
biological and mathematical perspective is given, and the Hebbian learning algorithm is
talked about. Chapter 3 brieﬂy discusses the SimpleScalar microprocessor Simulator, its
components and how it was utilized in this project. Chapter 4 discusses the
methodologies used to add a new branch predictor to SimpleScalar and programming,
verifying and training the neural network. Chapter 5 examines the results of the neural
network training and the branch predictors’ performance. Finally, the possibilities of
future work using Artiﬁcial Neural Networks for Branch Prediction are examined in

Chapter 6.

Chapter 1: Branch Prediction

1.1 Microprocessor Architecture

To understand the importance of branch prediction, an examination of the overall
microprocessor must be done. Most microprocessors today are of a superscalar design,
meaning that they execute multiple instructions at once. The Post-RISC architecture
[Brehob] is worth a brief examination Since most current microprocessors Share much in
common with this superscalar architecture [Hsieh]. Figure 1.1 [Brehob] Shows the

generic layout of a Post-RISC processor pipeline.

    
    
   

IVIE I‘v‘IC) RY

PREDECODE 1
__

 

 

 

 

INSTRUC TICIN CACHE 1

FE TCHI‘FLOW 2

DEC ODEIBRANCH 3

4

5

c OMPLETED INSTRUCTION 5
BUFFER

6

RETIRE UNIT

Figure 1.1: The Post-RISC Architecture

From examining Figure 1.1, there are six important steps in the processor pipeline.
First, instructions are read from memory in the Predecode stage and Stored into the
Instruction Cache (I-Cache). In the superscalar architecture, multiple instructions are read
into the cache at one time (for most current processors, four instructions are read

[Hsieh]). As a part of predecoding, extra bits are appended to the instructions in order to

assist in decoding in a later stage. During the Fetch/Flow stage of the pipeline, which
instructions are fetched from the I-Cache is decided. However, it is not until the third
pipeline stage, the Decode/Branch stage, that a prediction on a branch instruction is
actually made. At this point, “not taken” branches are discarded from the pipeline, and
decoded instructions are passed on to the Instruction Dispatch and Reorder Buffer stage.
In the Instruction Dispatch and Reorder Buffer stage, instructions are queued up and wait
to move on to an available execution unit in Stage ﬁve. Examples of execution units are
load/store units, branch units, ﬂoating point units and arithmetic logic units. Which types
and how many of each of these execution units is decided by the designers of the
microprocessor. At this point in stage ﬁve, after the appropriate calculations are done for
a branch instruction, the Fetch/Flow stage of the pipeline is informed of branch
mispredictions so that it can start recovering from mispredictions. The Branch/Decode
stage is also informed of mispredictions so that it can update its prediction methodology
(typically, updating of tables or registers). After an instruction is successfully executed it
is sent on to the Completed Instruction Buffer (stage six) and the Retire Unit successfully
updates the state of registers based on the completed instruction (usually at a rate equal to
that of Predecode stage instruction fetching). Instructions could be waiting in the
Completed Instruction Buffer until information about a branch instruction becomes
available so that they can be retired or erased. Figure 1.1 Shows that branch prediction

effects multiple stages of the pipeline.

1.2 Branch Prediction

An examination of the superscalar structure presented in Section 1.1 shows the
importance of a good branch predictor. With multiple instructions being executed out-of-
order in parallel, recovering from a misprediction can be extremely difﬁcult and can cost
valuable processor time and resources. Instructions that have been placed in the pipeline
that Should not have been, have to be nulliﬁed, and the correct instructions have to start
being fetched. This recovery period is referred to as a stall.

In order to prevent misprediction Stalls, a good branch predictor is needed. A few easy
methods exist to provide branch prediction. First, a branch could always be assumed
“taken” or always be assumed “not taken”. In these static cases, the logic for the
prediction is extremely simple, but also allows for an average 50% misprediction rate.
This is just not acceptable. A more dynamic approach is to use a branch prediction buffer,
which can be used to keep track of prediction history. In the simplest form, this history
table is referenced by the lower portion of the branch instruction address and uses a 1-bit
counter to record if the branch was recently “taken” or not. This simple method also does
not provide for very accurate predictions [Hennessy]. However, a slight modiﬁcation by

adding another bit helps overcome the one-bit method’s shortcoming.

1.2.1 Two-bit Predictor

In this method, two bits are used to keep track of the prediction history. Only after two
consecutive mispredictions iS the prediction state changed from predict taken to predict

not taken. Figure 1.2 [Hennessy] shows how this is achieved in a state diagram.

  
    
 
 
  
 

 

Predict
Taken

Not Taken

Predict
Not Taken

Mot Taken

Figure 1.2: Two-bit prediction method states

 

  

   

While this-method provides fairly good prediction accuracy, it is still not as ideal as

current technology needs, nor as good as current technology can provide.

1.2.2 Two-level Adaptive Bran ch Prediction

This prediction method, introduced by Tse-Yu Yeh and Yale N. Patt, is one of the
most successful prediction methods used today. It provides the most accurate predictions
[Yeh93]. In the two-level prediction scheme, two levels of branch history are stored. In
the ﬁrst level, a record of the last n branches is stored in the Branch History Table
(BHR). In the second level, the branch behavior of the last x occurrences of a Speciﬁc
pattern in the BHR is recorded. The second level table (is called the Pattern History Table

(PI-IT). Figure 1.3 [Driesen] Shows the general layout of the two-level method.

Global History Pattern Branch History Table
x targets Address
‘— —————————— y

  
   
 

Key (concatenate)

Figure 1.3: Generic Two-Level Branch Prediction

There are three manners in which information is kept in these tables: globally, per-

address and per-subset (which was introduced by Pan, So and Rahmeh [Yeh93]). When

referring to the Branch History Table, these methods are labeled as GA, PA, SA,

respectively. When referring to the Pattern History Table, these methods are labeled as g,

p, and s, respectively. This gives rise to nine different variations of the Two-Level

Adaptive Branch Predictor, as shown in Table 1.1 [Yeh93].

Table 1.1: Two-Level Branch Predictor Variations

 

 

 

 

 

 

 

Variation Description
GAg Global Adaptive Branch Prediction using one global pattern history table.
GAp Global Adaptive Branch Prediction using per-address pattern history
GAS 33:; Adaptive Branch Prediction using per-set pattern history tables.
PAg Per-Address Adaptive BTan'ch Prediction using one global pattern history
PAp tPileli‘l-erliiddress Adaptive Branch Prediction using per-address pattern

 

history tables.

 

 

 

Variation

Description

 

 

 

 

 

PAS Per-Address Adaptive Branch Prediction using per-set pattern history
SAg $3.38; Adaptive Branch Prediction using one global pattern history table.
SAp Per—set Adaptive Branch Prediction using per-address pattern history

SAS Iii-Z: Adaptive Branch Prediction using per-set pattern history tables.

 

 

Figure 1.4, Figure 1.5, and Figure 1.6 Show how the pattern history tables are

referenced in the variations of each type of Two-level Adaptive Branch Prediction.

10

 

 

 

 

 

 

 

 

 

 

Figure 1.4: Global Adaptive Branch Prediction Methods

11

In the Global Adaptive Branch Prediction methods, the global history of the last k
branches is recorded. Therefore, the history of all branches inﬂuences each prediction,

Since each prediction uses the same history register [Yeh93].

12

 

 

£33“

~33: &

3m

 

 

“8.3“ Lac—yam
«a

 

 

 

ES
5.:
5...:
E 83%
38

 

 

 

 

 

Figure 1.5 : Per-Address Adaptive Branch Prediction Methods

13

In the Per-Address Adaptive Branch Prediction methods, the ﬁrst-level branch history
accesses the pattern history of the last k occurrences of the same branch address [Yeh93].
Each prediction is, therefore, only inﬂuenced by the history of its branch address, and not

by other branches.

14

 

 

 

 

 

 

 

 

 

(ii-Inna..-

 

IE-

 

 

 

 

 

Figure 1.6: Per-Set Adaptive Branch Prediction Methods

15

For per-subset references, the set attribute can be any of the following: the branch
opcode, the branch class (which is assigned by the compiler), or the branch address. In
this way, the behavior of a branch effects other branches of the same type, or subset. The
ﬁrst-level of Per-set Adaptive Branch Prediction refers to the last k occurrences of a
branch of the same set attribute. The second level can be accessed either globally, by a

subset of a set, or by branch address [Yeh93].

1.2.3 The Branch-Target Buffer

To further improve prediction methods, the branch-target buffer (or cache) is
introduced. In the BTB, the branch address and respective predicted address (or branch
target) are stored. The branch address is used as a reference to the BTB to predict what
the branch target should be. If the branch address is not in the table, then the branch is
predicted “not taken”. Using a BTB in combination with the Two-bit predictor increases

prediction accuracy signiﬁcantly [Driesen].

16

 

 

 

You: unnumbmmpm mum
PCﬂroddbomoduttunoxtPG

Figure 1.7 : Branch Target Buffer

1.2.4 Hybrid or combinational predictors

Another solution when using more resources (transistors) on a microprocessor chip for
branch prediction is to combine multiple dynamic branch predictors [Driesen]. When
using two predictors at the same time a special Branch Predictor Selection Table, or
metapredictor, must be used to choose which predictor to use for a given situation. This
metapredictor is similar to the two-bit predictor discussed above. It uses a two-bit counter
to keep track of which predictor is more accurate for a speciﬁc branch address.

There are a couple of variations of this metapredictor. One variation uses run-time or

compile-time information to choose which predictor outcome to use. Another variation is

to use an n-bit conﬁdence counter to keep track of the success rate of a predictor over the

last 2'”l prediction requests [Driesen].

1.3 Current Branch Prediction Methods

1.3.1 AMD K-6 2

AMD'S K-6 2 microprocessor contains a 2-level branch predictor with an 8192 entry

Branch History Table, a 16 entry BTC, and a 16 entry RAS.

1.3.2 Pentium III

The branch predictor methodology in Intel's Pentium III microprocessor is referred to as

"deep" branch prediction. It contains a 512 entry Branch-Target Buffer.

1.3.3 Power PC 630

Motorola's Power PC 630 uses a Branch-Target Buffer and a two-bit predictor.

l8

Chapter 2: Artiﬁcial Neural Networks

2.1 The Neuron

2.1.1 Biological Model

The basic model of the neuron is founded upon the functionality of a biological
neuron. "Neurons are the basic signaling units of the nervous system" and "each neuron is

a discrete cell whose several processes arise from its cell body" [Kandel].

Cel Axon
= Ody
\
,\
Dendrites
Figure 2.1: A Neuron

The neuron has four main regions to its structure. The cell body, or soma, has two
offshoots from it, the dendrites, and the axon, which end in presynaptic terminals. The
cell body is the heart of the cell, containing the nucleus and maintaining protein
synthesis. A neuron may have many dendrites, which branch out in a treelike structure,
and receive signals from other neurons. A neuron usually only has one axon which grows
out from a part of the cell body called the axon hillock. The axon conducts electric

Signals generated at the axon hillock down its length. These electric signals are called

19

 

 

action potentials. The other end of the axon may Split into several branches, which end
in a presynaptic terminal.

Action potentials are the electric signals that neurons use to convey information to the
brain. All these signals are identical. Therefore, the brain determines what type of
information is being received based on the path that the signal took. The brain analyzes
the patterns of Si gnals being sent and from that information it can interpret the type of
information being received.

Myelin is the fatty tissue that surrounds and insulates the axon. Often short axons do
not need this insulation. There are uninsulated parts of the axon. These areas are called
Nodes of Ranvier. At these nodes, the signal traveling down the axon is regenerated. This
ensures that the signal traveling down the axon travels fast and remains constant (i.e. very
Short propagation delay and no weakening of the Signal).

The synapse is the area of contact between two neurons. The neurons do not actually
physically touch. They are separated by the synaptic cleft, and electric signals are sent
through chemical interaction. The neuron sending the Si gnal is called the presynaptic cell
and the neuron receiving the Signal is called the postsynaptic cell. The signals are
generated by the membrane potential, which is based on the differences in concentration
of sodium and potassium ions inside and outside the cell membrane.

Neurons can be classiﬁed by their number of processes (or appendages), or by their
function. If they are classiﬁed by the number of processes, they fall into three categories.
Unipolar neurons have a single process (dendrites and axon are located on the same
stem), and are most common in invertebrates. In bipolar neurons, the dendrite and axon

are the neuron's two separate processes. Bipolar neurons have a subclass called pseudo-

20

bipolar neurons, which are used to send sensory information to the spinal cord. Finally,
multipolar neurons are most common in mammals. Examples of these neurons are Spinal
motor neurons, pyramidal cells and Purkinje cells (in the cerebellum).

If classiﬁed by function, neurons again fall into three separate categories. The ﬁrst
group is sensory, or afferent, neurons, which provide information for perception and
motor coordination. The second group provides information (or instructions) to muscles
and glands and is therefore called motor neurons. The last group, intemeuronal, contains
all other neurons and has two subclasses. One group called relay or projection
intemeurons have long axons and connect different parts of the brain. The other group

called local intemeurons are only used in local circuits.

2.1.2 Mathematical Model

When creating a functional model of the biological neuron, there are three basic
components of importance. First, the synapses of the neuron are modeled as weights. The
strength of the connection between an input and a neuron is noted by the value of the
weight. Negative weight values reﬂect inhibitory connections, while positive values
designate excitatory connections [Haykin]. The next two components model the actual
activity within the neuron cell. An adder sums up all the inputs modiﬁed by their
respective weights. This activity is referred to as linear combination. Finally, an
activation function controls the amplitude of the output of the neuron. An acceptable
range of output is usually between 0 and 1, or -1 and 1.

Mathematically, this process is described in Figure 2.2 [Haykin].

21

Fixed input x0 = i 1

x0 WkO = bk (bias)

1'2.

xp
Input
signals

 

 

 

 

Activation
Function
@ vk Output
0
0 Summing
Junction
O m
Synaptic Threshold

Weights

Figure 2.2: Model of a Neuron

From this model the interval activity of the neuron can be Shown to be:

p
Vk : Z wijj (2.1)
.:__1

The output of the neuron, yk, would therefore be the outcome of some activation function

on the value of vk.

2.1.3 Activation functions

AS mentioned previously, the activation function acts as a squashing function, such

that the output of a neuron is between certain values (usually 0 and l, or -1 and 1). In

22

general, there are three types of activation functions, denoted by (p(0) in Figure 2.2. First,
there is the Threshold Function which takes on a value of 0 if the summed input is less
than a certain threshold value (v), and the value 1 if the summed input is greater than or

equal to the threshold value.

_I1 ivaO

2.2
O ifv<O ( )

(0(v)

Secondly, there is the Piecewise-Linear function. This function again can take on the
values of O or 1, but can also take on values between that depending on the ampliﬁcation

factor in a certain region of linear operation.

[1 v2%
¢(v)= v —-§->v>-§- (2.3)
'10 s ;

Thirdly, there is the sigmoid function. This function can range between 0 and 1, but it
is also sometimes useful to use the -1 to 1 range. An example of the sigmoid function is

the hyperbolic tangent function.

v 1 — exp(—v)
(06’) t h( 2 J l + exp(—v) (2 )

Figure 2.3 shows these activation functions plotted [Haykin].

23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 1 r r a ﬂ 1
1.8 - ‘
1'6 r 92(6) .
L4 I- -l
1.2 r r
I u-
0.8 - '
0.6 r ..
0.4 - -
0.2 '- -
92 -13 -i .03 o 03 i is 2
D
theshold
2 r r r r l I
111 r "
1.6 *- 99(9) .
1.4 ‘- "
1.2 r '
I n
0.8 - ~
0.6 r ‘
0.4 *- -
.‘2 ~ '*
92 —i.l5 vi «0.5 0 0:5 I 1:5 2
D
piecewise-linear
2 u z I . r 1 r r
1.8 ~ "
1'6 '- ﬁt") ‘
1.4 ~ 4
1.2 e '
l I-
0.8 '- '
0.6 - ‘l
(1.4 ~ ‘
0.2 - ‘
910 -8 -6 -4 -2 0 2 4 (i A 10
I}
sigmoid functions

Figure 2.3: Activation Functions

24

In Figure 2.2, two Special values should be noted: the bias and the threshold. These are
optional values that can be used, determined by the speciﬁc situation for which the neural
network will be used. The bias can be eliminated from the network by making the bias
weight, wko, equal to zero. The threshold value can also be set equal to zero to eliminate
is effect on the network. For obvious reasons, if a Threshold Activation Function is not

being used, the threshold value is not needed.

2.1.4 Directed Graphs

The neuron model described in Section 2.1.2 can be described in two other manners in
order to simplify discussions. First, there iS the Signal-ﬂow graph of a neuron, Shown in

Figure 2.4 [Haykin], where the symbols used are simpliﬁed.

 

Figure 2.4: Signal-Flow Graph of a Neuron

An architectural graph can also be used to Show the general layout of a neuron, as

displayed in Figure 2.5 [Haykin]. This is useful when describing multilayer networks.

25

 

Figure 2.5: Architectural Graph of a Neuron

2.2 Learning

2.2.1 Description

One of the most important aspects of Artiﬁcial Neural Networks is the ability to learn
and therefore update and improve performance. The deﬁnition of this neural network
learning according to Haykin is:

Learning is a process by which the free parameters of a neural network are

adapted through a continuing process of stimulation by the environment in which

the network is embedded. The type of learning is determined by the manner in

which the parameter changes take place.

Haykin goes on to state three important events that occur during learning:

1. The neural network is stimulated by an environment.

2. The neural network undergoes changes as a result of this stimulation.

3. The neural network responds in a new way to the environment, because of the
changes that have occurred in its internal structure.

Overall, the updating of the neural network is done by changing (or updating) the

interconnected weights between neurons. Mathematically, this is described as follows:

26

ij (n +1) = wkj (n) + Awkj (n) (2.5)

How the update, or Aw, is calculated is determined by a set of rules applied with a certain

paradigm. A general overview of the Artiﬁcial Neural Network learning process is Shown

 

 

 

in Figure 2.6 [Haykin].
Learning Process
1
I 1
Learning algorithms (rules) Learning paradigms
1 1
I l l T I l l 1
Error-correction Bolzrnann Thomdike's Hebbian Competetive Supervised Reinforcement Self-organized
learning learning law of learning learning learning learning (unsupervised)
effects learning

Figure 2.6: The Taxonomy of Learning

Learning algorithms are a set of rules by which the network is updated. The Learning
paradigms are the styles or methodology in which the learning algorithm is performed.
There are many different styles of learning, and new algorithms are continuously being

explored and developed. For this thesis, Hebbian learning will be discussed.

2.2.2 Hebbian Learning

Hebbian learning is one of the oldest and most popular learning methods today
[Haykin]. It derives its methodology from the idea that the more often a synapse between
two neurons is ﬁred, the stronger the connection between the two neurons becomes.

Expanding that statement, it can be said that when the connection between two neurons is

27

excited Simultaneously (or synchronously), the connection is strengthened, and when the
connection is excited asynchronously, the connection is weakened [Haykin].
Mathematically, the change of weight in Hebbian learning is a function of both
presynaptic (input) and postsynaptic (output) activities. In deﬁning the weight update, a
parameter called the learning rate is introduced. This learning rate, a positive value,
deﬁned the Speed at which the weight changes. In order to keep the weight from growing
too fast (which may lead to saturation), another parameter called the forgetting factor is
introduced (this is also a positive value). Finally, a formula for the weight update for

Hebbian learning can be deﬁned as:

AW,,,- (11) = 7m (n)x,- (n) - ayk ('1)ij (n) (2.6)

2.3 The Perceptron and Multilayer Perceptrons

The perceptron is the Simplest form of a neural network used for the classiﬁcation of
linearly separable patterns [Haykin]. Its structure can be easily shown using the signal-
ﬂow graph described above, since a Single layer perceptron operates as a Single neuron.

Figure 2.7 shows this [Haykin].

28

 

 

xp Linear
combiner

Figure 2.7: A Single Layer Perceptron

Multilayer perceptrons are an expansion of the perceptron idea, and can be used to
solve much more difﬁcult problems. They consist of an input layer, one or more hidden
layers and an output layer. The hidden layers give the network its power and allow for it
to extract extra features from the input. Figure 2.8 shows a multilayer perceptron with

two hidden layers [Haykin].

29

 
 

   
 
   
  

WWIIAWV

Aim: at» 4%.:

‘3‘" ‘i’ “K
‘ \ /\

Input First Second Output

      

layer ‘ hidden hidden layer
layer layer

Figure 2.8: Multilayer Perceptron

One of the most popular methods used in training a multilayer perceptron is the error

back-propagation method, which includes two passes through each layer of the network:

a forward pass and a backward pass [Haykin].

2.4 Feedforward and Ba ckpropagation

In error back-propagation, there are two types of signal ﬂow. First, signals are

calculated in a forward manner, from left to right, through each layer. These are called

30

function signals. Second, the error at each neuron iS calculated, from right to left, in a

backward manner through each layer. These are called error signals.

 

 

—-—> Function signals

4- - -- Error signals

Figure 2.9: Signals in a Multilayer Perceptron

The goal of the error back-propagation method is to make the average squared error of
the entire network as small as possible. Once the function Signals have been calculated,
the error back be back-propagated though the network. The error is calculated using the
local gradient, 5j(,.)[Haykin]. At the error of the output layer, the local gradient is just

deﬁned as:

51°01) = ej (n)¢j 0’} (72)) (2.7)

When calculating the error for a neuron in the hidden layer, the local gradient is deﬁned

as:

5,(n)=¢j,-(v,<n»2 a,(n)w,,.(n) (,8,

31

The weight correction, AWN"), is deﬁned as the learning rate parameter (1]) times the local

gradient (8) times the input signal of the neuron (I):

iji (n) = 77 ' 5]“ (n) ' y: (n) (2.9)

A momentum term is also used in calculating the weight update.

iji (n) = aAWﬁ (n —' 1) + 775,- (”DC- (n) (2.10)

32

Chapter 3: SimpleScalar

3.1 SimpleScalar
3.1.1 Software Architecture

SimpleScalar has a modular layout. This layout allows for great versatility.
Components can be added or modiﬁed easily. Figure 3.1 Shows the software structure for

SimpleScalar. Most of the performance core are optional modules [Austin].

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pr US“ SimpleScalar Program Binary
ograms
Pr /8' . ‘
Infegrfalcl: SimpleScalar ISA I POSIX System Calls
Functional . . . . ‘
Core Machine Definition I Proxy Syscall Handler
I BPred Simulator Stats
Core
PC rformérgc: Resource D1 ite !
l Cache Loader l Regs Memory

 

 

 

 

 

 

 

 

 

 

Figure 3.1: SimpleScalar Software Architecture

33

 

3.1.2 Hardware Architecture

The Hardware architecture of the SimpleScalar Simulator closely follows the Post-
RISC architecture previously described. Figure 3.2 shows the out-of-order pipeline for

the SimpleScalar hardware architecture.

 

 

 

 

Fetch ._..p Dispatch _. Scheduler.~ Exec I_.Vi’riteback .u Commit

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Memory
Scheduler Mem
I-Cache D-Cache
I-TLB -
(1L1) (DLl) D TLB
I-Cache D-Cache
(1L2) (D L2)

 

 

 

 

 

 

\A/

Virtual Memory

 

 

 

Figure 3.2: Out-of-Order Issue Architecture

3.1.3 Instruction Set Architec ture

The instruction set architecture is based on MIPS and DLX [Patterson] instruction set
architectures, with some additional addressing modes [Austin]. It is a “big-endian”
instruction set architecture deﬁnition, which allows for easier porting of the simulator to
new hosts since the Simulator is compiled to match the host’s endian. The instructions are
64-bit. Only 48 bits are currently used; the extra 16 bits allow for expansion or research

of instructions.

34

.164mm
16-annote 16-opcode 8-ru 8-rt 8-rs 8-rd

l l

63 48 32 24 16 8 o

 

 

 

 

 

 

 

 

Figure 3.3: Instruction Format

3.1.4 Running SimpleScalar

For this thesis, the Out-of-Order execution Simulator will be used. The correct
command-line format for using the out-of-order simulator is as follows: Sim-outorder

<options> <Sim_binary>. The <options> include the branch prediction conﬁguration.

3.1.5 Branch Prediction

SimpleScalar supports a variety of branch prediction methods. It allows for the static
always “taken” and always “not taken” prediction methods. It also has options for the
Two-Level Adaptive Branch Predictors. It supports a Two-bit Predictor referred to as a
bimodal predictor. SimpleScalar even supports a Hybrid predictor, a combination of the
two-level and bimodal predictors, using a metapredictor to choose between the two
predictors. SimpleScalar also utilizes a branch address predictor through Branch Target
Buffer.

The command line option for Specifying the branch predictor is -bpred <type>. The

types are as described in Table 3.1.

35

Table 3.1: Branch Prediction Types

 

 

 

 

 

 

 

 

 

predictor option Description

nottaken always predict not taken

taken always predict taken

perfect perfect predictor

bimod bimodal predictor: BTB with 2-bit counters

21ev two-level adaptive predictor

comb hybrid predictor: combination of 21evel and bimodal

 

Figure 3.4 shows the Two-level adaptive branch predictor layout.

 

‘pattem ' 2-bit

 

hist__size

Figure 3.4: Two-level Predictor Layout

The conﬁguration options for each predictor are described in Table 3.2.

36

 

Table 3.2: Branch Prediction Options

 

conﬁguration parameters

Description

 

-bpred:bimod <Size>

the size of the direct mapped BTB

 

-bpred:21ev <llsize>
<l2size> <hist_size>

llsize - the size of the ﬁrst level table (number of shift
registers)

l2size - the size of the second level table (number of
counters)

hist_size - the history pattern width (width of shift
registers)

 

 

-bpred:comb <Size>

 

The number of entries in the metapredictor table

 

Table 3.3Shows how speciﬁc 2-level design options may be implemented in the

SimpleScalar simulator. Note that each counter has one BTB entry.

Table 3.3: Two-Level Branch Prediction

 

 

 

 

 

 

 

 

type llsize lZsize hist_size
counter based 1 m 0
GAg 1 2"w w
GAp l m(m>2"w) w
PAg n 2"w w
PAp n m(m = 2"(N+W)) w

 

 

 

 

When selecting the hybrid predictor, both the bimod and 21ev conﬁguration options may

be Speciﬁed.

3.2 Using the SPEC95 benchmarks

Available with the SimpleScalar simulator are SimpleScalar binaries of the SPEC95

benchmark suite. These binaries are useless without the input ﬁles available only by

37

 

purchasing the license to use the benchmarks. In conjunction with a PERL script, Run.pl,
written by Dirk Grunwald and Artur Klauser, the SPEC95 benchmarks can be run on the
SimpleScalar simulator. This thesis uses the following SPECint95 benchmarks to test the
branch predictors: 099. go, 126. gcc, 130.1i, 134.perl; and the following SPECfp95
benchmarks: 101.tomcatv, 107.mgrid, 145.fpppp. The correct command-line
implementation used by Run.pl for running the SPEC95 benchmarks on the SimpleScalar
architecture is as follows: Sim-outorder <Sim_options> <benchmark_binary>

<benchmark_options> < <input__ﬁle> > <output_ﬁle>.

38

Chapter 4: Methodology

4.1 Programming the Neural Network

To more easily integrate the ANN code into SimpleScalar, the neural network was
programmed in C. Also, the error subroutines used in the SimpleScalar package were
used in programming the neural network (Speciﬁcally the fatal subroutine for run-time
error messages, which are included in the misc.h and misc.c ﬁles).

The two ﬁles that deﬁne the feed-forward back-propagation artiﬁcial neural network
code are a header ﬁle and a C code ﬁle, ffbpannh and ffbpann.c. Three structures were
created to deﬁne the overall design of the FFBPANN: bounds, layer and ffbpann. Bounds
iS a simple template of two integers that deﬁne the minimum and maximum number that
will be represented as input to the FFBPANN. These values were used to normalize the
inputs. Layer is a more complicated structure that is used in deﬁning the overall
FFBPANN structure. A layer contains a variable of the number of nodes in the layer, an
array for the error calculations of each node, an array for the output calculations of each
node, a multi-dimensional array for the weights, a multi-dimensional array for
saveweights (used during training of the FFBPANN), and a multi-dimensional array of
deltas (used during back-propagation). The FFBPANN structure itself possesses four
variables: the number of layers in the FFBPANN, the momentum, learning rate and
sigmoid function gain (all used for training of the FFBPANN), and the net error

calculated for the ANN. It also has a pointer to the array of input bounds, a pointer to the

39

input layer, a pointer to the output layer, and a pointer to an array of layers (described
above)

Several functions were deﬁned so that the FFBPANN would behave properly.
Create_ffbpann allocates memory for the entire FFBPANN, and ffbpann_free frees up the
memory allocated for the FFBPANN. Init_ffbpann initializes the FFBPANN weights
either to random numbers, or predeﬁned weights (pre-trained values saved in a text ﬁle).
Nnet_random is used to calculate the random values used when initializing a weight
matrix. Calc_ffbpann calculates the output of the FFBPANN given the current weight
matrices. Backprop performs the back-propagation algorithm of the FFBPANN, and is
usually followed by the updateweights function, which calculates the new weight values,
based on the recent back-propagation. Restore_weights and save_weights are used during
training to either revert to the last saved weight matrices or save the currently calculated
matrices, depending on how successful the current epoch of training was. Dump_weights
is the function used to write the current weight matrix to a ﬁle. The format of this ﬁle is
as follows: the bounds for each input node are at the beginning of the ﬁle (one pair per
line); each subsequent line contains a weight value starting with the ﬁrst node of the ﬁrst
layer, then the second node of the ﬁrst layer, and continuing until the last node of the last
(output) layer. Train_ffbpann and learn_ffbpann are short sample functions for setting up
training and testing routines.

To accommodate testing and training, small C programs were used as "wrappers" to
the four main ﬁles used for the FFBPANN. The layer and node parameters are deﬁned in

the "wrapper" program so that various FFBPANN conﬁgurations could be tested.

40

 

Examples of these programs programs can be found in Appendix B of technical report

MSU-CSE-Ol-22, and are discussed below.

4.2 Verifying the FFBPANN Code

4.2.1 The XOR Problem

To verify that the FFBPANN code was working properly the standard XOR (exclusive
OR) test was used. The XOR test is a very Speciﬁc application of a common problem of
a single-layer perceptron. A single-layer perceptron is incapable of classifying input
patterns that are not linearly separable, as is the case in the XOR problem. To overcome
this limitation, a multi-layer neural network is used to classify the XOR inputs as Shown

in Figure 4.1 and Figure 4.2 [Haykin].

41

(0,1) (1,1)

 

 

  
   

 

 

 

Output: 1
Output: 0
(0.0) Input x] (1.0)
(a)
(0.1) (1.1)
Output: 1
Output = 0
(0.0) Input x] (1.0)
(b)
(0.1) . (1.1)
Output = 0
Output: 0 Output = 1
(0.0) ’ Input x] (1.0)
(c)

(a) Decision boundary constructed by hidden neuron 1 of the network in Figure 1.2.
(b) Decision boundary constructed by hidden neuron 2 of the network.
(c) Decision boundaries constructed by the complete network.

Figure 4.1: XOR Classifications

42

 

Input Hidden Output
Layer Layer Layer

Figure 4.2: Multi-layer network for solving the XOR Problem

A 2x1 FFBPANN was trained using the very small data set of the XOR function. The
weights were saved in a ﬁle and another short program was used to verify that the saved
weights were correct to solve the XOR problem. The training and test ﬁles programs can
be found in Appendix B of technical report MSU-CSE-01-22. The output from the testing

ﬁle is as follows:

inputs: 0.000000 0.000000 output: 0.000068
inputs: 0.000000 1.000000 output: 0.999886
inputs: 1.000000 0.000000 output: 0.999923
inputs: 1.000000 1.000000 output: 0.000093

4.2.2 Predicting Sunspots

A second test was used to verify that the FFBPANN used for this experiment could
indeed do prediction. A data set of sunspot levels was discovered [Kutza] and was used to
train and test a 10x1 FFBPANN. The inputs to the FFPBANN were the sunspot levels
from the previous 10 years, and the output is the prediction of the sunspot level for the
current year. The code used follows in programs can be found in Appendix B of

technical report MSU-CSE—01-22. The output of the test program follows in Table 4.1.

43

Table 4.1: Sunspots Test Program Output

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Year Sunspots Prediction
1960 0.587 0.547
1961 0.282 0.306
1962 0. 196 0.048
1963 0. 146 0.082
1964 0.053 0.183
1965 0.079 0.153
1966 0.246 0.174
1967 0.491 0.454
1968 0.554 0.620
1969 0.552 0.533
1970 0.546 0.431
1971 0.348 0.473
1972 0.360 0.157
1973 0.199 0.209
1974 0.180 0.121
1975 0.081 0.099
1976 0.066 0.047
1977 0.143 0.076
1978 0.484 0.313
1979 0.813 0.615

 

 

4.3 Adding a new predictor to SimpleScalar

To add a new predictor to SimpleScalar the following ﬁles in the package had to be
modiﬁed: bpred.c, bpred.h, sim-outorder.c, sim-bpred.c and the Makeﬁle. The Makeﬁle
was modiﬁed so that the extra FFBPANN code was included during compilation. The
modiﬁcations to sim—outorder.c and sim-bpred.c accomplished three tasks. Firstly, so that
additional information (variables) would be sent to the branch predictor functions
speciﬁcally for use by the FFBPANN predictor; secondly, to deﬁne the FFBPANN

predictor command line options and how the predictor is called; and lastly, to display the

usage information for the command line options. The majority of modiﬁcations were
made to the bpred.h and bpred.c ﬁles. The predictor structure was deﬁned in bpred.h, and
the predictor functions were coded in detail in bpred.c. For more detailed information, see
Appendix C of technical report MSU-CSE-01-22 for the code listing.

The procedure for adding the new predictor was to add another hybrid predictor to the
code so that there existed two hybrid predictors, but the second one possessed a different
name (the FFPBANN predictor). Then, Slowly, code for the 2-level component of the
new hybrid predictor was modiﬁed to accommodate the FFPBANN predictor code. See
the code listing in Appendix C of technical report MSU-CSE-01-22 for more detailed

information, and Section 4.4.1 for a discussion of the predictor design.

4.4 The new predictor

4.4.1 Design

The overall structure of the new predictor (the FFBPANN predictor) is similar to the
Combination (or Hybrid) predictor. In the FFBPANN predictor, a two-bit predictor and a
purely artiﬁcial neural network predictor are used in combination, with a meta predictor
used to choose between the outputs of the two predictors.

The purely artiﬁcial neural network predictor is the feed-forward back-propagating
artiﬁcial neural network described in Section 4.1. There are four inputs to the network
and one output. The inputs are the following variables in the SimpleScalar program: the
branch address, the opcode portion of the instruction, and the RS and RT registers. The

single output is the branch target (“taken” or “not taken”). The number of neurons in each

45

layer can be conﬁgured at runtime, and there may be from 1 to 4 hidden layers (the input
and output layers are, of course, required). The only prerequisite for selecting a certain
conﬁguration at runtime is that a weights ﬁle must exist for the ANN to function. The
format for the ﬁle name of the weights ﬁle is comprised of the layer conﬁguration
followed by the string “.weights.dump”. For example, a 4x3x2x1 FFBPANN would have
a weights matrix ﬁle named 4x3x2x1 . weights . dump. This weights ﬁle can be a set
of randomly generated numbers or a set of trained data. It is recommended that a set of
trained data be used. This ﬁle can be created using the dump_weights function. The
FFBPANN will adjust, or update, its weights as SimpleScalar runs, and therefore
improve its predictions. Figure 4.3 Shows a simple network diagram of the ANN used in

this scenario.

branch addr I

  

branch
decision

Input Layer Hidden Output
Of Source Layers Layer
Nodes (3 shown)

Figure 4.3: FFBPANN Structure

4.4.2 Training the FFBAPNN

To obtain a data set to train the FFBPANN for branch prediction, the SimpleScalar

code was modiﬁed to output branch information to a ﬁle while it ran a simulation. The

46

branch information consists of data available to the branch predictor function that could
be used as input for the FFBPAN N (which was the aforementioned branch address, the
opcode portion of the instruction, and the RS and RT registers), and the correct
prediction.

Wrapper programs, discussed in Section 4.1, were written to train and test a weight
matrix using this branch information. The training data was divided into three portions
for training and testing. The ﬁrst 60% of the data was used for training, the next 30% was
used for testing the progress of the training routines (used to calculate the stopping
criteria for training), and the remaining 10% was reserved exclusively for testing a
trained FFBPANN.

Several different techniques were used for training. This was done to ﬁnd the quickest
and most efﬁcient way to train the neural network. This was also done because, early on,
the training was not producing a quality branch predictor. Therefore, modiﬁcations to the
training data set, varying the training parameters, and modifying the ANN structure were
tried in the hopes of improving the training procedure. The following procedures were
taken to vary the attempts of ﬁnding a weight matrix that would classify the inputs.

0 The whole of the raw training data set was used.

0 Modiﬁcations to the raw data set were made to reduce the Size of the data set. The most
common entries in the data set — lines that appeared more than 10 times — were extracted.
This was attempted again for lines that appeared only more that 50 times. This left only

single entries in the resulting data set for the most common entries.

47

0 Another attempt to reduce the data set was to pull the same frequently occurring entries,
described above, out of the raw data set; however, this time the total number of
occurrences was reduced by a fraction instead of being reduced to one entry. Now an
entry occurring 10000 times in the raw data set would only occur 10 times in the reduced
data set.

0 The momentum (alpha) and the learning rate (eta) parameters of the FFBPANN were
varied. These values changed only slightly, as the values initially chosen were suitable
for most cases.

0 Different layer structures were tested. The following four FFBPANN conﬁgurations
were tried: 4x3x2x1, 4x4x3x1, 4x4x2x1, 4x3x3x1. Though, the 4x3x2x1 structure was

the one most commonly used for training and testing.

48

.1

 

Chapter 5: Results and Discussion

5.1 Training

The training data was gathered, as described in the previous chapter, by running a
version of SimpleScalar modiﬁed to output the branch information (inputs and output) to
a ﬁle. Several experiments to train a FFBPANN branch predictor were done using this
data by varying the training Slightly with each training attempt. None of the training
techniques used produced a neural network branch predictor that performed more
effectively than the other branch predictors already used in the SimpleScalar Simulator.
The results of these training attempts are as follows.

The modiﬁcations to the raw data sets did not produce better results than the attempts
on the raw data sets described below. The attempt to reduce the data set by extracting
only unique lines from the data would not help since frequently occurring data (i.e. a
pattern for the ANN to pick out) would not be represented. The other reductions of the
data set just pulled out infrequently occurring lines, but did not help in the training
attempts either. The reason for reducing the Size of the data sets was to Speed up training
so that the following two modiﬁcation to the training procedures could be tried numerous
times.

The attempts to vary the momentum (alpha) and the learning rate (eta) of the
FFBPANN also did not produce satisfactory results. Again, these values only varied
slightly, since the standard momentum and learning rate were used. Changing these
parameters should help Speed up the training procedure (or give it a little boost).

However, since the training was performing poorly, all that these modiﬁcations did was

49

assist the training perform poorly faster. However, changing these values had little or no
effect on the outcome, and at worst would cause the training to ﬁnish quicker, but with
poorly calculated weights. The testing of these weights never produced satisfactory
predictions.

Surprisingly, changing the layer structure of the FFBPANN had no effect on the
training of the FFBPANN. With the reduced data sets, multiple layer changes could be
tested more frequently. Since training with the full raw data sets would take so long, the
only way to test multiple later options was to use the reduced data sets. None of the four
FFBPANN architectures tested (4x3x2xl, 4x4x3xl, 4x4x2x1, 4x3x3x1) produced

improved training performance or more accurate prediction results.

5.2 Discussion

One oddity to note occurred while gathering the training data. The SimpleScalar code
was modiﬁed to write the selected inputs for the FFBPANN to a ﬁle. However, the
branch data for every branch was not captured due to a ﬁle size limitation of Solaris 2.6.
The maximum ﬁle Size turned out to be about 2 Gigabytes (231 bytes). When this
maximum was reached, subsequent branch prediction information was no longer
gathered. The gcc SPEC95 benchmark was the only benchmark for which the full branch
prediction data was gathered. It is signiﬁcant to note the importance of gathering the gcc
branch information, since gcc performance is used by the industry to gauge its own
improvements to such components as compilers. The limitation on the amount of the

branch information that could be gathered could be a contributing factor for the poor

50

FFBPANN training. Table 5.1 Shows the amount of branch prediction information that

could be gathered.

Table 5.1: Limitations of Branches Recorded

1 13401 1
127421811 16040941 128871

1 1
176241 1314661

 

Another possible reason for non-convergence of the FFBPANN weights is that the
training data is too chaotic. An ANN is well suited to pick out patterns from the input. In
this situation, no unique patterns exist for the FFBPANN to effectively classify. Even
though the FFBPANN is much more suited to this task than the human eye given the
amount of inputs examined, using the four inputs available to the predictor the
FFBPANN was still unable to recognize any consistency in the data. Table 5.1 shows that
almost every branch examined for the gcc, fpppp, and go benchmarks had a unique
pattern - all four inputs and the branch outcome were different. However, this does not
necessarily prove that the data is too chaotic for the FFBPANN weights to converge,
since when examining the case of the mgrid benchmark, only about 12% of the branches
were unique. However, during training with the full raw data set from the mgrid

benchmark, something very interesting was noticed.

51

A closer look at the training progress using the mgrid data set Shows that the
FFBPANN was behaving almost exclusively as a “branch-alwayS-taken” predictor. This
is Shown in Figure 5.1. While the output never fully reached the value of “1,” it was

extremely close at a maximum value of 0.969877.

 

1.2

 

1 A

M

 

0.8

0.6

 

0.4

 

0.2

 

 

 

 

 

[+— Output I Expected Outpuﬂ

Figure 5.1: Mgrid Training Progress

52

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.2: Mgrid Training Error

Another item to note about the mgrid training data is that the majority of the branches

. The training error shown in Figure 5.2 more clearly shows how often the

“taken”

were

(1 27569741 of 1 3401 9707 recorded

branch Should have been “not taken”. About 95%

branches) of the mgrid branch decisions were “taken”. This could deﬁnitely account for

why the FFBPANN appeared to be acting as a “branch-always-taken” predictor. Taking a

closer look at the training output shows how quickly the ANN starts behaving this way.

Figure 5.3 zooms in on the beginning of the mgrid training progress.

53

 

0.98

 

 

0.96

 

0.94

 

 

0.92 -

 

0.9 ‘

 

0.88 1

 

0.86 ~

 

 

0.84

 

 

0.82

 

 

0.8

 

'3888:88R“'K’°§R2329““9382'“3’8"
q-v-v-mcvcvsgcoe wrio moaghhhmggmag

 

[+ Output I Expected Output]

Figure 5.3: Mgrid Training Progress Zoom

Table 5 .2 examines how frequently the other SPEC95 benchmarks’ branches were

“taken” or “not taken”.

 

 

 

 

 

 

 

Table 5.2: Branch Data
Branches Gcc fpppp mgrid go
Caken” 30558126 77594080 127569741 84216932

 

 

54

 

 

 

 

 

Branches Gcc fpppp mgrid go

“not taken” 19820305 50662550 6449966 46225472
Total 5037843 1 128256630 1 34019707 130442404
Percent “taken” 60.66% 60.50% 95.19% 64.56%

 

 

 

 

 

Interestingly enough, while training using the gcc and go benchmark data, which have

a more even distribution of “taken” and “not taken” branches (about 61% and 65%,

respectively), the same training error occurred. The ANN 8 produced by those training

attempts also performed very closely to a “branch-alwayS-taken” predictor. Figure 5.4

and Figure 5.5 Show the training progress using the gcc and go branch prediction data,

respectively.

55

 

1.2

 

 

 

0.8

0.6

 

0.4

 

0.2

 

 

 

 

boutput I expected outputj

Figure 5.4: Gcc Training Progress

56

 

 

1.2

 

 

0.8

0.6

 

0.4

 

 

0.2

 

 

 

0 = === = = === == == ==_._-— ---

 

[+0utput I Expected Outpuﬂ

Figure 5.5: Go Training Progress

Overall, it appears that the FFBPANN predictor is defaulting to a purely “branch-
always-taken” predictor since it cannot pick out any patterns in the training data set. One
idea that is not fully discovered from this data is that the predictor is performing as a
branch always predictor because the majority of the branches were “taken”. Since none of
the benchmarks had more exclusively “not taken” branches, it cannot be determined
whether the predictor would perform exclusively as a “branch-always-not-taken”
predictor if there were more “not taken” branches than “taken” branches in the training

data.

57

However, using the Unix command sed, the gcc training data was modiﬁed to create
a data set that had more “not taken” branches that “taken” branches. The lines that had a
branch decision of “not taken” were changed to “taken,” and the lines with “taken”
branches were changed to “not taken.” This would effectively give the opposite training
data set. The training procedure was run again. The training progress that was expected
was that the neural network would very quickly start functioning as a “branch-always-

not-taken” predictor. Figure 5.6 shows the true training progress of this “opposite” run.

 

1.2

 

 

 

0.8 A

 

0.6 -

 
 

0.4 -

 

Irv-n u w n“ "

 

 

0.2

 

 

I
1 108 215 322 429 536 843 750 857 9641071117812851392149916061713182019272034214122482355246225882676

[+omnut - expectodomputl

 

 

Figure 5.6: “Opposite” Gcc Training Progress

The output of the neural network in this case did trend toward behaving as a “branch-

always-not-taken” predictor. The values started high (around 0.7) and tapered off toward

58

 

0.3. However, it did not reach these values as quickly as the real gcc data trained toward a
“branch—always” predictor. Also, to truly be classiﬁed as a “branch-always-not-taken”
predictor the output should be closer to 0 (such as a value of 0.12) - just as the “branch-
always” predictor trained with the true gcc training data had output values much closer to

1 (about 0.97).

5.3 Branch Predictor Results

For completeness, the other branch predictor results gathered are included here. Table
5.3, Table 5.4, and Table 5 .5 Show the results for the 2-level, bimodal, and hybrid
predictors, respectively. The number of instructions and branches are, of course, the same
in each case. The prediction rate is calculated from the total number of successful

branches (total branches minus the nrispredictions) divided by the total number of

 

 

 

 

 

 

 

 

branches.
prediction rate = branches - nuspredrctrons (5.1)
branches
Table 5.3: Branch Predictor Results - 2-1evel
Simulation Instructions Branches Mispredictions Prediction Rate
fpppp 174687565856 2774271287 177200663 0.9361
gcc 253018428 5037843 1 5476501 0.8920
go 32718301958 48281 19436 941659343 0.8050
li 55389884875 13200007732 519133700 0.9607
mgrid 1 10557152489 1444135667 35077426 0.9757
perl 14237817453 27137 14020 19806753 0.9927

 

 

 

 

 

 

59

Table 5.4: Branch Predictor Results - Bimodal

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Simulation Instructions Branches Mispredictions Prediction Rate
fpppp 174687565856 2774271287 216288715 0.9220
gcc 253018428 50378431 5476501 0.8913
go 32718301958 4828119436 880996257 0.8175
1i 55389884875 13200007732 1056731094 0.9199
mgrid 1 10557152489 1444135667 36025665 0.9751
perl 14237817476 2713714028 106292585 0.9608
Table 5.5: Branch Prediction Results - Hybrid
Simulation Instructions Branches Mispredictions Prediction Rate
fpppp 174687565856 2774271287 166465777 0.9400
gcc' 253018428 50378431 4699856 0.9067
go 32718301958 48281 19436 848553348 0.8242
11 55389884875 13200007732 513342336 0.961 1
mgrid 110557152489 1444135667 34751699 0.9759
perl 14237817453 2713714020 19742856 0.9927

 

 

 

 

 

Unfortunately, results for the FFBPANN predictor are not included, because the

training did not produce a better branch predictor. Since the FFBPANN appears to have

been behaving as a “branch-always-taken” predictor, the prediction rate would be equal

to the percent of “taken” branches in the benchmark. Except for the mgrid benchmark, in

which 95% of the training data were “taken” branches, the prediction rate would be

signiﬁcantly poorer compared to the other predictors (i.e. about 60%). Even though the

FFBPANN predictor was actually a hybrid predictor — a combination of a bimodal and a

purely neural network predictor - it would have performed at best as well as the bimodal

60

 

 

predictor. Effectively, the FFBPANN predictor was a combination of a “branch always”
predictor and the bimodal predictor.

The more common predictors (2-level and bimodal), used in SimpleScalar, employ
much Simpler techniques for branch prediction — i.e., from a software perspective, shift
registers and tables. They perform very well under most conditions — however,
sometimes performance drops drastically, as in the case of the go benchmark. This can be
easily observed in Figure 5.7, and Figure 5.8. These ﬁgures graph the predictor data
presented in the preceding tables. Figure 5.7 shows the predictor performance per

predictor. Figure 5.8 displays how the predictors performed in each of the 6 different

SPEC95 simulations.

 

 

 

 

 

I fpppp
gcc
go

I li

a mgrid

Prediction Rate

 

 

 

E perl

 

 

 

Figure 5.7: Prediction Rate by Predictor

61

 

 

I hybrid
E bimodal
I 2-level

 

 

 

‘ 21'»: ,‘ g,'r’j"'51"', ‘W ~, .1. g v. ..
" ‘ '.....r 9.11.3.2;'&E_:¢eh.mwaml

 

 

 

 

 

l I l I l l

0.75 0.78 0.8 0.83 0.85 0.88 0.9 0.93 0.95 0.98 1

Figure 5.8: Prediction Rate by Simulation

Figure 5.9 shows the overall predictor performance by averaging the number of correct

predictions over the total number of branches from all the simulations.

62

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.9: Overall Predictor Performance

63

Chapter 6: Conclusions and Future Work

6.1 Conclusions

From the results, the hybrid, or combination, predictor performs the best. For branch
prediction to beneﬁt from an artiﬁcial neural network, it can safely be Stated that a
combination of a purely ANN predictor and another more common predictor (one that
has Shown consistently good results, such as the bimodal or 2-level predictors) is the
proper solution. The fact remains that artiﬁcial neural networks' characteristics lend itself
well to prediction. However, in this case, a purely feed-forward back-prOpagation ANN
did not perform well at all. Two tasks exist to continue on with testing the use of artiﬁcial
neural networks for branch prediction. In-depth statistical analysis of branch instruction
behavior is needed to determine if another ANN design is better suited for prediction of
microprocessor branches. Also, the inputs used for the ANN Should be exarrrined closer

to determine which inputs Should be used and not used to successfully train the ANN.

6.2 Future Work

6.2.] Other inputs

Further exarrrination of the variables available in SimpleScalar to be used as inputs
Should be done. Certain inputs may not be desirable and others may have been
overlooked. Speciﬁcally, the branch address could probably not be used. This prospect

should also be tied to the following area of exploration: the selection of better training

64

methods and use of other ANN designs. By changing or adding new inputs a different
learning method could be used. For example, if previous branch decisions are taken as
inputs, a recurrent learning method could be utilized. Also, variables from previous

branch instructions could be examined and taken as inputs.

6.2.2 Other Neural Networks and Training methods

The methodology used for training in this thesis was fairly Simple, as the focus was to
obtain a working Simulation. Other methods of training the ANN should be explored.
New training methodologies for Artiﬁcial Neural Networks are constantly a topic of
research, and new methods may be found that may allow an ANN for branch prediction
to be successful. Other artiﬁcial neural network styles should be explored. Statistical
analysis of branch prediction may hint toward another neural network structure that is
better suited for this task. Figure 6.1 shows other learning processes that could be

explored.

 

 

 

Leaming Process
1
I 1
Learning algorithms (rules) Learning paradigms
l j I T l l I I
Error-correction Bolzmann Thorndike's Hebbian Competetive Supervised Reinforcement Self-organized
learning learning law of learning learning learning learning (unsupervised)
effects learning

Figure 6.1: the taxonomy of learning

A recurrent network Should probabrylbe explored. By choosing a recurrent network,
previous branch prediction performance could be taken into account. An example of a

recurrent network is shown in Figure 6.2.

65

 

 

 

 

 

 

+ Output

 

 

 

 

 

 

 

lrl

 

operators

 

Unit-delay ﬂ

 

 

 

 

 

 

 

 

Inputs<

 

 

Figure 6.2: Recurrent network

One such recurrent network is a Hopﬁeld network. However, the Hopﬁeld network is
equivalent to an associative memory or content-addressable memory [Haykin]. Because
of this, the Hopﬁeld network might not be the best choice for branch prediction. Since it
is a form of memory — good at retrieving a pattern from memory — and it has been
determined that there not very many recurring patterns in the branch prediction training
data, the Hopﬁeld network may not improve the neural network as a branch predictor.
One limitation of a Hopﬁeld network is that is does not have any hidden neurons. Also,
the Hopﬁeld network cannot beoperated in a supervised manner, which could impede

training of the network.

66

If using a recurrent network is expanded upon (i.e. lifting the two limitations discussed
for Hopﬁeld networks), one learning method that looks promising is Boltzmann
Learning. Taking a stochastic approach to branch prediction seems reasonable, since the
question that this would answer is: “what is the probability that this branch instruction
will be ‘taken’?” Boltzmann learning should probably Operate in the Clamped Condition
Since the visible neurons should be used to supply the network with inputs from the
environment. The Boltzmann learning is a very different learning method from Hebbian
learning Since there is no error-correction learning [Haykin]. Boltzmann learning has
symmetric connections, which allow for two-way communications between neurons. The
weight update function, therefore, becomes a function of correlations:

i, j=1,2,...,N

iji = ”(p; "p;- ), i¢ j

In this gradient descent rule, the two averages P;- and p}, are the conditional and

unconditional correlations, respectively, between the states of neuron j and neuron i

[Haykin]. Figure 6.3 shows a basic Boltzmann machine.

67

 

 

 

 

 

Hidden neurons

 

1

 
    

1

  

 

rK\A

 

 

\ J J
Y Y
Input Neurons Output Neurons
Figure 6.3: Boltzmann Machine

Another option could be to use a Competitive Learning technique. In this manner, two
output neurons, one representing a “taken” decision, and the other representing a “not
taken” decision, would compete for an active, or “on” state. The weight update function

for a competitive network is determined by which neuron wins the competition:

I 77(x, — w .. if neuron j wins the competition
A = . "
l 0 if neuron jloses the competition

Also, it is assumed that the sum of the synaptic weight for a given node is 1:

2w; =1 forallj

A simple possibility for a competitive learning network is shown in Figure 6.4.

68

 

Figure 6.4: Competitive learning network

6.2.3 BTB

Another way that an ANN could be used to improve branch prediction within the
SimpleScalar architecture would be to use an ANN for address prediction. The ANN in
this thesis was used just for direction prediction (do not take or take the branch).
SimpleScalar has the capability for branch address prediction through the use of a

Branch-Target Buffer. An ANN could be used to enhance the branch address prediction.

6.2.4 State Output

One possibility that was not explored in this thesis is to train the ANN to be a branch
state predictor — using the state of the bimodal or 2—level predictor as the output for
training. While re-examining the code for SimpleScalar, it was determined that the
coding for the bimodal and 2-level predictors determine which state the predictor is in (0,
1, 2, or 3). When the state is 2 or 3, the prediction is “taken”, 0 and 1 are “not taken”.
Instead of recording the 0 or 1 (“not taken” or “taken”) output, the state output could be
gathered with the appropriate inputs. Then, instead of training for the “taken” or “not
taken” output, the predictor state could be trained for. In essence, this would be training

the ANN to be a better bimodal or 2-level predictor.

69

6.2.5 Hardware Feasibility

This thesis did not explore the hardware implementation of ANNS for branch
prediction. However, both Haykin (in his book, Neural Networks: A Comprehensive
Foundation) and Wang (in his dissertation, CMOS VLSI Implementations of a New
Feedback Neural Network Architecture) state that VLSI implementations of ANN S are
possible, and they are good for real time operations like control, Signal processing and
pattern recognition. A 6 neuron ANN would occupy 2.2 x 2.2 mm2 using 2pm n-well
technology. A 50 neuron CMOS analog chip uses 63,025 transistors and occupies 6.8 mm
x 4.6 mm using 211m CMOS n-well technology [Wang]. The 4x3x2x1 ANN discussed in
this thesis uses 10 neurons. In comparison, Intel’s Pentium 4 microprocessor contains 42
millions transistors [Intel]. So, if an ANN as a branch predictor can be successfully

Simulated, a hardware design could be attainable.

7O

BIBLIOGRAPHY

71

[Anderson]

[Austin]

[Austin97]

[Brehob]

[Dazsi]

[Demuth]

[Driesen]

[Emer]

[Haykin]

[Hsieh]

[Intel]

BIBLIOGRAPHY

Anderson, Paul and Gail Anderson. Advanced C Tips and Techniques.
Indianapolis: Hayden Books, 1988.

Austin, Todd and Doug Burger. SimpleScalar Tutorial. [Online]
Available http://www.cs.wisc.edu/~mscalar/ss/tutorial.htnrl, January
1998.

Austin, Todd. A User’s and Hacker’s Guide to the SimpleScalar
Architectural Research Toolset. January, 1997.

Brehob, Mark, Travis Doom, Richard Enbody, William H. Moore,
Sherry Q. Moore,Ron Sass, and Charles Severance. Beyond RISC - The
Post-RISC Architecture. [Online] Available
http://www.egr.msu.edu/~crs/paperS/postrisc2l, August 2, 1999.

Dazsi, Brian, and Richard Enbody. Artiﬁcial Neural Networks for
Branch Prediction. MSU-CSE-01-22. Computer Science and
Engineering, Michigan State University, 2001.

Demuth, Howard and MArk Beale. Neural Network Toolbox User’s
Guide. Natick, Massachusetts: The MathWorks, Inc., 1994.

Driesen, Karel and Urs Holzle. Accurate Indirect Branch Prediction. IhQ

25th International Symposium on Computer Architecture. IEEE, Inc.,
1998.

Emer, Joel and Nikolas Gloy. A Language for Describing Predictors
and its Application to Automatic Synthesis. The 24th Annual

International Smposium on Computer Architecture. Association for
Computing Machinery Press, 1997.

Haykin, Simon. Neural Networks: A Comprehensive Foundation. New
Jersey: Prentice-Hall, Inc., 1994.

Hsieh, Paul. Sixth Generation CPU Comparisons. [Online] Available
http://www.azillionmonkeys.com/qed/cpuwar.html, August 2, 1999.

Intel Corporation. Micoprocessor Quick Reference Guide. [Online]

Available http://www.intel.com/pressroom/kits/quickrefyr.htm, August
8, 2001.

72

[Kandel]

[Kutza]

[Patterson]

[SimpleScalar]

[Wang]

[West]

[Yeh93]

[Yeh92]

Kandel, Eric; Schwartz, James; and Thomas, Jessell. Essentials of
Neural Science and Behavior. Appleton & Lange. Norwalk,
Connecticut, 1995.

Kutza, Karsten. Neural Networks at Your Fingertips. [Online] Available
http://www.geocities.com/CapeCanaveral/l 624/, July 2,1999.

Patterson, David A. and John L. Hennessy. Computer Architecture A
Quantitative Approach. San Francisco: Morgan Kaufmann Publishers,
Inc., 1996.

The SimpleScalar Architectural Research Tool Set, Version 2.0
[Software] http://www.cs.wisc.edu/~mscalar/Simplescalarhtml

Wang, Yiwen. CMOS VLSI Implementations of a New Feedback
Neural Network Architecture. diSS. Michigan State University, 1991.

West, Patricia M., Patrick L. Brockett, Linda L. Golden. “A
Comparative Analysis of Neural Networks and Statistical Methods for

Predicting Consumer Choice.” Marketing Science, vol.16, no. 4, 1997,
pp.370-391.

Yeh, Tse-Yu and Yale Patt. A Comparison of Dynamic Branch
Predictors that use Two Levels of Branch History. The 20th Annuﬂ

International Syr_nposium on Computer Architecture. Los Alinritos, CA:
IEEE Computer Society Press, 1993.

Yeh, Tsu-Yu and Yale N. Patt. Alternative Implementations of 2-Level

Adaptive Branch Prediction. The 19th Annual International Sﬂpposium
on Computer Architecture. Association for Computing Machinery, 1992.

73