1 .11.. .2...

 

 

 

1 1 r: V . x
1 4. <53: .1! 1.1. .32....31‘1

a. 1 _
.a. 3 1...}. ... 5,. .m._......t
1'}
L'

‘0.

 

 

 

 

raw“
Z.
200 L)

@035199/

This is to certify that the
dissertation entitled

ON THE MATHEMATICS OF CACHING

presented by

MARK W. BREHOB

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Computer Science and
Engineering

 

 

62A \ (2/1 ,

Major Wssof Ignature
2/21/93

Date

MSU Is an Affirmative Action/Equal Opportunity Institution

 

 

LIBRARY ‘
Michigan State
University

 

 

 

l
l
l.
l

l PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c-JCIFlC/DateDuepes-p. 15

 

ON THE MATHEMATICS OF CACHING
By

M ark W. Brehob

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department. of Computer Science and Engineering

2003

ABSTRACT
ON THE MATHEMATICS OF CACHING
By

M ark W. Brehob

Computers circuits get faster at a much more more rapid rate than memory. In order for
historical computer performance gains to continue, memory latency and bandwidth need
to continue to improve. The most common way provide this performance is to use small,
but very fast. memory devices to store the most commonly used data called caches. Cache
performance relies upon locality; the predictable way in which caches tend to access memory
elements that have accessed in the past or near those previous accesses.

Of fundamental importance to improving a complex device is understanding how that
device functions. This work contributes to that understanding in two ways. First, it. expands
upon an old and mostly unused model of locality called ‘stack distance’ and shows how that
locality interacts with various types of caches including victim and skew caches. Second, a
negative result is provided showing that it is computationally intractable to find an optimal
schedule for most non-standard caches. This result means that a. very useful tool for cache
evaluation is not available when working with these non-standard caches. It also means that

those working on finding a tractable optimal algorithm for these caches can stop—they will

not succeed.

© Copyright 2003 by Mark W. Brehob

All Rights Reserved

Dedicated to my daughter, Kira. Nerys Brehob, and her mother, Katrina. Marie Brehob.

iv

ACKNOWLEDGMENTS

This dissertation has been many years in the coming and many many people have been
more supportive than I could possibly deserve. A short list of those people include Suresh
Chittor, Wen-Harm Wang. and Nick Wade of Intel as well as Eric Torng, Stephen Wagner,
Uthaisombut Patchrawat, Barbara Bircher Gannod, Travis Doom, Jennifer \Vhite Doom,
Hugh Smith, and Charles Severance of Michigan State University. My in-laws. parents
and sister have stood behind me the whole way. My adviser, Richard Enbody has been
remarkable in his support and encouragement. My wife has sacrificed her time and energy

beyond anything I could ever repay.

TABLE or CONTENTS

LIST OF TABLES viii
LIST OF FIGURES ix
1 Introduction and overview - 1
1.1 Overview ....................................... 3
1.2 On Terminology used in this document ....................... 4
2 On Caching and Non-standard Caches 5
2.1 Introduction to Processor Caching ......................... 6
2.1.1 The structure and organization of a standard cache .............. 7
2.1.2 Address mapping .................................. 9
2.1.3 Replacement policies ................................ 11
2.1.4 Modern implementations ............................. 11
2.1.5 Caching metrics .................................. 13
2.1.6 Locality of reference ................................ 14
2.2 Non-standard caches ................................. 15
2.2.1 Hash caches ..................................... 16
2.2.2 Hash-rehash caches ................................. 18
2.2.3 Victim caches .................................... 21
2.2.4 Skew caches ..................................... 22
2.2.5 Smart caches .................................... 23
2.2.6 Further reading ................................... 29
3 On the Measurement of Locality 30
3.1 Locality measures .................................. 30
3.1.1 What is locality? .................................. 31
3.1.2 Locality measures in the literature ........................ 32
3.2 Macroscopic locality measures ............................ 32
3.2.1 Chow’s empirical power law ............................ 33
3.2.2 The Fractal Model of Locality .......................... 34
3.2.3 Agarwal’s analytic cache model .......................... 35
3.2.4 Grimsrud’s locality metric ............................. 36
3.2.5 Cache based locality ................................ 37
3.3 Microscopic locality measures ............................ 40
3.3.1 Instantaneous locality ............................... 40
3.3.2 Loop—based locality metrics ............................ 42
3.4 Stack distance .................................... 44
3.4.1 Stack-distance distribution functions ....................... 45
3.4.2 Stack distance in the literature .......................... 46
3.4.3 Algorithmic use of the stack model ........................ 48

vi

3.4.4 Trace compression ................................. 49

3.4.5 Conclusion ..................................... 50
4 Stack distance as a locality measure 51
4.1 The stack distance model of locality revisited ................... 53
4.1.1 LRU stack distance revisited ........................... 54
4.1.2 Derivation of the cache model for standard caches ............... 56
4.2 Validation of the cache model ............................ 61
4.2.1 A comparison of actual hit rates to predicted values .............. 65
4.3 Expansion of the model to non-standard caches .................. 69
4.3.1 An analytic LRU skew-cache model ....................... 69
4.3.2 An analytic LRU victim cache model ...................... 75
4.3.3 Veriﬁcation and the 100% miss-rate model .‘ .................. 80
4.4 Applications of the Models ............................. 83
4.4.1 Applications of the locality model ........................ 85
4.4.2 Applications of the cache model ......................... 90
4.4.3 Examining non-standard caches ......................... 95
4.5 Future work ...................................... 100
4.6 Conclusion ...................................... 102
5 Optimal replacement is NP-hard for non-standard caches. 104
5.1 Introduction and Background ............................ 104
5.1.1 Companion caches ................................. 106
5.1.2 Related terms ................................... 107
5.1.3 Chapter outline ................................... 108
5.1.4 Problem deﬁnition and goal ............................ 109
5.2 Proofs ......................................... 109
5.2.1 Interval scheduling and caching .......................... 110
5.2.2 Formal NP-hard proof for FCCS ......................... 112
5.3 Extending F CCS ................................... 117
5.4 Conclusion ...................................... 128
BIBLIOGRAPHY 129

vii

3.1

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

4.10

5.1

LIST OF TABLES

A classiﬁcation of misses using the optimal replacement policy. The arguments
to the function M() are line size, cache size, associativity and replacement
policy in that order. L is the line size of the base cache, C is its cache size,

and k is the degree of associativity. ....................... 39
Symbols used in describing how caches interact with locality. .......... 57
Hit rates and predicted hit rates for the SPECfp2000 benchmarks ....... 66
Actual and predicted hit rates for SPECint2000 benchmarks .......... 67

Prediction errors grouped by cache conﬁguration for the SPECfp2000 benchmarks. 67
Prediction errors grouped by cache conﬁguration for the SPECint2000 benchmarks. 67
Prediction errors grouped by cache conﬁguration for the SPECint2000 benchmarks. 68

Transition probabilities for a two-way associative skew cache. .......... 73
Measurement of excessive conﬂict ......................... 94
A comparison of replacement schemes for skew caches. Hit rate improvement is

in terms of absolute improvement over the one-bit replacement scheme. . . 98
A comparison of replacement schemes for victim caches. Hit rate improvement

is in terms of absolute improvement. ...................... 100

Ratio of the number of sets in the L2 cache over the number in the L1 data cache
for selected processors ............................... 128

viii

2.1
2.2
2.3

2.4

2.5
2.6
2.7
2.8

3.1
3.2

4.1
4.2
4.3

4.4
4.5
4.6
4.7
4.8

4.9

4.10
4.11

4.12
4.13

4.14

4.15
4.16
4.17
4.18
4.19

LIST or FIGURES

A simple caching structure. .............................
A cache with data and tags ..............................
The three types of standard caches. If X is some address, this figure shows the
different locations its cache block could be placed. ..............
How an address is used by the cache. B is block size and S is set size. Both are
assumed to be greater than 1 ...........................
The Harvard architecture ...............................
An example of how the set number might be determined In a hash cache.
A model of a 2- -way set- associative cache. [60] ...................
A 2-way skewed-associative cache. [60] .......................

Illustration of the stack distance of a reference stream ..............
A graph of the locality found in the reference stream of Figure 3.1 .......

Cumulative stack distance for three floating-point benchmark traces ......
Average of the locality found in the SpecInt and Spech benchmarks ......
Illustration of the theoretic relationship between hit rates, associativity, and the
locality of a given memory reference .......................
Stack distance hit-rates for SpecInt on a 4KB direct-mapped cache. ......
Stack distance hit-rates for SpecFP on a 4KB direct-mapped cache. ......
Stack distance hit-rates for SpecInt on a 4KB two-way associative cache. . . . .
Stack distance hit-rates for SpecFP on a 4KB two-way associative cache.
Example showing how locality interacts differently with a skew cache than a
standard cache. ..................................
Predicted hit rate for a skew cache and a standard 2—way associative cache. Both
are 128 lines in size. ...............................
A magniﬁed look at the ‘flaws‘ in the skew cache analytic model .........
Predicted hit rate at each stack distance for the LRU victim cache and its com-
ponents. ......................................
A magniﬁed version of the upper-left hand corner of Figure 4.11 ........
Error rate of the 100% miss rate model when predicting the hit rates for a victim
cache. .......................................
Difference between the skew cache analytic model and the same cache using the
100% missrate model. Two different analytic models are compared.
Impact of cache ﬁltering on the gcc benchmark and the impact of writebacks
Impact of associativity on cache ﬁltering ......................
Percent of accesses in a streak of size N or greater .................
Impact of associativity on cache ﬁltering ......................
Impact of associativity on cache filtering ......................

ix

74
75

81
81

84

84
86
88
90
91
92

4.20 Illustration of the impact of the replacement scheme on performance for skew
caches. The direct-mapped cache is provided as a reference and all caches
consist of 128 lines .................................

4.21 Illustration of the impact of the replacement scheme on performance for victim
caches ........................................

4.22 Illustration of the impact of locality on the hit rate of the using a ‘separate LRU’
replacement scheme. This graph also shows the impact of using 5—bit time
stamps. ......................................

5.1 Relation between memory accesses and intervals. .................
5.2 Example reduction from 3—Occ-Max—2SAT. ....................
5.3 Example construction for a 2-way set-associative main cache. ..........
5.4 Example construction for a companion buffer of size 3 ..............

96

99

Chapter 1

Introduction and overview

Modern society has come to rely upon the computer and its ever-increasing performance
gains. In order to meet future expectations, both of society in general as well as the computer
industry, faster and better computers need to be designed and built. The computer industry,
citing a variant of Moore’s Law [39], has traditionally expected that processor performance
will improve by 60% per year (this is equivelent to the more commonly stated “double
in performance every 18 months” Of course, there are many roadblocks on the path to
continuing at this rate of exponential performance improvement. These roadblocks include
physical limitations of the devices being built, the difficulty of designing extremely large and

complex processors, and the amount of heat which must be removed from the processor.

One of the more immediate roadblocks is the relatively slow rate of improvement in
the speed of memory devices such as DRAM. This looming “memory wall” [80] stands
squarely in the path to continued processor improvement. DRAM process improvements
have historically generated a 7% improvement in latency per year. One professor at Berkeley
jokingly contrasted this much slower rate of improvement with Moore’s law by calling it

“Less’ Law. [22]”

The most common way to address this disparity is to use small. but very fast. memory
devices to store the most commonly used data. Those devices are called caches. Of course,
not all of the needed data. can be kept in these small caches, so some of the memory accesses

must go to the DRAM. Those DRAM accesses are said to have been cache misses.

Traditional methods for caching data are reasonably effective, but the widening gap
between memory performance and the expected processor performance has made further
decreasing the number of cache misses an important goal. As such, many novel caching
techniques have been proposed in the last ten years. One difﬁculty with this plethora of
ideas is that they are very difficult to evaluate. Certainly one can compare them in an
environment where the workload is held ﬁxed and “similar” sized caches are compared. (In
fact we have done such a study for a number of these techniques [14]). However, such studies
give very little insight into why these caches perform well—something very important to
anyone who proposes to use them in a commercial processor. The focus of this dissertation
is to examine what. can and cannot be done to gain that needed insight.

The work in this dissertation provides some insight and results about how a cache
interacts with the pattern of memory requests made by the processor. The two main

contributions are:

0 An old technique, LRU stack distance, is Greatly expanded upon. It is used to evaluate

locality and understand how locality and caches interact.

a An important evaluation technique, the optimal replacement algorithm, is shown to
be NP-hard to compute for a large and important class of caches. This means that
this optimal replacement is a tool that designers of those caches, which include the

“newest” cache designs, will not. be able to utilize.

1 .1 Overview

As with most scientiﬁc endeavors, this work would be meaningless without the context of
other’s efforts. As such, Chapters 2 and 3 summarize that context. In Chapter 2 a short
tutorial on caching is presented along with a discussion concerning a number of non-standard
caching techniques. The tutorial mostly consists of material one would ﬁnd in a standard
undergraduate class on the topic. However, the non-standard cache discussion should be

novel to most readers.

Chapter 3 is a. discussion of the memory reference stream of a processor. These patterns
(called locality of reference) are what allow small caches to perform as well as they do.
The various techniques in the literature for quantifying and taking advantage of the nature
and patterns of these reference streams is examined. At the end of that chapter the stack

distance model, which is used throughout Chapter 4, is introduced.

While bits and pieces of original work are scattered thoughout the early chapters, it is
Chapters 4 and 5 that contain the original work found in this thesis. Chapter 4 discusses how
the stack distance model can be used to visualize and measure the interaction of locality and
caches. Speciﬁcally that model is validated more extensively than before, new applications

are found and the model is expanded to work with non-standard caches.

Chapter 5 is a formal proof, showing that it is NP-hard to compute the optimal replace-
ment policy of certain non-standard caches. This result holds for skew caches, generalized
victim and assist caches, and in fact nearly all multi-lateral cache schemes. Further, various
results are given which show that it is also NP-hard to closely estimate the number of misses

an optimal replacement algorithm would cause on these caches.

1.2 On Terminology used in this document

In the computer profession, as in most professions, terminology is not. always as standardized
as its practitioners might like. III some cases this extends beyond terminology and into
differing standards and assumptions. Below is an attempt to address those terms that

might be unclear.

0 Throughout this dissertation, the function lg(:r) will be used in place of 109201?) as is

common in many Computer Science texts.
o A consistant big endian [39] notation shall be used thoughout this work.

0 The SPEC CPU2000 benchmark suite[21] is refered to thoughout this document as
the ‘SPEC’ benchmark suite. The integer and floating-point parts of the benchmark

are refered to as SPECint and SPECfp.

Chapter 2

On Caching and N on-standard

Caches

This chapter serves two purposes and as such is broken into two sections. The purpose of
the first section is to introduce the fundamentals of processor caching. This section does
not, and cannot, provide a complete overview of the topic. Rather. its purpose is to provide
a basic understanding of terms that will be used in further chapters. The experienced
computer architect will ﬁnd the vast majority of the material in that ﬁrst. section to be
review, while others may ﬁnd the need to refer to computer architecture books such as

[39, 56,66] in order to provide needed context.

The second section of this chapter focuses on non-standard caching schemes. This ma-
terial is signiﬁcantly more obscure than the material of the ﬁrst sectionl. Because Chapters
4 and 5 assume a working knowledge of this material it is also included in this chapter along

with the more basic material of section 2.1.

 

1In fact, only one paper has been found which provides an overview of even a few of these caching schemes
[82].

CI!

2.1 Introduction to Processor Caching

As a processor executes instructions it needs information from the memory system where
both data. and the instructions themselves are stored. However, the main memory is usually
very slow. In the time it takes to fetch one piece of information, a modern processor could
have executed many hundreds of instructions. Clearly this makes the memory system a
signiﬁcant performance bottleneck. While there have been a large number of methods used
to cope with this bottleneck, the most common and pervasive is caching. As shown in Figure
2.1, a small, fast memory subsystem, the cache, is placed between the processor and the
main memory. The purpose of caching is to keep the most commonly accessed information

in a closer (and much faster) location. the cache.

P C 1 Main
cessor ' . 1 ‘ .
ro ac e Memory

........................................................................................

 

 

 

 

 

 

 

 

Figure 2.1: A simple caching structure.

The cache generally keeps a copy of the most recently referenced information found in
the main memory system. While this cache is usually orders of magnitude smaller than
main memory, it is common to ﬁnd that well over 90% of all memory references are in the
cache [39]. The time it takes to fetch a piece of information from the cache is usually similar
to the time it takes to execute a single instruction [17]. Obviously, this greatly speeds up the
average time required to perform a memory access, and reduces the “memory bottleneck”
mentioned above.

It is instructive to include more detail at this point. Each byte2 of memory generally

 

2A byte generally consists of 8 bits and can be thought of as a single number in the range of O to 255.

has a certain numeric address. Normally the processor will request one to eight bytes of
information at a time. If the information is not in the cache (this event. is called a cache
miss), the cache requests it from main memory. Once retrieved from main memory, the
cache provides the information to the processor and keeps a copy. If the cache does have
the information (a cache hit), it simply supplies its copy of the information.

In the remainder of section 2.1 goes into greater details about caching, but not beyond
what. can be found in any standard computer architecture book. Books by Hayes [38] and by
Hennessy and Patterson [39, 56] in fact. contain this information in signiﬁcantly more detail.
Interested readers may ﬁnd that Smith's seminal work [64] on caching provides a complete, if
dated, overview on caching. Also, books by Przybylski [57] and Handy [37] are outstanding.
Przybylski’s book, while published in 1990, provides an architectural viewpoint similar to

that of this author. Handy's book is written from a circuit designer’s viewpoint.

2.1.1 The structure and organization of a standard cache

In general a cache consists of two parts. the data part, which is the information stored
in the cache, and the tag part, which indicates which address the data came from. Each
tag consists of enough information to identify from which address the data came. When a
memory access occurs one or more tags are searched and compared to the memory address of
the memory access. If a tag matches the address of the memory access, the data associated
with that tag is provided to the processor by the cache rather than requiring an access to
the (much slower) main memory to get the same data.

The one-to-one relationship between tags and data is illustrated in Figure 2.2.

Ideally the cache could keep the data is the most likely to be reused. In practice, however,

this is usually not viable. Imagine what would happen when the processor requests data

...............................................................

 

 

 

 

 

 

 

 

 

 

...............................................................

Figure 2.2: A cache with data and tags.

from the cache. The cache would need to search though each of the tags to ﬁnd where
the data is in the cache (if it is there at all). Since the whole point of the cache is to
speedup access time, the time required to search all of the tags is usually not acceptable.
A common solution to this problem is to assign each memory location a single place in the
cache where it can reside. Since the main memory is much larger than the cache, many
memory locations share the same assigned place in the cache. This means that only one
of those memory locations could be in the cache at any given time. The net result is that
ﬂexibility is traded away for speed. This type of cache is called a direct-mapped cache. A
cache where the data may be kept anywhere is called a fully-associative cache.

A compromise between the direct-mapped cache and the fully-associative cache is the
set-associative cache. A set-associative cache is broken into sets of locations, all of equal
size. Rather than mapping each main memory location to a speciﬁc line in the cache,
the memory location is mapped to a set of lines. A given main memory location is now
restricted to a very small number of locations in the cache. This increases the ﬂexibility of
placing data into the cache while still greatly reducing the number of locations which need
to be searched. Caches where. the set size is of size n are referred to as “n-way associative
caches.” Figure 2.3 illustrates the organization of these three caches. One can think of both

a direct-mapped cache and a fully-associative cache as extremes of a set-associative cache.

In the remainder of section 2.1 we use the terminology associated with set-associative caches

when referring to any of these three standard cache types.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Direct-mapped Set-associative F ully-associative

Figure 2.3: The three types of standard caches. If X is some address, this ﬁgure shows the
different locations its cache block could be placed.

2.1 .2 Address mapping

A standard cache of C bytes consists of K cache; lines. each B bytes long. Further, there are
S sets of cache lines. where each set consists of N cache lines. It follows that C = K a: B and
K = S * N. In this section it is formally described how the various memory addresses are
mapped into a cache. The next few paragraphs are summarized by Figure 2.4, and referring
to that ﬁgure may be helpful.

Because of a property called spatial locality, (discussed in detail in section 2.1.6) it
is likely that one of the next memory references will be to a location near the current
reference. Thus, it makes sense to fetch the data near the current reference into the cache.
To accomplish this task, the entire memory space is broken into non-overlapping blocks3

of size B, and the entire block where the current memory reference resides is brought into

 

31h this document the term cache line (or line) is used to describe the location in the cache while the
data that goes into the cache line is called a cache block (or block). Some others use the term “block frame”
for line while other, mainly older works, make no attempt to distinguish between the two concepts.

the cache. If all the address bits other than bits 32 - lg(B) to bit 31 are the same, the two

memory locations are in the same cache block.

Mapping each cache block into a set of the cache is the next task. Because of spatial
locality it is best to insure that two cache blocks that have addresses near each other do
not get mapped into the same set. To insure this, the blocks are mapped into the cache
in a round-robin like scheme, so that a given set has every Sth block mapped to it. If two
memory locations have the same values in bits 32 — (lg(B) + lg(S)) to 31 — lg(B), the two

addresses are mapped into the same set. These bits are often called the index bits as they

are used as an index to select a set in the cache.

Finally the remainder of the address, that is bits 0 to 31 — (19(5) + lg(B)), make up the
tag. Because of the mapping into the set. and block. the remainder of the address is trivial
to determine, and thus the tag is a unique identiﬁer of the memory address from which the
block came. Figure 2.4 summarizes the way a cache uses the address of a memory reference
to generate. the set number and tag. The offset. simply states where the data is inside of the

cache line.

....................................................................................................................................................................

 

 

 

 

 

 

 

 

 

 

 

High-order bits Low—order bits
Address: blto t0 bit31_([g(3)+lg(3)) bzt32—(19(B)+19(S)) t0 bit32_lg(3) t0 bit31
bit3l—lg(B)

Offset: ' ’ lg(B) bits
Set number (index): L; . ‘ lg(S) bits
Tag: 329I1g(B)+1g(S)J bits

 

 

 

....................................................................................................................................................................

Figure 2.4: How an address is used by the cache. B is block size and S is set size. Both are
assumed to be greater than 1.

10

An example is likely useful at this point. Consider a 2-way associative cache which has
512 cache lines. each 16 bytes in size. Thus the data portion of the cache is 8192 bytes (or
8 Kilobytes) in size and there are 256 sets. Say the one byte of data at address F 012345616
is requested by the processor. Notice that the offset of this address is 615, the set number
is 4516 and the tag is F012316. Thus, the two tags associated with set number 4515 (6910)
are checked to see if they match the tag of the address (F 0123416). If either does, the data

is found in the sixth byte (the offset) of the data part of the that cache line.

2.1.3 Replacement policies

If the cache is full and a new block needs to be put into the cache, some other block must
be evicted. Clearly the block that is to be evicted nmst be from the same set as the new
block. In a direct-mapped cache, there is no decision which needs to be made—there is
only one block to chose from. However, in a set-associative or fully-associative cache, some
decision must be made. By far the most common scheme is to evict the block which has
been unused the longest. This block is called the least-recently-used (LRU) block and the
algorithm is called the LRU replacement algorithm.

The reason the LRU algorithm is so commonly used is that programs tend to access
data they have accessed recently. This property of programs is called temporal locality and

is discussed further in 2.1.6.

2.1.4 Modern implementations

Modern caches are actually more complex than have been described above. III this section
we outline some of the techniques used in modern memory system design.

Modern processors generally have multiple levels of cache. If the ﬁrst and smallest cache

11

(called the L1 cache) does not have the requested block, it checks in the next level of the
cache (called the L2 cache). In theory this would continue until the last level of cache has
missed, when the data would be retrieved from the main memory. Today, most general-
purpose processors have only two levels of cache, although processors with three levels, and
even four levels of cache are starting to show up in the marketplace[35].

Another quirk of modern caching is that the L1 cache is often split into two parts,
one for data and one for instructions. This is done to improve bandwidth, improve miss
rates, and for several other reasons[39]. Usually the data and instruction request stream are
combined at the L2 cache. The caching scheme where the data and instructions are split is

often called the Harvard Architecture[39], and is illustrated in Figure 2.5.

............................................................................

 

Proccessor

 

 

 

 

L1 L1
Instruction Data

Cache Cache

 

 

 

 

 

 

 

............................................................................

Figure 2.5: The Harvard architecture.

Lastly the use of virtual addresses has an impact on caching, particularly non-standard
caching schemes. The details of virtual addressing can be found in most computer archi-

tecture and operating system texts[39, 63]. The net effect is that the address the processor

12

uses (the virtual address) is different than the address the main memory system uses (the
physical address). 111 general, the 71 least signiﬁcant bits will not change when the virtual
address is translated into the physical address. The value of 2" is then called the page size
and is often around 212 bytes in size.

The point at which the virtual address is translated into a physical address has a sig-
niﬁcant impact on the cache. This translation can happen before, after or during the cache
lookup. While this decision has a Signiﬁcant impact on cache performance, the only option
which impacts this discussion is if the translation happens during the cache lookup. In that
case, the index bits of the virtual address are used to select a set. The virtual address is
translated as the cache is searched and the tag(s) are compared against the resulting physi-
cal address. This scheme has a number of names, including “virtually addressed, physically
tagged,” “virtually indexed, physically tagged" or simply a V/R cache [79]. Notice that it
relies upon the index bits being the same for both the physical and virtual addresses‘l. This
observation implies that the number of sets multiplied by the line size cannot be larger than
the page size. The quick translation of virtual to physical addressing is done by a device

called a translation lookaside buffer or TLB.

2.1.5 Caching metrics

The previous sections have hinted at the two basic metrics for measuring the goodness of
a cache: hit rate and access time. Hit rate (H R) is simply the percent of accesses which
are cache hits. The access time is the amount of time it takes to service a hit (Tm-t) or
a miss (Tm-SS). From these metrics the average access time of a memory access can be

computed. It is simply HR * Thu + (1 — HR) * Tmiss. Recall that the advantage of direct-

 

4Although W'u, among others, proposes ways around this restriction [79].

13

mapped caches is a faster access time (a lower Th") while they tend to have a lower hit rate
(H R). Fully-associative caches usually have longer access times but a higher hit rate. A
detailed, if somewhat out of date, analysis of these trade-offs can be found in section 5.3.3

of Przybylski’s book[57].

Tm,“ is the time it takes for an access to be fulﬁlled if the cache does not have a copy
of the data, so it has very little to do with the cache itself. Rather it is dependent upon
the memory system behind the cache. In fact Tmz-ss can be an important parameter when
designing a cache. If Tmiss is high, increasing the hit rate at the cost of Thy will-be more

beneﬁcial than if Tmiss were lower.

2.1.6 Locality of reference

Locality of reference is the fundamental reason a. small cache can achieve a high hit rate.
A processor tends to access memory locations it. has referenced recently (temporal locality)
as well as locations near those it has referenced recently (spatial locality). A cache takes
advantage of spatial locality by fetching a larger chunk of data then the processor requests
(a cache block). As discussed in section 2.1.3, the temporal locality of a program is utilized

by the replacement algorithm, and by the very concept of a cache.

The above deﬁnition of locality is fairly terse and certainly qualitative rather than
quantitative. Unfortunately that deﬁnition is literally as detailed as any standard computer
architecture text gets. In Chapter 3 a number of locality metrics from the literature are
discussed. In Chapter 4 a method for quantifying and visualizing locality is introduced in

great detail.

14

2.2 Non-standard caches

W hile section 2.1 details how the most common caching schemes work, there are a large
number of variations. Those variations include changes in replacement policy, cache struc-
ture, and address mapping. These variations usually can be evaluated in terms of the two
cache metrics mentioned above: hit rate and access time. Each scheme attempts to improve
one of the metrics, usually sacrificing the other to some small extent.

Evaluating these non-standard caching schemes can be difﬁcult. First, it is difﬁcult to
prove that a given scheme will really improve the hit rate—even increasing associativity
can hurt in certain cases [72]. Thus anyone proposing a new scheme has to show that their
scheme works well on most real programs, quite a tall order. In general the authors of these
schemes use the SPECCPU benchmark suite5 as a standard method of evaluation. Even
using this rather large suite of programs as a standard for measuring cache performance
does not always generate reliable results. For one thing, this benchmark suite is biased
toward certain types of programs [23]. Another problem is that when evaluating a novel
caching scheme, one generally needs to write the cache simulator from scratch. This can
be error-prone and has resulted papers full of errors. For example in an e-mail exchange
with D. Rhoades, he agreed with this author that at least. parts of their published work [59]
suffers from signiﬁcant simulator error.

The other difficulty is evaluating the access times of a novel caching scheme. The prob-
lem is that evaluation is an engineering decision where the quality of design, the processor
layout, and the properties of the process technology used will have signiﬁcant impacts on
the actual access time. There have been high-level models proposed to estimate the impact

of certain changes at this level [25]. However, even if those models are accurate, many

 

5Information about SPECCPU can be found at http://www.spec.org.

15

non-standard caching schemes do not easily ﬁt into the parameters of these models.
Keeping these problems in evaluation ﬁrmly in mind, it is worthwhile to point out
that just because these schemes are difﬁcult. to evaluate does not mean that they are not
useful. Rather it means that when considering them one needs to keep a degree of healthy
skepticism about the reported results. Independent measures of the hit rates of these caches

have been reported [14,60] and are probably somewhat more reliable results.

2.2.1 Hash caches

Perhaps the simplest and most common modiﬁcation to a standard cache is to modify the
way addresses are mapped into the sets of the cache. Rather than using the lg(S) bits,
where S is the number of sets. to select a set of the cache (as described in section 2.1.2),
a larger group of address bits are used as inputs to a function which selects the set. This
function is called the hashing function, and the cache which uses it is called a hash cache.
For example if there are 2m sets, one might use 2m different address bits and then perform a
bit-wise XOR of those values to get an output of 772. bits. Figure 2.6 illustrates this hashing
function. A more generic hashing cache was conceptually proposed by Smith in a 1978 work
[65] while at least. one real processor, Hewlett-Packard’s PA7200, uses a hashed L1 cache
[17].

The beneﬁt of the hash cache is that it can reduce set conflicts in certain situations.
Imagine that a. program is striding [39] through memory 1024 bytes at a time. That is, it
might reference location 0, then 1024, then 2048, etc.6 Now imagine that we are using a
direct mapped cache with 32-byte cache lines and a total of 4096 bytes in size. That cache

has 4096/32 = 128 cache lines. As the program strides through memory it will only utilize

 

6This type of behavior is quite common and usually associated with arrays.

16

 

High order bits Low order bits
[ Wax—(wmsmgwn t0 [ bits'z—UgthgIS» to

Address: W31-1g(8)

 

Wax—Izmmsmsn

 

   

Set number:

 

 

 

Figure 2.6: An example of how the set number might be determined in a hash cache.

four of those cache lines. speciﬁcally lines 0, 32, 64, and 96. The hashing function described

above will allow the cache to utilize all 128 cache lines.

A hash cache also has a number of restrictions and drawbacks associated with it. Per-
forming the hashing function can take a signiﬁcant amount of time. While a single XOR
gate delay may seem trivial, that extra time may not be available". Also, as described in
section 2.1.4, if the cache is virtually addressed but physically tagged, the extra bits used by
the cache to perform the hashing function may not be available until after the translation

ﬁom the virtual to the physical address[60].

 

7In modern processes wire delays are starting to dominate the processor latencies. This means that the
delay introduced by an XOR may become less signiﬁcant in the future. [4]

17

One’s complement cache

A one‘s complement cache [81] is a hash cache with a speciﬁc hashing function. Speciﬁcally,
in the case of a cache of size 2’", the address (other than the block number bits) is broken
into groups of m bits. These ﬁelds are then XORed together. That creates a. mathematical
function similar to what would occur if the cache were of size 2’” — 1 rather than 2'", which
can greatly reduce striding problems. The authors of the paper which introduced the one’s
complement cache claim that miss rates tend to drop 10% to 20%. Other studies indicate
that the improvement is much lower for most. traces, usually on the order of 1% to 3% [14].
However, we have found that in certain data—sets, a one’s complement cache can have a
signiﬁcant impact, sometimes reducing miss rates by a factor of 10. While this may be due

to a horrible compiler. it is certain that horrible compilers exist in the real world.

2.2.2 Hash-rehash caches

While the hash cache tries to reduce the miss rate of the cache, a hash-rehash cache[1] tries
to decrease the time it takes for the cache to supply the data in the case of a hit (TM).
This reduction is accomplished by addressing the cache as a direct mapped cache. If this
access results in a miss, one of the index bits8 is toggled, and the resulting value is used
as the index to search the cache again. Notice that the toggling of a single bit means that
two cache lines are always paired although the search order will vary. In this section we
refer to X1 as the ﬁrst line searched for a given memory reference and X2 as the second line
searched.

If a hit occurs when searching X1 the data is supplied to the processor and the cache

is unchanged. If a hit occurs when searching X2. the cache block there is swapped with

 

8Usually the most signiﬁcant index bit. In any case, a given hash-rehash cache will always use the same
toggle bit.

18

the one in X1. If a miss occurs on both accesses. the block in X1 is moved to X2, the one
that was in X2 is evicted, and the new block is moved into X1 The idea is to keep the
most-recently-used (MRU) block in the cache line where it will be searched for ﬁrst in an

attempt to reduce the average value of Thir-

The advantage of the hash-rehash cache is that each access takes as long as a standard
direct-mapped cache, so if the ﬁrst search is successful, the access time is reduced compared
to a two-way associative cache. At the same time, the hit rate, when considering both tries
to ﬁnd the data, is better than a direct—mapped cache (but worse than a two-way associative
cache[1]).

It is worth noting that the name “hash-rehash” cache is perhaps inappropriate. The
index into the cache is usually done by bit selection, that is by using the index bits as
described in section 2.1.3, rather than using a more complex hashing function. The reason
for this indexing scheme is that even a short XOR delay might negate the access time

advantages of a. direct-mapped cache over a two-way associative cache.

While papers on hash-rehash caches and their variants [1, 3, 16] have been common, we
are unaware of one being used in a commercial productg. It is difﬁcult to say exactly why
this is, but there are some obvious possibilities. Clearly the lack of index bits in a virtually
addressed, physically tagged cache is an issue, in part because an X -way set-associative
cache requires lg(X) fewer index bits than a hash-rehash cache of the same size. Further,
the access time advantage of a hash-rehash cache is going to be very dependent upon the
exact technology used as well as other clock-limiting. paths in the chip. One might. be

trading a lower hit rate for no actual improvement in access time.

 

9Although the IBM CMOS System/360’s cache organization is somewhat similar to a hash-rehash cache
and is described below.

19

Column-associative caches

One variant on hash-rehash caches is the colunm-associative cache[3]. A “rehash” bit is
added to each cache line, which both improves the overall hit. rate of the cache and reduces
the average number of times the cache needs to be probed. As above, two paired cache
lines are given labels, but this time the labels A and B are used and no order of access is
implied. The rehash bit associated with a given cache line is set if it is accessed as a rehash
location, either during a search or during replacement. If a cache line with its rehash bit
set is searched. no attempt is made to search the rehash cache line. Say that lines A and B
both hold ‘hash blocks”, that is blocks that would be found during the ﬁrst pass of a search.
In that case neither rehash bit will be set. so if a search of A fails, B will be searched.
During the search of B (which will miss) its rehash bit is set. During replacement the block
in A will move to B and the block in B will be evicted to make room for the new block.
Now if B gets searched ﬁrst, it will result in a miss (it holds data associated with A) but
A will not be searched (because the rehash bit in B was set). The new data. when fetched,
will be placed in B and its rehash bit will be reset.

The net effect is two-fold. First, a more LRU-like replacement scheme is implemented,
resulting in a better hit rate than a hash-rehash cache [3,82]. Second, in some cases a miss
does not require two searches through the cache. slightly reducing both the average access

time and the number of requests to the cache.

Other hash-rehash schemes

There have been a number of other proposed schemes related to the hash-rehash cache.
Detailing each is outside the scope of this document, but short descriptions of a few are

provided in this section.

20

The predictive sequential associative cache [16] is very similar to the column associative
cache. However, instead of swapping blocks to keep the MRU block in the ﬁrst location,
a steering bit table determines which location to search ﬁrst. It also maintains true LRU
ordering with a single LRU bit. A number of schemes for predicting the steering bit were
also proposed.

There are also two caches called the MRU cache. both of which have similarities to
hash-rehash caches. The ﬁrst was used by IBM’s CMOS System/ 370 [18]. Here they kept
the MRU data of each set on chip, and had the rest of the cache off-chip. They then
speculatively used the MRU data whenever the cache was accessed. Because the time to
detect the mistake was always the same, the logic required to back out of a misprediction was
fairly straight—forward. The other MRU cache [47] simply did a sequential search through

a set of the cache. using MRU information to guide the search order.

2.2.3 Victim caches

The victim cache is perhaps the most exciting non-standard cache thus far proposed. As
originally proposed [45] it has no impact on the access time of an L1 hit (Thu) or on the
hit rate (H R). Rather it reduces the average access time for an L1 miss (Tmiss). The only
price is a small bit of additional die area.

A victim cache is simply a, very small (usually 2 to 16 cache lines) fully-associative
cache. It acts just like an L2 cache managed with an exclusive policy [9]. only rather than
the normal latency associated with an L2 cache. the small size allows the access to occur
in about 1 cycle. So when the cache evicts a block, rather than moving that block to the
L2 cache (or just throwing it away) it is moved into the victim cache, and a block from the

victim cache is evicted to the L2. Thus, the main L1 cache and the victim cache never store

21

the same block. Even very small victim caches can expect to recover 30% or more of the
conﬂict misses that occur in a direct-mapped L1 [45]. The write buffers in Intel’s Pentium
Pro as well as the “victim buffers” of the Alpha 21264 both perform some of the functions
of a victim cache.

It is worth noting that due to the small size of the victim cache it is quite reasonable to
search the victim cache in parallel with the main L1 cache. While this will slightly increase
the total latency of the cache, we believe that the access time will be similar to a two-way

associative cache, assuming that the victim cache size is kept quite small.

2.2.4 Skew caches

Only a small modification to a set-associative cache is required to create a skew cache.
As illustrated in Figure 2.7 each bank in a set-associative cache can be thought of as a
direct-mapped cache with each of the banks indexed by the same function. In a skew cache
a different function is used for each bank. A 2-way skew-associative cache is illustrated in
Figure 2.8. In this case two different functions, f0 and f1, are used and a block may be

placed in different locations in different banks.

 

Bank 0 Bank 1

address

Figure 2.7: A model of a 2-way set—associative cache. [60]

Such a simple modiﬁcation may not seem signiﬁcant, but studies have shown that a two-

22

 

Bank 0 Bank 1

address

Figure 2.8: A 2-way skewed-associative cache. [60]

way skew-associative cache produces hit rates similar to that of a four-way set associative
cache [14, 60, 61]. This performance can be attributed to three effects: data dispersion, the
effect of hashing, and self-data-reorganization [14,61]. The ﬁrst effect creates an actual
increase in associativity [14]. That is, temporal locality is more helpful to a skew cache
than a standard cache of the same associativity. The second effect is an identical effect to
that found in the above section on hash caches. The third effect is more of a side-effect of
locality. A cache block can go into two different lines. If one line is an area of high—conﬂict,
the block will likely get evicted. Next time it comes in it may go to the other line and, due
to the lower conflict, stay there. Thus, on relatively slow changing data sets, skew caches

can “learn” where to place blocks.

2.2.5 Smart caches

All of the caches thus far discussed, both standard and non-standard, achieve their high
hit rates purely by taking advantage of locality. While high hit rates are achieved by these
caches, there are limits to how successful a cache can be by purely relying upon locality.
As these limits are discussed in section 2.1.6 of this document, they will not be dwelled

upon here. However, the basic observation is that even a fully-associative cache has non-

23

compulsory misses, and no LRU cache can be expected to perform better than that.

Smart caches10 are those caches whose caching schemes rely upon something other than
just locality of reference. Many of these schemes rely upon the observed fact that certain
accesses are unlikely to be re-referenced in the near future [51,78], and having those refer-
ences bypass the cache may be the best solution. By bypassing these low locality references
other, ideally more-likely-to—be-used, cache blocks can be kept in the cache.

Each of the smart cache techniques utilizes different ways of identifying the low-locality
accesses. Additionally, many of these techniques provide methods for reducing the impact of
bypassing accesses which have a high degree of locality. We are unaware of any studies which
compare these smart caching schemes. The techniques used by these schemes are so diverse
that such a study would be extremely difﬁcult to conduct, and even more difficult to conduct
fairly. Many of these techniques have not made it past the “proof of concept” stage—
comparing them to more reﬁned methods would not. show that one idea was inherently
better than the other. While few quantitative comparisons seem to exist, it is possible to

detail how some of these schemes work.

Assist caches

Hewlett-Packard's PA7200 utilizes a smart caching scheme which can be referred to as an
assist cache [17]. Those familiar with HP’s recent endeavors will not be surprised that the
identiﬁcation of the low locality memory references is left to the compiler. In fact the assist
cache, for which this scheme is named, is little more than the backup in case a high locality

access was misidentiﬁed.

 

10The term “active management” is perhaps a more common term than “smart caching.” However, that
term seems to imply a dynamic hardware caching scheme, and we Wish to also include compiler based
management.

24

The PA72OO has two L1 data caches. The ﬁrst, called the assist cache, is a physically-
indexed, 64—entry, fully-associative cache using a FIFO replacement scheme. The second is
a virtually—indexed, physically-tagged, hashed, direct-mapped cache and can be as large as
one megabyte. In general, data is first put in the assist cache and evictions from the assist
cache are placed in the main cache.

The compiler can identify a certain load or store as being. in HP’S terms, “spatially
local.” What they mean is that the compiler believes that the line may be accessed again
due to spatial locality, but not due to temporal locality. Lines so identified are still put into
the assist cache, but upon eviction the main cache is bypassed. This backup by the assist

cache allows the compiler to be aggressive in its attempts to identify low locality accesses.

Stream buffers

Another method of identifying low-locality accesses is the hardware stream buffer [45, 51].
A stream buffer is basically pipe, that is a FIFO queue, where information can only be
taken off the top of the pipe. By some method, either with information from the compiler
or observation of the data stream or both, streams of references are discovered. Streams of
references are generally individual load instructions walking though memory with a. known
stride. The data can be prefeteched into the FIFOs and the caches can be bypassed entirely.

There are many advantages to stream buffers. For one, traditional prefetches into a cache
can result in data being evicted before it is used due to conflict in a certain set. Secondly,
streaming data usually has very limited temporal locality, and steam buffers reduce the
pollution of the cache by this low-locality data [51]. Also, modern DRAM components
haven’t been truly random access since the introduction of prRAM [42]. For modern

DRAMs, spatial locality can be very important indeed. Reordering the memory accesses

25

so that large chunks of each stream are fetched at once, rather than the program order of

interleaved access, can greatly reduce the total latency to fetch data from the DRAM [52].

There are some potential problems with stream buffers. W'hile stream buffers do show
significant performance increases on certain programs, non-scientific programs tend to see
more limited benefits. This characteristic makes their use questionable in general-purpose
processors. Also, stream buffers, unlike most of the. caching schemes discussed so far,
generally require compiler help to achieve any significant performance gain. That means
that effort by the chip designers is not enough; effort is required by the compiler writers too.
It is worth noting that while we are unaware of a commercial chip that uses stream buffers,

an experimental chip with stream buffers has been built at the University of Virginia [50].

Cacheable / N on-Allocatable Cache

It can be argued that the true beginning of smart caching algorithms was a paper by Tyson
et al. [74]. In that paper it is shown that not only do certain load instructions fetch the
vast majority of the data, but some load instructions are responsible for a disproportionate
amount of the cache misses. This observation means that identifying those instructions can
provide large hints about which memory references should bypass the cache.

They propose a number of techniques for identifying load instructions which should be
bypassed, including compiler / proﬁling methods and dynamic two-bit schemes that resemble
branch predictors. While their proposed schemes do not improve the average hit rate, they
do show that memory system bandwidth can be reduced. More importantly this paper
provided the fundamental basis for investigation of instruction-based load bypassing. This
scheme is called C/ NA or CNA as the bypassed loads are marked as “cacheable but non—

allocatable,” meaning that a given cache block may be in the cache but. if it is not, do not

26

allocate space for it in the cache.

Other smart caching schemes

A number of other smart-caching schemes have also been proposed. Some of the more
interesting are briefly described in this section.

A rather novel hardware scheme for smart caching was proposed by Johnson and ku
[44]. The underlying idea is that certain data structures are, by their very nature, accessed
in a low-locality manner. They propose to group the entire memory space into macroblocks
and observe the reuse behavior of the various macroblocks. If a given macroblock has low
reuse, the data from that block bypasses the cache. Under one version of their proposal
they add a small buffer to act as a back-up in the event of a misprediction. The initial
results of their study were encouraging, but covered a very limited set of programs”. In
the literature this scheme is often called the MAT cache [69].

A research group at the University of Michigan has proposed a number of smart caching
schemes. Their NTS cache is is similar to the MAT model in that the data addresses are
used to predict the locality of a memory location. However, in the NTS cache the metric
of locality considers whether the cache block had been used more than once in its previous
“tour” of the L1 cache, rather than looking at a macroblock over a much longer time [70].
That some group has also improved the CNA scheme.

Wong and Baer proposed a scheme to improve the hit rate of a highly associative L2
cache [78]. They tag each load instruction as either generating “temporal” cache lines or
“nontemporal” cache lines. A standard L1 replacement policy is used, but the L2 gives

preference to temporal cache lines. They propose both a compiler-based proﬁling scheme

 

11For example, programs like TPC-B have a very different access pattern than this scheme assumes.

27

and a hardware scheme to distinguish the two types of load instructions. Their results look
both complete and promising, showing as much as a 12% improvement in instructions per

clock for some programs, although 1 to 3% seems to be the norm.

Anderson et al. study the impact of bypassing and superloading on a miss. A superload
moves both the missed line and the surrounding lines into the cache [5]. They propose a
hardware scheme that decides to load or superload based upon past behavior of that part
of memory. They compare their on-line scheme to the off-line optimal algorithm, both with
and without bypassing. The net result is that their scheme shows marked reduction in
both the miss rate and bandwidth used when compared to a more traditional LRU cache.
However, their scheme also utilizes an unlimited amount of memory to maintain information
about previous loads. The net effect is that they show superloading and bypassing to
have signiﬁcant potential in an on-line environment. However, their exact imtﬂementation

appears infeasible at this time.

Hallnor and Reinhardt have recently proposed a much more complex smart caching
scheme [36]. The fundamental argument they make is that memory is getting so slow
compared to the processor speed that it is wise to take extra time on an L2 access in”
exchange for a higher L2 hit rate. While this premise may be flawed”, their results are
quite promising. Just as others have done, they apply a replacement scheme that uses past
history. However, they also use techniques from operating systems to make the L2 cache
effectively fully-associative. This associativity comes at higher system latency, but allows
them to use their replacement scheme more effectively. The number of misses is greatly

reduced, sometimes by as much as a factor of two.

 

12For example, L2 misses, if rare enough. might be dealt with in other ways, including changing threads
or processes

28

2.2.6 Further reading

While citations for each of the above non-standard caches have been provided. some of the
papers provide outstanding explanations of a wide variety of schemes. In particular [70, 82]

provide very nice overviews of hash-rehash caching and smart. caching respectively.

29

Chapter 3

0n the Measurement of Locality

The purpose of this chapter is to provide an overview of research in the area of locality
measures. In the ﬁrst section of this chapter, the meaning of the term "locality measure”
is discussed. In sections 3.2 and 3.3 various locality measures are classiﬁed and explained.
Section 3.4 contains a discussion about LR U stack distance, both as a locality measure as

well as some of its other applications in the literature.

3. 1 Locality measures

Measuring the locality of a data stream is an imprecise science. In general, it is an attempt
to reduce massive amounts of data, sometimes millions or billions of data points, into
something which can be easily understood and used by a human. Each researcher has
different. goals and hopes for their measure, so each researcher proposes a different scheme.
In fact, the motivation for each measure is perhaps as interesting as the measures themselves.

30

3.1.1 What is locality?

The exact deﬁnition of locality is a bit vague. which, as one would expect, contributes
to the difficulty of measuring it. However, the term locality, as was briefly mentioned in
Chapter 2, is used describe the fact that there is a dependence between the addresses of
previous memory accesses and that of future accesses. A processor tends to access memory
locations it has referenced recently (temporal locality) as well as locations near those it has
referenced recently (spatial locality). This phenomenon is to be expected—as a program
moves through a data structure it will do so in a certain pattern; perhaps “walking” memory
at a ﬁxed stride size, or perhaps moving though a linked list or tree. In reality, a program

will generally interleave its accesses to various data structures.

As an example, imagine a standard matrix multiply operation. One input array is
accessed in row order, one in column order, and the output array is usually accessed in
either row or column order. In addition, loop control variables and temporary variables
are heavily utilized (although many of those may be allocated to registers and not manifest
themselves as memory accesses). Now imagine that the matrix multiply operation is just
one of a number of subroutines used by the program. The net effect is that a number of
access patterns start, get interleaved with other access patterns, and then end. These access
patterns are much like threads in a tapestry. W'hen examining that tapestry what should be
the goal? One could evaluate the individual threads, the patterns formed by those threads,
or perhaps the tapestry as a whole. Each level of granularity is important, and each requires

a different perspective.

31

3.1.2 Locality measures in the literature

The remainder of this chapter summarizes much of the existing literature about locality
measures. Section 3.2 provides an overview of the most common way of examining locality:
at the macroscopic level. These metrics are generally concerned with capturing the locality
of an entire program. and generally rely upon the law of large numbers to function correctly.
In that section a brief overview is also provided for a. closely related topic, cache-centric
locality models. Those models view locality as it is reflected in a cache.

Section 3.3 looks at those locality metrics which are generally concerned with a much
smaller number of references. These microscopic metrics tend to be most. useful when
examining certain program constructs or data structures. Many of them are loop-centric.
That is they assume the existence of a single large looping structure with relatively regular
behavior, something generally found in scientiﬁc computing.

Finally, in Section 3.4 the idea of LRU-stack distance is introduced. In addition to other
applications, this scheme is clearly a macroscopic locality measure. It is was one of the ﬁrst
of the locality measures of any type. and has a large number of applications. In Chapter 4

this measure is discussed in great detail and novel alpiplications for it are introduced.

‘

3.2 Macroscopic locality measures

Macroscopic locality measures are generally concerned with characterizing the locality of
a program or a large portion of a program. Most of the papers in which these models
are discussed focus upon predicting the cache miss rate. Their goal is to allow the quick
characterization of a large number of programs on a large number of caches in a short period

of time. The huge increases in disk space and processing power. as well as the high speed

32

of the cache simulators such as Cheetah [68], have made this less important, at least for
standard caches, than it was in the past. However, by attempting to predict a miss rate
these studies had to generate a model for locality. It is that part of their work, the locality

model, on which we concentrate in this section.

3.2.1 Chow’s empirical power law

Perhaps the simplest useful model of locality was proposed by Chow in the mid 19708 [19, 20].
He proposed that the miss rate of a cache could be matched to the equation A“! (C) = C '0',
where a would be ﬁt to the available data and where C is the capacity of cache. Chow
claimed that the miss rate is “most sensitive to the capacity” and that in his analysis issues
such as “block size, the management algorithms and other factors” would be ignored. If
the cache is considered to be a fully—associative cache using an LRU replacement algorithm
(something his analysis would certainly allow). 1le (C) is precisely a prediction of temporal
locality as it is deﬁned in Chapter 5.

While outdated and lacking signiﬁcant quantity of empirical data to support it, Chow’s
power law is very important to the study of locality. First. it forms the basis for one of the
classic rules of thumb in cache design: increasing the cache size by a factor of 4 results in
one half the miss rate. This would be the case if a were 0.5. This general rule, if not the
0.5 value, is experimentally observed in [64] among others. '

Chow used his power law as the basis for a mathematical approach to the design of
multiple levels of the memory hierarchy [20], a research topic that has continued well into
the 1990’s [43]. While it is his power law that is relevant to our work, the primary result
of [19,20] was to show that the optimal number of cache levels scaled logarithmically with

the capacity of the cache hierarchy.

33

3.2.2 The Fractal Model of Locality

Thiebaut's work [71] on the fractal model of locality is something of a variation on Chow’s
[20] power-law model from the decade before. How he arrives at his model is very different
than Chow. T hiebaut claims that the access patterns of many programs are fractal in
nature: having an access pattern in one area of memory and then jumpng to a different
area of memory and repeating that same pattern. He claims that if this is true, a two piece
power—function, very similar to Chow‘s, should model the ratio of the unique number of
memory locations over the total number of memory references. His function, f (3:), takes as
an argument a number of successive memory references .7‘ and predicts how many different
locations those :1: memory references will access.

The fundamental argument is that memory accesses are in fact a random walk though
memory. Also. the “jump size” of that random walk tends to be low. but has the occasional
long jump in it. Thiebaut states that such a random walk is a. special case of a fractal,
and that other work in fractals can be used to model this access pattern, and thus the
locality of the trace. He admits that other than in the instruction stream, much of the
random walk theory is not supported by his data. He believes that this is due to memory
accesses actually consisting of many different random walks. we suspect that. examining
the data stream generated by each individual load and store instruction might provide a
way to observe these random walks, if they do indeed exist.

What Thiebaut does demonstrate is that assuming the random—walks are occurring, he
should be able to predict the miss rate of a fully—associative LRU cache. His model has two
parts: one for the ﬁrst no memory references, and one for the remainder of the accesses. In
effect he produces a formula of the form AIR(C) = AC9 with different values for A and 0

for accesses before and after access number no.

34

The results can be criticized for only addressing a very small number of traces. we
believe his model is of marginal value, but the underlying idea of data streams being mod—
eled by a random walk is of interest, and may have applications to stream detection and

prediction found in [51] and related works.

3.2.3 Agarwal’s analytic cache model

In 1989 Agarwal et al presented a novel analytic cache model [2]. Although this work is
presented as a model of cache behavior, it can be viewed primarily as modeling the locality
of a reference stream. Their goal is to describe, with as few parameters as possible, the
various properties of the reference stream. They attempt to quantify temporal locality,
spatial locality, and the degree of conflict in the reference stream. The reference stream is
broken into “time granules,” and values are found for each parameter in each time interval
as well as for the entire reference stream.

Ignoring multi-programming related issues, this analytic model has four locality param-
eters. The ﬁrst is the unique number of blocks that. are accessed. This provides a measure
of the degree of temporal locality very similar to that used by Thiebaut. The second pa-
rameter is the collision rate: an attempt to measure how prone a given reference stream
is to evicting live blocksl. The ﬁnal two parameters are used as inputs into a two-stage
Markov model. The Markov model is used to describe the probability of memory accesses
being to successive locations. It is this measure that is used to account for spatial locality.

Their scheme does have its flaws. First, for computational reasons, it relies upon the
cache using random replacement rather than LRU replacement. While they do claim that

the miss rate for random replacement is a good indicator of a cache with LRU replacement,

 

1That is, blocks that. will be reused.

35

neither their data, nor this author’s experiences support this. They report that the model
does a reasonable job estimating the miss rate for direct-mapped caches, with an average
error of 4% to 15%, depending on the block size. However, two-way set-associative caches
have an average error of 14% to 23%. They also note that their model always predicts too
high of a miss rate for the set-associative cache and attribute that behavior to the random
replacement policy assumed by their model. While their error rate might seem unacceptably
high, it should be noted that the space needed to store this analytic model is at least three
orders of magnitude smaller than the trace itself. Further, the results from this model can be
computed much faster than actual simulation. Thus, this model is a reasonable alternative
for fast design space explorations. The results from such an exploration could be used to

determine which cache conﬁgurations and traces to study in greater detail.

3.2.4 Grimsrud’s locality metric

In the early 1990’s Grimsrud, in some cases with others, wrote a number of papers about
quantifying locality [32—34]. Their work describes a method of visualizing locality with a
three-dimensional graph. The a: dimension is spatial locality, the y dimension temporal
locality, and the z dimension indicates the probability of access having this locality value.
Speciﬁcally, the locality of a memory reference trace F is deﬁned to be L('f,s,d). This
is the probability that an address 8 away from the currently referenced address is, in the

remainder of the trace, ﬁrst accessed (1 references in the future. More rigorously,

L(T", s,d) = Para-1+ s = ﬁx + d] /\ (rm a ﬁx +1],...,'f[:r + d — 1])

To make the graph readable, Grimsrud groups the values of s and d into bins. That is,

each value of d and s is rounded up to the nearest power of two. This rounding not only gives

36

a logarithmic view. it also makes the surface of the graph much cleaner and meaningful. This
“binned” version of L is renamed L" and is also called the locality function. By graphing
the three-dimensional locality function for a given trace, one can gain an intuitive feel for

the type and degree of locality that exists in the trace.

Applications

Grimsrud provides a wide variety of applications for his locality metric, One very useful
application is in evaluating artiﬁcial trace-generation technique52. While he states that the
locality function cannot easily be used to generate an artiﬁcial trace, it is possible to check
the correspondence between the locality functions in the artiﬁcially generated trace and the
original trace it was based upon. Grimsrud evaluates a number of techniques, including
a stack-based model [6,8] and a Markov model, concluding that each of the models has
certain ﬂaws and demonstrates the weaknesses of each.

There are a number of other proposed applications [34] including trace characterization,
benchmark evaluation, and the ability to visualize locality. The last is the most important
in the context of locality measures, and Grimsrud explains how this feature can be used
to help one identify loop characteristics including stride and the amount and nature of

temporal locality in the reference stream.

3.2.5 Cache based locality

W'hile Chow’s, Thiebaut’s, and Agarwal’s models are cache models after a sense, in each
case their results hinge upon their characterization of the reference stream in terms of a few

parameters. The interaction of caches with their locality model can be viewed strictly as an

 

2Briefly, artiﬁcial trace-generation is an attempt to summarize the locality properties of a trace in a set.
of parameters. This is mainly used to avoid the need to store long traces on a computer [8,33].

37

application of the locality model. However, there are also “pure” cache models that have
some relevance to the study of locality. These models generally use "ideal” caches, which
are unrealistic to actually implement, as a baseline for use in gauging the performance of a
more realistic cache. In doing so they also provide a glimmer of insight into the locality of

the reference stream being examined.

Perhaps the most widely used model of this type is the “three C’s” model [23,41],
developed by Hill and popularized by Hennessey and Patterson in their two computer
architecture texts [39,56]. In this rather simplistic model. cache misses are broken down
into three categories: compulsory, capacity, and conflict. Compulsory misses are misses
to never-before referenced cache blocks. Capacity misses are those misses on which a fully-
associative LRU cache of the same size would also have missed. Conflict misses are misses
on which the fully-associative cache would not have missed. Clearly the compulsory misses
tell us about the number of unique blocks in the reference stream and the capacity and

conflict misses describe the degree of temporal locality available.

A model closely related to the three C ‘5 model was proposed by Sugumar and Abraham
in 1993 [68]. They take issue with using a fully-associative LRU cache as an “ideal” cache,
and consequently as the basis for comparison. Rather they suggest that more information
could be gathered by using optimal placement algorithms. They instead divide cache misses
into four categories: compulsory, capacity, mapping and replace-771cm. Using their nota-
tion, let A1 (L, C, k, rep!) represent the number of cache misses, on a given trace, in a cache
where L, C, k and repl are the line size, cache size, associativity, and replacement strategy
respectively. Table 3.1 shows how their method categorizes those misses. C / L represents a

fully associative cache.

Suppose weare evaluating a C-byte, k-way associative LRU cache. Compulsory misses

38

 

LMiss type N umber of misses ]

 

Compulsory M(L,any,any,any)

Capacity M(L,C,C/LOPT) - l\"I(L,any.any,any)
Mapping M(L,C,k,OPT) - M(L,C,C/L,OPT)
Replacement M(L,Ck,LRU) - M(L,C,k,OPT)

 

 

 

 

 

 

 

Table 3.1: A classiﬁcation of misses using the optimal replacement policy. The arguments to
the function M() are line size, cache size, associativity and replacement policy in that order.
L is the line size of the base cache, C is its cache size, and k is the degree of associativity.

would be those misses which occur because the cache block had never-before been referenced,
a deﬁnition identical to that of the three C’s model. Capacity misses, however, are deﬁned
to be those non-compulsory misses which would also occur in a C-byte optimal, fully-
associative cache. Mapping misses are those misses which do not occur in a C-byte fully-
associative optimal cache, but do occur in a C-byte k-way associative optimal cache3. A'
replacement miss is one which occurs in a k—way set associative LRU cache but not in a
k—way associative optimal cache. Ideally then, capacity misses are those accesses on which a
cache is truly too small to expect to achieve a hit. Mapping misses represent misses due to
structural issues (conflict), while replacement. misses are those which a better replacement
policy would have avoided.

Sugumar and Abraham’s scheme has many advantages over the three C’s model. For
one thing the three C’s model assumes that an access which misses in the fully-associative
cache also will miss in a less associative cache. As is shown by [58, 72] and others, this need
not be the case. In fact, as we show in Chapter 4, an access which just barely qualiﬁes as
a capacity miss has over a 40% chance of actually being a hit in a direct-mapped cache.
Because Abraham and Sugumar use OPT rather than LRU as their basis for comparison,

this issue does not arise in their work. In fact. the aggregate deﬁnitions in Table 3.1 can never

 

3Using our terminology from Chapter ‘2, we would call this a structural miss rather than a mapping miss

39

be negative, something not true of the equivalent three C’s deﬁnitions [68]. Additionally,
this OPT—based scheme provides a method of evaluating the replacement policy, something
not addressed by the three C’s model.

There are other cache studies and metrics which can provide information about the
locality of the trace. Liveness, used by [15, 49, 54] among others, describes the percent of
blocks in the cache which will be re-referenced sometime in the future (or perhaps within
some ﬁxed time window.) Additionally many of the smart caching studies have attempted
to ﬁnd the relationship between locality and the referencing instruction [74] or locality and

the referenced location [44]. These were brieﬂy described in Chapter 2.

3.3 Microscopic locality measures

As noted above. microscopic locality measures are concerned with relatively small numbers
of memory references, often on the order of hundreds to tens of thousands. This section

provides a brief overview of the various microscopic locality studies in the literature.

3.3.1 Instantaneous locality

Instantaneous locality, as proposed by Weikle et al., is perhaps the most recent attempt in
the literature to quantify locality [76]. They argue that caches ﬁlter locality, just as a lens
can ﬁlter light or electronic ﬁlters can ﬁlter signal frequencies. Their goal is to quantify
locality in such a way that this ﬁltering can be observed and have a “high correlation to
intuition.”

An electronic signal is simply a changing voltage which has a single value at any given
time. As such they propose that the locality of the reference stream should also have some

value at any given time. The measure they tentatively propose for instantaneous locality,

4O

1,, of a reference a, in a stream of references (an. a1,a2, . . .) is:

i—l
I
z-~ E: ——
“(I‘li—ajl+1)(lj‘il)

The ﬁrst term, 1 /(|ai — a j] + 1), corresponds to spatial locality. If the two addresses a,- and
aj are the same, this term will be unity, otherwise it will be (signiﬁcantly) less than one.
The second term, 1 / [j — 2'] corresponds to temporal locality. If the two accesses are adjacent
(that is i = j :l: 1), this term is unity, otherwise it is less than one. When these two terms
are multiplied together a value ranging from nearly zero to one (for two adjacent accesses
to the same location) is generated. The summation of these terms determines the value of

the instantaneous locality of a,.

Once a deﬁnition of instantaneous locality has been created, the next step is using it to
see how well locality is ﬁltered out of the reference stream. In order to visualize this, one
other term, the instantaneous hit rate for reference 2', is deﬁned as h,- = 6, + Ola--1 where
a = 0.5 and (5, is one, if access 2' was a hit and zero otherwise. This gives h, a range from
zero to two, where higher values indicate a higher recent hit rate. With this deﬁnition ﬁrmly
in hand, they show that accesses with high instantaneous locality values tend to have high
bit rates and that the output locality stream (the requests made by the cache) generally has
less locality than the input stream. This metric is also used to compare different cache types
and to try to recognize program constructs from their instantaneous locality graphs [76].
Another application is that the “bursty” nature of real programs—real reference streams
tend to have periods of high locality separated by periods of low locality [10] -—can be easily

seen and measured.

The authors of this work readily admit that the exact formulas they have selected may

41

not be the best. Rather they defend the notion that instantaneous locality, as a concept, is
both useful and important and better formulas will only improve the concept. This author
would certainly claim that their formula for instantaneous locality greatly underestimates
the importance of spatial locality. For example, two memory accesses that. are only 8 bytes
apart are given a spatial locality measure of 1 / 9 on a scale of 0 to 1, too low by far in a time
when cache lines are often 64 bytes or larger. Another limitation of this work, and why we
characterize it as a microscopic locality measure, is that viewing large reference streams is
difﬁcult. The largest example in their paper dealt with fewer than 4000 memory referees.
Of course, with the aid of a computer and some additional techniques, visualizing much

larger traces may be possible“.

3.3.2 Loop-based locality metrics

A common form of locality analysis has been based on the study of loop-nests. While it
is interesting, loop-based studies are only peripherally related to our own locality studies
(found in Chapter 4) and as such will only be briefly discussed. Two papers have been chosen
as representative of the ﬁeld. McKinley and Teman's work [53] evaluates the applicability
of common assertions about program locality in the context of loop locality, while Ghosh
et al. [29], generate "cache I‘niss-equations” which can be used to guide code optimizations
in the compiler. In the remainder of this section some. of the basic jargon of this ﬁeld is
introduced followed by a brief overview of the two selected papers.

In general, loop-based locality studies are targeted at compiler writers. Usually their
purpose is to propose an algorithm for analyzing the memory reference patterns and ﬁnd a

way for the compiler to reduce the number of data cache misses. In doing so, these studies

 

4They have written a program which aids the analysis of instantaneous locality, but we have not had the
opportunity to examine it in any detail.

have adopted interesting terminology. Memory references are described in two methods.
The first method is the three C’s model discussed in Section 3.2.5. The other method
describes a non-compulsory reference with a three-word phrase consisting of hit or a miss;
temporal or spatial; and self or group. So a given memory access might be a. “spatial group
miss” or a “temporal self hit” or any combination of the three terms. An access to address
X by instruction I is called temporal, if the last time (which may be now) the block in
which X resides was in memory was when X was, referenced. Otherwise, the access is
labeled as spatial. If the access was temporal and instruction I was the instruction which
had last referenced X, the access is labeled as “self.” If the access was spatial and the last
instruction that referenced the block was I , the access is also labeled as “self.” Otherwise,
it is labeled as “group.”

This jargon is useful for a couple of reasons. First, it is the only locality terminology of
which we are aware that. indicates something about the referencing instruction. an important
concept for some smart caching schemes [74]. Second, it allows a high-level understanding
of the degree of temporal and spatial locality, something which otherwise has required a
more complex scheme such as [33]. It also considers the access patterns of hits as well as

misses, something the three C’s model does not.

Examples of loop-locality from the literature

McKinley and Terman [53] point out that most locality studies examine program locality,
while compiler writers tend to concentrate on cache optimizations restricted to looping
structures. They ﬁnd that conflict misses are much more common in looping structures
than capacity misses. Also, they observe that loops see more spatial reuse than the overall

program. While neither of these results come as any surprise, they do provide further

43

evidence that most of the cache misses occur between looping structures. This observation
means that rather than focusing on the “low-hanging fruit” of intra—loop cache placement,
inter-loop caching issues need to be examined more carefully by compiler writers.

Ghosh et al. [29] provide a much more traditional view of loop locality. They generate a
set of mathematical equations for certain highly-restricted looping structures. These equa—
tions can be used to determine the optimal tiling and array padding required to minimize
the miss rate. In creating these equations they have managed to quantify locality in a very
different way than any other study considered in this thesis. While it is unclear if their
methods would be useful to a human attempting to understand loop-locality, it is clearly

useful for an optimizing compiler.

3.4 Stack distance

A program trace consists of the ordered set of addresses referenced by a program. Imagine
pushing each memory reference onto a stack. If the reference had occurred earlier, it would
be removed from its current position in the stack and then pushed onto the top of the stack.
We deﬁne the stack distance of a reference R to be the depth in the stack from which it
~ was fetched. If R is not on the stack (because it. had never before been referenced) we
, deﬁne its stack distance to be 00. This concept has been used in [8. 13, 33, 41,49, 58, 64].
An equivalent deﬁnition of stack distance is to count the number of unique references which
have occurred since the last reference to R.

The earliest paper which used stack distance in this context was by Mattson et al.
[49], where they deﬁned the top of the stack to be the ‘one‘ location rather than the more
standard convention of the top of the stack being the ‘zero’ location. Others followed suit,

and for sake of consistency with previous work we adopt the same notation in the remainder

44

of this chapter. Thus, the stack distance of a reference is the number of unique references
which have occurred since the last reference to that location, plus one. Figure 3.1 illustrates

the computation of the stack distance value.

Reference Stream: A B C
Stack Distance: 3 2 1

.........................................................................................................................

Figure 3.1: Illustration of the stack distance of a reference stream

 

 

 

 

 

Figure 3.2: A graph of the locality found in the reference stream of Figure 3.1

3.4.1 Stack-distance distribution functions

Given a memory-reference trace T consisting of memory references (t1,t~_)....tL), deﬁne
f (T, i,d) to be 1, if it“ reference in T has a stack distance of d. Deﬁne the functions

sd(T,d) and SD(T, d) as follows:

 

1 if d 2 oc
SD(T,d) =
2:31 sd(T, 1') else
That is, sd(T, d) is the probability that in the trace T a randomly selected reference will
have a stack distance of cl, while S D(T, d) is the probability that such a reference will have

a stack distance of d or less. Figure 3.2 is a. graph of the values of sd(T, d) and SD(T,d)

for the trace used in ﬁgure 3.1.

3.4.2 Stack distance in the literature

Stack distance. as a means of examining reference streams, has been around for over three
decades[49]. In the literature one can ﬁnd it used as an analytic model of locality [26, 41,
58, 65], as a method to generate artiﬁcial traces [6, 8, 33], and as part of an algorithm for

hardware simulation [49,68]. The remainder of this section address is these applications.

Stack distance as an analytic model of locality

It is fair to say that the Stack Distance Model (SDM) was one of the most popular models
of cache behavior during the late 19708 [58, 64], slowly replacing the Independent Reference
Model (IBM) and the W'orking Set Model (W'SM) [24,58]. The major difference between
the WSM and the SDM is that the WSM counts the number of intervening references,
rather than the unique number of intervening references, between successive accesses to the
same location. The WSM and the SDM are fairly similar and, for the most part, represent

the same information. However, the WSM is both larger to store (its maximum size is

46

bounded by the number of accesses rather than the size of the working set) and is generally
more difﬁcult to utilize. A detailed explanation of each of these models, and other related
models, can be found in [26].

In 1977, a Stanford technical report was written by Rau on the properties and ap-
plications of the stack model [58]. Rau’s highly mathematical paper develops a number of
important results for the stack distance model when applied to fully-associative caches. Rau
also shows that given sd(T, d), one can derive a number of values including the distribution
governing how long an item will stay in the cache, some occupancy results results for multi—
level caches, and how stack distance values relate to set-associative caches. Most relevant
to our work in Chapter 5, Rau ﬁnds that if ms(q, a.) is the miss rate for a set—associative

cache with q sets of size a, its value can be computed as:

N
ms(q«a) = 1— 220mm
i=1

These results assume that the cache blocks are randomly mapped into the cache. T hat.
is, any given cache block has an equal chance of conﬂicting with another block. This
assumption is empirically demonstrated in [65] as well as in Chapter 5. Also, the realization
that hash caches (discussed in section 2.2.1) generally contribute very little in the way of
performance gain hints that this assumption is reasonable.

Rau brieﬂy discusses the function Ba(2') and concludes that a more highly set-associative
cache will result in a lower miss rate if (but not only if) Vi : sd(i) > sd(i + 1) is a non-
increasing function. In Chapter 5 we re-derive Ba(i) and examine it in much greater detail.

A year after Rau’s report, Smith proposed a variation on this caching model [65] and

then reﬁned it slightly in 1982 [64]. His and Rau’s equations are equivalent, but Smith

changes the order of the summations and makes a few other cosmetic changes. Smith is

47

mainly concerned with generating an analytic model for p(i) using the stack distance model,
combined with common sense. as a reality check. Smith also provides empirical evidence
that the actual miss ratio and predicted miss ratio are very similar. He provides ﬁve traces
and simply graphs the actual and predicted miss rates and shows that they are very similar.
While this evidence is intriguing, [2], among others, criticize this work as not being validated
on a wide variety of traces.

Hill and Smith again describe the same analytic model of locality in their 1989 work
[41]. This time a much larger set of programs are used to validate the model. They show
that the vast majority of programs they test achieve a. miss rate within 5% to 10% of the
value predicted by the model. These results are more satisfying, but no attempt is made to

explain the error or to reduce it.

3.4.3 Algorithmic use of the stack model

The ﬁrst application of stack distances we are aware of came in 1970. Mattson ct al.
proposed using stack-based algorithms to quickly simulate the hit rate of caches over a wide
range of values and replacement policies [49]. Their algorithms apply to only a limited
number of replacement algorithms, but those include the LRU, FIFO, optimal, and certain
pseudo—random replacement algorithms.

Mattson et al. focus on fully-associative caches5 for nearly all of the paper, but do
address maintaining separate stacks for each set in the cache for less-associative caches.
The majority of this paper is dedicated to a stack-based optimal placement algorithm [11]

and its associated proof of correctness. This paper is interesting because it clearly sparked

 

5It is interesting to note that this paper never actually uses the terms “associativity” or “cache.” Rather
they use the terms “congruence mapping” and “buffer” respectively. We assume that this is because the
terminology used today did not exist or was not standard at that time.

48

the idea of using a stack-based model as a locality metric as well as a means of artiﬁcial
trace generation. Further, this paper is the first to propose an efficient method of computing
the optimal placement policy of a cache, and would not be surpassed in that respect until

work by Sugumar and Abraham [67, 68].

3.4.4 Trace compression

Trace-driven simulation is a commonly used method for e raluating the behavior of a memory
hierarchy [75]. The trace is gathered by running the program, either in a simulator or on
real hardware, and recording every memory access that occurs. This address trace can then
be used to drive a variety of memory hierarchies. Pure simulation, where the processor and
memory hierarchy are simulated in tandem, will provide more accurate data than trace-
driven simulation. However, trace-driven simulation is considerably faster, and because the
future accesses are already known, allows for off-line replacement algorithms to be studied
such as OPT.

In 1981 Babaoglu proposed keeping the stack distances found in each trace rather than
storing the trace itself [8]. He then proposed generating an "artiﬁcial trace’ trace from these
values using a Monte-Carlo method. W hile his technique requires that some number of
accesses be wasted warming up the cache, it did allow outstanding savings in terms of disk
space. The size of the information stored depends solely upon the largest stack distance
considered—it is independent of the size of both the original and generated trace. It would
be expected that the same errors encountered by Hill’s analytic model [41] would also occur
when using Babaoglu’s technique.

There exist other stack-distance techniques which allow for perfectly accurate simula-

tions of higher levels of the memory hierarchy. The most recent and probably best of these

49

techniques is used in the area of virtual memory simulations [46]. The basic idea is that
if all memory components have at least an associativity of 1:. some portion of the actual
memory trace is redundant—main1y those with a stack distance of less than k. They show
that for values of k = 5 the page-level traces are reduced by a factor of about about 20 on
the average. While a minimum associativity of 5 is probably unrealistic for standard cache
simulations, it is quite realistic when considering a TLB as the first level cache of the page

table.

3.4.5 Conclusion

In this chapter a large number of locality measures were discussed, concluding with an
overview of the stack distance measure of locality. Most of these techniques are used to
address a specific issue, for example compiler loop optimizations. to understand how locality
is ﬁltered by a cache, or to compress memory reference streams. None of these techniques
are intended to be a tool for gaining insight into whole—program locality. The models of
locality proposed \Veikle and Grimsrud come the closest in terms of intent and ability. But
both of these work best on the small scale: on a single loop or a small segment of code
respectively.

In the following chapter it is argued that the stack distance model is a ﬁne general
technique for understanding locality and how caches interact with that locality. Further it
can be used to identify excessive conﬂict, explain the behavior of non-standard caches, and

show how caches ﬁlter locality out of a reference stream.

50

Chapter 4

Stack distance as a locality

measure

As described in Chapter 1, there is a widening gap between DRAM speeds and processor
performance. This widening gap has simultaneously made the interaction between locality
and caches more complex as well as more important. The additional complexity is, in
part, due to the fairly modern addition of the non-standard cache schemes discussed in
Chapter 2. Those caches, while similar to previous cache schemes, none-the-less interact
with locality very differently than standard caches. Further, locality itself is modified by
the ﬁltering effect of caches. This filtering means that a reference stream to an L1 cache
will have significantly different locality properties than the reference stream to an L2 cache
[77]. As discussed in Chapter 3 simple rules of thumb such as the 2:1 Cache Rule and the
three C ’3 model [23, 39,41] over-simplify the situation by making assumptions about the
locality of reference streams and even about how that. locality interacts with the caches.
As also discussed in Chapter 3 other models of locality are either too complex [34], too
narrow in scope [29, 53], or otherwise inappropriate [76] for considering the locality of an

51

entire program. This whole—program locality is important mainly for cache designers and

compiler writers.

The difficulty of understanding whole-program locality makes it very difficult for a cache
designer to propose appropriate cache schemes or to understand how and when non—standard
cache schemes will provide beneﬁt. Without this understanding the only real options are
to rely upon flawed rules of thumb or upon hit-rate results on a large set of simulations.
Neither is satisfactory and both can lead to poor designs. Relying upon hit-rate results
is problematic because designer needs to worry about a huge set of programs including
programs which do not yet exist. This need results in huge amounts of simulation activity
as well as signiﬁcant effort spent on prognostication. A model of locality would, at the least,
allow the cache designer to narrow the set of programs used by eliminating those with similar
locality properties. The model also could be used to ﬁnd trends and make more informed
predictions about future workloads. That said. the most important use of the model would
not be in simply improving current techniques but in allowing the cache designer to better

understand locality and how that locality interacts with the various possible cache designs.

It is less obvious that compiler writers also have a need to understand locality at the
program level. It is clear and well accepted that “microscopic” locality, as discussed in
Chapter 3, is extremely important to complier writers. However, whole-program locality is
mostly a function of the program being compiled rather than the compiler itself. W'here
whole-program locality is important is in the context of data layout. If indeed the whole-
program locality is determined by the program and not the compiler the only whole-program
issue that can be addressed by the compiler is conﬂict between the various memory references
in the cache. The locality and cache models can be used to predict hit rates given random

conﬂict between references. If, through proﬁling, it can be determined that a given data

52

layout has conﬂict that is worse than random, it. may be worth having the complier try a
different data layout. That is, the cache and locality models can provide guidance as to
expected hit rates and this can be tested against. the actual hit rates achieved. If the actual
hit rate is worse than the predicted value, a new data layout can be proposed and tested.
It may be that the program itself has lead to this conflict (for example by striding though
a one-dimensional array), but it may be that the arbitrary layout chosen by the complier
has lead to ‘excessive’ conﬂict.

This chapter provides a uniﬁed method for understanding and quantifying locality and
how caches interact with that locality. The chapter addresses general concepts of locality
(Sections 4.1 and 4.2), interactions with non-standard caches (Section 4.3), as well as ap-
plications (Section 4.4) and future work (Section 4.5). Even though there are many details

found in this chapter, our primary goal is to provide insight into caching and locality.

4.1 The stack distance model of locality revisited

In this section we describe two models which are based upon the stack distance metric.
The ﬁrst is the locality model, that is, describing the locality of a trace by using the terms
sd(T,d) and SD(T,d) as defined in Chapter 3. This model of locality allows for a concise
description of the locality of a trace. As is shown later in this chapter, the stack-based model
of locality can be used to help understand the degree and type of locality in a trace. Such
understanding can otherwise be difﬁcult to achieve when dealing with millions or billions
of data references.

The second model is a cache model. It describes how a cache interacts with the locality
of a reference stream. Speciﬁcally, it predicts the expected hit rate of memory references of

a given stack distance. It is important to note that this model is arrived at independent of

a speciﬁc reference stream: the model describes how the cache can be expected to interact
with any reference stream. As is shown later in this chapter, this model can greatly aid in
the understanding of caches as well as understanding the real impact of cache structure,
associativity and size.

These two models can be used together to produce a number of interesting results.
These results include the ability to predict the hit. rate a reference stream will achieve on
a given cache. None-the-less, it is important to keep in mind that these two models are
distinct. The locality model is simply a way of describing the locality of a memory reference
stream, while the cache model describes how a cache interacts with the locality of a memory

reference stream.

4.1.1 LRU stack distance revisited

In Chapter 3 the idea of stack distance was introduced. In this section that introduction is
reﬁned and used to describe the characteristics of the various SPEC CPU2000 benchmarks,
hereafter referred to by the name “Spec”. These benchmarks are targeted to measuring
CPU performance, but are also the standard for measuring L1-cache performance. These
benchmarks are designed to mostly ﬁt in the cache hierarchy, and it is well known that
these benchmarks do not generally put a significant load on any cache other than the L1
[21].

The Spec benchmarks run for minutes to hours on real hardware and generate well over
a hundred billion memory references. Gathering data on such large runs in simulated hard-
ware is extremely time consuming and storing such large traces, even on modern disk drives,
is impossible. As such, we limit our studies to samples of one billion memory references are

for each benchmark. A reasonable period of time is spent allowing the program to ﬁnish its

54

initialization and begin doing actual work (at least 100 million memory references) before
this sample is gathered. Unless otherwise noted all traces are gathered from optimized
binaries for the DEC Alpha and run on the SimpleScalar simulator. The SimpleScalar
website, www.simplescalar.com, has details about this simulator as well as the executables
used. These traces were then run though a modiﬁed version of the Cheetah simulator [67]
to generate the data found in this section. These results were also cross-checked against

other, much less efﬁcient, simulators written by this author and others.

Recall from section 3.4 that SD(T,d) is the percent of memory accesses in a memory
reference trace T that have a stack distance of d or less. Equivalently, S D(T, d.) is the hit
rate achieved by a fully-associative cache consisting of d + 1 cache lines on that same trace.
This function is termed the cumulative stack distance of the trace. Figure 4.1 provides a
graph of the cumulative stack distance for the ﬁrst three benchmarks in the ﬂoating-point
portion of the Spec benchmark. For purposes of readability the X-axis has a logarithmic
scale. As will be true throughout this chapter, the cache line size is 32 bytes. Notice that
were the cache fully associative, art and ammp both would see very little improvement. in
hit rate from a size of 50 lines (about 2KB) to 2000 lines (about 65KB). However, ammp
would see a huge reduction in its miss rate from 2000 lines to 16000 lines( 256KB) while

art would see only a minor improvement.

At this time it is worthwhile to emphasize the fact that characterizations can be made
about the relative locality of the various traces. For example, ﬁgure 4.1 clearly shows that
art has less locality than the other two traces. Historically the characterization of a trace
was more ad hoc. One might examine hit rates over a number of different caches, varying
size and associativity. However, in general only “reasonable” caches would be considered——

generally powers of two and in the range of 4—64KB in terms of size and usually of small

 

 

 

‘<n—-—-o‘mc'o~s‘c m<-Hw-—=3‘=O

 

0 l l 1 l L 1;] L L 1 L1 1 1 LL L L J
1 10 100 1000 10000 100000

Stack distance

 

Figure 4.1: Cumulative stack distance for three ﬂoating-point benchmark traces

associativity. Instead we can gain detailed insight about the traces by considering a much
wider range of sizes. Also, by using fully-associative caches, the impact of conﬂict between
caches of speciﬁc sizes does not interfere with our view of the locality inherent in the traces.
(In section 4.4.2 it is shown that this allows us to measure conﬂict.)

One ﬁnal observation is the locality difference between the Spec ﬂoating-point bench—
marks, which mostly involve scientiﬁc applications manipulating large arrays, and the Spec
integer benchmarks, which are much more of a mixed bag (compliers, games, graphic appli-
cations and more). Intuition is that the ﬂoating-point benchmarks operate on larger data
sets and thus have less locality. Figure 4.2 shows that this is quite true. Spech and SpecInt

both follow very similar trends, but Spech does have less locality than SpecInt.

4.1.2 Derivation of the cache model for standard caches

As described in Chapter 3, both Rau [58] and Smith [64] have provided equations to predict

the miss rate of a cache given the cache’s parameters and the probability of an access

56

 

 

 

 

 

 

 

C 1 I I I I r IT I I II I I UT T 1 T

u» SPECfp _,..

m SPECint

u 0.9 - a

l

a

t 0.8 .—

i

V

e 0.7 ._

p

I'

0 0.6 _

b

a

b 0.5 ...

i

l

i 0.4 - -

t

y 0.3 n 1 ll 1 1 11 l 1 L1 1 I ll 1 n n
1 10 100 1000 10000 100000

Stack distance

Figure 4.2: Average of the locality found in the SpecInt and Spech benchmarks

 

Symbol Meaning

The memory-reference trace

The current cache block being analyzed (referenced in T)
The set (or line for a DM cache) where R can be placed
Stack distance since the last reference to R in T

Degree of associativity of the cache 3

Number of lines in the cache

Number of sets in the cache (equal to B/A)

 

 

 

mmmbmms

 

Table 4.1: Symbols used in describing how caches interact with locality.

having a certain stack distance. Further, Rau introduced a formulation which allows one
to easily separate the cache’s characteristics from the program’s locality. However, we find
their equations somewhat unwieldy, both mathematically and algorithn'iically. As such we
introduce equivalent formulas. In Table 4.1 the various terms used in our formulation are

introduced. These terms will be used throughout the remainder of this chapter.

Start by considering the probability that a reference to a given block will be a hit in

a direct—mapped cache when there have been D unique intervening accesses since the last

reference to that block. The symbols found in Table 4.1 will be heavily utilized throughout
this derivation. Once the direct-mapped case is completed, the more general case of an

A-way associative cache will then be considered.

Consider a memory reference trace T, which, in addition to other references, includes
exactly two references to the block R. In the case of a direct-mapped cache, the second
access to R will result in a miss if, and only if. there had been an intervening access to
the cache line E since R was last referenced. If D = 1, there was exactly one unique
intervening access since R was last accessed. Assuming that the intervening access has
an equal probability of being stored in any cache line, there is one chance in B that the
intervening access was to the cache line E and thus -R was evicted.

So the probability of a hit when D = 1 is (1 — l/B) or equivalently (B — 1)/B. There
are effectively D Bernoulli trials each of which has 1 / B chance of causing an eviction of R.
We want to compute the probability that none of them causes an eviction. The odds of a
single intervening access not causing an eviction is (B — 1) / B. Thus the odds of none of
them causing an eviction is ((B — 1) / B)D .

Therefore, the expression ((B — 1) / B)D describes the probability of a direct—mapped
cache of size B achieving a hit when there were D unique intervening memory references
between the current memory reference to block R and when block R was last referenced.
The next step is to generalize this result for an A-way associative cache which uses LRU
replacement.

In an A-way set-associative cache there are a number of sets S, equal to B /A. Here
a miss will only occur if the number of unique intervening accesses to the set E is greater
than or equal to A. So, for a two—way associative cache a. hit will occur if there are 0 or 1

such accesses. Using the same argument as in the direct-mapped case, the probability of

58

having none of the accesses go to the set in question is ((S — 1) / S )D . The probability of
having exactly one access to E can be. computed using standard counting techniques. D — 1
of the accesses must map to sets other than E, while one of the accesses must map to E.
The probability of D -— 1 accesses using a cache line other than E is ((S — 1) / S )(D ‘1). The
probability of an access mapping to E is (1 / S) Additionally, any of the D accesses could
be the one which maps to E , so we need to multiply by D. Thus, for a two-way associative

cache we have the following chance of getting a hit: 3

S — 1 D S — 1 D“ 1
—, + D —— — (4.1)
S S S
Notice that this expression is just the probability of a hit on a direct-mapped cache
of size B /A plus a term for the second way of associativity. In general, for a cache of
associativity A the probability of a hit is equal to the probability of a hit on a cache with

associativity A — 1 plus an additional term. That term will be the probability of exactly

A — 1 of the accesses being mapped to the set E. The general form for that term is:

(.13.) H a)“ (4.2)

So for a cache with S sets and associativity equal to A the chance of a hit is:
A—1 D—a a
D S — 1 1
P = —— — 4.3
son) (s) H

An equivalent, and sometimes more useful form of the equation results from replacing

S with B/A.

59

PHITIB.A.D> = 4:16) (5%?)0—0 (g) (4.4)

020
Unless otherwise noted, the formulation of Equation 4.4, as well as the order of arguments,
will be used when computing PH IT for standard caches. Figure 4.3 provides a graphical
representation of these values for three caches of different associativity, each with 128 cache
lines. Notice that the more associative caches do better on references with a low stack
distance while the less associative caches do better on data with a higher stack distance.

Also note that the crossover point is approximately the same as the number of cache lines.

 

 

 

 

1 I I I I I I l I I
. direct-mapped
0.9 - 2-way associative "‘
I: 4-\\-'ay assoicative
e 08 +- 1
(ii 0.7 - -
f 0.6 ~ ..
e
d 0.5 - ‘
l} 0.4 - -
l
t 0.3 — -
r
a
t 0.2 l- "
e
0.1 - ‘
0 l l 1 1 1

 

 

 

 

0 50 100 150 200 250 300 350 400 450 500

Stack distance

Figure 4.3: Illustration of the theoretic relationship between hit rates, associativity, and
the locality of a given memory reference.

A few interesting properties of these formulas are worth mentioning, although formal
proofs are not provided. The ﬁrst is that 230:0(PH1T(B.A, D)) = B. That. is, no matter
the associativity, the area under the curve is constant, it is simply distributed at different

points. The intuitive argument for this result is that at any given time B blocks are stored

60

in the cache. Say that the status of the cache was sampled N times and the stack distance
of each block in the cache was recorded. After a large number of samples. taking the count
of each stack distance and dividing it by N should yield the curve described by Equation
4.4. Clearly, as B values are recorded at each step, the summation over all of the stack
distances must be B.

Another interesting fact is that for positive integer values of B and D and where the

associativity X is greater than Y it is the case that:

D D
Z(PH1T(B,X,2')) 2 Z(PmT(B,Yui))
i=0 i=0

That is. if you consider two caches of different associativity but the same number of
blocks and sum the area under the curve from zero to some point D the less associative
cache will never have a greater sum than the less associative cache. One can observe this
phenomenon in Figure 4.3, it is obvious that the less associative cache has more area to
the right of any given stack distance. As mentioned in Chapter 3, Rau formally derived a
corollary to the above result: that if sd(T, d) > sd(T, d + 1) for all values of (1, then it is the

case that a more associative cache can expect to achieve a higher hit rate.

4.2 Validation of the cache model

The cache model relies upon one fundamental assumption: conﬂict in a cache is random.
This assumption may seem somewhat counter-intuitive for two reasons. First, array-based
programs often stride though memory at even intervals [39]. This striding can result in
excessive conﬂict in some sets while others are virtually unused. The second reason to cast

doubt on the assumption of random conﬂict is more signiﬁcant: spatial locality. Recall from

61

Chapter 2 that standard caches are organized so that two accesses cannot conﬂict if they
are within S — 1 blocks of each other. FOr most modern computer systems this means that
two cache blocks will never conﬂict if they are placed into the same page. Because of spatial
locality it can reasonably be expected that memory references near each other in time are
less likely to conﬂict with each other than random selection might indicate.

In this section it is demonstrated that while the assumption of random conﬂict is not

exact, it is fairly close. Speciﬁcally, we do the following:

0 Choose two cache conﬁgurations, a 4KB direct—mapped cache and a 4KB two-way
associative cache, and show that the predicted hit rates at each stack distance map

fairly well to the actual hit rates.

0 Show that over a wide range of cache conﬁgurations the model generally predicts the

hit-rate within 2% of actual values.
0 Discuss what the error in predicted hit rates means.

The last point is expanded upon in Section 4.4.2.

Graphing actual vs. predicted hit rates as a function of stack distance

Figure 4.4 is a graphical representation of the accuracy of our cache model over all the
integer benchmarks. The cache in question is a 4KB direct-mapped cache. Obviously the
integer benchmarks perform almost exactly as the theory predicts. Accesses with low stack
distances seem to do slightly better than predicted. This small difference is actually quite
important as three-fourths of all accesses from the SpecInt. suite have a stack distance below
10. (You may recall this from Figure 4.2.) None-the-less, the integer benchmarks match

theory remarkably well.

62

 

 

 

 

 

 

 

l I j
Measured '
P predicted ‘
r -
e
d _
i
c
l; '1
d u-
h —I
i
t -
r
a i
t
e
0 1 1 1 1 1 1 1 1 1

0 50 100 150 200 250 300 350 400 450 500

Stack distance

Figure 4.4: Stack distance hit-rates for SpecInt on a 4KB direct-mapped cache.

On the other hand, Figure 4.5 shows that the same data for the ﬂoating-point bench-
marks is signiﬁcantly more noisy. While the general trend does track the predicted values
(other than perhaps the tail end of the curve) it is not anywhere as close as the integer

benchmarks.

This huge difference in locality properties is due to the very nature of the programs.
The ﬂoating-point. benchmarks tend to be fairly small kernels of code working over large
data sets. As such the access patterns tend to repeat which reduces the random nature of
conﬂict. The integer programs tend to have larger kernels and do not generally operate on

the same data sets over and over again.

Simply to show that the same general results hold for an associative cache, Figures 4.6
and 4.7 provide the same information, for two-way associative caches. The integer results
remain very close to the prediction while the ﬂoating-point values may be slightly less noisy

than they were on the direct-mapped cache.

63

 

 

 

1 ‘ I I I I I I I I I
Measured °
P 0-9 a". , Predicted "
l' .-
e 0.8
(.1 0.7 -
l
f 0.6 -
e
d 0.5 -
l.‘ 0.4 -
l
t 0.3 -
I'
a
t 0.2 —
e
0.1 -
0

 

 

 

 

0 50 100 l 50 200 250 300 350 400 450 500

Stack distance

Figure 4.5: Stack distance hit—rates for SpecFP on a 4KB direct-mapped cache.

 

 

 

1 ‘ T I I a I I I I I
Measured '
P 0-9 ' predicted — ‘
r — CI!
6 0.8
(1.1 0.7 - -
E 0.6 - a
e
d 0.5 - -‘
t 0.3 _ -
r
a
t 0.2 '— "‘
e
0.1 *- q
0 1 1 1 1 1 1

 

 

0 50 100 150 200 250 300 350 400 450 500

Stack distance

Figure 4.6: Stack distance hit-rates for SpecInt on a 4KB two-way associative cache.

64

 

 

 

1 . l I I I l I l I l
. f5: . Measured 0
P 0'9 l' :. °'° Predicted —— '7
0' . o
1‘ .— . . . o . g —
e 0.8 a . . .
(.1 0.7 '- . “a" "‘
I .. o. . . o' t
f 0.6 p .- ?..‘ -
e ' 0 .o' . 0
d 0.5 " ° 7. 0. . .—
. ‘0’. O P o

l; 0.4- .h .. :1». . . . -
t O 3 _ . o. . .. :- ‘. . O . -l
f . o . .. .0 ' o : .
? 02 _ ° 0 . \‘w'::.0 '2. ’4'. .f’ . O o. . . .4
" 0 - -. .- «rs-.- ° '4' :ri‘?

'1 '- . . . .0. .g .. ﬁ. 0”. O 00‘.

n 1 1 t 1 u "
0

 

0 50 100 150 200 250 300 350 400 450 500

Stack distance

Figure 4.7: Stack distance hit-rates for SpecFP on‘ a 4KB two-way associative cache.

4.2.1 A comparison of actual hit rates to predicted values

Where Section 4.2 illustrated the accuracy of the caching model showing how well it pre-
dicted the hit rate at each stack distance over a set of benchmarks, here we measure how
well the model predicts actual hit rates. The results are provided by benchmark, cache size,
and cache structure. While general trends in accuracy are discussed, the primary goal of
this section is to give additional validation for the cache model as well as to provide some

insight into what errors should be expected.

Nineteen traces were run on 21 different cache conﬁgurations. Those conﬁgurations are
the cross—product of three cache schemes (direct-mapped, direct-mapped hash, and 2-way
set-associative) with seven cache sizes (1KB, 4KB, 8KB, 16KB, 32KB, 64KB and 128KB).
All caches have a 32-byte line size. In order to summarize these 399 data points into a useful
set of information, tables have been provided which show the average results when two of

the axes (benchmark, cache scheme, cache size) are held ﬁxed. The results provided are

65

the average predicted hit rate, the average actual hit rate, the average of the signed errors,

and the average absolute error. The ﬂoating-point. and integer benchmarks are considered

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

separately.
Benchmark Predicted Actual Average signed Average Absolute
hit rate hit rate error error
ammp 71.00% 71.68% -0.68% 0.68%
applu 82.40% 83.26% -0.86% 1.27%
art 57.49% 58.48% -0.99% 1.24%
equake 95.76% 97.21% -1.45% 1.45%
fma3d 96.63% 96.50% 0.13% 0.87%
galgel 81.41% 82.87% -1.47% 2.10%
lucas 81.43% 82.58% -1.15% 1.85%
mesa 94.24% 95.15% -0.91% 0.99%
mgrid 84.27% 80.31% 3.97% 4.90%
swim 76.35% 79.28% -2.93% 2.93%
wupwise 94.07% 94.94% -0.88% 1.00%
Average 83.19% 83.84% -0.66% 1.75%

 

 

Table 4.2: Hit rates and predicted hit rates for the SPECfp2000 benchmarks

As can be seen from Table 4.2 the quality of prediction varies quite a bit depending on
the benchmark. Three of the benchmarks have errors greater than 1.5%. Recall that if the
actual hit rates are better than the predicted values, the conﬂict in the memory reference
stream is lower than it would be if the conﬂict were purely random. This lower-than-
random conﬂict is generally thought to be a result of spatial locality and the fact that two
lines sufﬁciently close together will not conﬂict with each other. As discussed at the start of
this chapter, poorer hit rates than predicted are due to heavy conﬂict usually attributable
to striding. Notice that mgrid has the largest error and is one of only three benchmarks
which does worse than predicted. This information shows that the data layout generated

by the complier for mgrid has signiﬁcant room for improvement. More information about

the detection of excessive conﬂict can be found in Section 4.4.2.

66

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Benchmark Predicted Actual Average signed Average Absolute
hit rate hit rate error error
bzip2 94.80% 95.28% -0.47% 0.57%
crafty 88.21% 88.87% -0.66%- 0.73%
gap 97.93% 98.42% -0.49% 0.49%
gcc 89.36% 89.96% -0.60% 0.60%
gzip 94.30% 95.25% ~0.95% 0.95%
mcf 81.09% 82.36% -1.27% 1.27%
parser 91.61% 92.29% -0.68% 0.70%
vpr 89.21% 88.60% 0.61% 1.40%
Average 90.81% 91.38% -0.56% 0.84%

 

 

Table 4.3: Actual and predicted hit rates for SPECint2000 benchmarks

Examining Table 4.3 the error in hit rate for the integer benchmarks is signiﬁcantly less
than what was found in the ﬂoating-point benchmarks. Once again the predicted hit rates

are generally a bit lower then reality. while only vpr suffers from more conﬂict than random.“

 

 

 

 

 

 

 

Conﬁguration Predicted Actual Average signed Average Absolute
Direct-mapped 82.52% 83.43% -0.99% 2.00%
Hash 82.52% 83.14% -0.61% 0.97%
Two-way 84. 77% 84.96% -0. 19% 0.47%-

 

 

Table 4.4: Prediction errors grouped by cache conﬁguration for the SPECfp2000 bench-

 

 

 

 

 

 

 

 

 

marks.
Configuration Predicted Actual Average signed Average Absolute
Direct-mapped 90.20% 90.65% -O.45% 1.07%
Hash 90.20% 91.06% —0.85% 0.97%
Two-way 92.04% 92.42% -0.39% 0.47%
Table 4.5: Prediction errors grouped by cache conﬁguration for the SPECint2000 bench-
marks.

One way to think of the cache model is that it takes hit rates for a fully-associative
cache and attempts to predict how less associative caches will behave.

therefore, Tables 4.5 and 4.4 show that the 2-way associative cache has signiﬁcantly less

67

 

 

Unsurprisingly,

error associated with it than the two non-associative caches. What is unusual is that hash
caches are supposed to reduce the amount of conﬂict in a cache without reducing the value
of page-level locality. However, on the ﬂoating-point benchmarks the hash cache actually
harms performance. In any case, the model has the most problems predicting the behavior
of the direct-mapped cache. While that error is not trivial, it is plain that the model tracks

actual performance fairly well.

 

 

 

 

 

 

 

 

 

Cache Size ] Predicted Actual ] Average signed Average Absolute]
1k 73.35% 74.28% -0.92% 3.22%
4k 84.03% 84.80% -0.77% 2.15%
8k 86.94% 87.41% -0.47% 1.25%
16k 88.90% 89.21% -0.32% 0.81%
32k 90.09% 90.57% -0.48% 060%
64k 90.75% 91.21% -0.46% 0.55%
128k 91.23% 91.62% -0.39% 0.56%

 

 

 

 

 

 

 

Table 4.6: Prediction errors grouped by cache conﬁguration for the SPECint2000 bench-
marks.

Finally, Table 4.6 combines the integer and ﬂoating-point results as they are impacted

by cache size. Clearly the model is much better at predicting the hit rates of larger caches

than smaller ones.

From this section a number of things can be concluded. First, the cache model is more
accurate when modeling large, associative caches than it is modeling smaller, direct-mapped
caches. Secondly, and as noted in [40] the cache model only does a moderate job of actually
predicting hit rates. Lastly, and perhaps most importantly, the error in prediction is due to
the assumptions made about conﬂict being random this is not entirely a bad thing. Later in

this chapter these errors will be exploited in a novel way to gain additional understanding

about caching and conﬂict.

68

4.3 Expansion of the model to non-standard caches

It is appropriate at this time to reﬂect on the stack model as discussed thus far. Recall that
in Section 4.1 PH IT for a given distance was shown to be independent of the locality of the
trace. In this section it shown that this property does not always hold for non-standard
caches. Two non-standard caches are examined: an LRU skew-cache and a true-LRU
victim cache. we show that the skew cache cannot accurately be correctly modeled under
the assumption that PHIT is independent of sd((1), while true-LRU victim cache can be so

modeled.

4.3.1 An analytic LRU skew-cache model

The skew cache was described in Section 2.2.4. An LRU skew cache is simply one where
replacement is always to the least-recently used location of those in which the new line can
be placed. Building such a cache can be difﬁcult [60], but near approximations are possible
[14].

As previously mentioned, a stack-distance based skew-cache model which is independent
of sd(d) is not possible. The fundamental reason why is that the model proposed in Section
4.1 relies upon the fact that only the number of unique accesses between successive references
to the same cache line are important. That is, it did not matter if those intervening lines
were accessed once or 100 times, either way the effect would be the same. With a skew cache,
that is not the case. The reason is that if a line is evicted from the skew cache and then
re—referenced, it might occupy a different location than it did the ﬁrst time. This concept,
which to our knowledge no one has ever noticed or even had the language to discuss, is best

explained by way of an example.

69

Impact Of Sd(d) on PHI'I‘

In order to make this example readable, some terminology is introduced. Assume a two—way
associative skew cache. The two ways of the skew cache are named A and B. Line '1 in way
A will be called A[i], with a similar notation used for B. Say a given block, Y, can be placed
into either the 17th line in A or the jth line in B. We use the terminology Y -—+ {A[i], B[j]}
to represent that situation.

Assume that between two successive references to cache line X —+ {A[l], B[1]} there are
references to three other lines, L ——+ {A[1],B[2]}, A1 —+ {A[2], B[2]}, and N —+ {A[2], B[2]},
as shown in Figure 4.8. Assume that X is initially placed into block A[1]. For the moment.
assume that AI, N, and L are each referenced exactly once. Notice that if L is referenced
after Ill and N have both been referenced, X will be evicted. Further, notice that if L is
the ﬁrst of the three lines (of L, M, and N) referenced, X will not be evicted. If L is the
second of these three lines to be referenced, there is a 50% chance X will be evicted (if
the previous reference was placed in B [2]) Thus, averaging over the six possible orderings
of the references to these three lines, there is a 50% chance of X being evicted before it
is re-referenced. In this exact situation, we observe that for the second reference to X,
PHIT = 0.5 and its stack distance is three.

Now, assume that each of these three lines is referenced a large number of times before
X is re-referenced. The stack distance for that second reference to X remains the same:
three. However, as the number of repetitions goes to inﬁnity, the probability of X being
evicted goes to one. The reason for this is that all that has to happen for X to be evicted
is that there is some reference to L after both 111 and N are referenced. In that case L will
be assigned to A[1], evicting X. In this situation for the second reference to X, PHIT 2: 0

although its stack distance remains three.

70

 

 

 

 

 

 

 

 

( D X f D
I l I
L
2 M a N 2
«< >
c ' J k l J

 

 

 

 

 

Figure 4.8: Example showing how locality interacts differently with a skew cache than a

standard cache.

Thus, without attempting to generalize the effect, it is clear that repeating a reference
to the same memory location may not change the stack distance of some other reference,
but it can change PHIT. Notice that. the distribution of sd(d) plays a role in this. For our
example, if S D(2) = 0 there will be no repetitions of L, 1W. or N between the two references
to X. If S D(2) = 0.999, the probability of repeats will be quite high. Therefore, sd(d) has

an impact on PH 1T.

Moving forward with the model

The previous example is enough to show that for a skew cache, sd(d) can have an impact
on PHIT. W hile acknowledging that observation, we none-the-less create a model which
assumes that PHIT is independent of sd(d). It is then shown that for real memory refer-
ence traces that such an assumption only introduces minimal error. For purposes of this
derivation, it is assumed that each intervening memory location is referenced exactly once.

Our goal is to ﬁnd the probability that a memory access with stack distance D will

achieve a hit. This is equivalent to determining the probability that a given memory refer-

71

ence, R, will remain in the cache after D unique accesses have occurred. For this derivation,
assume a two-way associative skew cache where each way of the cache consists of X cache
lines. Label the ﬁrst way A and the second way B. Assume without loss of generality that
the reference in question, R, is to a block placed in A. Deﬁne a cache block to have been
“touched” if one of the D references that occurs after R is placed in that cache block. Deﬁne
a to be the number of lines in bank A that have been touched since immediately before R
was referenced. Deﬁne b similarly to a. but for bank B. Notice that by these deﬁnitions,

when D = 0 it is the case that a = 1 and b = 0.

Treat the triple (a. b, D) as the state of a cache where R remains in the cache. That
is, we are only keeping track of the states of the cache where R would still be found in the
cache. For a given state, there are at most three states which could have proceeded it. The
previous state may have been (a — 1, b, D — 1) in which case the Dth reference mapped to
a block in A that had not previously been touched. Similarly the previous state may have
been (a, b — 1, D — 1) in which case the most recent reference mapped to a block in B that
had not previously been touched. Going from state (a,b, D — 1) to (a,b, D) occurs if the
most recent reference mapped to blocks already touched in both A and B but not to the

block in A holding R.

The probability of moving from state (a, b, D — 1) to state (a, b, D) is ((a — 1)/X)(b/X),
which is exactly the probability of mapping to a touched line in A that isn’t R multiplied
by the chance of mapping to a touched line in B. Changing state from (a, b —- 1, D — 1) to
(a, b, D) requires that the most recent reference is to an untouched block in B and a touched
block in A, or it was to an untouched block in both A and B but B was the LRU block of
the two (which is assumed to happen 50% of the time). The probability of the reference

mapping to an untouched block in B is (X — (b— 1)) / X while the probability of it mapping to

72

 

Transition from Probability

 

 

 

 

 

 

 

(a,b,D — 1) 952—1)?
(a— 1.0 D— 1) “f{‘“)%+%""‘{?‘”";b
' x— (b- 1) 1X—(b—1)x_
(“vb-er‘ll X ﬁaf+§ x x0

Table 4.7: Transition probabilities for a two-way associative skew cache.

a touched block in A (including the one holding R) is a / X . The probability of the reference
mapping to blocks which are untouched in A or B is ((X — (b — 1))/X) ((X -— a)/X). In
Table 4.7 the probability of each transition is provided.

Let the term P(a,b,D) be the probability of getting to state (a,b, D). Recall that
reaching state (a, b, D) implies that R is still in the cache. Using Table 4.7, it is trivial to

derive Equation 4.5.

 

 

 

x—(g—I) X76) (4.5)

 

 

 

The initial condition is that when D = 0, a = 1 and b = 0. So P(a,b,0) = 1 if a. = 1
and b = 0, otherwise P(a,b,0) = 0. Now deﬁne P,”,’j‘}“'(X,D) to be the probability of a

skew cache with two banks of size X to achieve a. hit on access with a stack distance of D.

X X
Pgtgll’ur =(ZZP i j D) (4.6)
i=0 =0

K)

As Equation 4.6 indicates the probability of a hit occurring at a stack distance D is
simply the probability of reaching a state (—, -—,D) where the “—” indicates any value.

Figure 4.9 shows the values of Pg§§3“(64, D), that is the chance of a hit occurring in a skew

73

cache of 128 cache lines at a given stack distance. PH/T for a two—way associative standard
cache of the same size is also displayed to provide some context. Notice that the skew cache

is slightly “more associative” than the 2-way associative cache.

 

 

 

1 I I I I f I I I , I
2-way assocnative
P 0-9 " skew-associative "'
r _ _
e 0.8
(,1 0.7 - J
E 0.6 [- a
e
d 0.5 - _
l 0.4 - -
t 0.3 - -
r
a
e
0.1 - -
0 1 1 1 1 1 1

 

 

 

 

0 50 100 150 200 250 300 350 400 450 500

Stack distance

Figure 4.9: Predicted hit rate for a skew cache and a standard 2-way associative cache.
Both are 128 lines in size.

A more real situation

As noted in Section 4.3.1, P13)? does depend upon sd(d). Using simulation, one can show
the relationship between our predicted value of P3)?" and how it is impacted by sd(d). For
most realistic sd(d) distributions, our prediction is fairly close to reality. In Figure 4.10 the
PHIT is graphed for the analytic model as well as two different sd(d) distributions: that
of the integer Spec (as seen in Figure 4.2) and a uniform (ﬂat) distribution where the ﬁrst
300 stack distances each have a 0.3% chance of occurring. It is worth emphasizing that

these graphs are not of actual traces. Rather they are traces which have been artiﬁcially

generated to have the appropriate sd(d) distributions (see Section 4.4.1 for details). For a

74

standard cache there would be no signiﬁcant variation between these three lines.

 

 

 

  

 

 

 

 

1 ‘ I I I
Analytical prediction
P SpecInt distribution
r 0.9 - Flat distribution ']
e
d
i 0.8 P -‘
c
t.
e
d 0.7 -
h
i
t (16 -
r
a
t " l—
e 0.0
0.4 1 L 1 1 1

20 40 60 80 100 120

Stack distance

Figure 4.10: A magniﬁed look at the ‘ﬂaws’ in the skew cache analytic model.

Another issue with respect to the skew cache model involves the assumption that if an
access maps to untouched blocks in A and B,, it has a 50% chance of going to either of
them. In fact, assuming the ﬁrst block maps to a location in A means we learned a little bit
about A and B—there is at least one location in B which is newer than the line we replaced
in A. This leads to a small bias. In Section 4.3.3 we show empirically that providing a 0.501
chance of the line being mapped to B in a 128-line skew cache signiﬁcantly reduces this

already very small effect.

4.3.2 An analytic LRU victim cache model

The victim cache, described in Section 2.2.3, is another type of non-standard cache. Brieﬂy,

it consists of two parts; a large main cache, and a much smaller fully-associative component].

 

1It is more common to refer to the smaller fully-associative component as the victim cache. However,
for reasons of convenience and clarity, the term ‘victim cache’ is used here to describe the two components
together, functioning as a whole.

75

It is usually the case that the large main cache is direct-mapped to reduce access time. Lines
evicted from the direct-mapped component are placed in the fully-associative component
which is managed like a separate cache.

A ‘true—LRU’ victim cache is identical to the usual victim cache except that the fully-
associative component is managed using global LRU information. That is, when a block
gets evicted from the direct-mapped component, its “time stamp” is compared to those
blocks in the fully-associative component. The LRU block is evicted from the victim cache.
If the evicted line was in the fully-associative component, the line currently in the direct-
mapped component is moved into that location. In any case the direct-mapped component
is now unoccupied, leaving room for the block which caused the eviction. In the remainder
of this section we provide an analytic model which describes how a true—LRU victim cache

interacts with the locality of a reference stream.

Derivation of the model

To facilitate this derivation, the following terminology is introduced. The direct-mapped and
fully-associative component are said to contain a total of S1 and S2 cache lines respectively.
Let the direct-mapped component consist of cache lines labeled as 11,12, . . . 15,. Further, if
X is a cache block which maps to location [I in the component, let this be stated as X E II.

As before the goal is to compute the probability that a given cache block will still be in
the cache after D unique accesses have occurred. The fundamental observation is that an
LRU victim cache will evict a block, X, in cache line 1,. from the combined components if,

and only if, the following two conditions are met:

0 Some other block is accessed after X and is also mapped into 11.
0 At least SQ other blocks have been accessed and then evicted from the direct-mapped

76

component. since X was last accessed.

The ﬁrst condition must be true, otherwise X will still be in the direct—mapped com-
ponent. The second must be true, otherwise X can be in the fully-associative component.
Notice that the order in which these two conditions occur is not important. Once both have
occurred X will be evicted from the victim cache.

Say that there are exactly D unique memory blocks accessed between two accesses to
block X. In order to meet the ﬁrst condition, at least one of those D blocks must be mapped
into lx. Meeting the second condition requires that no more than D — SQ different cache
lines are utilized by those D accesses. That is because only those lines accessed after X can
contribute to the 52 blocks needed to evict X from the fully-associative component of the
LRU victim cache.

Given the above result, it is possible to construct a relationship which computes the
probability of a given set of accesses accessing no more than D — 8‘2 different cache lines.
The same assumption of “random conﬂict” used before is used once again.

The number of times an access to a given block repeats has no inﬂuence upon the
probability of X being evicted. Consider the possibilities if one of the D blocks is accessed
repeatedly. Label the block which is repeatedly accessed as R. Upon the repeated access,
R is currently in one of three places: the direct-mapped component, the fully-associative

component. or has been evicted from the cache.

0 If R is in the direct-mapped component, the only change is to update the access time

of R. As it was already more recently used than X this action has no effect on X.

o If R is in the fully-associative component, some other of the D accesses must have

evicted it from the direct-mapped component. R will be placed in the direct-mapped

77

component while the other block will either go into the fully-associative component
or be evicted. Both of these blocks are more recently used than X, so the swapping

has no impact on whether or not X will be evicted.

o If R has been evicted. and X is still in the LRU victim cache, it must be the case that
X is in the direct-mapped component. Otherwise X would have been evicted from
the fully-associative component before R. At this point X will be evicted from the
entire victim cache once an access occurs that uses 13,. As R clearly does not map to
la, (otherwise X could not be in the direct—mapped component) repeated accesses to

R have no impact.

Thus, the number of times any of the D accesses repeat has no inﬂuence upon the

probability of X being evicted.

The analytic model

Deﬁne P§F%""C(Sl. 52,D) to be the probability that an LRU victim cache with a direct-
mapped component of size 51 and fully-associative component of size 52 will achieve a hit
on a access with a stack distance of D.

For a ﬁxed value of S 1 and S2 let F (D, U) be the probability that at stack distance of
D exactly U different cache lines have been accessed, none of which are 13. For the same
ﬁxed value of S1 and SQ, let E (D, U ) be the probability that at stack distance of D exactly

U different cache lines have been accessed, one of which is l1. It then follows that:

D D
Pizit~""6(81.52. D) = Z F<D.U) + Z E(D, U) (4.7)
U=0 U=D—S2+I

The ﬁrst summation of Equation 4.7 represents the probability that X is still in the

78

direct-mapped component of the cache. The second summation in that equation represents
the probability the X is in the fully-associative component of the cache. For that second
summation, notice that this is the probability that lJr has been accessed (and thus X is not
in the direct-mapped component) plus the probability that enough different cache lines have
been accessed (D - SQ + 1 or more) that X will still be in the fully-associative component
of the LRU victim cache.

All that is left is to compute the various values of F and E. It is convenient at this time
to deﬁne the state of the LRU victim cache as the triple (G. D, U). D and U are deﬁned as
above, while G = 0 if lx has not been accessed since X , and G = 1 otherwise. This state
will be used to create a recurrence relation as was done for the skew cache.

We now deﬁne the initial conditions. Notice that if D = 0 then U = 0. Therefore,
F(0,0) = 1, while E(0. if) = 0 ifi ¢ 0. Also note that if D = 0 then l;r could not have been
accessed after X (as nothing has been) and therefore E (0. i) = 0 for all i.

The recurrence relationship for F is now deﬁned. In order to reach the state (0. D, U),
the previous state had to be either (0, D — l, U — 1) or (0, D — 1,U). A transition from
(0, D — 1, U — 1) to (0, D, U) will occur if the most recent access went to a cache line that
has not been accessed since X was placed in the cache and that newly accessed cache line is
not lg. This will happen with probability ($1 -— (U + 1)) / S 1. A transition from (0, D — 1, U)
to (0, D. U) will occur if the most recent access went to a cache line that has been accessed
since X was placed in the cache. This will happen with probability U / S]. From these

observations the following recurrence for F can be trivially derived:

 

 

1’[V+1)+F(D-1,U).g (4.8)

F(D,U)=F(D—1,U—1)*S
l I

The recurrence relationship for E can be derived in a similar way. In fact the only

79

difference is the case when the most recent access used If. and [I had not otherwise been
used after X was placed in the cache. Obviously, this transition will occur with probability

(1 / 51). Thus the recurrence relation for E can be expressed as:

(E(D — 1,U — 1) * ————S"fgl,y+”) + (E(D —1,U)* %)
E(D,U) = (4.9)

+(F(D-—1,U—1)*-S‘—l)

Equations 4.7, 4.8 and 4.9 are enough to compute Pﬁﬂalvc(Sl, SQ, D). In Figures 4.1] and
4.12 the PH [T values at various stack distances are plotted for a an LRU victim cache with
SI = 128 and S2 = 6. The hit rate achieved by each component is also graphed separately.
The line labeled “victim cache” is the sum of the hit rates of the twO components. Notice
that the direct-mapped component of the LRU victim cache behaves exactly the same as
a direct-mapped cache of the same size. Further, notice that unlike simply increasing the
associativity of the cache, the hit rate at low stack distances is dramatically improving

without any harm to the expected hit rate at larger stack distances.

4.3.3 Verification and the 100% miss-rate model

There are also empirical methods for ﬁnding PHIT for any arbitrary caching scheme. The
most obvious method is to build a cache simulator and feed it data accesses, keeping track
of the hit rates for the various stack distances, much as was done in Section 4.2. However,
doing this is not as simple as it appears, even using the artiﬁcial trace generation techniques
discussed in Section 4.4.1.

The major difﬁculty with the above “niave” simulation method is that some distribution
of stack distances will need to be selected. Because PH 1T varies with sd(d) for some caching

schemes. one would ideally use the same sd(d) distribution for all cache schemes. However,

80

 

 

 

 

 

 

 

 

1 I I f T I _ r I I I
direct-mapped component
P fully-associative component
,- 0 8 __ victim cache _
e .
d
i
c
t 0.6 P- -
e
d
l‘ 0.4 — ..
i
t
r
a , _
t 0.2 -
e
0 J 1 1 1 1 1 1 1

50 100 I50 200 250 300 350 400 450

Stack distance

Figure 4.11: Predicted hit rate at each stack distance for the LRU victim cache and its
components. '

 

 

 

 

 

 

1 T l . I l
rect-mapped component

P 0.95 - victim cache _,
r

e

d 0.9 - ..
i

C 0.85 - ..
t

e

d 0.8 - _
h

i 075 i- -
t

r w

a 0.1 - _J
t

e 0.65 - -

0.6 l 1 1 1 1
10 20 30 40 50 60

Stack distance

Figure 4.12: A magniﬁed version of the upper-left hand corner of Figure 4.11

81

if that distribution includes large stack distances for a small cache, time will be wasted
collecting unneeded data. For large caches, the opposite problem might occur where the
distribution does not include some stack distances of interest. Another very noticeable
problem is the time required to run the simulation and gather the needed values. Simple
statistical analysis indicates that about 100,000,000 samples are required at each stack
distance to get the standard deviation for the estimate of the PH ”((1) below 0.0001. When
thousands of different stack distances are being examined, the run time to gather this data
can be on the order of hours or days.

Instead a slightly different technique is used. The reference trace used is a random one
made up of non-repeating random addresses. That non-repeating stream is identical to a
reference stream generated by the techniques discussed in Section 44.1 where sd(oo) = 1
and sd(d) = 0 for all ﬁnite values of (1. Rather than keeping track of the hit rate for
each reference, the stack distances kept in the cache after each step are recorded. This
distribution of locations in the cache indicates which stack distances would result in a. cache
hit if it were the location next accessed. We have named this analytic technique the “100%
miss rate model.” It allows us to use the same sd(d) distribution for every cache studied.
Further, one can trivially modify this technique to ﬁnd PH[T((1) for any sd(d) distribution.

In order to show that the 100% miss rate model is accurate, it can be compared to the
analytic model of the same cache scheme. Miss rates were gathered using 1,000,000,000
accesses to a true-LRU victim cache consisting of a 128 line direct-mapped component and
a four line, fully-associative, victim buffer. If each of the billion accesses were treated as
an independent test for a given stack distance, it would be expected that the error would

be under 0.00003 for the vast majority of the accesses2. As can be seen in Figure 4.13 the

 

2The expected error varies somewhat with the expected miss-rate. The 0.00003 value is for the worst-case
of 0.5 where it provides approximately a 95% conﬁdence interval

82

error rate falls well within that range.

A similar graph is presented for a 2-way associative skew cache with a total of 128 lines.
As discussed at the end of Section 4.3.1, there is a small error in the analytic model. That
error is due to the fact that when a. line is placed in a given bank due to an LRU decision, a
very small amount of information is gained about the state of the cache. The error creates
a small bias which can be greatly reduced with a small change. Figure 4.14 shows the error
in prediction with the bias set to .5/ .5 when selecting a block to evict when both potential
blocks had been ‘untouched" up to that. point. The measured errors are noticeably worse
than randomness would predict (though still very small). However, changing the bias to
0.501 / 0.499 brings the error into the expected range of 95% of the data having an absolute
value of less than 0.00003.

While the error in the analytical model is very small, the 100% miss rate model allows
the error to be detected. If the miss rate were anything other than 100% it. would be
difﬁcult, if not impossible, to know if the locality of the input were causing this small error

(as a skew cache’s performance is improved by locality).

4.4 Applications of the Models

As a conceptual tool, it is difficult to overstate the value of the model of locality presented
in this chapter. Such a tool makes it possible to see and understand the amount and
nature of locality in a given trace. The cache models which were introduced in this chapter
allow for an understanding not only of how caches interact with locality, but provide a
straightforward method of comparing the effects of caches.

In this section applications of the two models are provided. In many cases these applica-

tions simply use the models to aid in understanding a given situation or dynamic. In other

83

 

 

 

l j l I l l l
1e 0r r- 100% miss rate model error
. O -
P
r
e 0
d
i
c
t -le—05 1— -
i
o
n
9 “29-05 '- -
r
r
o
r -3e-05 - ‘
'49-05 1 l I J l 1 l I l

 

 

50 100 150 200 250 300 350 400 4 50

Stack distance

Figure 4.13: Error rate of the 100% miss rate model when predicting the hit rates for a
victim cache.

 

 

 

 

 

000025 I I I I I I I I
0.5 model
0.501
P 0.0002 - ..
I'
e
<_i 0.00015 - -
l
C
t
1 0.0001 I" ‘1
O
n
e .. _
r 5e—05
l‘
O
r 0 _LT L
69-05 1 1 1 1 1 1 1 1 1

 

 

50 100 150 200 250 300 350 400 450

Stack distance

Figure 4.14: Difference between the skew cache analytic model and the same cache using
the 100% missrate model. Two different analytic models are compared.

84

cases these applications allow comparisons and analysis to occur which would otherwise be
difﬁcult or impossible. Applications of the locality model are discussed in Section 4.4.1.
In Section 4.4.2 applications of the cache model are provided. while Section 4.4.3 provides

applications that involve both models.

4.4.1 Applications of the locality model

Figure 4.2 provides what is perhaps the most important insight into locality found in this
chapter—it illustrates the nature of locality in the Spec benchmark suite. It shows that
two-thirds of all memory accesses would hit on a 512-byte, fully associative cache. Further,
it shows that there is not a simple logarithmic relationship that describes locality—there is a
highly noticeable knee in the curve near a stack-distance of 50. This observation contradicts
the model proposed by Chow [20] (discussed in Section 3.2.1) and is closer to the two—part
model of locality discussed Section 3.2.2. Of course the Spec benchmark suite is by no
means representative of all workloads. None-the-less. most cache studies focus on these
benchmarks and understanding them is quite important.

Beyond simply understanding the whole-program locality of a benchmark suite, the
locality model can also provide insight into other issues. In the remainder of this section
two applications are examined. The ﬁrst looks at how locality is ﬁltered by L1 caches. The

second discusses how the locality model can be used to generate artiﬁcial traces.

Locality ﬁltering by caches

One interesting question is how locality is effected by caches. That. is, caches tend to get
hits on high-locality memory accesses. Thus, the reference stream of misses and write-

backs issued from the cache seems certain to have less locality than the reference stream

going into the cache. As noted in Section 4.4.1. Weikle and her co—authors attacked the
issue of measuring how caches ﬁlter “instantaneous” locality [77]. As expected, they found
signiﬁcantly less locality in the reference stream once ﬁltered by the cache. However, it is
very difficult to get a feeling for the locality of even a. moderate sized data stream with their
scheme. (Recall that the largest set of memory accesses examined using their scheme was

under 4,000 accesses.)

Figure 4.15 shows the impact of locality ﬁltering by a 64KB direct-mapped cache on
the gcc benchmark. That is, it illustrates the locality of the reference stream as it leaves
the L1 cache and enters the L2 cache. The graph is interesting for a number of reasons.
Notice that two different ﬁltered results have been provided in addition to the unﬁltered
locality information. The ‘all reads’ line treats all reads and writes to the L1 cache as if
they were reads, while the ‘read / write' line treats writes as normal in a write-back cache.

The difference between the two lines represents the impact of write-back accesses.

 

 

  
   

 

  

 

 

 

 

 

C W T
u -
m

u

l -
a

t -
i

v -
e

p 0 5 - no ﬁlter —e— _
(r) 64kb direct-mapped all-reads

b 0-4 l‘ 64kb direct-mapped read/write ——+—- -
a

b 0.3 _
i 0.2 ..
i

t 0.1 ..
y

0.1 1 4n 1

l 10 100 1000 10000 100000

Stack distance

Figure 4.15: Impact of cache ﬁltering on the gcc benchmark and the impact of writebacks

86

First, notice that the unﬁltered trace is very ﬂat from stack distance 100 to about 10,000.
As the 64KB cache will get a hit on a stack distance of 100 over 90% of the time, the vast
majority of the accesses with a stack distance less than 100 will be hits. As expected,
ignoring the impact of write—backs, this characteristic results in having very few (fewer
than 10%) of all accesses emerging from the L1 having a stack distance of less than 10,000.
The ﬁltered trace then rises dramatically at a stack distance at just over 11,000, a magniﬁed

version of a similar jump in the unfiltered trace.

Secondly, the write-backs are interesting. They clearly add a fair bit of locality to the
ﬁltered trace, speciﬁcally accounting for about one-half of the accesses with a. stack-distance
less than 2000 (though the total number of access at 10.000 or less is still very near 10%.)
They seem to level out at around a stack distance of 5000. The cumulative probability only

increases again due to the read misses.

Perhaps the most important point about this data is the huge difference in locality
between the ﬁltered and unﬁltered traces. In the unﬁltered trace a 16-entry fully-associative
cache would get a hit-rate of about 90%. In the ﬁltered trace it is less than 3%. The entire
premise of caching is that high locality accesses are more common than low-locality accesses.
However, after the L1 ﬁltering effect this is not true at all. The fundamental point is that
L1 and L2 caches see extremely different locality properties. Admittedly this instance is a
bit biased. The L1 cache is relatively large (though certainly reasonable in size). Figure
4.16 shows how smaller caches ﬁlter locality (in this case all accesses were treated as reads.)

It also shows the impact of having a more associative cache.

87

 

 

 

 

0.5 - no ﬁlter —€'— "
4kb direct-mapped ﬁlter —+-—

04 " 4kb fully-associative ﬁlter "

0.3 - _

0.2 F —

0.1 - _

   
 

 

'<e-v--—--c'=>o‘o-1':a m<-Hm—¢=E=O

 

4 II I I ll 1 l J

1 10 100 1000 10000 100000

 

0 d
Stack distance

Figure 4.16: Impact of associativity on cache filtering

Generation of artificial traces

Another application of the locality model is its ability to generate artiﬁcial traces using a
simple Monte-Carlo technique. This was brieﬂy described in Section 3.4.4, and is described
in detail in a paper by Babaoglu [8]. The net effect is that it is trivial to generate a memory
reference trace which has any stack-distance properties desired. Babaoglu’s concern was
with a lack of storage space needed to keep long traces. While this is still an issue, disk
drive space has been outpacing CPU power for many years; and the bound on trace driven
simulation is now mainly CPU power and the network bandwidth needed to move these
large traces around. Further, these artiﬁcially generated traces are not perfect replicas of
the traces they are replacing. For one thing, they have truly random conﬂict. On these

traces a hash-cache will, on the average, perform exactly as well as a standard cache.

Even given the above limitations, these artiﬁcially generated traces have many interest-

ing uses. For one, it is possible to generate a reference trace that has certain specific locality

88

properties. This feature can be useful when, for example, you want to know how a change
in the locality stream presented to the L1 cache will impact the reference stream seen by
the L2 cache. Mathematically analyzing this problem is very difﬁcult, and searching for a
real trace that has the locality properties in question will be time consuming and may not

yield a match.

Another use of an artiﬁcially generated trace is as a point of comparison to a real
trace. For example, Hallnor and Reinhardt. [36] have [proposed a replacement scheme which
relies on the idea that past heavy utilization of a cache line will likely result in future
heavy utilization. For the trace generated by the Monte-Carlo method the degree of past
utilization has nothing to do with the future utilization. The characteristics of the actual

trace can then be compared with the characteristics of the artiﬁcial trace.

We deﬁne a streak of memory accesses to be the list of accesses in a given trace which
ﬁt in a given cache line. The length of a given streak is deﬁned to be the number of memory
accesses in that streak. Figure 4.17 provides a graphical representation of what percent of
memory access are in a streak of length N or more for both an actual trace (the ﬁrst 10
million accesses in the gcc trace) as well as for a similarly sized artiﬁcial trace generated
using the same stack distances as the original trace. As can be seen accesses in the actual

trace are much more likely to be a part of a long streak than are those of the artiﬁcial trace.

Certainly an empirical model of expected streak length could be generated and used
instead of the artiﬁcial trace, though doing so seems fairly difﬁcult. Instead we showed
that Hallnor and Reinhardt’s assumption about heavy utilization being predictive of future
utilization quite accurate for this trace. Further, artiﬁcial traces can often be used as a
point of comparison when there is a desire to separate out the impact of locality from other

trace characteristics. For example, it can be used to eliminate the impact of hashing (and

89

 

1 T Y I I U l U I I l I M r] I I U l I I T
l. real trace
0.9 artiﬁcial trace

0.8 -

 

0.7 '-
0.6 l-
0.5 -
0.4 b

Hdmﬁﬁm'ﬁ

0.3 -
0.2 -

0.1 -

 

 

 

l l I ll 1 1 ll 1 4 L

 

1 10 100 1000 10000 100000 1e+ 06
Streak length

Figure 4.17: Percent of accesses in a streak of size N or greater.

thus the reduction in non-random conﬂict) while keeping the locality of a given reference

stream. This would allow for a measurement of the reorganization ability of a. skew cache

[60].

4.4.2 Applications of the cache model

The cache model also can be used to gain insight into the behavior of caches and reference

streams. In this section some of those insights are explored.

Associativity vs. cache size

The inside cover of the most widely used computer architecture text has provides the follow-
ing under “Rules of Thumb” “2:1 Cache Rule: The miss rate of a direct-mapped cache of
size N is about the same as a two-way set-associative cache of size N/2.” Figure 4.18 shows
the expected hit rates for N = 128. If the rule of thumb is true and one ignores excessive

conﬂict (and as was seen in Section 4.2 there is actually less conﬂict than random for most

90

traces so this is very reasonable), it must be that there are so many memory references in
the range of 1 to 23 that the small advantage the two-way associative cache has in that
range is enough to compensate for the huge difference in hit rates after stack distance 23.
Figure 4.19 magniﬁes this portion of the graph. The “area” of that small region is slightly
larger than 0.6. While the area between the two regions starting at stack distance 24 and

onwards is slightly more than 64.6.

 

 

 

 

 

1 I I , I I _ I
direct-mapped, 128 lines ——
P 0-9 ‘ 2—way, 64 lines "1
r _. ._
e 0.8
(l 0.7 - -
1
f 0.6 ~ -
e
d 0.5 - _
l,‘ 0.4 - _
t 0.3 i- -
r
a
I, 0.2 '— -'
e
0.1 — d
0 1 1 1 1
0 50 100 150 200 250

Stack distance

Figure 4.18: Impact of associativity on cache ﬁltering

None of the above analysis should be too surprising, after all Figure 4.2 showed that
the vast majority of accesses have a very low stack distance. W hat is interesting is that
the 2:1 Cache Rule is really a statement about the locality of most reference streams (a
large percentage of memory references have very low stack distances) rather than about
associativity or cache size.

The point of this discussion is not the validity of the rule. of thumb. Rather it is that

using our model of cache behavior the underpinnings of the rule of thumb can be understood.

91

 

 

 

 

 

1 I I _ I ,I
direct-mapped, 128 lines
2-way, 64 lines —

P
g 0.95 )- -
d
i .
C
t 0.9 l- 'l
e
d
l,‘ 0.85 — -
t
r
a
e

075 I L I l l

0 5 10 15 20 25 30

Stack distance

Figure 4.19: Impact of associativity on cache ﬁltering

For example, consider the results from Section 4.4.1 on cache ﬁltering. It was shown that
there are very few low-locality accesses in an L2 reference stream. Combining this fact with
Figure 4.18 would lead us to believe that this 2:1 Cache Rule is unlikely to apply to L2

caches—the needed locality just is not there.

Measuring conﬂict in a reference stream

One important issue in caching is the measurement of conﬂict. The caching models described
in this chapter assume that conﬂict is random, knowing full well that spatial locality will
cause the conﬂict to be less than random while striding and related issues will cause conﬂict
which is greater than random. In this section the issue of quantifying conﬂict is addressed.

First, the fact that a hit rate can be predicted assuming random conﬂict allows a trivial
measure of the non-random conﬂict found in the trace. Clearly the difference between the
actual hit rate and the predicted hit rate is due to non-random conﬂict. If the predicted hit.

rate is greater than the measured hit rate. there is clearly some source of excessive conﬂict

92

due to the data layout. The problem is that sources of excessive conﬂict can be masked by
sources of scant conﬂict, so that even if the predicted hit rate exactly matches the actual
hit rate there is still quite a bit that could be achieved by changing the data layout. Our
assumption is that the sources of excessive conﬂict involve striding or other heavy use of a

single set while scant conﬂict is due to a high degree of page-level spatial locality.

The actual hit rate reﬂects the impact of both excessive and scant non-random conﬂict
while the predicted hit rate assumes that neither type of non-random conﬂict exists. As
mentioned in Chapter 2, a “one’s complement.” cache (a rather complex hash cache) has the
effect of reducing many of the sources of excessive conﬂict while having very little impact
on the scant conﬂict caused by page—level locality [81]. While it is not exactly true that a
one‘s complement cache removes all of the bad while leaving the good, it does come fairly
close. A new stride length causes excessive conﬂict (though one that is not a power of two)
in a one’s complement. cache, and two highly utilized variables could now map to the same
set where they didn’t previously. Thus, while a one’s complement cache does not perfectly
keep the sources of scant conﬂict while removing the sources of excessive conﬂict it does the

best of any scheme of which we are aware.

The three hit rates (direct-mapped, one’s complement. and predicted) can be used to
provide a rough quantiﬁcation of conﬂict. Scant conﬂict can be measured as the difference
between the hit rate of the one’s complement cache (assumed to contain only sources of
scant conﬂict) and the predicted hit rate. Excessive conﬂict is measured as the difference
between the hit rate of the one’s complement cache and the actual hit rate achieved by a
direct-mapped cache. Table 4.8 shows the results over the Spec benchmarks. Notice that
the excessive conﬂict is occasionally negative, which is likely due the random chance that

two highly used variables conﬂict which did not conﬂict in the direct-mapped cache do '

93

conﬂict in the one’s complement cache. More rarely the scant conﬂict is negative. This

result is due to the one’s complement cache not doing a perfect job of removing excessive

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

conﬂict.
benchmark direct- one’s predicted excessive scant
mapped complement conﬂict conﬂict
ammp 92.28% 91.80% 90.07% -0.48% 1.73%
applu 79.27% 79.33% 76.09% 0.05% 3.24%
apsi 75.00% 75.00% 75.01% 0.00% -0.01%
art 77.04% 77.33% 75.49% 0.30% 1.84%
bzip2 95.20% 95.18% 95.19% -0.02% 0.00%
crafty 82.17% 83.17% 83.05% 1.00% 0.12%
eon 90.77% 92.59% 90.10% 1.83% 2.49%
equake 94.46% 92.21% 91.99% -2.26% 0.22%
fma3d 95.40% 95.87% 95.25% 0.47% 0.62%
galgel 98.41% 98.46% 98.28% 0.05% 0.18%
gap 90.83% 90.83% 90.83% 0.00% 0.00%
gcc 89.28% 88.58% 87.22% -0.70% 1.35%
gzip 95.14% 95.13% 95.13% -0.02% 0.00%
lucas 97.57% 97.54% 93.29% -0.03% 4.25%
mcf 90.83% 90.83% 90.83% 0.00% 0.00%
mesa 94.07% 97.21% 93.27% 3.14% 3.94%
mgrid 69.57% 75.44% 75.91% 5.87% -0.47%
parser 94.10% 94.08% 93.14% -0.02% 0.94%
perlbmk 85.79% 85.85% 85.65% 0.06% 0.19%
swim 96.81% 96.67% 92.98% -0.14% 3.68%
twolf 92.79% 95.25% 94.04% 2.47% 1.21%
vortex 92.92% 92.90% 91.99% -0.01% 0.92%
vpr 98.26% 97.48% 95.55% -0.78% 1.93%
wupwise 97.31% 97.32% 96.04% 0.00% 1.28%
Average - - - 0.45% 1.24%

 

 

 

 

 

 

 

 

Table 4.8: Measurement of excessive conﬂict

While the above measures are not perfect. we know of no previous work which can
determine if a low bit rate is due to poor data layout as opposed to there being a low degree

locality3. Looking again at the data in Table 4.8, it is obvious that eon, mesa, mgrid and

 

3The ‘three C’s model’ of caching [23,41]. discussed in Chapter 2. does make a rough attempt to dis-
tinguish these two concepts but it is so coarse in its granularity as to be nearly worthless as a quantitative
measure [13].

94

twolf have signiﬁcant room for improvement in their data layout. Fer a compiler writer
this information can provide focus to a study as to what is wrong with the data layout,
and could even be used as feedback in a profiling tool. For a cache designer the scant
conﬂict values provide insight into the importance of page—level locality while the excessive
conﬂict values provide insight into the importance of using a hash cache or other technique

to remove conﬂict.

4.4.3 Examining non-standard caches

The last area of application considered is understanding non-standard caches. The ability

to see how a cache interacts with locality can make comparisons and analysis much easier.

Skew cache replacement policies

In his various papers on skew caches Seznec has proposed a number of replacement schemes
including a time-stamp LRU scheme and a single-bit per set [60—62, 73]. In this section
those two schemes are compared to a novel scheme which uses partial time-stamps. This
novel scheme is evaluated by using the cache modeling techniques presented in this chapter
as well as by looking at traditional hit-rate numbers.

Figure 4.20 demonstrates how the skew—cache performance is impacted by changing from
the single-bit replacement scheme to a true-LRU scheme under the 100% miss-rate model
described in Section 4.3.3 . Clearly the LRU scheme has significantly better performance.
The problem with the LRU scheme is that it has fairly high overhead. In a skew cache, unlike
a standard cache, any given line might need to be compared to any of the lines in the other
way of the cache. To the best of this author’s knowledge, no one has formally proposed a

method to perform true-LRU replacement in a. skew cache. In a conversation with Seznec in

95

1997, be indicated that the only way his group saw to perform this replacement scheme was
using a time stamp. This would be implemented with a counter that is incremented upon
every cache access. The line accessed, whether a hit or a miss, would have the current value
of that counter stored along with the tag for that data (in the same way LRU information
is usually stored). When a replacement decision is required, the ‘time stamps’ stored with
the two potential candidates for eviction would be compared and the smallest value would
be the replaced line. When the counter overﬂows and returns to zero, the cache will either
need to be purged or the time stamp values will somehow need to be updated. For a large
enough counter this overﬂow will be very rare (say a few times a minute for a 32-bit counter)

and the performance impact small.

 

 

 

 

 

 

1 l I r I _ I
Direct-mapped ——
0-9 l' Skew, l-bit replacement "
I: 0 8 _ Skew. LRU replacement _
e .
(ii 0.7 - _,
‘5 0.6 r _
e
d 0.5 - -
i] 0.4 — -
t 0.3 - -
r
a
t, 0.2 '- -‘
e
0.1 -
0 J l 1 1 I
50 100 150 200 250 300

Stack distance

Figure 4.20: Illustration of the impact of the replacement scheme on performance for skew
caches. The direct-mapped cache is provided as a reference and all caches consist of 128

lines.

Another option is to use a counter that counts up to a much smaller value but handles
the overﬂow of the counter correctly. In order to get LRU right the vast majority of the

time (arbitrarily say 199 times out of 200), the scheme needs to be able to handle about

96

42: misses where 3r. is equal to the number of lines in the cache“. As an example, consider a
cache with 128 lines (4 KB) that has an expected hit rate of 90%. Every tenth access will
cause a miss. Thus, the counter needs to be able to count up to 10 * 4 * 128 which requires
at least a 13 bit counter.

The 13-bit time stamp would require 208 bytes of additional storage, which increases the
amount of non-data storage needed by nearly 50%. (For a 32-bit processor 26 bits would
be needed for each tag in addition to state bits consisting of at least a dirty bit and a valid
bit.) As the cache gets larger the relative overhead goes up considerably, since as the cache
size doubles one more bit is required for LRU information while one less is required for the
tag. In any case, the LRU information is well less than 10% of the total size of the cache
for any reasonably sized cache.

No matter how large the cache, ﬁnding which cache line holds the LRU block requires
that the time stamps associated with both cache lines be subtracted from the current
counter and then a comparison performeds. For the 128 line example this would a 13 bit
subtractor followed by a 13—bit comparator. That should likely be fairly similar in speed
to the 32-bit comparator proposed above. Including routing to and from the comparator,
the replacement decision may take more than one cycle, which should not pose a signiﬁcant
problem as the cache will be waiting for the block to be fetched from the L2 cache (or
memory) in any case. The complexity of a multi-cycle replacement policy would be minor
but potentially troublesome.

Reducing the number of bits required for the timestamps would have many obvious

 

“The 4:1: value comes from empirical results. It can also be deduced at by graphing the 100% miss rate
model for a skew cache and noting that only about 0.5% of the accesses have a stack distance that is greater
than 42:.

5While there may appear to be problems with overﬂow in the subtraction as the time stamp could be
smaller than either of the cache line time stamps, in reality this is not a problem as the overﬂow can be
ignored and the correct result will occur when using a standard subtractor.

97

 

Replacement Average Improvement over l-bit
Scheme hit rate Average L Min Max

32-bit time stamp 92.72% 0.21% 0.14% 1.77%
5—bit time stamp 92.70% 0.19% ~0.31% 0.99%

 

 

 

 

 

 

 

 

 

 

 

Table 4.9: A comparison of replacement schemes for skew caches. Hit rate improvement is
in terms of absolute improvement over the one-bit replacement scheme.

benefits, the two most signiﬁcant would be a reduction in the size of the tags and a reduction
in the time required to make the replacement decision. The easiest way to perform that
reduction would be to store only some of the bits of the time—stamp counter into each of
the tags. Experimentation using artificial traces showed that over the Spec benchmarks,
storing only bits 6 through 10 of the timestamp (where 0 is the least significant bit) provided
performance that was fairly close to that of a 32-l)it time stamp. More time-consuming
simulation over real traces of length 10 million confirmed this. Table 4.9 show the results
over these Spec traces. While the improvement for either scheme is fairly smalls, the much
shorter 5-bit time stamps get 90% of the improvement achieved by the full 32—bit time
stamps. Table 4.9 show how well this 5-bit scheme compares to the 32-bit timestamp and
the 1-bit replacement scheme over all the Spec benchmarks. Clearly the 5-bit scheme does
nearly as well on the average as the 32-bit time stamp but suffers somewhat on the extreme

C3585 .

Victim caches

In section 4.3.2 an analytic model for “true-LRU” victim caches was presented. Much like
skew caches, time stamps or the equivalently are required to maintain this perfect LRU

ordering between the two structures. The other, more standard, scheme is to have the

 

6Seznec saw much more signiﬁcant differences on the memory reference traces he used. This difference is
very much a function of the compiler.

98

"‘7 "‘7

 

‘t

 

two structures maintain LRU information separately. We call these schemes “true—LRU”
and “separate—LRU” respectively. Figure 4.21 shows that the difference between the two
schemes is quite large when using the 100% miss rate model. However, the separate scheme
performs much better when using real data. The reason is that an n—entry fully-associative
component of the cache contains the last n evictions. Under the 100% miss rate model
these evictions are simply the last 11 accesses—resulting in the stack distance vs. hit rate

graph being the same as the base-cache shifted to the right by n.

 

 

 

 

 

 

 

 

1 I I I I
Victim, separate-LRU replacement
P 0-9 "' Victim, true—LRU replacement "‘
r - ..
e 0.8
‘ij 0.7 - -
f 0.6 _ -
e
d 0.5 - -
i1 0.4 L -
t 0.3 — -
r
a
e
0.1 - a
0 ;:=—

50 100 150 200 250 300

Stack distance

Figure 4.21: Illustration of the impact of the. replacement scheme on performance for victim
caches.

Under a. reference stream with a greater degree of locality, the separate-LRU replacement
policy improves significantly while the true-LRU policy sees no improvement at all. Figure
4.22 shows the improvement that occurs when a higher locality reference stream is used,
in this case a memory reference stream that has the same locality as gcc. Recall that the
true-LRU replacement policy is unaffected by the input locality. That same graph shows

the impact of using the same 5-bit time stamp replacement policy discussed in Section 4.4.3

99

 

 

 

 

 

 

Replacement Average Improvement over separate-LRU
Scheme hit rate Average Min Max
232-bit time stamp 91.95% 0.11% -0.17% 0.83%
5-bit time stamp 91.91% 0.08% -0.41% 0.74%

 

 

 

 

 

 

 

Table 4.10: A comparison of replacement schemes for victim caches. Hit rate improvement
is in terms of absolute improvement.

for the skew cache. As was done in the skew cache section, Table 4.10 shows the impact

of the 5-bit time stamp and the 32-bit time stamps 'when compared to the more standard

replacement policy.

 

 

 

 

 

 

1 I I 7 T
100% miss rate, separate-LRU ——-—
0-9 " locality of gcc, separate-LRU -——-- "‘
I: 0 8 _ locality of gcc, 5—bit time stamp _
e .
9 0.7 ~ -
1
i 0.6 -
e
d 0.5 b
I; 0.4 r-
t 0.3 -
r
a
t 0.2 '-
e
0.1 -
0 1 1 1 1 1
50 100 150 200 250 300

Stack distance

Figure 4.22: Illustration of the impact of locality on the hit rate of the using a ‘separate
LRU’ replacement scheme. This graph also shows the impact of using 5—bit time stamps.

4.5 Future work

There is certainly a large amount of work that. can and is being done to expand on the work

in this section.

0 The ability to identify excessive conﬂict can be very valuable. It can be used to evalu-

100

 

ate the quality of the data layout of a complier. This can help in evaluating compliers
in addition to being useful as a profiling technique. Certainly certain reference traces.
like our mgrid trace, had excessive conﬂict. Informing the complier of this excessive
conﬂict so that it can reorder the data layout could be a source of significant per-
formance improvement. Further, comparing the conﬂict in these reference traces to
that found in other work [13] it is obvious that certain compliers simply do a better
job laying out data. This can be a helpful piece of information to a complier writer.
Preliminary work has started on this, comparing the gcc, MIRV, and Ice compliers.
We are currently looking to get the optimized SPEC binaries complied by the Intel

Reference Complier.

Finding an analytic model for multi-level caches has been an insurmountable hurdle.
The modeling the interaction of locality and the cache characteristics is extremely
complex. While there may be little real value to such a model this work seems some—

what incomplete without it.

A “better than LRU” replacen‘ient policy may be possible using information about the
locality of the memory reference stream. While this is almost certainly not realistic
to implement in hardware, it may be able to provide something of an upper bound on

possible performance under different assumptions.

Spatial locality is currently only addressed with respect to a single cache line size.
Providing a means of understanding spatial locality would be useful when constructing

caches that use more than one line size.

The primary use of the work in this chapter is as a tool to help understand caches

and how they interact with locality. The next step in that direction is to make the

101

 

tools widely available and otherwise be evangelical about the techniques found here.
To that end a website is being built as part of an NSF grant. It will provide traces,

simulators, graphs, and data to help get the ideas out into the world.

4.6 Conclusion

Every student of computer architecture knows that caches perform well because of spatial
and temporal locality. However, few attempts have been made at quantifying this fairly
abstract. concept of locality. In this chapter a. measurement of temporal was proposed with
spatial locality being implicitly dealt with though cache line size. This quantification of
locality leads to a much fuller understanding than previous models or rules-of-thumb have
provided.

In this chapter, an analytic model of how standard caches interact with locality was
presented. While very similar models had be proposed in the past by Rau, Smith, Hill and
others, we do not. know of any attempt to use these models as a method of understanding
locality. In addition, novel analytic models of two non-standard caches, LRU skew and
true-LRU victim caches, were derived. Next, the 100% miss-rate model of caching was
introduced, which makes it possible to find how locality and caching interact using an
empirical rather than an analytical model. This 100% miss-rate model allows any caching
structure to be modeled. That empirical method was also used to help verify that the
analytical models were correct. Other than where noted, this work has only been previously
seen in our own previous work.

While the primary application of these models is directed toward the understanding of
locality and caching, a number of applications were also presented. These applications of the

model include being able to measure non-random conﬂict as well as the ability to generate

102

 

artiﬁcial traces which can be tuned to have whatever locality properties are desired. It is
our hope that the novel insight and applications produced in this work prove valuable to

those who are involved in cache design.

103

Chapter 5

Optimal replacement is NP-hard

for non-standard caches.

When examining a new cache structure or replacement policy, it is often helpful to use the
optimal policy as a baseline. In this chapter it. is shown that it is NP-hard to find the
optimal schedule for any but the simplest of caching schemes. This is proven by a reduction
from 3-Occ-Max-2SAT, a known NP-hard problem [12]. It is also shown that there is no
polynomial time algorithm which is guaranteed to ﬁnd a “good” approximation to this
problem unless P = NP. These results apply to cache structures that include a victim
cache or an assist cache, as well as skew caches and nearly all “multi-lateral" caches [70].

This result also applies to the optimal scheduling of many multi-level cache structures.

5.1 Introduction and Background

One of the greatest challenges of modern computer architecture is the disparity between
processor and memory speeds. This disparity has resulted in attempts to eke the last

104

l—K—awv—Lm‘ﬂ

 

 

bit of performance out of the entire memory hierarchy, especially caches. As discussed
in Chapter 2, victim caches [45], assist caches [17], skew caches [60] and “smart” caches
[31, 44, 70, 74, 78] have all been proposed. and some have also been implemented. However,
a cache architect’s toolbox has not kept up with these changes, making these caches more
difﬁcult to evaluate, understand and improve. In this chapter one of those tools. the optimal
replacement algorithm, is examined. It is shown that for many caching structures, it is a

tool cache designers and researchers will have to live without.

In 1966 Belady published a heavilyicited paper describing the optimal replacement policy
for a cache [11]. Other, faster algorithms have been developed which also ﬁnd the minimal
hit rate [49, 67]. As these algorithms perform the same task and generate equivalent results,
we refer to this class of optimal replacement algorithms as a single algorithm named MIN.
MIN is important because it is used in comparative architectural studies. As was discussed
in Chapter 4, MIN has been used in many performance studies [15.68, 70, 73, 74, 78] and as

part of a. caching algorithm [78].

MIN is only applicable to “standard caches" such as direct-mapped, set-associative, and
fully-associative caches. For many other, more complex, caching schemes the algorithm
fails. In those caches it not only matters what block is evicted but where the new block
is placed, something MIN does not address. However, until now it has not been clear if a

computationally-tractable optimal algorithm could be found for these more complex caches.

The search for a tractable algorithm for optimal placement / replacement in a non-
standard cache has existed for a number of years. Most recently, Tam et al. [70] published
a paper where they could not ﬁnd a computationally tractable algorithm for optimal place-
ment for their cache design. Instead they made use of a tractable, but sub-optimal, ofﬂine

algorithm to make the desired comparisons.

105

 

5.1. 1 Companion caches

Every cache we are aware of has two major components: its structure and its replacement
policy. The structure of a cache describes the physical properties of the cache (such as cache
line size, number of cache lines, etc.) as well the legal locations for a given block in the
cache. The replacement policy describes when a block is placed into the cache, where that
block is placed, and which block, if any, is removed to make room for it. For example, a
standard “set-associative cache” uses a set-associative structure and an LRU replacement
policy.

The term Companion Cache Structure (CCS) is now introduced. A CCS consists of
two components: a set-associative or direct-mapped cache and a fully-associative cache. We
refer to the direct-mapped/set-associative component as the “main cache” and the fully-
associative component as the “companion buffer.” Both components must be able to store
any arbitrary cache block. A degenerate CCS is one in which the main cache is fully-
associative or the companion buffer is of size zero. In the ﬁrst case the CC S reduces to
a fully-associative cache. In the second case the CCS is just a standard direct-mapped or
set-associative cache.

Many of the non-standard caching schemes discussed in Chapter 2 have a C C S un-
derlying them. Cache structures which include a victim cache or an assist cache with its
associated main cache are exactly CC 3 3. Many of the smart caches discussed in Chapter 2
are also C C 53. In addition, skew caches and certain multi-level caches can also be thought
of as C C 55. The later two cases will be addressed in Theorems 5.3.7 and 5.3.8 respectively.

In this chapter it is proven that optimally scheduling a. non-degenerate CC 5 is NP-
hard with respect to the number of sets in the main cache. We assume that there are no

placement restrictions other than those that are part of the CCS. For example, the victim

106

caching scheme, as originally deﬁned [45], consisted of both a structure (a direct-mapped

cache and the victim cache itself) as well as a replacement policy.

5.1 .2 Related terms

While CCSs encompass a large number of useful cache organizations. it is important to put
C C Ss in the context of other terms and concepts. Perhaps the most closely related structure
to the CCS is the multi-lateral cache [70]. A multi-lateral cache consists of multiple cache
structures which are conceptually on the same level of the memory hierarchy and which can
store the same data. In general, multi-lateral caches contain an underlying CC S , although
it is not always obvious.

In a similar way. a multi-level cache can also have an underlying CC S . Suppose the
L1 cache is fully-associative and the L2 cache is direct-mapped. While these two caches
may have very different access times, they can be thought of as a single C C S . Similarly,
imagine that rather than being fully-associative, the L1 cache were instead set-associative.
As explained in Theorem 5.3.8 this can be thought of as scheduling a number of smaller
C C 55. This relies upon having a caching scheme where the L2 is not required to be inclusive
of the L1 [9].

T wo other terms which are important to the discussion are bypassing and internal
reorganization. Standard caching schemes require that each block be placed in the cache
when it is referenced. If this requirement is not in force, the cache is said to allow bypassing.
Notice that disallowing bypassing places a restriction on the optimal algorithm and cannot
improve the miss rate. Another variation is to allow a block to move from location to
location without evicting it. We call this reorganization. Many modern caching schemes

allow for this to some extent; in fact it is at the heart of the replacement schemes for both

107

 

victim and assist caches as well as certain hash-rehash cache variants 1. Notice that allowing
reorganization cannot increase the miss rate of an optimal algorithm. Further note that

reorganization does not affect the miss rate in a standard cache.

5.1.3 Chapter outline

In Section 5.2 it is formally proven that the problem of managing a CCS in order to minimize
the number of misses is NP-hard for a very speciﬁc instance. That proof is used as a starting
point to generalize and expand the results. The speciﬁc instance used is a 005 where the
companion buffer is of size one, the main cache is direct-mapped, bypassing is allowed,
and reorganization is not. This result is proven using a reduction from a known NP-hard

problem, 3-Occ—Max-2SAT [12]. The expanded results, found in Section 5.3, include:

o The problem is still NP—hard if the companion buffer is of a size greater than one

and /or the main cache is set-associative. (Theorems 5.3.1 and 5.3.2)
o The problem is NP-hard whether or not bypassing is allowed. (Theorem 5.3.3)

o Allowing the cache to freely reorganize does not. bring the problem into the class P.

(Theorem 5.3.4)

o A “good” approximation of OPT for a CCS cache is also NP-hard. (Theorems 5.3.5

and 5.3.6)

0 Skew and certain multi-level caches are NP—hard to schedule optimally. (Theorems

5.3.7 and 5.3.8)

The net result of all of these proofs is that using optimal scheduling to evaluate replace-

ment algorithms or cache structural changes is not viable for any but the simplest cache

 

1See Chapter 2 for details about these cache types.

108

structures. This result is important so that cache researchers and designers do not "spin
their wheels” looking for a polynomial-time optimal replacement algorithm. If their cache

structure has an underlying C CS , they will have to live without that evaluation tool.

5.1.4 Problem deﬁnition and goal

Now the problem of ofﬂine cache scheduling is described more formally. An ordered list of
memory references is sent to the cache. Each reference is to a particular block, and each
block can be placed in some subset of the cache. In the ofﬂine environment it is known
exactly in which order the references will occur. The goal is to minimize the number of
misses which occur. A given instance of a caching problem consists of the ordered list of

references and a mapping which describes where each reference can be placed in the cache!

5.2 Proofs

In this section the results are proven in three parts, each corresponding to a subsection.
In subsection 5.2.1 the relationship between interval scheduling and cache scheduling is de—
scribed. A formal proof is provided in subsection 5.2.2 for a very restricted problem—when
the main cache is direct-mapped, the companion buffer is of size one, no internal reorgani-
zation is allowed, and bypassing is allowed. We call this restricted case the Fundamental
Companion Cache Scheduling (FCCS) problem. In subsection 5.3 the F CCS proof is used
as the foundation for several broader conclusions. These results include showing that the
general CCS problem is N P-hard and APX-completeQ. Also speciﬁc bounds are given for

how well an algorithm can approximate the FCCS problem.

 

2A problem is APX-complete if it can be approximated within some bound in polynomial time but cannot
be approximated to within some arbitrary constant in polynomial time.

109

The F CCS problem is formally defined as follows.

Definition 5.2.1. Fundamental Companion Cache Scheduling(FCCS):
INSTANCE: A direct-mapped cache, D, consisting of sets of locations. D1, . . . D", a com-
panion cache consisting of a. single location C1, a set of memory items (cache lines) AI, a
mapping g : .M —> DJr indicating in which location in the direct-mapped cache each memory
item can be stored, a request sequence 3 E .M", and an integer K.

QUESTION: Each item of sequence 3 is selected in order. If that item is currently found
in D or Cl, it is counted as a hit. The memory item is then either placed into the location
in D indicated by Al, placed into C1, or not placed in the cache at all. Can the memory

items be legally placed in the cache such that at least K requests in the sequence 3 are hits?

5.2.1 Interval scheduling and caching

First, the relationship between cache scheduling and interval scheduling is described. Ini-
tially assume that the cache allows bypassing and does not allow internal reorganization
(as deﬁned above.) When bypassing is allowed, it is clear that an item need only be put in
the cache if it will remain in the cache until its next access. With this in mind, the cache
scheduling problem on a cache of size I: can be described as a restricted interval scheduling
problem on 1: parallel machines.

In short the interval scheduling problem is the following. There is a set of parallel
machines and a set of intervals (jobs) that have a ﬁxed starting time and a ﬁxed ending
time. A machine can only work on one interval at a time. An interval must be worked on
continuously from its start time to its end time to be successfully processed. Each interval
can only be processed by a given subset of the machines. The goal is to maximize the

number of intervals processed.

110

The interval scheduling problem relates to caching as follows. The cache locations
correspond to machines. Each memory access deﬁnes the start time of an interval. The
end time of that interval is deﬁned by the next. access to that same memory item. This
relationship is illustrated in Figure 5.1. Notice that the last access to a given memory item
does not start a new interval. Also, notice that each interval corresponds to a possible cache
hit (that is, the access is not a compulsory miss [39]). The subset of machines that can
process an interval is determined by the set of cache locations in which the corresponding
memory item can be placed. As stated earlier, when bypassing is allowed, placing an item
in the cache is only proﬁtable if the item will still be in the cache the next time it is
accessed. Likewise, scheduling an interval on a machine is only proﬁtable if the machine is

not otherwise busy during that interval.

Accenws: 101 102 101 103 102 103 101 103 102 101

 

 

 

 

 

Jobs:

 

 

Figure 5.1: Relation between memory accesses and intervals.

If intervals can be placed on any machine, then the interval scheduling is solvable in
polynomial time [30]. In the restricted interval scheduling problem, each interval can only
be placed on a subset of the machines. If the machines can be partitioned into subsets such
that if an interval can be scheduled on a machine in one subset, it can be scheduled on all
machines in that subset, and it cannot be scheduled on any machine not in that subset,
the problem corresponds to a set-associative cache and is still solvable in polynomial time.
However, the general case of restricted interval scheduling is NP-complete[7, 48].

In order to show that the FCCS problem isNP-complete, it is ﬁrst shown that the

Fundamental Companion Interval Scheduling problem (FCIS), a special case of restricted

111

interval scheduling. is NP-complete. Then F CIS is reduced to F CCS.

5.2.2 Formal N P-hard proof for FCCS

Deﬁnition 5.2.2. Fundamental Companion Interval Scheduling (FCIS):
INSTANCE: A set C of m machines 01,...Gm. a “companion” machine C1, an integer
K. and a set I of n intervals (9,, f,,o,-) where s, E Z+ is the starting time of interval i.
fi 6 Z + is the ﬁnishing time of interval i and o,- E G is the machine on which interval i
can be scheduled. Additionally, any interval can be scheduled on the companion machine.

QUESTION: Can at least K of the intervals be legally scheduled?
Theorem 5.2.1. FCIS is NP-complete.

Proof. 3-Occ-Max—QSAT is reduced to FCIS. 3-Occ-Max-2SAT is a restricted form of Max-
2SAT where each variable occurs at most three times.

More formally, 3-Occ-Max-2SAT consists of a set. of variables, U, a set of clauses, C,
and a value K. Each of the clauses contain exactly two variables. possibly in the negated
form. No variable occurs in more than three clauses. As noted above, this problem has
been shown to be NP-complete[12].

Let U = {21.1,ug,....va}, C = {c1.c2,...,cb} and K be an instance of 3-Occ-Max-
2SAT. For each variable u E U create two machines, A1,, and Ala. both elements of G.
For notational reasons, the function f (:1) and h(:r) are introduced. They are deﬁned as
f(ui) = i and h(ck) = h.

For each variable u E U create the following four intervals (f (u), f (u) + 1,]lIu),

(f(u), f(u) + 1,1115), (0,a. + b, AI“), and (0,a + b, lily).3 This results in 4a intervals. Label

 

3An alternative notation would be to examine each variable u.~ and create intervals (i, i + 1, Nu, ), (i,i +
1, Mar). (0,a + b, M... ). and (0, a + b.11'lw)

112

 

the first two intervals “short variable” intervals and the second two intervals “long variable”
intervals. For each clause c E C create the following two intervals. If variable 21 appears in
clause c in its positive form, create the interval (a + h(c),a + h(c) + 1,111,,). If it appears
in the negative form create the interval (a + h(c),a + h(c) + 1, Ala). Since there are two
variables per clause, this results in 2b “clause” intervals. Thus, a total of 4a + 2b intervals
are created and 2a machines are created. Both are clearly polynomial in the original input
size.

An example instance is shown in Figure 5.2. The intervals form an instance of F CIS.
The intervals can be broken into three groups: the long variable intervals, the short variable
intervals, and the clause intervals. The key observations about this reduction are that the
variable intervals (both long and short) are used to enforce a variable assignment and pairs

of schedulable clause intervals correspond to the satisﬁable clauses.

......................................................................

 

 

 

 

 

IL i
1 Ali'- 1
I n
l [My i
l My 1
as eM%&

l I l l I
we wean

......................................................................

Figure 5.2: Example reduction from 3-Occ-Max—2SAT.
Notice that the short intervals on the left are the “short variable intervals” and the ones
on the right are the “clause intervals.”

Suppose one can satisfy K of the clauses in the 3-Occ-Max-2SAT instance. It will now
be shown that 3a + b + K of the intervals can be scheduled. Consider each variable u E U.

If u is true, machine Iii-,7 is used to schedule one of the long intervals corresponding to u,

113

 

and machines A!“ and C1 are used to schedule the two short intervals corresponding to u.
If u is false, the roles of machines Mg and 111,, are reversed. Exactly 3a of the variable
intervals will be scheduled. Note that it is not possible to schedule more than 3a of the
variable intervals, because for each variable there are 4 overlapping jobs and only 3 available

machines.

Now consider each pair of clause intervals. If the corresponding clause c is satisﬁed in
the 3-Occ-Max-2SAT problem, then one of the literals in c must be true. Without loss of
generality. it can be assumed that this literal is 1. Because I is true, the long interval that
could be placed on AI, was not scheduled, and machine Ill; is free. Therefore the pair of
intervals that correspond to clause c using machines All and C1 can be scheduled. If the
clause is not satisfied, only one interval can be scheduled (011 C1). It is important to note
that for each u, the clause intervals either use machine All“ or machine Illa but not both.
A total of b + K of the clause intervals will be scheduled. Thus exactly 3a + b + K of the
intervals have been scheduled.

The proof is only half done at this point. It has been shown that if K of the clauses in
our 3-Occ-Max-2SAT problem can be satisﬁed, one can satisfy 3a + b + K of the intervals.
It now needs to be shown that if 3a + b + K intervals can be scheduled, that schedule can
be used to ﬁnd a solution to the corresponding 3-Occ-Max-2SAT problem where K of the
clauses are satisﬁed.

Suppose there exists a schedule 5 that schedules 3a + b + K intervals. We will create a
modified schedule 5' that schedules at least. 3a + b + K intervals with the two following ad-
ditional constraints: no long variable interval will be scheduled on the companion machine,

and for each variable u, exactly one of its long variable intervals will be scheduled.

S’ is constructed as follows. First modify 5 so that C; is not used to schedule any long

114

 

variable intervals. Suppose in schedule S that machine C1 is used to schedule one of the
long intervals for some variable u. In this case, either the corresponding short interval is
not currently scheduled or it is scheduled on its corresponding machine in G, Mu or 1115.
In the ﬁrst case, modify S by having C1 schedule the unscheduled short interval. In the
second case, modify S by taking all intervals scheduled on the corresponding machine, ll/Iu
or lug, and schedule them on C1 and schedule the long interval on that machine. In either
case, the modiﬁed schedule now completes the same number of intervals as S does, while

C1 is not used to schedule a long variable interval.

Now suppose for a variable u, the modiﬁed schedule uses both 111,, and Illa to schedule
the corresponding long intervals. This must mean that. at least one of the short intervals
corresponding to u is not scheduled. Modify the schedule by scheduling one of the un-
scheduled short intervals on its corresponding machine in G, ll-Iu or big, and dropping the

corresponding long interval.

At this step there is now a schedule where for each variable u, at most one of its two long
intervals is scheduled on its corresponding machine in C, All.” or .Mg. The schedule is further
modiﬁed so that for each variable u, exactly one of the two long intervals for u is scheduled
on its corresponding machine in G. Let u be a variable where neither of its corresponding
intervals are scheduled. Consider the clause intervals. Because the variable 11 appears at
most three times, it either appears in its positive form at most once, or it appears in its
negative form at most once. It follows that there is at most one clause interval that can
be scheduled on machine A1,, or there is at most one clause interval that. can be scheduled
on machine Ill-g. Drop that single clause interval and use the freed machine to schedule
the corresponding long interval for u. If this machine was currently used to schedule the

corresponding short interval, modify the schedule to have C1 schedule this short. interval

115

 

and the other machine, All, or Ali, to schedule the other short interval.

Call this ﬁnal schedule S'. The schedule S’ now has at least 3a+b+ K intervals scheduled
since all modifications performed did not decrease the number of intervals scheduled. For
each variable, exactly one long variable interval will be scheduled, and it will not be on
C1. This will deﬁne a valid truth assignment for 3-Occ-Max-2SAT. Speciﬁcally, if the long
interval 21 is not scheduled, then variable u is true; otherwise u is false. Since at most 3a
variable intervals can be scheduled, this means at least b+ K clause intervals are scheduled.
At most I) of these clause intervals can be scheduled on C1 which means that at least K
clause intervals will be scheduled on a machine in C. These “at least K” clause intervals
scheduled on a machine in G correspond to K satisﬁed clauses in the instance of 3-Occ-

Max-2SAT with the above truth assignment. Cl

Theorem 5.2.2. The Fundamental Companion Cache Scheduling (FCCS) problem is NP-

complete.

Proof. F C18 is reduced to FCCS. For each machine in FCIS create a cache location in D. For
each interval (3,, fi, 0,) create a memory item 3:,- and let g(.r,~) = 0,. In other words, memory
item I,- can be placed in the cache location that corresponds to the machine on which the
interval (31-, f,) can be scheduled. Sort the end points sl, f1, s2, f2, . . . s", fn of the intervals.
If two end points are equal, then ties are broken in the following manner. Finishing points
come before starting points. Otherwise, the endpoint from the lower indexed interval is
ﬁrst. A sequence of memory references is generated as follows. For each end point in the
sorted list, reference the memory element that corresponds to the end point’s interval.
The intervals deﬁned by the resulting sequence of memory references and the original set
of intervals are equivalent. For every interval that can be scheduled, there will be one cache

hit. Therefore, a schedule that schedules more than N intervals will also result in more than

116

 

N cache hits. It is trivial to show that the F CCS E NP, and thus is NP-completesimply
by showing that a given proposed solution can be veriﬁed to legally schedule K jobs in

polynomial time. 1:]

5.3 Extending FCCS

The results of Section 5.2.2 apply only to the F CCS problem—a highly restricted instance
of the general problem of optimally scheduling a C C S . This section broadens those results
greatly. While it should not be surprising that instances where there are set-associative
main caches and larger companion buffers are still NP-hard, the outline of proofs to that
effect are provided for sake of completeness. It is also formally proven that allowing internal
reorganization, and /or disallowing bypassing. does not affect the difficulty of the problem.
Further, it is proven that these problems are all difficult to closely approximate and provide
some bounds for the F CCS problem. Lastly, it is proven that skew caches and certain

multi—level caches are also NP-hard to schedule optimally.

Theorem 5.3.1. If the main cache is k-way set—associative. the problem is still NP-

complete.

Proof. The problem statement for F CIS in deﬁnition 5.2.2 is used with a. slight modiﬁcation.
Each element of C is redeﬁned to be a set of lf. machines, rather than a single machine. An
interval which can be scheduled on 03- may now use any of the (otherwise unscheduled)
machines in Q, or the companion machine.

The construction is also modiﬁed. Each of the short variable intervals as well as the
clause intervals are replicated so that there are k copies of each. Also, in an attempt to

reduce confusion, the sets of machines corresponding to the variable 1‘. are referred to as St

117

 

.....................................................................

 

 

Page 1r H H .
3511—531 lillsfllsyllsfl

 

.....................................................................

Figure 5.3: Example construction for a 2-way set-associative main cache.

rather than him.

Figure 5.3 is an example construction similar to that of Figure 5.2. Notice that exactly
one of the variable intervals will still be unschedulable as for each variable there are now
2(k + 1) variable intervals and only 213 + 1 machines on which they can be run. Also
notice that if the long variable jobs corresponding to the literals of a given clause are
both scheduled, one of the clause jobs corresponding to the clause will be unschedulable.
Thus the same basic arguments of Theorem 5.2.1 still apply, and the conversion from an
interval scheduling problem to a set-associative caching problem is only a simple variation

of Theorem 5.2.2.

Theorem 5.3.2. If the companion buffer is of size 2 > 1, the problem is still NP-complete.

Proof. The problem statement for FCIS in deﬁnition 5.2.2 is used with a slight modiﬁcation.
The single companion machine C1, is now a set C of companion machines C1 . . . C2. Much

like before, any job can be run on any of the machines in C.

118

 

......................................................................

 

 

 

 

 

 

 

} J.
t MI 1
l M” l
E M, ,
M, M M, M,r M, M,
I—‘H—H I‘ ”:1 “H "it "1
u 1 1 1 I 1
MIA—9 I” ”1? [if Bit—El
Ilia, a dMBanerBI 18.

I if It ll 1

......................................................................

Figure 5.4: Example construction for a companion buffer of size 3

The construction used in Theorem 5.2.1 is used with slight modiﬁcations. A new ma-
chine, A13 6 G' is added. For each pair of short intervals, add 2 — 1 intervals with the same
start and end times as that pair of intervals and which can run on machine 11713. Figure 5.3
illustrates the modiﬁed construction for a. companion buffer of size 3.

If all of these “B” intervals are scheduled they will use the .: — 1 extra machines of
the companion buffer. When creating the schedule 5' as described in Theorem 5.2.1 an
additional step is added. After the long variable intervals have been ﬁxed, any unscheduled
B intervals will be scheduled if a free machine is available. If no such machine is available,
it must be the case that both .of the non-B short. interval jobs are using a machine from
the set C. Unschedule one of those intervals and let the unscheduled “B” interval use that.

machine. The remainder of the construction is identical. [:1

Theorem 5.3.3. If bypassing is not allowed in the cache, the problem is still NP-complete.

Proof. This proof uses nearly the same construction as Theorem 5.2.1. The only change is

that the long intervals are lengthened so they start at time —1 rather than time 0. Notice

119

 

that this change does not have any impact on either of the theorems used to show that
F C C S is NP-hard. All that is left is to prove that disallowing bypassing will not decrease
the number of intervals which can be scheduled by an optimal algorithm. Clearly, disallow-
ing bypassing cannot increase the optimal number of intervals which can be scheduled in
any instance: any schedule legal under the no—bypassing case is also a legal schedule with
bypassing.

Let us now deﬁne the following terms. I is an instance of the F CIS problem. 5135395“)
is the optimal schedule for I when bypassing is allowed. Further we deﬁne SgSgSU) to
be of the form of S’ described in Theorem 5.2.1. That is, all of the short variable jobs are
scheduled, one long variable job is left. unscheduled for each variable, and some of the clause

intervals may be left unscheduled. That theorem showed that such a schedule exists which

is still optimal.

Notice that under the no—bypassing restriction, it must be the case that all of the short
intervals are scheduled. That is because the short intervals must be started and no other
interval starts while they run, so no interval can preempt them. Clearly the short intervals
may be preempting the long intervals. In fact the only impact of disallowing bypassing for
any I is that all of the short intervals must be scheduled: as long as that rule is followed
the schedule will be a legal non-bypassing schedule. The long intervals must all initially be
scheduled, but may be preempted later, and thus count as unscheduled. This means that
if nggs can be converted to a form where the long jobs are the only jobs left unscheduled
and the total number of jobs scheduled does not decrease then there exists a schedule that

does not. require bypassing and still schedules exactly as many jobs.

A schedule 520155038 is now constructed from the schedule $85,123. Label the k. unsched-

uled clause intervals in 8815333 as U CI 1 . . . U C If. and label the long variable interval which

120

 

uses the same machine in C used by U CI j as LV I j. UCIJ- could be scheduled if LVIJ- were
unscheduled. That is because LVIJ- is the only interval which can possibly use its machine
in G at that time. Since LVI, overlaps with at. least all of the intervals that U CI )- overlaps
with, it cannot decrease the number of schedulable jobs to schedule UC I j and unschedule
LVIj. This can be done for each of the k unscheduled clauses and will result in no fewer

intervals being scheduled.

. OPT . .. . .. ' . -. OPT
It has been shown that Snobypass schedules exactly as many Intervals as Sbypass. As
- - OPT - - OPT .
fmdmg Sbypass rs NP-hard, ﬁnding Snobypass must be NP-hard also. [3

Theorem 5.3.4. If reorganization is freely allowed in the cache, the problem is still NP-

complete.

Proof. The same construction as Theorem 5.2.1 is used once again. Notice that this problem
is less restricted than the FCIS problem. Thus the optimal schedule for the reorganiza-
tion variation can be no worse than the optimal non-reorganization schedule for the same
instance. It will be shown that allowing reorganization does not improve the number of
intervals which can be scheduled and therefore FCIS where reorganization is allowed must
also be NP-hard.

The fundamental observation is that given m intervals. all of which desire to be run
during some given time, and m — 1 machines on which to run those intervals, one of the
intervals must be left unscheduled. Reordering does not help in that. case. Another impor-
tant observation is that once an interval is left unscheduled, even for a moment, there is
no beneﬁt to scheduling it in the future. Using these facts, it will be shown that the proof
in Theorem 5.2.1 holds for the reorganization case also. First it is shown that no extra
variable intervals can be scheduled due to reorganization. Next it is shown that no extra

clause intervals can be scheduled due to reorganization. As that is all of the intervals in

121

 

the instance, it is enough to show that reorganization never allows extra intervals can be
scheduled in the optimal cache for any F CIS instance.

As before, at least one of the variable intervals must be left unscheduled. Again, leaving
a short variable interval unscheduled can be no better than leaving a long variable interval
unscheduled, and thus it can be assumed that one long variable interval is left unscheduled
per variable. It is therefore clearly the case that just before the ﬁrst. clause interval has
been scheduled exactly as many intervals can be scheduled in the standard FCIS version as
in the F CIS version where reorganization is allowed.

It is now shown that no extra clause intervals can be scheduled due to reorganization.
Label each of the pairs of short clause intervals as (CI 11 , C112) where C In can be assigned to
machine ll 6 G and C113 is deﬁned similarly. Further, the term LVI3r will be used to mean
the long variable interval which can be run on machine :1: E G. Examine each pair of clause
intervals in turn. As before. if LVIII and/or LVIIQ was left unscheduled, both CI“ and C112
can be scheduled. If LVI” and LV 112 are both scheduled, there exist four simultaneous
intervals (C111. C112, LVI“ and LVIIQ which can only be run on three machines (A111, AI”,
and C1). Thus, as in Theorem 5.2.1, one of those four intervals cannot be scheduled.

The only issue which remains is to show that there is no advantage to leaving two long
variable intervals associated with the same interval unscheduled. The argument for this is
identical to the argument presented in Theorem 5.2.1: if both long variable intervals are left
unscheduled, the one for which there exists only one clause interval which uses its machine
in C can be scheduled while that clause interval is unscheduled. Recall that this is due to the
3—occurrence restriction of 3-Occ-Max-2-SAT. Therefor, the same number of intervals are
scheduled by the optimal solution to FCIS with reorganization as are scheduled by standard

F CIS. As FCIS is NP-complete, FCIS with reorganization must also be NP-complete. Cl

122

 

Theorem 5.3.5. Any algorithm which can guarantee to approximate the miss rate within

0. 012.4% of the optimal for F CIS must have an exponential running time if P 74 NP.

Proof. Berman and Karpinski proved that it is NP—hard to approximate 3—Occ-IV'Iax-2SAT
within 0.05% [12]. Speciﬁcally, they showed that there exist instances of 2016n clauses where
it is NP-hard to compute if the number of satisﬁable clauses is (2011 + e)n or (2012 + e)n
for any 6 in the range (0,0.5). Utilizing the construction in Theorem 5.2.1 it is possible to
take an instance of these 2016n clauses and convert it into a. FCIS problem. Recall from the
construction in Theorem 5.2.1 that 4a + 2b clauses were generated when the 3-Occ-Max—
2SAT problem consisted of a variables and b clauses.

Because the instances generated in [12] all utilize each variable exactly three times,
there are exactly 1344n distinct variables used. Therefore, there will be exactly (4 * 1344 +
2 * 2016)n = 940871 clauses in the F CIS instance. Again, recall from Theorem 5.2.1 that
if K clauses in the 3-Occ—Max—2SAT problem could be satisﬁed, 30. + b + K clauses in
the FCIS instance could be scheduled. For K = (2012 + e)n and (K = 2011 + e), that.
means that (8060 + e)n and (8059 + e)n could be scheduled respectively. Clearly, as n
gets large and 6 can therefore approach 0, it is N P-hard to distinguish between 8060n and
805971 schedulable intervals. It is then the case that estimation of FCIS within a factor of
8060/8059 is NP-hardThis is the same as being unable to estimate within 0.0124% of the
optimal.

It should be noted that the technique used in Theorem 5.3.6 could easily be used to

compute a similar result. However that generic technique yields a looser result of 0.0079%.

Theorem 5.3.6. An algorithm which can guarantee to approximate the miss rate within

1.01% of the optimal for F CCS is non-trivial to ﬁnd.

123

 

According to Berman and Karpinski [12]. the best known polynomial-time approxima-
tion algorithm for 3-Occ-Max-2SAT is the same as that for Max-2-SAT found by Feige and
Goemans [27]: 1.0741. Clearly, as FCIS can be used to solve 3-Occ-Max-2SAT, this result
could be improved if there were a sufﬁciently good approximation algorithm for FCIS. It is
now shown that if there exists a 1.013-algorithm for FCIS (that is, it can approximate F CIS
within 1.01%) it would result in an improvement in the best known approximation algo—
rithm for 3-Occ-Max-2SAT. As this problem has been fairly well studied, we are comfortable
labeling such an improvement in 3-Occ-Max-2SAT as non-trivial.

An L-reduction, as explained by Papadimitriou in [55], can be used to show how an
inapproximability result for one problem can be reduced to another. To quote from various

parts of pages 309—311“.

Suppose that A and B are optimization problems (maximization or minimiza-
tion). An L-reduction from A to B is a pair of functions R and S, both com-
putable in logarithmic space, with the following two additional properties: First,
if a: is an instance of A with optimum cost OPT(;r), then R(:r) is an instance of

B with optimum cost that satisﬁes

OPT(R(:r)) s a - OPT(;I:) (5.1)

where a is a positive constant. Second, if s is any feasible solution of R(:r), then

S (s) is a. feasible solution of :1: such that

10PM) — c<S<s>>I s momma» — c(s)I. (5.2)

 

4We have replaced Papadimitriou’s (non-standard) use of c with the symbol 6. Otherwise the quotes are
verbatim.

124

.. -._n —--...-——-

 

where [3 is another positive constant particular to the reduction (and we use c

to denote that cost in both instances).

Papadimitriou goes on to state:

If there is an L-reduction (R, S) from A to B with constants a and [3, and

there is a polynomial-time 6-approximation algorithm for B, then there is a

pol ynomial-time

0,136 ,.
1 _ 5 (0.3)

 

-approximation algorithm for A.

It should be noted that Papadimitriou uses the term arr-approximation algorithm in a

non-standard way. He deﬁnes an :r-approximation algorithm as:

I = lc(M(a.-)) — 0pm);
OPT(:r) ‘

 

rather than the more standard form of :r. = c(hI(:r))/OPT(.r). For the remainder of
this section the term “:r-approximation algorithm.” where :r. is a real number, will refer to
the standard deﬁnition. The symbols 6 and 6 will be used to refer to the inapproximablilty

using Papadimitriou’s or the standard scheme respectively. Simple algebra can show that

 

Thus, if A is 3—Occ—Max-2SAT and B is F CIS, there is an 6-approximation for A of 1.0741.

That e-approximation is the same as a 6-approximation of about 0.069. It is now shown

125

 

that if an e-approximation algorithm of less than 1.013 exists for B, an e of less than 1.0741
also exists for A. This would be an improvement in the best-known algorithm for 3-OCc-

Max-2SAT.

The value of [3 from Equation 5.2 is now found via a two-part. argument. Let s be a
feasible solution of FCIS which schedules i intervals and let m be the optimal number of
intervals that could have been scheduled in that instance. First assume that s schedules
at least 3a + b intervals, where the instance consists of a variables and b clauses. Recall
from Theorem 5.2.1 that. if 3a + b + K clauses in F CIS could be satisﬁed, at least K clauses
in the 3-Occ-Max-2SAT instance could be scheduled. Thus, OPT(.r) — c(S(s))| = m — c
and [OPT(R(:r)) — c(s)[ = (3a + b + m) — (30 + b + c) = m — c. and from Equation 5.2,
0 = 1. Now assume that instead 3 schedules fewer than 3a + b intervals. In that case. even
if no 3-Occ-Max-2SAT clauses are satisﬁed, 0 is clearly less than or equal to one. That is

because OPT(:r.) — c(S(s))| < m S lOPT(R(r)) — c(s)|.

Computing (1 requires one additional fact: At least 3/ 4 of all clauses in a Max-25AT
problem can always be scheduled. Given that, if m is the optimal solution for the 3-Occ-
Max-2SAT problem, there could have been no more than 4m/3 clauses in the problem
instance. Further if a variable only occurs once, that literal can safely be set to true and
the problem reduced in size by one variable and one clause. It can therefore be assumed
that each variable occurs at least twice. If, as before, a is the number of variables in the
instance and b is the number of clauses, a S 4m / 3 and b 3 4m / 3. Thus, the F CIS problem
can schedule exactly 3a + b + m = 12m/ 3 + 4m/ 3 + m = 19m/3 intervals. Plugging the

appropriate values into Equation 5.1, a = 19/ 3.

Having oz and ,3 available, and using Equations 53,55. and 5.4 it can be shown that,

if a better than 1.0101-approximation algorithm for F CIS were found, that would lead to

126

 

an improvement in the best known approximation algorithm for 3-Occ-l\lax-2SAT5. This
does not mean that there is no polynomial time algorithm which can guarantee a result
of better than 1.6%. However, such a result would show an improvement over the best
known algorithm for a fairly well-studied problem. We are comfortable labeling such an

approximation algorithm “non-trivial,” ' El

Theorem 5.3.7. Finding the optimal schedule for a skew cache is NP-hard.

Proof. This is proven by restriction [28]. Notice that. it is possible to have a schedule where
all of the accesses to one of the banks of the skew cache are always to the same location. In
that restricted case the problem is exactly the same as FCCS. Thus if an algorithm existed
which could optimally schedule a skew cache, that same algorithm could be used to schedule

a. F CCS. Since FCCS is NP-hard, skew cache scheduling is also NP-hard. 1:]

Theorem 5.3.8. It is NP—hard to optimally schedule a multi-level cache as a single unit

where exclusion is allowed.

Proof. This is also a proof by restriction [28]. Assume a two-level cache hierarchy where
the L1 cache can hold kl blocks and the blocks are grouped into 31 sets. Deﬁne s2 and kg
in a similar way for the L2 cache. Without loss of generality we will assume that 32 > s16.
Table 5.3 lists the values of 82/81, for a few selected processors. Values near 8 or 16 are
most common.

Consider the set of blocks, 8,, which are mapped into the ith set of the L1 cache. Under
all reasonable indexing functions (see Chapter 2 for a discussion on indexing functions)

elements of B, will also be mapped into at least s2 / s1 different sets of the L2 cache. Now

 

5This solution can be easily veriﬁed. However ﬁnding the value 1.0101 required some additional work
6With only slight modiﬁcations The same argument can be used if s] > .52

127

 

 

Table 5.1: Ratio of the number of sets in the L2 cache over the number in the L1 data cache
for selected processors.

 

[ Processor I 82/81J
Intel P6 w 512KB L2 cache 64
Intel Celeron (Coppermine) 8

Compaq 21364 8
AMD 1GHz Athlon 0.5

 

 

 

 

 

 

 

 

 

notice that optimally scheduling accesses to the blocks in the set. B,- is exactly the same as
scheduling a companion cache, where the companion buffer is of size In /sl. and the main
cache consists of at least 82/81 sets each of size 152/51.

It is worth noting that for small values of 32 / 81. the problem may be computational
tractable. In fact, for a. ﬁxed value of 333/81 the problem is technically requires only poly-
nomial run time. However, the large values for 32/51 in real processors, combined with the
very large address traces (usually greater than 1,000,000 addresses) would seem to make

this problem generally intractable. C]

5.4 Conclusion

It has been shown that an important tool for cache studies, the optimal replacement policy,
is NP-hard to compute for many non-standard caches. It has also been shown that this
result applies to the optimal scheduling of multi-level caches where the caches are kept
exclusive. Additionally, it was proven that a good approximation to these problems is also
NP-hard to compute. While these results will not help one build faster caches, researchers
who are aware of this will not waste their time in a futile attempt to find efﬁcient algorithms

for these problems. Unless P = NP such algorithms simply do not exist.

128

 

Bibliography

[1]

l3]

l4]

l6]

[7]

[8]

[9]

[10]

[11]

[12]

A. Agarwal. J. Hennessy, and Horowitz M. Cache performace of operating systems
and multiprogramming. ACM Trans. on Computer Systems, 6(4):393—431, November
1988. V

A. Agarwal, J. Hennessy, and Horowitz M. An analytical cache model. ACM Trans—
actions on Computer Systems, 7(2), May 1989. ‘

A. Agarwal and S. D. Pudar. Column-associative caches: a technique for reducing the
miss rate of direct-mapped caches. In Proc. of the 20th Int ’1 Symposium on Computer
Architecure, pages 179—190, 1993.

V. Agarwal, M.S. Hrishikesh, S. W. Keckler, and D. Burger. Clock rate verse IPCzThe
end of the road for conventional microarchitecures. In Proc. of the 27th Int ’1 Symposium
on Computer Architecture, June 2000.

E. Anderson, P.V. Vleet, L. Brown, J-L Baer, and AR. Darlin. On the performace
potential of dynamic cache line 'sizes. Technical Report UW-CSE-99-02-01, Dept. of
Computer Science and Engineering, University of Washington, February 1999.

J. Archibald and J-L. Baer. Cache coherence proto—cols: Evaluation using a multi-
processor simulation model. ACM Transactions on Computer Systems, 4(4):273-298,
November 1986.

EA. Arkin and EB Silverberg. Scheduling jobs with ﬁxed start and end times. Discrete
Applied Mathematics, 18:1—8, 1987.

O. Babaoglu. Efﬁcient generation of memory reference strings based on the lru stack
model of program behaviour. In PERFORMANCE ’81, pages 373-383, 1981.

J-L. Baer and W-H. Wang. On the inclusion properties of multi-level cache hiearehies.
In Proc. of the 15th Int 7 Symposium on Computer Architecture, June 1988.

AP. Batson and A. W. Madison. Characteristics of program locality. Communications
of the ACM, 19(5):285—294, May 1976.

L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM
Systems Journal, 5(2):282-288, 1966.

P. Berman and M. Karpinski. On some tighter inapproximability results. Technical
Report 99-23, DIMACS, 1999. ’

129

 

[13] M. Brehob and R.J. Enbody. An analytical model of locality and caching. Techni—
cal Report MSUCPSzTR99-31, Michigan State University, Department of Computer
Science and Engineering, 1999.

[14] M. Brehob, R.J Enbody, and N. Wade. Analysis and replacement for skew-associative
caches. Technical Report MSUCPS:TR97—32, Michigan State University, Department
of Computer Science and Engineering, 1997.

[15] D. Burger, J.R. Goodman, and A. Kagi. Memory bandwidth limitations of future
microprocessors. In 23nd Annual Int ’1 Symposium on Computer Architecture. pages
78—89, 1996.

[16] B. Calder, D. Grunwald, and J. Emer. Predictive sequential associative cache. In Proc.
2nd Int ’1 Symposium on High Performace Computer Architecture, pages 244-253, 1993.

[17] K.K. Chan, C.C. Hay, J.R. Keller, G.P. Kurpanek, F.X. Schumacher, and J. Zheng.
Design of the HP PA7200. Hewlett-Packard Journal, February 1996.

[18] J .H. Chang, H. Chao, and K. So. Cache design of a sub-micron CMOS system/ 370. In
Proc. of the 14th Int’l Symposium on Computer Architecture, pages 179-190, 1987.

[19] C. K. Chow. An optimization of storage hierarchies. IBM Journal of Research and
Development, 18(3):194-203, May 1974. i

[20] CK. Chow. Determination of cache’s capacity and its matching storage hierarchy.
IEEE Transactions on Computers, C-25(2):157-164, February 1976.

[21] Standard Performace Evaluation Corporation. Spec benchmarks. \Vorld Wide web.
[22] DE. Culler. Caches. World Wide Web, February 2002.

[23] Gee J. D., Hill M. D., Pnevmatikatos D. N., and Smith A. J. Cache performance of
the SPEC92 benchmark suite. IEEE Micro, 13(4):17—27, August 1993.

[24] P.J. Denning. The working set model for program behavior. Communications of the

ACM, 11(5), May 1968.

[25] Wilton S. J. E. and Jouppi N. P. Cacti: An enhanced cache access and cycle time
model. IEEE Journal of Solid-State Circuits, 31(5):677—688, May 1996.

[26] and P.J. Denning E.G. Coffman. Operating Systems Theory. Prentice-Hall, 1973.

[27] Uriel Feige and Michel Goemans. Approximating the value of two prover proof systems,
with applications to max 2sat and max dicut. In In Proc. of 3rd Israel Symposium on
the Theory of Computing and Systems, pages 182—189, 1995.

[28] MR. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory
of NP-Completeness. W.H. Freeman, 1973.

[29] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical represen-
tation of cache misses. In Proc. of the 11th ACM Int ’1 Conference on Supercomputing,
July 1997.

130

[30] MC. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, New
York, 1980.

[31] A. Gonzalez, C. Aliagas, and M. Valero. Data cache with multiple caching strategies
tuned to different types of locality. In Proc. of the 1995 Conference on Supercomputing,
pages 338—347, 1995.

[32] K. Grimsrud. Quantifying Locality. Doctoral dissertation, Brighman Young University,
1993.

[33] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson. On the accuracy of memory
reference models. Lecture Notes in Computer Science, 794:369—388, 1994.

[34] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson. Locality as a visualization tool.
In IEEE Transactions on Computers, volume 45, pages 1319—1325, November 1996.

[35] M. Hachman. IBM and Intel chip vendors showcase road maps at microprocessor
forum. BYTE, October 1999.

[36] E. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In
27th Int ’1 Symposium on Computer Architecture, June 2000.

[37] J. Handy. The Cache Memory Book. Academic Press, second edition, 1998.
[38] J .P. Hayes. Computer Architecture and Organization. McGraw Hill Text, 1988.

[39] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach.
Morgan Kaufmann, San Mateo, CA, third edition, 2002.

[40] MD. Hill. A case for direct.~mapped caches. IEEE Computer, pages 25—40, December
1988.

[41] MD. Hill. Evaluating associativity in cpu caches. IEEE Trans. on Computers,
38(12):1612—1630, December 1989.

[42] Micron Technology Inc. 1997 DRAM Data Book. Micron Technology Inc., Boise, Idaho,
1997.

[43] B. Jacob, P. Chen, S. Silverman, and T. Mudge. An analytical model for designing
memory hierarchies. IEEE Transactions on Computer, 45(10), Oct 1996.

[44] TL. Johnson and W-M. ku. Run-time adaptive cache hierarchy management via
reference analysis. In Proc. of the 24th Annual Int ’1 Symposium on Computer Archi-
tecture, volume 25,2 of Computer Architecture News, pages 315-326, New York, June
1997. ACM Press.

[45] N. J ouppi. Improving direct-mapped cache performance by the addition of a small fully-
associative cache and prefetch buffers. Proc. of the Seventeenth Annual International
Symposium on Computer Architecture, 18(2):364—373, May 1990.

[46] SF. Kaplan, Y. Smaragdakis, and P. R. Wilson. Trace reduction for virtual memory
simulations. In Proc. of ACM SIGMETRICS, May 1999.

131

 

[47] R. R. Kessler and et al. Inexpensive implementations of set—associativity. In Proc. of
the 16th Int ’1 Symposium on Computer Architecture, pages 131-139, 1989.

[48] A.VV.J. Kolen and J .G. Kroon. On the computation complexity of (maximum) class
scheduling. European Journal of Operational Research, 54:23—38, 1991.

[49] R.L. Mattson, J. Gecsei, D. R. Slutz, and IL. Traiger. Evaluation techiques for storage
hierarchies. IBM system Journal, 9(2), 1970. Periodical Room, TA168 .114 v.12 1973.

[50] SA. McKee. A. Aluwihare, B.H. Clark, R.H. Klenke, T.C. Landon, C.W. Oliver, M.H.
Salinas, A.E. Szymkowiak, K.L. Wright, W.A. Wulf, and J.H. Aylor. Design and
evaluation of dynamic access ordering hardware. In Proc. 10th ACM Int ’1 Conf. on
Supercomputing, May 1996.

[51] SA McKee, R.H. Klenke, K.L. W'right, W'.A. Wulf, M.H. Salinas, J.H. Aylor, and
AP. Batson. Smarter memory: Improving bandwidth for streamed references. IEEE

Computer, July 1998.
[52] SA. McKee, W.A. Wulf, J .H. Aylor, R..H. Klenke, S.I. Salinas, M.H. Hong. and Weikle

D.A.B. Dynamic access ordering for streamed computations. IEEE Transactions on
Computers, To appear.

[53] Kathryn S. McKinley and Olivier Temam. A quantitative analysis of loop nest local—
ity. In Seventh International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 94—104, Cambridge, Massachusetts, October
1996. ACM Press.

[54] G. D. McNiven and E. S. Davidson. Analysis of memory referencing behavior for
design of local memories. In H. J. Siegel, editor, Proc. of the 15th Int ’1 Symposium
on Computer Architecture, pages 56—63. Honolulu. Hawaii, May 1988. IEEE Computer

Society Press.

[55] CH. Papadimitriou and M. Yannakakis. Optimization, approximation, and complexity.
In Proc. 20th ACM Symp. on Theory of Computing, pages 229-234, 1988.

[56] D. A. Patterson, J. L. Hennessy, and D. A. Peterson. Computer Organization and De-
sign : The Hardware/Software Interface. Morgan Kaufmann, San Mateo, CA, second

edition, 1997.

[57] SA. Przybylski. Cache memory hierarchy design: A Performance-Direted Approach.
Morgan Kaufmann, 1990.

[58] R. Rau. Properties and applications of the least-recently-used stack model. Technical
Report 139, Dept. of Electical Engineering, Stanford Univ., May 1977.

[59] UL. Rhodes and W.Wolf. Unbalanced cache systems. In Proc. of the IEEE Int’l
Workshop on Memory Technology, Design and Testing, pages 16—23, 1999.

[60] A. Seznec. A case for two—way skewed-associative caches. In Proc. of the 20th Inter-
national Symposium on Computer Architecture, pages 169—178, 1993.

[61] A. Seznec and F. Bodin. Skewed-associative caches. In Proc. of PA RLE .93. 1993.

132

 

 

[ml

[631

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[76]

A. Seznec and F. Bodin. Skewed associativity enhances performance predictablility. In
Proc. of the 22nd International Symposium on Computer Architecture. pages 256—275,
1995.

A Silberschatz, J. Peterson, and P. Galvin. Operating System Concepts. Addison-
Wesley, third edition, 1991.

A. J. Smith. Cache memories. ACM Computing Surveys, 14(3):473—530, September
1982.

A.J. Smith. A comparative study of set associative memory mapping algorithms and
their use for cache and main memory. IEEE Transactions on Software Engineering.
4(2):1‘21—130, March 1978.

HS. Stone. High-Performance Computer Architecture. Addison-VVesley Publishing
Company, 1990.

RA. Sugumar. Multi Conﬁguration Simulation Algorithms for the Evaluation of Corn-
puter Architecture Designs. Doctoral dissertation, University of Michigan, 1993.

RA. Sugumar and SC. Abraham. Efﬁcient. simulation of caches under optimal re-
placement with appliations to miss characterization. In Proc. of ACM SIGMETRICS,

May 1993.

ES. Tam, J.A. Rivers, V. Srinivasan, G.S. Tyson, and ES. Davidson. Evaluating
the performance of active cache management. schemes. In Proc. of the 1998 IEEE
International Conference on Computer Design, pages 368—375, October 1998.

ES. Tam, J.A. Rivers. V. Srinivasan, G.S. Tyson, and ES. Davidson. Active man-
agement of data caches by exploiting reuse information. IEEE Trans. on Computers,
48(11), November 1999.

D. Thiebaut. On the fractal dimension of computer programs and its application to
the prediction of the cache miss ratio. IEEE Transactions on Computers, 38(7), July

1989.
E. Torng. A unified analysis of paging and caching. Algorithmica, 20:175—200, 1998.

D. Truong, F. Bodin. and A. Seznec. Accurate data layout into blocks may boost cache
performance. In Interact-2, February 1997.

G. Tyson, M. Farrens, J. Matthews, and AR. Pleszkun. A modified approach to data
cache management. In Proceedings of the 28th Annual International Symposium on
Microarchitecture, pages 93—103, Ann Arbor, Michigan, 1995. IEEE Computer Society
TC-MICRO and ACM SIGMICRO.

R.A. Uhlig and TN. Mudge. Trace-driving memory simulation: a survey. ACM Com-
puting Surveys, 29(2):128—170, June 1997.

D.A.B Weikle, S.A. McKee, and W.A. W.A. Wulf. Caches as filters: A new approach to
cache analysis. In Proc. of the Sixth International Symposium on Modeling, Analysis,
and Simulation of Computer and Telecommunication Systems, July 1998.

133

 

. 1.5.)? m

 

 

 

[77] Dee A. B. Weikle, Kevin Skadron, Sally A. McKee, and W. A Wulf. Caches as filters:

[78]

[79]

[80]

[81]

[82]

A unifying model for memory hierarchy analysis. Technical Report CS-2000-16, Dept.
of Computer Science, University of Virginia, June 2000.

W.A. Wong and J-L. Baer. Modified lru policies for improving second-level cache be-
havoir. In 6th Int’l Symposium on High-Performance Computer Architecture, January
2000.

CE. Wu, Y. Hsu, and Y—H Liu. A quantitative evaluation of cache types for high-
performace computer systems. IEEE Trans. on Computers, 42(10), October 1993.

W .A. Wulf and S. McKee. Hitting the memory wall: Implications of the obvious.
Computer Architecture News, 23:20—24, 1995.

Q. Yang and S. Adina. A one’s complement cache memory. In Proceedings of the
23rd International Conference on Parallel Processing. Volume 1: Architecture, pages

250—257, Boca Raton, FL, USA, August 1994. CRC Press.

C. Zhang, X. Zhang, and Y. Yan. Two fast and high-associativity cache schemes. IEEE
Micro, 17(5):40—49, September 1997.

134

 

 

 

   

II[Willi]lit][l]i]lu]][l]m