1 .11.. .2... 1 1 r: V . x 1 4. <53: .1! 1.1. .32....31‘1 a. 1 _ .a. 3 1...}. ... 5,. .m._......t 1'} L' ‘0. raw“ Z. 200 L) @035199/ This is to certify that the dissertation entitled ON THE MATHEMATICS OF CACHING presented by MARK W. BREHOB has been accepted towards fulfillment of the requirements for the Ph.D. degree in Computer Science and Engineering 62A \ (2/1 , Major Wssof Ignature 2/21/93 Date MSU Is an Affirmative Action/Equal Opportunity Institution LIBRARY ‘ Michigan State University l l l. l l PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c-JCIFlC/DateDuepes-p. 15 ON THE MATHEMATICS OF CACHING By M ark W. Brehob A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department. of Computer Science and Engineering 2003 ABSTRACT ON THE MATHEMATICS OF CACHING By M ark W. Brehob Computers circuits get faster at a much more more rapid rate than memory. In order for historical computer performance gains to continue, memory latency and bandwidth need to continue to improve. The most common way provide this performance is to use small, but very fast. memory devices to store the most commonly used data called caches. Cache performance relies upon locality; the predictable way in which caches tend to access memory elements that have accessed in the past or near those previous accesses. Of fundamental importance to improving a complex device is understanding how that device functions. This work contributes to that understanding in two ways. First, it. expands upon an old and mostly unused model of locality called ‘stack distance’ and shows how that locality interacts with various types of caches including victim and skew caches. Second, a negative result is provided showing that it is computationally intractable to find an optimal schedule for most non-standard caches. This result means that a. very useful tool for cache evaluation is not available when working with these non-standard caches. It also means that those working on finding a tractable optimal algorithm for these caches can stop—they will not succeed. © Copyright 2003 by Mark W. Brehob All Rights Reserved Dedicated to my daughter, Kira. Nerys Brehob, and her mother, Katrina. Marie Brehob. iv ACKNOWLEDGMENTS This dissertation has been many years in the coming and many many people have been more supportive than I could possibly deserve. A short list of those people include Suresh Chittor, Wen-Harm Wang. and Nick Wade of Intel as well as Eric Torng, Stephen Wagner, Uthaisombut Patchrawat, Barbara Bircher Gannod, Travis Doom, Jennifer \Vhite Doom, Hugh Smith, and Charles Severance of Michigan State University. My in-laws. parents and sister have stood behind me the whole way. My adviser, Richard Enbody has been remarkable in his support and encouragement. My wife has sacrificed her time and energy beyond anything I could ever repay. TABLE or CONTENTS LIST OF TABLES viii LIST OF FIGURES ix 1 Introduction and overview - 1 1.1 Overview ....................................... 3 1.2 On Terminology used in this document ....................... 4 2 On Caching and Non-standard Caches 5 2.1 Introduction to Processor Caching ......................... 6 2.1.1 The structure and organization of a standard cache .............. 7 2.1.2 Address mapping .................................. 9 2.1.3 Replacement policies ................................ 11 2.1.4 Modern implementations ............................. 11 2.1.5 Caching metrics .................................. 13 2.1.6 Locality of reference ................................ 14 2.2 Non-standard caches ................................. 15 2.2.1 Hash caches ..................................... 16 2.2.2 Hash-rehash caches ................................. 18 2.2.3 Victim caches .................................... 21 2.2.4 Skew caches ..................................... 22 2.2.5 Smart caches .................................... 23 2.2.6 Further reading ................................... 29 3 On the Measurement of Locality 30 3.1 Locality measures .................................. 30 3.1.1 What is locality? .................................. 31 3.1.2 Locality measures in the literature ........................ 32 3.2 Macroscopic locality measures ............................ 32 3.2.1 Chow’s empirical power law ............................ 33 3.2.2 The Fractal Model of Locality .......................... 34 3.2.3 Agarwal’s analytic cache model .......................... 35 3.2.4 Grimsrud’s locality metric ............................. 36 3.2.5 Cache based locality ................................ 37 3.3 Microscopic locality measures ............................ 40 3.3.1 Instantaneous locality ............................... 40 3.3.2 Loop—based locality metrics ............................ 42 3.4 Stack distance .................................... 44 3.4.1 Stack-distance distribution functions ....................... 45 3.4.2 Stack distance in the literature .......................... 46 3.4.3 Algorithmic use of the stack model ........................ 48 vi 3.4.4 Trace compression ................................. 49 3.4.5 Conclusion ..................................... 50 4 Stack distance as a locality measure 51 4.1 The stack distance model of locality revisited ................... 53 4.1.1 LRU stack distance revisited ........................... 54 4.1.2 Derivation of the cache model for standard caches ............... 56 4.2 Validation of the cache model ............................ 61 4.2.1 A comparison of actual hit rates to predicted values .............. 65 4.3 Expansion of the model to non-standard caches .................. 69 4.3.1 An analytic LRU skew-cache model ....................... 69 4.3.2 An analytic LRU victim cache model ...................... 75 4.3.3 Verification and the 100% miss-rate model .‘ .................. 80 4.4 Applications of the Models ............................. 83 4.4.1 Applications of the locality model ........................ 85 4.4.2 Applications of the cache model ......................... 90 4.4.3 Examining non-standard caches ......................... 95 4.5 Future work ...................................... 100 4.6 Conclusion ...................................... 102 5 Optimal replacement is NP-hard for non-standard caches. 104 5.1 Introduction and Background ............................ 104 5.1.1 Companion caches ................................. 106 5.1.2 Related terms ................................... 107 5.1.3 Chapter outline ................................... 108 5.1.4 Problem definition and goal ............................ 109 5.2 Proofs ......................................... 109 5.2.1 Interval scheduling and caching .......................... 110 5.2.2 Formal NP-hard proof for FCCS ......................... 112 5.3 Extending F CCS ................................... 117 5.4 Conclusion ...................................... 128 BIBLIOGRAPHY 129 vii 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 5.1 LIST OF TABLES A classification of misses using the optimal replacement policy. The arguments to the function M() are line size, cache size, associativity and replacement policy in that order. L is the line size of the base cache, C is its cache size, and k is the degree of associativity. ....................... 39 Symbols used in describing how caches interact with locality. .......... 57 Hit rates and predicted hit rates for the SPECfp2000 benchmarks ....... 66 Actual and predicted hit rates for SPECint2000 benchmarks .......... 67 Prediction errors grouped by cache configuration for the SPECfp2000 benchmarks. 67 Prediction errors grouped by cache configuration for the SPECint2000 benchmarks. 67 Prediction errors grouped by cache configuration for the SPECint2000 benchmarks. 68 Transition probabilities for a two-way associative skew cache. .......... 73 Measurement of excessive conflict ......................... 94 A comparison of replacement schemes for skew caches. Hit rate improvement is in terms of absolute improvement over the one-bit replacement scheme. . . 98 A comparison of replacement schemes for victim caches. Hit rate improvement is in terms of absolute improvement. ...................... 100 Ratio of the number of sets in the L2 cache over the number in the L1 data cache for selected processors ............................... 128 viii 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.1 3.2 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 LIST or FIGURES A simple caching structure. ............................. A cache with data and tags .............................. The three types of standard caches. If X is some address, this figure shows the different locations its cache block could be placed. .............. How an address is used by the cache. B is block size and S is set size. Both are assumed to be greater than 1 ........................... The Harvard architecture ............................... An example of how the set number might be determined In a hash cache. A model of a 2- -way set- associative cache. [60] ................... A 2-way skewed-associative cache. [60] ....................... Illustration of the stack distance of a reference stream .............. A graph of the locality found in the reference stream of Figure 3.1 ....... Cumulative stack distance for three floating-point benchmark traces ...... Average of the locality found in the SpecInt and Spech benchmarks ...... Illustration of the theoretic relationship between hit rates, associativity, and the locality of a given memory reference ....................... Stack distance hit-rates for SpecInt on a 4KB direct-mapped cache. ...... Stack distance hit-rates for SpecFP on a 4KB direct-mapped cache. ...... Stack distance hit-rates for SpecInt on a 4KB two-way associative cache. . . . . Stack distance hit-rates for SpecFP on a 4KB two-way associative cache. Example showing how locality interacts differently with a skew cache than a standard cache. .................................. Predicted hit rate for a skew cache and a standard 2—way associative cache. Both are 128 lines in size. ............................... A magnified look at the ‘flaws‘ in the skew cache analytic model ......... Predicted hit rate at each stack distance for the LRU victim cache and its com- ponents. ...................................... A magnified version of the upper-left hand corner of Figure 4.11 ........ Error rate of the 100% miss rate model when predicting the hit rates for a victim cache. ....................................... Difference between the skew cache analytic model and the same cache using the 100% missrate model. Two different analytic models are compared. Impact of cache filtering on the gcc benchmark and the impact of writebacks Impact of associativity on cache filtering ...................... Percent of accesses in a streak of size N or greater ................. Impact of associativity on cache filtering ...................... Impact of associativity on cache filtering ...................... ix 74 75 81 81 84 84 86 88 90 91 92 4.20 Illustration of the impact of the replacement scheme on performance for skew caches. The direct-mapped cache is provided as a reference and all caches consist of 128 lines ................................. 4.21 Illustration of the impact of the replacement scheme on performance for victim caches ........................................ 4.22 Illustration of the impact of locality on the hit rate of the using a ‘separate LRU’ replacement scheme. This graph also shows the impact of using 5—bit time stamps. ...................................... 5.1 Relation between memory accesses and intervals. ................. 5.2 Example reduction from 3—Occ-Max—2SAT. .................... 5.3 Example construction for a 2-way set-associative main cache. .......... 5.4 Example construction for a companion buffer of size 3 .............. 96 99 Chapter 1 Introduction and overview Modern society has come to rely upon the computer and its ever-increasing performance gains. In order to meet future expectations, both of society in general as well as the computer industry, faster and better computers need to be designed and built. The computer industry, citing a variant of Moore’s Law [39], has traditionally expected that processor performance will improve by 60% per year (this is equivelent to the more commonly stated “double in performance every 18 months” Of course, there are many roadblocks on the path to continuing at this rate of exponential performance improvement. These roadblocks include physical limitations of the devices being built, the difficulty of designing extremely large and complex processors, and the amount of heat which must be removed from the processor. One of the more immediate roadblocks is the relatively slow rate of improvement in the speed of memory devices such as DRAM. This looming “memory wall” [80] stands squarely in the path to continued processor improvement. DRAM process improvements have historically generated a 7% improvement in latency per year. One professor at Berkeley jokingly contrasted this much slower rate of improvement with Moore’s law by calling it “Less’ Law. [22]” The most common way to address this disparity is to use small. but very fast. memory devices to store the most commonly used data. Those devices are called caches. Of course, not all of the needed data. can be kept in these small caches, so some of the memory accesses must go to the DRAM. Those DRAM accesses are said to have been cache misses. Traditional methods for caching data are reasonably effective, but the widening gap between memory performance and the expected processor performance has made further decreasing the number of cache misses an important goal. As such, many novel caching techniques have been proposed in the last ten years. One difficulty with this plethora of ideas is that they are very difficult to evaluate. Certainly one can compare them in an environment where the workload is held fixed and “similar” sized caches are compared. (In fact we have done such a study for a number of these techniques [14]). However, such studies give very little insight into why these caches perform well—something very important to anyone who proposes to use them in a commercial processor. The focus of this dissertation is to examine what. can and cannot be done to gain that needed insight. The work in this dissertation provides some insight and results about how a cache interacts with the pattern of memory requests made by the processor. The two main contributions are: 0 An old technique, LRU stack distance, is Greatly expanded upon. It is used to evaluate locality and understand how locality and caches interact. a An important evaluation technique, the optimal replacement algorithm, is shown to be NP-hard to compute for a large and important class of caches. This means that this optimal replacement is a tool that designers of those caches, which include the “newest” cache designs, will not. be able to utilize. 1 .1 Overview As with most scientific endeavors, this work would be meaningless without the context of other’s efforts. As such, Chapters 2 and 3 summarize that context. In Chapter 2 a short tutorial on caching is presented along with a discussion concerning a number of non-standard caching techniques. The tutorial mostly consists of material one would find in a standard undergraduate class on the topic. However, the non-standard cache discussion should be novel to most readers. Chapter 3 is a. discussion of the memory reference stream of a processor. These patterns (called locality of reference) are what allow small caches to perform as well as they do. The various techniques in the literature for quantifying and taking advantage of the nature and patterns of these reference streams is examined. At the end of that chapter the stack distance model, which is used throughout Chapter 4, is introduced. While bits and pieces of original work are scattered thoughout the early chapters, it is Chapters 4 and 5 that contain the original work found in this thesis. Chapter 4 discusses how the stack distance model can be used to visualize and measure the interaction of locality and caches. Specifically that model is validated more extensively than before, new applications are found and the model is expanded to work with non-standard caches. Chapter 5 is a formal proof, showing that it is NP-hard to compute the optimal replace- ment policy of certain non-standard caches. This result holds for skew caches, generalized victim and assist caches, and in fact nearly all multi-lateral cache schemes. Further, various results are given which show that it is also NP-hard to closely estimate the number of misses an optimal replacement algorithm would cause on these caches. 1.2 On Terminology used in this document In the computer profession, as in most professions, terminology is not. always as standardized as its practitioners might like. III some cases this extends beyond terminology and into differing standards and assumptions. Below is an attempt to address those terms that might be unclear. 0 Throughout this dissertation, the function lg(:r) will be used in place of 109201?) as is common in many Computer Science texts. o A consistant big endian [39] notation shall be used thoughout this work. 0 The SPEC CPU2000 benchmark suite[21] is refered to thoughout this document as the ‘SPEC’ benchmark suite. The integer and floating-point parts of the benchmark are refered to as SPECint and SPECfp. Chapter 2 On Caching and N on-standard Caches This chapter serves two purposes and as such is broken into two sections. The purpose of the first section is to introduce the fundamentals of processor caching. This section does not, and cannot, provide a complete overview of the topic. Rather. its purpose is to provide a basic understanding of terms that will be used in further chapters. The experienced computer architect will find the vast majority of the material in that first. section to be review, while others may find the need to refer to computer architecture books such as [39, 56,66] in order to provide needed context. The second section of this chapter focuses on non-standard caching schemes. This ma- terial is significantly more obscure than the material of the first sectionl. Because Chapters 4 and 5 assume a working knowledge of this material it is also included in this chapter along with the more basic material of section 2.1. 1In fact, only one paper has been found which provides an overview of even a few of these caching schemes [82]. CI! 2.1 Introduction to Processor Caching As a processor executes instructions it needs information from the memory system where both data. and the instructions themselves are stored. However, the main memory is usually very slow. In the time it takes to fetch one piece of information, a modern processor could have executed many hundreds of instructions. Clearly this makes the memory system a significant performance bottleneck. While there have been a large number of methods used to cope with this bottleneck, the most common and pervasive is caching. As shown in Figure 2.1, a small, fast memory subsystem, the cache, is placed between the processor and the main memory. The purpose of caching is to keep the most commonly accessed information in a closer (and much faster) location. the cache. P C 1 Main cessor ' . 1 ‘ . ro ac e Memory ........................................................................................ Figure 2.1: A simple caching structure. The cache generally keeps a copy of the most recently referenced information found in the main memory system. While this cache is usually orders of magnitude smaller than main memory, it is common to find that well over 90% of all memory references are in the cache [39]. The time it takes to fetch a piece of information from the cache is usually similar to the time it takes to execute a single instruction [17]. Obviously, this greatly speeds up the average time required to perform a memory access, and reduces the “memory bottleneck” mentioned above. It is instructive to include more detail at this point. Each byte2 of memory generally 2A byte generally consists of 8 bits and can be thought of as a single number in the range of O to 255. has a certain numeric address. Normally the processor will request one to eight bytes of information at a time. If the information is not in the cache (this event. is called a cache miss), the cache requests it from main memory. Once retrieved from main memory, the cache provides the information to the processor and keeps a copy. If the cache does have the information (a cache hit), it simply supplies its copy of the information. In the remainder of section 2.1 goes into greater details about caching, but not beyond what. can be found in any standard computer architecture book. Books by Hayes [38] and by Hennessy and Patterson [39, 56] in fact. contain this information in significantly more detail. Interested readers may find that Smith's seminal work [64] on caching provides a complete, if dated, overview on caching. Also, books by Przybylski [57] and Handy [37] are outstanding. Przybylski’s book, while published in 1990, provides an architectural viewpoint similar to that of this author. Handy's book is written from a circuit designer’s viewpoint. 2.1.1 The structure and organization of a standard cache In general a cache consists of two parts. the data part, which is the information stored in the cache, and the tag part, which indicates which address the data came from. Each tag consists of enough information to identify from which address the data came. When a memory access occurs one or more tags are searched and compared to the memory address of the memory access. If a tag matches the address of the memory access, the data associated with that tag is provided to the processor by the cache rather than requiring an access to the (much slower) main memory to get the same data. The one-to-one relationship between tags and data is illustrated in Figure 2.2. Ideally the cache could keep the data is the most likely to be reused. In practice, however, this is usually not viable. Imagine what would happen when the processor requests data ............................................................... ............................................................... Figure 2.2: A cache with data and tags. from the cache. The cache would need to search though each of the tags to find where the data is in the cache (if it is there at all). Since the whole point of the cache is to speedup access time, the time required to search all of the tags is usually not acceptable. A common solution to this problem is to assign each memory location a single place in the cache where it can reside. Since the main memory is much larger than the cache, many memory locations share the same assigned place in the cache. This means that only one of those memory locations could be in the cache at any given time. The net result is that flexibility is traded away for speed. This type of cache is called a direct-mapped cache. A cache where the data may be kept anywhere is called a fully-associative cache. A compromise between the direct-mapped cache and the fully-associative cache is the set-associative cache. A set-associative cache is broken into sets of locations, all of equal size. Rather than mapping each main memory location to a specific line in the cache, the memory location is mapped to a set of lines. A given main memory location is now restricted to a very small number of locations in the cache. This increases the flexibility of placing data into the cache while still greatly reducing the number of locations which need to be searched. Caches where. the set size is of size n are referred to as “n-way associative caches.” Figure 2.3 illustrates the organization of these three caches. One can think of both a direct-mapped cache and a fully-associative cache as extremes of a set-associative cache. In the remainder of section 2.1 we use the terminology associated with set-associative caches when referring to any of these three standard cache types. Direct-mapped Set-associative F ully-associative Figure 2.3: The three types of standard caches. If X is some address, this figure shows the different locations its cache block could be placed. 2.1 .2 Address mapping A standard cache of C bytes consists of K cache; lines. each B bytes long. Further, there are S sets of cache lines. where each set consists of N cache lines. It follows that C = K a: B and K = S * N. In this section it is formally described how the various memory addresses are mapped into a cache. The next few paragraphs are summarized by Figure 2.4, and referring to that figure may be helpful. Because of a property called spatial locality, (discussed in detail in section 2.1.6) it is likely that one of the next memory references will be to a location near the current reference. Thus, it makes sense to fetch the data near the current reference into the cache. To accomplish this task, the entire memory space is broken into non-overlapping blocks3 of size B, and the entire block where the current memory reference resides is brought into 31h this document the term cache line (or line) is used to describe the location in the cache while the data that goes into the cache line is called a cache block (or block). Some others use the term “block frame” for line while other, mainly older works, make no attempt to distinguish between the two concepts. the cache. If all the address bits other than bits 32 - lg(B) to bit 31 are the same, the two memory locations are in the same cache block. Mapping each cache block into a set of the cache is the next task. Because of spatial locality it is best to insure that two cache blocks that have addresses near each other do not get mapped into the same set. To insure this, the blocks are mapped into the cache in a round-robin like scheme, so that a given set has every Sth block mapped to it. If two memory locations have the same values in bits 32 — (lg(B) + lg(S)) to 31 — lg(B), the two addresses are mapped into the same set. These bits are often called the index bits as they are used as an index to select a set in the cache. Finally the remainder of the address, that is bits 0 to 31 — (19(5) + lg(B)), make up the tag. Because of the mapping into the set. and block. the remainder of the address is trivial to determine, and thus the tag is a unique identifier of the memory address from which the block came. Figure 2.4 summarizes the way a cache uses the address of a memory reference to generate. the set number and tag. The offset. simply states where the data is inside of the cache line. .................................................................................................................................................................... High-order bits Low—order bits Address: blto t0 bit31_([g(3)+lg(3)) bzt32—(19(B)+19(S)) t0 bit32_lg(3) t0 bit31 bit3l—lg(B) Offset: ' ’ lg(B) bits Set number (index): L; . ‘ lg(S) bits Tag: 329I1g(B)+1g(S)J bits .................................................................................................................................................................... Figure 2.4: How an address is used by the cache. B is block size and S is set size. Both are assumed to be greater than 1. 10 An example is likely useful at this point. Consider a 2-way associative cache which has 512 cache lines. each 16 bytes in size. Thus the data portion of the cache is 8192 bytes (or 8 Kilobytes) in size and there are 256 sets. Say the one byte of data at address F 012345616 is requested by the processor. Notice that the offset of this address is 615, the set number is 4516 and the tag is F012316. Thus, the two tags associated with set number 4515 (6910) are checked to see if they match the tag of the address (F 0123416). If either does, the data is found in the sixth byte (the offset) of the data part of the that cache line. 2.1.3 Replacement policies If the cache is full and a new block needs to be put into the cache, some other block must be evicted. Clearly the block that is to be evicted nmst be from the same set as the new block. In a direct-mapped cache, there is no decision which needs to be made—there is only one block to chose from. However, in a set-associative or fully-associative cache, some decision must be made. By far the most common scheme is to evict the block which has been unused the longest. This block is called the least-recently-used (LRU) block and the algorithm is called the LRU replacement algorithm. The reason the LRU algorithm is so commonly used is that programs tend to access data they have accessed recently. This property of programs is called temporal locality and is discussed further in 2.1.6. 2.1.4 Modern implementations Modern caches are actually more complex than have been described above. III this section we outline some of the techniques used in modern memory system design. Modern processors generally have multiple levels of cache. If the first and smallest cache 11 (called the L1 cache) does not have the requested block, it checks in the next level of the cache (called the L2 cache). In theory this would continue until the last level of cache has missed, when the data would be retrieved from the main memory. Today, most general- purpose processors have only two levels of cache, although processors with three levels, and even four levels of cache are starting to show up in the marketplace[35]. Another quirk of modern caching is that the L1 cache is often split into two parts, one for data and one for instructions. This is done to improve bandwidth, improve miss rates, and for several other reasons[39]. Usually the data and instruction request stream are combined at the L2 cache. The caching scheme where the data and instructions are split is often called the Harvard Architecture[39], and is illustrated in Figure 2.5. ............................................................................ Proccessor L1 L1 Instruction Data Cache Cache ............................................................................ Figure 2.5: The Harvard architecture. Lastly the use of virtual addresses has an impact on caching, particularly non-standard caching schemes. The details of virtual addressing can be found in most computer archi- tecture and operating system texts[39, 63]. The net effect is that the address the processor 12 uses (the virtual address) is different than the address the main memory system uses (the physical address). 111 general, the 71 least significant bits will not change when the virtual address is translated into the physical address. The value of 2" is then called the page size and is often around 212 bytes in size. The point at which the virtual address is translated into a physical address has a sig- nificant impact on the cache. This translation can happen before, after or during the cache lookup. While this decision has a Significant impact on cache performance, the only option which impacts this discussion is if the translation happens during the cache lookup. In that case, the index bits of the virtual address are used to select a set. The virtual address is translated as the cache is searched and the tag(s) are compared against the resulting physi- cal address. This scheme has a number of names, including “virtually addressed, physically tagged,” “virtually indexed, physically tagged" or simply a V/R cache [79]. Notice that it relies upon the index bits being the same for both the physical and virtual addresses‘l. This observation implies that the number of sets multiplied by the line size cannot be larger than the page size. The quick translation of virtual to physical addressing is done by a device called a translation lookaside buffer or TLB. 2.1.5 Caching metrics The previous sections have hinted at the two basic metrics for measuring the goodness of a cache: hit rate and access time. Hit rate (H R) is simply the percent of accesses which are cache hits. The access time is the amount of time it takes to service a hit (Tm-t) or a miss (Tm-SS). From these metrics the average access time of a memory access can be computed. It is simply HR * Thu + (1 — HR) * Tmiss. Recall that the advantage of direct- 4Although W'u, among others, proposes ways around this restriction [79]. 13 mapped caches is a faster access time (a lower Th") while they tend to have a lower hit rate (H R). Fully-associative caches usually have longer access times but a higher hit rate. A detailed, if somewhat out of date, analysis of these trade-offs can be found in section 5.3.3 of Przybylski’s book[57]. Tm,“ is the time it takes for an access to be fulfilled if the cache does not have a copy of the data, so it has very little to do with the cache itself. Rather it is dependent upon the memory system behind the cache. In fact Tmz-ss can be an important parameter when designing a cache. If Tmiss is high, increasing the hit rate at the cost of Thy will-be more beneficial than if Tmiss were lower. 2.1.6 Locality of reference Locality of reference is the fundamental reason a. small cache can achieve a high hit rate. A processor tends to access memory locations it. has referenced recently (temporal locality) as well as locations near those it has referenced recently (spatial locality). A cache takes advantage of spatial locality by fetching a larger chunk of data then the processor requests (a cache block). As discussed in section 2.1.3, the temporal locality of a program is utilized by the replacement algorithm, and by the very concept of a cache. The above definition of locality is fairly terse and certainly qualitative rather than quantitative. Unfortunately that definition is literally as detailed as any standard computer architecture text gets. In Chapter 3 a number of locality metrics from the literature are discussed. In Chapter 4 a method for quantifying and visualizing locality is introduced in great detail. 14 2.2 Non-standard caches W hile section 2.1 details how the most common caching schemes work, there are a large number of variations. Those variations include changes in replacement policy, cache struc- ture, and address mapping. These variations usually can be evaluated in terms of the two cache metrics mentioned above: hit rate and access time. Each scheme attempts to improve one of the metrics, usually sacrificing the other to some small extent. Evaluating these non-standard caching schemes can be difficult. First, it is difficult to prove that a given scheme will really improve the hit rate—even increasing associativity can hurt in certain cases [72]. Thus anyone proposing a new scheme has to show that their scheme works well on most real programs, quite a tall order. In general the authors of these schemes use the SPECCPU benchmark suite5 as a standard method of evaluation. Even using this rather large suite of programs as a standard for measuring cache performance does not always generate reliable results. For one thing, this benchmark suite is biased toward certain types of programs [23]. Another problem is that when evaluating a novel caching scheme, one generally needs to write the cache simulator from scratch. This can be error-prone and has resulted papers full of errors. For example in an e-mail exchange with D. Rhoades, he agreed with this author that at least. parts of their published work [59] suffers from significant simulator error. The other difficulty is evaluating the access times of a novel caching scheme. The prob- lem is that evaluation is an engineering decision where the quality of design, the processor layout, and the properties of the process technology used will have significant impacts on the actual access time. There have been high-level models proposed to estimate the impact of certain changes at this level [25]. However, even if those models are accurate, many 5Information about SPECCPU can be found at http://www.spec.org. 15 non-standard caching schemes do not easily fit into the parameters of these models. Keeping these problems in evaluation firmly in mind, it is worthwhile to point out that just because these schemes are difficult. to evaluate does not mean that they are not useful. Rather it means that when considering them one needs to keep a degree of healthy skepticism about the reported results. Independent measures of the hit rates of these caches have been reported [14,60] and are probably somewhat more reliable results. 2.2.1 Hash caches Perhaps the simplest and most common modification to a standard cache is to modify the way addresses are mapped into the sets of the cache. Rather than using the lg(S) bits, where S is the number of sets. to select a set of the cache (as described in section 2.1.2), a larger group of address bits are used as inputs to a function which selects the set. This function is called the hashing function, and the cache which uses it is called a hash cache. For example if there are 2m sets, one might use 2m different address bits and then perform a bit-wise XOR of those values to get an output of 772. bits. Figure 2.6 illustrates this hashing function. A more generic hashing cache was conceptually proposed by Smith in a 1978 work [65] while at least. one real processor, Hewlett-Packard’s PA7200, uses a hashed L1 cache [17]. The benefit of the hash cache is that it can reduce set conflicts in certain situations. Imagine that a. program is striding [39] through memory 1024 bytes at a time. That is, it might reference location 0, then 1024, then 2048, etc.6 Now imagine that we are using a direct mapped cache with 32-byte cache lines and a total of 4096 bytes in size. That cache has 4096/32 = 128 cache lines. As the program strides through memory it will only utilize 6This type of behavior is quite common and usually associated with arrays. 16 High order bits Low order bits [ Wax—(wmsmgwn t0 [ bits'z—UgthgIS» to Address: W31-1g(8) Wax—Izmmsmsn Set number: Figure 2.6: An example of how the set number might be determined in a hash cache. four of those cache lines. specifically lines 0, 32, 64, and 96. The hashing function described above will allow the cache to utilize all 128 cache lines. A hash cache also has a number of restrictions and drawbacks associated with it. Per- forming the hashing function can take a significant amount of time. While a single XOR gate delay may seem trivial, that extra time may not be available". Also, as described in section 2.1.4, if the cache is virtually addressed but physically tagged, the extra bits used by the cache to perform the hashing function may not be available until after the translation fiom the virtual to the physical address[60]. 7In modern processes wire delays are starting to dominate the processor latencies. This means that the delay introduced by an XOR may become less significant in the future. [4] 17 One’s complement cache A one‘s complement cache [81] is a hash cache with a specific hashing function. Specifically, in the case of a cache of size 2’", the address (other than the block number bits) is broken into groups of m bits. These fields are then XORed together. That creates a. mathematical function similar to what would occur if the cache were of size 2’” — 1 rather than 2'", which can greatly reduce striding problems. The authors of the paper which introduced the one’s complement cache claim that miss rates tend to drop 10% to 20%. Other studies indicate that the improvement is much lower for most. traces, usually on the order of 1% to 3% [14]. However, we have found that in certain data—sets, a one’s complement cache can have a significant impact, sometimes reducing miss rates by a factor of 10. While this may be due to a horrible compiler. it is certain that horrible compilers exist in the real world. 2.2.2 Hash-rehash caches While the hash cache tries to reduce the miss rate of the cache, a hash-rehash cache[1] tries to decrease the time it takes for the cache to supply the data in the case of a hit (TM). This reduction is accomplished by addressing the cache as a direct mapped cache. If this access results in a miss, one of the index bits8 is toggled, and the resulting value is used as the index to search the cache again. Notice that the toggling of a single bit means that two cache lines are always paired although the search order will vary. In this section we refer to X1 as the first line searched for a given memory reference and X2 as the second line searched. If a hit occurs when searching X1 the data is supplied to the processor and the cache is unchanged. If a hit occurs when searching X2. the cache block there is swapped with 8Usually the most significant index bit. In any case, a given hash-rehash cache will always use the same toggle bit. 18 the one in X1. If a miss occurs on both accesses. the block in X1 is moved to X2, the one that was in X2 is evicted, and the new block is moved into X1 The idea is to keep the most-recently-used (MRU) block in the cache line where it will be searched for first in an attempt to reduce the average value of Thir- The advantage of the hash-rehash cache is that each access takes as long as a standard direct-mapped cache, so if the first search is successful, the access time is reduced compared to a two-way associative cache. At the same time, the hit rate, when considering both tries to find the data, is better than a direct—mapped cache (but worse than a two-way associative cache[1]). It is worth noting that the name “hash-rehash” cache is perhaps inappropriate. The index into the cache is usually done by bit selection, that is by using the index bits as described in section 2.1.3, rather than using a more complex hashing function. The reason for this indexing scheme is that even a short XOR delay might negate the access time advantages of a. direct-mapped cache over a two-way associative cache. While papers on hash-rehash caches and their variants [1, 3, 16] have been common, we are unaware of one being used in a commercial productg. It is difficult to say exactly why this is, but there are some obvious possibilities. Clearly the lack of index bits in a virtually addressed, physically tagged cache is an issue, in part because an X -way set-associative cache requires lg(X) fewer index bits than a hash-rehash cache of the same size. Further, the access time advantage of a hash-rehash cache is going to be very dependent upon the exact technology used as well as other clock-limiting. paths in the chip. One might. be trading a lower hit rate for no actual improvement in access time. 9Although the IBM CMOS System/360’s cache organization is somewhat similar to a hash-rehash cache and is described below. 19 Column-associative caches One variant on hash-rehash caches is the colunm-associative cache[3]. A “rehash” bit is added to each cache line, which both improves the overall hit. rate of the cache and reduces the average number of times the cache needs to be probed. As above, two paired cache lines are given labels, but this time the labels A and B are used and no order of access is implied. The rehash bit associated with a given cache line is set if it is accessed as a rehash location, either during a search or during replacement. If a cache line with its rehash bit set is searched. no attempt is made to search the rehash cache line. Say that lines A and B both hold ‘hash blocks”, that is blocks that would be found during the first pass of a search. In that case neither rehash bit will be set. so if a search of A fails, B will be searched. During the search of B (which will miss) its rehash bit is set. During replacement the block in A will move to B and the block in B will be evicted to make room for the new block. Now if B gets searched first, it will result in a miss (it holds data associated with A) but A will not be searched (because the rehash bit in B was set). The new data. when fetched, will be placed in B and its rehash bit will be reset. The net effect is two-fold. First, a more LRU-like replacement scheme is implemented, resulting in a better hit rate than a hash-rehash cache [3,82]. Second, in some cases a miss does not require two searches through the cache. slightly reducing both the average access time and the number of requests to the cache. Other hash-rehash schemes There have been a number of other proposed schemes related to the hash-rehash cache. Detailing each is outside the scope of this document, but short descriptions of a few are provided in this section. 20 The predictive sequential associative cache [16] is very similar to the column associative cache. However, instead of swapping blocks to keep the MRU block in the first location, a steering bit table determines which location to search first. It also maintains true LRU ordering with a single LRU bit. A number of schemes for predicting the steering bit were also proposed. There are also two caches called the MRU cache. both of which have similarities to hash-rehash caches. The first was used by IBM’s CMOS System/ 370 [18]. Here they kept the MRU data of each set on chip, and had the rest of the cache off-chip. They then speculatively used the MRU data whenever the cache was accessed. Because the time to detect the mistake was always the same, the logic required to back out of a misprediction was fairly straight—forward. The other MRU cache [47] simply did a sequential search through a set of the cache. using MRU information to guide the search order. 2.2.3 Victim caches The victim cache is perhaps the most exciting non-standard cache thus far proposed. As originally proposed [45] it has no impact on the access time of an L1 hit (Thu) or on the hit rate (H R). Rather it reduces the average access time for an L1 miss (Tmiss). The only price is a small bit of additional die area. A victim cache is simply a, very small (usually 2 to 16 cache lines) fully-associative cache. It acts just like an L2 cache managed with an exclusive policy [9]. only rather than the normal latency associated with an L2 cache. the small size allows the access to occur in about 1 cycle. So when the cache evicts a block, rather than moving that block to the L2 cache (or just throwing it away) it is moved into the victim cache, and a block from the victim cache is evicted to the L2. Thus, the main L1 cache and the victim cache never store 21 the same block. Even very small victim caches can expect to recover 30% or more of the conflict misses that occur in a direct-mapped L1 [45]. The write buffers in Intel’s Pentium Pro as well as the “victim buffers” of the Alpha 21264 both perform some of the functions of a victim cache. It is worth noting that due to the small size of the victim cache it is quite reasonable to search the victim cache in parallel with the main L1 cache. While this will slightly increase the total latency of the cache, we believe that the access time will be similar to a two-way associative cache, assuming that the victim cache size is kept quite small. 2.2.4 Skew caches Only a small modification to a set-associative cache is required to create a skew cache. As illustrated in Figure 2.7 each bank in a set-associative cache can be thought of as a direct-mapped cache with each of the banks indexed by the same function. In a skew cache a different function is used for each bank. A 2-way skew-associative cache is illustrated in Figure 2.8. In this case two different functions, f0 and f1, are used and a block may be placed in different locations in different banks. Bank 0 Bank 1 address Figure 2.7: A model of a 2-way set—associative cache. [60] Such a simple modification may not seem significant, but studies have shown that a two- 22 Bank 0 Bank 1 address Figure 2.8: A 2-way skewed-associative cache. [60] way skew-associative cache produces hit rates similar to that of a four-way set associative cache [14, 60, 61]. This performance can be attributed to three effects: data dispersion, the effect of hashing, and self-data-reorganization [14,61]. The first effect creates an actual increase in associativity [14]. That is, temporal locality is more helpful to a skew cache than a standard cache of the same associativity. The second effect is an identical effect to that found in the above section on hash caches. The third effect is more of a side-effect of locality. A cache block can go into two different lines. If one line is an area of high—conflict, the block will likely get evicted. Next time it comes in it may go to the other line and, due to the lower conflict, stay there. Thus, on relatively slow changing data sets, skew caches can “learn” where to place blocks. 2.2.5 Smart caches All of the caches thus far discussed, both standard and non-standard, achieve their high hit rates purely by taking advantage of locality. While high hit rates are achieved by these caches, there are limits to how successful a cache can be by purely relying upon locality. As these limits are discussed in section 2.1.6 of this document, they will not be dwelled upon here. However, the basic observation is that even a fully-associative cache has non- 23 compulsory misses, and no LRU cache can be expected to perform better than that. Smart caches10 are those caches whose caching schemes rely upon something other than just locality of reference. Many of these schemes rely upon the observed fact that certain accesses are unlikely to be re-referenced in the near future [51,78], and having those refer- ences bypass the cache may be the best solution. By bypassing these low locality references other, ideally more-likely-to—be-used, cache blocks can be kept in the cache. Each of the smart cache techniques utilizes different ways of identifying the low-locality accesses. Additionally, many of these techniques provide methods for reducing the impact of bypassing accesses which have a high degree of locality. We are unaware of any studies which compare these smart caching schemes. The techniques used by these schemes are so diverse that such a study would be extremely difficult to conduct, and even more difficult to conduct fairly. Many of these techniques have not made it past the “proof of concept” stage— comparing them to more refined methods would not. show that one idea was inherently better than the other. While few quantitative comparisons seem to exist, it is possible to detail how some of these schemes work. Assist caches Hewlett-Packard's PA7200 utilizes a smart caching scheme which can be referred to as an assist cache [17]. Those familiar with HP’s recent endeavors will not be surprised that the identification of the low locality memory references is left to the compiler. In fact the assist cache, for which this scheme is named, is little more than the backup in case a high locality access was misidentified. 10The term “active management” is perhaps a more common term than “smart caching.” However, that term seems to imply a dynamic hardware caching scheme, and we Wish to also include compiler based management. 24 The PA72OO has two L1 data caches. The first, called the assist cache, is a physically- indexed, 64—entry, fully-associative cache using a FIFO replacement scheme. The second is a virtually—indexed, physically-tagged, hashed, direct-mapped cache and can be as large as one megabyte. In general, data is first put in the assist cache and evictions from the assist cache are placed in the main cache. The compiler can identify a certain load or store as being. in HP’S terms, “spatially local.” What they mean is that the compiler believes that the line may be accessed again due to spatial locality, but not due to temporal locality. Lines so identified are still put into the assist cache, but upon eviction the main cache is bypassed. This backup by the assist cache allows the compiler to be aggressive in its attempts to identify low locality accesses. Stream buffers Another method of identifying low-locality accesses is the hardware stream buffer [45, 51]. A stream buffer is basically pipe, that is a FIFO queue, where information can only be taken off the top of the pipe. By some method, either with information from the compiler or observation of the data stream or both, streams of references are discovered. Streams of references are generally individual load instructions walking though memory with a. known stride. The data can be prefeteched into the FIFOs and the caches can be bypassed entirely. There are many advantages to stream buffers. For one, traditional prefetches into a cache can result in data being evicted before it is used due to conflict in a certain set. Secondly, streaming data usually has very limited temporal locality, and steam buffers reduce the pollution of the cache by this low-locality data [51]. Also, modern DRAM components haven’t been truly random access since the introduction of prRAM [42]. For modern DRAMs, spatial locality can be very important indeed. Reordering the memory accesses 25 so that large chunks of each stream are fetched at once, rather than the program order of interleaved access, can greatly reduce the total latency to fetch data from the DRAM [52]. There are some potential problems with stream buffers. W'hile stream buffers do show significant performance increases on certain programs, non-scientific programs tend to see more limited benefits. This characteristic makes their use questionable in general-purpose processors. Also, stream buffers, unlike most of the. caching schemes discussed so far, generally require compiler help to achieve any significant performance gain. That means that effort by the chip designers is not enough; effort is required by the compiler writers too. It is worth noting that while we are unaware of a commercial chip that uses stream buffers, an experimental chip with stream buffers has been built at the University of Virginia [50]. Cacheable / N on-Allocatable Cache It can be argued that the true beginning of smart caching algorithms was a paper by Tyson et al. [74]. In that paper it is shown that not only do certain load instructions fetch the vast majority of the data, but some load instructions are responsible for a disproportionate amount of the cache misses. This observation means that identifying those instructions can provide large hints about which memory references should bypass the cache. They propose a number of techniques for identifying load instructions which should be bypassed, including compiler / profiling methods and dynamic two-bit schemes that resemble branch predictors. While their proposed schemes do not improve the average hit rate, they do show that memory system bandwidth can be reduced. More importantly this paper provided the fundamental basis for investigation of instruction-based load bypassing. This scheme is called C/ NA or CNA as the bypassed loads are marked as “cacheable but non— allocatable,” meaning that a given cache block may be in the cache but. if it is not, do not 26 allocate space for it in the cache. Other smart caching schemes A number of other smart-caching schemes have also been proposed. Some of the more interesting are briefly described in this section. A rather novel hardware scheme for smart caching was proposed by Johnson and ku [44]. The underlying idea is that certain data structures are, by their very nature, accessed in a low-locality manner. They propose to group the entire memory space into macroblocks and observe the reuse behavior of the various macroblocks. If a given macroblock has low reuse, the data from that block bypasses the cache. Under one version of their proposal they add a small buffer to act as a back-up in the event of a misprediction. The initial results of their study were encouraging, but covered a very limited set of programs”. In the literature this scheme is often called the MAT cache [69]. A research group at the University of Michigan has proposed a number of smart caching schemes. Their NTS cache is is similar to the MAT model in that the data addresses are used to predict the locality of a memory location. However, in the NTS cache the metric of locality considers whether the cache block had been used more than once in its previous “tour” of the L1 cache, rather than looking at a macroblock over a much longer time [70]. That some group has also improved the CNA scheme. Wong and Baer proposed a scheme to improve the hit rate of a highly associative L2 cache [78]. They tag each load instruction as either generating “temporal” cache lines or “nontemporal” cache lines. A standard L1 replacement policy is used, but the L2 gives preference to temporal cache lines. They propose both a compiler-based profiling scheme 11For example, programs like TPC-B have a very different access pattern than this scheme assumes. 27 and a hardware scheme to distinguish the two types of load instructions. Their results look both complete and promising, showing as much as a 12% improvement in instructions per clock for some programs, although 1 to 3% seems to be the norm. Anderson et al. study the impact of bypassing and superloading on a miss. A superload moves both the missed line and the surrounding lines into the cache [5]. They propose a hardware scheme that decides to load or superload based upon past behavior of that part of memory. They compare their on-line scheme to the off-line optimal algorithm, both with and without bypassing. The net result is that their scheme shows marked reduction in both the miss rate and bandwidth used when compared to a more traditional LRU cache. However, their scheme also utilizes an unlimited amount of memory to maintain information about previous loads. The net effect is that they show superloading and bypassing to have significant potential in an on-line environment. However, their exact imtflementation appears infeasible at this time. Hallnor and Reinhardt have recently proposed a much more complex smart caching scheme [36]. The fundamental argument they make is that memory is getting so slow compared to the processor speed that it is wise to take extra time on an L2 access in” exchange for a higher L2 hit rate. While this premise may be flawed”, their results are quite promising. Just as others have done, they apply a replacement scheme that uses past history. However, they also use techniques from operating systems to make the L2 cache effectively fully-associative. This associativity comes at higher system latency, but allows them to use their replacement scheme more effectively. The number of misses is greatly reduced, sometimes by as much as a factor of two. 12For example, L2 misses, if rare enough. might be dealt with in other ways, including changing threads or processes 28 2.2.6 Further reading While citations for each of the above non-standard caches have been provided. some of the papers provide outstanding explanations of a wide variety of schemes. In particular [70, 82] provide very nice overviews of hash-rehash caching and smart. caching respectively. 29 Chapter 3 0n the Measurement of Locality The purpose of this chapter is to provide an overview of research in the area of locality measures. In the first section of this chapter, the meaning of the term "locality measure” is discussed. In sections 3.2 and 3.3 various locality measures are classified and explained. Section 3.4 contains a discussion about LR U stack distance, both as a locality measure as well as some of its other applications in the literature. 3. 1 Locality measures Measuring the locality of a data stream is an imprecise science. In general, it is an attempt to reduce massive amounts of data, sometimes millions or billions of data points, into something which can be easily understood and used by a human. Each researcher has different. goals and hopes for their measure, so each researcher proposes a different scheme. In fact, the motivation for each measure is perhaps as interesting as the measures themselves. 30 3.1.1 What is locality? The exact definition of locality is a bit vague. which, as one would expect, contributes to the difficulty of measuring it. However, the term locality, as was briefly mentioned in Chapter 2, is used describe the fact that there is a dependence between the addresses of previous memory accesses and that of future accesses. A processor tends to access memory locations it has referenced recently (temporal locality) as well as locations near those it has referenced recently (spatial locality). This phenomenon is to be expected—as a program moves through a data structure it will do so in a certain pattern; perhaps “walking” memory at a fixed stride size, or perhaps moving though a linked list or tree. In reality, a program will generally interleave its accesses to various data structures. As an example, imagine a standard matrix multiply operation. One input array is accessed in row order, one in column order, and the output array is usually accessed in either row or column order. In addition, loop control variables and temporary variables are heavily utilized (although many of those may be allocated to registers and not manifest themselves as memory accesses). Now imagine that the matrix multiply operation is just one of a number of subroutines used by the program. The net effect is that a number of access patterns start, get interleaved with other access patterns, and then end. These access patterns are much like threads in a tapestry. W'hen examining that tapestry what should be the goal? One could evaluate the individual threads, the patterns formed by those threads, or perhaps the tapestry as a whole. Each level of granularity is important, and each requires a different perspective. 31 3.1.2 Locality measures in the literature The remainder of this chapter summarizes much of the existing literature about locality measures. Section 3.2 provides an overview of the most common way of examining locality: at the macroscopic level. These metrics are generally concerned with capturing the locality of an entire program. and generally rely upon the law of large numbers to function correctly. In that section a brief overview is also provided for a. closely related topic, cache-centric locality models. Those models view locality as it is reflected in a cache. Section 3.3 looks at those locality metrics which are generally concerned with a much smaller number of references. These microscopic metrics tend to be most. useful when examining certain program constructs or data structures. Many of them are loop-centric. That is they assume the existence of a single large looping structure with relatively regular behavior, something generally found in scientific computing. Finally, in Section 3.4 the idea of LRU-stack distance is introduced. In addition to other applications, this scheme is clearly a macroscopic locality measure. It is was one of the first of the locality measures of any type. and has a large number of applications. In Chapter 4 this measure is discussed in great detail and novel alpiplications for it are introduced. ‘ 3.2 Macroscopic locality measures Macroscopic locality measures are generally concerned with characterizing the locality of a program or a large portion of a program. Most of the papers in which these models are discussed focus upon predicting the cache miss rate. Their goal is to allow the quick characterization of a large number of programs on a large number of caches in a short period of time. The huge increases in disk space and processing power. as well as the high speed 32 of the cache simulators such as Cheetah [68], have made this less important, at least for standard caches, than it was in the past. However, by attempting to predict a miss rate these studies had to generate a model for locality. It is that part of their work, the locality model, on which we concentrate in this section. 3.2.1 Chow’s empirical power law Perhaps the simplest useful model of locality was proposed by Chow in the mid 19708 [19, 20]. He proposed that the miss rate of a cache could be matched to the equation A“! (C) = C '0', where a would be fit to the available data and where C is the capacity of cache. Chow claimed that the miss rate is “most sensitive to the capacity” and that in his analysis issues such as “block size, the management algorithms and other factors” would be ignored. If the cache is considered to be a fully—associative cache using an LRU replacement algorithm (something his analysis would certainly allow). 1le (C) is precisely a prediction of temporal locality as it is defined in Chapter 5. While outdated and lacking significant quantity of empirical data to support it, Chow’s power law is very important to the study of locality. First. it forms the basis for one of the classic rules of thumb in cache design: increasing the cache size by a factor of 4 results in one half the miss rate. This would be the case if a were 0.5. This general rule, if not the 0.5 value, is experimentally observed in [64] among others. ' Chow used his power law as the basis for a mathematical approach to the design of multiple levels of the memory hierarchy [20], a research topic that has continued well into the 1990’s [43]. While it is his power law that is relevant to our work, the primary result of [19,20] was to show that the optimal number of cache levels scaled logarithmically with the capacity of the cache hierarchy. 33 3.2.2 The Fractal Model of Locality Thiebaut's work [71] on the fractal model of locality is something of a variation on Chow’s [20] power-law model from the decade before. How he arrives at his model is very different than Chow. T hiebaut claims that the access patterns of many programs are fractal in nature: having an access pattern in one area of memory and then jumpng to a different area of memory and repeating that same pattern. He claims that if this is true, a two piece power—function, very similar to Chow‘s, should model the ratio of the unique number of memory locations over the total number of memory references. His function, f (3:), takes as an argument a number of successive memory references .7‘ and predicts how many different locations those :1: memory references will access. The fundamental argument is that memory accesses are in fact a random walk though memory. Also. the “jump size” of that random walk tends to be low. but has the occasional long jump in it. Thiebaut states that such a random walk is a. special case of a fractal, and that other work in fractals can be used to model this access pattern, and thus the locality of the trace. He admits that other than in the instruction stream, much of the random walk theory is not supported by his data. He believes that this is due to memory accesses actually consisting of many different random walks. we suspect that. examining the data stream generated by each individual load and store instruction might provide a way to observe these random walks, if they do indeed exist. What Thiebaut does demonstrate is that assuming the random—walks are occurring, he should be able to predict the miss rate of a fully—associative LRU cache. His model has two parts: one for the first no memory references, and one for the remainder of the accesses. In effect he produces a formula of the form AIR(C) = AC9 with different values for A and 0 for accesses before and after access number no. 34 The results can be criticized for only addressing a very small number of traces. we believe his model is of marginal value, but the underlying idea of data streams being mod— eled by a random walk is of interest, and may have applications to stream detection and prediction found in [51] and related works. 3.2.3 Agarwal’s analytic cache model In 1989 Agarwal et al presented a novel analytic cache model [2]. Although this work is presented as a model of cache behavior, it can be viewed primarily as modeling the locality of a reference stream. Their goal is to describe, with as few parameters as possible, the various properties of the reference stream. They attempt to quantify temporal locality, spatial locality, and the degree of conflict in the reference stream. The reference stream is broken into “time granules,” and values are found for each parameter in each time interval as well as for the entire reference stream. Ignoring multi-programming related issues, this analytic model has four locality param- eters. The first is the unique number of blocks that. are accessed. This provides a measure of the degree of temporal locality very similar to that used by Thiebaut. The second pa- rameter is the collision rate: an attempt to measure how prone a given reference stream is to evicting live blocksl. The final two parameters are used as inputs into a two-stage Markov model. The Markov model is used to describe the probability of memory accesses being to successive locations. It is this measure that is used to account for spatial locality. Their scheme does have its flaws. First, for computational reasons, it relies upon the cache using random replacement rather than LRU replacement. While they do claim that the miss rate for random replacement is a good indicator of a cache with LRU replacement, 1That is, blocks that. will be reused. 35 neither their data, nor this author’s experiences support this. They report that the model does a reasonable job estimating the miss rate for direct-mapped caches, with an average error of 4% to 15%, depending on the block size. However, two-way set-associative caches have an average error of 14% to 23%. They also note that their model always predicts too high of a miss rate for the set-associative cache and attribute that behavior to the random replacement policy assumed by their model. While their error rate might seem unacceptably high, it should be noted that the space needed to store this analytic model is at least three orders of magnitude smaller than the trace itself. Further, the results from this model can be computed much faster than actual simulation. Thus, this model is a reasonable alternative for fast design space explorations. The results from such an exploration could be used to determine which cache configurations and traces to study in greater detail. 3.2.4 Grimsrud’s locality metric In the early 1990’s Grimsrud, in some cases with others, wrote a number of papers about quantifying locality [32—34]. Their work describes a method of visualizing locality with a three-dimensional graph. The a: dimension is spatial locality, the y dimension temporal locality, and the z dimension indicates the probability of access having this locality value. Specifically, the locality of a memory reference trace F is defined to be L('f,s,d). This is the probability that an address 8 away from the currently referenced address is, in the remainder of the trace, first accessed (1 references in the future. More rigorously, L(T", s,d) = Para-1+ s = fix + d] /\ (rm a fix +1],...,'f[:r + d — 1]) To make the graph readable, Grimsrud groups the values of s and d into bins. That is, each value of d and s is rounded up to the nearest power of two. This rounding not only gives 36 a logarithmic view. it also makes the surface of the graph much cleaner and meaningful. This “binned” version of L is renamed L" and is also called the locality function. By graphing the three-dimensional locality function for a given trace, one can gain an intuitive feel for the type and degree of locality that exists in the trace. Applications Grimsrud provides a wide variety of applications for his locality metric, One very useful application is in evaluating artificial trace-generation technique52. While he states that the locality function cannot easily be used to generate an artificial trace, it is possible to check the correspondence between the locality functions in the artificially generated trace and the original trace it was based upon. Grimsrud evaluates a number of techniques, including a stack-based model [6,8] and a Markov model, concluding that each of the models has certain flaws and demonstrates the weaknesses of each. There are a number of other proposed applications [34] including trace characterization, benchmark evaluation, and the ability to visualize locality. The last is the most important in the context of locality measures, and Grimsrud explains how this feature can be used to help one identify loop characteristics including stride and the amount and nature of temporal locality in the reference stream. 3.2.5 Cache based locality W'hile Chow’s, Thiebaut’s, and Agarwal’s models are cache models after a sense, in each case their results hinge upon their characterization of the reference stream in terms of a few parameters. The interaction of caches with their locality model can be viewed strictly as an 2Briefly, artificial trace-generation is an attempt to summarize the locality properties of a trace in a set. of parameters. This is mainly used to avoid the need to store long traces on a computer [8,33]. 37 application of the locality model. However, there are also “pure” cache models that have some relevance to the study of locality. These models generally use "ideal” caches, which are unrealistic to actually implement, as a baseline for use in gauging the performance of a more realistic cache. In doing so they also provide a glimmer of insight into the locality of the reference stream being examined. Perhaps the most widely used model of this type is the “three C’s” model [23,41], developed by Hill and popularized by Hennessey and Patterson in their two computer architecture texts [39,56]. In this rather simplistic model. cache misses are broken down into three categories: compulsory, capacity, and conflict. Compulsory misses are misses to never-before referenced cache blocks. Capacity misses are those misses on which a fully- associative LRU cache of the same size would also have missed. Conflict misses are misses on which the fully-associative cache would not have missed. Clearly the compulsory misses tell us about the number of unique blocks in the reference stream and the capacity and conflict misses describe the degree of temporal locality available. A model closely related to the three C ‘5 model was proposed by Sugumar and Abraham in 1993 [68]. They take issue with using a fully-associative LRU cache as an “ideal” cache, and consequently as the basis for comparison. Rather they suggest that more information could be gathered by using optimal placement algorithms. They instead divide cache misses into four categories: compulsory, capacity, mapping and replace-771cm. Using their nota- tion, let A1 (L, C, k, rep!) represent the number of cache misses, on a given trace, in a cache where L, C, k and repl are the line size, cache size, associativity, and replacement strategy respectively. Table 3.1 shows how their method categorizes those misses. C / L represents a fully associative cache. Suppose weare evaluating a C-byte, k-way associative LRU cache. Compulsory misses 38 LMiss type N umber of misses ] Compulsory M(L,any,any,any) Capacity M(L,C,C/LOPT) - l\"I(L,any.any,any) Mapping M(L,C,k,OPT) - M(L,C,C/L,OPT) Replacement M(L,Ck,LRU) - M(L,C,k,OPT) Table 3.1: A classification of misses using the optimal replacement policy. The arguments to the function M() are line size, cache size, associativity and replacement policy in that order. L is the line size of the base cache, C is its cache size, and k is the degree of associativity. would be those misses which occur because the cache block had never-before been referenced, a definition identical to that of the three C’s model. Capacity misses, however, are defined to be those non-compulsory misses which would also occur in a C-byte optimal, fully- associative cache. Mapping misses are those misses which do not occur in a C-byte fully- associative optimal cache, but do occur in a C-byte k-way associative optimal cache3. A' replacement miss is one which occurs in a k—way set associative LRU cache but not in a k—way associative optimal cache. Ideally then, capacity misses are those accesses on which a cache is truly too small to expect to achieve a hit. Mapping misses represent misses due to structural issues (conflict), while replacement. misses are those which a better replacement policy would have avoided. Sugumar and Abraham’s scheme has many advantages over the three C’s model. For one thing the three C’s model assumes that an access which misses in the fully-associative cache also will miss in a less associative cache. As is shown by [58, 72] and others, this need not be the case. In fact, as we show in Chapter 4, an access which just barely qualifies as a capacity miss has over a 40% chance of actually being a hit in a direct-mapped cache. Because Abraham and Sugumar use OPT rather than LRU as their basis for comparison, this issue does not arise in their work. In fact. the aggregate definitions in Table 3.1 can never 3Using our terminology from Chapter ‘2, we would call this a structural miss rather than a mapping miss 39 be negative, something not true of the equivalent three C’s definitions [68]. Additionally, this OPT—based scheme provides a method of evaluating the replacement policy, something not addressed by the three C’s model. There are other cache studies and metrics which can provide information about the locality of the trace. Liveness, used by [15, 49, 54] among others, describes the percent of blocks in the cache which will be re-referenced sometime in the future (or perhaps within some fixed time window.) Additionally many of the smart caching studies have attempted to find the relationship between locality and the referencing instruction [74] or locality and the referenced location [44]. These were briefly described in Chapter 2. 3.3 Microscopic locality measures As noted above. microscopic locality measures are concerned with relatively small numbers of memory references, often on the order of hundreds to tens of thousands. This section provides a brief overview of the various microscopic locality studies in the literature. 3.3.1 Instantaneous locality Instantaneous locality, as proposed by Weikle et al., is perhaps the most recent attempt in the literature to quantify locality [76]. They argue that caches filter locality, just as a lens can filter light or electronic filters can filter signal frequencies. Their goal is to quantify locality in such a way that this filtering can be observed and have a “high correlation to intuition.” An electronic signal is simply a changing voltage which has a single value at any given time. As such they propose that the locality of the reference stream should also have some value at any given time. The measure they tentatively propose for instantaneous locality, 4O 1,, of a reference a, in a stream of references (an. a1,a2, . . .) is: i—l I z-~ E: —— “(I‘li—ajl+1)(lj‘il) The first term, 1 /(|ai — a j] + 1), corresponds to spatial locality. If the two addresses a,- and aj are the same, this term will be unity, otherwise it will be (significantly) less than one. The second term, 1 / [j — 2'] corresponds to temporal locality. If the two accesses are adjacent (that is i = j :l: 1), this term is unity, otherwise it is less than one. When these two terms are multiplied together a value ranging from nearly zero to one (for two adjacent accesses to the same location) is generated. The summation of these terms determines the value of the instantaneous locality of a,. Once a definition of instantaneous locality has been created, the next step is using it to see how well locality is filtered out of the reference stream. In order to visualize this, one other term, the instantaneous hit rate for reference 2', is defined as h,- = 6, + Ola--1 where a = 0.5 and (5, is one, if access 2' was a hit and zero otherwise. This gives h, a range from zero to two, where higher values indicate a higher recent hit rate. With this definition firmly in hand, they show that accesses with high instantaneous locality values tend to have high bit rates and that the output locality stream (the requests made by the cache) generally has less locality than the input stream. This metric is also used to compare different cache types and to try to recognize program constructs from their instantaneous locality graphs [76]. Another application is that the “bursty” nature of real programs—real reference streams tend to have periods of high locality separated by periods of low locality [10] -—can be easily seen and measured. The authors of this work readily admit that the exact formulas they have selected may 41 not be the best. Rather they defend the notion that instantaneous locality, as a concept, is both useful and important and better formulas will only improve the concept. This author would certainly claim that their formula for instantaneous locality greatly underestimates the importance of spatial locality. For example, two memory accesses that. are only 8 bytes apart are given a spatial locality measure of 1 / 9 on a scale of 0 to 1, too low by far in a time when cache lines are often 64 bytes or larger. Another limitation of this work, and why we characterize it as a microscopic locality measure, is that viewing large reference streams is difficult. The largest example in their paper dealt with fewer than 4000 memory referees. Of course, with the aid of a computer and some additional techniques, visualizing much larger traces may be possible“. 3.3.2 Loop-based locality metrics A common form of locality analysis has been based on the study of loop-nests. While it is interesting, loop-based studies are only peripherally related to our own locality studies (found in Chapter 4) and as such will only be briefly discussed. Two papers have been chosen as representative of the field. McKinley and Teman's work [53] evaluates the applicability of common assertions about program locality in the context of loop locality, while Ghosh et al. [29], generate "cache I‘niss-equations” which can be used to guide code optimizations in the compiler. In the remainder of this section some. of the basic jargon of this field is introduced followed by a brief overview of the two selected papers. In general, loop-based locality studies are targeted at compiler writers. Usually their purpose is to propose an algorithm for analyzing the memory reference patterns and find a way for the compiler to reduce the number of data cache misses. In doing so, these studies 4They have written a program which aids the analysis of instantaneous locality, but we have not had the opportunity to examine it in any detail. have adopted interesting terminology. Memory references are described in two methods. The first method is the three C’s model discussed in Section 3.2.5. The other method describes a non-compulsory reference with a three-word phrase consisting of hit or a miss; temporal or spatial; and self or group. So a given memory access might be a. “spatial group miss” or a “temporal self hit” or any combination of the three terms. An access to address X by instruction I is called temporal, if the last time (which may be now) the block in which X resides was in memory was when X was, referenced. Otherwise, the access is labeled as spatial. If the access was temporal and instruction I was the instruction which had last referenced X, the access is labeled as “self.” If the access was spatial and the last instruction that referenced the block was I , the access is also labeled as “self.” Otherwise, it is labeled as “group.” This jargon is useful for a couple of reasons. First, it is the only locality terminology of which we are aware that. indicates something about the referencing instruction. an important concept for some smart caching schemes [74]. Second, it allows a high-level understanding of the degree of temporal and spatial locality, something which otherwise has required a more complex scheme such as [33]. It also considers the access patterns of hits as well as misses, something the three C’s model does not. Examples of loop-locality from the literature McKinley and Terman [53] point out that most locality studies examine program locality, while compiler writers tend to concentrate on cache optimizations restricted to looping structures. They find that conflict misses are much more common in looping structures than capacity misses. Also, they observe that loops see more spatial reuse than the overall program. While neither of these results come as any surprise, they do provide further 43 evidence that most of the cache misses occur between looping structures. This observation means that rather than focusing on the “low-hanging fruit” of intra—loop cache placement, inter-loop caching issues need to be examined more carefully by compiler writers. Ghosh et al. [29] provide a much more traditional view of loop locality. They generate a set of mathematical equations for certain highly-restricted looping structures. These equa— tions can be used to determine the optimal tiling and array padding required to minimize the miss rate. In creating these equations they have managed to quantify locality in a very different way than any other study considered in this thesis. While it is unclear if their methods would be useful to a human attempting to understand loop-locality, it is clearly useful for an optimizing compiler. 3.4 Stack distance A program trace consists of the ordered set of addresses referenced by a program. Imagine pushing each memory reference onto a stack. If the reference had occurred earlier, it would be removed from its current position in the stack and then pushed onto the top of the stack. We define the stack distance of a reference R to be the depth in the stack from which it ~ was fetched. If R is not on the stack (because it. had never before been referenced) we , define its stack distance to be 00. This concept has been used in [8. 13, 33, 41,49, 58, 64]. An equivalent definition of stack distance is to count the number of unique references which have occurred since the last reference to R. The earliest paper which used stack distance in this context was by Mattson et al. [49], where they defined the top of the stack to be the ‘one‘ location rather than the more standard convention of the top of the stack being the ‘zero’ location. Others followed suit, and for sake of consistency with previous work we adopt the same notation in the remainder 44 of this chapter. Thus, the stack distance of a reference is the number of unique references which have occurred since the last reference to that location, plus one. Figure 3.1 illustrates the computation of the stack distance value. Reference Stream: A B C Stack Distance: 3 2 1 ......................................................................................................................... Figure 3.1: Illustration of the stack distance of a reference stream Figure 3.2: A graph of the locality found in the reference stream of Figure 3.1 3.4.1 Stack-distance distribution functions Given a memory-reference trace T consisting of memory references (t1,t~_)....tL), define f (T, i,d) to be 1, if it“ reference in T has a stack distance of d. Define the functions sd(T,d) and SD(T, d) as follows: 1 if d 2 oc SD(T,d) = 2:31 sd(T, 1') else That is, sd(T, d) is the probability that in the trace T a randomly selected reference will have a stack distance of cl, while S D(T, d) is the probability that such a reference will have a stack distance of d or less. Figure 3.2 is a. graph of the values of sd(T, d) and SD(T,d) for the trace used in figure 3.1. 3.4.2 Stack distance in the literature Stack distance. as a means of examining reference streams, has been around for over three decades[49]. In the literature one can find it used as an analytic model of locality [26, 41, 58, 65], as a method to generate artificial traces [6, 8, 33], and as part of an algorithm for hardware simulation [49,68]. The remainder of this section address is these applications. Stack distance as an analytic model of locality It is fair to say that the Stack Distance Model (SDM) was one of the most popular models of cache behavior during the late 19708 [58, 64], slowly replacing the Independent Reference Model (IBM) and the W'orking Set Model (W'SM) [24,58]. The major difference between the WSM and the SDM is that the WSM counts the number of intervening references, rather than the unique number of intervening references, between successive accesses to the same location. The WSM and the SDM are fairly similar and, for the most part, represent the same information. However, the WSM is both larger to store (its maximum size is 46 bounded by the number of accesses rather than the size of the working set) and is generally more difficult to utilize. A detailed explanation of each of these models, and other related models, can be found in [26]. In 1977, a Stanford technical report was written by Rau on the properties and ap- plications of the stack model [58]. Rau’s highly mathematical paper develops a number of important results for the stack distance model when applied to fully-associative caches. Rau also shows that given sd(T, d), one can derive a number of values including the distribution governing how long an item will stay in the cache, some occupancy results results for multi— level caches, and how stack distance values relate to set-associative caches. Most relevant to our work in Chapter 5, Rau finds that if ms(q, a.) is the miss rate for a set—associative cache with q sets of size a, its value can be computed as: N ms(q«a) = 1— 220mm i=1 These results assume that the cache blocks are randomly mapped into the cache. T hat. is, any given cache block has an equal chance of conflicting with another block. This assumption is empirically demonstrated in [65] as well as in Chapter 5. Also, the realization that hash caches (discussed in section 2.2.1) generally contribute very little in the way of performance gain hints that this assumption is reasonable. Rau briefly discusses the function Ba(2') and concludes that a more highly set-associative cache will result in a lower miss rate if (but not only if) Vi : sd(i) > sd(i + 1) is a non- increasing function. In Chapter 5 we re-derive Ba(i) and examine it in much greater detail. A year after Rau’s report, Smith proposed a variation on this caching model [65] and then refined it slightly in 1982 [64]. His and Rau’s equations are equivalent, but Smith changes the order of the summations and makes a few other cosmetic changes. Smith is 47 mainly concerned with generating an analytic model for p(i) using the stack distance model, combined with common sense. as a reality check. Smith also provides empirical evidence that the actual miss ratio and predicted miss ratio are very similar. He provides five traces and simply graphs the actual and predicted miss rates and shows that they are very similar. While this evidence is intriguing, [2], among others, criticize this work as not being validated on a wide variety of traces. Hill and Smith again describe the same analytic model of locality in their 1989 work [41]. This time a much larger set of programs are used to validate the model. They show that the vast majority of programs they test achieve a. miss rate within 5% to 10% of the value predicted by the model. These results are more satisfying, but no attempt is made to explain the error or to reduce it. 3.4.3 Algorithmic use of the stack model The first application of stack distances we are aware of came in 1970. Mattson ct al. proposed using stack-based algorithms to quickly simulate the hit rate of caches over a wide range of values and replacement policies [49]. Their algorithms apply to only a limited number of replacement algorithms, but those include the LRU, FIFO, optimal, and certain pseudo—random replacement algorithms. Mattson et al. focus on fully-associative caches5 for nearly all of the paper, but do address maintaining separate stacks for each set in the cache for less-associative caches. The majority of this paper is dedicated to a stack-based optimal placement algorithm [11] and its associated proof of correctness. This paper is interesting because it clearly sparked 5It is interesting to note that this paper never actually uses the terms “associativity” or “cache.” Rather they use the terms “congruence mapping” and “buffer” respectively. We assume that this is because the terminology used today did not exist or was not standard at that time. 48 the idea of using a stack-based model as a locality metric as well as a means of artificial trace generation. Further, this paper is the first to propose an efficient method of computing the optimal placement policy of a cache, and would not be surpassed in that respect until work by Sugumar and Abraham [67, 68]. 3.4.4 Trace compression Trace-driven simulation is a commonly used method for e raluating the behavior of a memory hierarchy [75]. The trace is gathered by running the program, either in a simulator or on real hardware, and recording every memory access that occurs. This address trace can then be used to drive a variety of memory hierarchies. Pure simulation, where the processor and memory hierarchy are simulated in tandem, will provide more accurate data than trace- driven simulation. However, trace-driven simulation is considerably faster, and because the future accesses are already known, allows for off-line replacement algorithms to be studied such as OPT. In 1981 Babaoglu proposed keeping the stack distances found in each trace rather than storing the trace itself [8]. He then proposed generating an "artificial trace’ trace from these values using a Monte-Carlo method. W hile his technique requires that some number of accesses be wasted warming up the cache, it did allow outstanding savings in terms of disk space. The size of the information stored depends solely upon the largest stack distance considered—it is independent of the size of both the original and generated trace. It would be expected that the same errors encountered by Hill’s analytic model [41] would also occur when using Babaoglu’s technique. There exist other stack-distance techniques which allow for perfectly accurate simula- tions of higher levels of the memory hierarchy. The most recent and probably best of these 49 techniques is used in the area of virtual memory simulations [46]. The basic idea is that if all memory components have at least an associativity of 1:. some portion of the actual memory trace is redundant—main1y those with a stack distance of less than k. They show that for values of k = 5 the page-level traces are reduced by a factor of about about 20 on the average. While a minimum associativity of 5 is probably unrealistic for standard cache simulations, it is quite realistic when considering a TLB as the first level cache of the page table. 3.4.5 Conclusion In this chapter a large number of locality measures were discussed, concluding with an overview of the stack distance measure of locality. Most of these techniques are used to address a specific issue, for example compiler loop optimizations. to understand how locality is filtered by a cache, or to compress memory reference streams. None of these techniques are intended to be a tool for gaining insight into whole—program locality. The models of locality proposed \Veikle and Grimsrud come the closest in terms of intent and ability. But both of these work best on the small scale: on a single loop or a small segment of code respectively. In the following chapter it is argued that the stack distance model is a fine general technique for understanding locality and how caches interact with that locality. Further it can be used to identify excessive conflict, explain the behavior of non-standard caches, and show how caches filter locality out of a reference stream. 50 Chapter 4 Stack distance as a locality measure As described in Chapter 1, there is a widening gap between DRAM speeds and processor performance. This widening gap has simultaneously made the interaction between locality and caches more complex as well as more important. The additional complexity is, in part, due to the fairly modern addition of the non-standard cache schemes discussed in Chapter 2. Those caches, while similar to previous cache schemes, none-the-less interact with locality very differently than standard caches. Further, locality itself is modified by the filtering effect of caches. This filtering means that a reference stream to an L1 cache will have significantly different locality properties than the reference stream to an L2 cache [77]. As discussed in Chapter 3 simple rules of thumb such as the 2:1 Cache Rule and the three C ’3 model [23, 39,41] over-simplify the situation by making assumptions about the locality of reference streams and even about how that. locality interacts with the caches. As also discussed in Chapter 3 other models of locality are either too complex [34], too narrow in scope [29, 53], or otherwise inappropriate [76] for considering the locality of an 51 entire program. This whole—program locality is important mainly for cache designers and compiler writers. The difficulty of understanding whole-program locality makes it very difficult for a cache designer to propose appropriate cache schemes or to understand how and when non—standard cache schemes will provide benefit. Without this understanding the only real options are to rely upon flawed rules of thumb or upon hit-rate results on a large set of simulations. Neither is satisfactory and both can lead to poor designs. Relying upon hit-rate results is problematic because designer needs to worry about a huge set of programs including programs which do not yet exist. This need results in huge amounts of simulation activity as well as significant effort spent on prognostication. A model of locality would, at the least, allow the cache designer to narrow the set of programs used by eliminating those with similar locality properties. The model also could be used to find trends and make more informed predictions about future workloads. That said. the most important use of the model would not be in simply improving current techniques but in allowing the cache designer to better understand locality and how that locality interacts with the various possible cache designs. It is less obvious that compiler writers also have a need to understand locality at the program level. It is clear and well accepted that “microscopic” locality, as discussed in Chapter 3, is extremely important to complier writers. However, whole-program locality is mostly a function of the program being compiled rather than the compiler itself. W'here whole-program locality is important is in the context of data layout. If indeed the whole- program locality is determined by the program and not the compiler the only whole-program issue that can be addressed by the compiler is conflict between the various memory references in the cache. The locality and cache models can be used to predict hit rates given random conflict between references. If, through profiling, it can be determined that a given data 52 layout has conflict that is worse than random, it. may be worth having the complier try a different data layout. That is, the cache and locality models can provide guidance as to expected hit rates and this can be tested against. the actual hit rates achieved. If the actual hit rate is worse than the predicted value, a new data layout can be proposed and tested. It may be that the program itself has lead to this conflict (for example by striding though a one-dimensional array), but it may be that the arbitrary layout chosen by the complier has lead to ‘excessive’ conflict. This chapter provides a unified method for understanding and quantifying locality and how caches interact with that locality. The chapter addresses general concepts of locality (Sections 4.1 and 4.2), interactions with non-standard caches (Section 4.3), as well as ap- plications (Section 4.4) and future work (Section 4.5). Even though there are many details found in this chapter, our primary goal is to provide insight into caching and locality. 4.1 The stack distance model of locality revisited In this section we describe two models which are based upon the stack distance metric. The first is the locality model, that is, describing the locality of a trace by using the terms sd(T,d) and SD(T,d) as defined in Chapter 3. This model of locality allows for a concise description of the locality of a trace. As is shown later in this chapter, the stack-based model of locality can be used to help understand the degree and type of locality in a trace. Such understanding can otherwise be difficult to achieve when dealing with millions or billions of data references. The second model is a cache model. It describes how a cache interacts with the locality of a reference stream. Specifically, it predicts the expected hit rate of memory references of a given stack distance. It is important to note that this model is arrived at independent of a specific reference stream: the model describes how the cache can be expected to interact with any reference stream. As is shown later in this chapter, this model can greatly aid in the understanding of caches as well as understanding the real impact of cache structure, associativity and size. These two models can be used together to produce a number of interesting results. These results include the ability to predict the hit. rate a reference stream will achieve on a given cache. None-the-less, it is important to keep in mind that these two models are distinct. The locality model is simply a way of describing the locality of a memory reference stream, while the cache model describes how a cache interacts with the locality of a memory reference stream. 4.1.1 LRU stack distance revisited In Chapter 3 the idea of stack distance was introduced. In this section that introduction is refined and used to describe the characteristics of the various SPEC CPU2000 benchmarks, hereafter referred to by the name “Spec”. These benchmarks are targeted to measuring CPU performance, but are also the standard for measuring L1-cache performance. These benchmarks are designed to mostly fit in the cache hierarchy, and it is well known that these benchmarks do not generally put a significant load on any cache other than the L1 [21]. The Spec benchmarks run for minutes to hours on real hardware and generate well over a hundred billion memory references. Gathering data on such large runs in simulated hard- ware is extremely time consuming and storing such large traces, even on modern disk drives, is impossible. As such, we limit our studies to samples of one billion memory references are for each benchmark. A reasonable period of time is spent allowing the program to finish its 54 initialization and begin doing actual work (at least 100 million memory references) before this sample is gathered. Unless otherwise noted all traces are gathered from optimized binaries for the DEC Alpha and run on the SimpleScalar simulator. The SimpleScalar website, www.simplescalar.com, has details about this simulator as well as the executables used. These traces were then run though a modified version of the Cheetah simulator [67] to generate the data found in this section. These results were also cross-checked against other, much less efficient, simulators written by this author and others. Recall from section 3.4 that SD(T,d) is the percent of memory accesses in a memory reference trace T that have a stack distance of d or less. Equivalently, S D(T, d.) is the hit rate achieved by a fully-associative cache consisting of d + 1 cache lines on that same trace. This function is termed the cumulative stack distance of the trace. Figure 4.1 provides a graph of the cumulative stack distance for the first three benchmarks in the floating-point portion of the Spec benchmark. For purposes of readability the X-axis has a logarithmic scale. As will be true throughout this chapter, the cache line size is 32 bytes. Notice that were the cache fully associative, art and ammp both would see very little improvement. in hit rate from a size of 50 lines (about 2KB) to 2000 lines (about 65KB). However, ammp would see a huge reduction in its miss rate from 2000 lines to 16000 lines( 256KB) while art would see only a minor improvement. At this time it is worthwhile to emphasize the fact that characterizations can be made about the relative locality of the various traces. For example, figure 4.1 clearly shows that art has less locality than the other two traces. Historically the characterization of a trace was more ad hoc. One might examine hit rates over a number of different caches, varying size and associativity. However, in general only “reasonable” caches would be considered—— generally powers of two and in the range of 4—64KB in terms of size and usually of small ‘ = 4:16) (5%?)0—0 (g) (4.4) 020 Unless otherwise noted, the formulation of Equation 4.4, as well as the order of arguments, will be used when computing PH IT for standard caches. Figure 4.3 provides a graphical representation of these values for three caches of different associativity, each with 128 cache lines. Notice that the more associative caches do better on references with a low stack distance while the less associative caches do better on data with a higher stack distance. Also note that the crossover point is approximately the same as the number of cache lines. 1 I I I I I I l I I . direct-mapped 0.9 - 2-way associative "‘ I: 4-\\-'ay assoicative e 08 +- 1 (ii 0.7 - - f 0.6 ~ .. e d 0.5 - ‘ l} 0.4 - - l t 0.3 — - r a t 0.2 l- " e 0.1 - ‘ 0 l l 1 1 1 0 50 100 150 200 250 300 350 400 450 500 Stack distance Figure 4.3: Illustration of the theoretic relationship between hit rates, associativity, and the locality of a given memory reference. A few interesting properties of these formulas are worth mentioning, although formal proofs are not provided. The first is that 230:0(PH1T(B.A, D)) = B. That. is, no matter the associativity, the area under the curve is constant, it is simply distributed at different points. The intuitive argument for this result is that at any given time B blocks are stored 60 in the cache. Say that the status of the cache was sampled N times and the stack distance of each block in the cache was recorded. After a large number of samples. taking the count of each stack distance and dividing it by N should yield the curve described by Equation 4.4. Clearly, as B values are recorded at each step, the summation over all of the stack distances must be B. Another interesting fact is that for positive integer values of B and D and where the associativity X is greater than Y it is the case that: D D Z(PH1T(B,X,2')) 2 Z(PmT(B,Yui)) i=0 i=0 That is. if you consider two caches of different associativity but the same number of blocks and sum the area under the curve from zero to some point D the less associative cache will never have a greater sum than the less associative cache. One can observe this phenomenon in Figure 4.3, it is obvious that the less associative cache has more area to the right of any given stack distance. As mentioned in Chapter 3, Rau formally derived a corollary to the above result: that if sd(T, d) > sd(T, d + 1) for all values of (1, then it is the case that a more associative cache can expect to achieve a higher hit rate. 4.2 Validation of the cache model The cache model relies upon one fundamental assumption: conflict in a cache is random. This assumption may seem somewhat counter-intuitive for two reasons. First, array-based programs often stride though memory at even intervals [39]. This striding can result in excessive conflict in some sets while others are virtually unused. The second reason to cast doubt on the assumption of random conflict is more significant: spatial locality. Recall from 61 Chapter 2 that standard caches are organized so that two accesses cannot conflict if they are within S — 1 blocks of each other. FOr most modern computer systems this means that two cache blocks will never conflict if they are placed into the same page. Because of spatial locality it can reasonably be expected that memory references near each other in time are less likely to conflict with each other than random selection might indicate. In this section it is demonstrated that while the assumption of random conflict is not exact, it is fairly close. Specifically, we do the following: 0 Choose two cache configurations, a 4KB direct—mapped cache and a 4KB two-way associative cache, and show that the predicted hit rates at each stack distance map fairly well to the actual hit rates. 0 Show that over a wide range of cache configurations the model generally predicts the hit-rate within 2% of actual values. 0 Discuss what the error in predicted hit rates means. The last point is expanded upon in Section 4.4.2. Graphing actual vs. predicted hit rates as a function of stack distance Figure 4.4 is a graphical representation of the accuracy of our cache model over all the integer benchmarks. The cache in question is a 4KB direct-mapped cache. Obviously the integer benchmarks perform almost exactly as the theory predicts. Accesses with low stack distances seem to do slightly better than predicted. This small difference is actually quite important as three-fourths of all accesses from the SpecInt. suite have a stack distance below 10. (You may recall this from Figure 4.2.) None-the-less, the integer benchmarks match theory remarkably well. 62 l I j Measured ' P predicted ‘ r - e d _ i c l; '1 d u- h —I i t - r a i t e 0 1 1 1 1 1 1 1 1 1 0 50 100 150 200 250 300 350 400 450 500 Stack distance Figure 4.4: Stack distance hit-rates for SpecInt on a 4KB direct-mapped cache. On the other hand, Figure 4.5 shows that the same data for the floating-point bench- marks is significantly more noisy. While the general trend does track the predicted values (other than perhaps the tail end of the curve) it is not anywhere as close as the integer benchmarks. This huge difference in locality properties is due to the very nature of the programs. The floating-point. benchmarks tend to be fairly small kernels of code working over large data sets. As such the access patterns tend to repeat which reduces the random nature of conflict. The integer programs tend to have larger kernels and do not generally operate on the same data sets over and over again. Simply to show that the same general results hold for an associative cache, Figures 4.6 and 4.7 provide the same information, for two-way associative caches. The integer results remain very close to the prediction while the floating-point values may be slightly less noisy than they were on the direct-mapped cache. 63 1 ‘ I I I I I I I I I Measured ° P 0-9 a". , Predicted " l' .- e 0.8 (.1 0.7 - l f 0.6 - e d 0.5 - l.‘ 0.4 - l t 0.3 - I' a t 0.2 — e 0.1 - 0 0 50 100 l 50 200 250 300 350 400 450 500 Stack distance Figure 4.5: Stack distance hit—rates for SpecFP on a 4KB direct-mapped cache. 1 ‘ T I I a I I I I I Measured ' P 0-9 ' predicted — ‘ r — CI! 6 0.8 (1.1 0.7 - - E 0.6 - a e d 0.5 - -‘ t 0.3 _ - r a t 0.2 '— "‘ e 0.1 *- q 0 1 1 1 1 1 1 0 50 100 150 200 250 300 350 400 450 500 Stack distance Figure 4.6: Stack distance hit-rates for SpecInt on a 4KB two-way associative cache. 64 1 . l I I I l I l I l . f5: . Measured 0 P 0'9 l' :. °'° Predicted —— '7 0' . o 1‘ .— . . . o . g — e 0.8 a . . . (.1 0.7 '- . “a" "‘ I .. o. . . o' t f 0.6 p .- ?..‘ - e ' 0 .o' . 0 d 0.5 " ° 7. 0. . .— . ‘0’. O P o l; 0.4- .h .. :1». . . . - t O 3 _ . o. . .. :- ‘. . O . -l f . o . .. .0 ' o : . ? 02 _ ° 0 . \‘w'::.0 '2. ’4'. .f’ . O o. . . .4 " 0 - -. .- «rs-.- ° '4' :ri‘? '1 '- . . . .0. .g .. fi. 0”. O 00‘. n 1 1 t 1 u " 0 0 50 100 150 200 250 300 350 400 450 500 Stack distance Figure 4.7: Stack distance hit-rates for SpecFP on‘ a 4KB two-way associative cache. 4.2.1 A comparison of actual hit rates to predicted values Where Section 4.2 illustrated the accuracy of the caching model showing how well it pre- dicted the hit rate at each stack distance over a set of benchmarks, here we measure how well the model predicts actual hit rates. The results are provided by benchmark, cache size, and cache structure. While general trends in accuracy are discussed, the primary goal of this section is to give additional validation for the cache model as well as to provide some insight into what errors should be expected. Nineteen traces were run on 21 different cache configurations. Those configurations are the cross—product of three cache schemes (direct-mapped, direct-mapped hash, and 2-way set-associative) with seven cache sizes (1KB, 4KB, 8KB, 16KB, 32KB, 64KB and 128KB). All caches have a 32-byte line size. In order to summarize these 399 data points into a useful set of information, tables have been provided which show the average results when two of the axes (benchmark, cache scheme, cache size) are held fixed. The results provided are 65 the average predicted hit rate, the average actual hit rate, the average of the signed errors, and the average absolute error. The floating-point. and integer benchmarks are considered separately. Benchmark Predicted Actual Average signed Average Absolute hit rate hit rate error error ammp 71.00% 71.68% -0.68% 0.68% applu 82.40% 83.26% -0.86% 1.27% art 57.49% 58.48% -0.99% 1.24% equake 95.76% 97.21% -1.45% 1.45% fma3d 96.63% 96.50% 0.13% 0.87% galgel 81.41% 82.87% -1.47% 2.10% lucas 81.43% 82.58% -1.15% 1.85% mesa 94.24% 95.15% -0.91% 0.99% mgrid 84.27% 80.31% 3.97% 4.90% swim 76.35% 79.28% -2.93% 2.93% wupwise 94.07% 94.94% -0.88% 1.00% Average 83.19% 83.84% -0.66% 1.75% Table 4.2: Hit rates and predicted hit rates for the SPECfp2000 benchmarks As can be seen from Table 4.2 the quality of prediction varies quite a bit depending on the benchmark. Three of the benchmarks have errors greater than 1.5%. Recall that if the actual hit rates are better than the predicted values, the conflict in the memory reference stream is lower than it would be if the conflict were purely random. This lower-than- random conflict is generally thought to be a result of spatial locality and the fact that two lines sufficiently close together will not conflict with each other. As discussed at the start of this chapter, poorer hit rates than predicted are due to heavy conflict usually attributable to striding. Notice that mgrid has the largest error and is one of only three benchmarks which does worse than predicted. This information shows that the data layout generated by the complier for mgrid has significant room for improvement. More information about the detection of excessive conflict can be found in Section 4.4.2. 66 Benchmark Predicted Actual Average signed Average Absolute hit rate hit rate error error bzip2 94.80% 95.28% -0.47% 0.57% crafty 88.21% 88.87% -0.66%- 0.73% gap 97.93% 98.42% -0.49% 0.49% gcc 89.36% 89.96% -0.60% 0.60% gzip 94.30% 95.25% ~0.95% 0.95% mcf 81.09% 82.36% -1.27% 1.27% parser 91.61% 92.29% -0.68% 0.70% vpr 89.21% 88.60% 0.61% 1.40% Average 90.81% 91.38% -0.56% 0.84% Table 4.3: Actual and predicted hit rates for SPECint2000 benchmarks Examining Table 4.3 the error in hit rate for the integer benchmarks is significantly less than what was found in the floating-point benchmarks. Once again the predicted hit rates are generally a bit lower then reality. while only vpr suffers from more conflict than random.“ Configuration Predicted Actual Average signed Average Absolute Direct-mapped 82.52% 83.43% -0.99% 2.00% Hash 82.52% 83.14% -0.61% 0.97% Two-way 84. 77% 84.96% -0. 19% 0.47%- Table 4.4: Prediction errors grouped by cache configuration for the SPECfp2000 bench- marks. Configuration Predicted Actual Average signed Average Absolute Direct-mapped 90.20% 90.65% -O.45% 1.07% Hash 90.20% 91.06% —0.85% 0.97% Two-way 92.04% 92.42% -0.39% 0.47% Table 4.5: Prediction errors grouped by cache configuration for the SPECint2000 bench- marks. One way to think of the cache model is that it takes hit rates for a fully-associative cache and attempts to predict how less associative caches will behave. therefore, Tables 4.5 and 4.4 show that the 2-way associative cache has significantly less 67 Unsurprisingly, error associated with it than the two non-associative caches. What is unusual is that hash caches are supposed to reduce the amount of conflict in a cache without reducing the value of page-level locality. However, on the floating-point benchmarks the hash cache actually harms performance. In any case, the model has the most problems predicting the behavior of the direct-mapped cache. While that error is not trivial, it is plain that the model tracks actual performance fairly well. Cache Size ] Predicted Actual ] Average signed Average Absolute] 1k 73.35% 74.28% -0.92% 3.22% 4k 84.03% 84.80% -0.77% 2.15% 8k 86.94% 87.41% -0.47% 1.25% 16k 88.90% 89.21% -0.32% 0.81% 32k 90.09% 90.57% -0.48% 060% 64k 90.75% 91.21% -0.46% 0.55% 128k 91.23% 91.62% -0.39% 0.56% Table 4.6: Prediction errors grouped by cache configuration for the SPECint2000 bench- marks. Finally, Table 4.6 combines the integer and floating-point results as they are impacted by cache size. Clearly the model is much better at predicting the hit rates of larger caches than smaller ones. From this section a number of things can be concluded. First, the cache model is more accurate when modeling large, associative caches than it is modeling smaller, direct-mapped caches. Secondly, and as noted in [40] the cache model only does a moderate job of actually predicting hit rates. Lastly, and perhaps most importantly, the error in prediction is due to the assumptions made about conflict being random this is not entirely a bad thing. Later in this chapter these errors will be exploited in a novel way to gain additional understanding about caching and conflict. 68 4.3 Expansion of the model to non-standard caches It is appropriate at this time to reflect on the stack model as discussed thus far. Recall that in Section 4.1 PH IT for a given distance was shown to be independent of the locality of the trace. In this section it shown that this property does not always hold for non-standard caches. Two non-standard caches are examined: an LRU skew-cache and a true-LRU victim cache. we show that the skew cache cannot accurately be correctly modeled under the assumption that PHIT is independent of sd((1), while true-LRU victim cache can be so modeled. 4.3.1 An analytic LRU skew-cache model The skew cache was described in Section 2.2.4. An LRU skew cache is simply one where replacement is always to the least-recently used location of those in which the new line can be placed. Building such a cache can be difficult [60], but near approximations are possible [14]. As previously mentioned, a stack-distance based skew-cache model which is independent of sd(d) is not possible. The fundamental reason why is that the model proposed in Section 4.1 relies upon the fact that only the number of unique accesses between successive references to the same cache line are important. That is, it did not matter if those intervening lines were accessed once or 100 times, either way the effect would be the same. With a skew cache, that is not the case. The reason is that if a line is evicted from the skew cache and then re—referenced, it might occupy a different location than it did the first time. This concept, which to our knowledge no one has ever noticed or even had the language to discuss, is best explained by way of an example. 69 Impact Of Sd(d) on PHI'I‘ In order to make this example readable, some terminology is introduced. Assume a two—way associative skew cache. The two ways of the skew cache are named A and B. Line '1 in way A will be called A[i], with a similar notation used for B. Say a given block, Y, can be placed into either the 17th line in A or the jth line in B. We use the terminology Y -—+ {A[i], B[j]} to represent that situation. Assume that between two successive references to cache line X —+ {A[l], B[1]} there are references to three other lines, L ——+ {A[1],B[2]}, A1 —+ {A[2], B[2]}, and N —+ {A[2], B[2]}, as shown in Figure 4.8. Assume that X is initially placed into block A[1]. For the moment. assume that AI, N, and L are each referenced exactly once. Notice that if L is referenced after Ill and N have both been referenced, X will be evicted. Further, notice that if L is the first of the three lines (of L, M, and N) referenced, X will not be evicted. If L is the second of these three lines to be referenced, there is a 50% chance X will be evicted (if the previous reference was placed in B [2]) Thus, averaging over the six possible orderings of the references to these three lines, there is a 50% chance of X being evicted before it is re-referenced. In this exact situation, we observe that for the second reference to X, PHIT = 0.5 and its stack distance is three. Now, assume that each of these three lines is referenced a large number of times before X is re-referenced. The stack distance for that second reference to X remains the same: three. However, as the number of repetitions goes to infinity, the probability of X being evicted goes to one. The reason for this is that all that has to happen for X to be evicted is that there is some reference to L after both 111 and N are referenced. In that case L will be assigned to A[1], evicting X. In this situation for the second reference to X, PHIT 2: 0 although its stack distance remains three. 70 ( D X f D I l I L 2 M a N 2 «< > c ' J k l J Figure 4.8: Example showing how locality interacts differently with a skew cache than a standard cache. Thus, without attempting to generalize the effect, it is clear that repeating a reference to the same memory location may not change the stack distance of some other reference, but it can change PHIT. Notice that. the distribution of sd(d) plays a role in this. For our example, if S D(2) = 0 there will be no repetitions of L, 1W. or N between the two references to X. If S D(2) = 0.999, the probability of repeats will be quite high. Therefore, sd(d) has an impact on PH 1T. Moving forward with the model The previous example is enough to show that for a skew cache, sd(d) can have an impact on PHIT. W hile acknowledging that observation, we none-the-less create a model which assumes that PHIT is independent of sd(d). It is then shown that for real memory refer- ence traces that such an assumption only introduces minimal error. For purposes of this derivation, it is assumed that each intervening memory location is referenced exactly once. Our goal is to find the probability that a memory access with stack distance D will achieve a hit. This is equivalent to determining the probability that a given memory refer- 71 ence, R, will remain in the cache after D unique accesses have occurred. For this derivation, assume a two-way associative skew cache where each way of the cache consists of X cache lines. Label the first way A and the second way B. Assume without loss of generality that the reference in question, R, is to a block placed in A. Define a cache block to have been “touched” if one of the D references that occurs after R is placed in that cache block. Define a to be the number of lines in bank A that have been touched since immediately before R was referenced. Define b similarly to a. but for bank B. Notice that by these definitions, when D = 0 it is the case that a = 1 and b = 0. Treat the triple (a. b, D) as the state of a cache where R remains in the cache. That is, we are only keeping track of the states of the cache where R would still be found in the cache. For a given state, there are at most three states which could have proceeded it. The previous state may have been (a — 1, b, D — 1) in which case the Dth reference mapped to a block in A that had not previously been touched. Similarly the previous state may have been (a, b — 1, D — 1) in which case the most recent reference mapped to a block in B that had not previously been touched. Going from state (a,b, D — 1) to (a,b, D) occurs if the most recent reference mapped to blocks already touched in both A and B but not to the block in A holding R. The probability of moving from state (a, b, D — 1) to state (a, b, D) is ((a — 1)/X)(b/X), which is exactly the probability of mapping to a touched line in A that isn’t R multiplied by the chance of mapping to a touched line in B. Changing state from (a, b —- 1, D — 1) to (a, b, D) requires that the most recent reference is to an untouched block in B and a touched block in A, or it was to an untouched block in both A and B but B was the LRU block of the two (which is assumed to happen 50% of the time). The probability of the reference mapping to an untouched block in B is (X — (b— 1)) / X while the probability of it mapping to 72 Transition from Probability (a,b,D — 1) 952—1)? (a— 1.0 D— 1) “f{‘“)%+%""‘{?‘”";b ' x— (b- 1) 1X—(b—1)x_ (“vb-er‘ll X fiaf+§ x x0 Table 4.7: Transition probabilities for a two-way associative skew cache. a touched block in A (including the one holding R) is a / X . The probability of the reference mapping to blocks which are untouched in A or B is ((X — (b — 1))/X) ((X -— a)/X). In Table 4.7 the probability of each transition is provided. Let the term P(a,b,D) be the probability of getting to state (a,b, D). Recall that reaching state (a, b, D) implies that R is still in the cache. Using Table 4.7, it is trivial to derive Equation 4.5. x—(g—I) X76) (4.5) The initial condition is that when D = 0, a = 1 and b = 0. So P(a,b,0) = 1 if a. = 1 and b = 0, otherwise P(a,b,0) = 0. Now define P,”,’j‘}“'(X,D) to be the probability of a skew cache with two banks of size X to achieve a. hit on access with a stack distance of D. X X Pgtgll’ur =(ZZP i j D) (4.6) i=0 =0 K) As Equation 4.6 indicates the probability of a hit occurring at a stack distance D is simply the probability of reaching a state (—, -—,D) where the “—” indicates any value. Figure 4.9 shows the values of Pg§§3“(64, D), that is the chance of a hit occurring in a skew 73 cache of 128 cache lines at a given stack distance. PH/T for a two—way associative standard cache of the same size is also displayed to provide some context. Notice that the skew cache is slightly “more associative” than the 2-way associative cache. 1 I I I I f I I I , I 2-way assocnative P 0-9 " skew-associative "' r _ _ e 0.8 (,1 0.7 - J E 0.6 [- a e d 0.5 - _ l 0.4 - - t 0.3 - - r a e 0.1 - - 0 1 1 1 1 1 1 0 50 100 150 200 250 300 350 400 450 500 Stack distance Figure 4.9: Predicted hit rate for a skew cache and a standard 2-way associative cache. Both are 128 lines in size. A more real situation As noted in Section 4.3.1, P13)? does depend upon sd(d). Using simulation, one can show the relationship between our predicted value of P3)?" and how it is impacted by sd(d). For most realistic sd(d) distributions, our prediction is fairly close to reality. In Figure 4.10 the PHIT is graphed for the analytic model as well as two different sd(d) distributions: that of the integer Spec (as seen in Figure 4.2) and a uniform (flat) distribution where the first 300 stack distances each have a 0.3% chance of occurring. It is worth emphasizing that these graphs are not of actual traces. Rather they are traces which have been artificially generated to have the appropriate sd(d) distributions (see Section 4.4.1 for details). For a 74 standard cache there would be no significant variation between these three lines. 1 ‘ I I I Analytical prediction P SpecInt distribution r 0.9 - Flat distribution '] e d i 0.8 P -‘ c t. e d 0.7 - h i t (16 - r a t " l— e 0.0 0.4 1 L 1 1 1 20 40 60 80 100 120 Stack distance Figure 4.10: A magnified look at the ‘flaws’ in the skew cache analytic model. Another issue with respect to the skew cache model involves the assumption that if an access maps to untouched blocks in A and B,, it has a 50% chance of going to either of them. In fact, assuming the first block maps to a location in A means we learned a little bit about A and B—there is at least one location in B which is newer than the line we replaced in A. This leads to a small bias. In Section 4.3.3 we show empirically that providing a 0.501 chance of the line being mapped to B in a 128-line skew cache significantly reduces this already very small effect. 4.3.2 An analytic LRU victim cache model The victim cache, described in Section 2.2.3, is another type of non-standard cache. Briefly, it consists of two parts; a large main cache, and a much smaller fully-associative component]. 1It is more common to refer to the smaller fully-associative component as the victim cache. However, for reasons of convenience and clarity, the term ‘victim cache’ is used here to describe the two components together, functioning as a whole. 75 It is usually the case that the large main cache is direct-mapped to reduce access time. Lines evicted from the direct-mapped component are placed in the fully-associative component which is managed like a separate cache. A ‘true—LRU’ victim cache is identical to the usual victim cache except that the fully- associative component is managed using global LRU information. That is, when a block gets evicted from the direct-mapped component, its “time stamp” is compared to those blocks in the fully-associative component. The LRU block is evicted from the victim cache. If the evicted line was in the fully-associative component, the line currently in the direct- mapped component is moved into that location. In any case the direct-mapped component is now unoccupied, leaving room for the block which caused the eviction. In the remainder of this section we provide an analytic model which describes how a true—LRU victim cache interacts with the locality of a reference stream. Derivation of the model To facilitate this derivation, the following terminology is introduced. The direct-mapped and fully-associative component are said to contain a total of S1 and S2 cache lines respectively. Let the direct-mapped component consist of cache lines labeled as 11,12, . . . 15,. Further, if X is a cache block which maps to location [I in the component, let this be stated as X E II. As before the goal is to compute the probability that a given cache block will still be in the cache after D unique accesses have occurred. The fundamental observation is that an LRU victim cache will evict a block, X, in cache line 1,. from the combined components if, and only if, the following two conditions are met: 0 Some other block is accessed after X and is also mapped into 11. 0 At least SQ other blocks have been accessed and then evicted from the direct-mapped 76 component. since X was last accessed. The first condition must be true, otherwise X will still be in the direct—mapped com- ponent. The second must be true, otherwise X can be in the fully-associative component. Notice that the order in which these two conditions occur is not important. Once both have occurred X will be evicted from the victim cache. Say that there are exactly D unique memory blocks accessed between two accesses to block X. In order to meet the first condition, at least one of those D blocks must be mapped into lx. Meeting the second condition requires that no more than D — SQ different cache lines are utilized by those D accesses. That is because only those lines accessed after X can contribute to the 52 blocks needed to evict X from the fully-associative component of the LRU victim cache. Given the above result, it is possible to construct a relationship which computes the probability of a given set of accesses accessing no more than D — 8‘2 different cache lines. The same assumption of “random conflict” used before is used once again. The number of times an access to a given block repeats has no influence upon the probability of X being evicted. Consider the possibilities if one of the D blocks is accessed repeatedly. Label the block which is repeatedly accessed as R. Upon the repeated access, R is currently in one of three places: the direct-mapped component, the fully-associative component. or has been evicted from the cache. 0 If R is in the direct-mapped component, the only change is to update the access time of R. As it was already more recently used than X this action has no effect on X. o If R is in the fully-associative component, some other of the D accesses must have evicted it from the direct-mapped component. R will be placed in the direct-mapped 77 component while the other block will either go into the fully-associative component or be evicted. Both of these blocks are more recently used than X, so the swapping has no impact on whether or not X will be evicted. o If R has been evicted. and X is still in the LRU victim cache, it must be the case that X is in the direct-mapped component. Otherwise X would have been evicted from the fully-associative component before R. At this point X will be evicted from the entire victim cache once an access occurs that uses 13,. As R clearly does not map to la, (otherwise X could not be in the direct—mapped component) repeated accesses to R have no impact. Thus, the number of times any of the D accesses repeat has no influence upon the probability of X being evicted. The analytic model Define P§F%""C(Sl. 52,D) to be the probability that an LRU victim cache with a direct- mapped component of size 51 and fully-associative component of size 52 will achieve a hit on a access with a stack distance of D. For a fixed value of S 1 and S2 let F (D, U) be the probability that at stack distance of D exactly U different cache lines have been accessed, none of which are 13. For the same fixed value of S1 and SQ, let E (D, U ) be the probability that at stack distance of D exactly U different cache lines have been accessed, one of which is l1. It then follows that: D D Pizit~""6(81.52. D) = Z Fo‘o-1':a m<-Hm—¢=E=O 4 II I I ll 1 l J 1 10 100 1000 10000 100000 0 d Stack distance Figure 4.16: Impact of associativity on cache filtering Generation of artificial traces Another application of the locality model is its ability to generate artificial traces using a simple Monte-Carlo technique. This was briefly described in Section 3.4.4, and is described in detail in a paper by Babaoglu [8]. The net effect is that it is trivial to generate a memory reference trace which has any stack-distance properties desired. Babaoglu’s concern was with a lack of storage space needed to keep long traces. While this is still an issue, disk drive space has been outpacing CPU power for many years; and the bound on trace driven simulation is now mainly CPU power and the network bandwidth needed to move these large traces around. Further, these artificially generated traces are not perfect replicas of the traces they are replacing. For one thing, they have truly random conflict. On these traces a hash-cache will, on the average, perform exactly as well as a standard cache. Even given the above limitations, these artificially generated traces have many interest- ing uses. For one, it is possible to generate a reference trace that has certain specific locality 88 properties. This feature can be useful when, for example, you want to know how a change in the locality stream presented to the L1 cache will impact the reference stream seen by the L2 cache. Mathematically analyzing this problem is very difficult, and searching for a real trace that has the locality properties in question will be time consuming and may not yield a match. Another use of an artificially generated trace is as a point of comparison to a real trace. For example, Hallnor and Reinhardt. [36] have [proposed a replacement scheme which relies on the idea that past heavy utilization of a cache line will likely result in future heavy utilization. For the trace generated by the Monte-Carlo method the degree of past utilization has nothing to do with the future utilization. The characteristics of the actual trace can then be compared with the characteristics of the artificial trace. We define a streak of memory accesses to be the list of accesses in a given trace which fit in a given cache line. The length of a given streak is defined to be the number of memory accesses in that streak. Figure 4.17 provides a graphical representation of what percent of memory access are in a streak of length N or more for both an actual trace (the first 10 million accesses in the gcc trace) as well as for a similarly sized artificial trace generated using the same stack distances as the original trace. As can be seen accesses in the actual trace are much more likely to be a part of a long streak than are those of the artificial trace. Certainly an empirical model of expected streak length could be generated and used instead of the artificial trace, though doing so seems fairly difficult. Instead we showed that Hallnor and Reinhardt’s assumption about heavy utilization being predictive of future utilization quite accurate for this trace. Further, artificial traces can often be used as a point of comparison when there is a desire to separate out the impact of locality from other trace characteristics. For example, it can be used to eliminate the impact of hashing (and 89 1 T Y I I U l U I I l I M r] I I U l I I T l. real trace 0.9 artificial trace 0.8 - 0.7 '- 0.6 l- 0.5 - 0.4 b Hdmfifim'fi 0.3 - 0.2 - 0.1 - l l I ll 1 1 ll 1 4 L 1 10 100 1000 10000 100000 1e+ 06 Streak length Figure 4.17: Percent of accesses in a streak of size N or greater. thus the reduction in non-random conflict) while keeping the locality of a given reference stream. This would allow for a measurement of the reorganization ability of a. skew cache [60]. 4.4.2 Applications of the cache model The cache model also can be used to gain insight into the behavior of caches and reference streams. In this section some of those insights are explored. Associativity vs. cache size The inside cover of the most widely used computer architecture text has provides the follow- ing under “Rules of Thumb” “2:1 Cache Rule: The miss rate of a direct-mapped cache of size N is about the same as a two-way set-associative cache of size N/2.” Figure 4.18 shows the expected hit rates for N = 128. If the rule of thumb is true and one ignores excessive conflict (and as was seen in Section 4.2 there is actually less conflict than random for most 90 traces so this is very reasonable), it must be that there are so many memory references in the range of 1 to 23 that the small advantage the two-way associative cache has in that range is enough to compensate for the huge difference in hit rates after stack distance 23. Figure 4.19 magnifies this portion of the graph. The “area” of that small region is slightly larger than 0.6. While the area between the two regions starting at stack distance 24 and onwards is slightly more than 64.6. 1 I I , I I _ I direct-mapped, 128 lines —— P 0-9 ‘ 2—way, 64 lines "1 r _. ._ e 0.8 (l 0.7 - - 1 f 0.6 ~ - e d 0.5 - _ l,‘ 0.4 - _ t 0.3 i- - r a I, 0.2 '— -' e 0.1 — d 0 1 1 1 1 0 50 100 150 200 250 Stack distance Figure 4.18: Impact of associativity on cache filtering None of the above analysis should be too surprising, after all Figure 4.2 showed that the vast majority of accesses have a very low stack distance. W hat is interesting is that the 2:1 Cache Rule is really a statement about the locality of most reference streams (a large percentage of memory references have very low stack distances) rather than about associativity or cache size. The point of this discussion is not the validity of the rule. of thumb. Rather it is that using our model of cache behavior the underpinnings of the rule of thumb can be understood. 91 1 I I _ I ,I direct-mapped, 128 lines 2-way, 64 lines — P g 0.95 )- - d i . C t 0.9 l- 'l e d l,‘ 0.85 — - t r a e 075 I L I l l 0 5 10 15 20 25 30 Stack distance Figure 4.19: Impact of associativity on cache filtering For example, consider the results from Section 4.4.1 on cache filtering. It was shown that there are very few low-locality accesses in an L2 reference stream. Combining this fact with Figure 4.18 would lead us to believe that this 2:1 Cache Rule is unlikely to apply to L2 caches—the needed locality just is not there. Measuring conflict in a reference stream One important issue in caching is the measurement of conflict. The caching models described in this chapter assume that conflict is random, knowing full well that spatial locality will cause the conflict to be less than random while striding and related issues will cause conflict which is greater than random. In this section the issue of quantifying conflict is addressed. First, the fact that a hit rate can be predicted assuming random conflict allows a trivial measure of the non-random conflict found in the trace. Clearly the difference between the actual hit rate and the predicted hit rate is due to non-random conflict. If the predicted hit. rate is greater than the measured hit rate. there is clearly some source of excessive conflict 92 due to the data layout. The problem is that sources of excessive conflict can be masked by sources of scant conflict, so that even if the predicted hit rate exactly matches the actual hit rate there is still quite a bit that could be achieved by changing the data layout. Our assumption is that the sources of excessive conflict involve striding or other heavy use of a single set while scant conflict is due to a high degree of page-level spatial locality. The actual hit rate reflects the impact of both excessive and scant non-random conflict while the predicted hit rate assumes that neither type of non-random conflict exists. As mentioned in Chapter 2, a “one’s complement.” cache (a rather complex hash cache) has the effect of reducing many of the sources of excessive conflict while having very little impact on the scant conflict caused by page—level locality [81]. While it is not exactly true that a one‘s complement cache removes all of the bad while leaving the good, it does come fairly close. A new stride length causes excessive conflict (though one that is not a power of two) in a one’s complement. cache, and two highly utilized variables could now map to the same set where they didn’t previously. Thus, while a one’s complement cache does not perfectly keep the sources of scant conflict while removing the sources of excessive conflict it does the best of any scheme of which we are aware. The three hit rates (direct-mapped, one’s complement. and predicted) can be used to provide a rough quantification of conflict. Scant conflict can be measured as the difference between the hit rate of the one’s complement cache (assumed to contain only sources of scant conflict) and the predicted hit rate. Excessive conflict is measured as the difference between the hit rate of the one’s complement cache and the actual hit rate achieved by a direct-mapped cache. Table 4.8 shows the results over the Spec benchmarks. Notice that the excessive conflict is occasionally negative, which is likely due the random chance that two highly used variables conflict which did not conflict in the direct-mapped cache do ' 93 conflict in the one’s complement cache. More rarely the scant conflict is negative. This result is due to the one’s complement cache not doing a perfect job of removing excessive conflict. benchmark direct- one’s predicted excessive scant mapped complement conflict conflict ammp 92.28% 91.80% 90.07% -0.48% 1.73% applu 79.27% 79.33% 76.09% 0.05% 3.24% apsi 75.00% 75.00% 75.01% 0.00% -0.01% art 77.04% 77.33% 75.49% 0.30% 1.84% bzip2 95.20% 95.18% 95.19% -0.02% 0.00% crafty 82.17% 83.17% 83.05% 1.00% 0.12% eon 90.77% 92.59% 90.10% 1.83% 2.49% equake 94.46% 92.21% 91.99% -2.26% 0.22% fma3d 95.40% 95.87% 95.25% 0.47% 0.62% galgel 98.41% 98.46% 98.28% 0.05% 0.18% gap 90.83% 90.83% 90.83% 0.00% 0.00% gcc 89.28% 88.58% 87.22% -0.70% 1.35% gzip 95.14% 95.13% 95.13% -0.02% 0.00% lucas 97.57% 97.54% 93.29% -0.03% 4.25% mcf 90.83% 90.83% 90.83% 0.00% 0.00% mesa 94.07% 97.21% 93.27% 3.14% 3.94% mgrid 69.57% 75.44% 75.91% 5.87% -0.47% parser 94.10% 94.08% 93.14% -0.02% 0.94% perlbmk 85.79% 85.85% 85.65% 0.06% 0.19% swim 96.81% 96.67% 92.98% -0.14% 3.68% twolf 92.79% 95.25% 94.04% 2.47% 1.21% vortex 92.92% 92.90% 91.99% -0.01% 0.92% vpr 98.26% 97.48% 95.55% -0.78% 1.93% wupwise 97.31% 97.32% 96.04% 0.00% 1.28% Average - - - 0.45% 1.24% Table 4.8: Measurement of excessive conflict While the above measures are not perfect. we know of no previous work which can determine if a low bit rate is due to poor data layout as opposed to there being a low degree locality3. Looking again at the data in Table 4.8, it is obvious that eon, mesa, mgrid and 3The ‘three C’s model’ of caching [23,41]. discussed in Chapter 2. does make a rough attempt to dis- tinguish these two concepts but it is so coarse in its granularity as to be nearly worthless as a quantitative measure [13]. 94 twolf have significant room for improvement in their data layout. Fer a compiler writer this information can provide focus to a study as to what is wrong with the data layout, and could even be used as feedback in a profiling tool. For a cache designer the scant conflict values provide insight into the importance of page—level locality while the excessive conflict values provide insight into the importance of using a hash cache or other technique to remove conflict. 4.4.3 Examining non-standard caches The last area of application considered is understanding non-standard caches. The ability to see how a cache interacts with locality can make comparisons and analysis much easier. Skew cache replacement policies In his various papers on skew caches Seznec has proposed a number of replacement schemes including a time-stamp LRU scheme and a single-bit per set [60—62, 73]. In this section those two schemes are compared to a novel scheme which uses partial time-stamps. This novel scheme is evaluated by using the cache modeling techniques presented in this chapter as well as by looking at traditional hit-rate numbers. Figure 4.20 demonstrates how the skew—cache performance is impacted by changing from the single-bit replacement scheme to a true-LRU scheme under the 100% miss-rate model described in Section 4.3.3 . Clearly the LRU scheme has significantly better performance. The problem with the LRU scheme is that it has fairly high overhead. In a skew cache, unlike a standard cache, any given line might need to be compared to any of the lines in the other way of the cache. To the best of this author’s knowledge, no one has formally proposed a method to perform true-LRU replacement in a. skew cache. In a conversation with Seznec in 95 1997, be indicated that the only way his group saw to perform this replacement scheme was using a time stamp. This would be implemented with a counter that is incremented upon every cache access. The line accessed, whether a hit or a miss, would have the current value of that counter stored along with the tag for that data (in the same way LRU information is usually stored). When a replacement decision is required, the ‘time stamps’ stored with the two potential candidates for eviction would be compared and the smallest value would be the replaced line. When the counter overflows and returns to zero, the cache will either need to be purged or the time stamp values will somehow need to be updated. For a large enough counter this overflow will be very rare (say a few times a minute for a 32-bit counter) and the performance impact small. 1 l I r I _ I Direct-mapped —— 0-9 l' Skew, l-bit replacement " I: 0 8 _ Skew. LRU replacement _ e . (ii 0.7 - _, ‘5 0.6 r _ e d 0.5 - - i] 0.4 — - t 0.3 - - r a t, 0.2 '- -‘ e 0.1 - 0 J l 1 1 I 50 100 150 200 250 300 Stack distance Figure 4.20: Illustration of the impact of the replacement scheme on performance for skew caches. The direct-mapped cache is provided as a reference and all caches consist of 128 lines. Another option is to use a counter that counts up to a much smaller value but handles the overflow of the counter correctly. In order to get LRU right the vast majority of the time (arbitrarily say 199 times out of 200), the scheme needs to be able to handle about 96 42: misses where 3r. is equal to the number of lines in the cache“. As an example, consider a cache with 128 lines (4 KB) that has an expected hit rate of 90%. Every tenth access will cause a miss. Thus, the counter needs to be able to count up to 10 * 4 * 128 which requires at least a 13 bit counter. The 13-bit time stamp would require 208 bytes of additional storage, which increases the amount of non-data storage needed by nearly 50%. (For a 32-bit processor 26 bits would be needed for each tag in addition to state bits consisting of at least a dirty bit and a valid bit.) As the cache gets larger the relative overhead goes up considerably, since as the cache size doubles one more bit is required for LRU information while one less is required for the tag. In any case, the LRU information is well less than 10% of the total size of the cache for any reasonably sized cache. No matter how large the cache, finding which cache line holds the LRU block requires that the time stamps associated with both cache lines be subtracted from the current counter and then a comparison performeds. For the 128 line example this would a 13 bit subtractor followed by a 13—bit comparator. That should likely be fairly similar in speed to the 32-bit comparator proposed above. Including routing to and from the comparator, the replacement decision may take more than one cycle, which should not pose a significant problem as the cache will be waiting for the block to be fetched from the L2 cache (or memory) in any case. The complexity of a multi-cycle replacement policy would be minor but potentially troublesome. Reducing the number of bits required for the timestamps would have many obvious “The 4:1: value comes from empirical results. It can also be deduced at by graphing the 100% miss rate model for a skew cache and noting that only about 0.5% of the accesses have a stack distance that is greater than 42:. 5While there may appear to be problems with overflow in the subtraction as the time stamp could be smaller than either of the cache line time stamps, in reality this is not a problem as the overflow can be ignored and the correct result will occur when using a standard subtractor. 97 Replacement Average Improvement over l-bit Scheme hit rate Average L Min Max 32-bit time stamp 92.72% 0.21% 0.14% 1.77% 5—bit time stamp 92.70% 0.19% ~0.31% 0.99% Table 4.9: A comparison of replacement schemes for skew caches. Hit rate improvement is in terms of absolute improvement over the one-bit replacement scheme. benefits, the two most significant would be a reduction in the size of the tags and a reduction in the time required to make the replacement decision. The easiest way to perform that reduction would be to store only some of the bits of the time—stamp counter into each of the tags. Experimentation using artificial traces showed that over the Spec benchmarks, storing only bits 6 through 10 of the timestamp (where 0 is the least significant bit) provided performance that was fairly close to that of a 32-l)it time stamp. More time-consuming simulation over real traces of length 10 million confirmed this. Table 4.9 show the results over these Spec traces. While the improvement for either scheme is fairly smalls, the much shorter 5-bit time stamps get 90% of the improvement achieved by the full 32—bit time stamps. Table 4.9 show how well this 5-bit scheme compares to the 32-bit timestamp and the 1-bit replacement scheme over all the Spec benchmarks. Clearly the 5-bit scheme does nearly as well on the average as the 32-bit time stamp but suffers somewhat on the extreme C3585 . Victim caches In section 4.3.2 an analytic model for “true-LRU” victim caches was presented. Much like skew caches, time stamps or the equivalently are required to maintain this perfect LRU ordering between the two structures. The other, more standard, scheme is to have the 6Seznec saw much more significant differences on the memory reference traces he used. This difference is very much a function of the compiler. 98 "‘7 "‘7 ‘t two structures maintain LRU information separately. We call these schemes “true—LRU” and “separate—LRU” respectively. Figure 4.21 shows that the difference between the two schemes is quite large when using the 100% miss rate model. However, the separate scheme performs much better when using real data. The reason is that an n—entry fully-associative component of the cache contains the last n evictions. Under the 100% miss rate model these evictions are simply the last 11 accesses—resulting in the stack distance vs. hit rate graph being the same as the base-cache shifted to the right by n. 1 I I I I Victim, separate-LRU replacement P 0-9 "' Victim, true—LRU replacement "‘ r - .. e 0.8 ‘ij 0.7 - - f 0.6 _ - e d 0.5 - - i1 0.4 L - t 0.3 — - r a e 0.1 - a 0 ;:=— 50 100 150 200 250 300 Stack distance Figure 4.21: Illustration of the impact of the. replacement scheme on performance for victim caches. Under a. reference stream with a greater degree of locality, the separate-LRU replacement policy improves significantly while the true-LRU policy sees no improvement at all. Figure 4.22 shows the improvement that occurs when a higher locality reference stream is used, in this case a memory reference stream that has the same locality as gcc. Recall that the true-LRU replacement policy is unaffected by the input locality. That same graph shows the impact of using the same 5-bit time stamp replacement policy discussed in Section 4.4.3 99 Replacement Average Improvement over separate-LRU Scheme hit rate Average Min Max 232-bit time stamp 91.95% 0.11% -0.17% 0.83% 5-bit time stamp 91.91% 0.08% -0.41% 0.74% Table 4.10: A comparison of replacement schemes for victim caches. Hit rate improvement is in terms of absolute improvement. for the skew cache. As was done in the skew cache section, Table 4.10 shows the impact of the 5-bit time stamp and the 32-bit time stamps 'when compared to the more standard replacement policy. 1 I I 7 T 100% miss rate, separate-LRU ——-— 0-9 " locality of gcc, separate-LRU -——-- "‘ I: 0 8 _ locality of gcc, 5—bit time stamp _ e . 9 0.7 ~ - 1 i 0.6 - e d 0.5 b I; 0.4 r- t 0.3 - r a t 0.2 '- e 0.1 - 0 1 1 1 1 1 50 100 150 200 250 300 Stack distance Figure 4.22: Illustration of the impact of locality on the hit rate of the using a ‘separate LRU’ replacement scheme. This graph also shows the impact of using 5—bit time stamps. 4.5 Future work There is certainly a large amount of work that. can and is being done to expand on the work in this section. 0 The ability to identify excessive conflict can be very valuable. It can be used to evalu- 100 ate the quality of the data layout of a complier. This can help in evaluating compliers in addition to being useful as a profiling technique. Certainly certain reference traces. like our mgrid trace, had excessive conflict. Informing the complier of this excessive conflict so that it can reorder the data layout could be a source of significant per- formance improvement. Further, comparing the conflict in these reference traces to that found in other work [13] it is obvious that certain compliers simply do a better job laying out data. This can be a helpful piece of information to a complier writer. Preliminary work has started on this, comparing the gcc, MIRV, and Ice compliers. We are currently looking to get the optimized SPEC binaries complied by the Intel Reference Complier. Finding an analytic model for multi-level caches has been an insurmountable hurdle. The modeling the interaction of locality and the cache characteristics is extremely complex. While there may be little real value to such a model this work seems some— what incomplete without it. A “better than LRU” replacen‘ient policy may be possible using information about the locality of the memory reference stream. While this is almost certainly not realistic to implement in hardware, it may be able to provide something of an upper bound on possible performance under different assumptions. Spatial locality is currently only addressed with respect to a single cache line size. Providing a means of understanding spatial locality would be useful when constructing caches that use more than one line size. The primary use of the work in this chapter is as a tool to help understand caches and how they interact with locality. The next step in that direction is to make the 101 tools widely available and otherwise be evangelical about the techniques found here. To that end a website is being built as part of an NSF grant. It will provide traces, simulators, graphs, and data to help get the ideas out into the world. 4.6 Conclusion Every student of computer architecture knows that caches perform well because of spatial and temporal locality. However, few attempts have been made at quantifying this fairly abstract. concept of locality. In this chapter a. measurement of temporal was proposed with spatial locality being implicitly dealt with though cache line size. This quantification of locality leads to a much fuller understanding than previous models or rules-of-thumb have provided. In this chapter, an analytic model of how standard caches interact with locality was presented. While very similar models had be proposed in the past by Rau, Smith, Hill and others, we do not. know of any attempt to use these models as a method of understanding locality. In addition, novel analytic models of two non-standard caches, LRU skew and true-LRU victim caches, were derived. Next, the 100% miss-rate model of caching was introduced, which makes it possible to find how locality and caching interact using an empirical rather than an analytical model. This 100% miss-rate model allows any caching structure to be modeled. That empirical method was also used to help verify that the analytical models were correct. Other than where noted, this work has only been previously seen in our own previous work. While the primary application of these models is directed toward the understanding of locality and caching, a number of applications were also presented. These applications of the model include being able to measure non-random conflict as well as the ability to generate 102 artificial traces which can be tuned to have whatever locality properties are desired. It is our hope that the novel insight and applications produced in this work prove valuable to those who are involved in cache design. 103 Chapter 5 Optimal replacement is NP-hard for non-standard caches. When examining a new cache structure or replacement policy, it is often helpful to use the optimal policy as a baseline. In this chapter it. is shown that it is NP-hard to find the optimal schedule for any but the simplest of caching schemes. This is proven by a reduction from 3-Occ-Max-2SAT, a known NP-hard problem [12]. It is also shown that there is no polynomial time algorithm which is guaranteed to find a “good” approximation to this problem unless P = NP. These results apply to cache structures that include a victim cache or an assist cache, as well as skew caches and nearly all “multi-lateral" caches [70]. This result also applies to the optimal scheduling of many multi-level cache structures. 5.1 Introduction and Background One of the greatest challenges of modern computer architecture is the disparity between processor and memory speeds. This disparity has resulted in attempts to eke the last 104 l—K—awv—Lm‘fl bit of performance out of the entire memory hierarchy, especially caches. As discussed in Chapter 2, victim caches [45], assist caches [17], skew caches [60] and “smart” caches [31, 44, 70, 74, 78] have all been proposed. and some have also been implemented. However, a cache architect’s toolbox has not kept up with these changes, making these caches more difficult to evaluate, understand and improve. In this chapter one of those tools. the optimal replacement algorithm, is examined. It is shown that for many caching structures, it is a tool cache designers and researchers will have to live without. In 1966 Belady published a heavilyicited paper describing the optimal replacement policy for a cache [11]. Other, faster algorithms have been developed which also find the minimal hit rate [49, 67]. As these algorithms perform the same task and generate equivalent results, we refer to this class of optimal replacement algorithms as a single algorithm named MIN. MIN is important because it is used in comparative architectural studies. As was discussed in Chapter 4, MIN has been used in many performance studies [15.68, 70, 73, 74, 78] and as part of a. caching algorithm [78]. MIN is only applicable to “standard caches" such as direct-mapped, set-associative, and fully-associative caches. For many other, more complex, caching schemes the algorithm fails. In those caches it not only matters what block is evicted but where the new block is placed, something MIN does not address. However, until now it has not been clear if a computationally-tractable optimal algorithm could be found for these more complex caches. The search for a tractable algorithm for optimal placement / replacement in a non- standard cache has existed for a number of years. Most recently, Tam et al. [70] published a paper where they could not find a computationally tractable algorithm for optimal place- ment for their cache design. Instead they made use of a tractable, but sub-optimal, offline algorithm to make the desired comparisons. 105 5.1. 1 Companion caches Every cache we are aware of has two major components: its structure and its replacement policy. The structure of a cache describes the physical properties of the cache (such as cache line size, number of cache lines, etc.) as well the legal locations for a given block in the cache. The replacement policy describes when a block is placed into the cache, where that block is placed, and which block, if any, is removed to make room for it. For example, a standard “set-associative cache” uses a set-associative structure and an LRU replacement policy. The term Companion Cache Structure (CCS) is now introduced. A CCS consists of two components: a set-associative or direct-mapped cache and a fully-associative cache. We refer to the direct-mapped/set-associative component as the “main cache” and the fully- associative component as the “companion buffer.” Both components must be able to store any arbitrary cache block. A degenerate CCS is one in which the main cache is fully- associative or the companion buffer is of size zero. In the first case the CC S reduces to a fully-associative cache. In the second case the CCS is just a standard direct-mapped or set-associative cache. Many of the non-standard caching schemes discussed in Chapter 2 have a C C S un- derlying them. Cache structures which include a victim cache or an assist cache with its associated main cache are exactly CC 3 3. Many of the smart caches discussed in Chapter 2 are also C C 53. In addition, skew caches and certain multi-level caches can also be thought of as C C 55. The later two cases will be addressed in Theorems 5.3.7 and 5.3.8 respectively. In this chapter it is proven that optimally scheduling a. non-degenerate CC 5 is NP- hard with respect to the number of sets in the main cache. We assume that there are no placement restrictions other than those that are part of the CCS. For example, the victim 106 caching scheme, as originally defined [45], consisted of both a structure (a direct-mapped cache and the victim cache itself) as well as a replacement policy. 5.1 .2 Related terms While CCSs encompass a large number of useful cache organizations. it is important to put C C Ss in the context of other terms and concepts. Perhaps the most closely related structure to the CCS is the multi-lateral cache [70]. A multi-lateral cache consists of multiple cache structures which are conceptually on the same level of the memory hierarchy and which can store the same data. In general, multi-lateral caches contain an underlying CC S , although it is not always obvious. In a similar way. a multi-level cache can also have an underlying CC S . Suppose the L1 cache is fully-associative and the L2 cache is direct-mapped. While these two caches may have very different access times, they can be thought of as a single C C S . Similarly, imagine that rather than being fully-associative, the L1 cache were instead set-associative. As explained in Theorem 5.3.8 this can be thought of as scheduling a number of smaller C C 55. This relies upon having a caching scheme where the L2 is not required to be inclusive of the L1 [9]. T wo other terms which are important to the discussion are bypassing and internal reorganization. Standard caching schemes require that each block be placed in the cache when it is referenced. If this requirement is not in force, the cache is said to allow bypassing. Notice that disallowing bypassing places a restriction on the optimal algorithm and cannot improve the miss rate. Another variation is to allow a block to move from location to location without evicting it. We call this reorganization. Many modern caching schemes allow for this to some extent; in fact it is at the heart of the replacement schemes for both 107 victim and assist caches as well as certain hash-rehash cache variants 1. Notice that allowing reorganization cannot increase the miss rate of an optimal algorithm. Further note that reorganization does not affect the miss rate in a standard cache. 5.1.3 Chapter outline In Section 5.2 it is formally proven that the problem of managing a CCS in order to minimize the number of misses is NP-hard for a very specific instance. That proof is used as a starting point to generalize and expand the results. The specific instance used is a 005 where the companion buffer is of size one, the main cache is direct-mapped, bypassing is allowed, and reorganization is not. This result is proven using a reduction from a known NP-hard problem, 3-Occ—Max-2SAT [12]. The expanded results, found in Section 5.3, include: o The problem is still NP—hard if the companion buffer is of a size greater than one and /or the main cache is set-associative. (Theorems 5.3.1 and 5.3.2) o The problem is NP-hard whether or not bypassing is allowed. (Theorem 5.3.3) o Allowing the cache to freely reorganize does not. bring the problem into the class P. (Theorem 5.3.4) o A “good” approximation of OPT for a CCS cache is also NP-hard. (Theorems 5.3.5 and 5.3.6) 0 Skew and certain multi-level caches are NP—hard to schedule optimally. (Theorems 5.3.7 and 5.3.8) The net result of all of these proofs is that using optimal scheduling to evaluate replace- ment algorithms or cache structural changes is not viable for any but the simplest cache 1See Chapter 2 for details about these cache types. 108 structures. This result is important so that cache researchers and designers do not "spin their wheels” looking for a polynomial-time optimal replacement algorithm. If their cache structure has an underlying C CS , they will have to live without that evaluation tool. 5.1.4 Problem definition and goal Now the problem of offline cache scheduling is described more formally. An ordered list of memory references is sent to the cache. Each reference is to a particular block, and each block can be placed in some subset of the cache. In the offline environment it is known exactly in which order the references will occur. The goal is to minimize the number of misses which occur. A given instance of a caching problem consists of the ordered list of references and a mapping which describes where each reference can be placed in the cache! 5.2 Proofs In this section the results are proven in three parts, each corresponding to a subsection. In subsection 5.2.1 the relationship between interval scheduling and cache scheduling is de— scribed. A formal proof is provided in subsection 5.2.2 for a very restricted problem—when the main cache is direct-mapped, the companion buffer is of size one, no internal reorgani- zation is allowed, and bypassing is allowed. We call this restricted case the Fundamental Companion Cache Scheduling (FCCS) problem. In subsection 5.3 the F CCS proof is used as the foundation for several broader conclusions. These results include showing that the general CCS problem is N P-hard and APX-completeQ. Also specific bounds are given for how well an algorithm can approximate the FCCS problem. 2A problem is APX-complete if it can be approximated within some bound in polynomial time but cannot be approximated to within some arbitrary constant in polynomial time. 109 The F CCS problem is formally defined as follows. Definition 5.2.1. Fundamental Companion Cache Scheduling(FCCS): INSTANCE: A direct-mapped cache, D, consisting of sets of locations. D1, . . . D", a com- panion cache consisting of a. single location C1, a set of memory items (cache lines) AI, a mapping g : .M —> DJr indicating in which location in the direct-mapped cache each memory item can be stored, a request sequence 3 E .M", and an integer K. QUESTION: Each item of sequence 3 is selected in order. If that item is currently found in D or Cl, it is counted as a hit. The memory item is then either placed into the location in D indicated by Al, placed into C1, or not placed in the cache at all. Can the memory items be legally placed in the cache such that at least K requests in the sequence 3 are hits? 5.2.1 Interval scheduling and caching First, the relationship between cache scheduling and interval scheduling is described. Ini- tially assume that the cache allows bypassing and does not allow internal reorganization (as defined above.) When bypassing is allowed, it is clear that an item need only be put in the cache if it will remain in the cache until its next access. With this in mind, the cache scheduling problem on a cache of size I: can be described as a restricted interval scheduling problem on 1: parallel machines. In short the interval scheduling problem is the following. There is a set of parallel machines and a set of intervals (jobs) that have a fixed starting time and a fixed ending time. A machine can only work on one interval at a time. An interval must be worked on continuously from its start time to its end time to be successfully processed. Each interval can only be processed by a given subset of the machines. The goal is to maximize the number of intervals processed. 110 The interval scheduling problem relates to caching as follows. The cache locations correspond to machines. Each memory access defines the start time of an interval. The end time of that interval is defined by the next. access to that same memory item. This relationship is illustrated in Figure 5.1. Notice that the last access to a given memory item does not start a new interval. Also, notice that each interval corresponds to a possible cache hit (that is, the access is not a compulsory miss [39]). The subset of machines that can process an interval is determined by the set of cache locations in which the corresponding memory item can be placed. As stated earlier, when bypassing is allowed, placing an item in the cache is only profitable if the item will still be in the cache the next time it is accessed. Likewise, scheduling an interval on a machine is only profitable if the machine is not otherwise busy during that interval. Accenws: 101 102 101 103 102 103 101 103 102 101 Jobs: Figure 5.1: Relation between memory accesses and intervals. If intervals can be placed on any machine, then the interval scheduling is solvable in polynomial time [30]. In the restricted interval scheduling problem, each interval can only be placed on a subset of the machines. If the machines can be partitioned into subsets such that if an interval can be scheduled on a machine in one subset, it can be scheduled on all machines in that subset, and it cannot be scheduled on any machine not in that subset, the problem corresponds to a set-associative cache and is still solvable in polynomial time. However, the general case of restricted interval scheduling is NP-complete[7, 48]. In order to show that the FCCS problem isNP-complete, it is first shown that the Fundamental Companion Interval Scheduling problem (FCIS), a special case of restricted 111 interval scheduling. is NP-complete. Then F CIS is reduced to F CCS. 5.2.2 Formal N P-hard proof for FCCS Definition 5.2.2. Fundamental Companion Interval Scheduling (FCIS): INSTANCE: A set C of m machines 01,...Gm. a “companion” machine C1, an integer K. and a set I of n intervals (9,, f,,o,-) where s, E Z+ is the starting time of interval i. fi 6 Z + is the finishing time of interval i and o,- E G is the machine on which interval i can be scheduled. Additionally, any interval can be scheduled on the companion machine. QUESTION: Can at least K of the intervals be legally scheduled? Theorem 5.2.1. FCIS is NP-complete. Proof. 3-Occ-Max—QSAT is reduced to FCIS. 3-Occ-Max-2SAT is a restricted form of Max- 2SAT where each variable occurs at most three times. More formally, 3-Occ-Max-2SAT consists of a set. of variables, U, a set of clauses, C, and a value K. Each of the clauses contain exactly two variables. possibly in the negated form. No variable occurs in more than three clauses. As noted above, this problem has been shown to be NP-complete[12]. Let U = {21.1,ug,....va}, C = {c1.c2,...,cb} and K be an instance of 3-Occ-Max- 2SAT. For each variable u E U create two machines, A1,, and Ala. both elements of G. For notational reasons, the function f (:1) and h(:r) are introduced. They are defined as f(ui) = i and h(ck) = h. For each variable u E U create the following four intervals (f (u), f (u) + 1,]lIu), (f(u), f(u) + 1,1115), (0,a. + b, AI“), and (0,a + b, lily).3 This results in 4a intervals. Label 3An alternative notation would be to examine each variable u.~ and create intervals (i, i + 1, Nu, ), (i,i + 1, Mar). (0,a + b, M... ). and (0, a + b.11'lw) 112 the first two intervals “short variable” intervals and the second two intervals “long variable” intervals. For each clause c E C create the following two intervals. If variable 21 appears in clause c in its positive form, create the interval (a + h(c),a + h(c) + 1,111,,). If it appears in the negative form create the interval (a + h(c),a + h(c) + 1, Ala). Since there are two variables per clause, this results in 2b “clause” intervals. Thus, a total of 4a + 2b intervals are created and 2a machines are created. Both are clearly polynomial in the original input size. An example instance is shown in Figure 5.2. The intervals form an instance of F CIS. The intervals can be broken into three groups: the long variable intervals, the short variable intervals, and the clause intervals. The key observations about this reduction are that the variable intervals (both long and short) are used to enforce a variable assignment and pairs of schedulable clause intervals correspond to the satisfiable clauses. ...................................................................... IL i 1 Ali'- 1 I n l [My i l My 1 as eM%& l I l l I we wean ...................................................................... Figure 5.2: Example reduction from 3-Occ-Max—2SAT. Notice that the short intervals on the left are the “short variable intervals” and the ones on the right are the “clause intervals.” Suppose one can satisfy K of the clauses in the 3-Occ-Max-2SAT instance. It will now be shown that 3a + b + K of the intervals can be scheduled. Consider each variable u E U. If u is true, machine Iii-,7 is used to schedule one of the long intervals corresponding to u, 113 and machines A!“ and C1 are used to schedule the two short intervals corresponding to u. If u is false, the roles of machines Mg and 111,, are reversed. Exactly 3a of the variable intervals will be scheduled. Note that it is not possible to schedule more than 3a of the variable intervals, because for each variable there are 4 overlapping jobs and only 3 available machines. Now consider each pair of clause intervals. If the corresponding clause c is satisfied in the 3-Occ-Max-2SAT problem, then one of the literals in c must be true. Without loss of generality. it can be assumed that this literal is 1. Because I is true, the long interval that could be placed on AI, was not scheduled, and machine Ill; is free. Therefore the pair of intervals that correspond to clause c using machines All and C1 can be scheduled. If the clause is not satisfied, only one interval can be scheduled (011 C1). It is important to note that for each u, the clause intervals either use machine All“ or machine Illa but not both. A total of b + K of the clause intervals will be scheduled. Thus exactly 3a + b + K of the intervals have been scheduled. The proof is only half done at this point. It has been shown that if K of the clauses in our 3-Occ-Max-2SAT problem can be satisfied, one can satisfy 3a + b + K of the intervals. It now needs to be shown that if 3a + b + K intervals can be scheduled, that schedule can be used to find a solution to the corresponding 3-Occ-Max-2SAT problem where K of the clauses are satisfied. Suppose there exists a schedule 5 that schedules 3a + b + K intervals. We will create a modified schedule 5' that schedules at least. 3a + b + K intervals with the two following ad- ditional constraints: no long variable interval will be scheduled on the companion machine, and for each variable u, exactly one of its long variable intervals will be scheduled. S’ is constructed as follows. First modify 5 so that C; is not used to schedule any long 114 variable intervals. Suppose in schedule S that machine C1 is used to schedule one of the long intervals for some variable u. In this case, either the corresponding short interval is not currently scheduled or it is scheduled on its corresponding machine in G, Mu or 1115. In the first case, modify S by having C1 schedule the unscheduled short interval. In the second case, modify S by taking all intervals scheduled on the corresponding machine, ll/Iu or lug, and schedule them on C1 and schedule the long interval on that machine. In either case, the modified schedule now completes the same number of intervals as S does, while C1 is not used to schedule a long variable interval. Now suppose for a variable u, the modified schedule uses both 111,, and Illa to schedule the corresponding long intervals. This must mean that. at least one of the short intervals corresponding to u is not scheduled. Modify the schedule by scheduling one of the un- scheduled short intervals on its corresponding machine in G, ll-Iu or big, and dropping the corresponding long interval. At this step there is now a schedule where for each variable u, at most one of its two long intervals is scheduled on its corresponding machine in C, All.” or .Mg. The schedule is further modified so that for each variable u, exactly one of the two long intervals for u is scheduled on its corresponding machine in G. Let u be a variable where neither of its corresponding intervals are scheduled. Consider the clause intervals. Because the variable 11 appears at most three times, it either appears in its positive form at most once, or it appears in its negative form at most once. It follows that there is at most one clause interval that can be scheduled on machine A1,, or there is at most one clause interval that. can be scheduled on machine Ill-g. Drop that single clause interval and use the freed machine to schedule the corresponding long interval for u. If this machine was currently used to schedule the corresponding short interval, modify the schedule to have C1 schedule this short. interval 115 and the other machine, All, or Ali, to schedule the other short interval. Call this final schedule S'. The schedule S’ now has at least 3a+b+ K intervals scheduled since all modifications performed did not decrease the number of intervals scheduled. For each variable, exactly one long variable interval will be scheduled, and it will not be on C1. This will define a valid truth assignment for 3-Occ-Max-2SAT. Specifically, if the long interval 21 is not scheduled, then variable u is true; otherwise u is false. Since at most 3a variable intervals can be scheduled, this means at least b+ K clause intervals are scheduled. At most I) of these clause intervals can be scheduled on C1 which means that at least K clause intervals will be scheduled on a machine in C. These “at least K” clause intervals scheduled on a machine in G correspond to K satisfied clauses in the instance of 3-Occ- Max-2SAT with the above truth assignment. Cl Theorem 5.2.2. The Fundamental Companion Cache Scheduling (FCCS) problem is NP- complete. Proof. F C18 is reduced to FCCS. For each machine in FCIS create a cache location in D. For each interval (3,, fi, 0,) create a memory item 3:,- and let g(.r,~) = 0,. In other words, memory item I,- can be placed in the cache location that corresponds to the machine on which the interval (31-, f,) can be scheduled. Sort the end points sl, f1, s2, f2, . . . s", fn of the intervals. If two end points are equal, then ties are broken in the following manner. Finishing points come before starting points. Otherwise, the endpoint from the lower indexed interval is first. A sequence of memory references is generated as follows. For each end point in the sorted list, reference the memory element that corresponds to the end point’s interval. The intervals defined by the resulting sequence of memory references and the original set of intervals are equivalent. For every interval that can be scheduled, there will be one cache hit. Therefore, a schedule that schedules more than N intervals will also result in more than 116 N cache hits. It is trivial to show that the F CCS E NP, and thus is NP-completesimply by showing that a given proposed solution can be verified to legally schedule K jobs in polynomial time. 1:] 5.3 Extending FCCS The results of Section 5.2.2 apply only to the F CCS problem—a highly restricted instance of the general problem of optimally scheduling a C C S . This section broadens those results greatly. While it should not be surprising that instances where there are set-associative main caches and larger companion buffers are still NP-hard, the outline of proofs to that effect are provided for sake of completeness. It is also formally proven that allowing internal reorganization, and /or disallowing bypassing. does not affect the difficulty of the problem. Further, it is proven that these problems are all difficult to closely approximate and provide some bounds for the F CCS problem. Lastly, it is proven that skew caches and certain multi—level caches are also NP-hard to schedule optimally. Theorem 5.3.1. If the main cache is k-way set—associative. the problem is still NP- complete. Proof. The problem statement for F CIS in definition 5.2.2 is used with a. slight modification. Each element of C is redefined to be a set of lf. machines, rather than a single machine. An interval which can be scheduled on 03- may now use any of the (otherwise unscheduled) machines in Q, or the companion machine. The construction is also modified. Each of the short variable intervals as well as the clause intervals are replicated so that there are k copies of each. Also, in an attempt to reduce confusion, the sets of machines corresponding to the variable 1‘. are referred to as St 117 ..................................................................... Page 1r H H . 3511—531 lillsfllsyllsfl ..................................................................... Figure 5.3: Example construction for a 2-way set-associative main cache. rather than him. Figure 5.3 is an example construction similar to that of Figure 5.2. Notice that exactly one of the variable intervals will still be unschedulable as for each variable there are now 2(k + 1) variable intervals and only 213 + 1 machines on which they can be run. Also notice that if the long variable jobs corresponding to the literals of a given clause are both scheduled, one of the clause jobs corresponding to the clause will be unschedulable. Thus the same basic arguments of Theorem 5.2.1 still apply, and the conversion from an interval scheduling problem to a set-associative caching problem is only a simple variation of Theorem 5.2.2. Theorem 5.3.2. If the companion buffer is of size 2 > 1, the problem is still NP-complete. Proof. The problem statement for FCIS in definition 5.2.2 is used with a slight modification. The single companion machine C1, is now a set C of companion machines C1 . . . C2. Much like before, any job can be run on any of the machines in C. 118 ...................................................................... } J. t MI 1 l M” l E M, , M, M M, M,r M, M, I—‘H—H I‘ ”:1 “H "it "1 u 1 1 1 I 1 MIA—9 I” ”1? [if Bit—El Ilia, a dMBanerBI 18. I if It ll 1 ...................................................................... Figure 5.4: Example construction for a companion buffer of size 3 The construction used in Theorem 5.2.1 is used with slight modifications. A new ma- chine, A13 6 G' is added. For each pair of short intervals, add 2 — 1 intervals with the same start and end times as that pair of intervals and which can run on machine 11713. Figure 5.3 illustrates the modified construction for a. companion buffer of size 3. If all of these “B” intervals are scheduled they will use the .: — 1 extra machines of the companion buffer. When creating the schedule 5' as described in Theorem 5.2.1 an additional step is added. After the long variable intervals have been fixed, any unscheduled B intervals will be scheduled if a free machine is available. If no such machine is available, it must be the case that both .of the non-B short. interval jobs are using a machine from the set C. Unschedule one of those intervals and let the unscheduled “B” interval use that. machine. The remainder of the construction is identical. [:1 Theorem 5.3.3. If bypassing is not allowed in the cache, the problem is still NP-complete. Proof. This proof uses nearly the same construction as Theorem 5.2.1. The only change is that the long intervals are lengthened so they start at time —1 rather than time 0. Notice 119 that this change does not have any impact on either of the theorems used to show that F C C S is NP-hard. All that is left is to prove that disallowing bypassing will not decrease the number of intervals which can be scheduled by an optimal algorithm. Clearly, disallow- ing bypassing cannot increase the optimal number of intervals which can be scheduled in any instance: any schedule legal under the no—bypassing case is also a legal schedule with bypassing. Let us now define the following terms. I is an instance of the F CIS problem. 5135395“) is the optimal schedule for I when bypassing is allowed. Further we define SgSgSU) to be of the form of S’ described in Theorem 5.2.1. That is, all of the short variable jobs are scheduled, one long variable job is left. unscheduled for each variable, and some of the clause intervals may be left unscheduled. That theorem showed that such a schedule exists which is still optimal. Notice that under the no—bypassing restriction, it must be the case that all of the short intervals are scheduled. That is because the short intervals must be started and no other interval starts while they run, so no interval can preempt them. Clearly the short intervals may be preempting the long intervals. In fact the only impact of disallowing bypassing for any I is that all of the short intervals must be scheduled: as long as that rule is followed the schedule will be a legal non-bypassing schedule. The long intervals must all initially be scheduled, but may be preempted later, and thus count as unscheduled. This means that if nggs can be converted to a form where the long jobs are the only jobs left unscheduled and the total number of jobs scheduled does not decrease then there exists a schedule that does not. require bypassing and still schedules exactly as many jobs. A schedule 520155038 is now constructed from the schedule $85,123. Label the k. unsched- uled clause intervals in 8815333 as U CI 1 . . . U C If. and label the long variable interval which 120 uses the same machine in C used by U CI j as LV I j. UCIJ- could be scheduled if LVIJ- were unscheduled. That is because LVIJ- is the only interval which can possibly use its machine in G at that time. Since LVI, overlaps with at. least all of the intervals that U CI )- overlaps with, it cannot decrease the number of schedulable jobs to schedule UC I j and unschedule LVIj. This can be done for each of the k unscheduled clauses and will result in no fewer intervals being scheduled. . OPT . .. . .. ' . -. OPT It has been shown that Snobypass schedules exactly as many Intervals as Sbypass. As - - OPT - - OPT . fmdmg Sbypass rs NP-hard, finding Snobypass must be NP-hard also. [3 Theorem 5.3.4. If reorganization is freely allowed in the cache, the problem is still NP- complete. Proof. The same construction as Theorem 5.2.1 is used once again. Notice that this problem is less restricted than the FCIS problem. Thus the optimal schedule for the reorganiza- tion variation can be no worse than the optimal non-reorganization schedule for the same instance. It will be shown that allowing reorganization does not improve the number of intervals which can be scheduled and therefore FCIS where reorganization is allowed must also be NP-hard. The fundamental observation is that given m intervals. all of which desire to be run during some given time, and m — 1 machines on which to run those intervals, one of the intervals must be left unscheduled. Reordering does not help in that. case. Another impor- tant observation is that once an interval is left unscheduled, even for a moment, there is no benefit to scheduling it in the future. Using these facts, it will be shown that the proof in Theorem 5.2.1 holds for the reorganization case also. First it is shown that no extra variable intervals can be scheduled due to reorganization. Next it is shown that no extra clause intervals can be scheduled due to reorganization. As that is all of the intervals in 121 the instance, it is enough to show that reorganization never allows extra intervals can be scheduled in the optimal cache for any F CIS instance. As before, at least one of the variable intervals must be left unscheduled. Again, leaving a short variable interval unscheduled can be no better than leaving a long variable interval unscheduled, and thus it can be assumed that one long variable interval is left unscheduled per variable. It is therefore clearly the case that just before the first. clause interval has been scheduled exactly as many intervals can be scheduled in the standard FCIS version as in the F CIS version where reorganization is allowed. It is now shown that no extra clause intervals can be scheduled due to reorganization. Label each of the pairs of short clause intervals as (CI 11 , C112) where C In can be assigned to machine ll 6 G and C113 is defined similarly. Further, the term LVI3r will be used to mean the long variable interval which can be run on machine :1: E G. Examine each pair of clause intervals in turn. As before. if LVIII and/or LVIIQ was left unscheduled, both CI“ and C112 can be scheduled. If LVI” and LV 112 are both scheduled, there exist four simultaneous intervals (C111. C112, LVI“ and LVIIQ which can only be run on three machines (A111, AI”, and C1). Thus, as in Theorem 5.2.1, one of those four intervals cannot be scheduled. The only issue which remains is to show that there is no advantage to leaving two long variable intervals associated with the same interval unscheduled. The argument for this is identical to the argument presented in Theorem 5.2.1: if both long variable intervals are left unscheduled, the one for which there exists only one clause interval which uses its machine in C can be scheduled while that clause interval is unscheduled. Recall that this is due to the 3—occurrence restriction of 3-Occ-Max-2-SAT. Therefor, the same number of intervals are scheduled by the optimal solution to FCIS with reorganization as are scheduled by standard F CIS. As FCIS is NP-complete, FCIS with reorganization must also be NP-complete. Cl 122 Theorem 5.3.5. Any algorithm which can guarantee to approximate the miss rate within 0. 012.4% of the optimal for F CIS must have an exponential running time if P 74 NP. Proof. Berman and Karpinski proved that it is NP—hard to approximate 3—Occ-IV'Iax-2SAT within 0.05% [12]. Specifically, they showed that there exist instances of 2016n clauses where it is NP-hard to compute if the number of satisfiable clauses is (2011 + e)n or (2012 + e)n for any 6 in the range (0,0.5). Utilizing the construction in Theorem 5.2.1 it is possible to take an instance of these 2016n clauses and convert it into a. FCIS problem. Recall from the construction in Theorem 5.2.1 that 4a + 2b clauses were generated when the 3-Occ-Max— 2SAT problem consisted of a variables and b clauses. Because the instances generated in [12] all utilize each variable exactly three times, there are exactly 1344n distinct variables used. Therefore, there will be exactly (4 * 1344 + 2 * 2016)n = 940871 clauses in the F CIS instance. Again, recall from Theorem 5.2.1 that if K clauses in the 3-Occ—Max—2SAT problem could be satisfied, 30. + b + K clauses in the FCIS instance could be scheduled. For K = (2012 + e)n and (K = 2011 + e), that. means that (8060 + e)n and (8059 + e)n could be scheduled respectively. Clearly, as n gets large and 6 can therefore approach 0, it is N P-hard to distinguish between 8060n and 805971 schedulable intervals. It is then the case that estimation of FCIS within a factor of 8060/8059 is NP-hardThis is the same as being unable to estimate within 0.0124% of the optimal. It should be noted that the technique used in Theorem 5.3.6 could easily be used to compute a similar result. However that generic technique yields a looser result of 0.0079%. Theorem 5.3.6. An algorithm which can guarantee to approximate the miss rate within 1.01% of the optimal for F CCS is non-trivial to find. 123 According to Berman and Karpinski [12]. the best known polynomial-time approxima- tion algorithm for 3-Occ-Max-2SAT is the same as that for Max-2-SAT found by Feige and Goemans [27]: 1.0741. Clearly, as FCIS can be used to solve 3-Occ-Max-2SAT, this result could be improved if there were a sufficiently good approximation algorithm for FCIS. It is now shown that if there exists a 1.013-algorithm for FCIS (that is, it can approximate F CIS within 1.01%) it would result in an improvement in the best known approximation algo— rithm for 3-Occ-Max-2SAT. As this problem has been fairly well studied, we are comfortable labeling such an improvement in 3-Occ-Max-2SAT as non-trivial. An L-reduction, as explained by Papadimitriou in [55], can be used to show how an inapproximability result for one problem can be reduced to another. To quote from various parts of pages 309—311“. Suppose that A and B are optimization problems (maximization or minimiza- tion). An L-reduction from A to B is a pair of functions R and S, both com- putable in logarithmic space, with the following two additional properties: First, if a: is an instance of A with optimum cost OPT(;r), then R(:r) is an instance of B with optimum cost that satisfies OPT(R(:r)) s a - OPT(;I:) (5.1) where a is a positive constant. Second, if s is any feasible solution of R(:r), then S (s) is a. feasible solution of :1: such that 10PM) — c>I s momma» — c(s)I. (5.2) 4We have replaced Papadimitriou’s (non-standard) use of c with the symbol 6. Otherwise the quotes are verbatim. 124 .. -._n —--...-——- where [3 is another positive constant particular to the reduction (and we use c to denote that cost in both instances). Papadimitriou goes on to state: If there is an L-reduction (R, S) from A to B with constants a and [3, and there is a polynomial-time 6-approximation algorithm for B, then there is a pol ynomial-time 0,136 ,. 1 _ 5 (0.3) -approximation algorithm for A. It should be noted that Papadimitriou uses the term arr-approximation algorithm in a non-standard way. He defines an :r-approximation algorithm as: I = lc(M(a.-)) — 0pm); OPT(:r) ‘ rather than the more standard form of :r. = c(hI(:r))/OPT(.r). For the remainder of this section the term “:r-approximation algorithm.” where :r. is a real number, will refer to the standard definition. The symbols 6 and 6 will be used to refer to the inapproximablilty using Papadimitriou’s or the standard scheme respectively. Simple algebra can show that Thus, if A is 3—Occ—Max-2SAT and B is F CIS, there is an 6-approximation for A of 1.0741. That e-approximation is the same as a 6-approximation of about 0.069. It is now shown 125 that if an e-approximation algorithm of less than 1.013 exists for B, an e of less than 1.0741 also exists for A. This would be an improvement in the best-known algorithm for 3-OCc- Max-2SAT. The value of [3 from Equation 5.2 is now found via a two-part. argument. Let s be a feasible solution of FCIS which schedules i intervals and let m be the optimal number of intervals that could have been scheduled in that instance. First assume that s schedules at least 3a + b intervals, where the instance consists of a variables and b clauses. Recall from Theorem 5.2.1 that. if 3a + b + K clauses in F CIS could be satisfied, at least K clauses in the 3-Occ-Max-2SAT instance could be scheduled. Thus, OPT(.r) — c(S(s))| = m — c and [OPT(R(:r)) — c(s)[ = (3a + b + m) — (30 + b + c) = m — c. and from Equation 5.2, 0 = 1. Now assume that instead 3 schedules fewer than 3a + b intervals. In that case. even if no 3-Occ-Max-2SAT clauses are satisfied, 0 is clearly less than or equal to one. That is because OPT(:r.) — c(S(s))| < m S lOPT(R(r)) — c(s)|. Computing (1 requires one additional fact: At least 3/ 4 of all clauses in a Max-25AT problem can always be scheduled. Given that, if m is the optimal solution for the 3-Occ- Max-2SAT problem, there could have been no more than 4m/3 clauses in the problem instance. Further if a variable only occurs once, that literal can safely be set to true and the problem reduced in size by one variable and one clause. It can therefore be assumed that each variable occurs at least twice. If, as before, a is the number of variables in the instance and b is the number of clauses, a S 4m / 3 and b 3 4m / 3. Thus, the F CIS problem can schedule exactly 3a + b + m = 12m/ 3 + 4m/ 3 + m = 19m/3 intervals. Plugging the appropriate values into Equation 5.1, a = 19/ 3. Having oz and ,3 available, and using Equations 53,55. and 5.4 it can be shown that, if a better than 1.0101-approximation algorithm for F CIS were found, that would lead to 126 an improvement in the best known approximation algorithm for 3-Occ-l\lax-2SAT5. This does not mean that there is no polynomial time algorithm which can guarantee a result of better than 1.6%. However, such a result would show an improvement over the best known algorithm for a fairly well-studied problem. We are comfortable labeling such an approximation algorithm “non-trivial,” ' El Theorem 5.3.7. Finding the optimal schedule for a skew cache is NP-hard. Proof. This is proven by restriction [28]. Notice that. it is possible to have a schedule where all of the accesses to one of the banks of the skew cache are always to the same location. In that restricted case the problem is exactly the same as FCCS. Thus if an algorithm existed which could optimally schedule a skew cache, that same algorithm could be used to schedule a. F CCS. Since FCCS is NP-hard, skew cache scheduling is also NP-hard. 1:] Theorem 5.3.8. It is NP—hard to optimally schedule a multi-level cache as a single unit where exclusion is allowed. Proof. This is also a proof by restriction [28]. Assume a two-level cache hierarchy where the L1 cache can hold kl blocks and the blocks are grouped into 31 sets. Define s2 and kg in a similar way for the L2 cache. Without loss of generality we will assume that 32 > s16. Table 5.3 lists the values of 82/81, for a few selected processors. Values near 8 or 16 are most common. Consider the set of blocks, 8,, which are mapped into the ith set of the L1 cache. Under all reasonable indexing functions (see Chapter 2 for a discussion on indexing functions) elements of B, will also be mapped into at least s2 / s1 different sets of the L2 cache. Now 5This solution can be easily verified. However finding the value 1.0101 required some additional work 6With only slight modifications The same argument can be used if s] > .52 127 Table 5.1: Ratio of the number of sets in the L2 cache over the number in the L1 data cache for selected processors. [ Processor I 82/81J Intel P6 w 512KB L2 cache 64 Intel Celeron (Coppermine) 8 Compaq 21364 8 AMD 1GHz Athlon 0.5 notice that optimally scheduling accesses to the blocks in the set. B,- is exactly the same as scheduling a companion cache, where the companion buffer is of size In /sl. and the main cache consists of at least 82/81 sets each of size 152/51. It is worth noting that for small values of 32 / 81. the problem may be computational tractable. In fact, for a. fixed value of 333/81 the problem is technically requires only poly- nomial run time. However, the large values for 32/51 in real processors, combined with the very large address traces (usually greater than 1,000,000 addresses) would seem to make this problem generally intractable. C] 5.4 Conclusion It has been shown that an important tool for cache studies, the optimal replacement policy, is NP-hard to compute for many non-standard caches. It has also been shown that this result applies to the optimal scheduling of multi-level caches where the caches are kept exclusive. Additionally, it was proven that a good approximation to these problems is also NP-hard to compute. While these results will not help one build faster caches, researchers who are aware of this will not waste their time in a futile attempt to find efficient algorithms for these problems. Unless P = NP such algorithms simply do not exist. 128 Bibliography [1] l3] l4] l6] [7] [8] [9] [10] [11] [12] A. Agarwal. J. Hennessy, and Horowitz M. Cache performace of operating systems and multiprogramming. ACM Trans. on Computer Systems, 6(4):393—431, November 1988. V A. Agarwal, J. Hennessy, and Horowitz M. An analytical cache model. ACM Trans— actions on Computer Systems, 7(2), May 1989. ‘ A. Agarwal and S. D. Pudar. Column-associative caches: a technique for reducing the miss rate of direct-mapped caches. In Proc. of the 20th Int ’1 Symposium on Computer Architecure, pages 179—190, 1993. V. Agarwal, M.S. Hrishikesh, S. W. Keckler, and D. Burger. Clock rate verse IPCzThe end of the road for conventional microarchitecures. In Proc. of the 27th Int ’1 Symposium on Computer Architecture, June 2000. E. Anderson, P.V. Vleet, L. Brown, J-L Baer, and AR. Darlin. On the performace potential of dynamic cache line 'sizes. Technical Report UW-CSE-99-02-01, Dept. of Computer Science and Engineering, University of Washington, February 1999. J. Archibald and J-L. Baer. Cache coherence proto—cols: Evaluation using a multi- processor simulation model. ACM Transactions on Computer Systems, 4(4):273-298, November 1986. EA. Arkin and EB Silverberg. Scheduling jobs with fixed start and end times. Discrete Applied Mathematics, 18:1—8, 1987. O. Babaoglu. Efficient generation of memory reference strings based on the lru stack model of program behaviour. In PERFORMANCE ’81, pages 373-383, 1981. J-L. Baer and W-H. Wang. On the inclusion properties of multi-level cache hiearehies. In Proc. of the 15th Int 7 Symposium on Computer Architecture, June 1988. AP. Batson and A. W. Madison. Characteristics of program locality. Communications of the ACM, 19(5):285—294, May 1976. L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):282-288, 1966. P. Berman and M. Karpinski. On some tighter inapproximability results. Technical Report 99-23, DIMACS, 1999. ’ 129 [13] M. Brehob and R.J. Enbody. An analytical model of locality and caching. Techni— cal Report MSUCPSzTR99-31, Michigan State University, Department of Computer Science and Engineering, 1999. [14] M. Brehob, R.J Enbody, and N. Wade. Analysis and replacement for skew-associative caches. Technical Report MSUCPS:TR97—32, Michigan State University, Department of Computer Science and Engineering, 1997. [15] D. Burger, J.R. Goodman, and A. Kagi. Memory bandwidth limitations of future microprocessors. In 23nd Annual Int ’1 Symposium on Computer Architecture. pages 78—89, 1996. [16] B. Calder, D. Grunwald, and J. Emer. Predictive sequential associative cache. In Proc. 2nd Int ’1 Symposium on High Performace Computer Architecture, pages 244-253, 1993. [17] K.K. Chan, C.C. Hay, J.R. Keller, G.P. Kurpanek, F.X. Schumacher, and J. Zheng. Design of the HP PA7200. Hewlett-Packard Journal, February 1996. [18] J .H. Chang, H. Chao, and K. So. Cache design of a sub-micron CMOS system/ 370. In Proc. of the 14th Int’l Symposium on Computer Architecture, pages 179-190, 1987. [19] C. K. Chow. An optimization of storage hierarchies. IBM Journal of Research and Development, 18(3):194-203, May 1974. i [20] CK. Chow. Determination of cache’s capacity and its matching storage hierarchy. IEEE Transactions on Computers, C-25(2):157-164, February 1976. [21] Standard Performace Evaluation Corporation. Spec benchmarks. \Vorld Wide web. [22] DE. Culler. Caches. World Wide Web, February 2002. [23] Gee J. D., Hill M. D., Pnevmatikatos D. N., and Smith A. J. Cache performance of the SPEC92 benchmark suite. IEEE Micro, 13(4):17—27, August 1993. [24] P.J. Denning. The working set model for program behavior. Communications of the ACM, 11(5), May 1968. [25] Wilton S. J. E. and Jouppi N. P. Cacti: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31(5):677—688, May 1996. [26] and P.J. Denning E.G. Coffman. Operating Systems Theory. Prentice-Hall, 1973. [27] Uriel Feige and Michel Goemans. Approximating the value of two prover proof systems, with applications to max 2sat and max dicut. In In Proc. of 3rd Israel Symposium on the Theory of Computing and Systems, pages 182—189, 1995. [28] MR. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1973. [29] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical represen- tation of cache misses. In Proc. of the 11th ACM Int ’1 Conference on Supercomputing, July 1997. 130 [30] MC. Golumbic. Algorithmic Graph Theory and Perfect Graphs. Academic Press, New York, 1980. [31] A. Gonzalez, C. Aliagas, and M. Valero. Data cache with multiple caching strategies tuned to different types of locality. In Proc. of the 1995 Conference on Supercomputing, pages 338—347, 1995. [32] K. Grimsrud. Quantifying Locality. Doctoral dissertation, Brighman Young University, 1993. [33] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson. On the accuracy of memory reference models. Lecture Notes in Computer Science, 794:369—388, 1994. [34] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson. Locality as a visualization tool. In IEEE Transactions on Computers, volume 45, pages 1319—1325, November 1996. [35] M. Hachman. IBM and Intel chip vendors showcase road maps at microprocessor forum. BYTE, October 1999. [36] E. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In 27th Int ’1 Symposium on Computer Architecture, June 2000. [37] J. Handy. The Cache Memory Book. Academic Press, second edition, 1998. [38] J .P. Hayes. Computer Architecture and Organization. McGraw Hill Text, 1988. [39] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Mateo, CA, third edition, 2002. [40] MD. Hill. A case for direct.~mapped caches. IEEE Computer, pages 25—40, December 1988. [41] MD. Hill. Evaluating associativity in cpu caches. IEEE Trans. on Computers, 38(12):1612—1630, December 1989. [42] Micron Technology Inc. 1997 DRAM Data Book. Micron Technology Inc., Boise, Idaho, 1997. [43] B. Jacob, P. Chen, S. Silverman, and T. Mudge. An analytical model for designing memory hierarchies. IEEE Transactions on Computer, 45(10), Oct 1996. [44] TL. Johnson and W-M. ku. Run-time adaptive cache hierarchy management via reference analysis. In Proc. of the 24th Annual Int ’1 Symposium on Computer Archi- tecture, volume 25,2 of Computer Architecture News, pages 315-326, New York, June 1997. ACM Press. [45] N. J ouppi. Improving direct-mapped cache performance by the addition of a small fully- associative cache and prefetch buffers. Proc. of the Seventeenth Annual International Symposium on Computer Architecture, 18(2):364—373, May 1990. [46] SF. Kaplan, Y. Smaragdakis, and P. R. Wilson. Trace reduction for virtual memory simulations. In Proc. of ACM SIGMETRICS, May 1999. 131 [47] R. R. Kessler and et al. Inexpensive implementations of set—associativity. In Proc. of the 16th Int ’1 Symposium on Computer Architecture, pages 131-139, 1989. [48] A.VV.J. Kolen and J .G. Kroon. On the computation complexity of (maximum) class scheduling. European Journal of Operational Research, 54:23—38, 1991. [49] R.L. Mattson, J. Gecsei, D. R. Slutz, and IL. Traiger. Evaluation techiques for storage hierarchies. IBM system Journal, 9(2), 1970. Periodical Room, TA168 .114 v.12 1973. [50] SA. McKee. A. Aluwihare, B.H. Clark, R.H. Klenke, T.C. Landon, C.W. Oliver, M.H. Salinas, A.E. Szymkowiak, K.L. Wright, W.A. Wulf, and J.H. Aylor. Design and evaluation of dynamic access ordering hardware. In Proc. 10th ACM Int ’1 Conf. on Supercomputing, May 1996. [51] SA McKee, R.H. Klenke, K.L. W'right, W'.A. Wulf, M.H. Salinas, J.H. Aylor, and AP. Batson. Smarter memory: Improving bandwidth for streamed references. IEEE Computer, July 1998. [52] SA. McKee, W.A. Wulf, J .H. Aylor, R..H. Klenke, S.I. Salinas, M.H. Hong. and Weikle D.A.B. Dynamic access ordering for streamed computations. IEEE Transactions on Computers, To appear. [53] Kathryn S. McKinley and Olivier Temam. A quantitative analysis of loop nest local— ity. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94—104, Cambridge, Massachusetts, October 1996. ACM Press. [54] G. D. McNiven and E. S. Davidson. Analysis of memory referencing behavior for design of local memories. In H. J. Siegel, editor, Proc. of the 15th Int ’1 Symposium on Computer Architecture, pages 56—63. Honolulu. Hawaii, May 1988. IEEE Computer Society Press. [55] CH. Papadimitriou and M. Yannakakis. Optimization, approximation, and complexity. In Proc. 20th ACM Symp. on Theory of Computing, pages 229-234, 1988. [56] D. A. Patterson, J. L. Hennessy, and D. A. Peterson. Computer Organization and De- sign : The Hardware/Software Interface. Morgan Kaufmann, San Mateo, CA, second edition, 1997. [57] SA. Przybylski. Cache memory hierarchy design: A Performance-Direted Approach. Morgan Kaufmann, 1990. [58] R. Rau. Properties and applications of the least-recently-used stack model. Technical Report 139, Dept. of Electical Engineering, Stanford Univ., May 1977. [59] UL. Rhodes and W.Wolf. Unbalanced cache systems. In Proc. of the IEEE Int’l Workshop on Memory Technology, Design and Testing, pages 16—23, 1999. [60] A. Seznec. A case for two—way skewed-associative caches. In Proc. of the 20th Inter- national Symposium on Computer Architecture, pages 169—178, 1993. [61] A. Seznec and F. Bodin. Skewed-associative caches. In Proc. of PA RLE .93. 1993. 132 [ml [631 [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [76] A. Seznec and F. Bodin. Skewed associativity enhances performance predictablility. In Proc. of the 22nd International Symposium on Computer Architecture. pages 256—275, 1995. A Silberschatz, J. Peterson, and P. Galvin. Operating System Concepts. Addison- Wesley, third edition, 1991. A. J. Smith. Cache memories. ACM Computing Surveys, 14(3):473—530, September 1982. A.J. Smith. A comparative study of set associative memory mapping algorithms and their use for cache and main memory. IEEE Transactions on Software Engineering. 4(2):1‘21—130, March 1978. HS. Stone. High-Performance Computer Architecture. Addison-VVesley Publishing Company, 1990. RA. Sugumar. Multi Configuration Simulation Algorithms for the Evaluation of Corn- puter Architecture Designs. Doctoral dissertation, University of Michigan, 1993. RA. Sugumar and SC. Abraham. Efficient. simulation of caches under optimal re- placement with appliations to miss characterization. In Proc. of ACM SIGMETRICS, May 1993. ES. Tam, J.A. Rivers, V. Srinivasan, G.S. Tyson, and ES. Davidson. Evaluating the performance of active cache management. schemes. In Proc. of the 1998 IEEE International Conference on Computer Design, pages 368—375, October 1998. ES. Tam, J.A. Rivers. V. Srinivasan, G.S. Tyson, and ES. Davidson. Active man- agement of data caches by exploiting reuse information. IEEE Trans. on Computers, 48(11), November 1999. D. Thiebaut. On the fractal dimension of computer programs and its application to the prediction of the cache miss ratio. IEEE Transactions on Computers, 38(7), July 1989. E. Torng. A unified analysis of paging and caching. Algorithmica, 20:175—200, 1998. D. Truong, F. Bodin. and A. Seznec. Accurate data layout into blocks may boost cache performance. In Interact-2, February 1997. G. Tyson, M. Farrens, J. Matthews, and AR. Pleszkun. A modified approach to data cache management. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 93—103, Ann Arbor, Michigan, 1995. IEEE Computer Society TC-MICRO and ACM SIGMICRO. R.A. Uhlig and TN. Mudge. Trace-driving memory simulation: a survey. ACM Com- puting Surveys, 29(2):128—170, June 1997. D.A.B Weikle, S.A. McKee, and W.A. W.A. Wulf. Caches as filters: A new approach to cache analysis. In Proc. of the Sixth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, July 1998. 133 . 1.5.)? m [77] Dee A. B. Weikle, Kevin Skadron, Sally A. McKee, and W. A Wulf. Caches as filters: [78] [79] [80] [81] [82] A unifying model for memory hierarchy analysis. Technical Report CS-2000-16, Dept. of Computer Science, University of Virginia, June 2000. W.A. Wong and J-L. Baer. Modified lru policies for improving second-level cache be- havoir. In 6th Int’l Symposium on High-Performance Computer Architecture, January 2000. CE. Wu, Y. Hsu, and Y—H Liu. A quantitative evaluation of cache types for high- performace computer systems. IEEE Trans. on Computers, 42(10), October 1993. W .A. Wulf and S. McKee. Hitting the memory wall: Implications of the obvious. Computer Architecture News, 23:20—24, 1995. Q. Yang and S. Adina. A one’s complement cache memory. In Proceedings of the 23rd International Conference on Parallel Processing. Volume 1: Architecture, pages 250—257, Boca Raton, FL, USA, August 1994. CRC Press. C. Zhang, X. Zhang, and Y. Yan. Two fast and high-associativity cache schemes. IEEE Micro, 17(5):40—49, September 1997. 134 II[Willi]lit][l]i]lu]][l]m