“#:E'Q“? IIIIIIIIIIIIIIIIIIIIII ITIIIIIIIIIIIIIIII This is to certify that the dissertation entitled DATA PLACEMENT IN SHARED-VIRTUAL—MEMORY MULTIPROCESSORS WITH NON-UNIFORM MEMORY ACCESS TIMES presented by J AYASHREE RAMANATHAN has been accepted towards fulfillment of the requirements for PhD degree in Computer Science \J\\W\& H .. QR Major professor Date July 1, 1992 MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Mtchlgen State University fi— PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FlNES return on or before dde due. DATE DUE DATE DUE DATE DUE MSU to An Affirmative Action/Equal Opportmlty lnetltutlon , 7 _ Gimme-p1 Data Placement in Shared-Virtual-Memory Multiprocessors with Non-Uniform Memory Access Times By Jayashree Ramanathan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1992 703— 7577 ABSTRACT Data Placement in Shared-Virtual-Memory Multiprocessors with Non-Uniform Memory Access Times By Jayashree Ramanathan Each processor in a multiprocessor usually has some local memory and, in addi- tion, it may share a global memory with all the other processors. Such a memory organization is motivated by price/ performance reasons, but results in a non-uniform memory access (NUMA) time due to the difference in the latency of the interconnects used to access the local and the non—local memories. Many NUMA multiprocessors allow processes of an application to interact by means of a shared virtual memory because of the advantages of virtual memory and the ease of programming using the shared-memory model. To achieve an acceptable performance in these multiproces- sors, it is important to properly place the data of a parallel application in the memory hierarchy. The primary aim of this thesis is to identify the limitations of existing data placement techniques and to develop better methods of placing data when providing a shared virtual memory in NUMA multiprocessors. Existing data placement techniques include data replication and data migration, both of which place data in terms of blocks such as pages or cache lines. Using analysis and trace-driven simulations, we study the factors affecting page-level replication, and show that it needs to be adaptive to page reference patterns and hardware parameters such as the available physical memory and the time to transfer data between the local and the non-local memories. We also demonstrate that proper layout of data in the shared virtual memory reduces the false sharing of each page, creates simple page reference patterns, and simplifies data placement. These results apply to other block- level placement strategies as well. In the absence of help from the compiler or the applications programmer however, such strategies incur runtime overhead when being adaptive and further, are unable to optimally place blocks that are falsely-shared and those whose reference patterns change rapidly. The conclusions from our investigation thus motivate our study of compiler- assisted data placement. Our approach to assist data placement is by using compile- time objects containing data of the same variable type and with similar reference patterns. We demonstrate how our approach can be incorporated in a compiler. We then develop a compile-time ob ject-creation scheme that assists block-level placement; we also propose solutions to several related issues. Next, we develop a scheme that assists object—level placement which requires the applications programmer to specify when data placement operations are needed. We derive the compile-time overhead of applying our schemes. We also compare the performance of applications for these different types of placement schemes by experimental simulations. We demonstrate that there is significant performance improvement when the compiler assists data placement. Further, the best performance is obtained when data placement oper- ations are entirely specified by the compiler. In summary, compiler-directed data placement is a promising approach to improve the performance of applications and the programmability of N UMA multiprocessors. Wt©by Jayashree Ramanathan 1992 ACKN “W I t'u .I;;‘..n~::~<'rs “ME“M'" .‘ I'm." '» ., ..J r!“ and: to my ',n\’ll - . I . . ' 7'» 'o the upbringing _' V 'anivm-r I. z . . .. ., :I. and h. is unfortunate ' He‘d W lm- IU 501‘ .m ‘ .» .w- w'u‘. still Emit-MN: to I‘“ N mrro of tlisgii.:t.‘- .2, ' .' . Is... “up“,"ch u» to We! educatn nu ..I~I -. . - _ I k- l gladly (it"‘lhau.’ Jug“ to m) pan-1m ’1 S“ Mullet: t) in r -' .-f' ‘ I , . ~ --. :.- a“); Some of them - . .\ .. tmn Sudha, Nimela, .0 , 'WW my interests aIiI’ .. . - , firm I also i'-- ' In .’num!shsp of China, C5. «Iffihoum, RAjnI. ax. cii Tony parents «lit, “ ““le i-O om d(imHl."'. W I»; 'I- I .. . . I 1r.r .' 'G-tl -u:~xatantebip Vhidl ‘ “tom to thti ‘ " ,‘I . t l'lI i’ . -" .mt H both Mt. W ‘tbSchoulof Oslo-up... .. \'.- n w' ;.. I . ~ . .wtmln. the Directord I Cdfl, for pun itizug u... x-m ~' .. MI: - 4‘;ng my first surmcrin Mann!” 'i-Je tn III. the "l’l 3"va " -. 1:;er ‘1- ”:1er v I'm ti” iWI. m yak"! - I‘ L m Wojidt for luring ll.t .t u ”7‘ m t'au'! ‘ 'm It 7 , ,, ,_ -pu'tioa not uni, 1:. .ltl' :1 2m iuarniul rm» tuner ‘mi Ilka Guided _ _ rd interpersonal «7'7 i was“? s» - 2 ‘u: Faust V-flhup. d A 'Dwtiq facilitu's Wf'llo‘g-Ltl 31. '!.u.- - up ”MW in bar enormwns In t, .- a-lmmntmtn I e '4“ “Hook with Pl\pl.ii1-ih-1Ii I ‘M. .. -. «award radish-fl - . I also Puywr-l pain-luv I; '-a- Ilfl‘w a! W M . a". *Wbumi Uimptiting év's'uuftt yul- ~ ~» 53.0.33! M PM! “I “in Mutke.rc~per1iw~i_\ ‘ Mafthc lb“ 1 J. “Mm Hire-em - .W , uM“ him. My thanks are den it." w v- alum {cutback on my Wtifiefl a (2 1‘ w Ni for suggesting that I tun) - A. ‘s 4e..- 4cm“ N “5 HI! also encournqui w iv w.(. ’41:“th ”m 5 km '1‘) *“ “‘SHIBM l J. Wanna Ramm‘ ‘wyu- at mefl““.' fl -&‘ ACKNOWLEDGEMENTS I owe much to my parents. My mother who devoted her life to the upbringing of her children motivated us to be independent and well-educated. It is unfortunate that she did not live to see me complete my Ph.D. But she was and still continues to be a great source of inspiration and motivation for me. My father encouraged us to pursue higher education and to excel in whatever we chose to do. I gladly dedicate this thesis to my parents. I am fortunate to have made several good friends along the way. Some of them helped shape my interests and values, out of whom I must mention Sudha, Nirmala, and Ravishankar. I also enjoyed and continue to enjoy the friendship of Chitra, CS, John, Poonam, Rajni, and Rama Seshu. I am thankful to our department for offering me a teaching assistantship which enabled me to come to the U.S. to pursue a Ph.D. I owe thanks to both Ms. Sasikala Reddy of the School of Osteopathic Medicine, and Dr. Erik Goodman, the Director of the Case Center, for providing me research assistantships during my first summer in the U.S. Thanks are also due to my thesis advisor Prof. Lionel Ni and our chairperson Prof. Anthony Wojick for hiring me as a lab manager from 1988 till 1991. The lab manager position not only provided me financial assistance, but also enriched my technical and interpersonal skills. I enjoyed working for Frank Northrup, our department’s computing facilities coordinator. My thanks to our graduate secretary Lora Mae Higbee for her enormous help in administrative matters. The courses that I took with Prof. Richard Enbody got me interested in the broad area of my dissertation. I also enjoyed participating in the meetings of the parallel processing and the distributed computing research groups organized by Prof. Richard Enbody and Prof. Matt Mutka, respectively. I thank Krish Kannan of the IBM T.J. Watson Research Center for offering me a summer internship in 1989. My thanks are also due to Paul Carini, Fran Allen, and Jeanne Ferrante, all of the IBM T.J. Watson Research Center for providing me with research papers and positive feedback on my dissertation work. I thank Prof. Lionel Ni for suggesting that I look at the Mach operating system for my term paper work. He also encouraged me to apply to the IBM T.J. Watson vi Research Center for a. summer internship. Both these factors eventually led to the choice of my dissertation topic. I also thank him for his understanding when I decided to complete my dissertation in Poughkeepsie, New York. I appreciate the hospitality shown by him and his wife. Finally, I owe much to my husband Ram for his love, encouragement, and support throughout my dissertation work. He patiently bore with my long hours of work, and took care of most of the household chores which significantly reduced the time it took to write this thesis. vii TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xii 1 INTRODUCTION 1 1.1 A NUMA Multiprocessor ......................... 1.2 Shared Virtual Memory ......................... 5 1.3 Data Placement .............................. 12 1.4 Summary of Major Contributions .................... 18 1.5 Thesis Organization ............................ 22 2 FACTORS AFFECTING DATA PLACEMENT 23 2.1 Related Work ............................... 23 2.2 Assumptions ................................ 26 2.3 Page Replacement ............................ 28 2.3.1 CLOCK Scheme ......................... 29 2.3.2 Software LRU Approximation Scheme ............. 29 2.3.3 LRU Scheme ........................... 30 2.4 Method of Study ............................. 34 2.5 Performance Metrics ........................... 36 2.6 Workload ................................. 40 2.7 Experimental Results ........................... 42 2.8 Summary ................................. 57 3 COMPILER-ASSISTED DATA PLACEMENT 59 3.1 Motivation ................................. 59 3.2 Data Placement and the Memory Hierarchy .............. 64 3.3 Reference Pattern of the Shared Virtual Memory ........... 66 3.4 Design of the Compiler .......................... 68 3.5 Type and Reference Pattern of Variables ................ 73 3.6 Related Work ............................... 82 viii 3.7 Summary ................................. 86 COMPILER—ASSISTED BLOCK—LEVEL PLACEMENT 88 4.1 Compile-Time Ob ject-Creation ..................... 88 4.2 Assistance in Block-Level Placement .................. 90 4.3 Compilation of Applications ....................... 97 4.4 Internal Fragmentation .......................... 102 4.5 Code Generation for Array References ................. 105 4.6 Compilation Overhead .......................... 111 4.7 Performance of Applications ....................... 114 4.8 Summary ................................. 115 COMPILER—DIRECTED OBJECT-LEVEL PLACEMENT 117 5.1 Motivation ................................. 117 5.2 Compile-Time Ob ject-Creation ..................... 120 5.3 Compilation of Applications ....................... 125 5.4 Compilation Overhead .......................... 129 5.5 Performance of Applications ....................... 133 5.6 Application Execution Model ...................... 134 5.7 Results of the Application Execution Model .............. 144 5.8 Summary ................................. 149 EXPERIMENTAL COMPARISON OF DATA PLACEMENT SCHEMES 150 6.1 Goals and Assumptions .......................... 150 6.2 Performance Measures .......................... 151 6.3 Methodology ............................... 153 6.4 Experimental Results ........................... 156 6.4.1 Application matrix multiplication .............. 157 6.4.2 Application parallel iterative solver ........... 163 6.4.3 Application multi-code-segment-array ............ 171 6.4.4 Conclusions ............................ 188 6.5 Summary ........................ i ......... 190 CONCLUSIONS 191 7.1 Major Contributions ........................... 191 7.2 Future Directions ............................. 193 Glossary 197 L151" 01' ‘1‘!“ 143?.1‘38' Li.‘ “we ‘fal' I" .L «'1 -' 'n'c' 1:1: -:"_".v.1i’I-.i ‘.'iI-'vilit.§t. . wm tramn 4') ‘.‘~ 4' .-.:‘.>'7 "3:51 '- . '1 ft: 1 ‘[;r.u _‘ - . how we 5.. cc. o." a“ (“win-{wk . “ “Melanin ......, “ Miami-5mm (”nun ”5.1.4,. I Npumuh‘un “£1. reform. 1‘ guru-m 01“" r 'Z-“F W of the appi; (Minn v ow um... .. .1. l slnll" W [images (aulti-cwdvam-t- 91:? I” m i W mushy: \Ih’llt' (039 “Kits W Mu”; Wed (in animus} (sultrtm' WMUbfi ).. . ...... ‘Wllultl codvugnnt-srq w. ‘6 (mtz-codo-sewzoenn 'fi 3‘ m d experinvmw rmuu . data placement .‘OGN -. if.” ‘o 91!.35 f 3.415 1 ga- a .5 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 5.1 5.2 6.1 6.2 6.3 6.4 6.5 6.6 6.7 LIST OF TABLES Page state transition diagram: LERN with full replication ........ 32 Page state transition diagram: LERN with no replication. ....... 33 Parameters. ................................ 37 TLB hit accesses .............................. 37 Page table hit accesses ........................... 37 Remote page fault accesses. ....................... 38 Disk cache fault accesses .......................... 38 Disk page fault accesses. ......................... 39 Time to maintain consistency of data ................... 39 Fixed parameters .............................. 43 Inexact reference pattern and 0CDEP. .................. 123 Parameters of the application execution model. ............ 138 Consistency messages (multi-code-segment-array; OCRP, OGDEP). . 176 Data transfer messages (multi-code-segment-array; DCRP, OGDEP). . 178 Data transferred (in elements) (multi-code-segment-array; OCRP, ocorp) .................................... 181 Total messages (multi-code-segment-array; OCRP, OGDEP). ..... 185 Total objects (multi-code-segment-array; OCRP, OGDEP). ...... 185 Summary of experimental results ..................... 188 Compiler-directed data placement (OCRP vs. OGDEP) ........... 189 xi 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 LIST OF FIGURES A N UMA multiprocessor. ........................ Mapping from virtual memory to physical memory. .......... Virtual-to-physical address translation using a three—level page table. A TLB with address mappings of multiple applications ......... Implementation of a shared virtual memory on networked workstations. Block-level data replication. ....................... Consistency models. ........................... Block—level data migration ......................... Block-level placement in a shared-virtual-memory NUMA multiproces- 801‘ ...................................... The software LRU approximation scheme. ............... The LERN page replacement scheme .................... Application parallel iterative solver ................ Different methods of data layout for a 4 x 4 array partitioned into four partitions. ................................. Parameter overhead (%) vs. No. of pages (FR/R=2/near neighbor/row major) .......................... Parameter overhead (%) vs. No. of pages (NR/R=2/near neighbor/row major) .......................... Parameter overhead (%) vs. No. of pages (FR/ R=10/ near neighbor/row major) .......................... Parameter overhead (%) vs. No. of pages (N R/ R=10/near neighbor/row major) .......................... Effective memory access time vs. No. of pages (FR/near neighbor/row major) .......................... 2.10 Effective memory access time vs. No. of pages (NR/near neighbor/row major) .......................... 2.11 Parameter overhead (%) vs. No. of pages (FR/R=2/near neighbor/block major) ......................... 20 29 31 40 41 45 45 46 46 47 49 2.12 Parameter overhead (%) vs. No. of pages (NR/R=2/near neighbor/block major) ......................... 49 2.13 Parameter overhead (%) vs. No. of pages (FR/R=10/near neighbor/block major) ......................... 50 2.14 Parameter overhead (%) vs. No. of pages (NR/R=10/near neighbor/block major) ......................... 50 2.15 Effective memory access time vs. No. of pages (FR/near neighbor/block major) ......................... 51 2.16 Effective memory access time vs. No. of pages (NR/ near neighbor/block major) ......................... 51 2.17 Parameter overhead (%) vs. No. of pages (F R/ R=2/ PIC) ....... 53 2.18 Parameter overhead (%) vs. No. of pages (NR/R=2/PIC) ....... 54 2.19 Parameter overhead (%) vs. No. of pages (FR/R=10/PIC) ...... 54 2.20 Parameter overhead (%) vs. No. of pages (NR/R=10/PIC) ...... 55 2.21 Effective memory access time vs. No. of pages (FR/ PIC) ....... 55 2.22 Effective memory access time vs. No. of pages (NR/ PIC) ....... 56 3.1 An example of a sequential application .................. 61 3.2 Performance degradation with caching and without page placement. . 65 3.3 Reference pattern of the shared virtual memory ............. 67 3.4 An example of a parallel application. .................. 69 3.5 A model of a parallel application with placement directives. ..... 72 3.6 Placement information about the shared virtual memory ........ 73 3.7 Steps in compiling an application to assist data placement ....... 74 3.8 Determination of reference pattern: Algorithm A ............ 76 3.9 Determination of reference pattern: Algorithm B ............ 79 4.1 Examples of objects for elements of a two-dimensional array. ..... 91 4.2 Algorithm to update the kind of an object ................ 93 4.3 Placement information provided by OCRP ................. 94 4.4 Parallel application after applying OCRP. ................ 96 4.5 Application matrix multiplication. ................. 97 4.6 Objects created by OCR? for application matrix multiplication. . . 98 4.7 Application parallel iterative solver ................ 99 4.8 Objects created by OCRP for application parallel-iterative-solver. 100 4.9 Application recurrence-relat ion. .................. 101 4.10 Application with a DOACROSS loop ................... 102 4.11 Code generation for array references when applying OCRP ........ 107 xiii 4.12 4.13 4.14 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 Traversals for an array reference with linear subscripts. ........ 109 Algorithm used by OCRP to create objects. ............... 112 Data structure to store objects created by OCR? for an array variable. 113 Examples of objects for elements of a two-dimensional array. ..... 119 Application matrix multiplication. ................. 125 Application parallel iterative solver. . I .............. 126 Objects created by OGDEP for application parallel iterative solver.127 Application with a DOACROSS loop ................... 128 Application multi-code-segment-array. ............... 130 Algorithm used by OGDEP for updates before a code segment ...... 131 Object list for OGDEP ............................ 132 Rectangular partitioning of a n x n array. ............... 135 Objects created by OCDEPJJDVLP for application parallel iterative solver. .................................. 136 Object update time reduction due to 0CDEP/ BARRIER .......... 145 Object update time reduction due to OGDEP/HAITSIGNAL. ...... 146 Object update time fraction/ BARRIER. ................. 146 Object update time fraction /HAIT_SIGNAL ................ 147 Execution time reduction due to OCDEP/ BARRIER. ........... 148 Execution time reduction due to DCDEP/HAITSIGNAL .......... 149 Legend for the performance graphs .................... 156 Application matrix multiplication. ................. 157 Consistency messages vs. Number of processors (matrix multiplication for 128 x 128 square matrices) ............. 158 Data transfer messages vs. Number of processors (matrix multiplication for 128 x 128 square matrices) ............. 159 Data transferred vs. Number of processors (matrix multiplication for 128 x 128 square matrices). ..................... 160 Total messages vs. Number of processors (matrix multiplication for 128 x 128 square matrices). ..................... 161 Physical memory requirements (% of problem size ) vs. Number of processors (matrix multiplication for 128 X 128 square matrices). 162 Virtual memory requirements (% of problem size ) vs. Number of processors (matrix multiplication for 128 x 128 square matrices). 163 Application parallel iterative solver ................ 164 6.10 6.1 H 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.2 H 6.22 6.23 6.24 6.26 Consistency messages vs. Number of processors (parallel iterative solver) ................................... Consistency messages vs. Number of processors (parallel iterative solver; OCRP, OCDEP). .......................... Data transfer messages vs. Number of processors (parallel iterative solver). ............. , .............. Data transferred vs. Number of processors (parallel iterative solver) ................................... Total messages vs. Number of processors (parallel iterative solver; OCRP, OGDEP). .......................... Total messages vs. Number of processors (parallel iterative solver) ................................... Total objects vs. Number of processors (parallel iterative solver; OCRP, UCDEP). .......................... Physical memory requirements (% of problem size ) vs. Number of processors (parallel iterative solver) ................ Virtual memory requirements (% of problem size ) vs. Number of processors (parallel iterative solver) ................ Application multi-code-segment-array. ............... Consistency messages vs. Number of processors (multi-code-segment-array; WF=.01) ................. Consistency messages vs. Number of processors (multi-code-segment-array; WF=.01) ................. Consistency messages vs. Number of processors (multi-code-segment-array; WF=.256). ............... Consistency messages vs. Number of processors (multi-code-segment-array; WF=.6). ................ Data transfer messages vs. Number of processors (multi-code-segment-array; WF=.01) ................. Data transfer messages vs. Number of processors (multi-code-segment-array; WF=.01, .256, .6) ...................................... Data transferred vs. Number of processors (multi-code-segment-array; WF=.01). ................................. XV 165 166 167 167 168 169 170 170 171 173 174 175 177 177 179 180 180 6.27 6.28 6.29 6.30 6.31 6.32 Data transferred vs. Number of processors (multi-code-segment-array; WF=.01, .256, .6). ............................ Total messages vs. Number of processors (multi-code-segment—array; WF=.01) ................. Total messages vs. Number of processors (multi-code-segment-array; WF=.256). ............... Total messages vs. Number of processors (multi-code-segment-array; WF=.6). ................ Physical memory requirements (‘70 of problem size ) vs. Number of processors (multi-code-segment-array). ............... Virtual memory requirements (% of problem size ) vs. Number of processors (multi-code-segment-array). ............... 182 183 184 184 186 187 CHAPTER 1 INTRODUCTION The speed of a single-processor computer has increased by several orders of mag- nitude since the first electronic digital computer was introduced in 1940, primarily due to advances in hardware technology. However, many important problems remain unsolved as their solution requires computational power that are far beyond the ca- pabilities of the fastest single-processor computers currently available. For instance, there exist problems that need computers capable of executing 1012 floating point op- erations per second (teraflops). Since hardware technology is approaching its physical limitations, such computational power can be realized only by introducing parallelism in computers. An important class of parallel computers are multiprocessors, which consist of multiple processors and memory modules connected together by one or more interconnects. In this thesis, we consider multiprocessors with a non-uniform memory access (NUMA) time, i.e., the access time is not the same for all memory locations. Pro- cessors in such a multiprocessor concurrently execute either different applications or different processes of the same parallel application. We consider the latter case and assume that there are enough processors so that each application is allocated as many processors as the number of processes it creates and further, each process is statically assigned to a given processor. Most existing multiprocessor operating systems, such as the BBN’s nX [1], allow such exclusive allocation of resources to a given application, thereby allowing its performance to be maximized, independent of other applications. We also assume that the processes of an application communicate and synchronize by using the same virtual addresses for data they share. This method of communication is achieved by assigning addresses to the application’s data and code from a set of virtual addresses shared by all processors, referred to as the shared virtual memory (SVM). Due to the variation in cost to access different physical memory locations in a NUMA multiprocessor, data in the SVM needs to be placed in the NUMA physical memory such that the time to access it is minimized for all processors. In this thesis, we propose and evaluate new data placement techniques that help to achieve this goal. In this chapter, we first present a typical architecture of a NUMA multiprocessor and then discuss the advantages of a SVM and the implementation issues in providing it in a NUMA multiprocessor. Next we provide an overview of existing techniques to address the problem of data placement. We then summarize the contributions of our thesis in addressing the data placement problem. We conclude with an overview of the remaining chapters. 1.1 A NUMA Multiprocessor Typically, each processor in a multiprocessor has some memory placed local to it and, in addition, it may share a global memory with all the other processors as shown in Figure 1.1. Such a memory organization is motivated by price/ performance reasons similar to the cache and the main memory hierarchy prevalent in uniprocessors. For example, processors in small—scale multiprocessors such as the Alliant’s FX/ 8 [2] and the Sequent’s Symmetry [3] are connected by a single bus interconnect to a global main memory. Each processor has a local cache, and a snoopy hardware cache protocol keeps all the caches consistent by listening to memory accesses on the shared bus. A single bus is limited in bandwidth for large-scale multiprocessors, and technol- ogy limitations make it too expensive to provide hardware connectivity from each processor to every other processor. Therefore, large-scale multiprocessors are built with intermediate connectivity using interconnects such as multi-stage interconnects as in the BBN’s TC2000 [1] and the IBM’s RP3 [4], point-to—point interconnects as I Local IMOI’COMOCt l rLocal Interconnect I I Global Interconnect I Global Memory Figure 1.1. A NUMA multiprocessor. in the Intel’s DELTA [5], and hierarchical interconnects as in the Kendall Square Research’s KSRl [6] and the DASH multiprocessor [7]. Since there is no longer a single bus to snoop at, some of these multiprocessors such as the TC2000, RP3, and DELTA do not provide hardware cache consistency. Others use a directory-based cache consistency protocol to maintain information about cache lines in either cen- tralized or distributed directories as in the DASH. Some others, such as the KSRl, use a hardware protocol that traverses the interconnect hierarchically. Details of the IEEE scalable coherent interface and related projects can be found in [8]. Main memory in large-scale multiprocessors includes modules global to all pro- cessors just as in small-scale multiprocessors. In addition, there are modules placed local to each processor. Note that in small-scale multiprocessors, the memory local to each processor includes only its cache. However, in large-scale multiprocessors, each processor has a private cache as well as a portion of the main memory, which we refer to as the local memory. For example, processors in the TC2000, DELTA, Horizon [9], Symult 2010 [10], and networked workstations have local memories, those in the RP3 can be set up to have local and/or global memory, while those in the IBM’s ACE [11] have local and global memory. Further, either all available main memory can be directly addressed in hardware by all processors as in the RP3 and the ACE, or each processor can address all memory except another processor’s local memory (called remote memory) as in the DELTA and networked workstations. The memory access time is defined as the number of processor clock cycles that elapse between the time a processor issues a memory read request and the time the data arrives from memory. It depends on the speed with which the memory can deliver data and the characteristics of the interconnect(s) connecting the processor to the memory location containing the data. Two main characteristics of an interconnect are the latency and the peak bandwidth. The latency of an interconnect is the minimum number of clock cycles required to transfer one word of data either from one processor to another, or from one processor to a memory module. The peak bandwidth gives the maximum rate at which data can be transferred across the interconnect and is usually measured in words/ cycle [12]. The variation in the latencies of the interconnects used to access the different memory locations results in a non-uniform memory access time, and hence the name NUMA multiprocessors. For example, the difference in the times to access the cache and main memory gives rise to a NUMA time in small- scale multiprocessors. An additional NUMA time in large-scale multiprocessors is due to the difference in the times to access local, global, and remote memory. As an example, in TC2000, a remote memory access takes four times longer than a local memory access. 1.2 Shared Virtual Memory Due to cost considerations, all computer systems use physical memories of only limited size. However, in order to allow application programmers to be not constrained by physical memory limitations, most existing operating systems provide them a virtual memory, which is much larger than the actual physical memory. Therefore, a processor references any data element by using its virtual address in the address field of the instruction. The maximum size of virtual memory is limited only by the number of bits in this address field and the swap area of the disk which stores data that cannot fit in main memory. Virtual memory systems use either paging, where the virtual memory is divided into units of equal size called pages, or segmentation, where the virtual memory is divided into units of unequal size called segments, or a combination of both [13]. Since most current multiprocessors use paging, we restrict our discussion to paging. In any virtual memory system, a map from virtual memory to physical memory is maintained as shown in Figure 1.2. For example, in a paged virtual memory system, a page table stores the mappings from virtual pages to physical pages. The page table can be an indexed page table as shown in Figure 1.3, or an inverted page table in which case the mappings from physical to virtual pages are maintained. We restrict our discussion to the commonly prevalent indexed page table. A large virtual memory implies a large number of virtual pages and consequently, a large number of mappings to be stored in the page table. Therefore, the page table might not fit in a single physical page, in which case a multi-level page table is necessary. All physical pages belonging to a page table are referred to as page directories. The page table base register stores the pointer to the first page directory page. During the virtual-to-physical address translation, the virtual address is split into various fields that provide the index into the different physical pages of the page table (Figure 1.3). In practice, the entire page table does not reside in physical memory. Rather, pages belonging to the page table are brought into physical memory on demand. Consulting a multi-level page table for each instruction containing a memory ref- VIRTUAL MEMORY Vlrtual Unlt Physleal Unlt PHYSICAL MEMORY Figure 1.2. Mapping from virtual memory to physical memory. erence results in performance degradation. To alleviate the above problem, virtual memory systems usually use a fast associative memory called the translation lookaside bufi'er (TLB) that has a subset of the mappings in the page table. During address translation, the processor first consults the TLB and if the mapping corresponding to the virtual page is not in the TLB, then it consults the page table. The TLB can be organized to allow mappings of multiple applications as shown in Figure 1.4. If it contains the mappings of only a single application, then there is no need for the field containing the application’s ID. In this case, the TLB needs to be flushed every time a processor switches from executing one application to another. Each entry in the TLB and the page table also contains flags such as dirty, which indicates whether the page has been updated, reference, which indicates whether the page has been referenced, (—— Virtual Address ——-—) page application directory 2 page page Page directory 0 directory 1 physical page table page ““933 application page table 939° base register Figure 1.3. Virtual-to-physical address translation using a three-level page table. and protect, which indicates the page’s protection level. Due to cost considerations, the TLB has a limited number of entries, and therefore, TLB replacement policies are used to determine which TLB entry to replace in order to make room for a new mapping. Usually, the number of physical pages is much smaller than the number of virtual pages. Therefore, the same physical page is allocated to different virtual pages at various times and consequently, the mappings in the TLB and the page table also change with time. A page fault is said to occur if a processor references a virtual address belonging to a virtual page which is not mapped to any physical page. The fault is resolved by mapping this virtual page to one of the free physical pages. Due Application Virtual Physical ID Page Page Flags x m I 0 1 v n l 1 ° Figure 1.4. A TLB with address mappings of multiple applications. to the NUMA nature of the physical memory, the scheme used to choose this physical page is important, and is referred to as the page placement scheme. In other words, page placement determines where in physical memory a given virtual page is placed. If no free pages are available, one of the already allocated physical pages is reclaimed, and the scheme used to make this choice is called the page replacement scheme. In other words, page replacement determines which virtual page will be replaced in memory by the currently referenced virtual page. It follows that the page replacement determines page placement when there are no free physical pages. Once the page fault is resolved, the TLB and the page table are updated to reflect the resulting change in the address map. Specific virtual addresses are assigned to the data and code of an application ex- eCUting on a uniprocessor that provides virtual memory. Extending this concept to a Parallel application executing on a multiprocessor, all data and code, including those Shared among processors, are assigned the same virtual address in all processors. Such an assignment allows the application to be written in the shared-variable program- ming model, in which processors synchronize and communicate by means of shared data mapped in a shared virtual memory or SVM. Therefore, the SVM offers the benefits of virtual memory and the ease of programming in the shared-variable model to the applications programmer. Further, placing the operating system’s code and data in the SVM allows better memory utilization by enabling pages to be mapped to remote or global memory instead of disk. It also facilitates easy distribution of work among processors by means of one or more shared queues of ready-to-run processes. SVM is one example of a process-interaction paradigm which defines the method by which processes of a parallel application can synchronize and communicate with each other. Other process-interaction paradigms are the remote procedure call (RPC), object-oriented systems, and data-unit [14]. Each of these approaches have their mer- its and demerits. The RPC mechanism can be used among heterogeneous machines but lacks support for shared data and cannot be extended easily into the asynchronous domain. Object-oriented systems provide application-level features, are free from some of the problems with SVM which we discuss in a later section, but are not appropriate for all applications. The data-unit approach is a combination of features derived from the SVM and the object-oriented systems. A data unit is a region of virtual memory that can be mapped into the address map of various processors. As in object-oriented systems, application-level operations on data units are supported which enable synchronization and communication among processes. In this thesis, we consider the SVM process-interaction paradigm and develop methods to efficiently provide such a paradigm in N UMA multiprocessors. SVM implementations exist for NUMA multiprocessors in which either all proces- sors can directly address all available memory [15, 16, 17, 18], or processors cannot directly address remote memory [19, 20, 21, 22, 23, 24]. The implementation issues in the two cases are different as we discuss below. When providing a SVM in a NUMA Illlultiprocessor with local and/or global memory, any virtual address (either in the kel‘nel or in the user virtual memory) can be mapped to a physical memory location in any of these memories. Let us first consider the case when each memory location has a unique physical address and all locations can be directly addressed in hardware by 31] processors. Each processor obtains the physical address corresponding to the virtInal address in a memory reference from its TLB or page table. It directly issues a Shared NUMA A. Request to C. Request to Virtual Physical Page Owner page Holder Memory Memory B. Response from D. Response "0"" Page Owner Page Holder Local Memory3 Local Memory2 Local Memory1 Figure 1.5. Implementation of a shared virtual memory on networked workstations. memory request for this physical address for the read or write operation. In this case, all processors can share a single operating system which resides in the kernel portion of the shared virtual memory. The data area of this portion includes information about each virtual page such as whether it is placed in physical memory or not, and the processors sharing it. The code area of this portion includes various operating system services such as that provided by the page fault handler. We now illustrate how a page fault is handled in this case. For simplicity, let us assume that all processors share a single physical copy of the f t k 11 kernel virtual memory. For ease of explanation, we consider a page placement scheme which statically places any virtual page in the local memory of the processor that first incurs a fault for the page. Consider for example what happens when a processor P1 incurs the first page fault for virtual page VPl. P1 switches to the kernel mode and looks up the information about VP1 in the kernel data area. It determines the placement scheme and also that VP1 is not placed anywhere in physical memory. VP1 is then mapped to a physical page PPl in the local memory of P1, the TLB and page table of P1 are updated to include this new mapping, and the information about VP; is updated in the kernel data area. Assume that processor P2 now incurs a fault for VP,, and switches to kernel mode. On looking up the information for VPl in the kernel data area, it determines the placement scheme and also that VP1 is already mapped in physical memory. Therefore, it updates its TLB and page table to include this mapping, and references PP] remotely. The method described above is inapplicable if all processors cannot directly ad- dress all available memory. We illustrate the basic idea behind the strategy used in those situations by considering the implementation of SVM on a network of worksta- tions (Figure 1.5). In this case, processors do not share a single operating system. ‘Each processor’s operating system resides in its private virtual memory and is placed in its local memory. Again, we consider a static placement scheme and refer to the processor in whose local memory a given virtual page is placed as the virtual page’s page holder. Further, each virtual page is assigned a static owner processor that stores information about the page in the data area of its operating system. In order to get information about a given virtual page, processors contact the page’s owner. Page ownership schemes deal with issues such as determination of a virtual page’s owner, and eficient storage and dissemination of information about virtual pages. For ease 0f explanation, we assume that the information about owners of all virtual pages is either stored or can be easily computed by all processors. let us consider the actions in this case for the page fault sequence mentioned earlier. Assume that virtual page VPl is owned by processor P3. When P1 incurs a Page fault for VP), it contacts P3 to get information about VPI. As in the previous 12 case, a physical page PPl is allocated in the local memory of P1, and the TLB and page table of P1 are updated. However, the virtual page information is updated by P3, which records that P1 is the page holder of VP1. When P2 incurs a fault for VPI, it contacts P3 and learns that P1 is the page holder of VPl. P2 sends a memory request to P1, which services the request and sends a response. The TLB and page table of P2 are not updated and any further references to VPI by P2 are handled in a similar fashion because P2 cannot address the local memory of P1 directly. Therefore, requests to a page in remote memory are serviced using messages after obtaining the page’s holder information from its owner. We discussed the two methods of implementing a SVM for a static page placement scheme. When this strategy is applied to a shared page, the references to the page by all but one processor require non-local memory accesses, which are expensive in a NUMA multiprocessor. This problem is addressed by data placement strategies which are covered in the next section. 1.3 Data Placement The goal of data placement strategies is to place data in the N UMA physical memory such that the time to access it is minimized. One well-known method to achieve this goal is data replication, which replicates data in the local memory of the referencing processor. However, with replication, it is necessary to guarantee some form of rela- tionship between these multiple physical copies. This relationship is referred to as the consistency model and is used by the applications programmer to write applications that execute correctly. To enforce the consistency model, when a processor issues a Write operation, either all copies are updated, referred to as the write-update (WU) Policy, or all other copies are removed, referred to as the write-invalidate (WI) policy. Most NUMA multiprocessors provide replication at the level of a block which is the unit of data transfer at a given level of the memory hierarchy. Block-level replication is 8l'aown in Figure 1.6, where a virtual block containing the referenced data element is reI>licated in local memory. For example, N UMA multiprocessors with hardware cache Shared Virtual Memory virtual block data element referenced NUMA Physical Memory NO REPLICATION Local Local Local Memory_1 Memory_2 Memory_N Local Local Memory_1 Memory_2 Memory_N REPLICATION ON REFERENCE Figure 1.6. Block—level data replication. 14 consistency replicate the virtual cache line containing the referenced data element and keep the various physical cache lines consistent [2, 25, 3, 26, 27, 7, 17]. Others provide software mechanisms that replicate virtual pages containing the referenced data element and keep the corresponding physical pages consistent [28, 15, 22, 29, 18, 24]. The strictest consistency model is sequential consistency [30], which requires the execution of the parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. For example, when sequential consistency and the WU policy is provided, a write to any data element is not complete unless all copies of the data element have been updated. There are N UMA multiprocessors that provide sequential consistency in hardware [2, 25, 3, 26, 27, 6], and those that provide it in software [28, 15, 22, 18]. Due to the high latency and the potential for contention among memory requests, sequential consistency is expensive for large-scale multiprocessors with intermediate connectivity. Further, the strictest consistency model provided by sequential consistency is not always necessary to guarantee correct program execution. These factors have motivated research on weaker consistency models. Figure 1.7 compares sequential consistency with release consistency, a weak con— sistency model provided in hardware by the DASH multiprocessor [31]. Release con- sistency exploits the fact that applications typically use synchronization variables to ensure that a processor has updated a data element before other processors read it. Therefore, it guarantees consistency not on the completion of each write but only when a release operation is executed on a synchronization variable. Release Consistency is provided in software by the Hunin [24] and the SSVM [23] operating BYStems. Another weak consistency model is weak coherence [17], which guarantees that, all copies of a data element are updated in the same order. Other weak consis- tency models proposed in the literature include [32, 33]. Weak consistency models offer performance improvement at the cost of either additional programmer effort in writing a parallel application or extra compiler effort in automatically parallelizing a Sequential application. The approach of improving performance by transferring Processor 1 acquire(lock) Time write A t1 write B t2 empty = 0 release(lock) t3— Processor 2 while empty acquire(lock) read A read B release(lock) Sequential Consistency guarantees consistency of all copies of A at t1, all copies of B at t2. Release Consistency guarantees consistency of all copies of A and B at t3. Figure 1.7. Consistency models. w., 16 data only during synchronization operations has also been used in Clouds [21], an object-oriented shared-virtual-memory implementation on networked workstations. Certain other issues need to be addressed when providing page—level replication. The entry for a replicated virtual page in each processor’s TLB contains the physical page allocated in the processor’s local memory. The problem of ensuring that the TLB entries in various processors for a given virtual page are consistent is referred to as the TLB consistency problem [34, 35, 36]. For example, with the WI policy, on a write operation, not only should the physical pages in other processors be deallocated, the corresponding TLB entry should be invalidated as well. Further, either all processors can share a single page table or the page tables can be replicated [37, 38]. In the latter case, page table entries in all processors for each virtual page need to be kept consistent, and this problem is referred to as the page table consistency problem [15]. Other than replication, another data placement technique used to minimize the data access time is data migration, which does not replicate but instead migrates the data element among the local memories of the processors that reference it. As in the case of data replication, most NUMA multiprocessors provide block—level migration as shown in Figure 1.8, where the virtual block containing the referenced data element is migrated among the local memories of the referencing processors. Interestingly, when caching and the WI policy are used, migration of a virtual cache line occurs on a write operation; however, several physical copies of the cache line exist during periods when it is only read and not written. Software-controlled page—level migration is provided by some of the NUMA multiprocessors lacking hardware cache consistency [39, 16, 19, 40, 18]. Next, we discuss certain issues in implementing the data placement schemes. The information about each cache line which is required by hardware-controlled cache- Hne—level placement strategies is maintained in either a centralized or a distributed manner in hardware. Software-controlled page-level placement strategies also need to maintain information about each virtual page and make it efficiently available to all processors. For example, to maintain consistency of a page that is replicated, it ’8 necessary to know which processors have its copy in their local memory. Similarly, Shared Virtual Memory virtual block data element referenced 17 NUMA Physical Memory BEFORE MIGRATION Local Local Local Memory_1 Memory_2 Memory_N Local Local Local Memory_1 Memory_2 Memory_N AFTER MIGRATION Figure 1.8. Block—level data migration. y 18 for page migration, information about where the virtual page is currently placed in physical memory is needed. More sophisticated page placement strategies need additional information such as the history of read and write references to the various pages by different processors. When all available physical memory can be directly addressed by all processors, this information can be stored in the data area of the single operating system which is shared by all processors. When each processor cannot directly address remote memory, each virtual page’s information is stored in the data area. of the operating system of its owner. 1.4 Summary of Major Contributions In this section, we summarize the major contributions of our thesis which aims to address the limitations of existing data placement techniques. We define a data place- ment strategy to be block-level if it places a data element in terms of the basic data block containing it. Therefore, cache-line-level and page-level strategies discussed earlier are all examples of block-level placement. We define the reference pattern of a data element or a block as a temporal order of the read and the write references originating from different processors to the data element or the block, respectively. In order to identify the limitations of existing block-level placement strategies, we conduct simulations to study the relationship among page-level replication, hardware parameters, and the reference pattern of pages [41, 42]. Note that a block’s reference pattern depends on the reference pattern of the data elements it contains. Also, the data elements contained in each block is determined by the layout of the data in the SVM. In our study, we vary hardware parameters related to replication, and we also vary the data layout to create virtual pages with different reference patterns for the same application. We conclude that the effectiveness of replicating a page depends on its reference pattern and also on hardware parameters. In addition, we found out that with proper data layout, the amount of sharing of the application’s pages is reduced, and then the performance of the application with and without replication is almost the same because the two schemes are equivalent for a non-shared page. Hence, we 19 conclude that proper data layout can simplify data placement. These results apply to other block-level strategies as well. In related work, Bolosky et a1. [43] conclude that page-level migration strategies need to be dependent on hardware parameters related to migration. Other studies (surveyed in [44, 45]) emphasize the need for a block’s. placement strategy to be dependent on its reference pattern. Previous studies have also made it straightforward to choose an appropriate block-level placement strategy once the block’s reference pattern is known. For example, Bennett et al. [29, 46] identify different types of data objects that occur in typical applications, and propose placement schemes for each of these objects. However, our study is unique in that we consider the influence of varying the layout of data in the SVM and also the relationship between replication and hardware parameters. Figure 1.9 shows examples of how to choose placement strategies for virtual blocks based on their reference pattern. Here, a virtual block is denoted by b(p,-, pk, ..), where b is its ID and each p, denotes a processor j that references it. Private blocks 1(1), 2(2), and 3(N) are placed statically in the local memory of processors 1, 2, and N, respectively. Read-only block 4( 1,2) is replicated in the local memory of processors 1 and 2. Block 7(2,N) is read more often than it is written, and is therefore replicated in the local memory of processors 2 and N. Blocks 6(1,2) and 8(1,2) are actively read and written by several processors, and are not replicated due to the high overhead of consistency. Block 6 is placed in global memory, while block 8 is placed in the local memory of processor 2. Therefore, the ideal placement strategy for a block can be decided once its reference pattern is known. However, techniques need to be developed to determine each block’s reference pattern. Our conclusion that proper data layout can simplify data placement is related to the problem of false sharing [47], which occurs when a virtual block contains data elements with different reference patterns. Data placement strategies are limited in their ability to place a falsely-shared block. For example, assume that block 7(2,N) in Figure 1.9 contains two data elements each of which is exclusively referenced by processors 2 and N, respectively. Replicating such a block leads to unnecessary consis- 20 Shared Shared Private Bead-Only Read-Write Shared Virtual NUMA Physical mmory . . . Physical Block Local Local Local Global Memory_1 Memory_2 Memory_N Memory Figure 1.9. Block—level placement in a shared-virtual-memory N UMA multiprocessor. tency overhead, migrating it leads to a ping-pong effect, and placing it statically in a given processor’s local memory results in non-local memory accesses for the other pro- cessor. Since the probability of false sharing is higher for larger blocks [48, 49, 50], the resulting performance loss is severe for N UMA multiprocessors providing page-level placement. It is clear that to improve the performance of block-level placement, false sharing needs to be eliminated and the placement strategy should be tailored to the reference pattern of blocks. One approach is to address these issues without any help from the compiler or the applications programmer. For example, certain block-level placement strategies [15, 16] use the runtime reference history to adapt to changing reference ‘ 21 patterns. Such schemes not only incur runtime overhead but are also unable to adapt soon enough to rapidly-changing reference patterns, thereby resulting in a non-optimal data placement. Further, they perform well only if the past reference history is an accurate predictor of future references. Other studies [50, 51] solve false sharing by providing block-level replication and a weak consistency model that updates only those copies of a data element that will actually be used, based on runtime information. However, these schemes either require additional hardware to store this information [50] or incur additional software time and space overhead [51]. The other approach to address the problems of block-level placement is to obtain help from the applications programmer or the compiler. Compiler optimizations [47] and program transformations [52] have been proposed to address the problem of false sharing of cache lines. LaRowe et a1. [53] develop a parameterized page placement scheme which can be tuned by the applications programmer in order to be adaptive to hardware parameters and application characteristics. Both SSVM [23] and Hunin [24] use programmer-specified information to solve the false sharing problem. They allow blocks to be falsely-shared and invoke placement operations only when they are actually required, as determined by the programmer-specified information. In addition, SSVM allows the programmer to place data in terms of objects rather than blocks, which we refer to as object-level placement. Our approach to address the problems of block-level placement isiby developing new techniques that assist data placement which can be incorporated in a compiler. These techniques reduce the runtime overhead and they also minimize the program- mer effort involved in proper data placement. Our method for compiler-assisted data placement uses compile-time objects containing data with the same variable type and similar reference patterns. We design a compiler that uses our method, and we de- velop algorithms to determine the type and reference pattern of data. We also develop specific object—creation schemes for different types of data placement strategies viz., the block-level and object-level types of placement. Our object-creation scheme that assists block-level placement [54] creates objects that are not falsely-shared and that have temporal locality. We propose methods to ‘ 22 implement these objects which ensure that blocks are not falsely-shared. Further, we use these objects to provide information required for proper placement, thereby reducing the runtime overhead of collecting this information. Our scheme that directs object-level placement creates objects based on the data exchange pattern [55, 56] and automatically generates all the programmer-specified information required by SSVH. We evaluate the cost of applying our schemes by deriving their time and space complexity. We use experimental simulations to compare the performance of applications for the block-level placement (with and without compiler assistance) and the compiler- directed object-level placement strategies. We conclude that compiler-specified in- formation about objects leads to a significantly better performance than that in the absence of such information. Further, the best performance is achieved when place- ment is directed rather than assisted by the compiler. Our work demonstrates that compiler-directed data placement is a promising approach to improve the performance of applications and the programmability of N UMA multiprocessors. 1.5 Thesis Organization The rest of our thesis is organized as follows. In Chapter 2, we present the results of our study on factors affecting data placement schemes. In Chapter 3, we outline our approach for compiler-assisted data placement. In Chapter 4, we discuss our work on compiler-assisted block-level placement. In Chapter 5, we present our work on compiler-directed object-level placement. In Chapter 6, we outline the results of our experimental evaluation of various types of placement schemes. Finally, in Chapter 7, we summarize the contributions of our thesis and present directions for future research. CHAPTER 2 FACTORS AFFECTING DATA PLACEMENT The primary aim of this thesis is to develop efficient data placement techniques, which minimize the data access time, when providing a shared virtual memory in a NUMA multiprocessor. As a first step toward achieving this aim, we conduct a study [41, 42] to identify the limitations of existing data placement strategies, most of which provide block-level placement. We present the results of our study in this chapter, which is organized as follows. First, we provide a detailed overview of related work, which was done prior to our study and which motivates the goals of our work. Next, we outline our assumptions about the hardware features of the NUMA multiprocessor and the manner in which the shared virtual memory (SVM) is provided. After presenting our work on page replacement, we outline the methodology, performance metrics, and workload of our study. Next, we discuss our experiments in detail, and present our conclusions. 2.1 Related Work The issues of providing a SVM have been studied for both N UMA multiprocessors in which all processors can directly address all available memory and those in which each processor cannot directly address another processor’s local memory. Li [28] was the first to propose the idea of providing a SVM for a NUMA multiprocessor belonging to ‘ 23 24 the latter category. SVM is also provided by the Mach operating system [57], which has been ported to a wide variety of paged uniprocessors and multiprocessors. Studies about SVM which were done prior to our study considered block—level data placement using either replication or migration. Further, in the case of block-level replication, studies differed in the consistency model used and whether the write-invalidate (WI) or the write-update (WU) scheme was used to maintain consistency. Data Replication First, we discuss studies that considered block-level data replication and maintained sequential consistency using the WI scheme. IVY [28, 20] is a user-level SVM, which is implemented for a network of Apollo workstations. It uses several page own- ership schemes such as the single-owner, statically-determined multiple-owner, and dynamically-determined multiple—owner schemes. The port of Mach to workstations interconnected by a token-ring [58] accommodates multiple page sizes and heteroge- neous architectures. It also uses the single-owner and multiple-owner page ownership schemes, with fault tolerance being implemented for the former. Mirage [22] is a kernel-level segmented-page SVM, implemented for three VAX 11 / 750’s networked by an Ethernet. It uses a lazy sequential consistency model which introduces a delay before a write request to a segment shared by multiple readers is granted. PLATINUM [15] is a SVM implemented for a BBN Butterfly Plus. It selectively replicates pages depending on the number of times the WI scheme is invoked for each page. Fur- ther, it uses a directory-based page ownership scheme, and replicates page tables and keeps them consistent by interrupting processors that are actively using the updated page table entry. DUnX [59, 18] is a SVM implementation for BBN’s GP-1000 [60] and provides several page-level replication schemes. While these studies considered sequential consistency, others provided a weaker consistency model instead. For ex- ample, Bisiani et al. [61] used trace-driven simulations to study their weak coherence model maintained using the WU scheme. They concluded that full replication with a large number of processors is expensive and advocated selective replication instead. Weak coherence is provided in hardware by the PLUS multiprocessor [17] which also 25 supports various hardware synchronization operations that enable applications to be written for this consistency model. Data Migration Next, we discuss studies which considered block-level data migration instead of repli- cation. In the port of Mach on the ACE [16], a page is migrated to the local memory of the processor that references it; after a selective number of such migrations, the page is placed statically in global memory. Further, kernel pages are neither replicated nor migrated and are placed in global memory. Black et a1. [39] used trace-driven simu- lations to study competitive page-level migration algorithms, which require hardware reference counters for implementation. Page-level migration is also provided in the implementation of a SVM for an iPSC / 2 by Li and Schaefer [19]. Mizrahi et a1. [62] used trace-driven simulations to study several block-level migration strategies that extend the memory hierarchy into the interconnect. Scheurich and Dubois [40] study page-level migration using page pivoting in a point-to-point mesh interconnect. DUnX [59, 18] also provides various page-level migration schemes. Different strategies to distribute the read-only pages of an application among the local memories of various processors have been studied in [63]. Other problems that need to be solved when providing a SVM in a NUMA mul- tiprocessor have also been studied in the literature. For example, the TLB and the page table consistency problem has been studied in [34, 35, 36, 15]. Holland [37] studied three page table management schemes using software simulation of synthetic applications on a BBN Butterfly Plus. The study concludes that a single page ta- ble is a software bottleneck and a fully replicated indexed page table uses a lot of memory. Therefore, the best choice is a partially replicated indexed page table which not only uses less memory but also performs almost as well as the fully replicated indexed page table. In a related paper [38], the most-recently-used page replacement scheme is considered, and the influence of the number of replacement daemons and their frequency of invocation, on typical applications, is studied. 26 Motivation and Goals of Study Most of the previous work involves actual implementation, and performance is mea- sured in terms of speedup or execution time. Such a performance measure applies to the hardware parameters of the specific multiprocessor in question. These studies do not compare the performance of a given placement strategy on different multiproces- sors. Such a comparison is one of the goals of our study, and we realize this goal by characterizing each multiprocessor by hardware parameters that are related to the placement strategy in question. We also study page replacement schemes which have not been addressed in detail by previous studies. Further, some of the block-level placement strategies proposed in previous studies are designed to be adaptive to the block’s reference pattern. For example, PLATINUM [15] replicates pages depending on their invalidation frequency, and the port of Mach on the ACE [16] migrates pages only a certain number of times. Also, Bennett et al. [29, 46] propose data placement schemes that adapt to data objects of different types. Since a block’s reference pat- tern or a data object’s type is determined by the layout of the application’s data in the SVM, another goal of our study is to consider the interaction between the data layout and the data placement strategy. We realize these goals by using a combina- tion of analytical and trace-driven simulation techniques, which allow great flexibility in parameter variation. 2.2 Assumptions In this section, we outline our assumptions about the hardware features of the NUMA multiprocessor and also the manner in which the SVM is provided. We consider NUMA multiprocessors such as the BBN TC2000 which lack hardware cache consis- tency and in which main memory is organized as modules local to each processor. Further, all processors can directly address all available memory. We assume that all processors share a single operating system which is allocated in the kernel virtual memory. The code area of the operating system is replicated in local memory on ref- erence, but there is a single copy of the data area. For example, all processors share 27 a single copy of the data structure that stores information about the various virtual pages. We do not replicate the operating system’s data because it is frequently up- dated by all processors, and the overhead of enforcing consistency might nullify the benefits of replication. Instead, we distribute this data evenly across all available local memory modules. The data placement strategy we consider is software-controlled, page—level replica— tion with sequential consistency maintained using either the WU or the WI scheme. We refer to the WU scheme alternatively as multicast, as its effect is the same as that of a multicast communication. We define fall replication (FR) as the strategy that replicates a virtual page in the local memory of each processor that references it, and no replication (NR) as the case when virtual pages are not replicated. Further, we re— fer to the first physical page allocated to a given virtual page as the master, and each physical page allocated thereafter as a replica. Note that if the master is in remote memory relative to a processor issuing a write request, then both the master and the local copy need to be updated. We refer to such an update of the master copy as a remote master write. We do not use explicit page-level migration, but as discussed in the next section, pages are migrated during specific cases of page replacement. We use two schemes for initial page placement: (1) fault processor placement, where the virtual page is placed in the faulting processor’s local memory, and (2) modulo placement, where the virtual page is placed in the local memory of the pro- cessor given by: (virtual page number) mod (number of processors). Page replacement has not been given much attention in existing studies, and it is important when using the FR scheme which needs more memory, and therefore, we study it in detail and postpone discussion on it until the next section. To enable fast address translation, each processor has its own TLB and we use a single-hand clock scheme[64] for TLB replacement. Further, each processor has its own copy of an indexed page table, the pages of which remain allocated during the application’s lifetime. In the FR case, each processor has its local copy of any virtual page, and the page’s entry in the processor’s TLB and page table also pertains to this local copy, and hence, TLB and page table consistency is not an issue. On the other 28 hand, in the NR case, all processors referencing a virtual page share its only physical page, and the page’s entry in the TLB and the page table of all these processors pertain to this single physical page. The information in the TLB and page table that are modified and hence related to the consistency problem are the dirty and the reference flags. Since our page replacement scheme does not use the reference flag, we consider only the dirty flag which is used to decide whether a page, on being chosen to be replaced, needs to be written to the disk. We solve the consistency problem in the NR case by updating the dirty flag as follows. When a virtual page is written, the corresponding dirty flag is set in the TLB and the page table of only the processor issuing the write request. When a page is chosen to be replaced, it is written to the disk if the dirty flag is set in the TLB or the page table of any of the processors that referenced it. We conclude this section by defining terms to denote some of the types of memory references that can occur during an application’s execution. During the virtual-to- physical address translation, a processor might find the corresponding mapping in its TLB, referred to as the TLB hit, or it does not find the mapping in its TLB, but finds it in its page table, referred to as a page table hit, or it does not find it in both its TLB and page table, referred to as a fault. We refer to a fault on a virtual page which is mapped in remote memory as a remote page fault (RPF). We assume that any physical page returned to the list of free pages is marked as a disk cache page, and refer to a fault on such a page as a disk cache fault (DCF). Such a disk caching scheme allows the page to be reused and prevents it from being copied unnecessarily from the disk. We refer to a fault on a virtual page which is not mapped anywhere in physical memory as a disk page fault (DPF). 2.3 Page Replacement Efficient page replacement schemes are important, when providing page-level replica- tion, because more physical pages are used. This fact is particularly true for appli- cations with a high degree of data sharing. In this section, we discuss our work on 29 Reference Counter 31 30 29 28 , 0 - 0 0 0 1 1 Discarded Bit Reference Bit Figure 2.1. The software LRU approximation scheme. page replacement. 2.3.1 CLOCK Scheme The first scheme we study is the adaptation of the CLOCK algorithm used in the 4.3 BSD UNIX [64]. The CLOCK algorithm periodically resets each page’s reference flag, thereby making it available for replacement. We conclude that it is preferred over the LRU algorithm [13] in the 4.3 BSD UNIX, mainly due to the hardware constraints of the VAX architecture rather than due to its superior performance. Since the CLOCK algorithm does not use the reference information for a page over a period of time, we do not study it further. 2.3.2 Software LRU Approximation Scheme The next scheme we study is the adaptation of the software LRU approximation algo- rithm. The basic algorithm periodically shifts the reference bit of each physical page into the most significant bit of the page’s reference counter, whose least significant bit is discarded (Figure 2.1). The page replacement scheme chooses the page with the least value of the reference counter. We adapt this algorithm for page replacement 30 with a NUMA physical memory as follows. We maintain separate lists for the mas- ter and the replica pages, and for each page in these lists, we periodically update its reference counter and reset its reference bit. We also maintain a set of reference coun- ters, one for each virtual page, and update these counters periodically using either the hardware reference bits or the reference counters of the corresponding master and replica physical pages. We allow all the update and the reset intervals to be variable design parameters. Therefore, the reference information for a virtual page is a com- bination of its local and remote reference information. In contrast, the adaptation of software LRU approximation in [59, 18] uses only local reference information. Soft- ware LRU approximation has the following drawbacks: (1) it is expensive because the reference counters need to be implemented in hardware for performance reasons, (2) the overhead of periodically updating all the counters and bits is high, and (3) in our experience, simulating its operation takes a lot of time. 2.3.3 LRU Scheme Next, we study the adaptation of the LRU algorithm, which is known to perform the best for applications that exhibit locality. In the basic LRU algorithm, when a page is referenced, it is placed at the beginning of the list of active pages. The page replacement scheme replaces the page at the end of the list, which is the one least recently used. Clearly, implementing this algorithm in software is too slow to be practical, while a hardware implementation is too costly. However, even software LRU approximation needs a hardware implementation for performance reasons. Further, any approximation to LRU can at best perform as good as LRU, and therefore, LRU gives a best-case performance estimate. In our experience, it takes less time to simulate its operation than that of the software LRU approximation. Based on these considerations, in our study, we develop a new page replacement scheme based on LRU called LERN, an acronym for LRU Extension for Replicated NUMA memory management. We now describe how LERN works in the FR and the NR cases. The state of any physical page is defined to be free, when it is not allocated to any virtual page, and is 31 Physical Page V tree master_L master_L_Fl master_Fl replica MW'J- l master_Fl frog replica I master_R free master_L_R l master_R fl.” replica I master_Fl tree LRUL RL FL Figure 2.2. The LERN page replacement scheme. defined to be replica, when it is allocated for page-level replication and is used locally. Further, when a physical page is the master copy, its state is defined to be master_L, master_R, or master.L-R, depending on whether it is used locally, remotely, or in both places, respectively. Each processor maintains a free list (FL) that contains free pages, a remote list (RL) that contains master_R pages, and a LRU list (LRUL) that contains master.L, master.L-R, and replica pages (Figure 2.2). Note that in the case of NR, LRUL has no replica pages. Tables 2.1 and 2.2 show the state transition diagram of a physical page when the LERN scheme is used for the FR and the NR cases, respectively. 32 Table 2.1. Page state transition diagram: LERN with full replication. Eurrent State ] Event ] Next State ] free Page replication replica free DPF master-L free DPF master.R replica Periodic replacement free replica Write invalidation free replica Demand replacement for page replication replica (New) replica Swap with master.R master.L replica Demand replacement for DPF master.L replica Demand replacement for DPF master.R replica Swap with master.R master.L-R master.L Periodic replacement free master_L Demand replacement for page replication replica master.L Demand replacement for DPF master.L (New) master_L Demand replacement for DPF master.R master_L Page replication on a remote node master.L-R master.R Periodic replacement free master.R No more replicas of the virtual page free master.R Demand replacement for page replication replica master.R Demand replacement for DPF master.L master.R Demand replacement for DPF master.R (New) master.R Page replication master.R master.R Local page fault master.L-R master.L-R No more replicas master_L master.L-R No local reference master.R master.L-R Page replication master.L-R LERN for Full Replication In the FR case, at the end of a reference to a virtual page by a processor, a cor- responding physical page is found in the processor’s LRUL. Further, each processor reorders its LRUL on every reference. When LERN is invoked, a page can be replaced either from the LRUL or from the RL. A page is replaced from the LRUL if it contains more than a minimum number of pages, which is equivalent to the LRU stack size. Otherwise, a page is replaced from the RL. If it is decided to replace a page from the LRUL, master.L.R pages are skipped as long as the number of replica and master.L 33 Table 2.2. Page state transition diagram: LERN with no replication. {Current State ] Event [ Next State ] free DPF master_L free DPF master.R master.L Periodic replacement free master_L Demand replacement for DPF master_L (New) master.L Demand replacement for DPF master_R master_L Remote reference master.L-R master_R Periodic replacement free master_R Demand replacement for DPF master.L master_R Demand replacement for DPF master.R (New) master_R Remote reference master_R master_R Local page fault master.L-R master.L-R Remote reference master.L-R master.L-R No local reference master-R pages is above a minimum value. If a master.L-R page is chosen as the candidate from the LRUL, it is transferred to the RL after changing its state to master_R, and the replacement policy on the RL is invoked. Otherwise, a replica or a master.L page is replaced from the LRUL. The RL replacement scheme replaces the page that has the least number of replicas, and uses a first-in-first-out rule when there is a tie. The first replica of the replaced page is made the new master, and in this case, the virtual page is effectively migrated. LERN for No Replication In the case of NR, a virtual page referenced by any processor might be allocated either in local or in remote memory. While in the former case, the corresponding physical page is in the local LRUL, in the latter case, it is in the LRUL or the RL of another processor. To implement exact LRU, each processor, on every reference, needs to reorder the list (LRUL or RL) in which the page currently referenced is located. It is obvious that this reordering is costlier in the NR case than in the FR case. The other option is to reorder only in the case of local references and use a first-in-first-out 34 ordering for remote references. When LERN is invoked, a page can be replaced either from the LRUL or the RL. As in the FR case, a page is replaced from the LRUL, if it contains more than a minimum number of pages, and from the RL otherwise. In the former case, master.L-R pages are skipped as long as the number of master.L pages is above a minimum value. If a master.L-R page is chosen as the candidate from the LRUL, it is transferred to the RL after changing its state to master_R, and the RL replacement scheme is invoked. Otherwise, a master.L page is replaced from the LRUL. The RL replacement scheme replaces the page that is used by the least number of remote processors. The TLB and the page table entries corresponding to the replaced page are invalidated in all these processors. 2.4 Method of Study The technique of simulation has been used extensively to study the memory per- formance of both uniprocessors and multiprocessors. A given run of a simulator approximates the execution of an application for a certain data placement scheme such as cache-line—level replication with sequential consistency maintained using the WI scheme. One of the key advantages of simulation is the ability to study the effect of a wide range of parameters at a cost significantly lower than actual imple- mentation. Two well-known simulation techniques are trace-driven simulation and execution-driven simulation [65]. In the former approach, the application is modeled by a global execution order of all its memory references, referred to as its address trace or just trace. Techniques for collecting such traces as accurately as possible have been developed (e.g., [66]). A trace-driven simulator simulates the actions that occur in response to each reference in this global execution order, one reference at a time. It does not account for changes that can occur to this global execution order itself due to the placement strategies that are being simulated. In the case of an execution-driven simulator, the application is modeled not as a global execution order, but as a set of execution orders, one per processor, each of which contains an event for every instruction executed by the corresponding processor. 35 A time counter is maintained for each processor and is initialized to zero. A given processor’s time counter is incremented whenever an event from its execution order is processed. The counter is incremented by the amount of time taken for this event’s execution, which depends on whether the event is a memory reference or not, and in the former case, the actions taken by the placement strategy. The simulator chooses its next event from the execution order of the processor that has the minimum value of all the time counters. When the values in the time counters of processors coincide, it makes an approximation by making a random choice. Such a simulation allows a given processor’s time counter to be incremented by the time taken for actions of the placement strategy, which are in response to an event from this processor’s execution order. However, other processors also participate in some of these actions such as invalidations that occur when the WI scheme is used to maintain consistency of replicated data. In order to incorporate the time for these actions in the time counters of the other processors, the time at which these actions took place, relative to each of these processors is needed, and only approximate values can be assumed for such cases. Therefore, an execution-driven simulator is only able to approximately incorporate the changes in the global execution order due to placement strategies. It is clear that execution-driven simulation is not only time-consuming, but also is not an accurate simulation of an application’s execution. Further, it requires separate execution orders for each processor, and additional parameters such as the time taken to execute each type of instruction. The additional accuracy offered by such a simu- lation over trace-driven simulation is at the cost of extra complexity, and is justified only if it is necessary to measure the execution time as accurately as possible. As dis- cussed in the next section, we measure performance not by the execution time, but by a metric that needs only the number of memory references for each type of reference. Due to these considerations, we use trace—driven simulations in our study. Indepen- dently, Bolosky et al. [43] also use a trace—driven simulator to study the relationship between page-level migration and hardware parameters when providing a SVM in a NUMA multiprocessor. They measure performance by the mean cost per reference, which is similar to one of our measures of performance. Further, they compare re- 36 sults obtained for simulations on traces which are generated by arbitrarily perturbing a given global execution order. They conclude that ignoring the perturbation caused by the placement strategies does not influence the results obtained. Our simulator is written in Gnu C++ and its input parameters include the TLB size, the number of processors, the number of physical pages, the page size, the number of address bits, and the parameters related to the LERN scheme and the various schemes for the other problems involved in providing a SVM. Its output includes various statistical data such as the number of page replacements and the number of memory references for each reference type, which can be measured at various points of the simulation run. 2.5 Performance Metrics Since the goal of data placement is to minimize the data access time, our key mea- sure of performance. is the eflective memory access time (EMAT), which is the mean memory access time for the application averaged over all its memory references. The access time for a given memory reference depends on the type of the reference and the placement strategy used. For example, if a reference causes a disk page fault, then its access time includes the time to transfer the page from disk to memory. If the reference causes a remote page fault and page-level replication is used, the access time includes the time to allocate a page in local memory and the time to initialize the allocated page from the master which is allocated in remote memory. The EMAT is given by: Data consistency time + zreferencel No. of references x Access time} type (2.1) Total no. of references As mentioned in the previous section, our simulator records the number of references that occur in each reference type category. We derive the access times taken by various types of references in terms of parameters that characterize the hardware 37 Table 2.3. Parameters. dfos Disk fault kernel lookup dmc Disk to memory copy inval Write invalidation lern LERN scheme local Local memory access multicast Write update ptl Page table level remote Remote memory access remwr Remote master write rlc Remote to local copy rpfos Remote page fault kernel lookup Table 2.4. TLB hit accesses. [ Page location ] Access time ] Local t local Rngte tremote and also those that represent the software overheads. These parameters are listed in Table 2.3, and for each parameter p, 12,, represents the number of times it occurs and t,D represents the time it takes. in Tables 2.4 through 2.8. Tab data for all the memory references, where tmulticast is normalized to twmotc. Our second performance me Table The expressions for the access times are enumerated le 2.9 lists the time taken to maintain consistency of asure is the parameter overhead, which is a measure of 2.5. Page table hit accesses. [Page location ] Access time ] Local RemOte nptltlocal ‘l' tremote nptltlocal + tlocal 38 Table 2.6. Remote page fault accesses. Replicate? Master Replica LERN Access time Locn. Locn. Replica? Yes Remote Local No nputloml + trpfo, + trlc + tloca; Yes RfimOte Local Yes nptltlocal + ,trpfoa + trzc + tlocal + tlern Yes Local Local NA nptltlocal + trpfo, + tlocal No Local Local NA nputloca; + trpfo, + tloca; NO Remote Remote NA nputzocaz + trpfo, + inmate Table 2.7. Disk cache fault accesses. Replicate? Master Replica LERN Access time Locn. Locn. Replica? Yes Remote Local N o nputlocal + t4,” + trlc + the“; Yes Remote Local Yes nptltlocal + tdfos + trzc + tlocal + tlern Yes Local Local NA nptltlocal + tdfo, + tlocaz No Local Local NA nputloca, + t4,” + tloca; No Remote Remote NA nputiocai + tdjo, + tremotc the influence of a particular parameter on the EMAT. By using Tables 2.4 through 2.9 in Eq. 2.1, the EMAT can be expressed as: No. of times the parameter Time for all X parameters appears in the total access time 2 the parameter Total no. of references Therefore, the overhead of a particular parameter p is given by: No. of times p appears X {Time forp} in the total access time EMAT x Total no. of references 39 Table 2.8. Disk page fault accesses. Repli- Master Replica LERN LERN Access time -cate Locn. Locn. Master Replica? Yes Remote Local No No nptltlocal + tdfo, + tdmc+ trlc + tlocal Yes Remote Local N 0 Yes nputlocaz + t4,” + tdmc-l- trlc + tlocal ‘l' tlcrn Yes Remote Local Yes No nputloca; + toga, + tdmc+ trlc + tlocal + tlern Yes Remote Local Yes Yes nputlocal + tdjo, + tdmc+ trlc + tlocal + 2(tlern) Yes Local Local No NA nptltlocal + tdfo, + tdmc + tlocal Yes Local Local Yes NA nputzocaz + tdfo, + tdmc‘l" tlocal + tlern N0 Local Local No NA nptltlocal + tdfos 'l' tdmc "l' tlocal No Local Local Yes NA nputloca: + tdfos + tdmc'l' tlocal 'l' tlern No Remote Remote No NA nputlocal + toga, + tdmc + tremote No Remote Remote Yes NA nputgoca, + toga, + tdmc'l' tremote ‘l' tlern Table 2.9. Time to maintain consistency of data. [Consistency scheme [ Replicate? ] Data consistency time ] WI Yes tinvalninval + tremotenremwr WI No 0 WU Yes tremotetmulticastnmulticast + tremotcnremwr WU No 0 40 DOSEQ SH Initialize A Sh END DDSEQ DDSED 55 K - 1 T0 m DOALL St I . 1 r0 n DOALL S; J . 1 r0 n A(I,J) - 0.25 * (A(I-1,J) + A(I+1,J) + A(I,J-1) + A(I,J+1)) 5; END DOALL 85 END DOALL Sb END DOSEQ DOSEQ 5% Output A 5% END DDSEQ Figure 2.3. Application parallel iterative solver. 2.6 Workload The workload for our simulations consists of synthetic traces as well as address traces of actual applications. We input synthetic traces of well-known reference patterns to the simulator, and test the validity of the simulator by comparing its output to the expected output. One of the actual application address traces used in our study is collected by executing the Particle-In-Cell benchmark on an Alliant emulator for an eight-processor configuration [67], and we refer to it as application PIC. We also consider the class of applications which solve partial differential equations using iterative methods [68], and develop a program to generate the address trace for the case of a two-dimensional problem with a 5-point stencil, the typical code for which is shown in Figure 2.3. Here, each outermost loop iteration (corresponding to 52) uses the results of the previous iteration and hence must be executed sequentially. However, the iterations 41 Column Major Row Major Block Major Figure 2.4. Different methods of data layout for a 4 x 4 array partitioned into four partitions. of the two inner loops (corresponding to 5'3 and 54) can be executed in parallel. Since this application involves communication only with the nearest neighbors, we refer to it as application near neighbor. The inputs to the program that generates the address trace for application near neighbor include the array size, the number of processors, the virtual page size, number of array elements per virtual page, and the layout of the-array elements in the SVM. The address trace corresponds to a rectangular partitioning of the array elements among the various processors, each of which processes the elements allo- cated to it in a row-wise manner. We do not include the references corresponding to initialization, termination, and synchronization, when generating the address trace, because for the purposes of our study, it is sufficient to record the read and the write references corresponding to the computation. We also assume the presence of an instruction cache and therefore ignore instructions. We study three schemes for laying out the array A in the SVM, out of which the row major and the column major schemes layout the array elements in row major and column major form, respectively. We refer to the portion of the array which is assigned to each processor as a block, and the block major scheme lays out ar- ray elements that belong to different blocks in separate regions of the SVM. These 42 schemes are illustrated for an example in Figure 2.4, where the number assigned to each array element is its virtual address. We refer to the three address traces gen- erated due to these three schemes as applications near neighbor/row major, near neighbor/column major, and near neighbor/block major, respectively. The be- havior of the first two applications are similar, except that the latter incurs more page faults because the array elements are processed row-wise and the layout of the data is in the column major form. Therefore, we do not consider application near neighbor/ column major further in our study. 2.7 Experimental Results Our objective is to study the interaction of page-level replication with hardware pa- rameters and with page reference patterns created by various data layout schemes, and hence, we consider the FR and the NR placement strategies. Since replication is used to alleviate the high cost of accessing remote memory, we choose the ratio of the times to access remote and local memories (R) as a variable hardware parame- ter. Further, since replication needs more memory, we choose the number of pages allocated to the application as another variable hardware parameter. The number of pages is varied to cover both the low page range region, where there are a lot of disk page faults and the high page range region, where increasing the number of pages further did not lead to an improvement in performance. We use the applica- tions PIC, near neighbor/row major, and near neighbor/block major to study various page reference patterns. We now outline the fixed parameters used in our experiments. We conduct detailed experiments and conclude that for our workload, the fault processor scheme is better than the modulo scheme for initial page placement, and also, the WU scheme is better than the WI scheme to maintain consistency. Therefore, we choose the fault processor scheme and the WU scheme as fixed parameters. The values for the other fixed parameters are listed in Table 2.10. These values are typical of the TC2000 multiprocessor and are all normalized to the local memory access time tlocag, which 43 Table 2.10. Fixed parameters. I Parameter p I t, (unit tlocaz) I rpfos 100 dfos 100 rlc 560 dmc 40000 lern 500 multicast inmate is assumed to be 0.5 us. We assume a disk access time of 20 ms and choose values for the software overhead to process various types of references and that for the LERN scheme based on the fact that the operating system’s data is distributed evenly across all memory modules. The value of tram“ for the multicast operation that enforces the WU scheme is based on the assumption that all remote updates can take place simultaneously, which might not be true always. As we will see, our results are in fact strengthened when this assumption, which favors the FR case, does not hold. We now discuss the results of simulations conducted with these fixed and variable parameters in terms of the performance metrics (EMAT and parameter overhead) defined earlier. Application near neighbor/row major Figures 2.5 through 2.8 show the parameter overhead versus the number of pages for R = 2 and R = 10, for both the FR and the NR cases. The parameter overhead graphs for the FR case can be divided into the low page range, the high page range, and the transition range regions. The overhead due to the LERN scheme, which is invoked in the low but not in the high page range region, causes the transition range region. The insignificance of this overhead in the NR case, because of infrequent page replacement, eliminates the transition range region in the corresponding graph. When the number of pages and R are both low (R = 2), disk page faults constitute the major overhead for both the FR and the NR cases. At high values of R (R = 10), remote 44 Near Neighbor/Row Major/F8182 N 0 ‘w .‘.- sag. .“ um.- 69 : local remote E °° ' 32'“ 67 7% 7% I: am : 7 5"", I m a. 66 7 EE‘ 5 lam es :z s! g ...... ‘ “ ,. ram 64 Z Egg 63 Z Efi! 62 iZ WV 7 7 , “A W A 52%! o 20 40 so so 100 Percentage Overhead Figure 2.5. Parameter overhead (%) vs. No. of pages (F R/ R=2/near neighbor/row major) accesses share the overhead with disk page faults for the NR case. However, disk page faults still constitute the major overhead for the FR case, as replication reduces the number of remote accesses. When the number of pages is high, disk page faults are the minimum possible for both the FR and the NR cases. In the case of FR, local memory accesses account for a major share of the overhead, which is desirable because our goal is to make the EMAT as close to tlocal as possible. Therefore, multicast and remote master writes constitute the major overhead and their share is more for higher values of R (R = 10). Remote accesses constitute the major overhead in the NR case for both high and low values of R. The EMAT for the FR and the NR cases are shown in Figures 2.9 and 2.10, respectively. The steep slope corresponds to the transition from the low page range region to the high page range region. When both the number of pages and R is low (R = 2), the FR case incurs lots of LERN invocations and remote page faults, and therefore, its EMAT is worse than that for the NR case. When R is high, these factors hJNflp: 45 Near Nolghbor/Row Major/Nfilfiz Dal-UNI.- Figure 2.6. Parameter overhead (%) vs. No. of pages (NR/R=2/nea.r neighbor/row major) ItflPqu Figure 2.7. Near Neighbor/Row Major/FRIEND 2 ‘\\‘ .\\\ .\\\' \\V \\\' Parameter overhead (%) vs. neighbor/row major) No. nan-null- ? of pages (FR/R=10/near 46 Near Neighbor/How Major/NR/mo ImflPuu 8 HEM-DE..- 3 RMMVMMMG Figure 2.8. Parameter overhead (%) vs. No. of pages (N R/ R=10/near neighbor/row major) Effective memory access time (unit tlocal) JAV w No. of pages Figure 2.9. Effective memory access time vs. No. of pages (F R/ near neighbor/ row major) 47 Effective memory access time (unit tlocal) 62 64 66 68 70 72 No. of pages Figure 2.10. Effective memory access time vs. No. of pages (NR/near neighbor/row major) are overshadowed by the remote accesses in the NR case, and hence, the EMAT for the FR case is better than that for the NR case. When the number of pages is high, the EMAT graph flattens after a certain number of pages, the actual number being smaller for the NR case than for the FR case. Again, for R = 2, the reduction in the EMAT for the FR case from that for the NR case is small, while it is significant when R > 2. The best values of the EMAT occur in the high page range region for both the FR and the NR cases. The dominant parameters that affect the EMAT then are the remote accesses for the NR case and the remote master writes and the multicast for the FR case. Right now, pages are migrated only during certain types of page replacement, which are absent in the high page range region. Explicit page-level migration can reduce the number of remote master writes and remote accesses, and therefore, reduce the EMAT. Further, efficient multicast algorithms can reduce the multicast overhead and reduce the EMAT for the FR case. These results show that the effectiveness of replication depends on hardware parameters such as R, the available 48 physical memory, and the overhead of maintaining consistency. Application near neighbor/block major Figures 2.11 through 2.14 show the parameter overhead for R = 2 and R = 10, as the number of pages is varied, for the FR and the NR cases. As in the row major case, the overhead graphs for the FR case can be divided into the low page range, the high page range, and the transition range regions. The overhead due to page replacements is present when the number of pages is low and absent when it is high, and hence the transition range region in the FR case. The infrequent page replacements in the NR case eliminates the transition range region in the corresponding graph. When the number of pages is low, disk page faults constitute the major overhead for both R = 2 and R = 10, for both the FR and the NR cases. When the number of pages is high, for both the FR and the NR cases, disk page faults are at their minimum and it is desirable for the local access overhead to be high. Therefore, multicast and remote master write are the dominant parameters for the FR case, while remote access is the dominant factor for the NR case, for both R = 2 and R = 10. The EMAT for the FR and the NR cases are shown in Figures 2.15 and 2.16, respectively. The steep 810pe corresponds to the transition from the low page range region to the high page range region. In all cases, EMAT flattens in the high page range region after a certain number of pages; this number is less for the NR case than for the FR case. The difference between this number for the FR and the NR cases is higher for the block major scheme than the row major scheme. Further, the EMAT for the NR case is better than that for the FR case, for all values of R and for the entire page range. The lower EMAT is because the block major scheme results in fewer shared pages and consequently, when the number of pages is high, the overhead for remote access in the NR case is less than that for multicast and remote master write in the FR case. Again, as in the row major case, the EMAT can be improved by explicit page-level migration and efficient multicast algorithms. 49 Near Neighbor/Block Major/FRIR2 ' 7 5. a -.e I . 52 g g I be” ' I i 5° E I I remote a. 57 ._:—_ ' I M» - 43 //A E' a He .. 'I I; D “8 8 ‘° 2 a :2“ I E m 4° Z fig mucast Z S! H mm 30 5 Eg 7 E 2° 2 E§ '/ E. ‘° “1,, W, ,,E. V- W i W E§ 0 20 40 60 100 Figure 2.11. Parameter overhead ( ‘70) vs. No. of pages ( FR/ R=2 / near neighbor/block major) Near Nelghbor/Block Major/MRIRZ 52 50 l 48 l l I local 46 I remote I rpfos 40 rlc u.- U dfos . m I dmc g E lom 20 B mulu'cut E] romwr 16 12 8 0 20 40 60 so 100 Figure 2.12. Parameter overhead (%) vs. No. of pages (NR/R=2/near neighbor/block major) le. ef Pepe Figure 2.13. 50 Near Neighbor/Block Major/FRIFno S I 8 8 S 8 8 S .. O l l 1 l‘“‘ “\‘ a.“ m“\ \\\\ “\\ l o 8 Parameter overhead (%) vs. neighbor/block major) Imofm Figure 2.14. as DEM-DE... 3 No. of pages (FR/R=10/near Near Neighbor/Block Major/NR/mo 8 8 8 8 6 8 Parameter overhead (%) vs. neighbor/block major) mam-null- 3 40 60 80 100 Percent-fie Overhead No. of pages (NR/R=10/near 51 12 AIAJA A -'.-A--' T E I ll_5mAWAA ........ A ....... j ..... AA R'2O .1 10 — if 5; AV 5; <3 4; - 9 - _ Effective 8 _ _ memory access 7 a _ time (unit tzmz) 6 '— F 5 " q 4- AAAA- 3 ~ _ 2 l I l l l l J l 5 10 15 20 25 30 35 40 45 50 55 60 No. of pages Figure 2.15. Effective memory access time vs. No. of pages (FR/near neighbor/block major) ...: O Effective memory access time (“mt tlocal) wanhU‘Cb‘lmCO 25 30 35 40 45 50 55 60 U1 ...-s O y... 01 IN? C No. of pages Figure 2.16. Effective memory access time vs. No. of pages (N R/near neighbor/block major) 52 Layout of Data: Row Major vs. Block Major The row major scheme creates virtual pages with a higher degree of sharing than the block major scheme and hence, needs more memory for the FR case. Further, in the address trace for the row major scheme, the leftmost processor references a given row’s virtual page first. This fact combined with the fault processor scheme leads to more pages being used by the leftmost processor. Therefore, application near neighbor/row major needs more memory for the NR case also. On the other hand, with the block major scheme, processors access virtual pages within their block, except when processing the border array elements, and consequently, need fewer pages for both the FR and the NR cases. Therefore, in both cases, the transition from the low page range region to the high page range region occurs at fewer pages with the block major scheme than with the row major scheme. The best EMAT for the row major scheme occurs for the NR case when R = 2 and for the FR case when R > 2. On the other hand, for the block major scheme, the best EMAT occurs in the NR case for all values of R. When R = 2, the remote access overhead is lower for the block major scheme when compared to that for the row major scheme. When R > 2, the remote access overhead for the block major scheme is less than the remote master write and multicast overhead for the row major scheme. Therefore, the best EMAT for application near neighbor occurs with the block major and the NR schemes, and in addition, this combination uses the least amount of memory. These results underscore the fact that proper layout of data in the SVM can simplify data placement. Application PIC Figures 2.17 through 2.20 show the parameter overhead for R = 2 and R = 10, as the number of pages is varied for both the FR and the NR cases. For both high and low values of R in the FR case, disk page faults constitute the major overhead in the low page range region, and multicast and remote master writes constitute the major overhead in the high page range region. On the other hand, in the NR case, 53 PIC/FR/R2 Ie.etPegee DEM-UNI.- 3 ”m“ s 16 7A 12 7% 10 ‘W a 7% o 20 40 so 80 Percentage Overhead ll mm HUI! H. s Figure 2.17. Parameter overhead (%) vs. No. of pages (FR/R22/PIC) remote accesses constitute the major overhead for all values of R and in the entire page range, indicating a high degree of sharing in application PIC. The EMAT for the FR and the NR cases are shown in Figures 2.21 and 2.22, respectively. The slope corresponds to the transition from the low page range region to the high page range region. The EMAT graph flattens in all cases after a certain number of pages, the number of pages being fewer in the NR case than in the FR case. The fact that more pages are needed in the FR case concurs with the earlier statement that there is a high degree of sharing in application PIC. While the EMAT for the NR case is better than that for the FR case by a very small margin when R = 2, when R > 2, EMAT for the FR case is better than that for the NR case. Therefore, we conclude that the sharing in application PIC is mostly-read, as otherwise, the overhead of multicast would have made the EMAT for the FR case much higher. In summary, when R is high, FR can improve the EMAT for an application with a high degree of mostly-read sharing. 54 PlC/NR/R2 DEIIDNIII 3‘ HmmMpmmmw Figure 2.18. Parameter overhead (%) vs. No. of pages (NR/R=2/PIC) PlC/FRIFHO Ie.efPegee Ball-DE..- 0 20 40 60 so 100 Pmmmpmmmw Figure 2.19. Parameter overhead (%) vs. No. of pages (FR/R=10/PIC) 55 PlC/NR/FHO Ie.efPe.ee mam-amul- 3 Pereentege Overheed Figure 2.20. Parameter overhead (%) vs. No. of pages (NR/R=10/PIC) Effective memory access time (unit tlacal) No. of pages Figure 2.21. Effective memory access time vs. No. of pages (FR/PIC) 56 10 ' I l 1 l I 9 R=2 O I R84 + 8 R36 U _ 7 . A ...... A ..... 33:3 3 5g Effective memory 6 X ........... X ............ X .......... .3 K access time 5 n m :1 (unit tlocal) _ — 4‘ .. 3 i f t 2< <> 9 45 l l l l l l 18 20 22 24 26 28 No. of pages Figure 2.22. Effective memory access time vs. No. of pages (NR/ PIC) Conclusions Our experiments show that whether to use page—level replication or not must be de- termined based on hardware parameters related to replication. While full replication does not provide a substantial performance improvement over no replication when remote memory accesses are relatively inexpensive, it improves performance when they are expensive. Further, since full replication needs more physical memory, when it is used for an application with a lot of shared data, insufficient physical memory can lead to performance loss due to disk page faults. When data is fully replicated and sufficient physical memory is available, the performance is limited by the time taken to keep data consistent. In essence, the choice of the data replication strategy should be adaptive to hardware parameters such as the remote memory access time, available physical memory, and the time it takes to make data consistent. Our experiments also show that a page’s reference pattern determines whether it should be replicated or not. When the sharing of virtual pages is low (application near neighbor/block major), performance is the same with and without replication. On 57 the other hand, when the sharing is high (applications PIC and near neighbor/row major), the best performance is achieved when pages are fully replicated. Further, the amount of sharing determines the physical memory needed to ensure that there are no disk page faults. In the presence of sufficient physical memory, performance of replication is determined by the overhead of maintaining consistency, which is directly related to the amount of sharing of a page. Further, we find that the layout of data in the SVM influences which placement strategy provides a better performance. A data layout strategy that reduces the sharing of pages can eliminate the need for complex placement strategies. With proper layout of data, our experiments show that the performance of an application with and without replication is the same. We therefore conclude that proper data layout can simplify data placement. In related work, Bolosky et al. [43] have concluded that page-level migration schemes need to be adaptive to hardware parameters that are related to migration. Also, LaRowe et al. [53] have developed a parameterized page placement scheme which can be adapted to hardware parameters and application reference characteristics by varying parameters. 2.8 Summary In this chapter, we presented the results of our study which aims at identifying the limitations of existing block-level placement strategies. We establish that for perfor- mance reasons, page-level replication strategies must be adaptive to certain hardware parameters of the N UMA multiprocessor and also to the page’s reference pattern. These results apply to other block-level placement strategies as well. It is tedious for the applications programmer to obtain by trial and error the placement scheme that performs the best for a given application and a specific NUMA multiprocessor. Further, in the absence of help from the compiler or the applications programmer, the runtime overhead of adaptive placement schemes is high. Also, the results of our study demonstrated how the compiler can play a role in simplifying data placement by 58 proper data layout in the SVM. These factors motivate our work on compiler-assisted data placement which is covered in detail in the following chapters. CHAPTER 3 COMPILER—ASSISTED DATA PLACEMENT In this chapter, we present our approach to compiler-assisted data placement when providing a shared virtual memory (SVM) in a NUMA multiprocessor. First, we motivate the need for the compiler to assist in placing data in memory, and then demonstrate that such help is necessary at all levels of the memory hierarchy. In order to assist data placement, the compiler needs information about the reference pattern of the SVM, and therefore, we next identify the various factors that determine this reference pattern. Then, we present our design of a compiler that assists data placement by using information about these factors. Next, we present our algorithms to determine this information, and finally we conclude by discussing related work. 3.1 Motivation Proper data placement is important to reduce the performance degradation due to non-local memory accesses when providing a SVM in a N UMA multiprocessor. Data placement can be either static or adaptive depending on whether a fixed strategy is adopted or not. For example, a static page-level placement strategy replicates any referenced page while an adaptive strategy might replicate a page depending on the number of processors sharing it. Studies on block-level placement schemes have concluded that in order to achieve a high performance, such schemes must be 59 60 adaptive to related hardware parameters [43, 41, 42, 53] and also to the reference pattern of the block in question [29, 46, 41, 42, 53]. Previous studies have made it straightforward to choose an adaptive block-level placement strategy once the block’s reference pattern is known. For example, Bennett et al. [29, 46] identify different types of data objects that occur in typical applications, and propose adaptive placement schemes for these objects. A read-only object is only read and never written, a mostly- read object is read more often than it is written, and a migratory object is shared by various processors in a round-robin fashion. Read-only and mostly-read objects can be replicated while migratory objects can be migrated. Such adaptive placement schemes can be implemented only if the reference pat- tern is known, and this information can be gathered either at compile-time or during runtime. Examples of the latter case are page-level placement schemes reported in [15, 16] which maintain each page’s reference history. In addition to space overhead, such schemes also incur time overhead because they frequently re-evaluate their de- cisions in order to ensure proper placement of pages whose reference patterns change rapidly. Such frequent re-evaluations, however, cause unnecessary overhead for pages whose reference pattern remains unchanged or changes infrequently. Further, these schemes can make proper placement decisions only if the past reference history is an accurate predictor of future references. LaRowe et al. [69] demonstrate that these schemes can benefit from hints in an application about changes in page reference pat- terns. Further, our study [41, 42] shows that simple reference patterns can simplify placement schemes. It is difficult for the applications programmer either to ensure that the reference patterns are simple or to provide hints about the reference pattern. It is therefore important to develop techniques that enable the compiler to achieve these objectives, when possible. Another problem that limits the performance of block-level placement strategies is false sharing of blocks. We shall illustrate the problem by considering the application shown in Figure 3.1, which we use for illustration throughout this chapter. Assume that p processors are allocated to the application and that loop 51 is parallelized by allocating consecutive sets of iterations to these processors. Further, the elements of 61 Initialization DU 5} I 3 1 T0 n A(I) - 3(1) + 0(1) .31 END D0 D032I=1T0m AC1) - 0(1) + E(I) 52 ENDDO no S3 I - 1 r0 q F(K(I)) = 6(1) + H(I) 53 END D0 Termination Figure 3.1. An example of a sequential application. the array variables are laid out consecutively in the SVM. Then, for certain values of the block size 3, it is possible for data elements which are exclusively accessed by different processors to reside in the same block (cache line or page), thus making it falsely-shared. For example, false sharing of certain blocks occurs when n = 1000, p = 10, s = 32 bytes, and each data element occupies four bytes, because each processor exclusively accesses 100 consecutive elements and a block contains eight elements. Block-level placement strategies consider every processor that references at least one of the elements of a given block as one that shares the entire block. Hence, they consider all data elements of a falsely-shared block to be actually shared by proces- sors referencing any of these data elements. Consider the situation when such a block is replicated and sequential consistency is maintained using the write-update (WU) scheme. On a write operation to a given data element, all copies of the block contain- 62 ing this data element are updated, even though some of the processors owning these copies will never reference this element. Though weaker consistency models such as release consistency reduce the frequency of updates, they also incur the performance loss due to unnecessary updates of falsely-shared blocks on a release operation. A write—invalidate (WI) scheme on the other hand causes unnecessary invalidations and subsequent misses. Alternatively, if this block is migrated among the local memories of interested processors, there is a ping-pong effect due to repeated invalidations and transfers. Instead, if the block is placed statically in a given processor’s local memory, every other processor referencing any of its data elements incurs the cost of accessing it remotely. Studies on caching in multiprocessors [48, 49, 50] have found that as the size of the cache line is increased, the number of cache misses decreases up to a certain line size, after which it increases. The initial reduction in cache misses is credited to prefetching facilitated by a larger block, while the subsequent increase is attributed to the higher degree of false sharing in a larger block. Since a page is much larger than a cache line, the potential of it being falsely-shared is correspondingly higher. Therefore, when NUMA multiprocessors lacking hardware cache consistency provide page-level placement schemes, it is important to eliminate the false sharing of pages, as rightly emphasized in [43, 69]. Our study [41, 42] also shows that proper layout of the data in the SVM reduces false sharing of pages and therefore simplies their placement. One can argue that the performance loss due to false sharing can be minimized by reducing the granularity of placement operations. For example, in the KSRl [6], each page is divided into several sub-pages, and the hardware provides replication with sequential consistency at a sub-page-level. The granularity of placement oper- ations can also be reduced for software placement schemes. This reduction can be achieved either using hardware features for locking sub-pages such as that provided by the IBM’s RISC System/ 6000 [70] or by maintaining software flags for each sub- page. However, the reduced block size in such an approach diminishes the benefits of prefetching. The reason large blocks provide the benefits of prefetching and do not 63 suffer false sharing in uniprocessors is because they contain data elements which are all referenced by a single processor. Similar results can be achieved in the case of multiprocessors if it is ensured that the blocks are not falsely-shared, irrespective of their size. Our study [41, 42] shows that false sharing of blocks can be reduced or eliminated by proper layout of data in the SVM, but it cannot be easily done by the applications programmer. Hence, it is important to develop techniques that enable the compiler to do it. Another approach to solving the false sharing problem is to allow blocks to be falsely-shared, but invoke placement operations only when they are actually required. For example, consider one of the blocks belonging to array A when the loop 51 of Figure 3.1 is parallelized as mentioned earlier. The data elements of this block are each exclusively written by a single processor. When such a block is replicated, it is sufficient to maintain consistency at the end of the loop, not for every write operation within the loop. In the case where data elements of a block are shared by different sets of processors, it is sufficient to maintain consistency of data elements which are actually shared, not the entire block. SSVM [23] and Munin [24] solve the false sharing problem in this manner, but they require the applications programmer to specify information about the reference behavior of data objects. It would be better if this information is provided by the compiler instead. These factors motivate our work on compiler-assisted data placement when pro- viding a SVM in a NUMA multiprocessor. There is some additional compile-time overhead when applying our techniques that assist data placement. This overhead, however, is not as critical as the runtime overhead of adaptive placement schemes which have no help from the compiler or the applications programmer. In addition, the overhead can be amortized over several runs of the application. It is also com- pensated by the ease of programming for the applications programmer who neither needs to layout the data properly nor provide information about the reference pat- tern. Finally, since the compiler can provide reasonably accurate information about the reference pattern in most situations, there is a high potential of improvement in runtime performance. 64 3.2 Data Placement and the Memory Hierarchy The problem of data placement in N UMA multiprocessors has been addressed both by hardware cache-line-level protocols and software page-level placement strategies, which are provided at the cache and main memory levels of the memory hierarchy, respectively. In this section, we show that even when. data is cached and hardware cache consistency is provided, proper page-level placement in the main memory is important, because it reduces remote page accesses that occur during cache line re- placements, cache misses, and certain consistency operations. Therefore, techniques for compiler-assisted data placement should provide assistance to place data at all levels of the memory hierarchy. We now derive an expression for the performance degradation in the absence of page placement, when data is cached. For simplicity, we ignore main memory accesses for cache consistency operations and cache line replacements, and measure performance by the EMAT, which we defined in Chapter 2. We ignore the time taken to service TLB hits and misses, and further, assume that there are no disk page faults. Therefore, a memory reference can result in a cache miss or a cache hit, and in the former case, the cache line is fetched either from remote or local memory, depending on where the corresponding page is located. We assume that the cache line is one word long and we also do not consider the times involved in page replication or migration when the page is absent in local memory. Let teach” tlocal, and tnmotc be the times to access a single word in the cache, local, and remote memory, respectively, and let L = howl/teach, and R = inmate/teach... Let hc and h; be the cache and the local memory hit ratios, respectively, and p be the number of processors allocated to the application. We measure the EMAT, normalized to tmchc. In the presence of page placement, a cache miss is serviced from local memory with a probability of hi, and therefore, the EMAT is given by: EMATwith_page.placement : he + (1 - hc)(L X hl + R(1 - Ill» (31) In the absence of page placement, a cache miss has equal probability of being serviced 65 r T I 2° 3; 1]) + " p = 10 c = .98 *— - he = .96 -A- 100 .. 3:416 hc=.94-0—- " he = .92 *— hC - .90 0— > 80 "' - A J _ i : ‘4’ (%) 60 : = - - - 40 b : I ; ; 4* 20~ yfi x X X o 1 i + 5 .90 .92 .94 .96 .98 1.0 Figure 3.2. Performance degradation with caching and without page placement. from the local memory of any of the processors. Therefore, the EMAT is given by: EMATwithout-page-placement : hc + (1 _ hc)(L/p + R(P — 1)/p)) (3.2) The performance degradation in the absence of page placement A(hc,h1) is given by: EMATwithout-page-placement _' EMATwith_page_placement A he, h = 3.3 ( I) EMATwith.page.placement ( ) Substituting Equations 3.1 and 3.2 in Equation 3.3, we get A (ho, h!) (1 _ hc)(hl_1/p)(R — L) (34) = h.+(1— mu; x h. + 120- m» Equation 3.3 provides us information about the relationship of the performance degradation A(hc, h() with various parameters. For example, when there are no cache misses (he = 1) or when the main memory access time is uniform (R = L), or when the local memory hit ratio equals the random placement hit ratio (h; = 1 / p), there is no performance degradation in the absence of page placement. Equation 3.3 gives DO lOCa] 591119, “19 If»: hwy/4,0 66 a value of zero for A(he, ht) for each of these cases. For a given value of hi, A(he, hi) increases as he decreases. For a given value of he, A(he, hi) increases as h, increases. As the number of processors increases, the choice made by the random placement scheme becomes worse, and A(he, hg) increases. The larger the values of L and R, the higher the performance degradation. Figure 3.2 shows how A(he, h;) varies as he and h; are varied, for fixed parameters of L = 4, R = 16, and p = 10. It is seen that even for high cache hit ratios (.90-.98), the performance degradation in the absence of page placement is quite high (20-80%). Our analysis does not include the additional memory access involved in case a cache line needs to be replaced to service a cache miss. Such replacements occur for a fully utilized cache in the steady-state condition, and the benefits of proper page placement will be even higher under such conditions. We conclude that proper page placement is important even when data is cached and therefore, techniques for compiler-assisted data placement should be designed to help place data at all levels of the memory hierarchy. 3.3 Reference Pattern of the Shared Virtual Memory Data placement strategies require information about the reference pattern of the var- ious portions of the SVM. As a first step toward compiler-assisted data placement, we identify the various factors that determine this reference pattern. The applica- tion’s code and data are allocated virtual space in the SVM. The portion of the SVM that contains the application’s code is only read and never written. The area con- taining the application’s data can be allocated statically or dynamically. It is not possible to determine at compile-time the reference pattern of the area which is al- located dynamically, and hence, we do not discuss it further. The area allocated for statically-declared variables can contain array, scalar, and synchronization variables. The reference pattern of a synchronization variable depends on the method of syn- chronization. The various factors that determine the reference pattern of scalar and 67 Control Partltionlng (Data References In Code Layout of Data In Shared Vlrtual Memo RUflIllflO FOCIOI'S .' ta Reference Pattern Reference Pattern of ared Vlrtual Memory Figure 3.3. Reference pattern of the shared virtual memory. array variables are shown in Figure 3.3. Here, control partitioning refers to the distribution of the work of a parallel appli- cation among the various allocated processors. For example, loop'51 in Figure 3.1 can be control-partitioned by assigning to each processor a distinct set of the values taken by the loop index I. A data reference refers to any reference to a data element in the program. A data reference can be a scalar or an array reference and an example of the latter is A(I) which appears in loop 51. In this array reference, the value of I deter- mines which element of array A is referenced, and since A(I) appears on the left hand side of the program statement, the corresponding element is written. On the other hand, array reference B (I) in the same loop appears on the right hand side and hence the element referenced by it is read. When the iterations of loop SI are partitioned as mentioned above, each processor executes instances of the array references A(I), 8(1), and C (I) corresponding to its assigned values of I. The data reference pattern of array variables are influenced by these factors and a similar discussion applies to scalar variables as well. Hence, control partitioning and the nature of data references 68 in the code together determine each data element’s reference pattern and hence, that of its virtual address. The layout of data determines which data elements reside in a given block of the SVM such as a page or a cache line. Therefore, the reference pattern of each block is influenced by that of the data elements it contains. In addition, it is influenced by various runtime factors which include replacement and placement schemes, the syn- chronization needs of the application, and contention for hardware resources such as the interconnect, memory module, and the memory port. These factors do not change the number of read and write references in the user state by the various processors to each block in the SVM, but change only the order in which these references occur. It is difficult at compile-time to accurately predict and incorporate the perturbation to the reference pattern due to these runtime factors. Also, we do not think that this perturbation is significant enough to warrant a change to the placement strategy of a block chosen based on its reference pattern in the absence of this perturbation. Hence, we ignore these runtime factors when developing techniques that assist data placement. In general, the nature of references to data elements is different in various parts of the program. Further, control partitioning is different for various loops of the application. Hence, the data reference pattern changes with time during the execution of a parallel application and so does the reference pattern of the SVM. For example, the reference pattern of the elements of array A can be different in loops 5; and S; in Figure 3.1, for some combination of the values for n and m and certain methods of control-partitioning the iterations of these loops. 3.4 Design of the Compiler In this section, we outline our design of a compiler that assists data placement. We present our design by considering the various factors that influence the reference pattern of the SVM (Figure 3.3), and also other factors. 69 DOSEQ Initialization END DOSEQ DOALL 51 I - 1 r0 :1 A(I) - 8(1) + CH) 31 END DOALL DOALL 55 I 3 1 T0 m A(I) - D(I) + E(I) 52 END DOALL DOSEQ 53 I - 1 T0 q F(K(I)) - 6(1) + 11(1) 53 END DOSEQ DDSEQ Termination END DOSEQ Figure 3.4. An example of a parallel application. Application Programmer Interface The applications programmer writes the application. either in a sequential language or in a parallel language. In the former case, the application is converted into a parallel application as done by parallelizing compilers [71, 72, 73, 74, 75]. In both cases, the parallel application consists of several code segments. We define a code segment to be sequential if it is executed by a single processor, and parallel if it is executed by more than one processor. A parallel code segment contains DOALL loops [73] in which all iterations are independent and can be executed in parallel and in 70 any order without any synchronization, or DOACROSS loops [73] in which there are dependencies among different loop iterations. Barrier synchronization operations are inserted after each parallel code segment. For example, the sequential application in Figure 3.1 can be converted into a parallel application shown in Figure 3.4, where DOSEQ refers to a sequential code segment. Control-Partitioning As shown in Figure 3.3, in order to determine the reference pattern of the SVM and assist data placement, the control-partitioning information needs to be available at compile-time. In general, loop iterations can be partitioned either at compile- time (static scheduling) or during runtime (dynamic scheduling). Though dynamic scheduling policies such as self-scheduling [76], guided self-scheduling [77], trapezoidal self-scheduling [78], and factoring [79] may reduce processing time, they do not allow the data reference pattern to be determined at compile—time. With processors getting faster relative to memory, the time spent on memory accesses constitute the major fraction of the execution time and must be reduced for good performance, particularly for NUMA multiprocessors. Also, Abraham and Hudak [80] control-partition iterative parallel loops at compile-time in order to reduce interprocessor Communication, and show that the performance is better than guided self-scheduling. It is precisely for the same reason that compile-time data partitioning techniques e.g., [81, 82], have been studied for message-passing N UMA multiprocessors. Therefore, we consider compile-time partitioning of the loop iterations. In our compiler that assists data placement, for the general case, the applications programmer specifies the control-partitioning of iterations for each parallel code seg- ment. This requirement is similar to the need for the applications programmer to spec- ify the data partitioning in languages such as FORTRAN -D [83] for message-passing N UMA multiprocessors. For specific cases, our compiler uses heuristics, which are derived from studies in the area of either parallel algorithms [84, 68] or compiletime control—partitioning of iterations [80]. 71 Data References In order to assist data placement, the compiler should also be able to resolve data references. We develop new algorithms as well as extend existing dependence analysis techniques (surveyed in [85, 86]) to resolve data references, as discussed in detail in the next section. It is not possible, however, to resolve data references that are functions of values which are unknown at compile-time. For example, the data references in the loops in Figure 3.1 cannot be resolved when the loop upper bounds n, m, and q are unknown at compile-time. Even if the loop upper bound q is known at compile-time, it is not possible to resolve the array reference F ( K (I )) in loop 53. Another condition under which the data reference pattern cannot be determined at compile-time is when processors execute code conditionally and the condition depends on the input data. An example is the QR factorization application in which processors choose a column of an array that has the maximum sum of the square of its elements. Note however that the applications programmer also cannot resolve these references which are dependent on values known only during runtime. We propose that the compiler try to resolve such references using profiling as in IMPACT [87]. If profiling does not help either, then the corresponding data reference pattern cannot be determined and consequently, the compiler cannot assist in placing data for related portions of the SVM. Temporal Variation of Reference Pattern We represent the reference pattern of a data element in a given code segment by the processors that read and those that write the data element in this code segment. We take care of the temporal variation of the reference pattern by using this representa- tion individually for each code segment. Assistance in Data Placement The algorithms we discuss in the next section can be used to determine the type and the reference pattern of the data in each code segment. We use this information to assist data placement by means of compile-time objects which contain data of the 72 program —» {code segment}+ code segment -> sequential code segment I parallel code segment sequential code segment —+ sequential code | placement directives parallel code segment -» Nested DOALL | Nested DOACROSS Figure 3.5. A model of a parallel application with placement directives. same variable type and similar reference patterns. These objects assist data place- ment at all levels of the memory hierarchy, and the manner in which they are created depends on the type of placement. In our thesis, we consider block-level placement and also object-level placement which is similar to that provided by SSVM [23]. Be- tween each pair of consecutive code segments, our object-creation schemes insert a set of placement directives which assist data placement and which are to be executed sequentially. The parallel application is thus transformed as shown in Figure 3.5. In the next two chapters, we discuss the object-creation schemes, their compile-time complexity, the runtime performance they offer, and other implementation issues, for the block-level and ob ject-level types of placement, respectively. If the compiler is unable to resolve some data references, it cannot determine the reference pattern of corresponding data elements. In this case, it specifies that it cannot assist in placing data for related portions of the SVM. Theinfluence of these unresolved references depends on the type of data placement, and we discuss it in the context of the block-level and ob ject-level types of placement in the next two chapters. The data elements which cannot be placed by the compiler need to be placed using the runtime reference history. The runtime overhead is still less than that in the absence of any form of compiler assistance. Therefore, as shown in Figure 3.6, our approach is for the compiler to assist data placement whenever possible, and leave the rest of the cases for the runtime mechanisms. The compilation process in our compiler that assists data placement is shown in Figure 3.7. 73 Shared Virtual Memory Dynamlc Program Variables Static Scalar Synchronization v - Compiler provides placement information Compiler provides no placement Information Figure 3.6. Placement information about the shared virtual memory. 3.5 Type and Reference Pattern of Variables Our object-creation schemes which assist the block—level and object-level types of placement need the type and the reference pattern of data elements. It is easy to determine by inspection whether a variable is of the array or the scalar type. The type of each synchronization variable is specified by the applications programmer for applications written in a parallel language. It is provided by the compiler if the parallel application is derived from a sequential version using parallelizing techniques. We consider the barrier, wait/signal and the lock types of synchronization. We determine the reference pattern of synchronization variables as follows. In general, 74 Sequential Application Parallellzing Techniques Parallel Application Control-Partitioning Control-Partitioning Heuristics Applications Programmer I Determination of Variable Type and Reference Pattern 1 Object-Creation l Parallel Application with Placement Directives Figure 3.7. Steps in compiling an application to assist data placement. barrier synchronization is used to ensure that the execution of the code before the barrier is completed before the start of the execution of the code after the barrier. For example, it is can be used at the end of a parallel code segment, in which case the corresponding synchronization variable is referenced equally by all processors. Wait / signal synchronization is used either to wait for an event to occur or to signal that an event has occurred. For example, it can be used to enforce dependencies in a DOACROSS loop, in which case the corresponding synchronization variable is referenced by the processors involved in the dependency in question. The lock synchronization variable is used to ensure mutual exclusion in accessing shared data, and is referenced by the processors that share the data elements it protects. Its reference pattern is the same as that of these data elements. 75 We now present our algorithms to determine the reference pattern of statically- declared array and scalar variables in a given code segment. Recall that the reference pattern of a data element is represented by the processors which read and those that write it. We record the information about the read references for each data element either by a flag or by a counter per processor. The information about write refer- ences is recorded in a similar fashion. A flag records whether there is a reference or not while a counter records the actual number of references. It is straightforward to determine the reference pattern for scalar variables and that for array variables with constant subscripts. When such references occur within the scope of the in- duction variable whose iterations are control-partitioned among the processors, the corresponding element is referenced by all processors. Further, the element is read or written depending on whether the reference appears on the right or the left hand side of the program statement. We now present algorithms to determine the reference pattern for other types of references. Algorithm A The first algorithm, which we refer to as Algorithm A, is shown in Figure 3.8. For our purposes, a scalar variable is a single-element array variable. Algorithm A computes the data reference pattern by determining the elements referenced for each point in the iteration space for each array reference. Though we present the algorithm for a perfectly nested loop, it can be easily generalized to loops that are not perfectly nested. Since it scans the entire iteration space, Algorithm A provides the reference pattern of each data element in terms of the actual number of read and write references to it by each processor. We now illustrate how Algorithm A determines the data reference pattern of a given parallel code segment. We consider the loop 5'; shown in Figure 3.4. Assume that n = 20 and that four consecutive iterations of the loop index I are each allocated to five processors. Each point in the iteration space corresponds to a given value of I. For the array references in loop SI, each such point writes an element of A, and reads an element of both B and C. The processor that executes this point of the 76 FOR each array reference FOR each loop induction variable FOR all iterations /* The element referenced is determined by the subscripts and the current values of the loop induction variables. The processor that references the element is the one that executes the iteration corresponding to the current values of the loop induction variables. The reference is a read (write) if the array reference appears on the right (left) hand side. */ update reference pattern of appropriate element END FOR END FOR END FOR Figure 3.8. Determination of reference pattern: Algorithm A. iteration space is also the one that executes these write and read operations. Using these facts, Algorithm A determines that the read counters for elements 0-3 of arrays A, B, and C are (0 0 0 0 0), (1 0 0 0 0), and (1 0 0 0 0), respectively. Similarly, the write counters for elements 0-3 of arrays A, B, and C are (1 0 0 0 0), (0 0 0 0 0), (0 0 0 0 0), respectively. It determines the counters for the other elements in a similar manner. We now derive the time and the space complexity of Algorithm A. Let I be the depth of the loop, 2' be the maximum out of the number of iterations for each loop induction variable, 1' be the number of data references, d be the maximum out of the number of dimensions for each array variable, and p be the number of processors al- located to the application. We ignore the time spent in parsing, and assume that the data references have already been obtained. Therefore, the worst-case time complex- ity to determine the reference pattern of all data elements referenced in a given code 77 segment is 0(1 x i x r x d). The space complexity for each data element is (2 x p) units, where the unit is a bit if we only need to store whether a processor reads or writes a data element or not. If we need to store the actual number of reads and writes, the unit is the storage space required for an integer. It is possible to reduce the space requirements by allocating space for only one variable and processing variables one at a time. Algorithm B Since the loop depth and the number of loop iterations appear in the time complexity of Algorithm A, it is computationally intensive for large values of these parameters. Our next algorithm, referred to as Algorithm B, has a lower time complexity than Algorithm A, and is motivated by the work done by Jeremiassen and Eggers [86]. In this work, they propose an algorithm to determine the per-process side effect information, which represents the variables read and those written by each process, and is equivalent to our representation of the data reference pattern for each code segment. Their algorithm uses existing dependence analysis techniques [71, 72, 73, 74, 75] that are used to extract the control and data dependences of a sequential application, which is then parallelized such that these dependences are satisfied. We now provide the traditional definition of data dependence [73]. Consider two statements 51 and S; in the control flow graph (CF G) of a sequential application. A data dependence exists between statements 51 and 52, with respect to a variable X if and only if 1. There exists a path in the CFO from 51 to 52 with no intervening write to X, and 2. at least one of the following is true: (a) flow, X is written by SI, and later read by 52, or (b) anti, X is read by S], and later written by 52, or (c) output, X is written by 51, and later written by 52, or 78 ((1) input, X is read by S], and later read by $2. In general, in order to determine the data dependences in a sequential applica- tion, dependence analysis techniques compute the side-effect information for each statement 5;, which is represented by the USE(S.-) and the MOD(S,-) sets, that con- tain the variables it uses and those that it modifies, respectively. Input dependences represent reads to the same variable and can be performed in any order. Therefore, they can be safely ignored for automatic parallelization. The dependences between any two pair of statements are determined by applying the intersection operation to each of the other three pairs of sets which can be derived from their USE and MOD sets. For example, if 51 and 52 are two statements in a sequential application, there is no data dependence between them if the sets MOD(S'1) fl USE(52), USE(S'1) fl MOD(52), and MOD(Sl) fl MOD(52), are all empty. On the other hand, if one or more of these sets is non-empty, then a data dependence exists between 51 and 52. As surveyed in [86, 85], dependence analysis techniques can also analyze statements containing procedure calls and in addition, provide the side-effect information of an array variable for either individual elements or a set of its elements, referred to as an array section. Jeremiassen and Eggers [86] develop an algorithm to identify the per-process con- trol flow graphs, given the control flow graph of the sequential application. They suggest that the per-process side-effect information can then be determined by ap- plying existing dependence analysis techniques to the control flow graphs of the var- ious processes. Their motivation in determining the side-effect information is to assist cache-line—level placement, but they do not mention how they plan to provide such assistance. Their work motivates us to use dependence analysis techniques in Algorithm B, our next algorithm to determine the data reference pattern. Unlike their approach, our algorithm determines the data reference pattern by incorporating control-partitioning in existing techniques that compute interprocedural side-effect information. Further, we demonstrate how the data reference pattern can be used t0 assist data placement by developing techniques [54, 55, 56] to assist both block- level and object-level placement. Our techniques can also be applied when the data 79 FOR each data reference IF scalar reference IF any applicable loop induction variable is control—partitioned set write (read) flag of all processors if reference occurs on left (right) hand side END IF ELSEIF array reference replace each subscript that contains a loop induction variable by the corresponding (lower_bound, upper-bound, step) information IF no loop induction variable is control-partitioned set each processor’s array section to (lower-bound, upper-bound, step) ELSEIF compute array section for each processor based on partitioning END IF add each processor’s section to its write (read) list if reference occurs on left (right) hand side END IF END FOR /* convert read and write list information for each array variable into element reference pattern */ FOR each array variable FOR each element FOR each processor IF element exists in read list set read flag END IF IF element exists in write list set write flag END IF END FOR END FOR END FOR Figure 3.9. Determination of reference pattern: Algorithm B. D0 Val WI; read l00p (Ode deter; 80 reference pattern is determined using other algorithms such as theirs. Since dependence analysis techniques have been traditionally used to extract par- allelism from sequential applications, they are conservative and do not always provide the exact side-effect information. For example, in the case of a conditional IF state- ment, these techniques assume that both branches are taken. Further, they sacrifice exactness in order to reduce the space required to store the side-effect information. For example, the side-effect information for an array variable is stored for its sections, not for its individual elements, thereby saving space at the cost of losing exactness. It is important to determine the reference pattern accurately in order to assist data placement. Therefore, in extending these techniques to determine the data reference pattern, we preserve the exactness of the data reference pattern at the cost of addi- tional space and time complexity. For example, in the case of conditional statements, we compute the data reference pattern separately for each possible condition. We derive Algorithm B by incorporating control partitioning in the algorithm for interprocedural side-effect analysis discussed in [85]. Algorithm B is shown in Figure 3.9; it is applicable irrespective of whether the loop in the code segment is perfectly nested or not, and it handles linear subscripts. Unlike Algorithm A, it does not scan all the points in the iteration space. It maintains a read and a write list per variable for each processor which contain elements of the variable that are read and written by this processor, respectively. It adds entries to these lists by processing the various data references in the code segment. Any scalar variable within the scope of a loop induction variable which is control- partitioned is assumed to be referenced by all processors. This variable is added to the read or the write lists of all processors depending on whether it appears on the right or the left hand side of a program statement, respectively. In the case of each array reference, appropriate sections of the corresponding array are added to the read and write lists of the various processors based on the control-partitioning of the loop induction variable in each of the subscripts. Once all the data references in the code segment are processed in this manner, each data element’s reference pattern is determined by examining the read and write lists of the corresponding variable. In 81 this manner, Algorithm B provides the reference pattern for each data element in terms of whether each processor reads or writes it or not. It does not provide the actual number of read and write references. We now illustrate how Algorithm B determines the reference pattern for an ex— ample parallel code segment. Again, we consider the loop 51 shown in Figure 3.4. We assume that n = 20 and that four consecutive iterations of the loop index I are allocated to each of the five processors. Algorithm B adds A(l : 5) to the write list for variable A of processor 0, A(6 : 10) to the write list for variable A of processor 1, and so on. Similarly, it adds B(l : 5) to the read list of variable B of processor 0, B (6 : 10) to the read list for variable B of processor 1, and so on. The array reference C (I ) is handled in a similar fashion. Algorithm B determines the reference pattern for each data element by looking up the entries in the read and the write lists of the corresponding variable. An optimization is to stop the lookup as soon as an entry which contains an element is found. In this example, it is determined that A( 1) is written by processor 0 because the entry A(l : 5) is found in the write list for variable A of processor 0. Similarly, it is determined that A(l) is not written by the other processors because no entry in their write lists contain it. The reference pattern of other data elements can be determined in a similar manner. We now derive the time and space complexity for Algorithm B. Let r be the number of data references, of be the maximum out of the number of dimensions for each array variable, 6 be the maximum out of the number of elements of an array variable, and p be the number of processors allocated to the application. The time complexity of Algorithm B is 0(r x d x p+ e x p x r x d). It requires the same space as that needed by Algorithm A to store the data reference pattern. Some additional space with a worst-case complexity of 0(p x r x 2) is required to store the read and the write lists for the various processors. Note that Algorithm B executes faster but requires more space than Algorithm A. 82 3.6 Related Work Related work includes those in the areas of parallelizing compilers, partitioning, lo- cality enhancement, prefetching, software-assisted cache coherence, and false sharing. We discuss each of these below. Parallelizing Compilers: In order for our compiler to assist data placement, the parallelism in the application needs to be expressed using the DOALL and the DOACROSS constructs. When the application is written in a sequential lan- guage, we use dependence analysis techniques in existing parallelizing compilers [71, 72, 73, 74, 75] to identify and express the application’s parallelism using these con- structs. In addition, we extend these techniques by incorporating control-partitioning to develop algorithms which determine the data reference pattern. Further, our compile-time object-creation scheme which directs object-level placement uses these techniques to determine when it is necessary to make copies of each data element consistent. Partitioning: We control-partition the iterations of each parallel code segment and develop techniques to place data in the NUMA physical memory. Independently, Hu- dak and Abraham [88, 89] have used control-partitioning and studied data placement for a single loop construct, viz., stencil-based iterative data parallel loops with near- neighbor communication. An example of an application with such a loop construct is the application near neighbor which we considered in Chapter 2. Their algorithm to determine the reference pattern, the sharing characteristics they consider, and their solution to the false sharing problem are all specific to this loop construct. They develop compile-time techniques that classify the data in this loop construct into var- ious sets depending on the near-neighbor sharing characteristics. They demonstrate the performance improvement due to these techniques by measuring execution times for different methods of placing these sets at various levels of the memory hierarchy of the BBN TC2000. Their techniques for control-partitioning [80] in order to minimize interprocessor communication perform better than guided self-scheduling, and are ap- plicable to our work. However, as we discuss below, our method of compiler-assisted 83 data placement [54] differs from their work in several important aspects. Firstly, our approach is more general because we consider an entire application containing multiple code segments instead of just a single loop construct. Our al- gorithms to determine the data reference pattern are applicable to any DOALL and DOACROSS loops, whether perfectly nested or not. .Also, our definition of the data reference pattern covers all possible sharing characteristics and not those for a spe- cific loop construct. Our idea of creating compile-time objects using both the data reference pattern and the variable type is new. Further, these objects are created differently for various types of placement schemes. Our solution to the false sharing problem which uses these objects is not specific to a given application. Our approach can assist in placing the data of the entire application at all levels of the memory hi- erarchy in a shared-virtual-memory N UMA multiprocessor. It is applicable whether direct remote memory access is possible or not and also for both the block-level and the object-level types of placement. We also develop algorithms for addressing issues such as code generation, thereby enabling easy incorporation of our schemes in a compiler. Other studies on control-partitioning are also of interest to us. Control- partitioning of two-dimensional iterative spaces has been studied in the literature. Vrsalovic et al. [84] develop a model to measure the speedup of several partitioning methods for an architecture in which each processor has both local and global mem- ory. By modeling both message-passing and shared-memory architectures, Reed et al. [68] conclude that the overall performance depends on a complex interaction of three factors, viz., the stencil, the partition type, and the architecture. This result is related to the mapping problem [90] which involves mapping a parallel application to a parallel machine. The problem of obtaining an optimal mapping of a general parallel application to a general parallel machine is NP-complete [91]. In practice, an applications programmer obtains the best partitioning and mapping of an application for a given N UMA multiprocessor by trial and error. An alternate approach is to use heuristic techniques such as simulated annealing [92] which we used in an earlier work [93, 94] on the mapping problem. Brochard and Freau [95] have studied the influence 84 of various partitioning and data placement strategies for the matrix multiplication and finite element applications for the IBM RP3. Unlike our work, these studies do not consider compile-time techniques that assist data placement. The approach of control-partitioning followed by data placement is different from that used by compilers for message-passing NUMA multiprocessors also referred to as distributed-memory message-passing (DMMP) multiprocessors. These approaches differ in the process-interaction paradigm (shared-memory versus message-passing) but aim at achieving the same objective of minimizing the performance degra- dation due to non-local accesses. In compilers for DMMP multiprocessors e.g., [96, 97, 98, 99, 74, 83, 100, 81, 82], data is partitioned among the local memories of the various processors, and each processor owns the data in its local memory. All processors execute the same program and the work done by each processor is de- termined by certain rules. For example, one such rule specifies that the owner of each data element computes its value. A processor uses messages to get a copy of a data element which it does not own. Most of these compilers require the applications programmer to specify the data partitioning. They then automatically partition the work among the processors and generate each processor’s code as well as the inter- processor communication. Communication primitives are generated during runtime [98] for data references which cannot be resolved during compilation. Static per- formance estimators [100] and automatic data partitioning techniques [81, 82] have been developed. The issues involved in translating control-partitioned programs into data-partitioned programs are discussed in [101]. Locality Enhancement: Another approach to improve memory performance is to maximize the locality and reusability of data. Abu-Sufah et al. [102] study program analysis and transformations to improve reusability of data in uniprocessor paging systems. Gannon et al. [103] define a reference window for a dependence as the vari- ables referenced by both the source and the sink of the dependence. After executing the source of the dependence, the cache hit ratio can be improved by saving the as- sociated reference window in the cache until after the sink has been executed. They estimate sizes of reference windows and suggest that program transformations can Sc prc sche dete this and main false Noble lines i. t*0 cad 85 be used to enable these windows to fit in the highest level of the memory hierarchy. Other studies [104] also address storage size limitations. Wolfe [105] uses a technique called tiling, which divides the iteration space of a loop into blocks or tiles with a fixed maximum size. Performance is improved when the tile fits into the highest level of the memory hierarchy. Wolfe and Lam [106] also describe program transformations to improve locality. These techniques can be used in conjunction with our approach. Prefetching: The performance loss due to the latency of remote memory accesses has also been addressed using data prefetching which takes place during computation. For example, Cornish et al. [107] use compile-time analysis to determine the earliest point in the program when data can be prefetched, and evaluate the scheme using simulations. Hudak and Abraham [88, 89] also use compile-time techniques to over- lap computation with memory accesses for iterative parallel loops and demonstrate performance improvement by execution time measurements on the BBN TC2000. Software-assisted Cache Coherence: Studies on software-controlled caches are of interest to us because object-level placement requires the compiler or the applications programmer to specify data placement operations. Cheong and Veidenbaum [108, 109, 110, 111] and Cytron et al. [112] have proposed software schemes which keep caches consistent by using data dependence analysis to determine when cache lines need to be invalidated or written back to global memory. Our compile-time object-creation scheme which directs object-level placement also uses data dependence analysis to determine when it is necessary to make copies of each data element consistent. Unlike these schemes however, our method uses the data reference pattern in order to create and transfer objects between the local memories of the processors with the goal of maintaining data consistency as efficiently as possible. False Sharing: Finally, we summarize related work in addressing the false sharing problem. Torellas et al. [47] discuss several solutions to solve the problem for cache lines including scalar expansion and record expansion. These techniques are specific to cache lines, depend on the cache line size, and might fragment the cache lines. Lam et al. [113] study the problem in the context of block algorithms and suggest 86 copying non-contiguous data to consecutive locations, but they do not discuss how this process can be automated by a compiler. Eggers and Jeremiassen [52] apply program transformations to eliminate false sharing of cache lines for certain types of shared data. But, in the absence of coherent caches, their indirection scheme leads to an extra memory access. Also, their work does not address false sharing of cache lines containing data shared by different sets of processors. Another approach to solve the false sharing problem is to provide a weak consis- tency model which updates only those copies of a data element that will actually be used by using runtime information and programmer-specified synchronization opera- tions. Examples are the delayed consistency for cache-line-level replication [50] and the lazy release consistency for page-level replication [51]. While the former requires additional hardware, the latter incurs extra runtime overhead. False sharing has been also been addressed by using programmer-specified information about the shar- ing characteristics of data objects in SSVM [23] and Munin [24]. While SSVH requires the programmer to specify the exact reference pattern of all objects, Munin requires the programmer to specify the nature of references to an object and determines the exact reference pattern during runtime. Our approach is to reduce the runtime over- head and eliminate the programmer effort by determining the data reference pattern at compile-time when possible. As discussed in Chapter 5, we develop object-creation methods which provide the information required by systems such as SSVH. 3.7 Summary In this chapter, we motivated the need for the compiler to assist in placing data at all levels of the NUMA physical memory hierarchy. The compiler can do so by providing information about the reference pattern of the SVM. After identifying the factors that determine this reference pattern, we designed a compiler that provides this information by means of compile-time objects. We also developed algorithms to determine the information needed by our object-creation schemes that assist data placement. In the next two chapters, we discuss in detail our compile-time object- 87 creation schemes for the block-level and object-level types of placement, respectively. CHAPTER 4 COMPILER—ASSISTED BLOCK-LEVEL PLACEMENT In this chapter, we present our work on compiler-assisted block-level placement when providing a shared virtual memory (SVM) in a NUMA multiprocessor. First, we outline our object-creation scheme that creates compile-time objects containing the application’s data. We then discuss how these objects can be used to assist block- level placement. Next, we present solutions to the other issues that arise when as- sisting block-level placement using objects created by our scheme. We then derive the compile-time overhead of applying our scheme. We conclude by discussing the performance improvement offered by our scheme. 4.1 Compile-Time Ob ject-Creation In order to assist block-level placement schemes, false sharing needs to be eliminated, which can be done by ensuring that all the data elements contained in a given block have the same reference pattern. Further, it is necessary to reduce the runtime overhead of re-evaluating placement decisions in order to adapt them to changes in the reference patterns. The overhead can be reduced by simplifying and when possible, specifying the reference pattern of blocks. These considerations motivate our object-creation scheme which assists block-level placement by creating compile- time objects using the data reference pattern, and we call it OCRP, an acronym for 88 H6 89 Object Creation based on Reference Pattern. In contrast, the Standard method of Object Creation lays out the application’s entire data contiguously in the SVM without any regard to the data reference pattern, and we call it (JCS. Since we include the temporal variation in the reference pattern by determining each data element’s reference pattern separately for various code segments, we also create objects individually for each code segment by applying OCRP. We assume that runtime requests for virtual memory allocation are serviced from distinct heaps for different processors, as proposed in [47] to address the false sharing problem. We also assume that private and temporary variables are placed in the local memory of the only processor that references them, and concentrate on object-creation for code and statically-declared variables. The type of a variable can be scalar, array, or synchronization. The type of a synchronization variable can be barrier, wait / signal or lock. Heuristics for Object-Creation: DCRP creates objects using each variable’s type and its reference pattern which are determined as outlined in Chapter 3. It allocates each processor’s code to a distinct object, and uses the following heuristics to create objects for statically-declared data. H1 Do not allocate different types of variables to the same object. H2 Do not allocate different types of synchronization variables to the same object. H3 Allocate scalar variables with identical reference pattern to a single object. H4 Allocate elements of the same array variable with identical reference pattern to a single object. H5 Allocate elements of the same array variable which are referenced by only one processor to a single object, irrespective of whether they are read or written by this processor. H6 Allocate barrier synchronization variables of different parallel code segments to the same object. 90 H7 Allocate wait/ signal synchronization variables with the same reference pattern to the same object. H8 Allocate lock synchronization variables that protect data elements with the same reference pattern to the same object. 4.2 Assistance in Block-Level Placement In this section, we discuss the manner in which block-level placement can be assisted by using the objects created for each code segment by applying OCRP. False Sharing When the data reference pattern for each code segment is exact, the objects created by 003? are not falsely shared. Therefore, for each code segment, false sharing can be eliminated by laying out the elements of each object in a contiguous portion of the SVM and ensuring that objects do not share virtual pages. In addition to eliminating false sharing of pages as well as cache lines, such a scheme also facilitates prefetching because the objects created by OCR? have temporal locality. One method of implementing this scheme is page padding, in which the virtual memory allocated to each object is padded to a page boundary. This method requires that the page size be known at compile-time, and necessitates recompilation of the application when it is ported to a NUMA multiprocessor with a different page size. The other method is for the runtime mechanism to provide primitives to allocate virtual memory for objects of various sizes. Such primitives can easily be provided by operating systems such as Mach [57]. Later in this chapter, we present solutions to issues that arise due to mapping objects to the SVM in this manner. These issues include internal fragmentation and code generation for references to array variables whose elements are non-contiguous in the SVM. Since objects are mapped to the SVM for each code segment individually, variables that are referenced in more than one code segment need special consideration. Typi- K“ spill r elem. ifid if}; 91 Figure 4.1. Examples of objects for elements of a two-dimensional array. cally, synchronization variables are independent in each code segment and therefore, their values need not be propagated from one code segment to another. Most of the scalar variables are shared by all processors and have the same reference pattern in all the code segments. We propose that for both methods of eliminating false sharing, the remaining scalar variables be remapped by copying them from the objects of the previous code segment to those of the next code segment. We also use copy opera- tions to remap array variables when page padding is used to eliminate false sharing. On the other hand, when the virtual memory for objects is allocated using runtime primitives, we remap array variables by requiring that the following additional prim- itives be provided: (1) split, which splits a single object into several objects, and (2) combine, which combines several objects into a single object. In general, after applying OCRP, operations to remap the various variables (copy, combine and split) are inserted between each pair of consecutive code segments. Both the combine and the split operations need to know how the elements in the objects are to be combined or split, respectively. Our first approach to specify- ing this information is by a parameter [54]. For example, interleave in the primitive split(input_object_id, interleave, interleave_factor, output-object_ids), specifies that the elements of the input object need to be allocated to the output objects in an inter- leaved fashion. Our second approach is to specify for each input and output object, W. 0n Prim 92 information about the layout of its elements in the SVM. We derive this second method from SSVM [23], where an object’s kind determines the layout of its elements in the SVM, and sequential and stride kinds of objects are supported. A sequential object contains elements which are contiguous in the SVM, while a stride object is a set of sequential objects of the same size that are non—contiguous, but equally spaced in the SVM. For example, assume that the elements of the two-dimensional array in Figure 4.1 are mapped in row major form to the SVM. Then, A is a sequential object, and B and C are stride objects. When a data element is assigned to an object during object-creation, OCRP updates the kind of the object using the algorithm shown in Figure 4.2. The objective of the combine operation is to create a consistent copy of the array variable in the local memory of the processor which executes the sequential code seg- ment. Hence, its effect is equivalent to that of flushing the cache of all the processors at the end of a code segment in software-assisted cache coherence schemes such as [108]. Its implementation is more involved however because unlike the flush-cache operation which writes to consecutive locations in the global memory, it needs speci- fication of the kind of all the objects that need to be combined. We propose several optimizations in using the combine operation. For example, the combine operation for an array is inserted only after a code segment in which it is modified and further, only if there is a change in the array’s reference pattern in the following code seg- ments. Moreover, only those elements of the array that are written are transferred during the combine operation. Adaptive Placement Schemes The manner in which OCRP creates objects allows specification of the placement in- formation for each object, which includes the variable type as well as the nature of the reference pattern of its elements. If false sharing is eliminated using page padding, OCR? provides placement information for the virtual pages of each object. On the other hand, if false sharing is eliminated using runtime primitives, additional primitives need to be provided that allow DCRP to specify placement information for 93 /* An element’s ID is its position in the SVM relative to its variable. */ IF number of elements in object is zero /* newly-created object */ obj..kind = sequential obj-vir.addr_index = elem_ID obj_num.elem = 1 cur_seq_obj-num_elem = 1 ELSEIF (obj_kind is sequential) IF ((obj_vir-addr-index + obj-num_elem) equals elem_ID) /* next element in the sequential object */ increment obj-num-elem increment cur_seq_obj_num-elem ELSE obj_kind = stride increment obj_num_elem obj_stride = (elem_ID - (obj_start_vir_addr + obj-num-elem)) obj_seq.set_count = 2 seq_obj-num-elem = obj-num.elem cur.seq-obj-num_elem = 1 last_elem.ID = elem_ID EN DIF ELSEIF (obj.kind is stride) last-elem_ID = elem_ID increment ob j_num-elem increment cur.seq_obj-num_elem IF (cur.seq-obj-num.elem > seq_obj.num_elem) /* new sequential set of elements begins; check stride */ IF ((last.elem_ID - seq-obj_num_elem + l + obj-stride) not equals elem_ID) /* unknown object kind */ exit ELSE /* valid stride */ increment ob j.seq_set-count cur.seq_obj_num_elem = 1 last-elem_ID = elem_ID ENDIF EN DIF EN DIF Figure 4.2. Algorithm to undate the kind of an object. 94 Shared Virtual Memory of Application Code Data Runtlme-Ailocated Staticall -Declared Array Sca ar Synch nlzatlon Private Shared Barrier Walt/Signal Lock ’v Read-Only Read-Write Mostly-Read Active-Read-Write Figure 4.3. Placement information provided by OCRP. each object. The placement information allows the runtime (overhead of adaptive placement schemes to be reduced or in certain cases eliminated altogether. OCR? specifies the variable type of an object to be scalar, array, barrier, wait/signal, or lock. The information it provides about an object’s reference pat- tern depends on the data structure used to store the data reference pattern. We consider the case when the exact number of read and write references for each data element are available in its read and write counters, respectively. If the counters of only one processor of such an object are non-zero, it is a private object. If all the write counters of the object are zero, it is a read-only object. If the sum of all the read counters of the object is far greater than the sum of the write counters, it is read more often than written and is specified to be a mostly-read object. Finally, objects with almost identical, non-zero, read and write counters for the various processors are equally read and written by all processors and are specified to be active-read-write 95 objects. The placement information provided by DOE? is shown in Figure 4.3. When the reference pattern is represented using bits instead of counters, DCRP is still able to provide the same placement information except that it is unable to classify read-write objects into mostly-read and active-read-write objects. Previous studies on block-level placement have made it straightforward to choose a placement scheme, given the placement information provided by OCRP. For exam- ple, a private object is best placed statically in the local memory of the processor that references it, while mostly-read and read-only shared objects can be replicated. Consistency need not be enforced for read-only shared objects and it is better not to migrate or replicate an actively shared read-write object. Barrier synchronization and wait-signal synchronization objects (shared actively in a fine-grain manner) are best placed statically in the local memory of one of the processors that references them. Since the placement information is specified for each code segment, OCRP allows changes in placement strategies from one code segment to another. For example, an object might be shared between processors A and B in a given code segment, and processors C and D in the next code segment. Using the placement information provided by OCRP, the object is replicated in the local memory of A and B in the first code segment, and in the local memory of C and D in the next code segment. Similarly, an object referenced by processor A in a given code segment and by processor B in the next code segment is migrated from the local memory of A to that of B between these code segments. In the case of conditional statements, the data reference pattern is determined separately for each case of a conditional statement. The remap operations and place- ment information are also specified conditionally for each such case. When the data reference pattern is inexact, the objects created by DCRP might be falsely-shared and so are the resulting virtual blocks. Further, the placement information provided by OCRP is not exact. Since OCRP does not specify when copies of data elements need to be updated, but only controls the mapping of the data elements to the SVM, applications compiled by applying DCRP execute correctly even in the presence of an 96 @ntial Code Segment A Remap Operations Placement Information i Parallel Code Segment Remap Operations Placement Information Parallel Code Segment Remap Operations Placement Information @flai Code Seng Figure 4.4. Parallel application after applying OCRP. 97 DOSEQ 31 Initialize A,B,C SH END DOSEQ DOALL 55 I = 1 TO n DOALL 53 J = 1 TO n DOSEQ S} K 8 1 TO n C(I,J) ' C(I,J) + A(I,K) * B(K,J) Sh END DOSEO 53 END DOALL 55 END DOALL DOSEQ 53 Output C 55 END DOSEQ Figure 4.5. Application matrix multiplication. inexact reference pattern. In this manner, OCRP assists block-level placement at all leVels of the memory hierarchy by creating virtual objects which are not falsely shared and which have temporal locality and by providing placement information for these objects. After applying OCRP, the parallel application is transformed as shown in Figure 4.4. 4.3 Compilation of Applications In this section, we illustrate OCRP by applying it to certain applications. The first application we consider is matrix multiplication, and its typical parallel program is shown in Figure 4.5. For the sequential code segment 51, OCRP creates three objects, one each for the array variables A, B, and C, and each variable’s object contains all its elements. Let us assume that the parallel code segment represented by 5;, 53, and S4 is control-partitioned by allocating sets of consecutive values of the loop index 98 Read-Write Objects Read-Only Objects Read-Only Object Processor 1 - Processor 3 D Processor 2 - Processor 1,2,3 Figure 4.6. Objects created by OCR? for application matrix multiplication. I to the various processors. For this case, OCRP creates a private read-write object for each processor containing the elements of the rows of C which are exclusively read and written by this processor. It also creates a private read-only object for each processor containing the elements of the rows of A which are exclusively read by this processor. Further, it creates a single shared read-only object for all the elements of B. Figure 4.6 illustrates object-creation in this manner when the parallel code segment is executed by three processors. For the sequential code segment 55, OCRP creates a single object containing all the elements of C. The next application we consider is the parallel iterative solver for a two- dimensional problem with a 5-point stencil, which is part of the workload for our experiments outlined in Chapter 2. For ease of reference, we repeat the typical parallel program of this application in Figure 4.7. Here, each iteration corresponding to loop 5'; uses the results of the previous iteration and hence must be executed sequentially. However, the iterations of the loops $3 and 5., can be executed in parallel. The parallel program consists of a sequential code segment 5'1 for initialization, a parallel 99 DOSEQ $1 Initialize A Sh END DOSEQ DOSEQ Sb K 8 1 TO m DOALL 55 I 8 1 TO n DOALL S} J 8 1 TO n A(I,J) 8 0.25 * (A(I-1,J) + A(I+1,J) + A(I,J-1) + A(I,J+1)) S4 END DOALL Sb END DOALL 53 END DOSEQ DOSEQ 5% Output A 5% END DOSEQ Figure 4.7. Application parallel iterative solver. code segment represented by S3 and 54 that is executed m times (corresponding to the m iterations of loop 52), and another sequential code segment 55 for termination. OCR? creates a single object containing all the elements of array variable A, for each of the two sequential code segments 51 and 55. Let us assume that the parallel code segment represented by 53 and S4 is control-partitioned by assigning a different subset of the values for the I and J loop indices to the various processors. Each processor then computes the values of a rectangular partition of the array A for each iteration of loop 5'2. For the parallel code segment, OCR? creates nine objects for each processor as shown in Figure 4.8. Here, the processors are labeled P,, 1 S i S 9, and the objects for P5 are labeled 0,, 1 S i S 9. 01 contains elements written by P5, and read by P2, P4, and P5, 02 contains elements written by P5, and read by P2 and P5, and so on. Another application we consider is the recurrence relation, and its typical parallel program is shown in Figure 4.9. OCR? creates a single object containing all 100 P1 P2 93, o 03 P P4 5 P6 O5 O7 09 P P P 7 s 9 Figure 4.8. Objects created by OCR? for application parallel-iterative-solver. the elements of array variable A for each of the two sequential code segments SI and S3. The parallel code segment S; can be control-partitioned into independent execution sets, which contain the set of iterations: (1, 7, 13 ..), (2, 8, 14...), etc. For this case, OCR? creates a private read-write object for each processor containing the elements in its independent set. Next, we illustrate OCR? for the application in Figure 4.10, which contains a DOACROSS loop with synchronization operations [114]. Assume that the iterations in the DOACROSS loop are control-partitioned such that each processor p executes the set of iterations starting from p until n, with a step equal to the number of processors. In this case, OCR? creates a single object for each array variable for the initialization code segment, and creates multiple objects for the parallel code segment based on the reference pattern of the data elements. 101 DOSEQ $1 Initialize A 51 END DOSEQ DOALL S; I 8 6 TO 11 A(I) - A(I) + A(I-e) .9; END DOALL DOSEQ 5% Output A 53 END DOSEQ Figure 4.9. Application recurrence-relation. Examples of array variables occurring in more than one code segment are A in Figure 4.7 and C in Figure 4.5. In the case of each of these variables, for the initial- ization code segment, a single object containing all its elements is created. The single object is then split into multiple objects for the parallel code segment. Assuming that the arrays are mapped to the SVM in row major form, array C is split into multiple sequential objects, while array A is split into a combination of sequential and stride objects. In both cases, the multiple objects are later combined into a single object for the termination code segment. In the case of all the above applications, barrier synchronization is implicit between successive code segments. OCR? allocates all the barrier synchronization variables of an application to a single object. Wait/signal synchronization variables are used to enforce dependencies in the DOACROSS loop shown in Figure 4.10. Using the data reference pattern and the arguments to a given SEND operation, it is possible to determine which processors are involved in the wait/signal synchronization and would reference the corresponding wait / signal synchronization variable. OCR? applies H7 to create objects for these variables. 102 DOSEQ S, Initialize .9, END DOSEQ DOACROSS 55 I - 1 TO n HAIT(CSYNC,I-1) A(I) - D(I) + 0(r-1) SEND(ASYNC,I) D(I) 8 A(I) * 2 HAIT(ASYNC,I-1) C(I) . A(I-l) + 0(1) SEND(CSYNC,I) HAIT(CSYNC,I-2) E(I) - D(I) + C(I-2) 53 END DOACROSS DOSEQ 55 Terminate 53 END DOSEQ Figure 4.10. Application with a DOACROSS loop. 4.4 Internal Fl‘agmentation In order to assist block—level placement, the objects created by OCR? are not allowed to share virtual pages. The virtual pages of each object are mapped to one or more physical blocks (e.g., pages, cache lines) during runtime. When the size of the virtual page is large relative to the number of elements in the last virtual page of an object, internal fragmentation and wastage of virtual memory occurs. Similarly, when the size of the physical block is large compared to the number of elements in the last physical block for an object, internal fragmentation and wastage of physical memory occurs. In the extreme situation, it is impossible to satisfy the virtual and / or physical memory requirements of certain applications when they are compiled using OCRP. 103 Out of the factors influencing internal fragmentation, the block size is an archi- tectural parameter, while the number of objects and the number of elements per object are determined by application characteristics such as the size and the refer- ence pattern of its data. Hence, for a given block size, internal fragmentation occurs for certain applications and not for others. For example, internal fragmentation of virtual memory does not occur for the application matrix multiplication if each object in Figure 4.6 needs an integer number of pages. In the case of application parallel iterative solver, the occurrence of internal fragmentation for objects 0;, i = 2,4, 5, 6, 8, which are shown in Figure 4.8, depends on the number of proces- sors and the size of array variable A. Objects 0,, i = 1,3,7,9 in the same figure, however, contain just one element, irrespective of these factors, and are prone to internal fragmentation for all page sizes greater than one. The extra physical memory needed to assist block—level placement can be com- pared to that needed to improve the availability and access time of data by replication. Such a space-time tradeoff is justified in all but the extreme situations of insufficient virtual and physical memory. To illustrate our claim, consider the application matrix multiplication, and let us assume that the page size is 4Kbytes, that each data el- ement occupies four bytes, and that A, B, and C are all 32 x 32 array variables. Further, let us assume that there are p processors and that the parallel code segment is control—partitioned by assigning consecutive sets of values of the loop index I to these processors. OCS creates a single virtual page for each array variable, which contains all of its elements, and is shared by all p processors. On the other hand, OCR? creates p private objects, one per processor for array variables A and C, each containing elements referenced by this processor. It creates a single object for all the elements of B which is shared by all processors. When page-level replication is used, in both cases, p physical pages are allocated for each array variable, one each in the local memory of every processor. The amount of data transferred for OCS is more than that for OCRP. Further, OCR? offers a better runtime performance than OCS because it eliminates false sharing, while OCS incurs the overhead of maintaining consistency of its falsely-shared pages. Note, however, that though OCR? requires the same amount (Oil: IlflStr 104 of physical memory as that needed by OCS, it requires more virtual memory. Solutions: In general, the problem sizes in numerical and scientific applications are large enough not to cause internal fragmentation of virtual and physical memory, when applying OCRP. Internal fragmentation is a problem under certain conditions however, and therefore, we propose schemes to help alleviate this problem. One solution is for the memory architecture to support pages of various sizes as in the MIPS R4000 [115]. The appropriate page size can then be chosen based on application characteristics and the available virtual and physical memory. Another solution is for the compiler to map objects of different variables with the same reference pattern to the same virtual page. When the reference pattern is represented by a read bit and a write bit for each processor, this comparison can be done inexpensively using a logical AND operation. If counters are used instead of hits, the comparison is more expensive. One solution is to convert the counter information into the bit format for this purpose. When applied to variables in the same code segment, this solution alleviates internal fragmentation of both virtual and physical memory. This solution can also be applied to variables belonging to different code segments, in which case it reduces the internal fragmentation of virtual but not physical memory, assuming that the size of the working set is determined by the memory required for each code segment. In this case, it is not necessary for the variables to have the same reference pattern because placement decisions are decided individually for each code segment. Modified versions of the combine and split operations are necessary when using this solution. For example, pages of an object containing variables referenced in the next code segment should not be deallocated during the combine operation before this code segment. A third solution is to use a relaxed version of OCR? which reduces false sharing instead of eliminating it altogether. For example, a relaxed version of OCR? applied to the parallel code segment in application parallel iterative solver creates one object per processor containing elements that are written by the processor. A fourth solution, the effectiveness of which depends on application characteristics, is to reduce the number of processors allocated to the application, thereby increasing the number 105 of elements in objects which are prone to internal fragmentation. There is a tradeoff in this solution between the performance gain due to assisting block-level placement and the performance loss due to the reduced parallelism. For illustration of this solution, consider application matrix multiplication, and let m be the number of physical pages available, and e be the number of elements in array variables A, B, and C. Further, let us assume that there are p processors and that the parallel code segment is control-partitioned by assigning consecutive sets of values of the loop index I to these processors. Since only one object is created for B, the potential for internal fragmentation exists only for A and C. We need to allocate at least (3 x e / m) elements to each object of A and C in order to prevent internal fragmentation. While this solution helps in the case of all objects for application matrix multiplication, in the case of application parallel iterative solver, it helps in the case of objects 0;, i = 2,4, 5,6,8, but not for objects 0;, i = 1,3, 7,9. 4.5 Code Generation for Array References In order to assist block-level placement, the objects created by OCR? are not allowed to share virtual pages. Therefore, it is possible for elements of an array variable to be mapped non-contiguously in the SVM. Existing code generation algorithms assume contiguous storage of array elements in the SVM, and calculate the virtual address of the element referenced in a given array reference as follows [116]. Consider a k- dimensional array, where the subscript in dimension j can take values from 1 to dj, and 1 S j S k. The virtual address of the element referenced in an array reference A(31,82,...,3k) is given by: (31 —1)d2 d3 ...dk + (32—Day, d, ...dk + + (sk_1 -1)d,, + (sk—l) = (31 d2 d3 dk + 32 d3 d4 dk + + Sk_1 dk + 3,.)— (d2d3 ...dk + dad, ...d. + + a. + 1) (... ((31 d,» + Sada + 33) )dk + 3,. — 106 (dgdg ...d, + d3 d, ...d, + + d, + 1) (4.1) In general, each subscript s], in an array reference is a function of all loop induction variables which are valid in its scope. For example, considering the array reference A(I, K) in the parallel code segment in Figure 4.5, the loop induction variables which are valid in its scope are I, J, and K, and its subscripts are functions of I and K, respectively. With respect to a given loop depth and its induction variable, all other induction variables are constants. Hence, the virtual address of an element referenced in an array reference is a function of only the loop depth’s induction variable. For example, with respect to loop depth 54, I and J are constants, and K is the only variable. Thus, the virtual address of the element referenced in A(I, K) is a function of only K. This property is used to move invariant factors outside a given loop depth when computing the virtual address using Equation 4.1. For example, in the case of loop depth 54 and array reference A(I,K), an address register is initialized to the virtual address of the element corresponding to the initial values of K and I, and is incremented by one to obtain the virtual address of elements referenced for successive iterations. Since OCR? does not guarantee contiguous storage of array elements in the SVM, the algorithm mentioned above is inapplicable. We develop a new algorithm to gen— erate code for array references when applying OCRP. Our algorithm is shown in Fig- ure 4.11 and is applicable for a perfectly nested loop. Our algorithm exploits the fact that with respect to a given loop depth, each array reference is a function of a single loop induction variable. It divides the loop into several subloops such that within each subloop, all array references traverse a contiguous portion of the SVM. There- fore, the traditional code generation algorithm [116] can be used to generate code for each subloop. Between successive subloops, the address registers corresponding to each array reference with a break in contiguity are loaded appropriately. We introduce some definitions to explain our new algorithm for code generation. We consider the subscripts of an n-dimensional array reference as representing the coordinates of an n-dimensional plane. Any point in this plane corresponds to a 107 /* The following algorithm is used to generate code for each processor. The variable loop-index is initialized to the first value of the loop induction variable assigned to the processor. MAXITERATIONS is the maximum value of the loop induction variable. LASTITERATION is the last value of the loop induction variable assigned to the processor. Each array reference has: (1) an address register containing the virtual address to be used in code generation; it is initialized to the virtual address of the element corresponding to the initial value of loop-index and ( 2) a variable validJoopJndex containing the loop index till which the current address register contents are valid (i.e., a contiguous portion of the shared-virtual—memory is traversed) */ Initialize loopJndex prevJoopJndex = loop-index FOR each array reference Determine type and direction of traversal Initialize address register and validJoop-index END FOR WHILE loop_index S LASTITERATION FOR each array reference IF loop index > validJoopJndex /* break in contiguity; obtain next virtual address */ address register = virtual address of next element on traversal valileoopJndex = {loop index corresponding to last element without break in contiguity} IF last contiguous region for the traversal validJoopJndex = MAXITERATIONS END IF END IF END FOR loopjndex = minimum validJoopJndex create subloop for iterations from prevJoopJndex till loopJndex increment loop_index prevJoopJndex = loop-index END WHILE Figure 4.11. Code generation for array references when applying OCRP. 108 specific set of values for these subscripts and hence a given element of the array. We define a traversal for a given array reference as containing points in this plane that correspond to elements which are referenced due to this array reference. The following steps are needed for each array reference in order to apply our new algorithm: (1) determination of the type of the traversal and (2) creation of a virtual-address data structure for this traversal which contains a list of contiguous portions of the SVM that are traversed. We illustrate this process for the case of a two-dimensional array reference with subscripts (31,32), where 31 = a1 ip + b1 and 32 = a2 i,, + b2. Here, a], b], a2, and I); are integer constants, and i, and iq are loop induction variables. Let im be the loop induction variable corresponding to the loop depth of interest. We have four cases to consider: Case A: (ip aé im) A (iq 75 im) In this case, 31 and 32 are both constants within the loop depth of interest, and hence a single element of the array is referenced. Case B: (ip = im) A (i, 74 in.) In this case, .91 is a linear function of im, while 32 is a constant, and a given column of the array is traversed. Case C: (ip at in.) A (i, = im) In this case, 31 is a constant, while 32 is a linear function of im, and a given row of the array is traversed. Case D: (ip = im) A (iq = in.) In this case, both 31 and 32 are linear functions of im. Rearranging the equations for 31 and 32, we obtain ip = (31 — b1)/a1 and iq = (32 — 02)/02, respectively. Substituting the condition for Case D viz., ip = iq, we obtain 02 81 + (11 32 + (01 02 — 02 01) = 0 (4.2) Hence, the traversal is a straight line which is parallel neither to the x nor the y axis. For the specific case when a1 and a; are equal to 1, and b1 and b; are equal to 0, the traversal is along the diagonal. The traversals corresponding to these four cases 109 Figure 4.12. Traversals for an array reference with linear subscripts. are shown in Figure 4.12, where D1 and D; are the two possible subcases of Case D and they correspond to straight lines with negative and positive slopes, respectively. In this manner, the type of the traversal corresponding to an array reference can be determined. Once the type of the traversal for an array reference is determined, the next step is to create a data structure that contains a list of contiguous portions of the SVM referenced for this traversal. Note that the manner in which this data structure is maintained should depend on the direction of the traversal. In general, the direction of the traversal is determined by the sign of the step of the loop induction variable. For example, the values of (1, 20,1) and (20, l, -1) for the parameters (lower bound, upper bound, step) of a loop induction variable I will cause forward and reverse traversals for the array reference A(I), respectively. The direction of the column and row traversals of a two-dimensional array reference can be determined in a similar manner because for both cases, only a single subscript varies within the loop just as in the case of the one-dimensional array reference A(I). The direction of the traversal for Case D depends on the sign of the step of the loop induction variable as well as that of a1 and a2. For example, when all three are positive, the traversal is along 110 D2 in the upward direction. When a1 and a; are positive and the step of the loop induction variable is negative, the traversal is along D2 in the downward direction. The other cases can be derived similarly. The virtual-address data structure for each array reference can be created by finding out the traversed elements and their virtual addresses. It is straightforward to determine the elements traversed for Case A and for the column and row traversals of Case B and C, respectively. The elements on the traversal for the case D1 can be determined by varying 32 from zero and obtaining the corresponding value of 31 using Equation 4.2, until 31 is zero. The elements on the traversal for the case D; can be determined by varying 31 from zero and obtaining the corresponding value of 32 using Equation 4.2, until the values of 31 and 3; correspond to the element in the top right corner of the array. For each set of values of 31 and 32, the corresponding element on the traversal can be computed. The virtual address of each element depends on its relative position within the object to which it is allocated by OCRP. Each object’s start virtual address is either known at compile-time (when page padding is used) or obtained using runtime primitives. In this manner, the virtual—address data structure for each array reference can be created. This method can be extended to arrays of dimensions greater than two. The number of times the address register for a given array reference is loaded is equal to the number of entries in the virtual-address data structure for this array reference. This number is influenced by the number of objects scanned in a traversal because each object causes a potential discontiguity in storage. Further, this number also depends on whether or not successive elements on a given traversal that belong to the same object have consecutive virtual addresses. In order to illustrate this fact, consider the application matrix multiplication and assume that each processor computes the values of elements belonging to a set of rows of matrix C. Let us assume that OCR? processes the elements of each array variable in a columnwise order. Consider the elements traversed in the row traversal corresponding to array reference A(I, K) by one of the processors. OCR? allocates all these elements to a single object but they do not have consecutive virtual addresses. On the other hand, elements 111 traversed for the column traversal corresponding to B (K, J) belong to a single object and have consecutive virtual addresses. The number of times the address register is loaded can be minimized if the order in which OCR? processes the elements of an array variable is tailored to the references of the array variable. For example, A and B in the application matrix multiplication can be processed rowwise and columnwise, respectively. This optimization however, cannot be applied when an array has references for both the row and the column traversals. 4.6 Compilation Overhead The additional steps needed for each code segment when compiling using OCR? instead of OCS are: 1. Determination of the data reference pattern. 2. Object-creation. 3. Creation of virtual-address data structure for each array reference. 4. Code generation using our new algorithm. 5. Generation of remap operations and placement information. Time Complexity: We already derived the time complexity for step 1 in the pre- vious chapter. The algorithm used in step 2 is shown in Figure 4.13. Without loss of generality, we assume that in step 2, the elements of a k-dimensional array variable are processed by varying subscript 3,, faster than subscript sq, where 2 S p S k, 1 S q S It - 1, and p > q. Further, the subscript in dimension j can take values from one to d,, where 1 S j S Is. Also, the objects are organized as a separate list for each value of subscript 31. For example, the elements of a two-dimensional array are processed rowwise and the objects are stored in a separate list for each row. Let e be the maximum out of the number of elements in each array variable and p be the number of processors allocated to the application. In the worst-case, each element 112 FOR each code segment FOR each variable Determine reference pattern of all elements FOR each element IF an object does not exist corresponding to its reference pattern create an object corresponding to its reference pattern END IF add element to object corresponding to its reference pattern END FOR END FOR FOR each object map contiguously in the SVM; disallow objects to share virtual pages END FOR END FOR Figure 4.13. Algorithm used by OCR? to create objects. of an array variable is of a different reference pattern and is allocated to a distinct object, in which case there are e/dl objects in each list. When adding an element, on an average it is necessary to look up e/ (d1 x 2) objects. For each such lookup, it is necessary to compare the reference pattern of the object and the element. This comparison takes constant time when bits are used to represent the reference pattern. When counters are used, the time taken for comparison is 0( p). Hence, the worst-case time complexity for step 2 for an array variable is given by 0(e x (e/(d1 x 2)) x p). The time complexity for step 3 depends on the number of distinct traversals and the number of elements in each traversal. For each element, its object is looked up from the information generated in step 2. For example, to generate the objects for the column traversal of a two-dimensional array, in the worst-case, e / d1 objects need to be looked up to get each element’s object. Further, as in step 2, e/ (d, x 2) objects need to be looked up to insert this information in the objects for the column traversal. 113 Reference Pattern I Start Element I Variable Name Element Set Pointer I Variable T Ob ti 1 ti Number of elements y” l°° " °""° °" Next Element Set Object Pointerf Next Object Object Pointer2 l\ ¢ I _ [Reference Pattern I Object Polnterd_1 I Element Pointer I Object Information | Next Object t Figure 4.14. Data structure to store objects created by OCR? for an array variable. Therefore, the worst-case time complexity is 0(e x (e/dl) X (e/ (d1 x 2)) x p). The worst-case time complexity for Case D discussed earlier is 0(t X (e/dl) x t/2 x p), where t is the number of elements in the traversal. The time complexity for steps 4 and 5 depend on the number of array references and the number of entries in the virtual-address data structure for each such reference. Hence, it depends on the characteristics of the application and the method of control-partitioning. Space Complexity: We already derived the space complexity for step 1 in the previous chapter. The data structure we use to store the objects for a given traversal is shown in Figure 4.14. The space complexity for steps 2 and 3 depend on the number of traversals and the number of entries in the virtual-address data structure for each traversal. 114 4.7 Performance of Applications The code for OCR? is larger than that for OCS due to the placement information, calls to runtime primitives, and extra loads of address registers, and thus additional physical memory is needed. We now discuss the various factors that affect the performance of applications when compiled with OCR? in relation to its performance when compiled using OCS. With respect to block size, a tradeoff between prefetching and false sharing exists for OCS because it prefetches and places data in terms of blocks. At one end of the spectrum, false sharing can be eliminated by placing data at an element-level but then there is no prefetching. Block-level data placement with larger blocks enhances prefetching but also increases the potential of false sharing. OCR? on the other hand creates virtual objects that provide the benefits of prefetching and at the same time eliminate false sharing. OCR? reduces the execution time of an application obtained with OCS by: l. eliminating false sharing, 2. enabling prefetching of an object into physical memory before the code segment in which it is referenced, while with OCS, each physical block of the object is fetched individually on demand, 3. reducing the runtime overhead of determining the reference pattern of blocks, and 4. providing placement information. In order to assist block-level placement, the objects created by OCR? are mapped to the SVM such that they do not share virtual pages. Due to this restriction, the following factors increase the execution time of the application obtained with OCS: 1. additional loads of address registers due to the need to use new algorithms for code generation, 115 2. additional system calls to obtain virtual addresses of objects in using the new algorithms for code generation, when the virtual memory for objects is allocated by runtime mechanisms, 3. possible inapplicability of certain compiler optimizations due to non-contiguous storage of array variables, 4. remap operations before each code segment, and 5. pages which are referenced in two consecutive code segments not being reused because of deallocation due to the remap operations between them. - Clearly, there is a tradeoff between these two set of factors. The factor that appears to significantly degrade performance in the case of OCR? are the remap operations. Recall that an array variable is remapped when its reference pattern changes between code segments. In such a case, processors reference a different set of its elements in each of these code segments. Therefore, even in the case of OCS, it might be necessary to fetch additional blocks. Further, the combine operation of OCR? deallocates physi- cal blocks that were allocated for the previous code segment. In the case of OCS, these blocks are deallocated only when they are chosen for replacement. Blocks that are not deallocated are falsely-shared and contribute to the consistency overhead. Hence, the performance gain due to the elimination of false sharing and the prefetching and deallocation features of the remap operations when using OCR? may outweigh the per- formance loss due to the prevention of reuse of pages. Further, the runtime overhead of adaptive placement is less for OCR? than for OCS. Hence, we expect OCR? to per- form better than OCS for a majority of applications. In Chapter 6, we quantitatively compare the performance of applications for the two schemes. 4.8 Summary In this chapter, we developed a new object-creation scheme that assists block-level placement by eliminating false sharing and providing placement information. We also 116 proposed solutions to the issues of internal fragmentation and code generation that arise when applying our object—creation method. We derived the overhead involved in compiling applications using our scheme and discussed the various factors that affect the resulting runtime performance. Most of the factors that contribute to the compile-time and runtime overhead when applying our method are due to the remap operations. These operations are necessitated by the inherent nature of block-level placement schemes which invoke data placement operations for every reference to a block irrespective of whether these operations are actually required or not. In order to assist such strategies in placing data, the compiler needs to properly lay out data in the SVM and hence the remap operations. It appears that instead of laying out data to accommodate unnecessary placement operations, the compiler can be of better help if it is allowed to direct or specify when data placement operations are actually required. Such a facility is provided by ob ject-level placement and in the next chapter, we present our work on compiler-directed ob ject-level placement. CHAPTER 5 COMPILER—DIRECTED OBJECT-LEVEL PLACEMENT In this chapter, we present our work on compiler-directed object-level placement when providing a shared virtual memory (SVM) in a N UMA multiprocessor. We first discuss how compiler-directed object-level placement is able to use the compiler’s knowledge about the data reference pattern more effectively than compiler-assisted block-level placement. Next, we present our object-creation scheme which achieves compiler-directed object-level placement. After illustrating our scheme for certain applications, we derive the compilation overhead involved in applying it. We then discuss the various factors that influence the performance of an application when using block-level placement, compiler-assisted block-level placement, and compiler-directed object-level placement. We conclude by presenting the results of our study which models the execution of an application when ob ject-level placement is provided. 5.1 Motivation In the previous chapter, we considered block-level placement schemes that invoke data placement operations whenever each block is referenced. For example, when block- level replication and sequential consistency is provided, whenever a write occurs, all blocks containing the referenced data element are made consistent. On the other hand, when release consistency is provided, whenever a release operation occurs, all 117 118 blocks which are referenced after the corresponding acquire operation and before this release operation, are made consistent. The performance of such schemes suffers when blocks are falsely-shared and when their reference patterns change rapidly. We provided compiler assistance to address these problems by using our object-creation scheme OCR?. In order to assist block-level placement schemes, objects created by OCR? for each code segment are not allowed to share virtual pages. This restriction leads to perfor- mance loss due to certain factors. For example, array variables which are split into multiple objects need additional virtual memory to map these objects. Further, when their elements are mapped non-contiguously to the SVM, there is a loss of efficiency in the code generated. Array variables with a different set of objects in distinct code segments need to be remapped between these code segments. Also, internal fragmen- tation of virtual and physical memory occurs when there are few elements in the last block for an object relative to the block size. Block-level placement schemes are unable to effectively use the information the compiler knows about the data reference pattern because they (1) invoke data place- ment operations on every reference to a block irrespective of whether these operations are actually required or not and (2) provide data placement in terms of blocks rather than objects. The compiler’s knowledge about the data reference pattern can be used more effectively when it specifies when data placement operations are necessary, and also when it specifies them at the level of an object instead of a block. For example, this knowledge is best used by systems such as SSVM [23], which entrust the applications programmer with the responsibility of the placement of data in terms of objects. SSVM does not allow the objects specified by the applications programmer to over- lap and requires them to entirely cover the region of the SVM belonging to the application. In addition to the reference pattern, the information about each object consists of its kind, base, size, skip, and count, which together specify the layout of its elements in the SVM. As discussed in Chapter 4 in the context of combine and split operations, the kind of an object can be sequential or stride. The base of an 119 Figure 5.1. Examples of objects for elements of a two-dimensional array. object is its start virtual address, and the size is the size of the first element of the object (which is the whole object for the sequential case or the first element in the stride case) in bytes. The skip of an object is the number of bytes to skip before the next element in the stride case, and the count is the number of times to repeat the size and the skip elements in the stride case. For example, assume that the elements of the two-dimensional array in Figure 5.1 are mapped in row major form to the SVM, and that each element needs one byte of storage. Then, A is a sequential object with size=8, B is a stride object with size=2, skip=8, and count=4, and C is a stride object with size=2, skip=7, and count=3. Objects are placed in physical memory by copying them to interested processors (as specified by the reference pattern of the object) on detecting programmer-specified object-synchronization operations which are similar to release operations. These op- erations also update the address map for the virtual pages of the object. We extend this concept and propose that when a single physical copy of an object is maintained as is done for actively-shared objects, ob ject-synchronization operations only update the address map. Note that the single copy is in local memory and not in cache, because in general, each processor cannot directly reference the cache of the other processors. The object-synchronization operations can take place asynchronously be- tween processors, through a centralized server processor, or by means of a set of 120 distributed server processors. SSVM therefore uses programmer-specified information to eliminate the perfor- mance loss due to false sharing and also to prefetch objects before their use. Note that objects are allowed to share virtual pages, and only physical blocks containing updated elements of an object are allocated. For example, only pages or cache lines containing at least one element of the stride object C in Figure 5.1 are allocated in the local memory or the cache respectively, of the processor in which object C is updated. Since all physical blocks containing at least one referenced element is allocated in a processor’s local memory, it is possible that portions of certain physical blocks are unreferenced. Note that the restriction in SSVM that disallows objects to overlap is un- necessary, and as we show in a later section, it leads to a loss in performance. Hence, our compile-time object-creation scheme allows objects to overlap and provides the programmer-specified information required by SSVM. 5.2 Compile-Time Ob ject-Creation Similar to OCRP, our object-creation scheme creates objects based on the type and the reference pattern of each variable. The type of each variable is determined as in the case of OCR?, and the data reference pattern for each code segment is computed using one of the algorithms outlined in Chapter 3. Our scheme uses the same heuristics as those used by OCR? to create objects for synchronization variables. As outlined below, it uses a different algorithm however to create objects for array and scalar variables. Since the case when objects are not replicated is a subset of the replication case, we restrict our discussion to the latter. Further, in order to simplify our presentation, we assume that the copies of data elements allocated in local memory or the cache are not replaced. It is easy to extend the discussion to the case when replacements occur and the data is fetched from the next level of the memory hierarchy. Goals: Object-creation for array and scalar variables is motivated by the following goals: G1 Guarantee the following conditions: (1) a copy of a data element should be 121 updated only if it will be read and (2) before any data element’s copy is read, it should be consistent with the copy last written. G2 Create objects that satisfy G1, and allow them to overlap if necessary to minimize the time taken to update objects, provided correct execution of the application is guaranteed. G3 Specify information about each object. G4 Determine when object-synchronization operations are needed. G1 determines when placement operations are necessary, and the first step to satisfy it is to determine when it is necessary to update data elements. Clearly, a data element needs to be updated only if it is read but not last written by the same processor. We consider the updates necessary for each code segment separately, and classify these updates into those which are required before and those that are needed during the execution of the code segment. The data elements which need to be updated before the execution of a given code segment are determined as follows. A data element in a given code segment can be: (1) neither read nor written, (2) only written, (3) only read, (4) first read, and (5) first written. The first three cases can be detected using the reference pattern of the data element. The data element need not be updated for cases 1 and 2, and for case 3, it needs to be updated from the copy which was last written in one of the previous code segments. The data element needs to be updated similarly for case 4, and it need not be updated for case 5. In order to distinguish between cases 4 and 5 however, the read and write references to the data element within the code segment need to be ordered, which can be done using the data dependence graph. Alternately, the data element can be updated for case 5 as well because such an update does not cause incorrect execution of the application. A data element therefore needs to be updated before a given code segment if it is read in the code segment. The next step in satisfying G1 is to determine which copy of the data element was last written and by which processor. The barrier between successive code segments 122 ensures that reads and writes of a given code segment occur before those in the following code segments. Further, as is true in most applications, we assume that within a given code segment, every data element is written by at most one processor. Hence, data read in a given code segment should be updated from the code segment in which it was last written. This code segment as well as the processor which writes the data element can be determined using the reference pattern of the data element in the previous code segments. In addition to making sure that each data element has the correct value before the execution of a code segment begins, G1 also requires that they have the correct values during the execution of the code segment. Within a sequential code segment or a parallel code segment containing DOALL loops, it is not necessary to update a data element which is written and then read because both these references belong to the same processor. On the other hand, if these references occur within a parallel code segment containing DOACROSS loops, they could belong to different proces- sors, in which case, updates to data elements are needed within the code segment. Such updates can be added before synchronization operations, which are used by par- allelizing compilers to enforce dependencies within DOACROSS loops [117, 118, 114]. Each data element is updated from the only processor that writes it to one or more processors which read it, and these processors are determined from the data element’s reference pattern. Once the information about data elements that need to be updated are determined in this manner, the next step is to create objects such that these updates are carried out efficiently. Unlike block-level placement, object—level placement allows physical blocks to be falsely-shared. Therefore, the restriction which is present in both OCR? and SSVM that disallows objects to overlap is unnecessary. G2 removes this restriction and allows the use of optimizations that help minimize the object update time. The optimization we use is to create a single object for all the elements of a given variable that need to sent from one processor to another. G3 ensures that the information needed to carry out the object-synchronization operations properly is provided. Each object’s reference pattern is recorded when it is created, and the other information 123 Table 5.1. Inexact reference pattern and OCDEP. Exact Reference Inexact Reference Comment Pattern Pattern Written Not Written N / A Read Not Read N / A Not Read Read Updates harmless, but unnecessary Not Written Written Incorrect execution if written element is used in update can be determined by the algorithm outlined in Chapter 4 which identifies sequential and stride objects. G4 can be satisfied by inserting ob ject-synchronization operations in the processor that writes each of the objects. We refer to the sequence of read and write references to a data element by the various processors as its data exchange pattern, and it is determined by using the data element’s reference pattern in the various code segments. Since our object-creation scheme creates objects based on the data exchange pattern, we call it OCDEP, an acronym for Object Creation based on Data Exchange Pattern. As discussed in Chapter 3, when array references cannot be resolved, the reference pattern of related data elements is inexact. Considering the reference pattern to a data element by a given processor, the four possible combinations of values for the inexact and the exact reference pattern are listed in Table 5.1. Since we assume that the reference pattern is estimated conservatively, two of the cases are inapplicable. OCDE? specifies when copies of data elements are updated, and therefore an inexact data reference can sometimes cause incorrect execution of the application, as shown in Table 5.1. In related work, other researchers have used software-based schemes to keep caches consistent when cache-line-level replication is provided. For example, Cheong and Veidenbaum [108, 109, 110, 111] have proposed three schemes. Two of these schemes ([108] and [109, 110]) assume that global memory always has the correct copy, and use compile-time dependence analysis in order to selectively invalidate cache lines so that the correct copy is fetched from global memory on the next reference. Since these 124 schemes do not make use of the data reference pattern, they cannot determine which processor last wrote an obsolete cache line. Hence, they unnecessarily invalidate cache lines that were written by the same processor which is currently reading it. The third scheme [111] addresses this problem by associating version numbers with each cache line, and these are updated on each write during runtime. A cache line is invalidated and its latest copy is fetched from global memory only when its version number is less than the current version number for the variable. Additional hardware is however required to implement the second and the third schemes. Cytron et al. [112] also propose a scheme to manage caches in software using compile-time dependence analysis which is similar to the second scheme discussed above. OCDE? also uses dependence analysis to determine when consistency operations are necessary, but differs from the other approaches in the manner in which it makes data consistent. It uses the data reference pattern to determine which processor has the latest copy and therefore avoids the problems of the first two schemes mentioned above. It then creates objects and inserts object-synchronization operations so that data elements are made consistent in an eflicient manner. Also, it transfers data in terms of objects not blocks, and hence, does not transfer unreferenced data elements such as those in falsely-shared blocks, and further, it facilitates prefetching. Under conditions of sufficient physical memory, it allows a distributed exchange of data among the processors, while in the other schemes all processors fetch data from global memory. In the case of DOACROSS loops, with the other schemes, before executing the wait synchronization operation, a processor invalidates its cache line containing the data it requires. Other data elements in this cache line which have the current value are also invalidated in this process. When the waiting processor is signaled by another processor, the cache line containing the correct value of the variable is fetched from global memory on reference. In contrast to this pull-based approach, OCDE? uses a push-based approach where the data is pushed into the local memory of the waiting processor using object-synchronization operations. In this manner, at the cost of additional compilation overhead, OCDE? has the potential of achieving a 125 DOSEQ 51 Initialize A,B,C .31 END DOSEQ DOALL Sb I 8 1 TO n DOALL 55 J 8 1 TO n DOSEQ S} K 8 1 TO n C(I,J) 8 C(I,J) + A(I,K) * B(K,.J) Sh END DOSEQ 53 END DOALL .3; END DOALL DOSEQ 53 Output C 53 END DOSEQ Figure 5.2. Application matrix multiplication. better performance than the schemes mentioned above and in addition, it does not require extra hardware. 5.3 Compilation of Applications In this section, we illustrate OCDE? by applying it to certain applications, and in all cases we refer to the single processor that executes the initialization (termination) code as the initialization (termination) processor. The first application we consider is matrix multiplication shown in Figure 5.2. We assume that the parallel code segment represented by 52, 53, and S4 is control-partitioned by allocating sets of consecutive values of the loop index I to the various processors. Since the only parallel code segment contains DOALL loops, data elements need not be updated within code segments. They need to be updated before every code segment except the first one containing the initialization code. For the parallel code segment, OCDE? 126 DDSED S, Initialize A 5, END DOSEO DOSEO Sb x - 1 r0 m DOALL 55 I - 1 r0 n DOALL st J - 1 TO n A(I,J) - 0.25 * (A(I-1,J) + A(I+1,J) + A(I,J-1) + A(I,J+1)) 5; END DOALL 55 END DOALL 33 END DDSED DOSEQ 53 Output A 53 END DDSED Figure 5.3. Application parallel iterative solver. creates one object for each processor containing the elements of the rows of C which are exclusively read and written by this processor. It also creates one object for each processor containing the elements of the rows of A which are exclusively read by this processor. Further, it creates one object per processor containing all the elements of B. These objects are updated from the initialization processor to all the processors executing the parallel code segment. For the code segment containing the termination code, OCDE? allocates the elements of C which are computed by each processor to a distinct object. The objects thus created are updated from these. processors to the termination processor. Next, we consider object-creation by OCDE? for the application parallel iterative solver shown in Figure 5.3. Let us assume that the parallel code seg- ment represented by loops S3 and S4 is control-partitioned by assigning a different subset of the values for the I and J loop indices to the various processors. In other 127 P P P 1 2 3 05,5 P4 P6 I o... P p P 7 a 9 Figure 5.4. Objects created by OCDE? for application parallel iterative solver. words, each processor computes the values of a rectangular partition of the array A for each iteration of loop 5'2. As in the application matrix multiplication, all parallel code segments contain DOALL loops, and therefore, data elements need to be updated only before each of the code segments, except the first one containing the initialization code. For the parallel code segment corresponding to the first iteration of loop 5;, OCDE? creates one object per processor containing all the elements in its partition, and these objects are updated from the initialization processor to all the processors executing the parallel code segment. The objects created by OCDE? for each of the following parallel code segments for an example processor P5 are 03],, j = 5, k = 2,4,6,8, as shown in Figure 5.4. Here, the array elements are represented by rectangles, the different shaded regions represent distinct objects, and the number of shades in a rectangle represents the number of objects containing the corresponding element. Each object 0,31, contains 128 DOSEQ 51 Initialize 51 END DOSEQ DOACROSS Sb I 8 1 TO n HAIT(CSYNC,I-1) A(I) 8 3(1) + C(I-1) SENDCASYNC,I) D(I) 8 A(I) 8 2 HAIT(ASYNC,I-1) C(I) 8 A(I-1) + C(I) SEND(CSYNC,I) HAIT(CSYNC,I-2) 8(1) 8 D(I) + C(I-2) Sb END DOACROSS DOSEQ 53 Terminate 53 END DOSEQ Figure 5.5. Application with a DOACROSS loop. elements that are written by P,- and read by P1,. For example, the region shaded black is 05,8, and contains elements written by P5 and read by P8. Further, objects are allowed to overlap, and there is no consistency problem because elements shared among overlapping objects are updated in different processors. For example, the element in the top left corner of array A is allocated to 05,2, 05,4, and 05,5. Finally, for the sequential code segment containing the termination code, OCDE? creates one object per processor containing all elements in its partition. The objects thus created are updated from these processors to the termination processor. Next, we illustrate OCDE? for an application where data elements need to be up- dated both before and within code segments. We consider the application in F ig- 129 ure 5.5, which contains a DOACROSS 100p with synchronization operations [114]. In order to update data elements before each code segment, OCDE? creates objects as ex- plained earlier for the other applications. In addition, it inserts ob ject-synchronization operations before each SEND operation in order to update the copy of the correspond- ing data element in the local memory or cache of the processors waiting for it. The processor that writes the element and those that read it are determined using the data element’s reference pattern. For example, assume that the iterations in the DOACROSS loop are control-partitioned such that each processor p executes the set of iterations starting from p until n, with a step equal to the number of processors. In this case, before the first processor executes SEND(ASYNC,1), OCDE? ensures that the .copy of A(l) is updated in the second processor’s local memory or cache. Next, we illustrate OCDE? for an array variable with different reference pattern in distinct code segments. An example is the variable A which appears in the application multi-code-segment-array shown in Figure 5.6. Here, all elements of A are read in loops 51 and $3, and a fraction of the elements are written in loops 5'2 and S4. The loops are partitioned by allocating consecutive sets of iterations to various processors. Note that the lower bound of the loop index in loops $1 and 5'2 is different from that in loops 5;; and 54, thereby making the data reference patterns different. OCDE? creates one object per processor containing all the elements it reads in loop 51, and these are updated before loop 31 from the initialization processor. No objects are updated before loops S; and 5,. Each element which is read in loop 53 and written in loop 5'; needs to be updated from the corresponding processor. Elements which are not written in loop 32 and for which a local copy does not exist need to be updated from the initialization processor. OCDE? creates objects to satisfy these conditions using the data reference pattern of these loops and applying the algorithm shown in Figure 5.7. 5.4 Compilation Overhead The algorithm used by OCDE? to create objects for updating data elements before each code segment is shown in Figure 5.7. The objects created by OCDE? that need to 130 /* All referenced elements of A are read in loops SI and 53. Element A(I) is written in loops 5'; and 54 if Imod (WF*e) =0 8/ DOSEQ Initialize END DOSEQ DOALL 51 I 8 0 TO e-1 . A(I) .9, END DOALL DOALL 53 I = 0 TO e-1 A(I) . ... 5% END DOALL DOALL 5% I 8 k TO e-1 8 A(I) 53 END DOALL DOALL S} I 8 k TO e-1 A(I) 8 S; END DOALL DOSEO Terminate END DOSEO Figure 5.6. Application multi-code-segment-array. 131 /* The following algorithm creates objects that need to be updated before each code segment. Each object for a given code segment contains elements which are read and written by a unique set of processors. rproc denotes a processor that reads the data element in the current code segment. pwproc denotes the processor that last wrote the data element in one of the previous code segments. It is assumed that all elements are written at least once before being read. */ FOR each data element FOR each code segment FOREACH processor rproc that reads the element IF not (local_copy(rproc)) IF (rproc ¢ pwproc) IF object(codeseg, var, rproc, pwproc) not present Create object(codeseg, var, rproc, pwproc) ENDIF Add element to object(codeseg, var, rproc, pwproc) END IF local_copy(rproc) = TRUE END IF END FOREACH IF the data element is written pwproc 2 processor that writes the element local_copy(pwproc) = TRUE FOREACH processor proc not equals pwproc local.copy(proc) = FALSE END FOREACH ENDIF END FOR END FOR /* Algorithm above has been applied to all variables for all code segments */ FOR each object Update information Synchronize before code segment in processor Wproc END FOR Figure 5.7. Algorithm used by OCDE? for updates before a code segment. 132 Write Processor Element Information Next Write Processor Write Processor List Read PI’OOOSSOf Next Read Processor rObject Pointer Variable name Next Variable Figure 5.8. Object list for OCDE?. be updated before a given code segment are organized as shown in Figure 5.8. The worst-case time complexity of the algorithm in Figure 5.7 for each array variable is 0(e X s X p X (log p)2). The worst-case space complexity is (2 X e X p X 3) bits for storing the reference pattern, and 0(p2) for the object list in Figure 5.8. 133 5.5 Performance of Applications In this section, we provide a qualitative comparison of the performance of an applica— tion when compiled using each of the ob ject-creation schemes OCS, OCR?, and OCDE?. The experiments we conduct to compare them quantitatively are covered in the next chapter. We assume that data is replicated and sequential consistency is maintained for the case of block-level placement. For a given partitioning, the components of the execution time which are spent on processing, memory references, and synchro- nization are the same for all the three schemes. They however differ in the time spent in transferring data and keeping it consistent. We consider three cases: (1) OCS with block-level placement, (2) OCR? with block-level placement, and (3) OCDE? with object-level placement. OCS maps the application’s data to the SVM without any consideration of the data reference pattern and does not provide placement information. The time spent in transferring data and keeping it consistent depends on the block size. Performance loss occurs due to false sharing and the inability to adapt to rapidly changing reference patterns. OCR? creates objects which are not falsely-shared and which have temporal locality. Data transfer messages are fewer than in the case of OCS because of the prefetching facilitated by the temporal locality of the objects. By disallowing these objects from sharing virtual pages, false sharing of the corresponding physical blocks is eliminated. Hence, the number of consistency operations is fewer than that for OCS. The restriction of not allowing objects to share pages however leads to other factors such as remap operations and code generation which might result in a loss in performance. Since object-level placement allows objects to share pages, these factors are absent. Further, since data elements are updated only when necessary rather than on every write or release operation to a block, the number of consistency operations are fewer for OCDE? when compared to OCRP. We therefore expect OCDE? with object—level placement to perform better than either OCS or OCR? with block-level placement. The virtual and physical memory requirements for OCS and OCDE? are the same. 134 OCR? requires additional virtual memory in order to ensure that objects do not share virtual pages. It is possible for portions of the physical blocks to be unused in the case of OCS and OCDE?. The temporal locality of objects created by OCR? ensure that each physical block except the last one are always fully utilized. Internal fragmentation of the last block however occurs under certain conditions and might increase the virtual and physical memory needs considerably. 5.6 Application Execution Model In the next two sections, we present the results of our initial study [55, 56] on object- level placement. In this study, we model the execution of the application parallel iterative solver when object-level placement is provided. We consider the case when objects are allowed to overlap and that when they are not, and we refer to them as OCDE? and OCDE?.NOVLP, respectively. Our model allows us to study the various factors affecting performance when object-level placement is provided, and it also helps us to determine the performance improvement obtained by allowing objects to overlap. We choose this particular application because it has been studied exten- ' sively in the literature and its loop construct occurs in several important applications including the area of image processing [119]. For example, Reed et al. [68] have stud- ied several partitioning strategies for this application including rectangular partitions which are shown in Figure 5.9. Here, an n X n array is partitioned into p partitions, each of which is a n/ r X nr/ p array, where r is divisible by p and is a divisor of both n and p. The array is divided into horizontal strips when r = 1, and vertical strips when r = p. For our study, we use a value of J}? for r, and hence the array is divided into square partitions, which is clearly possible only when p is a perfect square. The objects created by OCDE? and OCDE?.NOVL? are the same for the first parallel code segment and the termination code segment, but are different for the other parallel code segments. The objects created by OCDE?.NOVL? for each of these parallel code segments are shown in Figure 5.10. An object-synchronization operation involves the overhead of looking up the object’s information and that of sending as many messages 135 <——n/r ——> 1 2 e e e r_1 r Figure 5.9. Rectangular partitioning of a n X n array. as the number of processors in which the object needs to be updated. OCDE? limits the number of objects and messages sent by each processor to the minimum possible value, which is equal to the number of processors reading data written by this processor, while OCDE?.NOVL? does not guarantee this condition. Considering the single-element object 01 created by OCDE?.NOVLP, two messages are required to send it to P2 and P4, for which the software overhead of the object update may be much greater than the overhead of transferring the single element. Further, two more messages are sent around the same time to each of these processors to update two other objects (02 and 03 for P2; 04 and 07 for P4). Hence, we expect OCDE? to offer a better performance than OCDEPJOVLP. We model the execution of the application parallel iterative solver under the following assumptions. Each processor has a local memory and cannot address remote memory directly. In addition, there is a global memory that is directly ad- dressable by all processors. For efficiency, synchronization variables reside in global memory, while all other variables are replicated in local memories on reference. Pro- cessors communicate with each other by means of messages through a mesh inter- 136 .30 .0 (JD P P P a Figure 5.10. Objects created by OCDE?.NOVLP for application parallel iterative solver. connection network. Message transmission is handled by a router that can send and receive messages simultaneously in all four directions. Messages may carry objects or acknowledgements and are assumed to be of sufficient size so that it is not necessary to split an object into multiple messages. The header of each message contains the message type, source, destination, and object-ID. While sending a message, the router sends the message size which is used and then discarded by the receiving router. In- formation about relevant objects is distributed to respective processors during the initialization phase and processors retrieve information about an object using the object-ID in the received message. Object update is done in a distributed manner, that is, each processor individually sends (receives) objects to (from) relevant proces- sors. We do not model prefetching and pipelining and also ignore the time needed for address translation using the TLB and the page tables. Further, we assume that sufficient amount of main memory is available to ensure that application pages are 137 not paged out. Typically, the parallel program consists of an initialization phase when the vari- ables are read from disk into memory, a computation phase when the loop iterations are executed, and a termination phase when the results are written back to disk from memory. For the compute-intensive parallel iterativesolver, the times for initial- ization and termination are small compared to the time for computation. Further, assuming that the time taken for each outermost loop iteration is the same, it is sufficient to consider the time taken for one such iteration with i processors, tgte,(i), in comparing OCDE?.NOVL? and OCDE?. We refer to this time as the execution time and its value depends on the times spent in processing, memory references, object update, and synchronization. Though OCDE?.NOVL? and OCDE? lead to different times for updating objects, the resulting reduction in execution time, normalized by the execution time with OCDE?.NOVLP, depends on the fraction of tgm(i) spent in updat- ing objects. This fraction depends on the time taken by the other components, out of which the times for processing and memory references depend on technology, and the synchronization time depends on the method of synchronization. To compare OCDE?.NOVL? and OCDE? for two widely different ways of synchronization, we study a method using barrier synchronization (referred to as BARRIER) and another using wait.signal synchronization (referred to as HAITSIGNAL). Table 5.2 lists the model parameters and their typical values in CPU cycles. These values are based on existing multiprocessors such as the BBN’s TC2000 and processors such as the Intel’s iWARP and the MIPS R4000. We assume a 50MHz clock driving the CPU, which amounts to a CPU cycle time of 20 ns. Entries in the table that have two values are for the BARRIER and the HAIT.SIGNAL cases, respectively. BARRIER: The following code is executed by each processor for every outermost loop iteration: Step 1. Compute the value of assigned elements Step 2. Barrier /* All processors have finished computation */ 138 Table 5.2. Parameters of the application execution model. LParameter ] Notation ] Valutfl Size of the message header h 6 Context restore time tcontmflnorc 100 Context save time tconjextdave 100 Time for a floating point addition t fpa 4 Time for a floating point multiplication t [pm 8 Global memory access time tgm 40 Local memory access time tlm 10 Time to look up object information to,_obj_jooku,, 100 Time spent in checks before sending to,_objccj_m,g_,cnd 0/ 100 an object message Time spent in updating counters on to,_objecj_m,g_,cc 0/ 100 receiving an object message Time spent in checks before sending to,_ack_m,g_,cnd 0 an acknowledgement message Time spent in updating counters on to,_ack-m,g_rec 0 receiving an acknowledgement message Time to process a router receive interrupt trec_,-,,¢_p,.oc 20 Time to initiate a router send thawed...“ 20 Transmission time per 32-bit word t, 5 Step 3. Send (receive) locally-written (locally-needed) objects Step 4. Barrier /* All processors have finished object update */ where, barrier synchronization is implemented as follows: /* barrier.counter is initialized to (number of processors) 8/ acquire (barrierlock) decrement (barrier.counter) IF (barrier.counter is 0) /* last processor to reach the barrier 8/ barrier.counter 8 number of processors barrier.done 8 TRUE release (barrierJock) 139 ELSEIF (barrier.counter is equal to (number of processors - 1)) /* first processor to reach the barrier Ii'/ barrier_done 8 FALSE release(barrier_lock) poll UNTIL barrier_done ELSE /* all other processors */ release (barrierJock) poll UNTIL barrier.done ENDIF To reduce the contention on “barrier-counter”, we use an additional variable “bar- rier.done”. The longest time taken by a processor for a single iteration determines the overall execution time for all processors for that iteration. Processors may start the computation phase at various instants and arrive at the barrier at different instants. We assume that these delays do not increase the time spent in the barrier more than the time for each processor to execute the barrier code serially. Processors may also start the object update phase at different instants; in the worst case, actual object update begins only after all processors have left the barrier. Similarly, processors may leave the object update phase and arrive at the second barrier at different instants and the assumptions for the first barrier also apply to the second one. Let the worst-case times for processing, memory reference, and barrier synchro- nization for i processors and an n X n array be t,,,oc(n,i), tmem(n,i), and than-"(2'), respectively. If tobject(n, i, DC) is the object update time when objects are created by the policy DC, the worst-case time tgtc,(n,p,OC) for executing one iteration with p processors is given by: titer(napa 0C) = tproc(nap) + tmem(na P) + tbarrier(p) ‘l‘ tobject(ntpt 0C) (5.1) Since individual processor delays in starting the processing phase is compensated by 140 the wait at the barrier, the processing time (in step 1) is given by: n2 tproclnaP) = ’1)“ X (3tfpa + tipm) (52) Since data is replicated in local memory on reference, the time spent in memory references (in step 1) is given by: n2 tmcm(n,p) = -p— X (5th") (5.3) Based on the barrier implementation code and since all synchronization variables reside in global memory, the total barrier time (in steps 2 and 4) is given by: tbarrier(p) = 2(717 + 1)tgm (54) Since remote memory is not directly addressable, objects are updated using mes- sages. The mesh interconnect matches the near-neighbor communication for the 5- point stencil case and therefore, no message forwarding is needed. The shared virtual space of the user process executing the partition is also available in the supervisor state, thereby allowing the processor to directly update each received object in the appropriate portion of the shared virtual space. In calculating the object update time, we assume that all processors start the object update phase at the same time. Therefore, while a processor is sending a message corresponding to a given object, another processor is sending a similar message to it. For example, considering the objects in Figure 5.10, when P5 sends its 01 to P4, P6 is sending its 01 to P5. Each processor is involved in the following activities while sending an object (for example, consider P5 sending 01 to P4): (1) save context of user process (object- synchronization is implemented as a system call), (2) look up object information, (3) write object message in router buffer, (4) initiate message send, (5) save context and wait for receive interrupt (the wait time is the greater of object message transmission time and context save time), (6) process interrupt for received object message (for example, P5 now receives 01 from P6), (7) look up object information, (8) update 141 object in the user virtual space, (9) write acknowledgement message in router buffer, (10) initiate message send, (11) wait for receive interrupt, (12) process interrupt for received acknowledgement (for example, by now P5 receives the acknowledgement for 01 from P4), (13) restore context to continue object-synchronization system call in supervisor state, and (14) restore context of user process after completion of object- synchronization system call. Here, steps (3) through (13) are repeated for every processor in which the object is updated. Since the time for the barrier in step 2 includes the time until all processors leave the barrier, it includes the time spent in waiting for processors entering the object update phase at different instants. We also assume that the additional time in sending/ receiving stride objects when compared to sequential objects is insignificant compared to the time for other factors in an object update. Hence, the time to send an object of size 3 in It processors including the time to process an acknowledgement is given by: tsendobj(3, k) : tcontert-save + tos_obj_lookup + k X (tos-obj_msg-send + tlm X (3 'l' h) + tinitiate_send + twait-to_receive + trec_int_proc "l" tos-ack-msg-rec 'l' tcontextJestore) + tconteerestore (5'5) The time to receive an object of size 3 including the time to send an acknowledgement is given by: trecobj(3) = trec-int.proc + tos-obj_lookup + tos_obj_msg_rec + 3tlm 'l' tos.ack_msg_send + htlm + tinitiateJend 'l' htt (56) In using OCDE?.NOVLP, eight objects are sent and 12 objects (three from each neigh- boring processor) are received, and the object update time (in step 3) is given by: tow-“471,11, OCDE?.NOVLP) = 4 x Amman-(1,2) + 4 x inflow-(5: -- 2,1) + p 142 n 8 X trecobj(1) 'l' 4 X trecobj(7I—) — 2) (5.7) In using OCDE?, four objects are sent and four objects are received, and the object update time (in step 3) is given by: 1,1) ‘l” 4 X trecobj(_1l- x/f’ W5 Note that as the number of processors is increased, the times taken for processing, tobject(n,p,0CDEP) = 4 X tsendobj( ) (5.8) memory reference, and object update all decrease, while the barrier time increases. The execution times for OCDE?.NOVL? and OCDE? are obtained using Equations (5.1) through (5.8). HAIT_SIGNAL: The following code is executed by each processor for every outermost loop iteration: Step 1. Compute the value of assigned elements Step 2. Call "ready-to-receive-object" Step 3. Send (receive) locally-written (locally-needed) objects Step 4. Call "object-reception-status" The times spent in processing and memory references in step 1 are the same as in the BARRIER case. Before starting the object update phase (step 3), each processor needs to wait only until all the destination processors for its objects finish computa- tion. This wait can be ensured by using wait.signal synchronization between processor pairs (steps 2 and 4). For every processor, we allocate a lock-protected “ready” flag (initialized to 0) in global memory for each processor from which it receives objects. For the case of square processors and a 5-point stencil, each processor receives ob- jects from its four neighbors and hence four “ready” flags. Setting a “ready” flag to 1 indicates that the processor is ready to receive objects from the corresponding neighbor. Also, each processor maintains a counter “num-received-objects” to record the number of objects it receives. Once step 1 is completed, a system call “ready- to-receive-object” is executed (step 2), parameters to which include the number of 143 objects to be received from each neighbor and the total number of objects to be re- ceived. This call sets “num-received-objects” to 0 and atomically sets all the “ready” flags to 1. A processor sends an object only if the “ready” flag of the receiving processor is 1. For example, P5 in Figure 5.10 sends 01 to P2 only if the “ready” flag of P2 corresponding to P5 is 1. Further, each processor sets the “ready” flag corresponding to a given processor to 0 on receiving the last object, before sending an acknowledge- ment for that object. These operations guarantee that the elements are updated for a new iteration only after they have been used for the previous iteration. Also, each processor increments “num-received-objects” whenever it receives an object. After all objects have been sent (in step 3), computation for the next iteration begins only when all objects have been received. The reception of objects is checked by executing the system call “object-reception-status” which compares “num-received-objects” to the expected number of objects and returns a boolean value. Since “num-received- objects” is maintained in the supervisory state, it is not directly accessible in the user state and hence the system call. These operations guarantee that each user process begins the computation phase for the next iteration only after the objects it needs for this iteration have been updated. The time for executing one iteration for the NAITSIGNAL case is given by: titer(ntpt UC) : tproc(nt P) 'l' tmem(nap) + twaitdignal + tobject(ntpa 0C) (59) The BARRIER case allows comparison of OCDE?.NOVL? and OCDE? policies for a poor method of synchronization. To allow comparison for an ideal synchronization method, we assume that the “ready” flags and their locks are allocated in global memory such that operations on them can proceed in parallel in all processors. The value of t0,_obj_m,g_,c,,d is higher for the HAIT_SIGNAL case than in the BARRIER case to account for the checks performed on the “ready” flags while sending an object message. Simi- larly, to,_obj_m,g_,cc is higher to account for the time spent in keeping track of the last object received from each processor and also incrementing “num-received-objects”. 144 Therefore, the time for waitsignal synchronization is given by: twaitdignal : 2tcontext-aave "l" 2tcontert_restore + 2 X (5 X 4 X tgm) 'l' 3tlm (510) Equations (5.9) and (5.10) can be used along with other equations derived for the BARRIER case to obtain the execution times for both OCDE?.NOVL? and OCDE?. Since all components of the execution time decrease as the number of processors is increased, the NAILSIGNAL case is more scalable than the BARRIER case. 5.7 Results of the Application Execution Model In this section, we present the results for the BARRIER and the HAITSIGNAL cases obtained by using typical values for the model parameters as listed in Table 5.2. The results presented here are derived by varying n from 128 to 1024 in steps of 2 and in each case varying p from 4 to 1024 in steps of 4. Figure 5.11 shows how the reduction in object update time for OCDE? from that for OCDE?.NOVL? varies with the number of processors and the array size, in the BARRIER case. This reduction is measured as (toga-“4n, p, OCDE?.NOVL?) "tobjcct (n, p, OCDE?)) /tob,-ect(n, p, OCDE?.NOVL?), using Equa- tions (5.7) and (5.8). The difference in object update times is independent of the array size and the number of processors because both OCDE?.NOVL? and OCDE? transfer the same number of elements. Also, for a given array size, tom-cam, p, OCDE?.NOVL?) de— creases as the number of processors is increased and hence the reduction increases. For a given number of processors, tobjcct(n, p, OCDE?.NOVL?) increases as the array size is increased, and hence the reduction decreases. The curves for the HAILSIGNAL case shown in Figure 5.12 are similar and all values are higher than the corresponding values in the BARRIER case because of the higher values of parameters to,_objcc¢_m,g_,.ec and tosmbjechmsgmrend- The fraction of the execution time spent in updating objects for policy OC is measured as (tobj,c¢(n,p, OC)/t,t,,(n, p, OC) X 100) using the equations derived in the previous section. Figures 5.13 and 5.14 show how this fraction varies with the number 145 60 t . . _gu ’ 50 _ r A 40 j -‘ '1 4. Object Update I . Time Reduction 30 l _ (%) 1; I. 20 ] " I" 128 X 128 + i 256 x 256 1:1 10.. 512 x 512 x — 1024 x 1024 A 0 7 l l l l L 0 200 400 600 800 1000 1200 No. of processors Figure 5.11. Object update time reduction due to OCDEP/ BARRIER. of processors and the array size, for the BARRIER and NAIT_SIGNAL cases, respectively. As the number of processors is increased, the times for processing and memory refer- ence that are affected by p (Equations (5.2) and (5.3)) decrease at a faster rate than the object update time which is affected only by \/13 (Equations (5.5) through (5.8)). This fact explains why in the case of both OCDE?.NOVL? and OCDE?, with a given array size, the object update fraction of the execution time increases as the number of processors is increased, for the entire range of processors in the HAIT_SIGNAL case. The same behavior occurs for small number of processors in the BARRIER case, but for large number of processors there is a sharp decrease in object update fraction due to the large amount of time spent in the barrier. The drop occurs at a higher number of processors for larger array sizes because of the higher non-barrier com- ponents of the execution time in those cases. In all cases, for a given number of processors, the fraction is higher for smaller grid sizes, because of the corresponding lower times for processing and memory references. Further, the fraction for OCDE? is always lower than that for OCDE?.NOVL? because of the reduced object update time for the latter. It is seen that for the inefficient BARRIER case, the highest value of 146 70 I I I I I _l 60 "' ‘ ...E] —4 50 r. A -4 Object Update Time Reduction 40 " ’- - (%) ] 30 1.: I. - ] 128 x 128 + _ 256 X 256 C] 20 512 x 512 x 1 i 1024 x 1024 A 10 1.1 1 1 1 1 1 0 200 400 600 800 1000 1200 No. of processors Figure 5.12. Object update time reduction due to OCDE?/HAILSIGNAL. 16 14 12 Object Update 10 Time (% of Execution 8 Time) 6 , 4 i I 128 X 128 + 256 x 256 D - 512 x 512 x 1024 x 1024 A - OCDE? — OCDE?.NOVL? —‘ 8-1 400 600 No. of processors 800 1000 1200 Figure 5.13. Object update time fraction/ BARRIER. 147 140 _ ' ' ' ' 123 x'128 + q 256 X 256 CI 512 X 512 X 120 - 1024 x 1024 A - OCDE? — 100 - OCDE?.NOVL? _— Object Update Time 80 - - (% of Execution Time) 60 ~ / _ 40 T — 20 A 1 1 ’ _1 0 l I I 0 200 400 600 800 1000 1200 No. of processors Figure 5.14. Object update time fraction/HAILSIGNAL. the fraction is only 16% compared to 80% for the HAILSIGNAL case. Since only the object update time component of the execution time is different in the case of the two object creation policies, the normalized reduction in execution time measured as ((tgm(n, p, OCDE?.NOVL?) — tgtc,(n, p, OCDE?))/tgt,,(n, p, OCDE?.NOVL?) X 100) is similar to the object update fraction, as shown in Figures 5.15 and 5.16. It is seen that execution time reductions as high as 48% are achieved for the NAITSIGNAL case. Conclusions: OCDE? always performs better than OCDE?.NOVL?, with object update time reductions as high as 46-60% and 50-61% in the BARRIER and HAITSIGNAL cases, respectively. The performance improvement due to OCDE? as measured by the normalized reduction in execution time is dependent on the other components of the execution time. The time spent in synchronization depends on the efficiency of the method used and for the efficient UAITSIGNAL scheme, the execution time reduction is as high as 48% for the smallest array size studied, while it is significant (8%) even for the largest array size studied. Similarly, a faster processor will reduce the fraction of the execution time spent in processing and lead to a higher object update time fraction and consequently lower execution time. 148 9 I i T I I 8 __ 128 X 128 + I 256 X 256 CI 7 - 512 X 512 X _ 1024 x 1024 A 6 _ Execution Time 5 " _ Reduction 4 __ _ (%) 3 - _ 2 .= 1 0 j. 0 200 400 600 800 1000 1200 No. of processors Figure 5.15. Execution time reduction due to OCDEP/ BARRIER. The component of the execution time that is different for OCDE? and OCDE?.NOVL? is that spent in transferring data and keeping it consistent. Both schemes achieve these functions by creating and transferring data objects. The time spent in transfer- ring objects depends on the number of objects, the number of processors that share each object, and the amount of data transferred. When the transfer of each object between a given pair of send and receive processors requires one message, the best performance is achieved for OCDE? which creates the least number of objects. OCDE? achieves this better performance by exploiting the nature of the application’s data exchange patterns when creating objects. It minimizes the number of objects by al- lowing them to overlap and combining objects which are created by OCDE?.NOVL? that are updated around the same time. Our results demonstrate that performance can be improved by allowing objects to overlap, and allocating to a single object elements of an array variable that need to be updated around the same time. It follows that when possible, objects of different array variables that need to be updated around the same time should also be combined and sent in a single message. 149 100 I T I T I 128 X 128 + 256 X 256 CI 80 r 512 x 512 x a 1024 X 1024 A Execution 60 r 4 Time - Reduction r-‘i‘ (%) 40 ~ 20 - 4A 0 1 1 1 1 0 200 400 600 800 1000 1200 No. of processors Figure 5.16. Execution time reduction due to OCDE?/NAI'LSIGNAL. 5.8 Summary In this chapter, we presented our new compile-time object-creation scheme which achieves compiler-directed object-level placement. We outlined the algorithm used by our scheme and derived the compilation overhead in applying it. We discussed how compiler-directed ob ject-level placement can provide a better performance than block-level placement. We also identified possible performance optimizations in pro- viding object-level placement by modeling the execution of a typical application. These optimizations motivate the performance measures we use in the next chapter to quantitatively compare various types of placement schemes. CHAPTER 6 EXPERIMENTAL COMPARISON OF DATA PLACEMENT SCHEMES In this chapter, we present the results of our experimental study in which we compare various data placement schemes that can be used when providing a shared virtual memory (SVM) in a N UMA multiprocessor. As discussed in the previous chapters, these schemes differ in the degree to which the compiler controls data placement operations. We first state the goals, assumptions, and performance measures of our study. We then outline the manner in which we determine the performance measures for the various data placement schemes. After discussing in detail the results of our experiments for various applications, we end by summarizing our conclusions. 6.1 Goals and Assumptions In the previous chapters, we presented our work on compiler-assisted block-level and compiler-directed object-level placement. We claimed that the compiler’s knowledge about the data reference pattern can be better used if it is allowed more control over data placement operations. Our goal in this chapter is to demonstrate our claim by quantitatively comparing data placement schemes which differ in the degree to which the compiler controls data placement operations. We consider block-level and ob ject- 150 151 level placement, and vary the amount of compiler control in data placement operations by compiling the application with different ob ject-creation schemes. More specifically, we compare the performance of a given application for each of the following cases: (1) block-level placement with no compiler assistance (compiled with OCS), (2) compiler- assisted block-level placement (compiled with OCR?), and (3) compiler-directed object- level placement (compiled with OCDE?). We refer to these three cases as OCS, OCR?, and OCDE?, respectively. Since our study [55, 56] which modeled object-level placement demonstrates that there is performance improvement in allowing objects to overlap, we choose OCDE? instead of OCDE?.NOVL? for the third case. We make the following assumptions. A single copy of the application exists in the hard disk, and a server processor is responsible for the transfer of data between the disk and the main memory. The server executes all the sequential code segments of the application and is distinct from the set of processors which execute the parallel code segments. We restrict ourselves to data placement at a given level of the mem- ory hierarchy (cache or main memory), and assume an infinite amount of physical memory in that level. The latter assumption is typical of studies on memory perfor- mance and removes the non-determinism introduced by replacement schemes. The interconnect allows reliable FIFO communication between any two processors, and we do not assume any specific broadcast or multicast capabilities. Each processor has an instruction cache which is large enough to hold its code, and we ignore instruction references. 6.2 Performance Measures In general, the execution time of a parallel application is comprised of times spent on initialization, termination, processing, synchronization, and memory references. The times spent on initialization and termination is dominated by the time taken for the transfer of data between the disk and the main memory of the server, and is the same for all the three cases of our study. Further, the times spent on process- ing and synchronization are the same for the three cases. Hence, we consider only 152 the time spent in memory references and this time depends on the data placement scheme. In the case of OCS and OCRP, we assume block-level data replication within each code segment, with sequential consistency maintained using the write-update (WU) scheme. The remap operations between successive code segments in the case of OCR? deallocate and allocate physical blocks and hence, migrate data under cer- tain conditions. In the case of OCDE?, object-synchronization operations are used to replicate data and keep copies consistent as and when necessary. Since data is replicated in all the three cases, any memory reference is serviced locally. The three cases however differ in the amount of time spent for (1) trans- ferring data to local memory on a data miss, and (2) keeping the various copies of data consistent. We compare the software overhead involved in transferring data by measuring in each case the number of data transfer and consistency messages, and the total number of messages. We also compare the overhead for actual transmission of data by measuring in each case the total amount of data transferred during both consistency and data transfer messages. Also, both OCR? and OCDE? package data into objects, and we compare the corresponding overhead by recording the total number of objects transferred by each of these schemes. Finally, in order to study the severity of fragmentation when applying OCRP, we measure the amount of physical and virtual memory required for the three cases as a percentage of the problem size. Other studies on providing a SVM in NUMA multiprocessors such as [51] have also measured performance by the number of messages and the amount of data transferred. Our performance measures do not account for the runtime overhead of adaptive place- ment, which is higher for OCS compared to the other schemes. It also does not include the possible performance loss due to inefficient code generation when applying OCRP. This loss is however insignificant when compared to the additional runtime overhead for OCS. Therefore, in comparing all the three cases, our results are skewed in favor of OCS. In comparing OCR? and OCDE?, our results are skewed in favor of OCRP. 153 6.3 Methodology In the case of OCS, we calculate the performance measures separately for each code seg- ment as follows. We determine each data element’s reference pattern using Algorithm A outlined in Chapter 3. For the purposes of our experimental study, we restrict the implementation of Algorithm A to handle applications’in which: (1) the parallel code segment consists of a perfectly nested loop with the lower bound, upper bound, and step being integer constants, (2) the depth of the loop is at most three, (3) the di- mension of the array variables is at most two, (2) there are no conditional statements, and (3) array references have linear subscripts, i.e., of the form (a X i + b), where a and b are integers and i is the loop index. We calculate each block’s reference pattern based on the reference pattern of its data elements. For example, consider a block which is referenced by p processors and written w times, and assume a block size of b data elements. For each of the p processors, the block is transferred if it does not already exist in the local memory of the processor. Assuming that the location of the block can be looked up locally and does not require any messages, each such transfer needs two messages, one to request for the block and the other to receive it. Hence, the number of data transfer messages for the block is (2 X p). Each write needs to update the block’s copy in the server as well as the other referencing processors. Assuming again that the information about the referencing processors can be looked up locally and does not require any messages, two messages are needed for updating each copy, one to send the update and the other to receive an acknowledgement. Hence, the number of consistency messages is (2 X p X w). We ignore the data sent during block requests and WU acknowledgements, and therefore, the total amount of data transferred is (p X b + p X to). Such a calculation gives us the steady-state performance because we assume that all processors start execution of the code segment at the same time. In practice, processors incur different delays and therefore it is possible for one of them to write a block before it is fetched by some of the others, in which case the block’s copy in those processors need not be 154 updated. In the case of OCR? and OCDE?, we distinguish between consistency and data trans- fer messages as follows. We consider messages sent from the server after the initializa— tion code segment and those sent by the other processors before the termination code segment as data transfer messages. The rest of the messages are used for consistency operations within and between code segments, and we consider them as consistency messages. OCR? transfers data in terms of objects, and maintains block-level consis- tency within parallel code segments. In addition, it maintains consistency between code segments by the remap operations which transfer data in terms of objects. The number of objects and the amount of data transferred for the data transfer and the remap operations of OCR? is determined using the algorithm outlined in Chapter 4. The number of messages and the amount of data transferred for the block-level con- sistency operations of OCR? are determined based on each block’s reference pattern as in the case of OCS. OCDE? on the other hand uses object-synchronization operations both for data transfer and consistency operations. We determine the number of ob- jects and the amount of data transferred for OCDE? using the algorithm outlined in Chapter 5. We define a data transfer unit as containing data to be transferred between a given pair of send and receive processors. In our previous study [55, 56], we considered each object as a single data transfer unit. We demonstrated that performance can be improved by minimizing the number of objects and hence the number of data transfer units. We did so by allowing objects belonging to the same array variable to overlap and combining several of these objects into a single object. On similar lines, performance can be improved by combining into a single data transfer unit objects of different array variables that need to be transferred around the same time between the same pair of send and receive processors. We apply the above optimization for both OCR? and OCDE?, and assume that two messages are needed for each data transfer unit, one for the actual transfer and the other for the acknowledgement from the receiving processor after it updates the ob- jects in its local memory. In this manner, we determine the number of consistency 155 and data transfer messages for OCR? and OCDE? which are required to transfer ob- jects during consistency and data transfer operations, respectively. As in the case of OCS, we do not include in the amount of data transferred data which is sent during acknowledgements for the data transfer units. Both OCS and OCDE? map the application’s data contiguously to the SVM. They require the same amount of virtual memory, and further, they both guarantee that all virtual pages except possibly the last one is fully utilized. We determine their virtual memory needs by using the information about the number of data elements of the application and the virtual page size. OCR? on the other hand uses more virtual memory than the other two schemes because it ensures that objects for a given code segment do not share virtual pages. Assuming that virtual space for objects can be reused in each parallel code segment, the additional virtual memory needed by OCR? is the maximum out of that required for each of the parallel code segments. OCR? guarantees that all virtual pages of a given object except possibly the last one are fully utilized. Hence, fragmentation of virtual memory can occur when there are few elements in the last virtual page of an object relative to the page size. We determine the virtual memory needs of OCR? by using information about its objects and the virtual page size. Both OCS and OCDE? allocate in local memory any non-local block which contains at least one referenced data element. They both use the same amount of physical memory, but portions of physical blocks might be unreferenced. We determine their physical memory needs by using each block’s reference pattern and the physical block size. OCR? on the other hand requires physical memory for the objects it creates for each of the code segments. The remap operations of OCR? deallocate blocks at the end of each parallel code segment and therefore physical memory can be reused. The amount of physical memory for OCR? is that required by the server and the maximum out of that needed for each of the parallel code segments. OCR? guarantees complete utilization of all physical blocks except possibly the last one used by each object. Fragmentation of physical memory occurs only if there are very few elements mapped to the last block relative to the block size. We determine the physical memory needs 156 OCDEP _— DCS _— Block size=64 0 Block size=256- + Block size=1024 Cl Block size=4096 >< WF=.010 * WF=.256 A WF=.600 ° Figure 6.1. Legend for the performance graphs. of OCR? using information about its objects and the physical block size. Since OCDE? requires the same amount of virtual and physical memory as OCS, we omit it in our performance graphs for virtual and physical memory, and plot only the results for OCS and OCRP. 6.4 Experimental Results We choose our workload to highlight when the different amounts of compiler control in data placement leads to a difference in performance and when it does not. Our workload consists of applications matrix multiplication and parallel iterative solver, and also the application multi-code-segment-array for the three different values (.01, .256, and .6) of the parameter WF. In all cases, we assume that each element occupies four bytes of storage, and that both OCS and OCDE? maps array variables to the SVM in column major form. We conduct experiments for block sizes of 64, 256, 1024, and 4096 bytes, which are representative of block sizes at both the cache and the main memory levels of the memory hierarchy. In the case of each application, for each of the three schemes, we determine the number of consistency 157 DDSED Sh Initialize A,B,C .9, END DOSEQ DOALL 55 I - 1 TO n DOALL 53 J . 1 r0 n DOSEQ s; K - 1 TO n C(I,J) . C(I,J) + A(I,K) * B(K,J) Sh END DOSEQ 53 END DOALL Sb END DOALL DOSEO 53 Output C 53 END DOSEQ Figure 6.2. Application matrix multiplication. and data transfer messages, the total number of messages, the total number of objects, the amount of data transferred, and the physical and virtual memory requirements. Note that the number of objects are applicable only for OCR? and OCDE?, and the virtual memory requirements is applicable only when the block is a virtual page. The symbols used in all our performance graphs are shown in Figure 6.1, where the block size is specified in bytes. 6.4.1 Application matrix multiplication Our first application matrix multiplication is shown in Figure 6.2. It consists of array variables which occur in a single parallel code segment and hence, no remap operations are needed between parallel code segments. We consider multiplication of 128 X 128 square matrices and partition the application by dividing the matrix 0 into as many square partitions as the number of processors (p). The iterations of loops 158 3.5€+07 I I I I I I OCS; Block size=64 €— 3e+07 ._ OCS; Block size=256 -j— - OCS; Block size=1024 B- 2.5e+07 __ OCS; Block size=4096 -)(—- ‘ Consistency 2e+07 b _ Messagesl5e+o7 — ~ 1e+07 — — 56+06 — V O O -‘ 0 1 1 1 n 1 1 0 10 20 30 40 50 60 70 No. of processors Figure 6.3. Consistency messages vs. Number of processors (matrix multiplication for 128 X 128 square matrices). 5'2 and 53 are each divided into J13 sets and each processor executes one of the p resulting combinations. With such a partitioning, elements of A and B are shared read-only, those of C are each read and written by a single processor, and no data element exhibits read-write sharing. OCR? and OCDE? both apply optimizations to reduce the number of objects and therefore create the same number of objects. We discuss below the other performance measures. Consistency Messages: OCR? eliminates false sharing, and OCDE? updates copies of elements only when necessary. Since no data element is shared in a read-write fashion, the number of consistency messages is zero for both OCR? and OCDE?. OCS on the other hand suffers from false sharing and has a non-zero number of consistency messages. Figure 6.3 shows how the number of consistency messages varies with the number of processors and the block size for OCS. For a given number of processors, this number increases as the block size increases because of the larger amount of false sharing. The consistency overhead due to false sharing is absent for blocks belonging to A and B because they are not written. It exists for the blocks of matrix C and is 159 35000 I r I I m A OCDE? — 30000 _ OCS; Block size=64 @— OCS; Block size=256 -+— _DCS; Block size=1024 B— DCS; Block size=4096 -X—- 25000 Data 20000 'IIansfer Messages 15000 10000 - 5000 - 0 l l 0 10 20 30 40 50 60 70 No. of processors Figure 6.4. Data transfer messages vs. Number of processors (matrix multiplication for 128 X 128 square matrices). proportional to the number of processors sharing each such block. Since the array elements are mapped to the SVM in column major form and C is partitioned into square partitions, at most ‘43 processors can share each block of C, as long as it does not cross one of the fi column partitions. For our experimental parameters, this condition is satisfied for block sizes of 1024 and 4096, and hence they have an equal number of consistency messages. For a given block size, the number of consistency messages is higher for more processors because each block of C is shared by more processors. Data 'D'ansfer Messages: Both OCR? and OCDE? fetch data in terms of objects and apply optimizations that reduce the number of data transfer units needed to transfer these objects. For example, objects for array variables A, B, and C that need to be sent from the server to a given processor are all combined into a single data transfer unit. For both schemes, the number of data transfer messages is the same and further, it is independent of block size. Due to the lack of compiler assistance, OCS transfers data in terms of blocks. It replicates in local memory any non-local block containing a referenced data element. Therefore, the number of data transfer 160 2e+07 I m I I I I 1.8e+07 - OCS Bl k OCDEéI: Z ; oc Size: 1.6e+07 ’- OCS; Block size=256 +— 1.4e+07 _OCS; Block size=1024 B— OCS; Block size=4096 *— 1.2e+07 8 Data 'Il'ansferred le+07 - (in elements) 8e+06 - _ 6e+06 - - 4e+06 — — 2e+06 - /v\ 4) _ 0 ‘ Jun—M4 0 10 20 30 40 50 60 70 No. of processors Figure 6.5. Data transferred vs. Number of processors (matrix multiplication for 128 X 128 square matrices). messages for OCS depends on the block size. Figure 6.4 shows how the number of data transfer messages for OCDE? and OCS varies with the number of processors and the block size. It is seen that the number for OCS is much higher than that for OCDE? in the case of all block sizes studied. For a given number of processors, OCS needs more data transfer messages for smaller block sizes due to the reduced benefits of prefetching. For all schemes, the number of data transfer messages increases as the number of processors increases because data needs to be sent to more processors. Amount of Data 'Il'ansferred: OCDE? transfers only those data elements that will be read and actually need to be transferred. In general, OCR? transfers more data than OCDE? due to remap operations. Since these operations are absent for this application, both OCDE? and OCR? transfer the same amount of data, and further, this amount is independent of the block size. OCS transfers data in terms of blocks and each falsely-shared block contains data that is unreferenced and is therefore transferred unnecessarily. In addition, since consistency messages out number data transfer messages (Figures 6.3 and 6.4), they account for a significant fraction of the amount of data transferred. 161 3.56-1-07 l I I I I I OCDE? — 3.34.07 _ OCS; Block size=64 €- DCS; Block size=256 —l— _0CS; Block size=1024 B— DCS; Block size=4096 -x—- 2.5e+07 2 .- _ Total e+07 Messages 1.5e+07 - - 1e+07 - ~ 5e+06 *- W O <> - 0 l I L I 1 l 0 10 20 30 40 50 60 70 No. of processors Figure 6.6. Total messages vs. Number of processors (matrix multiplication for 128 X 128 square matrices). Figure 6.5 shows how the amount of data transferred varies with the number of processors for OCDE? and OCS, and with the block size for OCS. OCDE? transfers fewer data elements than OCS for all values for the block size and the number of processors. OCS transfers almost the same amount of data for block sizes of 1024 and 4096 because they have an equal number of consistency messages. For a given number of processors, it transfers more data for larger blocks because of the larger amount of false sharing. Both OCDE? and OCS transfer more data for a higher number of processors because some of the data is now shared by more processors. Total Messages: OCDE? and OCR? have the same number of consistency and data transfer messages and therefore the same total number of messages. Since consistency messages out number data transfer messages for OCS, the plots for the total number of messages shown in Figure 6.6 are similar to those for the consistency messages (Figure 6.3). Since the number of consistency messages for OCDE? is zero, the plot for its total number of messages is similar to that for its data transfer messages (Figure 6.4). It is seen that the total number of messages for OCDE? is much smaller than that for OCS for all values of the block size and the number of processors. 162 3000 r T T i r I 2500 — OCS —" ~ Block size=64 0 Block size=256 + Block size=1024 Cl - Block size=4096 X 1 Physical 2000 Memory Requirements 1500 — s (% of problem size) 1000 _ q 500 _ v ”H: . _ ' _ _ . . . . . . . . - - : ' ’ " _1 0 1 1 1 1 I 1 0 10 20 30 40 50 60 70 No. of processors Figure 6.7. Physical memory requirements (% of problem size ) vs. Number of processors (matrix multiplication for 128 X 128 square matrices). Physical Memory Requirements: Figure 6.7 shows the variation of physical mem— ory requirements for OCR? and OCS with the block size and the number of processors. For small block sizes, the physical memory needs for OCR? and OCS are the same. For large block sizes however, OCR? needs less physical memory than OCS because its fragmentation is not severe and because OCS leaves larger unreferenced portion in the physical blocks. For a given number of processors, both schemes require more physical memory for larger blocks because of the higher fragmentation for OCR? and the higher degree of false sharing for OCS. For a given block size, both need more physical memory for more processors because of the increased number of objects for OCR? and the increased degree of sharing of blocks for OCS. Virtual Memory Requirements: Figure 6.8 shows the variation of virtual memory requirements for OCR? and OCS with the block size and the number of processors. For the parameters used in our experiments, for all values of the block size and the number of processors, the virtual memory needed by OCS is almost the same as the problem size. OCR? however requires additional virtual memory and this amount is higher for larger blocks and more processors because of the increased potential for 163 400 I I I I I 350 - OCS — - Block size=64 O 300 - Block size=256 + _ . . . ' X '1 , Block size=1024 Cl __..-"“ Vlrtual 250 - Block size=4096 x , , , . . . . .. 1 Memory Requirements 200 _. g ....... 3.. ~. -. '. '. ' .............. ' .................. a a (% of problem size) 150 r a 100 - 3 fi 9 - 50 - ~ 0 1 1 1 1 1 1 0 10 20 30 40 50 60 70 No. of processors Figure 6.8. Virtual memory requirements (% of problem size ) vs. Number of pro- cessors (matrix multiplication for 128 X 128 square matrices). fragmentation. Summary: OCR? and OCDE? perform several orders of magnitude better than OCS in terms of the total number of messages and the amount of data transferred. For these performance measures as well as the number of objects, OCR? and OCDE? perform the same because of the absence of remap operations and that of block-level consistency operations within parallel code segments. OCR? needs more virtual memory and less physical memory than OCDE? and OCS. 6.4.2 Application parallel iterative solver Our second application parallel iterative solver is shown in Figure 6.9. It con- sists of an array variable which occurs in more than one parallel code segment, but with the same reference pattern in each of them. Again, no remap operations are nec— essary between parallel code segments. We consider parallel iterative solution with a 5-point stencil on a 130 X 130 array. The elements of the interior 128 X 128 array need to be computed, and we partition the computation by dividing this array into 164 DOSEO Sh Initialize A 51 END DOSEQ DOSEQ 55 K 8 1 TO m DOALL Sh I 8 1 TO n DOALL 50 J 8 1 TO n A(I,J) 8 0.25 8 (A(I-1,J) + A(I+1,J) + A(I,J-1) + A(I,J+1)) Sh END DOALL 53 END DOALL Sb END DOSEQ DOSEO 5% Output A 53 END DOSEQ Figure 6.9. Application parallel iterative solver. as many square partitions as the number of processors (p). Unlike the previous ap- plication matrix multiplication, data elements are shared in a read-write manner in the parallel code segments of this application. As mentioned earlier, we assume block-level replication with sequential consistency in the case of OCS and OCRP. When data is placed in this manner, in practice, two arrays e.g., C and D are used instead of the single array A. Odd iterations of loop 52 read data from C and write to D, while even iterations of loop 5; read data from D and write to C. Such a scheme ensures that the values of the elements for the previous iteration of loop 52 are not overwritten before being used for the current iteration. Since the data reference pattern for C and D are the same as that of A, for our experiments, we consider a single array and further, we measure the performance for one iteration of loop 52. Also, we assume that processors compute either for a given number of iterations of loop 52 or until convergence is attained. In the latter 165 400000 I I I N OCRP---- 350000 - OCDEP — " OCS; Block size=64 €- 300000 - OCS; Block size=256 —l— a 8 I OCS; Block size=1024 E— 250000 b 0C3; Block size=4096 -)<—— - Clt‘)dnsistency200000 .- ' d essages 100000 — / _ 2. I ‘4) 50000 - M - 0 l l J ...... 1 ....... 1 ....... 1. . . 0 10 20 30 40 50 60 70 No. of processors Figure 6.10. Consistency messages vs. Number of processors (parallel iterative solver) case, each processor locally checks the convergence of the elements belonging to its partition. Since the computation involved in checking for convergence is the same for all the three object-creation schemes, we do not include it in our performance measure. Consistency Messages: The parallel code segments corresponding to the itera- tions of loop 52 each contain a DOALL loop. Therefore, OCDE? does not maintain consistency within these code segments, but uses ob ject—synchronization operations to maintain consistency between them. OCR? and OCS on the other hand maintain block- level consistency within these code segments for each write operation. OCS however incurs more consistency operations than OCR? because its blocks are falsely-shared. As shown in Figure 6.10, for all the three schemes, the variation of the number of consistency messages with the number of processors is similar to that for application matrix multiplication. Also, the number of consistency messages is independent of the block size for OCDE? and OCRP, while in the case of OCS, for a given number of processors, it is higher for larger blocks. As expected, the number of consistency mes- sages is the least for OCDE? and is the maximum for OCS. Figure 6.11 shows that the 166 8000 T I I I I I 7000]- OCRP"" ‘ OCDEP — . - ' ' 6000 - a 5000 - a Cleansmtency 4000 __ _ essages 3000 - - 2000 - 4 1000 - _ 0 _..__h——==T . I 1 1 0 10 20 30 40 50 60 70 N o. of processors Figure 6.11. Consistency messages vs. Number of processors (parallel iterative solver;OCRP,OCDE?) number of consistency messages for OCDE? is significantly lower than that for OCRP. Note that this number is for just one iteration of the loop 5;. Data 'It‘ansfer Messages: As in the case of application matrix multiplication, both OCDE? and OCR? apply optimizations in transferring objects and incur the same number of data transfer messages. Moreover, this number is independent of the block size for both schemes. The number of data transfer messages for OCS is higher than that for OCDE? and OCRP, and for a given number of processors, it is higher for smaller blocks. As shown in Figure 6.12, the number of data transfer messages for all the three schemes increases for a larger number of processors, as in the case of application matrix multiplication. Amount of Data Transferred: During consistency operations, both OCDE? and OCR? transfer data elements which are written in a given iteration to processors that will read them in the next iteration. Further, due to the absence of remap operations OCR? does not transfer data between parallel code segments. Also, as in the case of application matrix multiplication, both schemes transfer the same amount of data during data transfer messages as well. The amount of data transferred is hence 167 I I I I I I 6000 — DCRP ' ' ' ' _l OCDE? — 5000 _ OCS; Block size=64 9— q OCS; Block size=256 +— OCS; Block size=1024 B— 4000 ~0CS; Block size=4096 ->(— '1 Data 'IIansfer Messages 3000 _- -' 2000 r d 1000 1 0 ‘ 1 I l 0 10 20 30 40 50 60 70 No. of processors Figure 6.12. Data transfer messages vs. Number of processors (parallel iterative solver) 400000 I I I I I I 350000 8 OCDE? — _ OCS; Block size=64 9— 5 300000 ’ OCS; Block size=256 +— " OCS; Block size=1024 E- 250000 ‘0CS; Block size=4096 9(— " Data ’IIansferred200000 - ‘ (in elements) 150000 — a _ 100000 — ///””‘ — #4) 50000 — <;::_._——40>— — 0 I l l l l l o 10 2o 30 4o 50 60 70 No. of processors Figure 6.13. Data transferred vs. Number of processors (parallel iterative solver). 168 8000 1 T I u I T 7000- OCR?---~ — OCDEP — , - ' ' 6000 '- r 5000 - a Total Messages 4000 ~ .. 3000 *- - 2000 '- r 1000 - __ _ 0 1i 1 l l l l 0 10 20 30 40 50 60 70 No. of processors Figure 6.14. Total messages vs. Number of processors (parallel iterative solver; OCRP, OCDE?). the same for both OCDE? and OCRP. Also, this amount is independent of the block size for both schemes. OCS transfers more data than OCDE? and OCR? because it incurs more consistency messages, and further, it unnecessarily transfers unreferenced data in falsely~shared blocks. For a given number of processors, OCS transfers more data for larger blocks because of the higher amount of false sharing. As shown in Figure 6.13, the amount of data transferred for the three schemes increases for more processors as in the case of application matrix multiplication. Total Messages: Since consistency messages are fewer for OCDE? than for OCR?, and the data transfer messages for both schemes are equal, as shown in Figure 6.14, the total number of messages are fewer for OCDE? than OCRP. In addition, the total number of messages for both schemes are independent of the block size because its component consistency and data transfer messages are independent of the block size. Since consistency messages out number data transfer messages for OCS, the graph for the total number of messages shown in Figure 6.15 is similar to that for the number of consistency messages (Figure 6.10). It is seen that the total number of messages is the smallest for OCDE? and the largest for OCS for all values of the block size and 169 400000 I . . . . I 003p . . . . 350000 - OCDEP — — OCS; Block size=64 9— 300000 - OCS; Block size=256 -l— 5 q OCS; Block size=1024 8— 250000 ~0CS; Block size=4096 -)(—- .. Total . Messages 200000 - - 150000 r g _ 100000 - / _ t. ‘4) 50000 - M _ 0 l 1 1L1-1--..11.......1.......j... 0 10 20 30 40 50 60 70 N o. of processors Figure 6.15. Total messages vs. Number of processors (parallel iterative solver) the number of processors. Total Objects: OCR? creates objects based on the reference pattern while OCDE? creates them based on the data exchange pattern. Unlike application matrix multiplication, due to the presence of read-write sharing of data in this appli- cation, OCR? creates more objects than OCDE? (Figure 6.16). Physical Memory Requirements: OCS and OCDE? require the same amount of physical memory. In the case of OCRP, fragmentation is severe for large blocks in the case of the single-element objects. Hence, the physical memory required by OCR? is higher than that required by OCS and OCDE?, in spite of the unreferenced portions in the blocks created by the latter schemes. For small blocks however, fragmentation for OCR? is not severe, and it uses less physical memory than OCS and OCDE?. Figure 6.17 shows the variation of physical memory requirements for OCR? and OCS with the block size and the number of processors, which is similar to that for application matrix multiplication. Virtual Memory Requirements: Figure 6.18 shows the variation of virtual mem- ory requirements for OCR? and OCS with the block size and the number of processors. 170 1600 I l l I I I _ . 1400— 0cm>~~ — OCDE? — 1200 - r 1000 — — Total Objects 800 _ q 600 *— r 400 ~ #_ d 200 — I d 0 _2—1 1 1 1 1 1 0 10 20 30 40 50 60 70 No. of processors Figure 6.16. Total objects vs. Number of processors (parallel iterative solver; ocap,ocnspy 7000 , I I I I I X OCR? - - .- 6000 - OCS __ _ Block size=64 <> _ Block size=256 + q Pb sical 5000 Block size=1024 C] y Block size=4096 x Memory 4000 - ‘ Requirements (‘70 of 3000 - ‘ problem size) 2000- _- {3 fl . X . 1000 - a d ............. .—L 0 f..;....l.;.....v;L "r1"'fi"1~5 0 10 20 30 40 50 60 70 N o. of processors Figure 6.17. Physical memory requirements (% of problem size ) vs. Number of processors (parallel iterative solver). 171 3500 I I I I I I I O O U) X 1 3000 Block size=64 0 Block size=256 + , , . . ' ' ' _ Virtual 2500 *- Block size=1024 D Memory 2000 r Block 8128:4096 X A Requirements (‘70of 1500— - problem size) . - ' 1000*- T ..X" ..... .......-CI 500— ....... _ 0 3:22;. ::1:::'.:Zl'.; ...... j. ....... .L ....... .Iué 0 10 20 30 40 50 60 70 No. of processors Figure 6.18. Virtual memory requirements (‘70 of problem size ) vs. Number of processors (parallel iterative solver). For the parameters used in our experiments, the virtual memory needed by OCS and OCDE? is almost the same as the problem size, and is lower than that needed by OCRP. The potential for fragmentation is higher for larger blocks and more processors and hence, the virtual memory needed by OCR? is also correspondingly higher. Summary: In terms of the total number of messages, OCDE? performs the best and OCS performs the worst. OCR? sends more messages than OCDE? due to the block- level consistency operations within the parallel code segment. Both OCR? and OCDE? transfer the same amount of data, which is less than that for OCS. OCR? transfers more objects than OCDE?, and requires more virtual and physical memory than OCDE? and OCS. 6.4.3 Application multi-code-segment-array We choose our next application with the aim of comparing the three schemes for array variables with a different reference pattern in distinct code segments. A typical appli— cation usually consists of several such array variables. In the presence of an infinite 172 number of blocks however, the performance of each of them can be considered sepa- rately. Further, a multi-dimensional array variable is mapped to the SVM in a linear fashion, and is therefore equivalent to a one-dimensional array variable. Also, the per- formance of an array variable which occurs in more than two parallel code segments is characterized by its performance for each successivepair of code segments for which its reference pattern is different. Our next application multi-code-segment-array shown in Figure 6.19 consists of a one-dimensional array variable with one such pair of parallel code segments. An example of an application with such an array variable is the Burg algorithm [120] used in the area of image processing. All elements of A are read in loops 51 and 33, while a fraction of the elements are written in loops S; and $4. This fraction is characterized by the parameter write frequency (WF). For example, an element i is written if (i mod (WF x e)) is zero, where e is the total number of array elements. As WF increases, fewer elements are written, and we vary WF to simulate various amounts of the consistency overhead for the falsely—shared blocks of OCS. We choose a value other than zero for k, the lower bound of I in loops 5;; and S4 in order to change the reference pattern of A in loop 53. For our experiments, we choose array references such that there is no read- write sharing of data within the parallel code segments. The results obtained can be extrapolated to the case when read-write sharing is present by using the results for application parallel iterative solver. We conduct experiments for e = 8192, k = 1000, and various values of WF. We present the results for those values of WF which demonstrate both the case when the consistency overhead of false sharing is low enough to allow OCS to perform as well as OCR? and OCDE?, and the case when it is not. We partition the loops 5;, 1 S i S 4 by assigning a consecutive set of values of the loop index I to each processor. OCR? creates one private read-only object per processor for loop 31 and a private write- only object per processor for loop 5'2. Since the set of elements in a given processor’s object for loop 8'; is a subset of that in its object for loop 51, no remap operations are necessary. A similar argument also applies for loops 5'3 and S4. The difference in the lower bound of I from loop 5; to loop 53 changes the reference pattern and /* All referenced elements of A are read in loops 51 and 53. Element A(I) is written in loops 5; and S4 if 173 Imod(WF*e)=0 */ 51 Eh 5% 52 Figure 6.19. Application multi-code-segment-array. DOSEQ Initialize END DOSEO DOALL $1 I - 0 T0 e-1 = A(I) END DOALL DOALL 53 I 8 0 TO e-1 A(I) - . END DOALL DOALL 53 I I k TO e-1 - A(I) END DOALL DOALL S} I I k TO e-1 A(I) - END DOALL DOSEQ Terminate END DOSEQ 174 Write Frequency = .01 16000 I I I I I OCDE? '- 12000 _ OCS; Block size=64 €— OCS; Block size=256 -l— 10000 _ OCS; Block size=1024 B— OCS; Block size=4096 ->(— Consistency 8000 14000 *- Messages 6000 - '4 4000 - - 2000 — 4L — 0 1 I Tb 0 50 100 150 200 250 300 No. of processors Figure 6.20. Consistency messages vs. Number of processors (multi-code-segment-array; WF=.01). necessitates remap operations between these loops. Consistency Messages: Since there is no read-write sharing of data in this appli- cation, in the case of both OCR? and OCDE?, consistency messages transfer objects between parallel code segments. They are almost equal in number for both the schemes, and further this number is independent of the block size. Consistency mes- sages for OCS on the other hand maintain block-level sequential consistency for each write operation, and their number depends on the block size. Figure 6.20 shows how the number of consistency messages for the three schemes varies with the number of processors and the block size, when WF=0.01. For small blocks, all the three schemes have an almost equal number of consistency messages. As the block size increases however, due to the increase in false sharing, OCS incurs more consistency messages than the other two schemes. For a given block size, OCS incurs more consistency mes- sages for a larger number of processors because of the increased sharing of elements across code segments. A larger number of processors also increases the number of consistency messages for OCR? and OCDE? because more processors write the same number of elements. 175 Write Frequency = .01 800 I l l I 7 700— OCR?---- - OCDE? — -' 600 — OCS; Block size=64 €— . _ . - ‘ — 500 r ' a Clansrstency 400 _ _ essages 300 ‘- - 200 r - 100 P — O l l J l l 0 50 100 150 200 250 300 No. of processors Figure 6.21. Consistency messages vs. Number of processors (multi-code-segment-array; WF=.01). Figure 6.21 shows an expanded portion of the graph in Figure 6.20, where all three schemes have the same number of consistency messages. The consistency messages for OCR? are due to the remap operations (combine and split) between loops 52 and 33. When p processors are used and w of them write elements in loop 5'2, the number of consistency messages for OCR? is 2 x (w + p), where 210 and 2p messages are used for the combine and the split operations, respectively. The consistency messages for OCDE? include the messages needed to update elements that are read in loop 53. Since consecutive elements of the array A are assigned to each processor, each of the w processors will send elements to at most two other processors. Also, elements that are not written in loop 5; and for which there are no local copies need to be fetched from the server processor, and let us assume that .3 processors need such data transfers. The number of consistency messages for OCDE? is therefore 2 x (2w + s). For a small number of processors, the number of elements per processor is large. The probability of a processor writing an element in loop 5'2 is high, and w ~ p. Hence, the number of consistency messages for OCDE? is higher than that for OCRP. For a large number of processors, the number of elements per processor is small. The 176 Table 6.1. Consistency messages (multi-code-segment-array; OCRP, OCDE?). No. of processors WF=.01 WF=.256 WF=.6 ocap|’ocnsp ocap]ocosp ocapjocnsp 16 64 72 40 36 36 32 32 128 150 72 68 68 64 64 256 274 136 132 132 128 128 460 426 264 260 260 256 256 716 684 520 516 516 512 probability of a processor writing an element in loop 52 is small, and w << p. Hence, the number of consistency messages for OCDE? is lower than that for OCRP. As shown in Table 6.1, the actual difference in the number of consistency messages for OCR? and OCDE? is quite small for the entire range of processors. OCR? requires that objects be combined and split in a server processor. A single server will prove to be a bottleneck and hence it is necessary to use multiple servers. The traffic in the case of OCDE? on the other hand is dependent on the nature of the data exchange pattern in the application. Figures 6.22 and 6.23 show how the number of consistency messages varies with the block size and the number of processors for WF=0.256 and WF=0.6, respectively. Since fewer elements are written for higher values of WF, OCS incurs lower consistency overhead due to false sharing. Hence, the number of consistency messages for OCS is fewer than that for the other two schemes for all but one block size when WF=0.256, and for all block sizes when WF=0.6. Some of the consistency messages for OCDE? and OCR? however are used to update data which is transferred using data transfer messages in the case of OCS. Again, as shown in Table 6.1, the actual difference in the number of consistency messages for OCR? and OCDE? is quite small for the entire range of processors. OCDE? incurs fewer consistency messages than OCR? for all values of the number of processors when WF=0.256 and WF=0.6, because w << p for all these cases. Data Transfer Messages: Since both OCR? and OCDE? apply optimizations to re- 177 Write Frequency = .256 600 I T I I T OCDE? — OCS; Block size=64 @- OCS; Block size=256 +- 400 *- OCS; Block size=1024 B— OCS; Block size=4096 *- l 500 Consistency 300 _ Messages 200 r 100 r 0 50 100 150 200 250 300 No. of processors Figure 6.22. Consistency messages vs. Number of processors (multi-code-segment-array; WF=.256). Write Frequency = .6 600 I I I I . I 500 ... OCDE? — _l OCS; Block size=64 €— OCS; Block size=256 -l— 400 - OCS; Block size=1024 B— — OCS; Block size=4096 ->(—- Consistency Messages 300 P T 200 - ~ 100 - - 0 x x E 0 50 100 150 200 250 300 No. of processors Figure 6.23. Consistency messages vs. Number of processors (multi-code-segment-array; WF=.6). 178 Table 6.2. Data transfer messages (multi-code-segment-array; OCRP, OCDE?). N o. of processors WF=.01 WF=.256 WF=.6 ocap]*ocnsp ocaplocosp ocnpl 0005? 16 64 64 38 40 34 36 32 128 128 70 72 66 68 64 256 256 134 ' 136 130 132 128 434 440 262 264 258 260 256 690 704 518 520 514 516 duce the number of data transfer units, they have almost the same number of data transfer messages (Table 6.2), and further this number is independent of the block size. The data transfer messages for both schemes transfer data from the server to the other processors before loop 51. In addition, for OCRP, they transfer elements that are modified in loop 34 to the server (combine operation). We however consider messages which transfer elements that are modified in loop 52 to the server as con— sistency messages in the case of OCRP. In the case of OCDE?, data transfer messages transfer elements which are last modified in either loop 5'2 or 54 to the server before the termination code segment. For a large number of processors, the number of data transfer messages for OCDE? is hence slightly higher than that, for OCRP. The data transfer messages for OCS transfer non-local blocks that contain refer- enced elements, and therefore for a given number of processors, they are higher for smaller blocks. Figure 6.24 shows how the number of data transfer messages for the three schemes varies with the number of processors and the block size, when WF=0.01. Due to prefetching, for large blocks, the number of data transfer messages for OCS is comparable to that for OCR? and OCDE?. For small blocks however, OCS needs a higher number of data transfer messages than the other schemes. For a given block size, the number of data transfer messages for OCS is higher for more proces- sors, because of the increased sharing of each block. The number of data transfer messages for OCR? and OCDE? also increases for more processors because additional data transfer units need to be sent to the extra processors. Plots for WF=.256 and 179 Write Frequency = .01 2500 2000 — .2 OCR? ‘ - - ' 15 0 __ OCDE? — d Data 0 OCS; Block size=64 6*- Transfer OCS; Block size=256 -l— Messages OCS; Block size=1024 B— 1000 " OCS; Block size=4096 -><—- ‘ 500 r- _ 0 0 50 100 150 200 250 300 No. of processors Figure 6.24. Data transfer messages vs. Number of processors (mult i -code-segment-array; WF=.01). WF=0.6 also exhibit the same behavior. Data transfer messages for OCR? and OCDE? are fewer for a higher WF because fewer elements are written and need to be updated in the server processor. The number of these messages for OCS however is independent of the value of WF because this value does not affect the number of blocks containing elements read in loops 5'1 and 53, which are transferred using these messages. OCR? and OCDE? are able to adapt the number of data transfer messages to the application’s characteristics, but OCS lacks this ability. Figure 6.25 shows how the number of data transfer messages for OCDE? varies with WF. It also contains plots for this number in the case of OCS for two of the larger block sizes. We omit the plots for OCR? because they are similar to those for OCDE?. Amount of Data Transferred: The amount of data transferred includes that for both the consistency and the data transfer messages. OCDE? transfers the minimum amount of data because it updates data elements only when needed. OCR? transfers more data than OCDE? during consistency messages because of the remap operations (Table 6.3). For example, in the case of OCRP, elements that are written in loop 52 are 180 Data Transfer 500 - Messages I I I I OCDE?; WF=.010 unh— OCDEP; WF=.256 é- OCDEP; WF=.600 IO- OCS; Block size=1024 B— OCS; Block size=4096 -)(—- % l l l l l Figure 6.25. 50 100 150 200 250 N o. of processors Data 300 transfer messages vs. Number of processors (multi-code-segment-array; WF=.01, .256, .6). Write Frequency = .01 400000 I I I I 350000 - OCDEP — - OCS; Block size=64 6— 300000 F OCS; Block size=256 +— ~ OCS; Block size=1024 8— 250000 ~ OCS; Block size=4096 -)(~— - Data 'Il'ansferred200000 — a (in elements) 150000 - ~ 5 100000 - - 50000 — g ‘ ~ 0 “if i. __-" I I 1' 0 50 100 150 200 250 300 No. of processors Figure 6.26. Data transferred vs. Number of processors (multi-code-segment-array; WF=.01). 181 Table 6.3. Data transferred (in elements) (multi-code-segment-array; OCRP, OCDE?) No. of processors WF=.01 WF=.256 WF=.6 OCR? [OCDE? ocap [OCDE? OCR? [OCDE? 16 15575 13622 15391 13524 15387 13522 32 15575 14554 15391 14456 15387 14454 64 15575 15002 15391 14904 15387 14902 128 15575 15225 15391 15127 15387 15125 256 15575 15331 15391 15233 15387 15231 updated to the server after loop 52 (combine operation). The server then transfers all the elements of A to the various processors based on the reference pattern of A for loop 5;, (split operation). OCDE? on the other hand transfers only elements that are written in loop 5'; to processors that do not have a local copy and read them in loop S3. The amount of data transferred by OCS depends on the block size which determines the amount of false sharing of each block. Figure 6.26 shows how the amount of data transferred for the three schemes varies with the number of processors and the block size, when WF=0.01. OCR? and OCDE? transfer almost the same amount of data relative to that transferred by OCS, and this amount is independent of the block size. Since consistency messages for OCS are much higher than data transfer messages, their contribution to the amount of data transferred is also higher. Due to more consistency messages, for larger blocks, OCS transfers more data than OCR? and OCDE?. For small blocks, all three schemes transfer comparable amounts of data. For a given number of processors, OCS transfers more data for larger blocks because of the increase in false sharing. For a given block size, OCS transfers more data for a higher number of processors because of the increased sharing of data elements. Plots for other given values of WF (.256 and 0.6) are similar. OCR? transfers all the elements of A before loops 51 and 33, and the elements written in loops 52 and 54 after each of these loops, respectively. The amount of data transferred by OCR? is therefore independent of the number of processors. OCDE? 182 19000 1 I I T I UCRP"" 18000 _. OCDE? _ _ OCS — WF=.010 * 17000 - WF=.256 A - WF=.600 O Data Transferred 16000 T (in elements) 15000 - - 14000 ~ _, a 13000 1 1 1 l l 0 50 100 150 200 250 300 No. of processors Figure 6.27. Data transferred vs. Number of processors (multi-code-segment-array; WF=.01, .256, .6). transfers more data for a higher number of processors because elements are now written by a larger number of processors. The amount of data transferred in the case of OCR? and OCDE? for various number of processors and values of WF is shown in Table 6.3. Figure 6.27 shows the variation of the amount of data transferred for the three schemes with the number of processors and the WF. In the case of OCS, we plot only the data for the minimum block size of 64 bytes when the amount of data it transfers is comparable to that for the other schemes. All schemes transfer less data for higher values of WF because fewer elements are written. In all cases, OCDE? transfers the least amount of data and OCS transfers the maximum amount of data. Total Messages: The total number of messages is the sum of the number of the data transfer and the consistency messages. In the case of OCS, large blocks cause more consistency messages while small blocks result in more data transfer messages. Hence, a tradeoff exists between false sharing and prefetching as the block size is increased. Since consistency messages account for a dominant fraction of the total number of messages when WF=.01 for large blocks, the plots for the latter in Figure 6.28 are 183 Write Frequency = .01 18000 I I r I I 16000 - OCR? '- - . OCDE? — 14000 _ OCS; Block size=64 9. OCS; Block size=256 .1— 12000 ' 005; Block size=1024 9. 10000 OCS; Block size=4096 -)(— Total Messages 8000 6000 4000 2000 0 50 100 150 200 250 300 No. of processors Figure 6.28. Total messages vs. Number of processors (multi-code-segment-array; WF=.01). similar to those for the former in Figure 6.20. As WF increases, fewer elements are written, and the fraction of the total number of messages which is contributed by data transfer messages becomes higher. The plots for the total number of messages when WF=0.6 shown in Figure 6.30 are similar to the corresponding plots for the data transfer messages. When WF=.256, the plots for the total number of messages shown in Figure 6.29 follow the corresponding plots for both the consistency and the data transfer messages. When the consistency overhead due to false sharing is low (WF=.600, when only one element is written), large blocks facilitate prefetching, and hence, OCS performs as well as OCDE? and OCRP. For most of the values of WF however (5 .256, when more than one element is written), OCDE? and OCR? perform better than OCS. Note however that our performance measure does not include the runtime overhead of adaptive placement which is higher for OCS than for OCDE? and OCRP. Hence, we expect OCR? and OCDE? to provide a better runtime performance than OCS. The total number of messages for OCR? and OCDE? are almost the same as shown in Table 6.4. The total number of messages for OCDE? is higher than that for OCR? for a 184 Write Frequency = .256 2500 2000 - .l OCRP.... 1500 - OCDE? — q Total 0C5; Block size=64 6— Messages 0C3; Block size=256 +— 1000 _ OCS; Block size=1024 B— d OCS; Block size=4096 -x— 500 ~ _ 0 l l 1 1 l 0 50 100 150 200 250 300 No. of processors Figure 6.29. Total messages vs. Number of processors (multi-code-segment-array; WF=.256). Write Frequency : ,6 2500 I I T 1 I 2000 ~ _ OCRP---v 1500 F OCDE? — 4 Total 0CS; Block size=64 9. Messages 0C3; Block size=256 +— 1000 __ OCS; Block size=1024 B— _ OCS; Block size=4096 -x— 500 - _ 0 l l l 4 1 0 50 100 150 200 250 300 No. of processors Figure 6.30. Total messages vs. Number of processors (multi-code-segment-array; WF=.6). Table 6.4. Total messages (multi-code-segment-array; OCRP, OCDE?). 185 No. of processors WF=.01 WF=.256 WF=.6 ocnpl*ocnsp OCRPI OCDE? ocap]ocosp 16 128 136 78 76 70 68 32 256 278 142 140 134 132 64 512 530 270 268 262 260 128 894 866 526 524 518 516 256 1406 1388 1038 1036 1030 1028 Table 6.5. Total objects (multi-code-segment-array; OCRP, OCDE?). N o. of processors WF=.01 WF=.256 WF=.6 ocap]ocnsp ocap] OCDE? asap] ocnsp 16 64 68 39 38 35 34 32 128 139 71 70 67 66 64 256 265 135 134 131 130 128 447 433 263 262 259 258 256 703 694 519 518 515 514 small number of processors and low value of WP, and it is lower in all the other cases. The total number of messages is the sum of the data transfer and the consistency messages, and its behavior can be explained based on our discussion earlier about its component messages. Total Objects: The total number of objects for OCDE? and OCR? for various values of WP and the number of processors is shown in Table 6.5. Both schemes use objects to transfer data during consistency and data transfer operations. Both schemes apply the same optimizations to reduce the number of objects and data transfer units and hence, they transfer similar objects during data transfer operations. Moreover, the objects transferred by OCR? during remap operations are similar to those transferred by OCDE? during the consistency operations. Hence, the number of objects for both schemes is almost the same for all cases. In contrast, due to the presence of read-write 186 5000 1 I , I , 4500 _ Block size=64 O - Block size=256 + 4000 ’ Block size=1024 1:1 7 3500 — Block size=4096 X _. Physical OCS —- . ‘ . »>< Memory 3000- OCRP.... _ Requirements 2500 - ‘ p . . - ' ' _ (%°f. 20004 j problem srze) ><' - a 1500 I- . . . ' ' i —I 1000'- ..‘g'ix = ............ D —I 500— 5 .............. 4 0 ;, “Iii; ..... l.....-...|....:::::J::: ...... L6 0 50 100 150 200 250 300 No. of processors Figure 6.31. Physical memory requirements (% of problem size ) vs. Number of processors (multi-code-segment-array). sharing of data, for the application parallel iterative solver, OCR? creates more objects than OCDE?. Note that the behavior of this performance measure is similar to that of the total number of messages. Physical Memory Requirements: All schemes need the same amount of physical memory to allocate the application’s data in the server’s local memory. In addition, since we consider an infinite number of blocks and do not replace blocks, OCS and OCDE? need physical memory for each of the parallel code segments. OCR? on the other hand needs additional memory equal to the maximum out of that needed for each of the parallel code segments. Hence, OCR? requires less physical memory than OCS as shown in Figure 6.31, which shows how the physical memory requirements for OCS and OCR? varies with the block size and the number of processors. For a given number of processors, both schemes need more physical memory for larger blocks. The additional memory needed is due to the higher amount of fragmentation for OCR? and the larger number of unreferenced elements per block in the case of OCS. For a given block size, both schemes need more physical memory for more processors. The additional memory needed is due to the increased fragmentation of blocks for 187 3500 l l I I IX OCRP; Block size=64 -<> - ' i 3000 — OCRP; Block size=256 -+- - q OCRP; Block size=1024 E} - _ . - ' 2500 - OCRP; Block size=4096 -><- - . . - ’ 7 Virtual OCS '— .- " Memory 2000 — _.-- 2 Requirements X - ' (% of 1500 - " ~ problem size) . . 1000 P x‘ '1 . . . CI ‘ 500~ x :1 2 gym/Q .......... ©...a::::2131112ii2iiii$ 0 r I r I I 0 50 100 150 200 250 300 No. of processors Figure 6.32. Virtual memory requirements (‘70 of problem size ) vs. Number of processors (multi-code-segment-array). OCRP, and the higher number of processors referencing each block for OCS. Virtual Memory Requirements: Figure 6.32 shows how the virtual memory re- quirements for OCS and OCR? varies with the block size and the number of processors. As expected, the needs for OCS are close to the problem size, while that for OCR? are much higher. For a given number of processors, OCR? requires more virtual mem- ory for larger blocks because of increased fragmentation. For a given block size, it needs more virtual memory for more processors because of the fewer elements in each processor’s object and the consequent increase in fragmentation. Summary: When the consistency overhead due to false sharing is extremely low, the performance of all the three schemes in terms of the total number of messages is comparable. In all other cases, OCDE? and OCR? perform better than OCS. Due to the absence of consistency operations within code segments, the performance of OCDE? and OCR? in terms of the total number of messages and objects is comparable. In terms of the amount of data transferred, OCDE? performs the best, followed by OCRP, and then OCS. OCR? requires more virtual memory than both OCS and OCDE?. For our experimental parameters, its physical memory requirements are less than that for the 188 Table 6.6. Summary of experimental results. Application Characteristics Performance Rank False sharing True Consistency Consistency of blocks sharing needed within needed between OCS OCR? OCDE? with OCS? of blocks? code segment? code segments? No N o No - No 1 1 1 No Yes No No 2 2 1 No Yes Yes No 1 l 1 Yes No No No 2 1 1 Yes Yes N 0 Yes 3 2 1 Yes Yes Yes Yes 3 2 1 Yes Yes Yes No 2 1 1 Yes Yes No No 2 2 1 other two schemes. 6.4.4 Conclusions Our experiments highlight the manner in which various characteristics of an applica- tion can influence its performance for these data placement schemes. Based on our experimental results, we rank the performance of the placement schemes for various combination of these characteristics as shown in Table 6.6. Our performance rank is based on the total number of messages and the amount of data transferred. OCDE? performs the best of all the schemes, thereby verifying our claim that the compiler can best help in data placement if it is allowed to specify when data placement operations are necessary. OCDE? performs better than OCR? because it directs rather than assists data place- ment, and OCR? can also be used in a similar manner. For example, in the case of the application parallel iterative solver, OCR? can specify that consistency need not be maintained within the parallel code segments. Between parallel code segments, optimizations can be applied to combine together objects that need to be transferred between a given pair of send and receive processors. The performance rank for this 189 Table 6.7. Compiler-directed data placement (OCR? vs. OCDE?). I Performance Measure I OCR? I OCDE? ] Possible unused physical memory Only in last block In all blocks Virtual memory needs Problem size + Objects Problem size Code generation New, more Existing complex algorithms algorithms application in the case of both OCR? and OCDE? will then be the same. Table 6.7 com- pares the other performance measures when compiler-directed object-level placement is provided using either OCR? or OCDE?. OCR? requires additional virtual memory and new code generation algorithms. Except when fragmentation is severe, OCR? uses less physical memory than OCDE? because it creates objects with temporal locality. The choice between these schemes for compiler-directed object-level placement therefore depends on the available physical and virtual memory, and the ease of generating code when using OCRP. In related work, Chen and Veidenbaum [121] use trace-driven simulations to com- pare two cache-line-level placement schemes. One is a directory-based hardware scheme (without any compiler assistance) and the other is a'simple software-assisted scheme [108]. They conclude that both schemes perform the same, and that the hardware scheme suffers from false sharing while the software scheme suffers from unnecessary invalidations. They argue that sophisticated software schemes which re- duce the number of invalidations will perform better and at the same time cost less than hardware schemes. In contrast, we compare schemes with varying degrees of compiler control (software assistance) over data placement operations by compiling with various object-creation schemes. Further, while they consider only block—level placement, we study placement in terms of blocks as well as objects. 190 6.5 Summary In this chapter, we experimentally compared various placement schemes which differ in the degree to which the compiler controls data placement operations. We es- tablished that the placement scheme which entrusts the compiler with the complete responsibility of data placement (compiler-directed object-level placement) performs the best of all the placement schemes. Further, by transferring data in terms of objects rather than blocks, this scheme can potentially perform better than software- assisted, block-level schemes. We conclude that compiler-directed object-level place- ment shows promise as the approach to automate data placement and make it easier to use shared-virtual-memory N UMA multiprocessors. CHAPTER 7 CONCLUSIONS In this thesis, we attempted to develop new and better techniques to place the data of a parallel application in the physical memory of a multiprocessor with a non-uniform memory access (NUMA) time. We assume that the parallel application uses as many processes as the available number of processors, and that each process is statically assigned to a given processor. Further, processes communicate and synchronize by means of a shared virtual memory (SVM). In this chapter, we summarize the major contributions of our thesis, and provide directions for future research. 7. 1 Major Contributions Existing techniques place data in terms of basic data blocks such as cache lines or pages. Using analysis and trace-driven simulations, we studied the relationship among page-level replication, hardware parameters, the layout of data in the SVM, and page reference patterns. We demonstrated the need for page-level replication to be dependent on hardware parameters and the reference pattern of pages. We also showed that proper data layout reduces or, in some cases, even eliminates false sharing of pages. Moreover, it simplifies the reference pattern and the placement of pages. These results apply to other block-level placement strategies as well. In the absence of help from the compiler or the applications programmer however, these strategies incur runtime overhead when being adaptive and further, are unable to optimally place blocks which are falsely-shared and those with rapidly-changing reference patterns. 191 192 Motivated by the conclusions of our study, we developed a new approach for compiler-assisted data placement. Our approach helps data placement by creating compile-time objects that contain data of the same variable type and similar refer- ence patterns. We outlined the design of a compiler which uses this approach. Our compiler allows applications to be written eitherin a sequential or a parallel lan- guage. (In the former case, the compiler parallelizes applications using techniques in existing parallelizing compilers [71, 72, 73, 74, 75]). Our compiler assumes static control-partitioning of the iterations of each parallel code segment. It uses parti- tioning heuristics for well-known loop constructs such as iterative parallel loops [80], but requires the applications programmer to specify the partitioning for the other cases. We also developed new algorithms to determine the data reference pattern and the type of each variable which are required by our object-creation schemes. These schemes are different for various types of data placement. We first developed a new object-creation method that assists block-level place- ment by creating objects that are not falsely-shared and have temporal locality. We demonstrated how these objects can be used to eliminate false sharing and reduce the runtime overhead of adaptive placement through hints. We also developed new code generation algorithms that are required when applying our scheme. Further, we pro- posed solutions which can be used when our method fragments virtual and physical memory. We derived the time and space overhead involved in compiling applications using our method. We identified the factors that affect the performance offered by our scheme. We concluded that the additional runtime overhead incurred by our method is small when compared to the improvement in performance it achieves by addressing the limitations of block-level placement schemes. Moreover, the performance improve- ment is limited because block-level placement schemes invoke placement operations unnecessarily without allowing the compiler to specify when they are required. We then considered object-level placement schemes which require the compiler or the applications programmer to specify data placement operations. In order to direct ob ject-level placement, we developed a new object-creation scheme which cre- ates objects based on the data exchange pattern among the processes of the parallel 193 application. As in the case of blockolevel placement, we derived the time and space overhead involved in compiling applications using this method. We discussed why compiler-directed object-level placement performs better than block-level placement. By modeling the execution of an application when object-level placement is provided, we identified the performance optimizations possible when transferring objects. Finally, we conducted a detailed performance study using experimental simula- tions to quantitatively compare various placement schemes, each of which allow a different degree of compiler control over data placement. We chose a workload that highlights the cases when a higher degree of compiler control improves performance and when it does not. Our results clearly demonstrate that block-level placement with compiler assistance performs significantly better than that without compiler as- sistance. In one application for instance, the number of consistency messages in the absence of compiler assistance is several orders of magnitude higher than that in the presence of such help. The main reason for this performance improvement is the ability of our object-creation scheme that assists block-level placement to eliminate the consistency overhead due to false sharing. Our study also showed that compiler- directed object-level placement performs as well as or better than compiler-assisted block-level placement. For example, in the case of another application in our work- load, the number of consistency messages for the former is an order of magnitude less than that for the latter. This performance improvement is achieved because our object-creation method (that directs ob ject-level placement) avoids unnecessary data placement operations. By identifying the application characteristics of our workload, we concluded that these results are true for applications with various combinations of these characteristics. 7 .2 Future Directions The results of our thesis can be extended as described below. Our object-creation schemes use the reference pattern of variables in each code segment. In Chapter 3, we developed algorithms to determine the reference pat- 194 tern when the parallelism in a code segment is specified using the DOALL and the DOACROSS programming constructs, assuming that the subscripts are linear. Algo— rithms need to be developed for other parallel constructs and for non-linear subscripts. The basic concept behind our existing algorithms can be applied for these cases as well, provided the work done by each processor is known at compile-time. More ef- ficient algorithms than those proposed here will help further reduce the compilation overhead in applying our schemes. Our approach to compiler-assisted data placement assumes static control- partitioning of the work in each code segment. Our compiler uses heuristics to control-partition well-known parallel constructs. Since it requires the applications programmer to control-partition other cases, it is important to make this task easy. This objective can be achieved by incorporating our object-creation algorithms in a programming environment. Such an environment can display the objects created for each code segment, provide an estimate of the fragmentation of virtual and physical memory for various block sizes, and provide an estimate of the performance. Further, partitioning heuristics for various loop constructs which are developed using such an environment can be incorporated in our compiler. When compiling applications using our object—creation scheme for block-level placement, new algorithms are needed to address the issues of fragmentation and code generation for static array references. In Chapter 4, we developed code gener- ation algorithms for the case of linear subscripts. Algorithms need to be developed for the other cases, and the solutions we proposed for fragmentation need to be im- plemented. Our object-creation scheme for object-level placement outlined in Chapter 5 as- sumes that at most one processor writes a data element within each code segment. In those applications for which this condition is not true, it is necessary to develop algorithms that use the data dependence graph to establish an order on the writes within each code segment. In our experimental comparison of different placement schemes, we considered placement at a single level of the memory hierarchy with an infinite number of physical 195 blocks. This assumption is typical of studies on memory performance e.g., [51]. Our object-creation schemes, however, are designed to assist placement at all levels of the memory hierarchy. Further, due to price/ performance reasons, we expect future multiprocessors to have several levels of the memory hierarchy. It would be interesting to extend our experimental study to the case of data placement at all levels of the memory hierarchy, assuming that each level has only a finite number of physical blocks. We expect the placement scheme that provides the compiler with the highest control over placement operations to perform the best even under these conditions. Our object-creation schemes for both the block—level and the object-level types of placement need to be incorporated in a compiler, the design of which was out- lined in Chapter 3. The runtime primitives required by our schemes also need to be implemented. Such an implementation will allow estimation of the additional time required for compilation when applying our object-creation schemes. It will also al- low measurements of execution time for various applications when using the different placement schemes. We expect the results of these measurements to concur with the conclusions of our experimental study. Compiler-assisted data placement has also been studied by researchers for message—passing NUMA multiprocessors, also referred to as distributed—memory message-passing multiprocessors [96, 97, 98, 99, 74, 83, 100, 81, 82]. In these studies, data is partitioned among the local memories of the various processors, and then the work to be done by each processor is decided. In contrast, our approach is to control-partition the work and to create objects which are then carefully placed to minimize the time spent in memory references. These approaches differ in the process- interaction paradigm, but attempt to use compiler help to achieve the same objective of reducing the performance degradation due to non-local data accesses. It would be interesting to develop a methodology to compare these two approaches and determine which is appropriate under what conditions. In summary, in this thesis, we have identified the factors that limit the performance of existing block-level data placement strategies. We have developed a methodology for compiler-assisted block-level placement which improves performance by address- 196 ing these limitations. We have concluded that the performance can be improved even further if the compiler is allowed to direct rather than assist data placement, as provided by compiler-directed object-level placement. We have developed a method to assist object-level placement, and demonstrated that this method indeed provides the best performance. Due to price/ performance reasons, we expect future multi- processors to belong to the N UMA class. Results such as ours on compiler-directed data placement are therefore important steps towards improving the performance of applications and the programmability of NUMA multiprocessors. APPENDICES APPENDIX A Glossary Array reference: A reference to an array variable in a program statement. Block: Unit of transfer in a given level of the memory hierarchy, e.g., cache line or page. Block-level migration: A data placement scheme that migrates the virtual block containing a data element. Block-level placement: A data placement scheme that allocates physical memory for the virtual block containing a data element. Block-level replication: A data placement scheme that replicates the virtual block containing a data element. Code segment: A set of statements in the program of an application. Combine operation: An operation that combines several objects into a single object as per specifications. Consistency model: Definition of the relationship among the various physical copies of a given data element or block. Control partitioning: Division of the work in a parallel code segment among avail- able processors. Data dependence: The relationship between two program statements that read or write the same variable. 197 198 Data dependence analysis: A method to extract information about the data dependencies of an application. Data exchange pattern: Pattern of communication and transfer of data among the processors executing a single parallel application. Data layout: Map of data to locations in the shared virtual memory. Data partitioning: Distribution of data among the local memories of available processors. Data placement: Allocation of a physical memory location for a data element. Data reference: A reference to any data element in a program statement. Data reference pattern: A temporal order of the read and the write references to a data element originating from different processors. Falsely-shared block: A block which contains data elements with different reference patterns. Internal fragmentation: Condition in which blocks are only partially used. Layout of data: See data layout. . Linear subscript: A subscript of the form ai + b, where a and b are integer constants and i is a loop induction variable. Migration: Movement of a single physical copy of the data among the local memories of the processors that reference it. Multiprocessor: A parallel computer which consists of multiple processors and memory modules connected together by one or more interconnects. NUMA multiprocessor: A multiprocessor in which the time to access different memory locations is not the same. Object: A collection of data elements, with similar reference patterns, and treated as a unit. Ob ject-level placement: A data placement scheme that allocates physical memory for an object. 199 Page padding: A method of allocating virtual memory for an object in units of a virtual page, possibly with internal fragmentation in the last page. Parallel code segment: A code segment which is executed in parallel by several processors. Parallelizing compiler: A compiler that autOmatically extracts the parallelism available in a given application’s program and generates code that can run concur- rently on different processors. Placement: See data placement. Placement information: Information that assists data placement schemes. Reference pattern: See data reference pattern. Release consistency: A consistency model that makes the several physical copies of the same data consistent with one another only when a release operation on a synchronization variable is executed. Release operation: A hardware instruction that is used to implement the release consistency model. Remap operation: An operation that changes the layout of data in the shared virtual memory e.g., combine and split operations. Replication: Creation of multiple, co—existing physical copies of the same data. Sequential code segment: A code segment executed by a single processor. Sequential consistency: The strictest consistency model which requires the execu- tion of the parallel application to appear as some interleaving of the execution of the parallel processes on a sequential machine. For example, when sequential consistency and the write-update scheme is provided, a write to any data element is not complete unless all physical copies of the data element have been updated. Shared virtual memory (SVM): A programming model in which the processors executing a parallel application communicate and synchronize by means of data that is mapped to a region of the virtual memory shared by all processors. 200 Split operation: An operation that splits a single object into several smaller objects as per specifications. Sub-page: A contiguous portion within a virtual page. Write-invalidate (WI) A scheme that maintains consistency of multiple physical copies of the same data on a write operation by invalidating all copies except the one in the local memory of the processor issuing the write request. Write-update (WU) A scheme that maintains consistency of multiple physical copies of the same data by updating each of them on a write or a release operation. BIBLIOGRAPHY BIBLIOGRAPHY [1] BBN Advanced Computers Inc., Cambridge, MA, Inside the TC2000 Computer, February 1990. [2] R. Perron and C. Mundie, “The Architecture of the Alliant F X/ 8 Computer,” in Spring COMPCON ’86, pp. 390—393, March 1986. [3] T. Lovett and S. Thakkar, “The Symmetry Multiprocessor System,” in Proc. Intl. Conf. of Parallel Processing, pp. 303-310, August 1988. [4] G. Pfister et al., “The IBM Research Parallel Processor Prototype,” in Proc. Intl. Conf. on Parallel Processing, pp. 764-772, August 1985. [5] Intel Corporation, A Touchstone DELTA System Description, 1991. [6] Kendall Square Research, KSRI, March 1992. [7] D. Lenoski et al., “The directory-based cache coherence protocol for the DASH multiprocessor,” in Proc. 17th Intl. Symp. on Computer Architecture, pp. 148- 159, May 1990. [8] D. Gustavson, “The Scalable Coherent Interface and Related Standards Projects,” IEEE Micro, pp. 10—22, February 1992. [9] J. Kuehn and B. Smith, “The Horizon Supercomputing System: Architecture and Software,” in Supercomputing 1988, pp. 28—34, November 1988. [10] C. Seitz et al., “The Architecture and Programming of the Ametek Series 2010 Multicomputer,” in Proc. 3rd Conf. on Hypercube Concurrent Computers and Applications Volume I, pp. 33—36, ACM, January 1988. [11] A. Garcia, D. Foster, and R. Freitas, “The Advanced Computing Environment Multiprocessor Workstation,” Tech. Rep. RC 14491, IBM T.J. Watson Research Center, March 1989. 201 [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] 202 J. Hayes, Computer Architecture and Organization. New York: McGraw-Hill, 1988. J. Peterson and A. Silberschatz, Operating System Concepts. Reading, MA: Addison-Wesley, 1985. W. Delaney, D. Cohn, K. Tracey, and M. Casey, “Data Units: A Process Inter- action Paradigm,” Tech. Rep. 91-3, Personal Computer Laboratory, Dept. of Computer Science and Eng., University of Notre Dame, March 1991. A. Cox and R. Fowler, “The Implementation of a Coherent Memory Abstraction on a N UMA Multiprocessor: Experiences with PLATINUM,” in Proc. 12th ACM Symp. on Operating System Principles, pp. 32—44, December 1989. W. Bolosky, R. Fitzgerald, and M. Scott, “Simple But Effective Techniques for NUMA Memory Management,” in Proc. 12th ACM Symp. on Operating System Principles, pp. 19—31, December 1989. R. Bisiani and M. Ravishankar, “PLUS: A Distributed Shared-Memory Sys- tem,” in Proc. 17th Intl. Symp. on Computer Architecture, pp. 115—124, May 1990. RR LaRowe Jr. and C. Ellis, “Experimental Comparison of Memory Manage- ment Policies for N UMA Multiprocessors,” in ACM Transactions on Computer Systems, vol. 9, pp. 319—363, November 1991. K. Li and R. Schaefer, “A Hypercube Shared Virtual Memory System,” in Proc. Intl. Conf. on Parallel Processing, pp. 125-131, August 1989. K. Li and P. Hudak, “Memory Coherence in Shared Virtual Memory Systems,” in ACM Transactions on Computer Systems, pp. 321—359, November 1989. U. Ramachandran, M. Ahamad, and M. Khalil, “Coherence of distributed shared memory: Unifying synchronization and transfer of data,” in Proc. Intl. Conf. on Parallel Processing, vol. II, pp. 160—169, August 1989. B. Fleisch and G. Popek, “Mirage: A Coherent Distributed Shared Memory Design,” in Proc. 12th ACM Symp. on Operating System Principles, pp. 211— 223, December 1989. R. Bryant, P. Carini, H. Chang, and B. Rosenburg, “Supporting Structured Shared Virtual Memory under Mach,” in Proc. USENIX Mach Symposium, November 1991. 203 [24] J. Carter, J. Bennett, and W. Zwaenepoel, “Implementation and performance of Munin,” in Proc. 13th ACM Symp. on Operating System Principles, pp. 152- 164, October 1991. [25] A.W. Wilson J r., “Hierarchical cache/ bus architecture for shared-memory mul- tiprocessors,” in Proc. 14th Intl. Symp. on Computer Architecture, June 1987. [26] J. Goodman and P. Woest, “The Wisconsin Multicube: A new large-scale cache- coherent multiprocessor,” in Proc. 15th Intl. Symp. on Computer Architecture, pp. 422—431, May 1988. [27] D. Cheriton, H. Goosen, and P. Boyle, “Multi-level shared caching techniques for scalability in VMP-MC,” in Proc. 16th Intl. Symp. on Computer Architec- ture, pp. 16—24, June 1989. [28] K. Li, “IVY: A Shared Virtual Memory System for Parallel Computing,” in Proc. Intl. Conf. on Parallel Processing, pp. 94-101, August 1988. [29] J. Bennett, J. Carter, and W. Zwaenepoel, “Adaptive Software Cache Manage- ment for Distributed Shared Memory Architectures,” in Proc. 17th Intl. Symp. on Computer Architecture, pp. 125—134, May 1990. [30] L. Lamport, “How To Make a Multiprocessor Computer That Correctly Exe- cutes Multiprocess Programs,” IEEE Transactions on Computers, pp. 690-691, September 1979. [31] K. Gharachorloo et al., “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” in Proc. 17th Intl. Symp. on Computer Ar- chitecture, pp. 15—26, May 1990. [32] S. Adve and M. Hill, “Weak Ordering - A New Definition,” in Proc. 17th Intl. Symp. on Computer Architecture, pp. 2—14, May 1990. [33] M. Dubois and C. Scheurich, “Memory access dependencies in shared-memory multiprocessors,” IEEE Transactions on Software Engineering, vol. 16, pp. 660— 673, June 1990. [34] P. Teller, R. Kenner, and M. Snir, “TLB Consistency on Highly Parallel Shared Memory Multiprocessors,” in Proc. 213t Annual Hawaii Intl. Conf. on System Sciences, pp. 184—192, IEEE Computer Soceity, 1988. 204 [35] D. Black et al., “Translation Lookaside Buffer Consistency: A Software Ap- proach,” in Proc. 3rd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 113—122, April 1989. [36] B. Rosenburg, “Low-Synchronization Translation Lookaside Buffer Consistency in Large-Scale Shared-Memory Multiprocessors,” in Proc. 12th ACM Symp. on Operating System Principles, pp. 137-146, December 1989. [37] M. Holliday, “Page Table Management in Local/ Remote Architectures,” in Proc. 1988 Intl. Conf. on Supercomputing, pp. 1—8, July 1988. [38] M. Holliday, “Reference History, Page Size, and Migration Daemons in L0- cal/Remote Architectures,” in Proc. 3rd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 104—112, 1989. [39] D. Black, A. Gupta, and W. Weber, “Competitive Management of Distributed Shared Memory,” in Proc. Spring COMPCON, 1989. [40] C. Scheurich and M. Dubois, “Dynamic Page Migration in Multiprocessors with Distributed Global Memory,” IEEE Transactions on Computers, pp. 1154-1163, August 1989. [41] J. Ramanathan and L. Ni, “Critical Factors in N UMA Memory Management,” Tech. Rep. MSU-CPS-ACS-34, Michigan State University, December 1990. [42] J. Ramanathan and L. Ni, “Critical Factors in NUMA Memory Management,” in Proc. 11th Intl. Conf. on Distributed Computing Systems, pp. 500-507, May 1991. [43] W. Bolosky et al., “NUMA Policies and their Relation to Memory Architec- ture,” in Proc. 4th Intl. Conf. on Architectural Support for Programming Lan- guages and Operating Systems, pp. 212-221, April 1991. [44] M. Stumm and S. Zhou, “Algorithms implementing distributed shared mem- ory,” IEEE Computer, pp. 54-64, May 1990. [45] B. Nitzberg and V. Lo, “Distributed Shared Memory: A Survey of Issues and Algorithms,” IEEE Computer, pp. 52—60, August 1991. [46] J. Bennett, J. Carter, and W. Zwaenepoel, “Munin: Distributed shared memory based on type-specific memory coherence,” in Proc. 2nd Symp. on Principles and Practice of Parallel Programming, pp. 168—175, March 1990. 205 [47] J. Torrellas, M. Lam, and J. Hennessy, “Shared data placement optimizations to reduce multiprocessor cache miss rates,” in Proc. Intl. Conf. on Parallel Processing, vol. II, pp. 266—270, August 1990. [48] D. Cheriton, A. Gupta, P. Boyle, and H. Goosen, “The VMP Multiprocessor: Initial Experience, Refinements and Performance Evaluation,” in Proc. 15th Intl. Symp. on Computer Architecture, pp. 410—421, May 1988. [49] S. Eggers and R. Katz, “The Effect of Sharing on the Cache and Bus Perfor- mance of Parallel Programs,” in Proc. 3rd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 257-270, April 1989. [50] M. Dubois et al., “Delayed Consistency and Its Effects on the Miss Rate of Parallel Programs,” in Supercomputing 1.9.91, pp. 197—206, November 1991. [51] P. Keleher, A. Cox, and W. Zwaenepoel, “Lazy Release Consistency for Software Distributed Shared Memory,” in Proc. 19th Intl. Symp. on Computer Architec- ture, pp. 13—21, May 1992. [52] S. Eggers and T. Jeremiassen, “Eliminating False Sharing,” in Proc. Intl. Conf. on Parallel Processing, vol. I, pp. 377-381, August 1991. [53] RP. LaRowe Jr., C. Ellis, and L. Kaplan, “The Robustness of NUMA Mem- ory Management,” in Proc. 13th ACM Symp. on Operating System Principles, pp. 137—151, October 1991. - [54] J. Ramanathan and L. Ni, “Compiler-Directed Object Creation for Shared Vir- tual Memory Systems,” Tech. Rep. MSU-CPS-ACS-48, Michigan State Univer- sity, October 1991. [55] J. Ramanathan and L. Ni, “Exploiting Data Exchange Patterns in Creating Objects for N UMA Shared Virtual Memory Systems,” Tech. Rep. MSU-CPS- ACS-50, Michigan State University, January 1992. [56] J. Ramanathan and L. Ni, “Exploiting Data Exchange Patterns in Creating Objects for NUMA Shared Virtual Memory Systems,” in Proc. Intl. Conf. on Parallel Processing, August 1992 (to appear). [57] R. Rashid et al., “Machine Independent Virtual Memory Management for Paged Uniprocessor and Multiprocesor Architectures,” IEEE Transactions on Cam- puters, pp. 896—908, August 1988. 206 [58] A. Forin et al., “Design, Implementation and Performance Evaluation of a Distributed Shared Memory Server for Mach,” Tech. Rep. CMU-CS-88-165, Carnegie Mellon University, August 1988. [59] RP. LaRowe Jr. and C. Ellis, “Experimental Comparison of Memory Man- agement Policies for NUMA Multiprocessors,” Tech. Rep. CS-1990-10, Duke University, April 1990. ' [60] BBN Laboratories, Cambridge, MA, Overview of THE BUTTERFLY GP1000, November 1988. [61] R. Bisiani, A. Nowatzyk, and M. Ravishankar, “Coherent Shared Memory on a Message Passing Machine,” Tech. Rep. CMU-CS-88-204, Carnegie Mellon University, December 1988. [62] H. Mizrahi et al., “Extending the Memory Hierarchy into Multiprocesor Inter- connection Networks: A Performance Analysis,” in Proc. Intl. Conf. on Parallel Processing, pp. 141-150, August 1989. [63] P. Clancey and J. Francioni, “Distribution of Pages in a Distributed Virtual Memory,” in Proc. Intl. Conf. on Parallel Processing, vol. II, pp. 258—265, 1990. [64] S. Leffler et al., The Design And Implementation of the ISBSD UNIX Operating System. New York: Addison-Wesley, 1988. [65] C. Stunkel and W. Fuchs, “TRAPEDS: Producing Traces for Multicomputers Via Execution Driven Simulation,” in Performance Evaluation Review, vol. 17, pp. 70-78, ACM, 1989. [66] S. Eggers et al., “Techniques for efficient inline tracing on a shared-memory mul- tiprocessor,” in Proc. Intl. Conf. on Measurement and Modeling of Computer Systems, pp. 37—46, May 1990. [67] H.-B. Lim, “Characterization of Parallel Program Behavior on a Shared- Memory Multiprocessor,” M.S. dissertation, CSRD Technical Report No. 1138, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, August 1991. [68] D. Reed, L. Adams, and M. Patrick, “Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems,” IEEE Transac- tions on Computers, vol. C-36, pp. 845 - 858, July 1987. 207 [69] RP. LaRowe Jr., J. Wilkes, and C. Ellis, “Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor,” in Proc. 3rd ACM SICPLAN Symp. on Principles and Practice of Parallel Pro- gramming, pp. 122—132, March 1991. [70] IBM Corporation, Austin, Texas, IBM RISC System/6000 Technology, 1990. [71] F. Allen et al., “An Overview of the PTRAN Analysis System for Multipro- cessing,” Journal of Parallel and Distributed Computing, vol. 5, pp. 617—640, 1988. [72] H. Zima, H. Bast, and H. Gerndt, “SUPERB: A tool for semi-automatic MIMD- SIMD parallelization,” Parallel Computing, vol. 6, pp. 1—18, 1988. [73] M. Wolfe, Optimizing Supercompilers for Supercomputers. Cambridge, Mas- sachusetts: The MIT Press, 1989. [74] H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers. New York: AddisomWesley, 1990. [75] M. Wolfe and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Transactions on Parallel and Distributed Sys- tems, vol. 2, pp. 452—471, October 1991. [76] Z. Fang, P. Yew, P. Tang, and C. Zhu, “Dynamic Processor Self-Scheduling for General Parallel Nested Loops,” in Proc. Intl. COnf. on Parallel Processing, pp. 1—10, 1987. [77] C. Polychronopoulos and D. Kuck, “Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Ccomputers,” IEEE Transactions on Comput- ers, pp. 1425—1439, December 1987. [78] T. Tzen and L. Ni, “Dynamic Loop Scheduling for Shared-Memory Multipro- cessors,” in Proc. Intl. Conf. on Parallel Processing, vol. 2, pp. 247—250, 1991. [79] S. Hummel, E. Schonberg, and L. Flynn, “Factoring: A Practical and Robust Method for Scheduling Parallel Loops,” in Supercomputing 1991, pp. 610—619, November 1991. [80] S. Abraham and D. Hudak, “Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic,” IEEE Transactions on Parallel and Distributed Systems, vol. 2, pp. 318—328, July 1991. [81] [82] [83] [84] [85] [85] [87] [88] [89] [90] [911 [92] 208 J. Ramanujam and P. Sadayappan, “Compile-Time Techniques for Data Distri- bution on Distributed Memory Machines,” IEEE Transactions on Parallel and Distributed Systems, vol. 2, pp. 472-482, October 1991. M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Transactions on Parallel and Distributed Systems, April 1992. G. Fox et al., “Fortran D Language Specification,” Tech. Rep. Rice COMP TR90-141, Rice University, December 1990. D. Vrsalovic et al., “The Influence of Parallel Decomposition Strategies on the Performance of Multiprocessor Systems,” in Proc. 12th Intl. Symp. on Computer Architecture, pp. 396—405, 1985. P. Havlak and K. Kennedy, “An Implementation of Interprocedural Bounded Regular Section Analysis,” IEEE Transactions on Parallel and Distributed Sys- tems, vol. 2, pp. 350—360, July 1991. T. Jeremiassen and S. Eggers, “Computing Per-Process Summary Side-Effect Information,” Tech. Rep. 91-11-03, Dept. of Comp. Sci. 81. Eng., University of Washington, 1991. P. Chang et al., “IMPACT: An Architectural Framework for Multiple- Instruction-Issue Processors,” in Proc. 18th Intl. Symp. on Computer Archi- tecture, pp. 266—275, 1991. D. Hudak and S. Abraham, “Beyond Loop Partitioning: Data Assignment and Overlap to Reduce Communication Overhead,” in Proc. 1991 Intl. Conf. on Supercomputing, pp. 172—182, June 1991. D. Hudak and S. Abraham, “Compile-Time Optimization of Near-Neighbor Communication for Scalable Shared-Memory Multiprocessors,” Journal of Par- allel and Distributed Computing, August 1992 (to appear). S. Bokhari, “On the Mapping Problem,” IEEE Transactions on Computers, pp. 207-214, March 1981. M. Garey and D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco: Freeman, 1979. P. V. Laarhoven and E. Aarts, Simulated Annealing: Theory and Applications. Eindhoven, The Netherlands: Philips Research Lab, 1986. 209 [93] J. Ramanathan and L. Ni, “The Mapping Problem: Revisited,” Tech. Rep. MSU-CPS-ACS-IO, Michigan State University, December 1988. [94] J. Ramanathan and L. Ni, “The Mapping Problem - In the context of new com- munication paradigms in multicomputers,” in Proc. Fourth Conf. on Hypercube Concurrent Computers and Applications, March 1989. [95] L. Brochard and A. Freau, “Designing Algorithms on Hierarchical Memory Multiprocessors,” in Proc. 1990 Intl. Conf. on Supercomputing, pp. 414—427, June 1990. [96] D. Callahan and K. Kennedy, “Compiling programs for distributed-memory [97] [98] [99] [100] [101] [102] [103] multiprocessors,” Journal of Supercomputing, pp. 2:151—169, October 1988. A. Rogers and K. Pingali, “Process decomposition through locality of refer- ence,” in Proc. ACM SICPLAN ’89 Conf. on Programming Language Design and Implementation, pp. 69—80, June 1989. R. Ruhl and M. Annaratone, “Parallelization of FORTRAN code on Distributed-memory Parallel Processors,” in Proc. 1990 Intl. Conf. on Super- computing, pp. 342—353, June 1990. C. Koelbel, P. Mehrotra, and J. Rosendale, “Supporting shared data structures on distributed memory architectures,” in Proc. 2nd ACM SICPLAN Symp. on Principles and Practice of Parallel Programming, pp. 177—186, March 1990. V. Balasundaram et al., “A Static Performance Estimator to Guide Data Par- titioning Decisions,” in Proc. 3rd ACM SICPLAN Symp. on Principles and Practice of Parallel Programming, pp. 213-223, March 1991. V. Balasundaram, “Translating Control Parallelism to Data Parallelism,” in Proc. Fifth SIAM Conference on Parallel Processing for Scientific Computing, March 1991. W. Abu-Sufah, D. Kuck, and D. Lawrie, “On the Performance Enhancement of Paging Systems through Program Analysis and Program Transformation,” IEEE Transactions on Computers, pp. 341—356, May 1981. D. Gannon, W. J alby, and K. Gallivan, “Strategies for Cache and Local Memory Management by Global Program Transformations,” Journal of Parallel and Distributed Computing, vol. 5, pp. 587—616, October 1988. [104] [105] [106] [107] [108] [109] [110] [111] [112] [113] [1141 [115] 210 K. Gallivan, W. Jalby, and D. Gannon, “On the Problem of Optimizing Data Transfers for Complex Memory Systems,” in Proc. 1988 Intl. Conf. on Super- computing, pp. 238—253, July 1988. M. Wolfe, “More Iteration Space Tiling,” in Supercomputing 1989, pp. 655—664, November 1989. M. Wolfe and M. Lam, “A Data Locality Optimizing Algorithm,” in Proc. 3rd ACM SICPLAN Conf. on Programming Language Design and Implementation, pp. 30—44, June 1991. E. Gornish, E. Granston, and A. Veidenbaum, “Compiler-Directed Data Prefetching in Multiprocessors with Memory Hierarchies,” in Proc. 1990 Intl. Conf. on Supercomputing, pp. 354—368, June 1990. A. Veidenbaum, “A Compiler-Assisted Cache Coherence Solution for Multipro- cessors,” in Proc. Intl. Conf. on Parallel Processing, pp. 1029—1036, August 1986. H. Cheong and A. Veidenbaum, “A Cache Coherence Scheme with Fast Selective Invalidation,” in Proc. 15th Intl. Symp. on Computer Architecture, pp. 299—306, June 1988. H. Cheong and A. Veidenbaum, “Stale Data Detection and Coherence Enforce- ment using Flow Analysis,” in Proc. Intl. Conf. on Parallel Processing, vol. I, pp. 138—145, August 1988. H. Cheong and A. Veidenbaum, “A Version Control Approach to Cache Coher- ence,” in Proc. 1989 Intl. Conf. on Supercomputing, pp. 322-330, June 1989. R. Cytron, S. Karlovsky, and K. McAuliffe, “Automatic Management of Pro- grammable Caches,” in Proc. Intl. Conf. on Parallel Processing, vol. 2, pp. 229— 238, August 1988. M. Lam, E. Rothberg, and M. Wolf, “The cache performance and optimizations of blocked algorithms,” in Proc. 4th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 63-74, April 1991. M. Wolfe, “Multiprocessor Synchronization for Concurrent Loops,” IEEE Soft- ware, pp. 34-42, January 1988. MIPS Computer Systems, Inc., Sunnyvale, California, MIPS R4 000 Micropro- cessor User’s Manual, 1991. 211 [116] A. Aho and J. Ullman, Principles of Compiler Design. New York: Addison- Wesley, 1985. [117] R. Cytron, “Doacross: Beyond Vectorization for Multiprocessors,” in Proc. Intl. Conf. on Parallel Processing, pp. 836—844, August 1986. [118] S. Midkiff and D. Padua, “Compiler-Generated Synchronization for Do Loops,” in Proc. Intl. Conf. on Parallel Processing, pp. 544-551, August 1986. [119] B. Schunk, “Image Flow Segmentation and Estimation by Constraint Line Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11(10), pp. 1010—1027, October 1989. [120] D. Rover, V. Tsai, Y. Chow, and J. Gustafson, “Signal-Processing Algorithms on Parallel Architectures: A Performance Update,” Journal of Parallel and Distributed Computing, vol. 13, pp. 237—245, 1991. [121] Y. Chen and A. Veidenbaum, “Comparison and Analysis of Software and Di— rectory Coherence Schemes,” in Supercomputing 1991, pp. 818—829, November 1991. 111011an STATE UNIV. LIBRARIES l][W]l][Ill][ll][Willi]llllllllll 31293010550287