THESIS 1 £002 Date 0-7639 This is to certify that the thesis entitled ADAPTIVELY INCREASING PERFORMANCE AND SCALABILITY OF AUTOMATICALLY PARALLELIZED PROGRAMS presented by H. D. K. MOONESINGHE has been accepted towards fulfillment of the requirements for MS. degree in M61106 .\ Jéu / 4:, Mag!” professor 7/21/02; MS U is an Affirmative Action/Equal Opportunity Institution ».-—- --T LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE F 1 4 ‘0 #5513 1 2093 6/01 cJClRC/Dateouopes-pts ADAPTIVELY INCREASING PERFORMANCE AND SCALABILITY OF AUTOMATICALLY PARALLELIZED PROGRAMS By H. D. K. MOONESINGHE A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science and Engineering 2002 ABSTRACT ADAPTIVELY INCREASING PERFORMANCE AND SCALABILITY OF AUTOMATICALLY PARALLELIZED PROGRAMS By H. D. K. MOONESINGHE Compilers use various transformation techniques to obtain a higher performance in programs. But the Optimal implementation depends not only on the information available at compile-time but also on the information available at run-time, where complete knowledge of the execution environment exists. In this thesis, we present adaptive execution techniques to increase the performance and scalability of automat- ically parallelized loops by executing them parallely or serially using the information available at both compile-time and run-time. We have implemented several adap- tation and performance estimation algorithms in our compiler preprocessor. The preprocessor inserts code that determines at compile-time or at run-time the way the loops are executed. Using a set of standard numerical applications written in For- tran77 and running them with our techniques on a 32-processor distributed shared memory machine (SGI Origin2000), we obtain the performance of our techniques, on average, 30%, 25%, 20%, and 14% faster than the original parallel programs on 32, 16, 8 and 4 processors respectively. Some programs run more than twice faster than the corresponding original parallel programs. Also, our approach increases the scalability of parallel applications. To My Parents iii ACKNOWLEDGEMENTS This thesis would not have happened without the help of many people whom I wish to thank here. I express deep gratitude to my advisor Dr. Jaejin Lee, for allowing me to pursue this research and helping me along every step of the way. Without his generous support and constant guidance, this thesis would simply not be possible. I appreciate the confidence and encouragement extended by him, towards my work. I sincerely thank Dr. Lionel M. Ni, Dr. Sandeep Kulkarni, and all other professors of the Department of Computer Science & Engineering at Michigan State University for providing me with a sound foundation on all aspects of computer science, which helped me immensely in achieving this goal. I would also like to thank all the staff members of CSE for their assistance’s. Although I cannot name them all here, I appreciate the help they provided to me in many ways throughout the graduate school. Special thanks to Debbie and Linda for providing information and advises about the administrative tasks. A special word of thanks goes to my friend N iroshena for his support and encour- agement’s given to me throughout this period. I would also like to thank all other friends for the support given to me during this period. These acknowledgements wouldn’t be complete without an expression of gratitude to my parents, and teachers throughout the years. They have always been a source of support and encouragement. iv TABLE OF CONTENTS LIST OF TABLES vi LIST OF FIGURES vii 1 Introduction 1 1.1 Background ................................ 1 1.2 Related Work ............................... 3 1.3 Organization of the Thesis ........................ 4 2 Execution Model and Our Framework 5 2.1 Parallel Execution Model ......................... 5 2.2 An Overview of the Framework ..................... 6 3 Adaptive Execution Algorithms 8 3.1 Compile-time Cost Estimation ...................... 9 3.2 Run-time Cost Estimation ........................ 11 3.3 Adaptive Execution Algorithms ..................... 13 3.3.1 First Two invocations with Timing (F2T) ............ 13 3.3.2 Most Recent with Timing (MRT) ................ 14 3.3.3 Static cost estimation and Most Recent with Timing (SMRT) 15 3.3.4 MRT with Run-time cost estimation (MRTR) ......... 15 3.3.5 MRT with Static and Run-time cost estimation (SMRTR) . . . 15 4 Evaluation Environment 17 4.1 Compiler .................................. 17 4.2 Applications ................................ 18 4.3 Target Architecture ............................ 18 5 Evaluation 20 5.1 Characteristics of the Parallel Loops .................. 20 5.2 Results ................................... 24 6 Conclusions 34 APPENDIX 36 BIBLIOGRAPHY 4O 4.1 4.2 4.3 5.1 5.2 5.3 LIST OF TABLES Auto—Parallelizing directives used for SGI MIPSpro Fortran77 compiler. . 18 Applications used for the evaluation. ................... 19 Machine specification. ........................... 19 Characteristics of parallel loops in the applications. ............ 22 Characteristics of parallel loops whose amount of work can be estimated by the compile-time cost estimation model. .................. 25 Speedup of Base and SMRTR strategies of each application ......... 27 Execution time of the applications under each strategy. .......... 37 Execution time of the applications in APO mode. . . . . . . . ,_ ..... 39 vi LIST OF FIGURES 2.1 Parallel execution model. ........................ 6 2.2 Adaptation framework. ......................... 7 3.1 The adaptation scheme. ......................... 9 3.2 Determining the threshold value with compile-time cost estimation . 11 3.3 Determining the threshold value with run-time cost estimation. . . . 13 5.1 Normalized execution time of Applu ................... 28 5.2 Normalized execution time of Cg .................... 28 5.3 Normalized execution time of Hydro2d ................. 29 5.4 Normalized execution time of Lu .................... 29 5.5 Normalized execution time of Mg .................... 30 5.6 Normalized execution time of Mgrid ................... 30 5.7 Normalized execution time of Sp ..................... 31 5.8 Normalized execution time of Su2cor .................. 31 5.9 Normalized execution time of Swim ................... 32 5.10 Average normalized execution time of all the applications ...... 32 5.11 Speedup of Base and SMRTR for each application. .......... 33 5.12 SM RTR execution time normalized to APO. .............. 33 vii Chapter 1 Introduction 1 . 1 Background High performance optimizing and parallelizing compilers perform various optimiza- tions to improve the performance of sequential and parallel programs [2]. A problem that most of such compilers currently faced is the lack of information about machine parameters and input data at compile-time. The performance of an algorithm that best solves a problem largely depends on the combination of the input and hardware platforms used to execute it. This information is difficult to obtain or unavailable at compile time. Thus, it is difficult or sometimes impossible to statically fine tune the application to a particular execution platform. In order to address this issue, recent work [1, 8, 17, 20, 21] has been started to explore the feasibility of run-time fine-tuning and Optimizations when complete knowledge about the input data set and the hardware platform is available. Automatic parallelizing compilers analyze and transform a sequential program into a parallel program without any intervention of the user. However, in order to achieve maximum performance from the automatically parallelized programs, the user must consider the following with regards to the underlying multiprocessor architec- ture: amount of parallelism contained in the program, cache locality of the program, hardware cache coherence mechanism, workload distribution among processors, data distribution, false sharing, coordination overhead incurred between processors, syn- chronization overhead, etc. Since these factors manifest synergistically on the perfor- mance of parallel loops and the effects differ from one machine to another, identifying performance bottlenecks of a parallel program is a tedious and difficult job for the programmer. Thus, the cost that a programmer pays in order to obtain a reason- able performance from an automatically parallelized program adds extra difficulties in developing and maintaining parallel programs. Here in our work, instead of identifying the performance bottlenecks manually from an automatically parallelized program, we avoid executing some parallel loops in parallel if performance degradation of the loop exceeds a predefined threshold value during parallel execution of the program. No knowledge of the program should be assumed from the programmer. In this thesis, we present adaptive execution techniques that increase the per- formance and scalability of automatically parallelized programs. The adaptation and performance estimation algorithms are implemented in a compiler preprocessor. The preprocessor inserts code that automatically determines at compile-time or at run- time the way the parallelized loops are executed. Using a set of standard numerical applications written in Fortran77 and running them with our techniques on a 32—processor distributed shared memory machine (SGI Origin2000), we obtain the performance, on average, 30%, 25%, 20%, and 14% faster than the original parallel programs on 32, 16, 8, and 4 processors respectively. Even, some programs run more than twice faster than the original parallel version with our scheme. 1 .2 Related Work Recent work has begun to explore the possibilities of Optimizations performed at run-time when complete knowledge of the execution environment exists. Many dif- ferent types of adaptive Optimization techniques have been proposed recently in the literature. Multiple versioning Of a loop for run-time Optimization was first proposed by Byler et a1. [4], and modern compilers still use this technique. Diniz et a1. [8] used multiple versioning with dynamic feedback to automatically choose the best synchronization Optimization policy for object-based parallel programs. The main problem of multiple versioning is that it has a significant code growth, since each version is a replication of a single code section with a particular Optimization. In our work, we generate at most two versions of a parallel loop. Moreover, versioning of the parallel loop can be avoided using conditional parallelization directives. Some approaches [10, 19, 7, 18] are based on parameterization of the code at compile—time to restructure it at run-time. Gupta and Bodik [10] dealt with the com- plexity of loop transformations, such as loop fusion, loop fission, loop interchange, and loop reversal, done at run-time. Saavedra et al. [19] proposed adaptive prefetching algorithm that can change the prefetching distance of individual prefetching instruc- tions. Their adaptive algorithm uses simple performance data collected from hardware monitors at run-time. Another adaptive Optimization technique, which is based on replication of Objects in object oriented programs, is proposed by Rinard et al. [17]. In order to avoid synchronization overhead occurred in updating a shared Object, they replicate the Object adaptively at run-time. The replication policy adapts to the dynamic characteristic of each program execution. Holzle et al. [11] proposed a dynamic type feedback technique for improving the performance of Object oriented programs. These approaches are similar to our work in that the program dynamically adapts to the environment during its execution using run-time information. However, we neither restructure the code at run-time nor deal with Object-oriented programs. We focus on shared memory parallel programs with loop level parallelism, and use different adaptation strategies in order to improve performance and scalability. A generic compiler-support framework called ADAPT was proposed by V058 and Eigenman [20, 21] for adaptive program Optimization. Users can specify types of optimizations and heuristics for applying the optimizations at run-time using ADAPT language. The ADAPT compiler generates a complete run-time system by reading these heuristics and applying them on to the target application. Lee [14] proposed a serialization technique of small parallel loops using static performance prediction and some heuristics. A SOphisticated static performance estimation model based on the stack distance [16] was proposed by Cascaval et al. [5]. Even though we used a fairly simple static performance estimation model here in our work, our run-time cost estimation models compensate for the inaccuracy caused by the static model. Our work is also related to adaptive compilers for heterogeneous processing in memory systems [15], where heterogeneity of the system is exploited adaptively. 1.3 Organization of the Thesis The rest of the thesis is organized as follows: Chapter 2 describes the parallel exe- cution model and an overview of our framework; Chapter 3 presents adaptive execu- tion algorithms including the compile-time and run-time cost estimation techniques; Chapter 4 describes the environment used to test our algorithms, such as the compiler, benchmarks and the execution platform; Chapter 5 presents the results obtained by evaluating the algorithms on the target architecture. Chapter 6 concludes the thesis. Chapter 2 Execution Model and Our Framework 2.1 Parallel Execution Model We use the OpenMP parallel programming model as our model of parallel execution. It is a master-slave thread model [3, 6, 12]. When a thread encounters a parallel region, it creates a team of threads with an execution context for each thread, and it becomes the master of the team. A parallel region is a block of code, where in our case a parallel loop, which is executed by multiple threads in parallel. Once a parallel loop is executed, the master thread creates slave threads and divides the iterations of the loop among the slaves as shown by Figure 2.1. Each thread performs a chunk of the iterations concurrently. There is an implicit barrier at the end of each parallel region. If one thread finishes its portion Of the iterations, it has to wait until other threads finish executing their portion Of the iterations. After all the threads finished their portion of iterations the barrier condition is complete and all the threads are released from the barrier. At this point of execution of the program, slave threads get disappeared and master thread resumes execution of the 6 Master thread executes serial portion Master thread encounter Parallel Do directive ................... < < Create Slave threads , Master & slave divides iterations ‘ and executes concurrently .................. é Implicit Barrier Master thread resumes execution Figure 2.1: Parallel execution model. code after the parallel loop [6, 3, 12]. Creating slave threads for a parallel loop, distributing the iterations of the lOOp among threads, cache affinity of each thread executing different iterations, and the synchronization between threads at the end of the loop incur an overhead for running the loop in parallel. We call it parallel loop overhead. 2.2 An Overview of the Framework Our framework is shown by Figure 2.2. An automatically parallelized program or manually parallelized program is fed into the compiler preprocessor. The prepro- cessor inserts performance estimation and adaptive execution code in to the parallel program. It restructures these codes using the target machine specific parameters that are also fed into it. The output program from the preprocessor is compiled by a compiler that generates code for the target multiprocessor. We are particularly interested in adaptive Optimization strategies that select code at run-time from a set of statically generated code variants based on the execution environment. We generate two versions of the same parallel loop, one is a sequential version and the other is a parallel version. Since we are interested in running a parallel loop sequentially or parallely, we focus on the parallel programs that contain large IF" \‘I a 37-3...“ if I“ ‘5! “22:77.3..- fli 1’. y (q i Machine specific parameters Parallel program amount Of loop level parallelism. A variety of selection algorithms are compared and PREPROCESSOR Cost estimation an adaptive code generation Compiler for the target multiprocessor Adaptive parallel program Figure 2.2: Adaptation framework. evaluated with these highly parallel programs: compile-time cost estimation, run-time cost estimation, selection based on execution time, selection based on performance counters, such as the number of graduated instructions, and combinations of those strategies. The key observation in this work is that most of the parallel loops are invoked many times so that adaptive execution of the parallel loops is feasible and effective. The preprocessor inserts instrumentation code in to the parallel loop in order to measure, at run-time, the execution time and the number of instructions graduated in a single invocation. These invocations which measure execution time, the number of graduated instructions or both are called decision runs. Based on the measurements in the decision runs of the parallel program, the adaptation code determines the way of executing the loop, i.e., sequentially or parallely in the next or remaining invocations. A drawback of our adaptive execution techniques with decision runs is that we have to run a parallel loop at least once sequentially and at least once in parallel in order to Obtain some useful information for adapting the loop to the run-time environment. To compensate this penalty, we add compile-time cost estimation model to the adaptation model. Also, we introduce the notion Of adaptation window for the adaptive execution techniques. ”-1.7 ’0’. "J 2:?!" H. II. Chapter 3 Adaptive Execution Algorithms We have implemented compile-time and run-time algorithms that adaptively execute automatically parallelized programs. The same techniques can be applied to the parallel programs generated by hand as long as they use standard parallelization directives such as the ones used in automatic parallelizing compilers. Our adaptation scheme has basically three different parts. A compile-time cost estimation model, a run-time cost estimation model, and adaptation strategies using the two models. First, the compile-time cost estimation model (Figure 3.1) filters parallel loops that contain smaller amount of work than the parallel loop overhead (i.e. small loops). Second, the run-time cost estimation model filters highly efficient parallel loops due to large amount of work in the lOOp. An efficient parallel loop is the parallel loop whose speedup is greater than 1. Otherwise, it is an inefl‘icient parallel loop. The model counts the number of instructions executed in an invocation of the loop at run- time. Also, it identifies small loops that cannot be handled by the compile-time cost estimation model due to run-time parameters in the loop. Finally, several adaptive execution strategies (including execution time based strategies) determine the way the remaining loops are executed. Executed Always executed Executed sequentially g , adaptively ‘ in parallel Run-time Run-time cost estimation cost estimation Compile-time cost estimation Not efficient due to Adaptation Highly efficient due to small amount of work window large amount 0' work Small Amount of work Large Figure 3.1: The adaptation scheme. 3.1 Compile-time Cost Estimation The compile-time cost estimation model identifies parallel loops that are not efficient due to insufficient amount Of work in the loop. It is not beneficial to run this type of loops in parallel because the amount of work in the loop is fairly small compared to the parallel loop overhead. We define a threshold value for the amount of work contained in the loop. If the amount of work is smaller than the threshold value, we run the loop sequentially. We use a fairly simple cost estimation model. The amount Of work (W) in a loop can be estimated as a function of the number of iterations of the loop (71;), the number of assignments (71.0), the number of floating point operations for addition (anDD), multiplication (nfMUL), subtraction (nfSUB), division (nfgw), the num- ber of intrinsic function and system procedure calls (n f and np), and the number of user defined function and procedure calls (nu). Consequently, the estimated amount of work (W;) in an iteration Of the loop is given by the following formula: Wi : "0 ° CG + anDD ' cfADD + nfMUL ' chUL +nfsue ' CfSUB + nfDIV ' chIV + nfi ' Cfi +nfs'cfs‘l'nfu'cfu Where cop is the cost of performing a single Operation with type op on the target machine. Thus, the total amount of work in an invocation of the loop is, W024) = n,- ° W; Since we cannot determine at compile-time the actual number of iterations in a loop in general, this formula is parameterized by n,. When we estimate the amount of work in a loop that has branches, we give equal weights to each branch. We determine the threshold value heuristically. First, we run several represen- tative micro benchmark programs that contain many different type of parallel loops. After measuring sequential execution time and parallel execution time of each loop contained in the program on p processors, we plot the speedup of each loop on the Y—axis and the estimated cost (the estimated amount of work) on the X-axis. Then, draw a line from the point that has the lowest estimated cost (pa in Figure 3.2) to the point that has the lowest estimated cost among those whose speedup is greater than 0.8 (p5). We choose the cost of the intersecting point with the horizontal line of speedup 1.0 (pr) as our threshold value. When the loop is multiply nested, it is hard to determine the number of iterations Of an inner loop before we run the outer IOOp. It is because the upper bound, lower bound, and step of the inner loop may change in the outer loop. In other words, it is hard to obtain the cost function parameterized by the numbers of iterations of both the outermost loop and the inner loop. In this case, we simply pass the loop to the run-time cost estimation model. 10 lab-ll“ n; can 2.0 ; . I ~ I . I 1.5 .- —— ——— J, — "TV - I l I a. I 5 I 8 a I m I I I I I 1 _ I 8 I I O : o 3 v I v I T v r 1 r 1 v r I I 1 r 1 1 v v 1 v 1000 ] 1500 2000 2500 3000 Threshold value Compile-Time Estimated Cost Figure 3.2: Determining the threshold value with compile-time cost estimation 3.2 Run-time Cost Estimation While the compile-time cost estimation model estimates the performance of parallel loops that contain small amount of work and that are highly inefficient, the run-time cost estimation model identifies the loops that contain enough amount of computation that can overcome the parallel lOOp overhead and other overheads incurred during its execution. It also handles small lOOps that cannot be handled by the compile-time cost model due to some run-time parameters. Since the number of instructions executed in a loop is proportional to its amount of work contained in it, we use the number of instructions executed (graduated) in an invocation of a parallel loop as the cost estimation. This measurement is much more accurate for estimating the cost than the execution time. For the run-time cost estimation model to be effective, the parallel loop must be invoked at least once in the program. There are two threshold values to be determined in the run-time cost estimation 11 I‘d-«0‘ A“ ]T . model. One (the lower threshold value) is for inefficient loops that contain small amount of work and that cannot be handled by the compile-time cost estimation model. The other (the higher threshold value) is for filtering out highly efficient loops. Heuristically determining the threshold values for the run-time cost estimation model is similar to the compile-time cost estimation model. We run several representative benchmark programs that contain many different types of parallel loops. We mea- sure sequential execution time, parallel execution time, and the number of graduated instructions in the parallel execution of each loop in the benchmark programs on p processors. Then, we plot the speedup of each loop on the Y-axis and the number Of graduated instructions (run-time estimated cost) on the X-axis. For the higher threshold value, we draw a line from the point that has the lowest number of instructions executed (pa in Figure 3.3) to the point that has the lowest number of instructions executed among those whose speedup is greater than the average speedup of all the loops plotted (pb in Figure 3.3). We choose as our higher threshold value (TH) the cost of the point (pH) whose number of instructions executed is in the middle of the intersecting point with the speedup 1.0 (pc) and the intersecting point with the average speedup (pd). For the lower threshold value, the method is the same as the compile-time cost estimation model. Draw a line from the point with the lowest cost (pa) to the point with the lowest cost among those whose speedup is greater than 0.8 (pg). We choose the cost of the intersecting point (m) with the horizontal line of speedup 1.0 as our lower threshold value (TL in Figure 3.3). The region in between the lower threshold value and the higher threshold value is called the adaptation window. 12 4‘! .‘Afl‘fl I ‘ s Adaptation Window 9 ‘ ‘ : I I 0 : l I o 8 - i i A=B E I I 7 t F 3 Pb 6 : A A g g]: B L] / Average Speedup uE —————— -lt———-'—-—-T ———————————————————————— b g 5 3 I I /