l l Hill! 314 ’ Illllllllllll it'll 3 1293 010 m ”i ll This is to certify that the dissertation entitled Advanced Loop Parallelization: Dependence Uniformization and Trapezoid Self-Scheduling presented by TEN-HWAN TZEN has been accepted towards fulfillment of the requirements for PhD degree in Computer Sc ience chstg H ML Major professor Date July 30, 1992 MSU i: an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE I fit‘fir“? (—71 MSU I. An Affirmative Action/Equal Opportunlty Institution cWMMG-M ADVANCED LOOP PARALLELIZATION: DEPENDENCE UNIFORMIZATION AND TRAPEZOID SELF-SCHEDULIN G By Ten H. Tzen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1992 692*«77 ABSTRACT ADVANCED LOOP PARALLELIZATION: DEPENDENCE UNIFORMIZATION AND TRAPEZOID SELF-S CHEDULIN G By Ten H. Tzen Due to the increasing demand for computational power in the areas of science and engineering, multiprocessor systems provide a promising approach to high speed com- puting. It is indisputable that loops are the major source of parallelism in programs. This thesis studies loop parallelization which improves the execution of iterative loops on multiprocessors. Any nested loop can be parallelized as long as all dependence constraints among iterations are preserved by applying appropriate synchronizations. However, the per- formance is significantly affected by the run-time overhead which is mainly induced by synchronization and data transfer between processors. For nested loops with a uniform dependence pattern, loop netting provides a partitioning method which does not induce data transfer and synchronization between processors. This method is particularly suitable for multiprocessors with a local cache. In case of irregular and complex dependence constraints, it is very difficult to efficiently and systematically arrange synchronization primitives. The proposed data dependence uniformization overcomes the difficulties in parallelizing a nested loop with irregular dependence constraints. The approach is based on the concept of vector decomposition. A simple set of basic dependences is developed of which all dependence constraints can be composed. The set of basic dependences then will be added to every iteration to replace all original dependences so that the dependence constraints become uniform. To dynamically allocate iterations of a DOall 100p to processors, load balancing among processors may be achieved at the expense of run-time scheduling overhead. By linearly decreasing the chunk size at run time, the best tradeoff between the scheduling overhead and balanced workload can be obtained in the proposed trapezoid self-scheduling approach. Due to its simplicity and flexibility, this approach can be efficiently implemented in any parallel compilers. The experiments conducted in a 96- node Butterfly GP-lOOO clearly show the advantage of the trapezoid self-scheduling over other well-known self-scheduling approaches. To my mother and to the memory of my father iv ACKNOWLEDGEMENTS I wish to thank the members of my Ph.D. guidance committee for their support during my Ph.D. study. I would particularly like to express my appreciation to Dr. Jesse Fang of Hewlett Packard Research Lab. for his helpful guidance in the area of parallelizing compilers, and to Dr. Richard Enbody for his careful reading of this dessertation and his many useful comments. I want to thank my good friend Amaresh Joshi, who has painstakingly read through drafts of this thesis, and helped me to make this presentation much more readable. I would like to acknowledge my friend Honda Shing for providing expertise when mine was lacking. I especially would like to thank Chia-Ying for helping me through some pressing moments. Finally, I would like to thank Dr. Lionel M. Ni for having been an excellent advisor. Without his guidance, enthusiasm, and patience, I would not have been able to complete this thesis work. I am very grateful for his support and encouragement. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 INTRODUCTION 1.1 Compilation Support for Loop Parallelization ............. 1.2 Parallel Constructs ............................ 1.2.1 DOall Loop ............................ 1.2.2 DOacross Loop .......................... 1.3 Motivation and Research Objectives ................... 1.4 Thesis Organization ............................ CROSS-ITERATION DEPENDENCE _ 2.1 Basic Concepts .............................. 2.2 Solving Dependence Equations ...................... 2.3 Iteration Space .............................. 2.4 Synchronization Among Iterations .............. ' ...... TRANSFORMATION OF UNIFORM DEPENDENCE LOOPS 3.1 Sequentially Iterated DOall Loop .................... 3.2 Loop Netting ............................... 3.3' Wave-Front Method ................ V ............ DATA DEPENDENCE UNIFORMIZATION 4.1 Program Model .............................. 4.2 Analyzing Cross-Iteration Dependence ................. 4.2.1 Solution Space and Dependence Convex Hull .......... 4.2.2 S-Test: A Dependence Slope Test ................ 4.3 Dependence Decomposition ....................... 4.4 Index Synchronization for Uniform Dependence Loops ......... 4.5 Performance Evaluation ......................... vi ix CONQCJ'MH 10 12' 13 15 19 20 25 26 28 38 41 43 47 47 50 59 65 68 4.5.1 Theoretical Speedup ....... ............... 4.5.2 Experimental Results .‘ ...................... 5 SCHEDULING OF DOALL LOOPS 5.1 Review of Loop Scheduling Schemes ................... 5.2 A Model for Loop Scheduling ...................... 6 TRAPEZOID SELF-SCHEDULING (TSS) 6.1 Generic Approach: TSS(f,l) ....................... 6.2 A Conservative Approach: TSS(2—’§, 1) ................. 6.3 Mutual Exclusion Implementation .................... 6.4 Memory Management ........................... 6.5 Experimental Results . . . ; ....................... 6.5.1 Workload and Performance Metrics ............... 6.5.2 Run-Time Behavior ........................ 6.5.3 Performance Measurement .................... 7 CONCLUSION AND FUTURE RESEARCH 7.1 Summary ................................. 7.2 Future Plans . . p .............................. A FINDING THE DCH IN A N-DIMENSIONAL SPACE A.1 Basic Principle .............................. A.2 Data Structure .............................. A.3 Algorithm ................................. BIBLIOGRAPHY vii 68 70 74 75 78 83 84 88 90 92 95 95 97 108 118 118 121 123 123 126 128 134 LIST OF TABLES 6.1 The number of chores generated by various self-scheduling schemes . 89 viii 1.1 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 5.1 5.2 LIST OF FIGURES A Parallelizing Compiler Model ..................... 3 The dependence graph of Example 2.1 ..... . ............ 14 The dependence grapg of Example 2.2 ................. 15 A multiple-level nested loop model. ................... 16 The iteration space dependence graph of Example 2.5 ......... 21 The dependence graph of Example 2.6 ................. 22 The iteration space of Example 3.1(a) .................. 30 Loop netting transformation ....................... 36 The iteration space dependence graph of Example 3.5 ......... 39 The iteration space dependence graph of Example 4.1. ........ 42 A doubly nested loop program model ........... ' ........ 44 The DCH of Example 4.1 ........................ 48 The Algorithm of Finding the DCH in a 2-Dimensional Solution Space 51 The characteristic of DSF ........................ 53 A Monotonic DSF ............................ 54 The solution space of Example 4.2 .................... 56 The solution space of Example 4.1 .................... 58 The algorithm of S-test ......................... 60 The uniformized dependence pattern of Example 4.1 ......... 63 The Index Synchronization Method ................... 66 Experiment 4.1: granularity is 120 microseconds ............ 71 Experiment 4.2: granularity is 560 microseconds ............ 72 EXperiment 4.3: granularity is 1600 microseconds ........... 73 State diagram of a processor in 100p scheduling ............ 79 Chunk size function of a uniformly distributed loop for SS, 088(k), and GSS schemes ............................. 82 ix 6.1 Four typical uniformly and nonuniformly distributed loops ...... 6.2 Chunk size function of a uniformly distributed 100p for TSS scheme . 6.3 T(1) of the TSS(2-’,3,1) under the loop in Fig. 3(d) ........... 6.4 Static memory management ....................... 6.5 Event log profile of a uniformly distributed loop using the T88. 6.6 Event log profile of a uniformly distributed loop using the CSS. 6.7 Event log profile of a uniformly distributed loop using the G88. 6.8 Event log profile of a decreasing workload using the T53 ........ 6.9 Event log profile of a decreasing workload using the CSS ........ 6.10 Event log profile of a decreasing workload using the G83 ........ 6.11 Event log profile of a uniformly distributed loop using the T33, where processors start at different times ..................... 6.12 Event log profile of a uniformly distributed loop using the CSS, where processors start at different times ..................... 6.13 Event log profile of a uniformly distributed loop using the G85, where processors start at different times ..................... 6.14 Experiment 6.1: I =100000, L(i)=110 microseconds .......... 6.15 Experiment 6.2: I =10000, L(i)==2 milliseconds ............. 6.16 Experiment 6.3: 1:3000, L(i)=[50 to 77000] microseconds ...... 6.17 Experiment 6.4: 1:3000, L(i)=[77000 to 50] microseconds ' ...... 85 86 88 94 98 99 100 101 102 103 104 105 106 109 110 112 114 6.18 Experiment 6.5: I =10000, L(i)=[36(50%) or 1000(50%)] microseconds 115 6.19 Experiment 6.6: 1:100000, L(i)=112 microseconds, Processors start at different times ............................. 6.20 Experiment 6.7: P=80, I =100000, L(i)=112 microseconds ...... A.l A CH 3 in a 3-dimension space ...................... A.2 A CH3 cut by a H P2 in a 3—dimension space .............. A3 The data structure of the CH 3 in Figure A.l. ............. 117 124 125 127 CHAPTER 1 INTRODUCTION Due to the increasing demand for computational power in the areas of science and engineering, multiprocessor systems provide a promising approach to high speed com- puting. A multiprocessor is a single computer system containing multiple processors which are capable of communicating and cooperating at different levels in order to solve a given problem [1]. More specifically, through some regular interconnection network, processors exchange information to share the workload of a given program so that the computation can be performed in a parallel manner. However, programming and compiling for multiprocessor systems is a far more complex task than traditional sequential programming. To efficiently and correctly utilize a multiprocessor, one has to speculate about such issues as qualification and quantification of parallelism, scheduling of tasks, synchronization between processors, as well as memory management for spatial and temporal locality. Due to the complex- ity of parallel programming, parallelizing compilers, which automatically transform a serial program into a parallel form, become more and more popular. It is indisputable that loops are the major source of parallelism in programs [2]. Current parallelizing compilers pay much of their attention to loop parallelization. Thus, the iterations of a loop can be spread across processors by having different processors execute different iterations simultaneously. This is also referred to as microtasking [3]. A review of compilation support for loop parallelization will be given in Section 1.1. The procedure of parallelizing a loop will be detailed. Loops are usually transformed to proper parallel constructs. As will be defined in Section 1.2, two major types of parallel constructs, DOall loops and DOacross loops, are widely applied to explicitly express parallelism. However, the performance is restricted by dependence constraints that are caused by the underlying data flow between iterations. This thesis addresses several new mechanisms to improve the execution of iterative loops on multiprocessors. Variable locality, run-time overhead, workload balancing, and even complicated and irregular dependence constraints are considered in this research. A more detailed description of the research objective will be given in Section 1.3. Finally, an overview of this thesis is given in Section 1.4. 1.1 Compilation Support for Loop Parallelization A parallelizing compiler explores parallelism and generates parallel code to take ad- vantage of the underlying parallel architecture. Currently, most supercomputer ven- dors supply parallelizing compilers for their machines [4] [5] [6]. A number of commer- cial and experimental restructuring compilers are also in use which include University of Illinois’s Parafrase [7] [8] [2], Rice University’s PFC [9] [10] [11], KAI’s KAP [12] [13], and Pacific Sierra’s VAST [14]. They are mostly constructed by inserting a parallelizing phase into, or by adding a parallelizing preprocessor before, an existing sequential compiler. Usually, parallelizing loops is implemented in the optimization phase of a sequen- tial compiler. As shown in Fig. 1.1, there are three steps in the loop parallelization phase: dependence analysis, loop transformation, and task scheduling. The goal of dependence analysis is providing necessary dependence information of a nested loop which can be used to restructure the loop from a serial to a parallel form in l [.....J l [ ...o..y OP'I'IMIZER L PostOpq\ l [ .515... J l \ /_ /. Dependence Analysts l Loop Transiormatlo n1 % l % \w I. Figure 1.1. A Parallelizing Compiler Model the next step, loop transformation. Analyzing data dependence among iterations involves solving linear Diophantine equations and the bounds of real functions. It is very time-consuming to exactly analyze dependence relations among iterations. Parallelizing compilers usually apply only in-exact tests that provide an overly con— servative dependence information. Thus, some parallelism is not exploited. A more detailed study of dependence analysis will be given in Chapter 2. Generally speaking, loop transformation (or called loop restructuring) is applied to serve the following purposes: 1. manipulate nested loops in order to explore potential parallelism, 2. transfer nested loops to proper parallel code, and then 3. simplify the target parallel code to reduce run-time overhead. Many well-known transformation techniques, such as loop interchange, loop dis- tribution, loop fusion, loop fission, strip mining, loop partition, cycle shrinking, loop skewing, and so on, have been preposed and widely applied [15] [16] [17] [18] [19] [20]. In order to be valid, any transformation of a given nested loop must honor the depen- dence constraints in that nested loop. Since different transformations require different dependence information, a universal method of dependence analysis is an important issue in 100p parallelization. Also, given a nested loop and a parallel architecture, there are many ways to transform it. Choosing an optimal order of transformations is still an open problem. In practice, depending on the dependence relation among iterations, loops and nested loops are manipulated and then transferred to different types of parallel con- structs for concurrent execution. A parallel construct is a segment of code in which parallelism is explicitly represented. Parallel languages, which provide various types of parallel constructs, have been proposed in recent years [21] [22] [9]. However, none has been widely accepted. In general, there are two major types of parallel constructs that are provided in all parallel languages: DOall loops and DOacross loops. A formal definition and detailed description of these constructs will be given in next section. After parallelism has been explored, the next problem for parallelizing compilers is how parallelism can be efficiently exploited. This is the so called loop scheduling problem for multiprocessors [23]. The goal of loop scheduling is to distribute these independent iterations to idle processors in the most equitable manner and with the least amount of overhead. This is not an easy task because the actual execution time of each iteration may be different due to network contention, mutually exclusive access, and complicated control flow. In addition the amount of run-time overhead is machine and even application dependent [24] [25] [26] [27]. 1 .2 Parallel Constructs Generally, an iteration which denotes a series of statements in the loop body is a unit of work assigned to a processor. Therefore, the dependence constraints inside an iteration can be ignored in parallelizing a nested loop. The dependence constraint among different iterations, called cross-iteration dependence, is our major concern. As indicated in the previous section, after dependence analysis and loop transfor- mation, loops or nested loops are usually restructured to DOall loops or DOacross loops. A loop or a level of a nested loop can be transformed to a. DOall loop (also called a parallel 100p) [21] [22] if there is no cross-iteration dependence among iterations. On the other hand, if some dependences occur among iterations, synchronization between processors executing different iterations must be applied to preserve those cross-iteration dependences. This technique is referred to as a DOacross loop. 1.2.1 DOall Loop Since all iterations are independent of each other, iterations of a DOall loop can be executed in any order or even simultaneously without any extra synchronization. Consider the loop in Example 1.1(a). Example 1 . 1: do I I 1, N do J t 1, N 81: 1(1, J) - B(I, J) + c(1+3, J+3) 82: C(I, J) - A(I, J) + C(I, J) enddo enddo (a) Source code do I = 1, M doall J 3 1, N 81: A(I, J) 8 B(I, J) + C(I+3, J+3) S2: C(I, J) I A(I, J) + C(I, J) enddo enddo (b) DOall loop transformation The DOall loop transformation fails at the I-loop because statement 81 in the iteration 2' requires the computation results from the statement 82 in the iteration i-3. Fortunately, DOall transformation can be applied at the J -loop, because, given a fixed I, statements 81 and 32 are independent of each other. Thus, the second level loop can be transformed to a DOall loop. The parallel code is listed in Example 1.1(b). Ideally, the peak speedup is equal to the degree of parallelism, N. The degree of parallelism here represents the maximal number of iterations that can be executed concurrently. Because parallelizing a DOall loop involves assigning iterations to avail- able processors and barrier synchronization at the end of the loop, in practice, the performance is significantly affected by scheduling overhead, barrier synchronization overhead and workload imbalancing [23] [28]. Also, if cache or local memory is em- ployed in a multiprocessor, variable locality is another important factor to the overall performance. More discussion on DOall loop transformation and scheduling will be presented in Chapter 3 and Chapter 5, respectively. 1.2.2 DOacross Loop A loop is termed DOserial loop if some cross-iteration dependences exist. To obey dependence constraints, a DOserial loop can be simply executed serially by a single processor. However, any nested loop can be executed in parallel as long as all cross- iteration dependences are satisfied. As mentioned, one such technique is DOacross loop in which some synchronization between the processors executing different iter- ations must be added to obey cross-iteration dependences. The DOacross technique is suggested by [29] and completely proposed in [30] [31]. It should be emphasized that all cross-iterationdependences need be detected so that synchronization can be applied. This may be difficult for some loops that carry an irregular and complex dependence relations among iterations. In addition the performance is determined by the degree of parallelism and particularly synchronization overhead which might be significant. . Consider Example 1.2(a). DOall transformation fails at both I~loop and J -loop. Alternately, DOacross transformation can be applied at the I-leop by using a post&wait synchronization primitive. The parallel code is listed Example 1.2(b). Example 1.2: do I I 1, N‘ do J I 1, N 81: A(I, J) I BCI, J) + C(I+2, J+3) S2: C(I, J) I A(I, J+1) + C(I, J) enddo enddo (a) Source code array syn(N) syn(*) I 0 doacross I I 1, N do J I 1, N if( I-2 >- 1) wait(syn(I-2), J-a), $1: A(I, J) I B(I, J) + C(I+2, J+3) S2: C(I, J) I A(I, J+1) + C(I, J) post(syn(I), J) enddo enddo (b) DOacross loop transformation In the parallel code, the inner 100p is executed serially. There is only one barrier synchronization required at the end of the outer loop. The variable syn(i) is a synchronization variable which indicates the number of iterations finished at iteration I = i. Iteration i can be started as long as it is not ahead of iteration i - 2 3 steps. Since all iterations of the outer loop I can be started at the same time, the degree of parallelism in the above parallel code is N. Thus, if the synchronization overhead is ignored, the ideal speedup is equal to N. 1.3 Motivation and Research Objectives Summarizing the above statements, the major concerns of loop parallelization are listed as the following: o compile-time cost 0 degree of parallelism e workload balancing 0 data locality e run-time overhead. With a reasonable cost at compile-time, we hope the maximal degree of parallelism can be exploited, the workload can be fairly distributed across processors, the data transfer between processors can be reduced, and the run-time overhead of scheduling and synchronization can be minimized. However, it is difficult and not necessary to consider all of the points listed above. For example, for some 100ps that have complicated dependence relations, we focus on systematically arranging synchronization primitives. For some loops that have much more parallelism than the number of processors, overhead and workload balancing should be the major concern. Based on the dependence relations of loops, we develop three new methods to improve the execution of iterative loops in multiprocessor sys- tems. First, consider nested loops that carry regular dependence relations. Many tech- niques have been proposed to deal with such loops. Most of them focus on exploring more parallelism. For a multiprocessor system, however, more parallelism does not always result in a better performance due to too much data transfer and synchro- nization induced between processors. In the first part of this research, we develop 10 a transformation method which sacrifices some potential parallelism to avoid data transfer and synchronization between processors. Second, consider nested loops that carry complex and irregular dependences. Due to the complicated dependence relations between iterations, it is difficult to efficiently arrange synchronization to preserve all cross-iteration dependences. Usually, such a loop is executed serially; no speedup is obtained. The second objective of this thesis is providing a new mechanism, which requires a reasonable compiler-time cost and induces a small run-time overhead, to explore more parallelism from such nested loops. Third, consider a DOall loop which is the most important parallel construct in parallel programming. Since all iterations are independent of each other, there is no need of synchronization and data transfer between processors. However, due to unpredictable execution time of iterations, our major concern in scheduling a DOall loop is how to evenly distribute the workload with the least amount of overhead. Some well-known scheduling methods are good in distributing the workload, but complicated. Some methods are simple, but not as good in balancing the workload. The last objective of this research is designing a practical scheduling scheme which can achieve a better workload balancing by inducing a smaller run-time overhead than all other scheduling schemes currently used. 1.4 Thesis Organization The rest of the thesis is organized as follows. Chapter 2 gives a brief review of depen- dence concept and some knowledge about dependence analysis and synchronization. Some associated terms used in the presentation are also defined. Chapter 3 discusses the parallel execution of nested loops with uniform dependence relations between iterations. We present the concept and disadvantages of sequential 11 iterative DOall loops that are frequently generated by many restructuring techniques. A new method, namely loop netting, is proposed. Loop netting transform nested loops into a DOall loop enclosing serial loops that do not induce cache thrashing and barrier synchronization. In Chapter 4, the attention is turned to parallelizing loops with a complicated dependence pattern. The new mechanism, data dependence uniformization, is pro— posed. We develop a new dependence analysis method to determine more accurate dependence information. Based on that information, irregular dependence relations are uniformized so that synchronization can be easily and efficiently arranged. In Chapter 5, we study the scheduling problem of DOall loops in multiprocessors. Several well-known self-scheduling schemes are discussed, followed by a loop schedul- ing model. This model which provides a basis for analyzing a loop scheduling scheme is used to present our new scheduling scheme in the next chapter. The practical Trapezoid Self-Scheduling (TSS) scheme is described in Chapter 6. By linearly decreasing the number of iterations assigned to processors, TSS achieves both the objective of small scheduling overhead and workload balancing. Due to the linearity, an efficient memory management is also proposed. The experiments conducted on a 96-node Butterfly GP-1000 give a comparison of TSS with other self-scheduling schemes. Finally, the conclusion is presented in Chapter 7. Improvement of the work in this thesis is discussed and plans for future works are given. CHAPTER 2 CROSS-ITERATION DEPENDENCE Since an iteration is a unit of work assigned to a processor, cross-iteration dependences are the major constraints in loop parallelization. A cross-iteration dependence occurs if some data computed in one iteration is also used by another iterations. Data dependence analysis gives information about underlying data flow of a loop. If no data flows across iterations, a loop can be directly parallelized. Otherwise, some proper synchronization must be applied in the transformed parallel code to completely preserve cross-iteration dependences so that no semantic difference arises. In this chapter, we first present basic concepts of data dependence. Then, some fundamentals of analyzing cross—iteration dependence will be given in Section 2.2. To clearly describe the dependence relations, a loop is usually represented graphically. In Section 2.3, we define the iteration space dependence graph and some associated terms that will be used in this presentation . Finally, some common synchronization primitives are reviewed in Section 2.4 12 13 2.1 Basic Concepts A program is -a list of statements. Each statement has a set of input variables and a set of output variables. A data dependence occurs if an input or output variable of a statement is also an output variable of another statement. Let 5', denotes the i-th statement in a program (counting in lexicographic order). As presented in literature, we use I N (5,) to denote the set of input variables of statement 5,, and OUT(S;) to denote the set of output variables of 3,. Let i < j. Corresponding to three possible combination. of output and input variables there are three types of dependence between S,- and S,- that are listed as the following. 1. flow—dependence: OUT (5,) DIN (Sj) is not empty. 2. anti-dependence: I N (5,)0 0UT(S,-) is not empty. 3. output-dependence: OUT(S.-)nOUT(S,-) is not empty. Consider Example 2.1. There are three data dependences: the flow-dependence from S; to 5'2, the anti-dependence from 5'1 to 53, and the output-dependence from S; to S4. The dependence relations of a program are conveniently described by its dependence graph: a directed graph with nodes representing statements and arcs representing dependences between statements. Usually, we denote flow-dependence by a plain arc, anti—dependence by a cross arc, and output-dependence by an arc with a small circle on it. The corresponding dependence graph of Example 2.1 is shown in Fig.2.]. Example 2.1: 81: AIB+C S2: DIAIC $3: BIBI 2 S4: DIC+3 l4 Figure 2.1. The dependence graph of Example 2.1 For statements inside loops, we are interested in dependence relations between statements and also between iterations. We distinguish dependence relations be« tween different iterations by numbering arcs in a dependence graph. These numbers, called dependence distance in the literature, represent the distance of dependences. For instance, a plain arc with a value 3 from S, to S ,- represents that statement S,- in iteration I + 3 is flow-dependent on statement 5',- in iteration I. An arc with a value 0 represents a dependence inside an iteration, i.e., this is not a cross-iteration depen~ dence. Consider Example 2.2 which is a loop including the same set of statements as in Example 2.1. The corresponding dependence graph is illustrated in Fig. 2.2. It can be seen that there is only one cross-iteration dependence from $1 to $2 whilethe other two dependences can be ignored in loop parallelization. Example 2.2: do I I 3, 10 31: 1(1) . 8(1) + C(I+1) 82: MI) I A(I-3) I 0(1) 83: 8(1) I B(I) I 2 S4: 0(1) - C(I+2) + 3 enddo 15 Figure 2.2. The dependence grapg of Example 2.2 2.2 Solving Dependence Equations As mentioned in the previous section, detecting dependences among statements in- volves taking the intersection of the corresponding IN and OUT sets. For scalar vari- ables, it is straightforward. However, for array variables, testing for cross-iteration dependences involves checking whether two subscript expressions could take on iden- tical values during the execution of a loop. Various dependence analysis algorithms have been extensively studied and proposed in the context of automatic loop paral- lelization of sequential programs [32] [9] [15] [33] [34] [35]. They are mostly based on the theory of linear Diophantine equations and the bounds on real functions. Let I,- denote the loop index of the j-th level DO loop and i, denote a particular value of I 1'. We assume that all 100ps are normalized, i.e., all index strides are equal to 1. Let the lower and upper bounds of I; be constants 1 and N, respectively. A d-level nested loop has a general form as listed in Fig. 2.3 in which f1,--- , fk, g1,- - - ,gk, [2, ' - - ,54, and U2, - - . , ud are linear functions and A is a lc-dimensional array. The question is whether there are two different sets of integer values, I = (51,: ° - ,id) and J = (jl, - - - , jd) such that they satisfy the following I: equations si- multaneously. 16 d0 11 =1,N do I2 = (2(Il)1u2(11) do Id = £d(Il)sud(Il) A(fl(11i12i°”$14))"°afk(llil2,”°,Id) = H. ~-=AW.....,1,),...,,,(1,,,,,...,,,) enddo enddo enddo Figure 2.3. A multiple-level nested loop model. 17 fl(il,' ° ° aid) = 91(j1, ° ° ' )jd) (2.1) fk(£ly ' ' ' aid) = gk(j1)° ' . vjd) If the answer is negative, clearly there is no cross-iteration dependence. If the answer is affirmative, there may be a cross-iteration dependence. However, since I and J must lie within the loop limits, the following inequalities must be satisfied as well. I 1 Silajl SN ] [20.1) S 1.2,].2 S “20.1) (2.2) L [4(1'1) S idajd S ud(i1) The algorithm of finding the general solution of a system of Diophantine equations can be found in [36]. Depending on the number of levels, d, the algorithm is time- consuming. Most parallelizing compilers only apply in-exact tests that may give a too conservative answer, i.e., a yes is answered while actually there is no dependences. Two of the most common in-exact tests are the GOD test and the Banerjee bound test [32]. The GCD test determines whether Eq.(2.1) has an integer solution ignoring the loop bounds in Eq.(2.2). The Banerjee bound test checks whether Eq.(2.1) has a real solution within the loop limits. For instance, consider Example 2.3. 18 Example 2.3: do I I 1, 20 S1: AC3II+10) I .... S2: .... I A(6II-5) ..... enddo The corresponding diophantine eqution and inequality are listed below. 3*i-6Ij=—15 155.1320 The GCD test ignores the inequality and checks whether the gcd of 3 and 6 divides 15. In this example, the answer is positive which indicates that there are integer solutions. However, the integer solutions may not be located in the valid loop limit in many cases. Consider Example 2.4. The corresponding diophantine eqution and inequality are listed in Eq.(2.3). Example 2.4: do I I 1, 10 81: A(I+1) I .... S2: .... I A(2II+12) ..... enddo i—2*j=11 (2.3) 1 S 5,} S 10 The GCD test gives a positive answer, i.e., this loop cannot be transformed to a DOall loop. In fact, there does not exist. any dependence between S; and 52, since OUT(S,) = {A(2),A(3),...,A(11)} does not intersect with IN(S;) = 19 {A(14),A(16),. ..,A(32)}. The Banerjee bound test computes the lower and up- per bounds of the expression i — 2 * j and check whether 11 is located in it. In Example 2.4, the bound of the function i — 2 I j is {-19, 8]. Thus, the bound test returns a correct answer, while the GOD test fails in this example. 2.3 Iteration Space To clearly describe the dependence relations among iterations, informally, a loop can be said to describe an iteration space. Let each level of the nested loop correspond to an axis of the iteration space. A d—level nested loop describes a d—dimensional discrete Cartesian coordinate system. Each iteration of this loop corresponds to an integer point in this d-dimensional space. Let I ,- denote the loop index of the j-th level DO loop and i5 denote a particularvalue of I j as before. The iteration space of a nested loop is formally defined as the following. Definition 2.1 Let L be a normalized d-level nested loop-of the form in Fig. 2.3, the corresponding iteration space of L, namely I‘(L), is equal to {(i1,...,i4)|1_<_i1 S N,€,-(i1)$i,- S u,(i1) for 2 Sj S d, i,- E Z+ for ISj 5 d} where Z1” is the set of non-negative integers and N is the upper bound of the l-st level DO loop. A vector if = (i1,i2,...,id) is an iteration vector of the loop L, if s E I‘(L). Each iteration vector (ibig, ...,id) corresponds to a series of statements in the loop body, SI,Sz,...,S, for (I;,I;,...,Id) = (i1,ig,...,id). A cross-iteration dependence from iteration vector ii to iteration vector if can be represented by a dependence vector cf = iii — ii. A dependence. vector is a directed edge in the iteration space in which the node on the head must be executed after the node on the tail. The dependence vector set of a nested loop L, namely D(L), is a vector set which contains 20 all distinct dependence vectors in the iteration space. Each iteration may have its own dependence vectors. If all iterations of a loop have the same set of dependence vectors, D, this is a uniform dependence loop. The definition is formally described as the following. - Definition 2.2 A nested loop, L, is called a Uniform Dependence Loop (UDL), if for any a e I‘(L), 6 depends on s— 3‘ for all J‘e D(L) ifv‘ —- J 6 I‘(L), where D(L) is the dependence vector set and I‘(L) is the iteration space of L. Example 2.5: do I I 1, 5 do J I 1, 5 s1: A(I, J) I BCI, J) + C(I+1, J+1) S2: C(I, J) I A(I, J) + C(I, J) enddo enddo We close this section with the doubly nested loop in Example 2.5. The 2- dimensional iteration space is shown in Fig. 2.4 in which the dotted arrows show the semantic execution order of iterations and the solid arrows represent dependence vectors. Since the flow-dependence from 51 to 52 does not traverse iterations, there is only one cross-iteration anti-dependence from S; to S; which occurs in every iteration. Thus, the loop is a UDL with the dependence vector set, {(1,1)}. 2.4 Synchronization Among Iterations It is clear that any loop can be executed concurrently as long as the dependences among iterations are preserved. To preserve those cross-iteration dependences, ad- dition code must be added to ensure proper synchronization between the processors executing different iterations. In the worst case, the synchronization from the bottom 21 Figure 2.4. The iteration space dependence graph of Example 2.5 of one iteration to the top of the next iteration results in a serial execution. There are several types of synchronization primitives used in DOacross loops [37] [38]; the choice depends on the target machine’s architecture and the dependence pattern of the loop. In this section, we survey three major types of synchronization: post&wait, critical section, and barrier. We discuss their strength and weakness by some simple examples. Example 2.6: do I I 3, N 81: A(I) I 3(1) + C(I) S2: C(I) I E(I+1) + C(I) 83: MI) I ACI-2) I 2 S4: 3(1) I A(I) * 3(1) enddo Consider Example 2.6. The dependence directed graph is shown in Fig. 2.5 in which only cross-iteration dependences are considered. This is clearly a uniform 22 WM Figure 2.5. The dependence graph of Example 2.6 dependence loop with dependence vector set D = {( 1), (2)} in a single—dimensional iteration space. By using the post&wait mechanism, the above loop can be transferred to a doacross loop as below. Example 2.7: (Parallel code of Example 2.6) post(Asyn, 1) post(Esyn, 2) doacross I I 3, N 81: ACI) I BCI) + C(I) post(Asyn, I) 82: C(I) I E(I+1) + C(I) post(Esyn, I) nait(Asyn, I-2) 83: 0(1) I A(I-2) I 2 wait(Esyn, I-1) s4: . 2(1) I A(I) I B(I) enddo As can be seen, the post&wait mechanism is flexible but may require many syn- chronization points and thus induce too much run-time overhead. Beside, if a state- ment is dependent on itself, post&wait is not applicable. Consider Example 2.8. Post&wait cannot be applied due to the flow-dependence from S4 to itself. In this example, critical section is the only choice. However, it was demonstrated in [39] 23 that critical sections are a major factor of performance degradation in a shared mem- ory multiprocessor. The goal is to insert as few critical sections as necessary and to make them as small as possible. Generally speaking, only those statements involved in a data dependence cycle need to be enclosed in a critical section. Example 2.8: doacross I I 3, N 51: A(I) I A(I) I A(I) 32: 8(1) I 8(1) I 2 S3: C(I) I B(I) + A(I)I2 begin critical section S4: sun I sun + C(I) end critical section enddo The most restricted synchronization method is barrier synchronization which di- vides a loop into segments such that cross-iteration dependences occur only between segments, but not inside a segment. All statements of all iterations in a segment must be completed before the execution of any iteration in the next. segment is started. The principle of barrier synchronization can be simply illustrated by Fig. 2.5 in which the two cross-iteration dependences can be hold by a barrier synchronization (the dotted line). The parallel code is listed below. Example 2.9: (Parallel code of Example 2.6) doacross I I 3, N 31: A(I) I 8(1) + C(I) 32: C(I) I E(I+1) + C(I) barrier 83: 0(1) I A(I-2) * 2 54: 8(1) I MI) I 8(1) enddo 24 This primitive handles only lexically forward dependences. If dependence cycles exist, some techniques, such as node splitting [8] [9], may be employed to break depen- dence cycles. Statement reordering then can be used to change lexically backward dependences into lexically forward dependences [31]. Barrier synchronization also has been implicitly applied in many loop restructuring techniques. For instance, a barrier synchronization is essentially equivalent to a loop distribution [40]. In a large scale multiprocessor system, however, the barrier synchronization is one of the major factors to the performance degradation due to the hot-spot condition [41] and the atomic operation [42]. More discussion about barriers in loop restructuring will be given in the next chapter. CHAPTER 3 TRANSFORMATION OF UNIFORM DEPENDENCE LOOPS As indicated in Chapter 1, the major goal of loop transformation is to explore more parallelism. Based on the dependence information provided by dependence tests, loop transformation rearranges nested loops such that the implicit parallelism can be explicitly expressed by some parallel code. It is true in general that the more precise and complete dependence information is provided, the more efficiently that parallel code can be generated. Since exactly analyzing dependences is time-consuming and difficult, many proposed approaches deal with only uniform dependence loops that have a regular dependence pattern [43] [44] [45] [46] [18] [47] [48] [49] [50]. In this chapter, we study the transformation of uniform dependence loops. The dependence pattern of a uniform dependence loop usually is so simple and regular that the complete dependence vector set can be easily identified, and synchronization primitives can be systematically arranged. However, in addition to parallelism, there are two important concerns. First, in many multiprocessor systems, a local cache or local memory is employed to bridge the gap between fast processor cycle and slow 25 26 main memory access times. To reduce data transfer among caches and local memory of processors, data locality should be considered in restructuring nested loops. Second, to reduce run-time overhead induced by synchronization, selecting and arranging synchronization primitives is the other important issue for loop transformation. The rest of this chapter is organized as the follows. In Section 3.1, we discuss a sequentially iterated DOall loop which is a common parallel code generated by many transformation methods. In Section 3.2, we present a new partitioning method which does not involve much data transfer and synchronization among processors. Finally, the well-known wavefront method is reviewed for dealing with special uniform dependence loops with no explicit parallelism. 3.1 Sequentially Iterated DOall Loop Loop transformation uses the data dependence information to determine whether the existing dependences allow the 100p, or parts of it, to execute in parallel without violating the semantics. If no cross-iteration dependences exist, the loop can be simply transformed to a DOall loop. In most cases, however, dependences occur, and appropriate rearrangements are required to extract partial parallelism. It is common that a nested loop is transformed into a serial outer loop enclosing a number of DOall loops. This situation is referred to as sequentially iterated DOall loops in parallel programming [51]. Consider Example 3.1(a) in which a dependence cycle exists. The dependence vector set is { (1, 3), (3, 1)}. Since all dependences involved in this cycle are flow-dependences, this dependence cycle is not breakable. Thus, loop distribution cannot be applied [40]. 27 Example 3.1: do I I 3, N do J I 3, N 31: A(I, J) I B(I-l, J-3) . S2: B(I, J) I A(I-S, J-1) . . . enddo enddo (a) do I I 3, N doall J I 3, u 81: A(I, J) I 8(1-1, J-3) . . . S2: B(I, J) I A(I-3, J-1) . . . enddo enddo (b) Sequentially iterated DOall 100p In fact, the iterations in the second level loop are independent of each other. For example, by using selective cycle shrinking method [17], the parallel code is listed in Example 3.1(b). Assuming self-scheduling is applied [26] [25] [52], processors dy- namically fetch the next iteration of the DOall loop at run time. Due to the nonde- terministic run-time behavior, the assignment of iterations to processors is arbitrary and unpredictable. For instance, in the first iteration of the outer serial loop (I = 3), processor 1 is assigned to iteration vector (3, 3) and has array elements A(3,3), A(0,2), B(3,3) and'B(2,0) in its cache. Processor 2 is assigned to iteration vector (3,4) and has array elements A(3,4), A(0,3), B(3,4) and B(2,1) in its cache. Then, in the second iteration of the outer serial loop (I = 4), if processor 1 is assigned to iteration vec- tor (4, 7) and processor 2 is assigned to iteration vector (4,6), array element B(3,4) must be moved from the the cache of processor 2 to the cache of processor 1, and B(3,3) should be transferred in the opposite direction. The same behavior may occur again in the iterations I = 5,6, - - - ,N if the iterations, which access the same set 28 of data, are not assigned to the same processor. This phenomena, called cache or local memory thrashing, is one of major factors causing performance degradation in multiprocessor systems [53]. A partitioning method is proposed in [51] which employs a static scheduling scheme to partition sequentially iterated parallel loops with a uniform dependence pattern. It has been proved that the amount of data transfer between processors re- quired by this partitioning method is less than any other static partitioning methods. However, this approach has several disadvantages. First, only the loops with constant access vectors are considered. Thus if the access vector is a function of the outer se- quential loop index, the method is no longer applicable. Second, if the number of processors and the bounds of loops are unknown at compiler time, then this approach is not applicable. Third, workload imbalancing is a major disadvantage in any static scheduling scheme particularly when the workload is not uniformly distributed. More discussion about scheduling schemes will be given in Chapter 5. The other important factor in loop transformation is the amount of parallelism. In the above example, the degree of parallelism is equal to the bound of the second level loop, M —3, which possibly is large enough. However, in a sequentially iterated DOall loop, barrier synchronization is implicitly applied in every iteration of the outer serial loop. If the granularity of the loop body is small, the run-time overhead induced by barriers may outweigh the benefit of the resulting parallelism, in particular, when the outer loop bound is large and the inner loop bound is small. 3.2 Loop Netting Rather than a sequential iterative DOall loop, in this section, we propose a new loop transformation technique, namely loop netting, to restructure a uniform dependence loop to a DOall loop enclosing a number of serial loops. Our new method involves 29 neither cache and local memory thrashing nor barrier synchronization. We start our presentation by considering the UDL in Example 3.1(a). The iteration space depen- dence graph of Example 3.1(a) is partially shown in Fig. 3.1 in which the iteration space is composed of many independent sets. All iteration vectors in each set are chained by two dependence vectors, (3, 1) and (l, 3) and are independent of itera- tion vectors in the other sets. To make this figure clear, only two sets are displayed: one chained by solid arrows and the other one chained by dotted lines. The formal relation between iteration vectors in the same set is defined as the following. Definition 3.1 Given two iteration vectors {2‘ and i. ii is reachable from ii via dependence vector set D = {d;,d;, . - - ,d;}, if there exist c1, - . - ,ck E Z such that ii—ii=c1*d; +c2*d;+---+ck*d;, where Zis the set ofintegers. Lemma 3.1 The relation, reachable from, is symmetric and transitive. Proof: Let dependence vector set D = {d;,d;, - - - ,di}. symmetric: If ii is reachable from ii via D, ii —— ii = c; I d; + c; I d; + - - - + c), * d; where c1,n-,c;, are integers. So, ii— 6’: -c1 and; —cg*d; - -c).*d-i.. Thus, ii is also reachable from ii. transitive: If ii is reachable from ii and ii is reachable from 117 via D, we have II—tI=l)1*d-;+bztd-;+H-+bk*d1, and t'i—w'zc1*¢f1+c2*¢fg+~-+ck*d; where by, - - . , bk, c1, - . - , c), are integers. Combine above two equations. We have 17-65: (b. +c1)*d;+(52+c2)*di+°~+(bk+6k)*di. Thus, ii is reachable from :13. D It is clear that if two iteration vectors are not reachable from each other, then there is no cross-iteration dependence between them. Note that the other direction is not necessary true. For example, iteration vectors (3, 3) and (3,11) are reachable from each other in Figure 3.1, but they are independent. Each independent set, namely 30 Figure 3.1. The iteration space of Example 3.1(a) 31 net, is independent of the others. The formal definition of a net is described as the following. Definition 3.2 Given a UDL with dependence vector set D, the iteration vector set {v}, - - - ,5;} is a net in I‘(L), if if,- is reachable from 23',- via dependence vector set D foranylSi,j$t. Theorem 3.1 Given a UDL with dependence vector set D, all nets in F(L) are independent of each other. Proof: The proof is done by contradiction. . Let D = {.113}, . - - ,d';} and a e T, and s e 7",, where T, and T2 are two distinct nets in I‘(L). Assume ii is dependent on ii. There exist non-negative integers c1, - - - , ck such that ii - ii = Cl I d; + c; * d; + ~ - - + c], * d1. From Definition 3.1, if is reachable from ii via D. Since the relation reachable is symmetric and transitive (Lemma 3.1), all iteration vectors in T1 are reachable from any one in T2. Thus T1 and T2 are in the same net. D Since nets are independent of each other, if all iterations in a net are performed by a single processor, no data transfer and synchronization between processors are required. The principle of loop netting is that a processor, which is assigned to an independent net T, starts executing from an initial iteration vector, namely entry point, and scans the rest of the iterations in T by following the direction of the dependence vectors. The most important part of this approach is finding all entry points. Obviously, the number of entry points is equal to the number of independent nets in the iteration space. We formally define the set of entry points as the following. Definition 3.3 Given a UDL, L, with dependence vector set D, a set of vectors E is an entry set in I‘(L), if the following two conditions hold. 32 (1 ) For any two distinct e'}, 6;- 6 E, 6',- is not reachable from e“;- via D. (2) For any ii 6 I‘(L), there exists a E 6 E such that ii is reachable from 6 via D. Consider Figure 3.1. Intuitively, the integer points inside the parallelogram which is enclosed by (3,3), (4,6), (6,4), (7,7) form an entry set. There are a total of 8 points in that parallelogram,i.e., there are 8 independent nets in Figure 3.1. However, due to irregularity, it is not a straightforward task to completely and mutually exclusively fetch integer points inside a parallelogram at run—time. We develop a simple and efficient method which locates the other set of entry points that are enclosed in a rectangle. For instance, the 8 points in the dotted rectangular form the other entry set in Figure 3.1. Our approach is formally described by the following lemmas and theorems. Lemma 3.2 If ng($1,$2, - - - ,xk) = g, then there exist a1,a2, - ' - ,ak such that al a: $1 + "' +ak*$k =9 when mlix2au'ixki aliGZiu'iak 6 Z Proof: This is a basic property of the greatest common divisor [54]. D Lemma 3.3 If gcd(.r1,:c2, ~ - ~ ,xk) = g, and a1 * z; + - ~ - + a). ,* an, = G, then there existc such thatG=c*g, where zl,x2,---,xk, a1,ag,---,a;,, andce Z. Proof: This is a basic property of the greatest common divisor [54]. 13 Theorem 3.2 Given a doubly nested UDL, L, with dependence vector set D = {(u1,v1),(u2,v2)}. The set E ={(z,y)]15 1? < 9. 1 S y < h, 3,3! 6 Z} , "1 01 is an entry set, where g = gcd(u1, 11.2), h = w/ g and w = 112 v; 33 Proof: Part 1: we prove any point (3:, y) E E are not reachable from each other. First, we prove that any two entry points (2:1, yl), (22, y;) 6 E are not reachable from each other, if z; 75 2:2. The proof is done by contradiction. Assume there exist a, b 6 Z such that 2:1 — 2:2 = a I u] + b I 112. From Lemma 3.3, we have 21 — 1:2 = csg where c 6 Z. But, we know 2:1 - 2:2 < g. Second, we prove that any two entry points (2:,y1), (2:,y2) E E are not reachable from each other, if y] gé yg. Again, the proof is done by contradiction. Assume there exist a, b E Z such that 32-: =a*u1+b*u2==0 (3.1) yl—yz =a*v1+b*v2 Thus, a = -b* EI' To ensure a 6 Z, we have b= c* 1‘;- and a = —c* 291 where c6 Z. Substituting a, b into Eq.(3.1), we have c c yl—y2=5*(ul*v2—u2*vl)=3*w=6*h (3.2) Part 2: we prove that any point (i, j) can be reached from (:r,y) via D, where (1:,y) 6 E. A Let so = —‘-"? and b0 = 2,1. We have a0 I (u1,v1) + b0 I (U2,‘Ug) = (0, f) = (0, h). Let i’ = i mod 9 and j’ =j mod h. There exist c.-, c, E Z such that i =c,-*g+i’ (3.3) j=CjIh+f From Lemma 3.2, there exist al, In 6 Z such that al I u1 + b1 I ([2 = c, I g. Let (01*01+b1*02) mod h = v’, there exist c. e Z such that also] +b1atv2 = chh+v’. 34 Case 1: j’ 2 v’. Let (1:,y) = (i’,j’ — v’) e E. We prove there exist two integers (a, b) = ((Cj - c.) I a0 + a1, (Cj — c.) * b0 + b,» such that a a: (u1,v1) + b I (u2,v2) + (:r,y) = (i,j). Compute i-part by the following equations. (c,-—c,,)*ao*u1+a1*u1+(cj—c,)*bo*u2+b1*u2+i’ = (Cj—Cu)*(ao*u1+bo*u2)+al*u1+b1*u2+i’ (3.4) = 0+q*g+¥ = 2 Compute j-part by the following equations. (cj—cv)*ao*v1+a1*vl+(cj—cv)*bo*v2+b1*v2+j’—v’ = (cj—c.)*(ao*v1+bo*v2)+a1*vl+b,*v2+j’—v’ (35) = (cj-ct)*h+ct*h+v’+j’—v’ \ = I Case 2: j’ S v’. Let (x,y) = (i’,j’ — v’+ h) 6 E. We prove there exist two integers (a, b) = ((Cj - c, - 1) *00 +01, (01' — c, ‘ 1) * b0 + 51)) such that a I (u1,v1) + b * (u2,v2) + (z,y) = (i,j). The proof is similar to in Case 1. Cl Theorem 3.3 Given a doubly nested UDL, L, with dependence vector set D = {(u1,vl),(ug,v2)}. The set E ={(:c,y)|15 a: < h, 15 y < g, 3,31 6 Z} ul 01 is an entry set, where g = gcd(v1,v2), h = w/g and w = 112 v2 35 Proof: The proof is similar to the proof in Theorem 3.2. CI Corollary 3.1 Given a doubly nested UDL, L, with dependence vector set D = {(u1,v1), (u2,v2)}. The number of independent nets is w = "1 v1 u; v; Now the last problem is how to scan a net from an entry point defined in The- orem 3.2,i.e., how to restructure the loop to parallel code. The scanning algorithm actually is embedded in the proof of Theorem 3.2. From an entry point, a processor jumps h steps in the second level loop until the bound is reached. Then, it jumps g steps in the first level loop by following dependence vectors to reach the other point which is in the same net. This algorithm results in a semantic order of serial execu- tion in a net. Thus, dependence constraints are naturally preserved. A loop netting transformation for a doubly nested UDL with two dependence vectors is shown in Figure 3.2. Note that loop coalescing can be applied to reduce two DOall loops to a singly DOall loop, if necessary. For instance, Example 3.1(a) can be transformed to Example 3.2 by using the loop netting method. Since g = 1, the second DOall loop in Figure 3.2(b) is not necessary. For example, the execution order of the first net, which is started from the entry point (3, 3), is (3, 3),(3,11),(3, 19), - - - , (4, 6), (4, 14), (4,22), - - -. Example 3.2: Parallel code of Example 3.1(a) in loop netting doall J1 I 3, 10 do I I 3, N, 1 do J I J1, H, 8 81: A(I, J) I B(I-1, J-3) . . . 82: B(I, J) I A(I-3, J-1) . . . enddo J1 I (J1 + 3 - 3) mod 8 + 3 enddo enddo 36 do I I n, N do J I m, N \* Loop Body I\ 81: A(I, J) I B(I-ul, J-vl) . S2: B(I, J) I A(I-u2, J-v2) . . . (a) Source code /* a, b are any integers such that aIu1 + qu2 I g *l g I gcd(u1, u2) h I (ulIv2 - u2Iv1)/g doall I1 I n, n+g-1 doall J1 I m, m+h-1 do I I I1, N, g do J I J1, H, h \I Loop Body *\ $1: . A(I, J) I B(I-ul, J-v1) . 32: B(I, J) I A(I-u2, J-v2) . enddo J1 I (J1 + (a*v1 + b*v2) - m) mod h + m enddo enddo enddo . (b) Parallel code in loop netting Figure 3.2. Loop netting transformation 37 The above theorems deal with UDLs with two dependence vectors. If a doubly nested UDL has only one dependence vector, a virtual dependence vector can be added. The choice of the virtual dependence vector should result in as much paral- lelism as possible. Let (u,v) be the only dependence vector, and [n,N] and [m,M] be the ranges of the first level loop and second level loop, respectively. Two good candidate virtual dependence vectors are (0, M - m + 1) and (N - n + 1,0). The choice is determined by the number of resulting independent nets. For instance, con— sider Example 3.3(a) in which D = {(2,1)}. By applying Theorem 3.2, (0,100) which results in 200 independent nets is the better choice. Since M = h = 100, the second level serial loop in Figure 3.2(b) is not needed. The parallel code is listed in Example 3.3(b). Example 3.3: do I I 1, 100 do J I l, 100 \* Loop Body *\ Sl: A(I+2, J+l) ' B(I, J) . . (a) Source code doall II I 1, 2 doall J I 1, 100 ‘ do I I I1, 100, 2 \* Loop Body *\ 51: A(I+2, J+1) I B(I, J) . . . J I (J + 1 -1) mod 100 + 1 enddo enddo (b) Parallel code in loop netting 38 On the other hand, if the number of dependence vectors is larger than 2, Theo- rem 3.2 must be repeatedly applied to extract various entry sets. Again, by applying the principle of greatest common divisor, the final entry set can be obtained. An extended theorem dealing with doubly nested UDLs with many dependence vectors is described as the following. Theorem 3.4 Given a doubly nested UDL, L, with dependence vector set D = {(241, v1), - - - , (ud, vd)}. Let g = gcd(u1, - - - , ud), h = gcd(hu, h13, . - - , hm). The set E={(z.y)|152 0) or (c1 = 0 and c; > 0). The dependence vector set D =~ {(c1,c2)]. Let I‘ be the iteration space and (q1(x,y), q2(x,y), q3(x,y), q4(x,y)) be the general solution of the corresponding Diophantine equations of L. According to the definition above, 9303,31) - 4107,31) = c1 9403,11) — 92(1'3 y) = c2 Let (X, Y) = (q3(x,y),q4(x,y)). The general solution can be normalized to (X - c1,Y - Cg,X, Y). Thus, for any (i,j) 6 I‘, (i,j) depends on (i,j) — (c1,c2) if (i,j) - (01,62) 6 I‘. So, the loop L is a UDL. Case 2: (c1 < 0) or (c1 = 0 and 62 < 0)- fll a. 47 The dependence vector set D = {(—c1,—C2)}. The proof is similar to Case 1. 0 Consider Example 3.4 in Chapter 3. There are two systems of Diophantine equa- tions. i1 =i2+1 i1 =i2 I1 =le .71 =j2+l The general solution of the first system is (x,y,x + 1,y). Thus, the dependence vector function is a constant vector (1,0). For the second system, the dependence vector function is another constant vector (0,1). Therefore, the loop is a UDL with dependence vector set {(1, 0), (0, 1)}. 4.2 Analyzing Cross-Iteration Dependence As observed in the previous section, finding exact data dependences involves solving Diophantine equations. After the general solution of Eq.(4.1) is computed, the system of inequalities (Eq.(4.2)) must be satisfied. In this section, we propose a new method to apply the constraints (Eq.(4.2)) to the general solution (Eq.(4.3)) so that more precise dependence information, a domain of dependence vectors, can be explored. 4.2.1 Solution Space and Dependence Convex Hull Let R denote the set of real numbers. The two free variables of a general solution (Eq.(4.3)), x and y, form a 2-dimensional space, namely solution space 9 where 9 = {(x,y)]x,y E R}. Every integer point (x,y) in 9 corresponds to two vectors, (91(3,y),qz(x,y)) and (q3(x,y),q4(x,y)), which possibly forms a dependence vector. However, due to the bounds of the nested loop (Eq.(4.2)), only a limited region in the solution space is valid. In a multiple-dimensional solution space, this region is a 48 convex hull, called Dependence Convex Hull (DCH). Thus, every integer point in the DCH certainly corresponds to a dependence vector in the iteration space. Substituting Eq.(4.3) into Eq.(4.2), the DCH is defined as the following. Definition 4.2 Let L .be a doubly nested loop with the form in Fig. 4.2. The Depen- dence Convex Hull of the loop L, namely DCH(L), is the region H, where H is equal to {(x.y)l1 S May) S New 6 R} n {($,y)ll(qt($.y)) S (May) S U(ql(z.y)). m: 6 R} f) {(I.y)|1 S (1303,31) S New 6 R} n {($,y)|€(t13(z.y)) S May) S u(qa(a=.y)). any 6 R} where (q1,q2,q3,q4) is the general solution of Diophantine equations (1.1). y .,."':.x+y-I=I(XX) I [I ."I’. I IX+Y' 1 =1 FIMIIO:'J ’03:} 25,; “:12; ‘ y=1 . x=l “500 xélmo Figure 4.3. The DCH of Example 4.1 49' Consider Example 4.1 again. The general solution is (x, y, —-x + y — 1, 2x). Ac- . cording to the definition above, the DCH is listed as the following. {(x,y)]1 S x S 1000,x,y E R} n {(aym s y 3 1000.3.) 6 R} n {(xwlll S -a=+ y -1S1000.x.y 6 R} n {(x,y)|1 S 2x S 1000,x,y E R} The DCH in the solution space is also shown in Fig. 4.3 in which the DCH is bounded by four nodes, (1, 3), (1, 1000), (500, 1000), and (500, 502). Each integer point in the DCH corresponds to a dependence vector in the iteration space in Fig. 4.1. For instance, the point (1, 5) in the DCH is corresponding to the dependence vector from (1, 5) to (3, 2) in the iteration space. Lemma 4.2 If the dependence convex hull (DCH) of a nested loop L is empty, then no cross-iteration dependence occurs in L. Proof: The proof is done by contradiction. Let (q1(x,y),q2(x,y),q3(x,y),q4(x,y) be the general solution of the corresponding Diophantine equations of L. Assume there exists a dependence between iteration (inji) and iteration (ig,j2). There must exist two integers a,b such that (i1,j1) :- (q.(a, b),.,,(., b)) and (5,, j.) = (q3(a, b), q.(a, 1)). Also, since both (11,13) and (1,, j.) are in the iteration space, the following inequalities hold. 1 S QI(ai b) _<- N l(q:(a. b)) S q2(a.b) S u(311(a.b)) 1 S 93(aib) S N I(q3(ai b)) _<_. Q4(ai b) _<— u(q3(a’ b» 50 Thus, (a, b) is in the DCH. D If an inequality, q(x,y) S 0 (or q(x,y) 2 0), represents a half space, a DCH actually is the intersection of eight half spaces in the solution space 9. To efficiently compute the intersection of half spaces, a DCH is internally represented by a ring structure composed of the corners (or called nodes) of the convex hull. Initially, the DCH is set to be an infinitely large region (in practice we select 4 corners). Then, each half space is applied to cut the DCH. In our algorithm, the valid half space is defined as zoom-0 While the rest of solution space is called zoom-1. The evaluation of zoom value‘for a point (x,y) e 9 is simply checking the value of the function, g(x,y). If all nodes of a DCH are located in zoom-0, the intersection is the whole DCH. On the other hand, if all nodes are located in zoom-1, the intersection is empty. In case that nodes are located in both zoom-0 and zoom-1, the intersection points are computed and inserted into the ring structure. Then, the ring is scanned once again to remove the nodes in zoom-1. A detailed C-like algorithm to identify a DCH in a 2-dimensional solution space is listed in Fig. 4.4. Since each cut can increase at most one node, the maximal number of DCH nodes of a doubly nested loop is 12 (original 4 nodes plus 8 cuts). The complexity, in worst case, is 0(n2) where n is the number of half spaces. According to our experiments on a SPARE-2 workstation, this algorithm can be finished in about 350 psec when the number of half spaces is 8. 4.2.2 S-Test: A Dependence Slope Test Based on the general solution of Diophantine equations and the constraints derived from loop index bounds, we have defined and explained the concept of the solution space and the dependence convex hull (DCH). In this section, we propose an S—test to identify the bounds of dependence slopes of dependence vectors in a 2-dimensional iteration space. The dependence slope of a dependence vector, ii = (i, j), is defined 51 Input: A list of 8 half spaces (Def.(2)) Output: A DCH struct node { float (x, y) ; int zoom ; struct node *next ; } ; max = 9999999 ; begin Build the initial DCH ring which is composed of four nodes: (x1,y1) = ( max, max); ($02,112) = ( max. -maX); (xg,y3) == (~max, -max); (34,114) = (’max, max); while (the input list is not empty) Pop a half space from the list, named HS ; Scan the ring Determine the zoom value for each node ; if ((x,y) 6 HS) then zoom = 0 ; else zoom == 1 ; if (the zoom is different from previous node) then Compute the intersection point and give it zoom = 0 ; Insert it into the ring between the current node and the previous node ; Scan the ring again Remove the nodes with zoom = 1 ; if (the ring is empty) STOP ; end while end Figure 4.4. The Algorithm of Finding the DCH in a 2-Dimensional Solution Space 52 as j /i. The general formula of dependence slopes, called Dependence Slope Function (DSF), is defined as follows. Definition 4.3 Given a doubly nested loop L with the form in Fig. 4.2, the De- pendence Slope Function of L, namely DSF(L), is defined as s = $5], where (h1(x,y), h2(x, y)) is the the dependence vector function of L. Identifying the bounds of dependence slopes of a loop L is, therefore, transferred to computing the maximal and minimal values of DSF (L) in the DCH(L). The char- acteristic of a DSF can be explored by the equation below. h2(x,y) — h1(x,y) I s = 0 (4.4) Intuitively, Eq.(4.4) is a line in the solution space 6, in which all points have the same DSF value 3. If line h1(x, y) = 0 is not in parallel with line h2(x,y) 2: 0, line Eq.(4.4) passes the intersection point of line h1(x,y) = 0 and line h2(x,y) = 0 for all s E R. The geometrical relation between Eq.(4.4) and h; = 0 and hg = 0 is illustrated in Fig. 4.5(a). When 3 is increased from —oo to +00, by following the arrow, line Eq.(4.4) rotates from h] = 0, passing ’12 = 0, and back to h = 0. If line h1(x,y) = 0 is in parallel with line h2(x,y) = 0, Eq.(4.4) is a line in parallel with both line h1(x,y) = 0 and line h2(x,y) = 0. Let h: = a I h; + b, where a,b 61R and a > 0. The geometrical relation between Eq.(4.4) and h; = 0 and h; = 0 is shown in Fig. 4.5(b). Line Eq.(4.4) shifts from h, = 0, passing h; = 0, and goes back h1 = 0, when s is increased from —oo to +00. It can be seen that, the s-value in Eq.(4.4) is continuously increasing except that it jumps from +00 to —00, when P3381113 h1(x,y) = 0 (both Fig. 4.5(a) and (b)). Definition 4.4 Given a doubly nested loop L with DCH (L) = H and DSF(L) = 2%)}:3, we say DSF(L) is monotonic in DCH(L) if the maximal and minimal values 53 h >0 / 2 . \ sag-Q hl >0 0 v40 1“1:0 112-0 112 >0 (3)111yh2 (b) 1.11/th, h2=a*h1+b Figure 4.5. The characteristic of DSF Of w in H are located on the boundary of H. Lemma 4.3 Given a doubly nested loop L with DCH (L) = H and DSF (L) = fig], if line h1(x,y) = 0 does not pass through H in the solution space, then DSF (L) is monotonic in DCH(L). Proof: Since h1 = 0 does not intersect with H, the whole convex region H must be in either the half space h1(x,y) > 0 or h1(x,y) < 0. Assume that H is in the side 0f I11 > 0. Let s] and s; be the maximal and minimal values of filfij in H. As shown in Fig. 4.6, both line b; — s; I h} = 0 and line h; — s; I h; = 0 are two tangents of the convex region H. Thus, 31 and s; can only be generated on the boundary of H. By Definition 4.4, DSF(L) is monotonic in DCH(L). 0 Lemma 4.4 Given a doubly nested loop L with DCH (L) = H and DSF(L) = fig”)! if all nodes of H are located in one of the following two half solution spaces 54 h2>0 H (b) h1// h2, h2=a*h1 + b Figure 4.6. A Monotonic DSF (1) {($.v)l(h1($.y) Z 0 and May) 2 0) or (h1(=r,y)> 0 and h2(=€,y) < 0),x,y e R} (2) {(x.y)l(h1(z,y) S 0 and h2($.y) S 0) or (May) < 0 and h2($,y) > 0),I.y 6 R}, then DSF(L) is monotonic in DCH(L). Proof: This lemma is almost the same as Lemma 4.3 except that the maximal value of m in H is possibly equal to +00. The proof is trivial and the same as h! (1.”) the proof above. . , ' D Clearly, if h is a nonzero constant (h, will never be zero in the solution space), DSF, fig], is always monotonic in any convex region where h is an arbitrary func- tion. As mentioned, the goal of our algorithm is to compute the maximal and minimal DSF(L) values in DCH(L). If DSF(L) is monotonic in DCH(L), the maximal and min- imal values can be easily located on the nodeslof DCH(L). 55 Theorem 4.1 Given a doubly nested loop L, if DSF (L) is monotonic in DCH(L), then the minimal (or maximal) DSF (L) value can be found on a node of DCH(L). Proof: Let s; be the minimal value of DSF(L) in DCH(L). Since DSF(L) is mono- tonic in DCH (L), 31 must be located on the boundary of DCH ( L). For a convex region, the boundary is composed of segments. Assume that 31 is generated on a boundary point, namely (a, b). Then, (a,b) must be on the line h; — 31 I h; = 0. If (a, b) is an end of a boundary segment, (a, b) is a node of DCH(L). - On the other hand, if (a, b) is a middle point of a boundary segment, the bound- ary segment which (a, b) located on must also lie on the line hz — s1 * h; = 0 due to the characteristic of tangents. Thus, 31 can also be generated from both end of that boundary segment, i.e.,sl can be found on a node of DCH(L). 0 Example 4.2: do II1, 10 do JI1, 1O A(2IJ+3, 1+1) I ... I A(2II+J+1, I+J+3) .... and do end do Consider the above example. The general solution of the corresponding Diophantine equations is (x, y, -x +2y+ 4, 2x - 2y — 6) and the DSF is %. By applying the algorithm in Fig. 4.4, the DCH is bounded by four nodes, (10,6.5),(10,3.5),(5,1), and (4.5,1). Let h; denote 2x -‘ 3y - 6 and h; denote —2x + 2y + 4. The DCH (shaded area) and both lines h, = 0 and h; = 0 (dotted lines) in the solution space are illustrated in Fig. 4.7. It can be seen that line h1(x, y) = 0 does not pass though the DCH, i.e.,the DSF is monotonic in the DCH. The maximal value of the DSF 56 in the DCH is 1.833 which is located on the node (10,3.5) while the minimal value, -0.3889, is generated by the node (10, 6.5). —>‘< 10 b2 lh1 =-o.3889 Figure 4.7. The solution space of Example 4.2. Apparently, for a loop L with DSF (L) = $23, if h] = 0 passes DCH(L), the range of DSF(L) in the DCH(L) is possibly from —00 to +00. However, a —00 dependence slope is invalid, because it implies that an iteration depends on its next iteration in a serial execution. Also, due to the characteristic of integer functions, the range of dependence slopes can be further reduced. Lemma 4.5 Given a doubly nested loop L with the form in Fig. 4.2, the valid range of DSF(L) is from (1 - M) to +00, where M is equal to maxlSi5N(u(i) - l(i)+1). The proof is straightforward. 57 Theorem 4.2 Given a doubly nested loop L with DSF(L) = $37} where h; is in parallel with ’12, ifDSF (L) is not monotonic in DCH(L), then the range of the DSF (L) is (a - b, +00) when b > 0 (a + b, +00) when b S 0 where h2=a*h1+b, anda,b€R. Proof: dependence slope s = .— Since a is a constant, we only have to consider the second factor 361-. Case 1: when 6 > 0. Clearly the maximal s is 00 when h1 = 0. Since h; is an integer function, the minimal 3 is a — b when h1 = —1. Case 2: when b S 0. From Lemma(4.5), s = -00 is invalid. The maximal s is 00 by reversing the direction of the dependence vector when h} = O. The minimal s value is a + b when h, = 1. Again, we illustrate the above theorem by Example 4.1 which has the general solution (m,y, —a:+y—1,2a:). The DSF is {3 where h = —2x+y—1 and ’12 = 2.1—y. TheDCH (shaded area) and both lines h1 = 0 and h; = 0 (dotted lines) are illustrated in Fig. 4.8 in which line h1(a:,y) = 0 clearly passes though the DCH. Thus, the DSF 58 is not monotonic in the DCH. However, since h: = —1 at h] - 1, by applying the above theorem, the range of the DSF is (-2, 00). Figure 4.8. The solution space of Example 4.1 Corollary 4.1 Given a doubly nested loop L with DSF (L) = fig;- where ’12 is a constant b, if DSF(L) is not monotonic in the DCH (L), then the range of the DSF(L) is (—b, +00) when b > 0 (b, +00) when b S 0 Proof: If h, is a constant b, then h, = o ... h, + b. By Theorem 4.2. the Corollary is proved. El 59 Theorem 4.3 Given a doubly nested loop L with DSF (L) = 23:1: where hl is not in parallel with hg. If DSF (L) is not monotonic in the DCH (L), then the range of the DSF(L) is (— min(M — 1,T),00), where T is the maximal absolute h; value of the corners of the DCH, and M is equal to max15;51v(u(i) -— [(2') + 1). Proof: Since the h1(:c, y) and h2(:c, y) are integer functions, the minimal nonzero absolute value of h; is 1. Therefore, if T is the maximal absolute h; value in the DCH, the dependence slope, ff, is always large than —T. However, by Lemma 4.5, the valid range of dependence slope is (1 — M, +00). So, the range of DSF is (- min(M — 1, T), 00). Cl Concluding the above theorems, the algorithm of S-test is listed in Fig. 4.9. The input of this algorithm are a DCH and a DSF, 7’3. We first determine whether the DSF is monotonic in the DCH by using Lemma 4.4. If the answer is yes, Theorem 4.1 is applied to compute the bounds of the DSF. Otherwise, depending on whether h; parallels with hz, Theorem 4.2 or Theorem 4.3 will be employed. The complexity, in the worst case, is 0(n) where n is the number of nodes in the DCH. Again, since a DCH is composed of at most 12 nodes for a doubly nested loop, the execution time is very small. 4.3 Dependence Decomposition In the last section, an exact dependence analysis method has been presented to iden- tify the range of dependence slopes of a doubly nested loop. The range of dependence slopes, namely (31, 82), implies that any vector (I, J), where I 6 2*, J E Z, is possibly a dependence vector as long as the following equation is satisfied. NH. IA Co N (4.5) 60 Input: A DCH and a DSF,(%:) Output: The bound of the DSF in the DCH begin if (both h; nd h; are constants) Return the constant slope fif- ; if (the DSF is monotonic in the DCH) (Lemma 4.4) Compute the maximal and minimal dependence slopes by Theorem 4.1; else if (hl in parallel with hg) Compute the maximal and minimal dependence slopes by Theorem 4.2 ; else Compute the maximal and minimal dependence slopes by Theorem 4.3; end if end if end Figure 4.9. The algorithm of S-test 61 In order to be valid, the group of vectors satisfying Eq.(4.5), called dependence cone, must be considered in the dependence uniformization. It should be emphasized that the dependence vector set is a subset of the dependence cone. In this section, we propose the approach dependence decomposition in which all vectors in the dependence cone can be decomposed into a set of basic vectors. The set of basic vectors is called Basic Dependence Vector set then can be added to every iteration to replace all original dependence vectors so that the loop become a uniform dependence loop. Definition 4.5 Given a doubly nested loop L, a set of vectors B = {b-;, b}, . . . , 5;} is defined as a Basic Dependence Vector (BDV) set of L, if for any 56 C, there exists c1,c2,...,cb 6 2", such that d: c; *b: +cz*b;+~-+cb*b;, where Z" is the set of non-negative integers and G is the dependence cone of L. Theorem 4.4 Any doubly nested loop L can be transferred to a uniform dependence loop with dependence vector set B, where B is a BDV set of L. Proof: Let D denote the dependence vector set of L, I‘ denote the iteration space of L, and B = {b-;,b;, . . . ,b7,}. We want to prove that for any two vectors 17,13 6 I‘, where 17 = 21; + d for any d6 D, 17 must be executed after :3, as long as all vectors in B are preserved in every iteration. Since D is a subset of the dependence cone, by definition, there must exist positive integers c1,c2,...,cb such that d: c1 *b: +Cz*b;+'°'+cb*l;bg T0 obey bi, tad-bi must be executed after 213. Accumulatively, a; + c, * b: must be executed after 215. Similarly, to preserve b: for 2 S i S b, 175+c1 *b: +c2 *b; must beexecuted after t3+c1 *b-i. 13+c1*b:+c2*b;+-~+cb*b;mustbeexecutedafter t3+¢1*b'i+02*b;+"-+Cb-i *bb:l 62 Thus, ii = iii + (T will be executed after 113. D The above theorem indicates that, in the existence of a BDV set, a nested loop can always be transferred to a uniform dependence loop. This transformation is called dependence uniformization. However, there are many BDV sets for a dependence cone. The choice is determined by the synchronization technique being applied to the nested loop. In general, less vectors in a BDV set, more efficient synchronization can be arranged. The goal of our algorithm is to find a BDV set which is composed of the least number of basic dependence vectors. Unless 31 = 32 in Eq.(4.5), a BDV set of a doubly nested loop is composed of at least two vectors. The following theorems derive formulas to compute most simple BDV sets for a given dependence cone in a 2-dimensional iteration space. To cooperate with the proposed theorems, a synchronization method which induces very small synchronization overhead will be presented in the next section. Theorem 4.5 Given a doubly nested loop L with a dependence slope range (31,32), the set of vectors {(0,1),(1, [slj)} is a BDV set ofL. Proof: Let (I, J) denote‘a dependence vector where I E Z+ and J 6 Z. We want to prove that if (I, J) is in the dependence cone (Eq.(4.5)), there must exist two positive integers, c1, e: such that (I, J) = 61 * (0:1) + 02 * (la [sill- Since I, J, and [51] are all integers, there must exist a positive integer r, such that J: [31J*I+r 63 Now, (I,J) can be decomposed as the following (LJ) =(I,l~91l*1+"l =(Iai-91J*Il+(0,rl =I*(1,[slj)+r*(0,1) I Now, the theorem is proved by selecting c1 = r and c2 = I . Figure 4.10. The uniformized dependence pattern of Example 4.1 Consider Example 4.1 in Section ?? again. The general solution is (z, y, —a: + y — l, 22:) and the DSF is fizz-1:151. By applying the S-test in the last section, the range of dependence slope is (-2, 00). A BDV set { (0, 1), ( 1, —2)} then can be easily computed by using Theorem 4.5. Thus, the irregular dependence pattern in Fig. 4.1 can be uniformized to a uniform dependence pattern with two fixed dependence vectors, 64 (0,1) and (1,-2). The uniformized dependence pattern in the iteration space is shown in Fig. 4.10. Theorem 4.6 Given a doubly nested loop L with a dependence slope range, (31,32), the set of vectors {(0, -—1), (1, [32]” is a BDV set ofL, if 32 74 00. Proof: The proof is similar to the proof of Theorem 4.5. Let (I, J) denote a dependence vector where I 6 Z‘l' and J E Z. We want to prove that if (I, J) is in the dependence cone (Eq.(4.5)), there must exist two positive integers, c1, c2 such that (I,J) = c1 *(0,-1) +c2 as (1, [32]). 315 % $82 31*15 J S32*I J S[32]*I Since I, J, and [32) all integers, there must exist a positive integer r, such that J = [32] a: I — r. Now, (I,J) can be decomposed as (I,J) = (I, [32] *I—r) =(I:[32l*1)_(0:r) = I a: (1, [32]) + r ... (0, —1). Thus, the theorem is proved by selecting c1 = r and c2 = I. Corollary 4.2 Any doubly nested loop can be uniformized to a uniform dependence 100;? with only two dependence vectors. 65 Proof: By using Theorems 4.1, 4.2 and 4.3, a dependence cone can be always computed for any doubly nested loop. Then, a BDV set which is composed of two vectors can be derived by using Theorem 4.5 or Theorem 4.6. D 4.4 Index Synchronization for Uniform Depen- dence Loops There are many ways to parallelize a uniform dependence loop. The performance is mostly dependent on the efficiency of the synchronization primitives which are used to preserve those uniform dependence constraints. For instance, in the wavefront method [45], a two-level uniform dependence loop is transferred to a sequential iterative DOall loop. As indicated in Chapter 3, that loop involves many barrier synchronizations. In this section, we present the index synchronization method to preserve the constraints due to‘ the basic dependence vectors derived in the last section. The new method which does not require any barrier synchronization and atomic operation is introduced first. Then, we will evaluate the performance in the next section. The principle of our method is very similar to the post69‘wait technique in which a processor has to wait for the signal posted by another processor to continue its execution. However, since there are only two basic dependence vectors in a uni- formized dependence pattern, the synchronization overhead can be further reduced. As shown in Theorem 4.5, the first basic dependence vector is (0,1), and the second basic dependence vector is (1, t) where the t is determined by the lower bound of the DSF. The basic idea of our approach is to group the iteration space along the direction of the first basic dependence vector (0,1). Then synchronizations between adjacent groups are made to obey the second basic dependence vector (1, t). More Specifically, the inner loop is grouped and executed serially, and the outer loop is 66 performed concurrently while synchronizations are applied. The synchronization be- tween adjacent groups may be implemented in a post8wait manner. Since every group only depends on the previous group by the direction (1, t), the post operation is not needed. The processor executing the j’th group can directly check the index variable of the (j —1)’th group, which represents the current working iteration in the (j -1)’th group, to decide how far it can proceed. In order to be valid, every index variable (except the first one) must be less than its previous index variable at least by 1 - t. The value 1 —t is defined as delay factor (b in our approach. Thus, if the delay factor 43 is positive, every processor (except the first processor) has to be behind the previous processor by at least at steps. On the other hand, if d) is negative it means that each processor (except the first processor) cannot be ahead of the previous processor more than —¢ steps. moo TATLLEATA. ’ l+t 1+: 1+1 1H 1000 4 999 3 998 2 997 l 996 i, - ' i ' i .L, 1 E l-t l-t 314 4~1-t (a) BDV={ (0,1). (1.1)} (b) BDV={ (0.-1).(1,t)} Figure 4.11. The Index Synchronization Method 67 The concept of index synchronization is also illustrated in Fig. 4.11(a). The dotted rectangle is a group being assigned to a processor. The first basic dependence vector is naturally preserved in the serial execution of a group while the second basic dependence vector is transferred to the index synchronization. The are from index variable ii to index variable i j+1 indicates that i,“ must be always less than i ,- by at least 1 ——t during the execution of this loop. Consider Example 4.1. The uniform dependence pattern, which has a BDV set {(0,1),(1, -2)}, is shown in Fig. 4.10. By applying index synchronization, the depen- dence relation can be simplified as in Fig. 4.11(a) with t = —2. Thus, every group, except the first group, must be 3 steps behind previous group. The parallel code of Example 2 can be implemented by a Doacross 100p including a Doserial loop as the following. Example 4.1: ( Parallel code ) doacross I - 1, 1000 shared integer J[1000] doserial J[I] 3 1, 1000 if( I > 1 ) then vhile( JII-l] < JIIJ+3 ) ; $1: A(I+J, 3*I+J+3) - ,... $2: - A(I+J+1, I+2*J+4) ..... enddo JII] = 1000 + 3 enddo If Theorem 4.6 is applied, a BDV set has the form {(0,-1), (1,t)} where t 6 Z. To preserve the basic dependence vector (0, —l), the execution direction of the inner loop is reversed. As shown in Fig. 4.11(b), processors execute the inner loop in a reverse order, from the top to the bottom. Similarly, the second basic dependence vector (1, t) is transferred to index synchronization. The delay factor 45 in this case is equal to 1+t. Thus, the index variable i j+1 has to be always larger than index variable 68 i,- by at least 1 + t. It should be emphasized that each index variable i,- is accessed by only two processors. Thus, the hot-spot condition will not occur. Also, since the second processor only performs read operation, no atomic operation is required in this approach. 4.5 Performance Evaluation 4.5.1 Theoretical Speedup To simplify the analysis, it is assumed that ((i) = 1 and u(i) = M in the pro- gram model in Fig. 4.2. Intuitively, the delay factor 4), is the major factor to the performance of loop uniformization. As mentioned, if a BDV set has the form {(0,1),(1,t)}, the delay factor (b is defined as 1 - t. Otherwise, if a BDV set has the form {(0, —1),(1,t)}, the delay factor d) = 1 + t. There are two possibilities for a given BDV set: 45 S 0 or (t > 0. In the first case, each group can continue its execution immediately as long as it is not ahead of its previous group --¢ steps. Thus, no delay is required. If the scheduling and synchronization overhead is not taken into account, the peak speedup is simply equal to the bound of the first level loop N when assuming the number of processors is infinite. On the other hand, if it > 0, by the index synchronization method, group j has to be behind group j — l at least 43 steps. The last group, the N ’th group, need to delay accumulatively at least (N - l) a: 43 steps to start its execution. The equation 0f peak speedup is derived as the following. Peak speedup= W 69 = NtM W S if when ()5 > 0 (4.6) Peak speedup = N when (t S 0 Note that the upper bound 1% may not be tight when (M -— 1) is comparable to N * (ii. The peak speedup is clearly determined by the ratio of M and at. A larger ¢ induces a smaller peak speedup. In the worst case 45 = M, which actually results in a serial execution, the peak speedup is equal to one. Therefore, when both Theorems 4.5 and 4.6 are applicable to a nested loop, the choice should be determined by the theoretical peak speedup. ' Example 4.3: do 181, 1000 do j81, 1000 A(2*J+3, 1+1) = .... . A(I+J+3, 2*I+1) .... and do and d0 Consider the example above. The general solution is (2x,y,a:, -:c + 2y) and the DSF is :Ef". By applying S-test, the range of dependence slopes is (-499.5,0.499). Since —499.5 < 00, both Theorem 4.5 and Theorem 4.6 can be applied to choose a BDV set. However, if Theorem 4.5 is used, the BDV set is {(0,1),(1, —500)} which results in a very low peak speedup 1%. On the other hand, if Theorem 4.6 is used, the BDV set is {(0, -1),(1,1)} and the delay factor at is only 2. A much higher peak speedup 333 then can be obtained. Thus, Theorem 4.6 is a better choice in this example. 70 4.5.2 Experimental Results To evaluate the performance of index synchronization under various delay factors, our experiments were conducted on a 76-node BBN GP—lOOO parallel processor. The CP- 1000 is a physically distributedvmemory architecture running the MACH operating system that offers a logically shared virtual space to the programmer [56]. Each processor node is composed of a processor, a local memory module, and a network interface controller. The 76 processor nodes are interconnected through a multistage interconnection network (MIN), a slight variant of the Omega network, that allows any processor to access any memory module. Any remote memory access is made by sending a request message via the network controller to the remote peer network controller which then accesses the target memory module and sends back a response message. Therefore, a remote memory reference is about four times slower than a local memory reference which directly accesses the local memory module bypassing the interconnection network. If the traffic of the MIN is heavy, a remote memory access might take much longer than usual. The operating system allows an application to have an exclusive control of a cluster of nodes without sharing with other applications. , Detailed performance study and the characteristics of the GP-1000 can be found in [42]. The experimental programs are coded in Butterfly Fortran [22] which is a parallel language, in some sense, similar to the PCF Fortran [21]. To avoid too much influence from network and memory contention, the remote reference ratio is set to 5%. As for scheduling, dynamic self-scheduling scheme [26] is applied to schedule the outer loop. Based on different granularity, three experiments are made in this research. In each experiment, different delay factors, from 46 = 2 to ¢ = 200, are applied to a doubly nested loop with bounds N = 1000 and M = 1000. The results of Experiment 4.1 is shown in Fig. 4.12 in which the iteration granularity is only 120 psecs. As mentioned 71 before, delay factor (15 is the major factor to the performance. In Fig. 4.12, a smaller gt clearly results in a higher speedup. According to the analysis in Section 4.5.1, none should achieve a speedup higher than the peak speedup, 5:39. Thus, when the number of processors is larger than 1%, the speedup should start going down due to more memory or network contention induced by unused processors. For instance, when 43 = 50 the peak speed is 20. It can be seen that the highest speedup, 14, is generated when the number of processors P is equal to 20. Beyond 20, the speedup start dropping. In fact, due to a high ratio of the synchronization overhead to granularity, the speedup drops even earlier. In this experiment, speedup drops when the number of processors is larger than 30 under all ¢ values. However, it should be emphasized that, without applying loop uniformization, non-uniform dependence loops are always executed in serial. 35 ..‘O. $52 r r T r T ..+.. o=10 30— ..D.. «15:20 a ..x.. 45:50 ..®.. ¢=100 . 25* ..A. ¢=200 2or- $5 “ Speedup 8- 3 15_ g gnno .. g'QWXXXHHX "3:, 10‘ W0 X "22-6“ 8280-00-0~0-o--~o----@...,g'mix 5”.‘g'A.A.A.A.A..A...A...A...A ..... A ..... A 24 0 . 1 1 L 4 _4 _1 l 0 10 20 30 40 50 60 70 No. of Processors : P Figure 4.12. Experiment 4.1: granularity is 120 microseconds 72 7o ..'<>.. ¢E2 ' r T ‘ r + ..+.. ¢=10 ..D.. ¢=20 a 60F ..X.. 43:50 ..Q. ¢=100 50* ..A. ¢=2oo " 2:13:$:2r$ Q 40-— 5 " ‘ Speedup .3330 ----- D Q" (:1 30‘- au,‘ D-i 20r- Qm d :-X ........................... '9 x x x x x x ----x ...... x 10¢ 289999909990- “‘2 .A.A.A.A...A...A...A..A.....A.....A ..... A 0 J 1 I J L 4 J No. of Processors : P Figure 4.13. Experiment 4.2: granularity is 560 microseconds In Experiment 4.2, the granularity is increased from 120 usec to 560 psec. The results are plotted in Fig. 4.13. Since the ratio of the synchronization overhead to granularity is decreased, the overall performance is much better than in Experiment 4.1. However, due to network and memory contention caused by scheduling, syn- chronizations, and remote references, none can achieve a peak speedup. For example, when (:5 = 20 the highest speedup is about 37 which is smaller than the peak speedup, 50. Again, because of too many processors involved, speedup drops when P > 50. The results of Experiment 4.3 is shown in Fig. 4.14 in which the granularity is 1600 psec. The synchronization overhead is not significant in this experiment due to the high granularity. It can be seen that speedup almost linearly rises until the number of processors is larger than the peak speedup. For «1) = 2 or d) = 10, the 73 speedup never drops because the peak speedup is larger than 72. 7o .T<>.. ¢EZ ‘1 ..+.. 43:10 _ ..n.. ¢=20 . 4 60 ..X.. ¢=5o 43$ ..9.. ¢=100 0-: 50- ..A. ¢=2oo ..;:-+" ‘ 40L 3 Speedup Hag... .....D ...... D ...... CJE' 0L ,9 3 _av' ‘ 20— ,9“ «QXXXXX ...... X ...... x ...... X 10L 2806099006 1 ‘3: .A.A.A.A....A...A...A...A.....A.....A ..A 0 l J | J J l l o 10 2o 30 40 50 60 70 No. of Processors : P Figure 4.14. Experiment 4.3: granularity is 1600 microseconds CHAPTERE» SCHEDULDfl30FIKMflL LOOPS As mentioned, DOall loop is the most important parallel construct in parallel pro- gramming. A critical issue in loop parallelization is how workload of a DOall loop can be equally or fairly distributed among processors. This is the so called loop schedul- ing problem. Consider the case in which a DOall loop of k independent iterations is executed in a multiprocessor system with 71 processors. The goal of scheduling is to distribute these I: iterations to n processors in the most equitable manner and with the least amount of overhead. This is not an easy task because the amount of work of each iteration may be different, the value of k may not be known during the compile time, the number of processors n is machine dependent, and the amount of overhead is also machine dependent. In this chapter, we first review various scheduling schemes of DOall loops by discussing their strength and weakness. Then, a loop scheduling model which will be used in the next chapter is given to express and analyze scheduling schemes. 74 75 5.1 Review of Loop Scheduling Schemes A DOall loop is uniformly distributed if the execution time of all iterations are very close. In a shared memory multiprocessor system, iterations of a DOall loop can be scheduled among processors either statically at compiler time or dynamically at run time. Static scheduling is usually applied to uniformly distributed loops, in which iterations of a DOall loop are pre-assigned to processors at compiler-time. How- ever, the actual execution time of each iteration still may vary due to network con- tention, memory contention, mutually exclusive access, and complicated control flow. Thus, workload imbalancing is its major disadvantage, especially when loops are non- uniformly distributed. Also, static scheduling is not applicable if the bound of loops and the number of processors are unknown at compile time. Dynamic scheduling allocates iterations to processors dynamically at run time. Thus, whenever a processor becomes available, one or more iterations will be assigned to that processor. Clearly, some scheduling overhead will be involved at run time, while a better workload balancing is achieved. The overhead to start a new task (or process) is usually expensive. Thus, self-scheduling in which no system calls are involved is widely employed in multiprocessor systems, such as HEP [57], Alliant FX/ 8 [4], Concurrent Computer [27], Cray X-MP, and BBN GP-lOOO [22]. In these systems, each processor accesses a set of shared variables by itself to retrieve the indices of iterations instead of invoking system calls provided by the operating system. However, the number of iterations a processor should fetch becomes an important issue. Getting a large number of iterations may cause load imbalancing, while getting a small number of iterations will induce too much scheduling overhead. Various self-scheduling schemes which put the scheduling responsibility on the compiler rather than on the operating system were proposed in the past [26], [20], [24], [25], [27]. The pure self-scheduling (SS) approach is to dynamically allocate one 76 iteration of a DOall loop to any idle processor at run time. The main advantage of this approach is that it can easily achieve load balancing, which is especially useful for those nonuniformly distributed loops. However, there are two drawbacks. First, the amount of scheduling overhead is proportional to the bound of the DOall 100p which might be tremendous. Second, when the granularity of each iteration is small, the high frequency of mutually exclusive accesses to shared variables, such as the loop index variable, will cause serious memory contention which will significantly reduce the overall performance. Detailed study of tradeoff between load balancing and scheduling overhead can be found in [58]. To reduce the frequency of mutually exclusive accesses to those shared variables, an improved self-scheduling scheme, chunk self-scheduling (CSS), which allocates a group of iterations to an idle processor, was deveIOped in [22]. In the CSS(lc), the number of iterations, Is, in a group, called chunk size, is fixed and has to be specified by programmers. When the chunk size is one, this scheme is the pure self-scheduling as discussed above. If the chunk size is selected as the bound of the DOall loop equally divided by the number of processors, this scheme becomes static scheduling in which the system will easily cause load imbalancing in the case of nonuniformly distributed loops. The efficiency of the CSS(k) scheme is based on the proper chunk size which is determined by the number of processors, the bound of the DOall loop, and the granularity of iterations. A large chunk size will cause load imbalancing, while a small chunk is likely to produce too much scheduling overhead and memory and network contention. Due to the unpredictable run-time behavior of parallel programs, it is difficult for programmers or compilers to properly select an appropriate chunk size. Another better method, namely guided self-scheduling (GSS) [25], can dynamically change the number of iterations assigned to each processor. More specifically, the next chunk size is determined by dividing the number of remaining iterations of a DOall 77 loop by the number of processors. The property of decreasing chunk size implies its efforts to achieve load balancing and to reduce scheduling overhead. By allocating large chunks at the beginning of a DOall loop, one can reduce the frequency of mutually exclusive accesses to those shared loop indices. The small chunks at the end of a loop serve to balance the workload across all processors. This approach obtains an optimal workload balancing under any initial configuration for uniformly distributed parallel loops. To avoid having many small chunks at the end of a loop, GSS(k), in which the optimal load balancing is sacrificed, is proposed to bound the chunk size from below by k, where k has to be specified by either programmers or compilers. However, the GSS approach is not perfect in some conditions. First, if the granu- larity of iterations is not uniformly distributed, the GSS may not balance the workload due to the large initial chunk size which is equal to the bound of the loop divided by the number of processors. An alternate approach in the GSS is to make P, the number of processors, artificially large to decrease the chunk size. However, the in- creasing of the number of scheduled units and the scheduling overhead is a serious side effect of this alternate approach. Second, even though the GSS is a very simple method, the complexity of the GSS algorithm, especially the GSS(k), is still more complicated than the SS or CSS approach. Without hardware support, the imple- mentation of mutually exclusive accesses of the CSS algorithm will be a bottleneck in a large-scale multiprocessor system when the granularity of DOall loops is small. Finally, the best minimum bound on the value I: in the extended (3830:) approach is machine and application dependent. Just like selecting the fixed chunk size in the 083(k), it is not easy for programmers or compilers to select a proper bound value k. Our experimental results for the GSS(k) scheme in the next chapter. will show that a large lc value causes more workload imbalancing, while a small I: value induces too much scheduling overhead. The best choice of I: value is mostly determined by the 78 workload distribution. 5.2 A Model for Loop Scheduling In this section a simple loop scheduling model is presented which allows us to clearly express and analyze different self-scheduling schemes. The performance of a self-scheduling scheme is dependent on two factors: one. is synchronization primitives provided by the system, and the other is the execution time of loop iterations, which is program dependent and is also affected by system run-time status. Thus, to design an efficient self-scheduling scheme, we have to consider the effect of these two factors. The machine considered in this research is a shared memory multiprocessor system consisting of P processors. Each processor can directly access any location in the global shared memory. Either hardware or software atomic primitives are provided by the system to allow the implementation of mutual exclusion or critical sections which are essential to any self-scheduling schemes. As to application programs, arbitrary nested parallel and serial loops are allowed, and the bound of loops may not be known at compile time. Due to the unpredictable system behavior, the execution time of each iteration is non-deterministic even at run time. A chore, defined in [21], is a unit of work to be assigned to a processor. In scheduling a DOall loop, a chore is usually a group of iterations. For example, in the CSS(k), a chore is composed of k iterations. For each chore, the indices of the first and last iterations are called chore starting point (a) and chore ending point ([9), respectively. In this research, the last chore to complete its execution is called the critical chore. The processor executing the critical chore is called the critical processor. For each processor, we distinguish three states in handling a DOall loop: scheduling state, computing state, and waiting state. The state diagram is shown in 79 Fig. 5.1. A Scheduling 1 Y Computing /‘ Waiting Done Figure 5.1. State diagram of a processor in loop scheduling When encountering a DOall loop, the first thing a processor has to do is to acquire a chore. In this scheduling state, the processor tries to acquire a chore, (0,13), from the remaining iterations. The time a processor stays in the scheduling state is con- sidered as scheduling overhead. In general, the scheduling overhead is determined by the implementation of mutually exclusive access to those shared variables, the num- ber of processors, and the average execution time of each chore. A simple scheduling algorithm, such as the SS, might be easily implemented by a single atomic fotchtadd instruction; while a complicated one, such as the GSS, is usually implemented in a critical section. As indicated in [58], a significant amount of the scheduling overhead may be caused by waiting for the availability of a critical section. A recent measure- ment of some shared-memory multiprocessors also indicated that critical sections are 80 a major source of performance degradation, especially when the number of proces- sors is large [42]. The total number of times that processors enter into the scheduling state, which is actually the number of chores plus the number of processors, is another important factor contributing to the total scheduling overhead. Whenever a processor acquires a chore, it enters the computing state and starts execution. After finishing the computation, it goes back to the scheduling state. The total time all processors stay in the computing state is very likely greater than the total load of the DOall loop due to network and memory contention. Thus, a scheduling algorithm which generates much network and memory traffic will affect the time processors stay in the computing state. A processor repeats scheduling and computing until the loop is exhausted. Then, the processor goes to the waiting state and waits for the completion of the remaining executing chores. Every processor, except the critical processor, has to stay in the waiting state due to this barrier synchronization. In general, the larger and later chores have a higher probability of becoming the critical chore. The actual critical chore is determined not only by the scheduling scheme but also by the execution time distribution of iterations. For example, the first chore in the GSS scheme may possibly become the critical chore if the execution time of the first few iterations is much larger than the rest. To minimize the waiting time in this state, i.e.,to achieve workload balancing, it should not be assumed that loops are always constant or uniformly distributed. The consideration of nonuniformly distributed loops is important in designing a practical scheduling scheme. The symbols used in this model are defined as follows: 0 P: the number of available processors in the system. 0 I : the total number of iterations in a DOall loop, which is application program dependent. 81 o L(i): the execution time of the i-th iteration. Thus, the total load of a loop is L = 23’ L(i). i=1 0 N: the total number of chores spawned from a DOall loop, which is determined by the scheduling scheme and the value I. o C(t): the chunk function which represents the number of iterations assigned to the t-th chore (1 S t S N). This function which might be either linear or nonlinear expresses the partition and dispatching rule of a scheduling scheme. 0 T(t): the load of the t-th chore, which is equal to 2:30)“)110'). Obviously, i=0 Xi, T(t) should be equal to the total load, L, of this DOall loop. 0 X: the total time processors stay in the computing state. Recall that due to network and memory contention, X is usually greater than or equal to the total load L. o 0: the total scheduling overhead of all chores. O W: the total waiting time of all processors due to barrier synchronization and load imbalancing. In this model, P, I , and L(i) are given by the system and programs. C(t) rep- resents the scheduling method which determines N and T(t). X, 0, and W are the final metrics that we are concerned with. An example of scheduling a uniformly dis- tributed loop for SS, 058(k) and GSS schemes are depicted in Fig. 5.2(a), Fig. 5.2(b), and Fig. 5.2(c), respectively. The area under the curve C (t) in each scheme is equal to the total number of iterations, I. 82 C (i) (a) SS 1 N - I Chore index: t C(t) (b) 0580:) k N = rs Chore index: t 00) (C) (3850) Chore index: t Iteration index: i L(i) T(l) T(2) T(N Iteration index: i Iteration index: i Figure 5.2. Chunk size function of a uniformly distributed loop for SS, CSSUc), and GSS schemes CHAPTER 6 TRAPEZOID SELF-SCHEDULING (TSS) As mentioned in the previous chapter, in order to design an efficient scheduling scheme which induces small scheduling overhead and achieves workload balancing, our goal is to minimize both 0 and W. To minimize 0, we are concerned with the number of chores N (number of mutually exclusive accesses) and the simplicity of the scheduling scheme (complexity of each mutually exclusive access). Thus, we would like to design a scheme like CSS(k) which is simple and only produces a small N if k is large enough. To minimize W, we want to reduce the waiting time for the completion of the critical chore. Thus, the GSS scheme seems to be a good choice under any initial processor configuration. In this chapter we propose the Trapezoid Self-Scheduling (TSS) scheme which includes the advantage, but excludes the disadvantage, of both the 0330:) and GSS schemes. The TSS method, which starts the execution of a loop by assigning a large number of iterations and linearly decreases the number of iterations until all iterations are exhausted, can obtain the best tradeoff between small overhead and load balancing for both uniformly or nonuniformly distributed parallel loops. The TSS method also substantially reduces the number of mutually exclusive accesses to 83 84 loop indices, which significantly restricts the speedup on large-scale shared memory machines. The rest of chapter is organized 'as follows. The generic concept of this scheme is discussed first in Section 6.1. A conservative yet powerful approach is presented in Section 6.2. Then, we shall address mutual exclusion implementation and static memory management issues. Finally, experimental results measured from running different self-scheduling schemes in a 96-node GP‘IOOO Butterfly machine will be discussed in Section 6.5. 6.1 Generic Approach: TSS(f,£) In addition to the simplicity of scheduling algorithm, workload distribution among iterations of a DOall loop is the major factor to the success of a self-scheduling scheme. In this research, four typical uniformly and nonuniformly distributed loops shown in Fig. 6.1 will be discussed and experimented. The basic idea of the T88 method is to extract the advantages of both CSS and GSS schemes. As indicated in Fig. 5.2(b) and (c), the CSS(k) scheme applies a linear, but fixed, chunk function while the GSS uses a decreasing, but nonlinear, chunk function. The linearity can make the chunk function simple enough to induce a small scheduling overhead while the decreasing chunk size can balance the workload. To combine both properties, intuitively, we design a linearly decreasing chunk function for the T83 scheme. Thus, instead of using a nonlinear chunk function, the T33 scheme linearly decreases the chunk size at run time. The linearity, which makes the T85 simple enough to be implemented with a few atomic instructions, is a major difference between the T38 and GSS schemes. Figure 6.2 clearly shows the key principle of the TSS( f ,l) method in which the area below the linear curve C (t) is a trapezoid. The size of the first chore and the 85 Parallel Do i = I to I Parallel Do i = I to I If( Expl ) then (*“ Loop body **) (‘"‘ Block 1 **) se (** Block 2 **) . .1 End Parallel . 1} End If L(z) * L(z) End Parallel I 14" ii”. (a) Constant workload (b) Random workload Parallel Doi = I to] ParallelDoi = l to] SerialDoj=Itoi SerialDoj=ltoI—i (“ Loop body **) (“ Loop body ") End Serial End Serial L(i)“ End pmnel my End Parallel ] 'i l ‘ z (c) Increasing workload (d) Decreasing workload Figure 6.1. Four typical uniformly and nonuniformly distributed loops 86 C t L ' < i (a) Tm ”HM/4') MA («2) “l ‘1 W TSS(f,!) 1‘0) "H21 3 N I Chore index: i Iteration index: i Figure 6.2. Chunk size function of a uniformly distributed loop for TSS scheme last chore, f and e, can be flexibly selected by compiler designers, programmers, or even intelligent compilers. The number of chores, N, is determined by f, l, and I . According to the formula of trapezoid area, the value N of TSS( f, l) is 2] N=[f+£. (6.1) After N has been computed, the decreasing step (6) can be obtained by the following formula. 6 = f — e TV: (6.2) For example, if I = 1000 and (f, t) = (88,12), then we have N = 20 and 6 = 4. The distribution of the chunk function C(t) is: C(l) = 88, 0(2) = C (1) — 4 = 84, 0(3) = 0(2) — 4 = 80, - - o, C(18) = 20, C(19) = 16, and C(20) = 12. The complete algorithm describing each processor in the scheduling state is listed below. 1. If this processor is the first processor that encounters the parallel loop, then perform the following initialization steps. (a) select a (f, [) pair. according to the policy decided by the compiler or spec- ified by the programmer. 87 (b) compute N and 6 according to Eq. (6.1) and Eq. (6.2). (c) set count:-O and csizez-f. 2. BEGIN MUTUAL EXCLUSION (a) 0:8 count-*1 (b) fl:-|_count+csizej (c) csize :8 csize-6 3. END MUTUAL EXCLUSION 4. If a > I, then enter the waiting state. 5. Iffl> I, then 3 := I. 6. enter the computing state. The flexibility of selecting (f, [) without requiring any extra statement in step 1 is an advantage of the T58 over the GSS. In fact, we can treat the SS and CSSUc) schemes as special cases of the T88. For example, the SS is equivalent to the TSS( 1,1), and the CSS(k) is equivalent to the TSS(lc,k). Although the GSS is not a linear scheduling scheme, it can be roughly simulated by the TSS(-,’;,l). Now, a similar question asked for in the 085(k) and GSS(lc) may be asked immediately: how to select an optimal f and l? Apparently, an optimal (f, [) pair is determined by P, I, L(i), and system run-time status. Before actually running that loop, it may not be possible to compute the optimal (f, l) pair. However, in practice, an optimal solution is not really necessary. A conservative yet efficient (f, l) pair is proposed in the next section. Our experimental results in Section 6.5 will show that this consecutive approach will result in a near optimal solution for most of the uniformly and nonuniformly distributed loops. 88 6.2 A Conservative Approach: TSS(fi, 1) An optimal choice of (f, l) is dependent on the execution time distribution of the parallel loop. It has been shown in Eq. (6.1) that to have a small N in the T58 the (f, l) pair should be chosen as large as possible. However, the load imbalancing will be a serious side effect especially when the workload is not uniformly distributed. Nonuniformly distributed loops, such as in Fig.i6.l(c) and Fig. 6.1(d), are often generated in real application programs. Ignoring the other extreme cases, Fig. 6.1(d) is almost the worst case for the scheduling schemes with a decreasing chunk function, because too much workload will be assigned to the first few chores. For example, if the GSS(k) scheme is used, the T(l) for the loop in Fig. 6.1(d) is equal to (when I: = l) or larger than (when k > 1) 3%- which is very likely to become the critical chore of this lOop. Similarly, Fig. 6.1(c) is the worst case of GSS(k) scheme with a large value 1:, because the last chore will be assigned too much workload and become a terrible critical chore of this loop. Figure 6.3. T(l) of the TSS(§%,1) under the loop in Fig. 3(d) 89 In order to ensure every T(i) is less than or equal to -f; ,instead of ,1, in the G38, the first chunk size f is set to —- in the conservative TSS method. In this case, T(l), the shade area in Fig. 6. 3 18 derived ln Eq. (6. 3). It can be seen that T(l) ls always less thanL 3, thus the workload of the rest iterations 13 still adequate to keep other processors busy even under the decreasing workload as shown m Fig. 6.1(d). T(I) = (I+ (I— é» an: (#) *% = (4134-1912) * L (6.3) After the f is fixed by #, due _to the property of the linearly decreasing chunk size in the TSS, the range of the last chunk size, I, is from 1 to W. Substituting (fl) = 2—p,1) and (f,£’) = (27,27) into Eq. (6.1), the range of the number of chores N, is from 4P to 2P. Thus even the l is set to the most conservative value 1, each processor will have at most two more mutual exclusion accesses than the other values. If the mutual exclusive access can be implemented in an atomic operation (will be described in the next section), the extra scheduling overhead is insignificant. Also, considering the increasing workload in Fig. 6.1(c), the last chunk size I is set to 1 to minimize the load imbalancing. This implies that each processor in average Will execute 4 chores. The comparison of N among those well known self-scheduling schemes is shown in Table 6.1. Table 6.1. The number of chores generated by various self-scheduling schemes Scheme 33 GSS(k) GSS(l)‘ TSS(#,1) N . I rt] Plnrtl 4P (worst case) 90 6.3 Mutual Exclusion Implementation As mentioned before, scheduling overhead is dependent on both the number of chores N and the implementation of chore dispatching. As for the first factor, the number of chores dispatched by the TSS(;%, I) from a parallel loop is 4P which can be further reduced if an optimal (f, l) pair is selected. As for the implementation of chore dispatching, the algorithm in Section 6.2 can be easily implemented in a critical section. However, as indicated in [58] and [42], the system performance will degrade significantly if the critical section takes much time and the number of processors is large. In this section, we present an efficient chore dispatching mechanism which replaces a critical section by a single atomic instruction. Because the chunk function C (t) of the TSS is linear, the chunk size of each chore is predictable. A simple way to compute the start iteration of the i-th chore, a(i), is to sum the first i — l chunk sizes and then add one. Similarly, B(i) is equal to the summation of the first i chunk sizes. A generic formula of calculating (a(i), fi(i)) for the TSS( f, l) is derived below. C (1) = f (64). 0(2) = f — (i — 1)6 (6.5) as) = 1'): 00)] +1 (6.6) .=1 By using the formula of trapezoid area, we have a“) = [(0(1)+C(i-2- 1))*(i— 1)J + 1. (6.7) 91 Substituting Eq. (6.4) and Eq. (6.5) to Eq. (6.7), we obtain a“): l(l—1)...(2f2_(e.—2)....5)J +1. (6.8) Similarly, the formula for fl(i) can be derived as as) = mince” = Li*(2f-(2i_1)*6)l- (6.9) In these equations, f and 6 are selected or computed by the first processor encoun- tering the parallel loop. Hence, the index of chores, i, is the only shared variable that processors have to access atomically. If a multiprocessor system supports atomic in- structions, such as fetchtaddO, the implementation of the TSS can be very efficient as shown below. 1. If this processor is the first processor that encounters the parallel loop, then perform the following initialization steps. (a) select a (f, 6) pair according to the policy decided by the compiler or spec- ified by the programmer. (b) compute N and 6 according to Eq. (6.1) and Eq. (6.2). (c) set t = 1. 2. i = fetchladd(t, l). 3. If i > N, then enter the waiting state. . 4. Compute (a(i), fi(i)) based on Eq. (6.8) and Eq. (6.9). 5. Enter the computing state. It should be emphasized that the execution overhead of Eq. (6.8) and Eq. (6.9) is very small. Accessing the shared variable is reduced to a single fetchtadd operation 92 which is usually insignificant. For example, according to our experimental results, the average time required by a fetchaadd Operation on a 96-node BBN GP-IOOO machine is only 30 psecs. Since some operations can be overlapped, the atomic phase of this instruction is about 4.5 psecs. Thus the average scheduling overhead for each processor in the TSS(§%, 1) scheme is about 120 psecs which is a small constant. On the other hand, without hardware support, the GSS(k) scheme needs a lock()/unlock() mechanism. The scheduling overhead then is determined by not only the time processors stay in the critical section, but also the time processors wait for entering the critical section. According to our recent measurement [42], the fre- quence of the critical section accesses is the most important factor to the performance degradation. If a small I: is chosen in the GSS(k) method, the frequency of mutu- ally exclusive accesses will become higher and higher during the execution of loops especially when the granularity of iterations is small. The waiting time for fetching a chore is possibly even larger than the execution time of that chore. If a larger 1: is chosen, the frequency of mutually exclusive accesses can be reduced. As mentioned, the load imbalancing is a serious side effect. The impact of critical sections to the system performance was studied in [58]. However, detailed theoretical analysis is still an open issue as no closed-form solution is available for a general and practical model. Instead, an extensive number of experiments will be conducted and demonstrated in Section 6.5. 6.4 Memory Management In this section, another interesting and important implementation issue, memory management, in the run-time model of a multiprocessor system is discussed. Due to concurrency, the traditional stack (or static allocation) is no longer able to manage memory for parallel programs. A tree-like structure called a cactus stack, 93 which handles both local and global variables by directly mirroring the chore tree, is applied to deal with both parallelism and procedure calls. A simple way to explain the cactus stack is when a chore with a given frame splits into several chores, a child frame is created for each child chore. The operation of frame creation will certainly induce a large run-time overhead. Another improved approach which treats chore locality differently from procedure locality is proposed in [27]. Instead of dynamically creating chore frames at chore splitting, all chore frames are allocated at the procedure call by using the well-known scalar promotion technique. Thus, a promoted array will be created at the beginning of the procedure which includes a parallel 100p. The length of the array will be the same as the number of chores to split from that loop. Each element of the array corresponds to a chore frame. A strip mining technique, which is the same idea used in the GSS(k), has to be employed to bound the size of the promoted array. However, this approach has a major drawback—workload imbalancing. Note that the number of chores N in the TSS is small and predictable. The approach discussed above can be improved by replacing the strip mining with the trapezoid mining in the TSS scheme. This approach is explained by the simple pro- gram, which exchanges array A and array B, listed in Fig. 6.4. The first part is the source program, while the second part is the translated program by using the TSS(f,-5, I) approach. In addition to m, the index of the DO loop has to be a local variable. We have to translate both in and i to two scalar arrays: u(j) and i (3'). SP (j) and EP(j) are two functions to compute a( j ) and ,6( j ) using Eq.(6.8) and Eq.(6.9), respectively. Now, the parallel do block can be easily implemented by a single fetchtaddO in- struction. This approach further reduces the runvtime overhead of memory allocation and still keeps a balanced workload. 94 /** Source program *t/ Procedure example() int 1, A(1000), B(1000) int i I I 1000 doall i=1, I local int n m - A(i) A(i) ' 8(1) 8(1) 3 m end end; /** Translated.program **/ Procedure examp1e() int j, I, A(IOOO). B(IOOO) int 1(4P), m(4P) /** Scalar promotion array **/ I ' 1000 doall jtl, min(4P,I) do i(j)'SP(j). 39(5) u(j) * A(iCj)) A(i(j)) . B(i(j)) B(iCj)) ' u(j) end end end; /** SP(j): Starting Point of j’th chore (Eq.(6.8)) **/ /** EP(j): Ending Point of j’th chore (Eq.(6.9)) **/ Figure 6.4. Static memory management 95 6.5 Experimental'Results To evaluate the performance of various self-scheduling schemes, our experiments were conducted on a 96-node BBN GP-lOOO parallel processor. Rough review of BBN GP-IOOO has been given in Section 4.5.2. Detailed information can be read in [56] [42]. 6.5.1 Workload and Performance Metrics The experimental benchmarking programs are coded in Butterfly Fortran [22] which is a parallel language, in some sense, similar to the being standardized PCF For- tran [59]. To make an objective comparison, the loop language construct, PARALLEL REGION, is used to represent parallel loops, which provides the same kind of barrier for synchronizing processors in the waiting state. To take the advantage of atomic instructions, chore dispatching in the SS, CSS and TSS schemes are coded using the atonadd32() instruction in the scheduling state. Based on [25], the GSS is imple- mented in a critical section using the lock() and unlock() mechanisms. In addition to the selection of different self-scheduling schemes, the benchmark programs are characterized by a set of parameters: (P, I, L(i)). Four kinds of work- load as shown in Fig 6.1 will be experimented. The program executes the parallel loop and keeps tracing‘where the resource is spent. As described eariler, there are three possibilities: I) computing, 2) scheduling, and 3) waiting for synchronization. Therefore, besides the traditional speedup 1‘ measurement, two other interesting per- formance metrics, degree of scheduling overhead 6 and degree of load imbalancing A, are defined below. LsP X+0+W Speedup: I‘ = (6.10) 96 , 0 * P Degree of Scheduling Overhead. O — X + 0 + W (6.11) . W * P Degree of Load Imbalanclng. A —- X + 0 + W (6.12) Note that O and A represent the average number of processors wasted in the scheduling and waiting state, respectively. In the idea case, X = L and 0 = W = 0, the speedup I‘ is same as the number of processors P. In practice, L S X. Thus, I‘ + 9 + A is always less than or equal to P. The difference between P and I‘ + O + A is due to network and memory contention in the computing state which is mostly dependent on the remote memory reference ratio in the computing of a chore. To avoid too much influence from it, the remote reference ratio is set to 5% so that the sum of I‘, G, and A can be very close to P in our experiments. To measure the effect of remote memory accesses to scheduling schemes, one experiment based on different remote reference ratios from 0% to 50% is made as well. For the GSS(k) scheme, the optimal choice of the chunk size k is machine and application dependent. When the input parameters (P, I, L(i)) is (72, 100000, 110 psec) (a uniformly distributed loop), our experiment shows that when l: = 31.: = 1389, we can achieve a speedup of 69.2 which is very close to the idea speedup, 72. Therefore, our following experiments will be taking k = % for the GSS(k) scheme, which can produce minimal scheduling overhead and keep a balanced workload under uniformly distributed loops. However, it is believed that there must be a better I: value for the GSS(k) scheme when the workload is not uniformly distributed or processors start executing at different times. For the GSS(k) scheme, again the best choice of la value is mostly determined by the workload distribution and the number of processors. It is very difficult to predict the best value of Is. In our experiments, we will be testing different I: values from 1, 2, 5, 10, 20 to [{3]. Then, the best results of GSS(k) and J1,-f 97 GSS(I).will be plotted and discussed. 6.5.2 Run-Time Behavior The first part of our experiments is performed in a 16—node cluster in order to analyze and compare the run-time behavior of the CSS(%), GSS(I), and TSS(z—gsJ) schemes. The event log library provided by the (SP-1000 system is applied to record the events during the execution of loops. Then, the gist, an event log analysis software, is employed to profile the whole scheduling operations. The first experiment is made to analyze an uniformly distributed loop, shown in Fig. 6.1(a), with a small granularity of 110 psec. The resultant event profiles are shown in Fig. 6.5, Fig. 6.6, and Fig. 6.7 for the TSS, CSS, and GSS schemes, respec- tively. In these figures, Y-axis is the logical processor number and X-axis represents the execution time. The small squares represent events. Corresponding to the three states discussed in Chapter 5, there are three kinds of events: scheduling events, computing events, and waiting events. The fourth event indicates the completion of the critical chore. Each large rectangle box is used to explain the event which is connected with. In Fig. 6.5, since the scheduling overhead of the TSS is very small, all computing events (squares labeled 2) are almost right after the scheduling events (squares labeled 1). Thus, the squares labeled 1 are almost right behind the squares labeled 2 in this figure, and thus invisible. The time period from any square box 2 to the next square box 1 is the execution time of a chore. As indicated in the figure, in the first round, all processors are scheduled at about the same time and fetch chores by a linearly decreasing chunk size. The chunk size can be read in the large rectangles. Since the earlier processors are assigned with larger chores, all processors almost synchronize at the end of the second round. The same behavior repeats in the third and fourth rounds. Thus, all processors finish their last chore at about the same time which implies a balanced workload - the total waiting time is very small. -‘Ib. 98 Figure 6.5. Event log profile of a uniformly distributed loop using the TSS. l éiéé i: i” '3 $5 Ital, «+3 iéi :éééiéééés ~ ~ ” it i GP (3 El gs gs is e rs is ca s eh E9 is d? q? [.3 la ..8. 006550 u— have” 12L U 42' U JIL L1 ‘2‘ U '21 U '7' 1.1 r314_l ‘1_J 1,1, *‘L.I r;1l, 1.; or;1l, "—*‘1_J 100000 1 ")1 TV- llil Chunk w': ml . l Ive t f we: Chunk- L; al::}. ‘l;:* s[:}; at} P O I o .lF-«l +l+l «l l B? 0011‘ 99 Figure 6.6. Event log profile of a uniformly distributed 100p using the CSS. u more: 1” 6 uses trous- I. mm mm mum su. sass Dtsplsy an lusts lash-es." Display Is [seats 0 “he Ivan: Display Changes F: + J» and 010 C C C C C I ‘17}— ' ml «J nil ii . l= if; 8 ls: 03 2: l 3 o: 8 ‘8 l: h: L; fig ii: :3 2 ’6 Id . ,. ... ... l L: rim rer 111 AL JJHJ'I IJ'I _ . .. . : : . . ‘:’ E" 3" E" ‘:' L" EC? L" .hOCCI. HBO..- 126.. no. flatten 1‘ processors 1 we: “glass." Dlsplsy All tunes Displsy Is lasts 0 Isle [vest Displsy usages 100 Figure 6.7. Event log profile of a uniformly distributed loop using the G88. ntI . l s ISCIIWIJD: Done- ”"1 .MK the selenium: 0..- 99955 Ml“: Qual- 1 MIN: Do..- 9999‘ l.- .ms I. end # l'l-«l+l+1-l -] u- insessi I 0011’ I "$230 ses- “S l as. In 101 Figure 6.8. Event log profile of a decreasing workload using the TSS. ... BIB 2.31.. I use-um sDDSAU >s~nsqo as!” In! D Sees"— ox >s~ns3 3:28 :4 ulna-«o m axon a4< - nu usaunas n nu uz~uaaxoo u n causanu-un _ __ sues}.- Squ-nsv use! 0.32. “sues-scene 3 o....u.woad "use 757.362.“... .5252. 230000 0 l] a. lo- _ ooomacn .28 :83; Old... cosmos. n. Ina—EU "8:5..8 _ mug“ .mnu ICU. r-n r-n l—J I 1520 “DIE-DEB veg . I... F1 J l"! [‘1 U in r-1 L_J O r-1 r‘1 L_J , r-n L_J U «n lac—30 “clung u§> . E] r-: L] 0 v0 Inna—cu «22.98.00 m 233. 6m~ I: $8.0.- hlsDUII. 102 Figure 6.9. Event log profile of a decreasing workload using the CSS. .... ... BIB wean:- I season acme-.6 Cum-«n ucs>u sac: D 5225— on as~nsqo Saga :4 gin-«a m uzoo .34 v D U-an‘t n m DEED u D gas—Bus“: a U mean-«n I... «too Sega coda-=5 use 0.: ..suOsleOua : :..8.1.3 63 NHCHI ufld s0- ED- E . _ . _..._.__ cocoon"— cocooo: .5 Ice- ccooooo~ .53 :00: in ooooao. ooaooo~ j ooooooo cocoon: odd Inc—:5 "clubs I: llm! lmlw] ..A late—AU “02—kDa l u 233— .m: '5... .._u lllllfi coon lDeon ..Hzoo ...: . I coon IDCDQ "gr—n: oo— Ida—=6 "02:.350 . . I5. His-0.. Luau... 103 Figure 6.10. Event log profile of a decreasing workload using the GSS. 00.: >5 _HIH_ 2:3... I use-um .09.an >sHA-«o ucgu 9.1: D cases”. on >san-«a Saga and >snnsao m H200 ‘34 v D 9.2L: n 0 03:5ng N D gndbnuznfl n D >s~nsqo ll: (‘00 3:25 sewn-nan use 1: “sues-000nm '— 3.-.o.-e3 53 7.1.? _ + _....._ _a... oooooo: oooooo= ...: E :00: luau ooooooo— oooooo. oooooco oooooov oooooon a loll lid ll l—J N r1 r1; coon stcoo ..Dlnhndl . ooon lye—on “0.2.54! coon licoo «8:8 .34 v salsa .mp-mmo: Ila... n uus>u .oo ucnn Ila... .: laczsu "02:55.50 « negu .unv Ola.— ails-9.. mucus-l 104 3 I — ens — A. _ IV —:-.:_ two. Flow:— :8 ..003: Id» 883 82.3 382 832: 883 o v:- a 9 E a a 6 E u H50 q— I 100: E c 5 .J .2: 5.5.6 63.5.60, ..8. F. E. a a a E] E o B a a n =3 1:36 “9.3298“. unD>u . n .... ... me 3:... I Infillfl. B la sou-an J J lgla R... E Figure 6.11. Event log profile of a uniformly distributed loop using the TSS, where processors start at different times. sounds scan-«o DEN sue! D Eon lac—Eu “nu-=38. Sega ex >193 35>". :1 >s~ns3 E: [,9 E] if - D 9.2.3. n m aisle ~ 0 Illajfl a U 8.8 ..A: JV 3 E E s OISBH—Uu J J 71.3 lglflu E E a Is: 080 3:26 l—“_IHIH— a _u_ H. ..L I Ho con-=5 use! «.0: "sans-DOPE = —H~h~ Inca-U "ax—#358 on: lac—EU "gr—:8 o—.ssu.ouo~ “03 h uzs>u .mronmn 01.: n :25 .onn '2. unsol- EMOOIII 105 Figure 6.12. Event log profile of a uniformly distributed loop using the CSS, where processors start at different times. can >5 _HI: 95:... I ecu-um .5540 71:3 :25 3.1: B 3:65 on ran-3 3:25 A: uni-3 m 8.8 4.: v a 927:: n E 92588 u D 92.59:.“ ~ D hugs-«o '3: I30 nucgu coda-=5 00-! n69: “anal-I093 ed 3435?: 63 —rv:o— AI — IV is; —hn hnnfil _230— ion-.... In: 23ch .3330“ 250°. 23.000 cocoa. cocoon < 4 .1 ii is a [E a law Jr? A liw nF JE a n Ilixfi iml _ lm'wi imT. . IIT iwifi . n llmlu iml . ammo Inc—.6 “czar—.580 50>” . I5... 23 5:8 622.2: . 2.2.”. .. I: AIM-9.. 106 Figure 6.13. Event log profile of a uniformly distributed loop using the GSS, where processors start at different times. ... mIm 93:... I non-an nooinu aim-.3 295 can! D 35:8 on hung—nun 3:15 :1 finqanqa m 9.8 .34 c D 93.3.: n D 025558 a D 93.53:“ a U >IAa-3 '3: {cu Bali .3332“ 00-! 0.3. :no-uoooun 3 3...?33 63 unlu. nab lac. _HT. 7 _....._ h... .5 GE 8%... has Anni-d: Clan 89.8. 98.8. 23...: :2 .156 6359.3 u 2.25 .22.: Id.— “ll-0.. 107 Since the workload is constant and very small, the CSS scheme works almost perfectly in this case. It can be seen in Fig. 6.6 that the CSS scheme produces the least events, only one round for each processor. The total execution time using the CSS or TSS is about 706 msecs. The GSS scheme takes a little bit longer, about 20 msecs, than the TSS and CSS schemes. It achieves a perfect workload balancing, while the mutually exclusive access is possibly the bottleneck in the case of small granularity workload. Figure 6.7 shows the run-time behavior of the last 12 msecs of the execution of the loop. In some processors, such as processors (b) and (6), the scheduling overhead (from the event 1 to next event 2) are even much larger than computing time (from the event 2 to the next event 1). A loop with a decreasing workload shown in Fig. 6.1(d) is analyzed in the next experiment. The event profiles of the TSS, CSS and GSS schemes are depicted in Fig. 6.8, Fig. 6.9, and Fig. 6.10 respectively. Since the loop is not uniformly dis- tributed, processors don’t synchronize their fetch every two rounds in the T88 scheme. The first chore which is assigned to processor 0 is not so large as to become a terrible critical chore. As the previous experiment, all processors complete their last chore almost at the same time in the TSS scheme. In Fig. 6.9, every processor, except processor 0 which acquires the last 180 iter- ations, fetches a chore with the same chunk size 188 according to the CSS method. However, processors 3 which fetches the first 188 iterations becomes a terrible critical processor due to the heavy workload at the beginning of the loop. The shaded periods from the event 3 to the event 4 on the other processors represent the time proces- sors wasted in the waiting state. The same problem caused by the GSS is shown in Fig. 6.10. Processor 0 fetches the first large 188 iterations and becomes a critical processor. Both CSS and GSS schemes take about twice time comparing to the TSS scheme in this case. In the last experiment of the run-time behavior, processors start executing an 108 uniformly distributed loop at different times. The event profiles of the TSS, CSS, and GSS schemes are exhibited in Fig. 6.11, Fig. 6.12, and Fig. 6.13, respectively. The TSS works as well as the GSSin which a good load balancing is achieved in any initial configuration, while the CSS causes serious load imbalancing. Clearly, processor f which is the last processor to start executing the loop is the critical processor in Fig. 6.12 under the CSS scheme. In this case, the CSS takes about 230 msecs longer than the TSS and GSS. ‘ 6.5.3 Performance Measurement The second part of our experiment is conducted to measure and compare the perfor- mance (which includes I‘, 6, and A) of SS, CSS(-{;), GSS(k), and the conservative TSS(é, 1) approaches under different kinds of DOall loops, where the maximum number of processors, P, is 80. The plots in Fig. 6.14 are the results of Experiment 6.1, where L(i) is a small constant, about 110 psec, for all 2' and I is 100000. The TSS and CSS almost perform identically well in this experiment. From the plots of 6 and A, the CSS has a little bit higher A (load imbalance) than the TSS, while the TSS has a little bit higher 9 (scheduling overhead). Both the SS and GSS(l) produce too much scheduling overhead in this eXperiment. The scheduling overhead, 9, of the SS is the worst, up to 65 when P = 80. This is due to hot spot condition [41] occurs on the shared index of the SS algorithm. As discussed above, the low and long tail in the chunk function C(t) of the GSS(l) is the major source of 9. In the experiments of GSS(k), when the value of k is increased the scheduling overhead is decreased while the degree of workload imbalancing is increased. The best value of k in GSS(k) scheme is 80 under this workload. Beyond 80, the workload imbalancing dominates the system, and the overall performance goes down again. The speedup of the GSS(80), however, is still a little bit lower than the TSS and CSS due to both higher scheduling overhead and 109 80 I T_ ..+.. SS ”Q 70l- .....X 088 Q. “*1 60 L .. Cl. GSS(I) Q ,nk' " _ ...*.. GSS(80) _ 1: _l Speedup 40" E] ....... D ....... [:1 '4 30" If 8 "'D‘ 20L 9mg J 10? ."9P-“++ ...... ..+...- + ..... + ....... + ”+4 0 ug‘l'L L n 1 J L 1 L 0 10 20 30 40 50 60 70 80 No. of Processors : P 80 I I T T j j I r“ 70- ..+- 60? + "‘ 9 50r- +-_.' 1 Overhead 40* ' " 1 10+- +~++ ...-“D"... ......... * ..... *4 0 J -IJAALzzigzzni ...... 1:34“ .... ""1 0 10 20 3 40 50 60 70 80 No. of Processors : P 8" 1 l l l 1 I I I 7_ ., A 5.- ‘ Load .' Imbalancing 4 P "* .‘D 3 _ ..D ...xq 2— "x" - 1 ...0-1- x 0-4 “ ::,Q:II....*:.. ........ o 0 ...hzgzillfiiul.:_xw:Q:[ 4iQJ-J_L..?L1 0 10 20 30 40 50 60 70 80 No. of Processors : P Figure 6.14. Experiment 6.1: I =100000, L(i)=110 microseconds Speedup 6 Overhead A Load Imbalancing Figure 6.15. Experiment 6.2: I =10000, L(i)=2 milliseconds 80 70 60 50 30 20 10 cwwwemm-Joo OHNWIhO‘GNm O 110 1 I I I I I . u . v . SS (:33 (353(1) GSS(S) TSS .3" .3" J L 1 l 113?? ..J 20 No. of Processors : P 30 40 50 60 I T I I —r m I r a l- .' 1 - E] q r ' n l- .. :::¥-l - D _ a“; d “’ +6 _.*' -4 . + 9135"“ Home. .-+ m:i:::' . ..... O ..-‘+LL-4.1L-'1' in-L ...... I'_"'Jn L..1... .1...” -_L 10 20 30 40 No. of Processors : P 50 ,_ n n I r I T— T T - J l- _ r ,0“ + mm X- .— X d ,m- ,5 . ii' :E::::'¢:.~ _J l' .Q'WESEEBE: 6 ------ +"’+ n-n.IL-;;UJH:.EZL ----- 1 ---- 4 1 O 10 ‘20 No. of Processors : P 30 40 50 111 load imbalancing. Experiment 6.2 is very similar to Experiment 6.1 except that L(i) is increased from 110 psec to 2 msec. The results are plotted in Fig. 6.15. The TSS and CSS work as well as in the previous experiment, while the GSS(l) and SS have a significant improvement. This is due to the significant improvement of the ratio between the execution time of each iteration to the scheduling overhead. Note that the scale of Y-axis for O and A is between 0 and 8 rather than between 0 and 80. The influence of the critical section implementation overhead still restricts the GSS(I) from achieving a better performance than the other schemes. The best choice of I: value in GSS(k) ' scheme is 5 in this case which performs almost as well as the TSS and CSS schemes. Again, the effect of the load imbalancing will degrade the performance if the I: value is larger than 5. The loop distributions of experiments 3 and 4 are based on Fig. 6.1(c) and Fig. 6.1(d), respectively. In EXperiment 6.3, since the execution time of the last few iterations is very high, the last chore of the CSS will become a very large critical chore. Results of this nonuniformly distributed loop are shown in Fig. 6.16. Obvi- ously, the CSS scheme has the worst performance, in which A of the CSS is up to 39 when P = 80. The SS, GSS(I) and TSS schemes all achieve a very good speedup. The GSS(l) has a slightly higher scheduling overhead when P is very large due to its critical section implementation. The TSS has a slightly worse load imbalancing due to the heavy workload in the last few chores. The GSS(k) also has the same load imbalancing problem as the TSS when k 2 2 is chosen. Overall speaking, under this increasing workload, the GSS(l) provides the best speedup, although the difference is insignificant comparing with the SS, TSS, and GSS(k) with k 2 2. 112 80l- I r I II— F 1 T I__+ 70. + 88 .328- x 053 8' 60- D GSS(l) @ '4 _ O TSS _.: H P 50 :;$7' Speedup 40* ”a," ._..><'°"'Xd 30l- Hzll X fi 20h- ."la ...me 10 up ..-"x ” .¢i..-x-“ omm'xi J J l 1 n L 1 8_ T T F T I— T 1 1:4 7- - 6- e 5‘ Overhead 4 - J 3b 7 2* J 1 304 r ---------- D:.‘IIIZP...9 0 ‘L “ 14.4-4.1 """ li'—£—'~"i_L::::J_ ...... 1'" 0 10 20 30 40 50 60 70 80 No. of Processors : P 80v j T ' T I F r ms 70L _J 60r _ A 50 '- _l L .l Imbalancing40 xx 30_ ' _,.-X' _J 20L x _J 10- XX _J 0 "".x;1;:-~LA-|----v_-_L;-nnzz :;::::1::::IZ%II_I_IIIQZ:I' 0 10 20 30 40 50 60 70 80 No. of Processors : P Figure 6.16. Experiment 6.3: I =3000, L(i)=[50 to 77000] microseconds 113 Figure 6.17 shows the results of the fourth experiment, which has a very small ex- ecution time in the last few iterations. Again, the TSS still achieve both objectives-— very small scheduling overhead (9) and load imbalancing (A), which result in a linear increasing speedup, I‘, as P increases. On the contrary, the major disadvantage of the GSS(I) scheme mentioned above, a very large first chore T (1), obviously appears in this experiment. The first chore of the GSS(l) and CSS, even it is spawned first, be- comes the critical chore in both schemes. The performance degradation of the GSS(l) and CSS is mainly caused by load imbalancing. When the k value is increased in the GSS(k) scheme, the results are even worse than when k = 1 due to the larger first chunk size, [mfg-M]. In the same way, the random workload shown in Fig. 6.1(b) is measured in Ex- periment 6.5. The probability of branch in the if statement is 50%. The execution time of Block 1 and Block 2 are 36 psec and 1 msec, respectively. As shown in Fig- ure 6.18, both TSS and CSS work almost as well as in Experiment 6.1, while a slight load imbalancing is created. The SS gets a significant improvement over Experiment 6.1. The GSS(ZO) is the best one in the GSS(k) scheme under this random work- load. However, the critical region implementation still induces too much scheduling overhead, and prevents the GSS(k) from getting better speedup. In Experiment 6.6, processors start executing an uniformly distributed loop, as the workload in Experiment 6.1, at different times. The result is plotted in Figure 6.19. As expected, the SS is the worst one due to much scheduling overhead, while the CSS loses it load balancing as described above. The GSS(l) works as well as the TSS when the number of processors, P, is less than 40. Its speedup is seriously affected by the critical section implementation when P is larger than 40, even though a good load balancing is achieved. However if k is increased in the GSS(k) scheme, the scheduling overhead can be significantly improved. The best result of the GSS(k) is produced by GSS(80) which works almost as well as the TSS scheme. 114 80" 1 SIS— T l T r 1 lg-4 _ ...+ ' - 70 ...x 033 Q 60 L ...13 (388(1) Q * 0 T - II- ...I . -+ Speedup 40 “9‘ B'MX _1 30- Q ..szi“:' 'D-mé 20- a " 10,— g ”.8 -¢ 0 J¢$l J L L l J L l 0 10 20 30 40 50 60 70 80 No. of Processors : P 8 L- I i I— j I T 7 ‘fi 7 r- 4 6 L- J 9 5 ‘- .. Overhead 4 _ _ 3 L d 2 r- q 1 L -134 ..... D' 0 i g1.u.+.LLU+an-:Lm;ugsL55::::p=::::.+ 0 10 20 30 40 50 60 70 80 No. of Processors : P 80 "' j T 1— T 7 l ]—— 1 70 L .7 60 r 1 A 50 L _1 Load Imbalancing” ’ Dx~~~ Bx,....a>9 30 F q 20 F Dx _J 0 'Di 1 l-+Ai+A-QJULLHI.4....L#... 0 10 20 30 40 50 60 70 80 No. of Processors : P Figure 6.17. Experiment 6.4: I =3000, L(i)=[77000 to 50] microseconds 115 I 80 .....+ 38 70L ....X CSS $564 60,. ...D.. GSS(l) 6 ..+.. GSS(20) ”£29 “.+.”; I‘ 50’ ._.<>.. TSS 9* Speedup A c l D—H _L 30 20 $55: 10 4" l l L l J . ' L 0 10 20 30 40 50 60 70 80 N o. of Processors : P T I l 40i— W F j fi— _—T l I j 35r 30* 9 25k " Overhead 20 — 3D - 15F 1:1".- ..-'+1 10r ...:ili-'”*- 0 10‘ '20 30 4o 50 60 i0- 80 8- .-*- . pp A 6 (5 Load a" ._,x Imbalancing4_ 1:1?" , . 2_ _.*':='3:' 3:5“9 ,. "l.+ o o a . n ...... n 0 .-;-;;1LumiLL ...1 ----- 4 _1 L 4 4 0 10 20 30 40 50 60 70 80 No. of Processors : P Figure 6.18. Experiment 6.5: I =10000, L(i)=[36(50%) or 1000(50%)] microseconds 116 80 + 3s ' ' ' ' I _ 70- Iffxf css - 60 _ [1. GSS(I) _ _ ...*. GSS(80) . I‘ 50 ...<>. TSS .,¢""Q‘ Speedup 40- ,9?” ..x- .53" X 30b ”.‘8:..;;:Q=°‘:-~D---.D_ ..3g33.:::x.~' 20- .3:: ::X.°‘ _. 10_ '3 '15,.+ ....... + ....... + ...... +.n+_ Imggl:.$l I I l l l I so ' ' ' ' ' . . — 70- - 60- _.-+- 9 50- _-+' - Overhead 40 - ~ 30- .+... --'+ _ 20— ," __I:1_ 10_ .+... ...+. D ..C} _ 0 """ ‘ +1 ---I ------ IND; 1 ...... 1 ....... 1 0 10 20 30 40 50 60 70 80 30 l T l I 1 1 I r- 25- - A 20- .x'__.x_ I‘r‘r‘balancingl5 ' Xx 10- x _ 5b Xiflfix'. , ""D“ 0 .....X'L'UUAIHUIll”...”HHOIJHHEIQIQ .... 1:8“? 0 10 20 30 40 50 60 70 80 No. of Processors : P Figure 6.19. Experiment 6.6: I =100000, L(i)=112 microseconds, Processors start at different times 117 80L- 1 j I j r- Q59 70 i—‘i**'*i'::. ... +.. 3883 '- 60? 2339 a GSS(I) ‘ 50 _ ‘x ...*.. GSS(150) _ 1‘ * ...<>.. TSS Speedup 40 - Q d 30? his}; ‘ i 20 b i 39 _ i X-._:': Huh”: 3” 10 L-fl+_*__+_ ..... +.++ ..... + ..... fiigunuqnn'g_ 0 L J J _L J I‘ 0 10 20 30 40 50 . Remote reference ratio( %) Figure 6.20. Experiment 6.7: P280, I =100000, L(i)=112 microseconds Instead of setting 5% of the remote access ratio, the last experiment is made to measure the speedup under different remote reference ratios in an 80-node cluster. The result is shown in Figure 6.20 in which X-axis represents the remote ratio, from 0% to 50%. The GSS(150) is the best one in the GSS(k) group in this experiment. Due to memory and network contention, the speedup drops fast in any scheme when the remote ratio is larger than 10%. However, the TSS is still on the top of the figure. CHAPTER 7 CONCLUSION AND FUTURE RESEARCH We have presented three new techniques to improve the execution of iterative loops in multiprocessor systems. In this chapter, we summarize our major contributions in this thesis. Improvements to the current work will be suggested, and plans for future works will be given. 7 .1 Summary In Chapter 1, the procedure of loop parallelization in most parallelizing compilers has been reviewed. Two major types of parallel constructs, DOall and DOacross loops, were described. By discussing the important concerns of loop parallelization, the objectives of this thesis were outlined at the end of Chapter 1. In order to preserve correct semantics, dependence constraints among iterations are the major concerns in loop parallelization. In Chapter 2, cross-iteration depen- dence was studied. After introducing the basic concepts of data dependence, we have presented the ideas of linear Diophantine equations and real inequalities that are the fundamentals of analyzing cross-iteration dependences. Some associated terms, such 118 J.’ 31 119 as iteration space and uniform dependence loop, are also defined and described. In a shared memory multiprocessor system, cache and local memory is widely employed to bridge the gap between fast processor cycle and slow main memory access times. To exploit more potential parallelism, many transformation techniques ignore the problem of cache (or local memory) thrashing and barrier synchronization that may overweight the benefit of the resulting parallelism. Loop netting has been prOposed in Chapter 3 to parallelize uniform dependence loops without inducing cache thrashing and barriers. This method is particularly suitable for multiprocessor systems with a local cache or memory. In Chapter 4, dependence uniformization has been proposed to overcome the diffi- culties of parallelizing a doubly nested loop in which irregular dependence constraints occur among iterations. Based on solving a system of Diophantine equations and a system of inequalities, we have presented an efficient dependence test to compute the maximal and minimal dependence slopes of any regular or irregular dependence pattern in a 2-dimensional iteration space. Then, by applying the idea of vector de- composition, a set of basic dependences is chosen to replace all original dependence constraints in every iterations so that the dependence pattern become uniform. Thus all approachs, such as wave—front method, loop skewing and so on, for parallelizing nested loops with a uniform dependence pattern are now applicable. We have proved that any doubly nested loop can always be uniformized to a uniform dependence loop with only two dependence vectors. To reduce the synchronization overhead, index synchronization method, in which appropriate synchronization can be system- atically inserted, is also proposed to preserve the two basic dependence vectors in a uniformized dependence pattern. Unlike the wavefront method or loop skewing technique, the proposed method does not require loop index transformation in the original program. The results of performance evaluation showed that the speedup is significantly affected by the range of dependence slopes and the granularity of 120 iterations. Since DOall loops are the most important parallel construct in parallel program- ming, efficiently scheduling of DOall 100ps is vital to the performance of shared- memory multiprocessors. In Chapter 5, we have reviewed various loop scheduling schemes by discussing their strength and drawback. Some methods are good in dis- tributing the workload, but complicated. Some methods are simple, but not good in balancing the workload. To achieve both objectives: balanced workload and low scheduling overhead, we have proposed a simple and efficient Trapezoid Self-Scheduling scheme in Chapter 6. By assigning a large group of iterations at the beginning, the frequency of mutually exclusive accesses to loop indices is substantially reduced. The property of linearly decreasing chunk size induces a good workload balance even under extremely nonuni- formly distributed parallel loops. Furthermore, the linearity of the chunk function allows a very efficient implementation of the chore dispatching function, which can sig- nificantly reduce the scheduling overhead. Instead of strip mining, if the TSS scheme is coupled with scalar promotion, one can statically manage memory at the procedure level and still keep a balanced workload among processors. The adjustable starting and ending chunk sizes imply more flexibility of the TSS scheme than other well- known schemes. A conservative choice of starting and ending chunk sizes is proposed. With the availability of a 96—node GP-1000, we were able to actually measure, rather than simulate, the performance of different self-scheduling schemes. Our experimen- tal results clearly indicate the advantage of the TSS scheme over other self-scheduling schemes. 121 7 .2 Future Plans To improve the worked reported in this thesis, we identify at least the following areas that deserve further study. Partition an independent net Loop netting divides an iteration space into many independent nets that are as- signed to processors. If the number of independent nets is small, loop netting gains a small speedup. One idea deserves further study is dividing an independent net into many sub-nets so that an independent net can be shared by many processors. This partitioning scheme will somehow lose data locality and cause data transfer and synchronization between processors executing different sub-nets in the same indepen- dent net. The major concern in this research is how to minimize the amount of data transfer and synchronization. Dependence uniformization considering the magnitude of dependence vectors Clearly, the major factor to the performance of dependence uniformization is the range of dependence slopes which determines the delay factor between processors. If a loop has a very wide range of dependence slopes, in the worst case, the proposed method could result in a serial execution. One possible improvement is considering the magnitude of dependence vectors to choose a more efficient BDV set. Dependence uniformization of multiple nested loops The proposed dependence uniformization deals with only doubly nested loops. In the cases of multiple nested loops, one can separately apply the proposed method to any two adjacent levels of loops. Loop interchange and loop coalescing may be helpful in this case. However, detailed study and experiments should be further explored. TSS with more aggressive choice of starting and ending chunk size As indicated in Chapter 6, one of major advantage in TSS is the flexibility of choos- 122 ing the starting and ending chunk size. The conservative approach in this thesis is certainly not the best one. If the workload distribution and scheduling overhead are known, finding an optimal pair of (f, I) for the TSS scheme is an interesting issue. APPENDICES APPENDIX A FINDING THE DCH IN A N-DIMENSIONAL SPACE In this appendix, we study the cases in which the general solution of a system of Diophantine equations is composed of more than 2 free variables. Thus, instead of 2-dimensional solution space, the algorithm of finding the DCH should be extended to n-dimensional space. We first present our ideas by some lemmas and theorems. Then the data structure used to represent a convex hull is described; followed by the completed algorithm. A.l Basic Principle Recall that finding a DCH in a n-dimensional solution space involves computing the intersection of a n-dimensional convex hull and several half-spaces that are derived frOm the constraints in Eq.(2.2). A half-space in a n-dimensional space can be rep- resented by a (n — l)-dimensional hyper-plane and a symbol, 5 or 2. Thus, finding a DCH involves computing the intersection of a n-dimensional convex hull and a (n — 1)-dimensional hyper-plane. To simplify our presentation, a m-dimensional convex hull is represented by CH "‘ 123 124 and a n-dimensional hyper-plane is denoted by H P". Without formal proofs, we first describe the principle of our algorithm by the following lemmas and theorem. Lemma A.l A C H m is a convex hull bounded by many connected CH "“13. For example, the CH 3 shown in Fig A.1 is bounded by 6 connected CH 23. Each CH 2, is 2-dimensional plane bounded by connected edges that are essentially 1- dimensional convex hulls. For instance, plane 01 is bounded by edges, db, E, a, and 35. ' b / C1 e a ‘ h a g c Figure A.l. A C H 3 in a 3-dimension space Definition A.l Let H be a H P""1 and C be a CH m in a n-dimensional space. We say H cuts C if some part of C is in one side of H and the rest space of C is in the other side of H, where m S 12. Lemma A.2 Let H be a HP'“"1 and C be a CH'" in a n-dimensional space. IfH cuts C, the intersection of H and C is a CH ”‘1. 125 For example, a H P2 cuts a CH3 in a 3-dimensional space is shown in Fig A.2 in which the intersection is a 2-dimensional convex hull, bounded by edges W, p_q, q—r, and fi. 3.. prer-plane Cs Figure A.2. A CH3 cut by a H P2 in a 3«dimension space Theorem A.l Let H be a H P"‘1 and C be a CH m which is bounded by connected k CH "“18, Cl, C2, - - - , Ck, in a n-dimensional space. If H cuts C, the intersection of H and C is a CH”"1 which is bounded by CHm'gs, C{,C§, . - - ,Ci, where C}, which may be empty, is the intersection of H and Cg. Consider the example in Fig A.2 again in which the CH 3 is bounded by 6 CH 28, Ci, 02, ° ' - , C3. The intersection of the HP'2 and the CH3 is the CH2 bounded by 4 126 edges W, W, q—r, and W. As can be seen that this hyper-plane does not cut 01 and 02. The intersection of this hyper-plane and C3, C4, C5, and Cs are edges w, 171, W, and r—o', respectively. A.2 Data Structure To flexibly represent a m-dimensional convex hull, a tree-like structure is employed in our algorithm. A convex hull is represented by a structude which is composed of six fields as the following. /* ----- structure of a Convex Hull ----- */ struct CH { struct CH *child; /* pointer to its child ring */ struct CH *next; /* pointer to its brother CH */ int dim; /* number of dimensions of this CH */ int chdno; /* number of CHs in its child ring */ struct NUDE *left; /* pointer to a node; used in a CH1 */ struct NUDE *right; /* pointer to a node; used in a CH1 */ } /* ----- structure of a Node ----- *I #define Ndim 10 /* define the maximal dimension */ struct NUDE { strcut NUDE *next; /* pointer to next node in node list */ int coord[HdimJ; /* the coordinate of this node */ int degree; /* the degree of this node *I } /* ----- structure of a Half-space ----- Iii/ struct HS { int coeff[Hdim+1]; /* the coefficients of h-plane */ char zoom; /* > or < */ 127 As indicated in Lemma A.1, a CH m is composed of many connected CH "“13 that are stored in a ring structure, called child ring, in our algorithm. The field child is a pointer pointing to its child ring while next is the pointer pointing to the next convex hull in the child ring. The dimension of this convex hull is stored in dim, and the number of convex bulls in the child ring is stored in chdno. If this is a CH 1 (an edge), left and right are used to point to two nodes that form the edge. All nodes of this convex hull are maintained in a list structure, which are the major concerns in computing the range of dependence slopes in the proposed dependence uniformation. For example, the data structure of Figure A.1 is partitally shown in Figure A.3. A half-space represented by a hyper-plane and a symbol is also listed above. Note that M dim represents the maximal dimension in this algotithm. Node List Figure A.3. The data structure of the CH 3 in Figure A.l. 128 A.3 Algorithm Similar to the algorithm shown in Fig. 4.4, initially the whole n-dimensional space is represented by an infinite large CH " which is composed of 2" nodes. Then, the half-spaces derived from the general solution of Eq.(2.l) and Eq.(2.2) are applied to cut the initial convex hull. The algorithm in C-like language is listed as the following. Input: A list of half-spaces. Output: A DCH int dim; /* the dimension of initial convex hull */ struct CH *ch; /* pointer of the convex */ struct NUDE *node; /* pointer to node list */ struct HS *hs; /* pointer to a half-space */ struct CH *dummy; begin Build the initial convex hull pointed by *ch and the node list pointed by *node; while(the input list is not empty) begin Pop a half-space from the list to *hs; CUT(ch, node, hs, dummy); if( node -- Null ) STUP; end while end CUT(ch, node, hs, ich) struct CH *ch ; struct CH #ich; /* return the intersection of ch and hs */ struct NUDE *node; struct HS the; begin int i, chdno; struct CH *new_ch; 129 struct CH *p1, *p2, *ichx; ich I Null; nev_ch 8 (struct CH *) malloc( sizeof( struct CH )); if( ch->dim == 2 ) { /* 2-dimensional convex hull */ PLANE(ch, node, hs, ichx); ich I ichx; } else { p1 - ch->child; /* the 1’st CH in the child ring */ p2 8 p1->next; /* the second CH */ chdno I ch->chdno; /* --- now, scan the child ring --- *l for( itl ; i<= chdno ; i++) { CUT(p2, node, hs, ichx) /* recursive call *I if( p2->chdno -- O ) { /* CH not in the right zoom */ p1->next I p2->next; /* remove it from the ring */ p2 8 p1->next; /* go to next child CH */ ch.chdno -- ; } else { pl 8 p1->next; /* go to next child CH */ p2 = p1->next; ' if( ichx !8 Null ) { /* CH intersects h-plane */ /* insert it into the nev_ch */ if( nev_ch->child !' Null ) { ichx->next I nev-ch->child->next; nev_ch->chi1d I ichx; } else { ichx->next - ichx; /* 1’st CH in the ring */ nev-ch->child I ichx; } nev_ch->chdno ++ 3 } } /* end of the scanning the child ring */ 1:» a 119.30“ 130 if( new_ch->chi1d !I Null ) { [I if new CH is not empty I/ /* return it to the previous level 1i‘/ new-ch.dim I ch->dim -1; ich I new-ch; end; 131 III Cut a 2-dim CH (a plane) III PLANE(ch, node, hs, ich) struct CH *ch; struct CH Iich; II return the intersection of CH and HS II struct NUDE Inode; struct HS Ihs; begin int i, chdno; stuuct CH *new_ch1; stuuct CH Ip1, Ip2; struct NUDE *inode; ich I Null; new_ch I (struct CH I) malloc( sizeof( struct CH )); new-ch->dim I 1; p1 I ch->child; II the first child CH in the child ring */ p2 I p1->next; II the second child CH */ chdno I ch->chdno; /*** scan the child ring */ for( iI1 ; ileft II Null .and. p2->right II Null ) { II the edge not in right zoom */ p1->next I p2->next; I* remove it from child ring Iill p2 I p1->next; II go to next child CH */ ch.chdno -- ; } else { p1 I p1->next; II go to next child CH */ p2 I p1->next; if( inode !I Null ) { II this edge intersects h-plane II II fill it into the new-ch1 II if( new_ch1->1eft !I Null ) 132 new_ch1—>left I inode; else new_ch1->right I inode; new_ch.chdno ++ ; } } /*** end of the scanning the child ring I/ if( new-ch.right SI Null ) ich I new-ch; end; 133 EDGE(ch, node, hs, inode) struct CH Ich; struct NUDE Inode; struct NUDE Iinode; II return the intersc. of edge and HS I/ struct HS Ihs; begin stuuct NUDE Indl, Ind2; struct NUDE Iinode, Inew_nd; inode I Null; new_nd I (struct NUDE I) malloc( sizeof( struct NUDE )); nd1 I ch->left; II nd1 points to the one node of the edge II nd2 I ch->right; II nd2 points to the other node II case 1: (Both Indl and Ind2 are not in the half-space, Ihs) { if( nd1->degree -- II 0 ) Remove Indl from the list Inode; if( nd2->degree -- II 0 ) Remove Ind2 from the list Inode; ch->1eft I Null; ch->right I Null; } case 2: ( Ind1 is not in the half-space, Ihs) { if( nd1->degree -- II 0 ) Remove Ind1 from the list Inode; Compute the intersection point and store it into Inew_nd; Insert Inew-nd into node list, Inode; ch->1eft I new-nd; } case 3: ( Ind2 is not in the half-space, Ihs) { if( nd2->degree -- II 0 ) Remove Ind2 from the list Inode; Compute the intersection point and store it into Inew-nd; Insert Inew_nd into node list, Inode; ch->right I new_nd; } case 4: (Both Ind1 and Ind2 are in the half-space, Ihs); if( new-nd !I Null ) inode I new_nd; end; BIBLIOGRAPHY [1] K. Hwang and F. Briggs, Computer Architecture and Parallel Processing. McGraw-Hill Book Co., 1984. [2] D. Kuck, A. Sameh, R. Cytron, A. V. C. Polychronopoulos, G. Lee, T. McDaniel, B. Leasure, C. Beckman, J. Davies, and C. Kruskal, “The effects of program re- structuring, algorithm change and architecture choice on program performance,” in Proceedings of the 1984 International Conference on Parallel Processing, Aug. 1984. [3] M. Booth and K. Misegades, “Microtasking: A new way to harness multiproces- sors,” in Cray Channels, Summer, 1986. [4] Alliant Computer System Corp., Acton, MA., FX/Series Architecture Manual, 1985. [5] Cray Research, Inc., Mendota Height, Minn, Multitasking User Guide, Jan. 1985. [6] F. Allen, M. Burke, P. Charles, R. Cytron, and J. Ferrante, “An overview of the ptran analysis system for multiprocessing,” in Proceedings of the International Conference of Supercomputing, June 1987. [7] D. Kuck, R. Kuhn, B. Leasyre, and M. Wolfe, “The structure of an advanced vectorizer for pipelined processors,” in Proc. IEEE computer Society Fourth In- ternational Computer Software and Applications Conf., Oct. 1980. [8] D. J. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe, “Dependence graphs and compiler optimizations,” in Proceedings of the 8-th ACM Symposium on Principles of Programming Languages, pp. 207—218, Jan. 1981. [9] R. Allen and K. Kennedy, “Automatic translation of fortran programs to vec- tor form,” ACM Transactions on Programming Languages and Systems, vol. 9, pp. 491-542, Oct. 1987. 134 135 [10] R. Allen and K. Kennedy, “Automatic loop interchange,” in Proceedings of the ACM SICPLAN 84 Symposium on Compiler Construction, pp. 233—246, June 1984. ' [11] D. Callahan, K. Cooper, K. Kennedy, and L. Torczon, “Interprocedure constant propagation,” in Proceedings of the ACM SI GPLAN 86 Symposium on Compiler Construction, pp. 59—67, July 1986. [12] J. Davies, C. Huson, T. Macke, B. Leasure, and M. Wolfe, “The kap/s-l: An advanced source-to-source vectorizer for the s-1 mark iia supercomputer,” in Proceedings of the 1986' International Conference on Parallel Processing, (New York), pp. 833—835, IEEE Press, Aug. 1986. [13] J. Davies, C. Huson, T. Macke, B. Leasure, and M. Wolfe, “The kap/ 205: An ad- vanced source-to-source vectorizer for the cyber 205 supercomputer,” in Proceed- ings of the 1986 International Conference on Parallel Processing, (New York), pp. 827—832, IEEE Press, Aug. 1986. [14] B. Brode, “Precompilation of fortran programs to facilitate array processing,” Computer“, .9, pp. 46-51, Sept. 1981. [15] M. Wolfe, Optimizing Supercompilers for Supercomputers. Cambridge, Mas- sachusetts: The MIT Press, 1989. [16] D. A. Padua and M. J. Wolfe, “Advanced compiler optimizations for supercom- puters,” Communications of the ACM, vol. 29, pp. 1184—1201, Dec. 1986. [17] C. D. Polychronopoulos, Parallel Programming and Compilers. 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061: Kluwer Academic, 1988. [18] M. Wolfe, “Loop skewing: The wavefront method revisited,” International Jour- nal of Parallel Programming, vol. 15, pp. 279-293, Aug. 1986. [19] M. Wolfe and D. J. Kuck, “Advanced loop interchanging,” in Proceedings of the 1986 International Conference on Parallel Processing, pp. 536-543, Aug. 1986. [20] C. D. Polychronopoulos, D. J. Kuck, and D. A. Padua, “Execution of parallel loops on parallel processor systems,” in Proceedings of the 1986 International Conference on Parallel Processing, pp. 519—527, Aug. 1986. [21] The Parallel Computing Forum, PCF Fortran: Language Definition, revision 1.5 ed., Aug. 1990. ' 136 [22] BBN Advanced Computers Inc., Cambridge, MA., Mach 1000 Fortran Compiler Reference, revision 1.0 ed., Nov. 1988. [23] T. H. Tzen and L. M. Ni, “Dynamic loop scheduling for shared-memory mul- tiprocessors,” in Proceedings of the 1991 International Conference on Parallel Processing, vol. II, pp. 247—250, Aug. 1991. [24] Z. Fang, P.-C. Yew, , P. Tang, and C. Q. Zhu, “Dynamic processor self-scheduling for general parallel nested loops,” in Proceedings of the 1987 International Con- ference on Parallel Processing, pp. 1—10, Aug. 1987. [25] C. Polychronopoulos and D. Kuck, “Guided self-scheduling: A practical self- scheduling scheme for parallel supercomputers,” IEEE Transactions on Cam- puters, vol. C-36, pp. 1425—1439, Dec. 1987. [26] P. Tang and P.-C. Yew, “Processor self-scheduling for multiple-nested parallel loops,” in Proceedings of the 1986 International Conference on Parallel Process- ing, pp. 528—535, Aug. 1986. [27] M. Weiss, Z. Fang, C. R. Morgan, and P. Belmont, “Effective dynamic scheduling and memory management on parallel processing systems,” in Proceedings of the 1989 COMPSAC, pp. 122—129, Sept. 1989. [28] T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers,” Tech. Rep. MSU-CPS-ACS-27, Michigan State University, 1990. [29] D. Padua, Multiprocessors: Discussion of Some Theoretical and Practical Prob- lems. PhD thesis, University of Illinois at Urbana-Champaign, Nov. 1979. [30] R. Cytron, Compiler-Time Scheduling and Optimization for Asynchronous Ma- chines. PhD thesis, University of Illinois at Urbana-Champaign, Oct. 1984. [31] R. Cytron, “Doacross: Beyond vectorization for multiprocessors,” in Proceedings of the 1988 International Conference on Parallel Processing, pp. 836-844, Aug. 1986. [32] U. Banerjee, Dependence Analysis for Supercomputing. 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061: Kluwer Academic, 1988. [33] X. Kong, D. Klappholz, and K. Psarris, “The i test: A new test for subscript data dependence,” in Proceedings of the 1990 International Conference on Parallel Processing, vol. II, pp. 204—211, Aug. 1990. 137 [34] Z. Li, P.-C. Yew, and C.-Q. Zhu, “An efficient data dependence analysis for parallelizing compilers,” IEEE Transactions on Parallel and Distributed Systems, vol. 1, pp. 26—34, Jan. 1990. [35] D. Grunwald, “Data dependence analysis:the lambda test revisited,” in Pro- ceedings of the 1990 International Conference on Parallel Processing, vol. II, pp. 220—223, Aug. 1990. [36] D. Malm, A Computer Laboratory Manual for Number Theory. COMPress, Inc., Wentworth, New Hampshire, 1980. [37] S. Midkiff and D. Padua, “Compiler generated synchronization for do loops,” in [38] [39] [40] [41] [42] Proceedings of the 1986 International Conference on Parallel Processing, (Wash- ington DC), IEEE Computer Society Press, Aug. 1986. S. Midkiff and D. Padua, “Compiler algorithms for synchronization,” IEEE Transactions on Computers, vol. C—36, no. 12, pp. 1485-1495, 1987. A. K. Nanda, H. Shing, T. H. Tzen, and L. M. Ni, “Resource contention in shared- memory multiprocessors: A parameterized performance degradation model,” Journal of Parallel and Distributed Computing, vol. 12, pp. 313-328, 1991. U. Banerjee, S.-C. Chen, D. J. Kuck, and R. A. Towle, “Time and parallel proces- sor bounds for fortran-like loops,” IEEE Transactions on Computers, vol. c-28, pp. 660—670, Sept. 1979. G. F. Pfister and V. A. Norton, “Hot-spot contention and combining in multi- stage interconnection networks,” IEEE Transactions on Computers, vol. C-34, pp. 934—948, Oct. 1985. A. K. Nanda, H. Shing, T. H. Tzen, and L. M. Ni, “A replicate workload frame- work to study performance degradation in shared-memory multiprocessors,” in Proceedings of the 1990 International Conference on Parallel Processing, vol. I, pp. 161—168, Aug. 1990. [43] Y. Muraoka, Parallelism Exposure and Exploitation in Programs. PhD thesis, University of Illinois at Urbana-Champaign, Department of Computer Science, Feb. 1971. [44] L. Lamport, “The parallel execution of do loops,” Communications of the ACM, pp. 83-93, Feb. 1974. 138 [45] D. J. Kuck, Y. Muraoka, and S. C. Chen, “On the number of Operations simulta- neously executable in fortran-like programs and their resulting speedup,” IEEE Transactions on Computers, vol. C-21, pp. 1293-1310, Dec. 1972. [46] R. Kuhn, Optimization and Interconnection Complexity for: Parallel Processors, Single-Stage Networks, and Decision Trees. PhD thesis, University of Illinois at Urbana-Champaign, Department of Computer Science, Feb. 1980. [47] W. Shang and J. Fortes, “Independent partitioning of algorithms with uniform dependencies,” in Proceedings of the 1988 International Conference on Parallel Processing, vol. 2, pp. 26—33, Aug. 1988. [48] E. D’Hollander, “Partitioning and labeling of index sets in do loops with constant dependence vectors,” in Proceedings of the 1989 International Conference on Parallel Processing, vol. 2, pp. 139-144, Aug. 1.989. [49] W. Shang and J. Fortes, “Tiling of iteration spaces for multicomputers,” in Proceedings of the 1990 International Conference on Parallel Processing, vol. 2, pp. 179-186, Aug. 1990. [50] U. Banerjee, “Unimodular transformations of double loops,” the Third Workshop on Programming Language and Compilers for Parallel Computing, Aug. 1990. [51] D. E. Hudak and S. G. Abraham, “Compiler techniques for data partitioning of sequentially iterated parallel loops,” in Proceedings of the 1990 International Conference on Supercomputing, Amsterdem, pp. 187—220, 1990. [52] Z. Fang, P.-C. Yew, , P. Tang, and C. Q. Zhu, “Dynamic processor self-scheduling for general parallel nested loops,” IEEE Transactions on Computers, vol. 39, pp. 919-929, July 1990. [53] Z. Fang, “Cache or local memory thrashing and compiler strategy in parallel pro- cessing systems,” in Proceedings of the 1990 International Conference on Parallel Processing, vol. II, pp. 271-275, Aug. 1990. [54] D. E. Knuth, The Art of Computer Programming, Vol. 2, Seminumerical Algo- rithms. Addison-Wesley, Reading, Massachusetts, 1981. [55] Z. Shen, Z. Li, and P. C. Yew, “An empirical study on array subscripts and data dependences,” in Proceedings of the 1989 International Conference on Parallel Processing, vol. II, pp. 145—150, Aug. 1989. I.‘ , i-r‘)“ . ’ _ ... “1— hr?" 1 . 2 139 [56] BBN Advanced Computers Inc., Cambridge, MA., Overview of the Butterfly CP1000, Nov. 1988. [57] E. Lusk and R. A. Overbeek, “Implementation of monitors with macros: A programming aid for HEP and other parallel procrssors,” Tech. Rep. ANL-MCS- 83-97, Argonne National Lab., 1983. [58] L. M. Ni and C.-F. E. Wu, “Design tradeoffs for processor scheduling in shared- memory multiprocessor systems,” IEEE Transactions on Software Engineering, vol. 15, pp. 327—334, Mar. 1989. also in Proc. of the 1985 Int. Conf. on Parallel Processing, pp. 63—70. [59] The Parallel Computing Forum, PCF Fortran: Language Definition, revision 1.5 ed., Aug. 1990. "Iiillilillililililillf