flaw" mm rw~v§~ ' . _ _ ' .' . ; ' - ' .1 .12.: :‘.l .,.-V.|S.‘ Ill/ll ., Ill/I llllwlsllt’ll/‘IA/l/l‘l’l This is to certify that the dissertation entitled MATRIX-BASED REPRESENTATIONS OF LOOP TRANSFORMATIONS presented by David Raymond Chesney has been accepted towards fulfillment of the requirements for Ph . D . degree in Computer Sc ience , Z Major professor Date 91’?!ng MSU is an Affirmative Action /Equal Opportunity Institution 0- 12771 LIBRARY Michigan State Unlverslty PLACE IN RETURN BOX to remove this Moat from your mood. TO AVOID FINES Mum on or More data duo. fiTE DATE DUE DATE DUE MSU I: An Nfinnltlvo Adlai/Emu Opportunity Instltwon Wm: _ VEK __. _..__._ _ __...\_ _V _ _ MATRIX-BASED REPRESENTATIONS OF LOOP TRANSFORMATIONS By David Raymond Chesney A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1995 ABSTRACT MATRIX-BASED REPRESENTATIONS OF LOOP TRANSFORMATIONS By David Raymond Chesney The execution speed of source code on a parallel architecture machine is bound by many factors, including the ability of a parallel compiler to determine parallelism in the target source code. Most of the available parallelism in a source code is contained in the loops and is exploited by applying a sequence of loop transformations. Different methods of representing and ordering sequences of transformations have been developed, including the use of unimodular transformations. A significant feature of the unimodular approach is that it unifies a kernel set of seemingly independent loop transformations into the familiar linear algebra domain. The unimodular approach has been applied to loop permutation, loop reversal, and loop skewing in perfectly nested loops. This research extends matrix-based representations of loop transformations to include a wider family of loop transformations applied to a broad range of source code structures. First, a matrix-based representation of additional loop transformations has been developed, namely loop normalization, loop blocking, strip mining, cycle shrinking, loop collapsing, loop coalescing, loop fission, and loop fusion. Second, the application of the extended and kernel transformations is generalized to handle both perfectly and imperfectly nested loops. Finally, properties of the original unimodular transformations, including composition, one- time mapping, and the evaluation of parallelism are preserved in the extended model. Copyright © by David Raymond Chesney 1995 To S.T., ILYTTMABA iv ACKNOWLEDGMENTS I would like to thank my advisor, Dr. Betty H.C. Cheng, for her assistance during my research. Early in my academic career she told me that earning a Ph.D. is like a roller- coaster ride. Thank you, Dr. Cheng, for believing in me in both the highs and the lows of my Ph.D. program. Thank you also to my committee. Dr. Lionel Ni once said that the key to success in a Ph.D. is finding the question. After that, the answer comes easy. His assistance in helping me find both the question and the answer were invaluable. Dr. Diane Rover helped me both technically, and by showing me that it is possible to balance family and academics. She is a wonderful role model. Dr. David Yen ably assisted on my committee by adding a mathematical perspective. The pursuit of my Ph.D. would not have been possible without the General Motors Corporation Fellowship. Mr. Dennis Meyers served in the non-always-easy position of corporate mentor during the fellowship. Our numerous early morning meetings led to sound career advice and opportunities. Thank you, Dennis. Several other faculty and staff stand out. Dr. Carl Page sowed the seeds of my initial interest in computer science, and saw to it that I had a solid foundation. Dr. Richard Enbody is a true friend and has given me direction in both academic and non-academic matters. I have a history with Mrs. Lora Mae Higbee from before my time in the Computer Science Department. She has been a wonderful “aunt” and listener throughout. Mrs. Linda Moore has always exuded genuine warmth and able assistance on my frequent visits to her office. Thank you to my friends and fellow students. I feel fortunate to be a part of such a remarkable group. I will truly miss our lunch and office conversations dealing with such topics as SPMD machine operation, IATEX errors, and platypus breasts. By name, they are Andy Fung, Edgar Kalns, Barb Birchler, Gerry Gannod, Steve Turner, Marie-Pierre Dubuisson, Joe Sharnowski, Christian Trelftz, Tim Newman, Jon Engelsman, Jay Kusler, Steve Schafer, Enoch Wang and Oliver Christ. Special thanks to Maureen Galsterer and Linda Cherry for sharing the commute and providing rich conversation. Ron Sass offered critical review of my work that strengthened my overall approach and the potential value of my research. Thanks, Ron. My parents, Dale A. and Beverly E. Chesney taught me how to dream big dreams. Their love and support has always been an inspiration to me. It’s funny! Even as an adult, I still need to make them proud. My daughter, Mairin, came along midway through my academic career. Her giggles and hugs added more to any success at my research than any amount of time at a workstation. Mairin, I love you dearly, and being your (and your siblings, wink! wink!) Daddy is my true vocation. Finally, and most importantly, thank you to my wife, Jean. I don’t know if you realized what the journey would be like when you married a dreamer. You are the absolute center of my life and there is no possible way that I could have completed my Ph.D. without you. I love you and thank you, from the bottom of my heart. vi TABLE OF CONTENTS LIST OF FIGURES ix 1 Introduction 1 1.1 Motivation .................................... 2 1.2 Contributions ................................... 3 1.3 Organization ................................... 5 2 Background and Definitions 6 2.1 Source Code Structure .............................. 6 2.2 Perfect and Imperfect Loop Nests ........................ 9 2.3 Dependences ................................... 1 1 2.4 Properties of Dependences ............................ 13 2.5 Unimodular Transformations .......................... 17 2.6 Global Pre- and Postconditions ......................... 21 2.7 Transforming Source Code ............................ 21 3 Kernel Transformations and Loop Normalization 26 3.1 Kernel Transformations ............................. 26 3.2 Loop Normalization ............................... 30 4 Loop Blocking and Loop Collapsing 41 4.1 Loop Blocking .................................. 41 4.2 Loop Collapsing ................................. 51 5 Loop Fission and Loop Fusion 59 5.1 Loop Fission ................................... 59 5.2 L00p Fusion .................................... 62 6 Integration of the Extended Loop Transformations 71 6.1 Previously Developed Algorithms ........................ 73 6.2 New Algorithms ................................. 77 6.3 Fine-grain parallelism for imperfect-only-child loop nests using the kernel transformation set ................................ 79 vii 6.4 Coarse-grain parallelism for imperfect-only-child loop nests using the ex- tended transformation set ............................ 6.5 Data locality for imperfect-only-child loop nests using the extended transfor- mation set ..................................... 6.6 Imperfect-only-child loop nests on vector architecture machines using the extended transformation set ........................... 6.7 Imperfect-only-child loop nests on a shared-memory machine using the ex- tended transformation set ............................ 7 Implementation Model 7.1 Overview ..................................... 7.2 Requirements ................................... 7.3 OMT Analysis .................................. 8 Related Work 8.1 Unimodular Matrices ............................... 8.2 Non-Singular Matrices .............................. 8.3 Schedules ..................................... 8.4 Generalized Loop Transformations ....................... 8.5 Precondition Approach .............................. 8.6 Morphism Notation ................................ 9 Conclusions and Future Investigations BIBLIOGRAPHY viii 86 91 103 103 105 107 120 120 122 123 125 128 130 134 136 2.1 2.2 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.14 2.15 2.16 3.1 3.3 3.4 3.6 3.7 5.3 6.1 6.2 6.3 6.4 6.5 6.6 6.7 LIST OF FIGURES Sample code: general form of a nested loop ................... 7 Sample code and array index function for statement SI. ........... 8 Relationship of perfect and imperfect loop nests ................. 11 Sample code containing scalar flow, anti, and output dependence ....... 12 Sample code containing array flow, anti, and output dependences ....... 13 Dependence graph (depth = 1) .......................... 14 Dependence graph (depth 2 2) .......................... 14 Dependence graph (depth = 3) .......................... 15 Sample code: before unimodular transformation. ............... 18 Transformation of assignment statements. ................... 22 Algorithm to map the loop bounds ........................ 23 Transformation of loop bounds .......................... 24 Sample code: general form of an imperfectly nested loop. .......... 27 Sample code: illegal permutation of loops L1 and L2 .............. 28 Sample code: skew of loop L; by m x L1. ................... 29 Sample code: different cases of unnormal loops ................. 33 Sample code: false dependence. ......................... 36 Algorithm to determine if two loop bound sets can be adjusted ........ 64 Algorithm to obtain coarse-grain parallelism using the kernel transformations of perfect loop nests ................................ 75 Algorithm to obtain fine-grain parallelism using the kernel transformations on perfect loop nests ................................ 78 Sample code: n-deep imperfect-only-child loop nest. ............. 79 Algorithm to obtain fine-grain parallelism using the kernel transformations on imperfect-only-child loop nests. ....................... 81 Sample code: initial code before application of modJrerneLfine ....... 82 Sample code: transformed code after application of modlerneliine. . . . 83 Algorithm to obtain coarse-grain parallelism using the extended transforma- tions on imperfect-only-child loop nests. .................... 85 ix 6.8 Sample code: parallelization not allowed by kernel unimodular transformations. 86 6.9 Sample code: parallelization after loop fission .................. 86 6.10 Sample code: showing reuse and locality ..................... 87 6.11 Sample code: statement groupings for data locality {(31, S4), (32, 53)}. . . . 88 6.12 Algorithm to improve data locality using the extended transformations on imperfect-only-child loop nests. ......................... 89 6.13 Sample code: before application of data locality algorithm. ......... 90 6.14 Sample code: after application of data locality algorithm. .......... 91 6.17 Algorithm to optimize source code for a vector machine using the extended transformations on imperfect-only-child loop nests ............... 95 6.18 Sample code: before application of Algorithm ext.vector ........... 96 6.21 Algorithm to optimize source code for a shared memory machine using the extended transformations on imperfect-only-child loop nests. ........ 100 6.22 Sample code: before application of Algorithm ext.sms ............. 101 6.23 Sample code: after application of Algorithm extents. ............ 102 7.1 Graphical user interface .............................. 106 7.2 Object diagram for overall system. ....................... 109 7.3 Object diagram for source code dependences. ................. 110 7.4 Object diagram for transformation matrices ................... 110 7.5 Object diagram for transformation sequence. ................. 111 7.6 Object diagram for source code .......................... 112 7.7 Backus-Naur Form of target imperative language (modified Tiny syntax). . 114 7.8 State diagram for system. ............................ 115 7.9 Level 0 data flow diagram ............................. 116 7.10 Level 1 data flow diagram ............................. 117 7.11 Level 2 data flow diagram for finding the transformation matrix T ...... 118 7.12 Level 2 data flow diagram for mapping source code. ............. 119 CHAPTER 1 Introduction The speed of executing source code on a parallel computer is theoretically limited by the number of processors [1, 2]. However, a practical limitation is the ability of a compiler for a parallel machine to recognize and exploit parallelism and data locality. In order to parallelize and localize data, a compiler is often more complex for parallel machines than for sequential machines. Methods need to be developed that easily and consistently exploit parallelism and locality, which leads directly to faster execution on parallel architecture machines. Most of the available parallelism in target source code is in the loops and certain com- binations of transformations to the loops maximize parallelism or data locality goals. Ex- amples of the loop transformations include loop permutation, loop reversal, loop skewing, loop fission, loop fusion, loop blocking (tiling), strip mining, cycle shrinking, loop collaps- ing, loop coalescing, and loop normalizing. Individually, the transformations are fairly well understood, however, they are often treated independently. Thesis Statement. The purpose of this research is to extend matrix-based representations of loop transforma- tions to include loop normalization, loop blocking, tiling, cycle shrinking, strip mining, loop collapsing, loop coalescing, loop fission, and loop fusion. The transformations are applied to both imperfectly and perfectly nested loops, and preserve attractive properties of the tradi- tional matrix-based models. New algorithms are developed that use the kernel and extended transformation set to obtain both machine-independent and machine-dependent goals. 1.1 Motivation A unified framework that represents a broad spectrum of transformation techniques is es- sential to the usefulness of a parallel compiler. An integrated framework facilitates the user’s ability to manipulate and transform source code within a common context. Recently, several generalized transformation schemes have been proposed to unify the seemingly inde- pendent transformations, including unimodular matrices [3, 4], non-singular matrices [5, 6], schedules [7, 8], and other general approaches [9, 10, 11]. Each generalized model has its own set of strengths and weaknesses. One weakness of the generalized models is that only a limited number of loop transformations are repre- sented. As an example, unimodular transformations represent loop permutation, reversal, and skewing. Second, there is typically an explicit limitation on the source code struc- ture before the generalized model is applied. That is, the models typically require perfect loop nesting. Third, mapping rules for the dependences are often imprecise. Accurate dependence information is paramount to the evaluation of any transformation sequence. The accuracy needs to be maintained throughout the transformation sequence in order for parallelism goals to be evaluated. A major strength of each transformation model is the unification of a set of loop transfor- mations, which allows a user to manipulate different loop transformations withina common context. Unimodular transformations offer three additional strengths: one—time mapping of initial source code into transformed source code; composability; and the ability to evaluate parallelism goals using an abstract representation of the source code and transformation sequence. Unimodular transformations are a set of loop transformations that are represented by matrices with specific properties. Specifically, loop permutation, reversal, and skewing are all represented by unimodular matrices. A sequence of transformations is represented by the product of the individual transformations that comprise it, and is called the transformation matrix T. The user can determine if the transformed source code is legal based upon properties of the product of the transformation matrix T and a vector representation of the dependences D is the source code. Source code parallelism is also determined from the product of the transformation matrix T and the dependences D. The source code is mapped from initial to final form, given the transformation matrix T. That is, there exists an algorithm for source code mapping based upon the matrix representing the sequence of transformations. 1.2 Contributions The major contributions of this research are: 1) to extend the matrix-based representations of loop transformations; 2) to generalize the kernel and extended set of transformations to apply to both perfect and imperfect loop nests; and 3) to develop new algorithms that are based upon combinations of the transformations to achieve goals relevant to a parallel compiler. The research presented in this dissertation preserves the strengths of unimodular transformations, yet overcomes the weaknesses of other generalized models [12]. First, this dissertation describes an extension to the unimodular approach to include a wider class of loop transformations. Specifically, five new families of transformation techniques are described in terms of a matrix-based framework: loop normalization [13, 14, 15], loop blocking [4, 14, 15, 16, 17, 18, 19], loop collapsing [14, 15, 18], loop fission [17, 18, 19, 20, 21], and loop fusion [14, 15, 17, 18, 19]. Loop normalization is a transformation that transforms source code so that the lower bounds and step sizes of a loop are both one. Normalized loops are often a precondition for kernel transformations and for many extended transformations that are discussed in this dissertation. Loop fission is a loop transformation technique that divides a set of program statements within a nested loop into two separate nested loops. The inverse of loop fission is loop fusion, in which two adjacent nested loops are combined into one [22]. Loop blocking includes tiling, strip mining, and cycle shrinking; and this transformation increases the depth of a nested loop. Loop collapsing includes loop coalescing and is the inverse of loop blocking because it decreases the depth of a nested loop [23]. Mapping for the source code and dependences is explicitly defined for each extended transformation. The rules for source code mapping show how the initial source code is mapped to its final form. Accurate dependence mapping rules ensure that the transformed source code is semantically equivalent to the initial source code and allows for evaluation of potential parallelism. Another important advantage of the extended approach is the ability to transform im- perfectly nested loops. Imperfect loop nests do not have all assignment statements contained in the deepest loop. The generalized unimodular framework presented in this dissertation enables both the kernel set (loop permutation, reversal, and skewing) and the extended set (loop normalization, fission, fusion, blocking, and collapsing) of loop transformations to be applied to both perfect and imperfect loop nests. Loop transformations are used for numerous purposes in a parallel compiler. For ex- ample, loop permutation and skewing may be used to obtain parallelism. Loop blocking may result in improved data locality or reduced barrier synchronization. Combining loop blocking, permutation, and skewing enables parallelism, data locality, and barrier synchro- nization goals to be attained within a matrix-based framework. That is, extending the unimodular approach to include additional transformations broadens the potential advan- tages to include those of the extended transformation set. Several new algorithms are introduced that combine the kernel and extended transfor- mation sets on machine-independent and machine-dependent goals. Machine-independent goals include data locality and parallelism, and machine-dependent goals include transfor- mation of source code for a vector or shared-memory machine. The algorithms combine the loop transformations to obtain goals that are beyond the scope of the kernel transformation set alone. The current unimodular approach is a theoretically elegant method for applying a limited number of transformations to a restricted set of source code structures. The generalization presented here broadens the types of transformations that are applied to a wider class of source code structures. 1 .3 Organization The remainder of this dissertation is organized as follows. Chapter 2 gives definitions and background information used throughout the dissertation. Chapter 3 discusses the generalization of the kernel transformations (loop permutation, reversal, and skewing) to include imperfectly nested loops and also loop normalization. The next chapter discusses two additional transformations that modify the nest depth of a nested loop, namely loop blocking and collapsing. Chapter 5 discusses loop fission and fusion, which are the last of the extended transformations. Algorithms that attain machine-independent and machine- dependent goals using the kernel and extended transformation sets are discussed in Chapter 6. Chapter 7 uses the object modeling technique to describe the implementation framework of the system. Related work is discussed in Chapter 8. Finally, conclusions and future investigations are discussed in Chapter 9. Chapters 3 through 5 describe matrix representations of individual loop transformations, and the format of all of the chapters is similar. First, the transformation is briefly described and an example is given. Second, the legality of the transformation is discussed in terms of invariant, explicit, and nesting preconditions. Next, the effect of the loop transformation on the distance vector set D is discussed. Accurate mapping of D is essential toyevaluation of overall legality, parallelism, and other transformation goals. Finally, the source code mapping is discussed. CHAPTER 2 Background and Definitions This section gives background and definitions for terms used throughout the dissertation. First, the terminology related to source code structure is discussed. Next, the taxonomy of perfect and imperfect loop nests is described. Third, dependences and their properties are discussed. Next, fundamentals of unimodular transformations are described, followed by the global pre- and postcondition for applying a sequence of matrix-based transformations. 2.1 Source Code Structure This section discusses the explicit notation used throughout the dissertation to describe loop bounds and subscript expressions. Loop Bound Notation In general, a loop nest is in the form shown in Figure 2.1, where n is the loop nest depth. The entire iterative statement “for I,- = 1b,, ab, 3;” is referred to as L. I1 through In are index variables, and lb,-, ab;, and s,- are the lower bound, upper bound, and step size, respectively, for the loop with index variable I,-. In order to prevent ambiguity, the subscripts for the upper bound, lower bound, and step size may contain the name of the index variable with which they are associated. As an example, lb;1 is equivalent to lbl. If there is an n-nested loop, then the loop bound set B has n loop bound vectors 5. Each 5 has three loop bound expressions: the lower bound, upper bound, and step size. As L1 for I) = lb1,ub1,s1 L2 for 12 = lb2,ub2, .92 Ly; for In = lb“, “6”, an 51 ARR[sub1, suba, ..., subm] = . . . Figure 2.1. Sample code: general form of a nested loop. an example, the source code given in Figure 2.1 has the loop bound set: B = {(lbl, ubl, 31), (lbg, sz, .92), . . . , (lbn, ubn, sn)}. The first vector in B is bl = (lbl, ubl, s1) and refers to the outermost loop. The last vector in B is 5,, = (lbn, ab", 3,.) and refers to the innermost loop. A restriction on the form of the lower bounds, upper bounds, and step sizes in the loop nest is that each lb.-, and ub;, 1 5 i _<_ n, is a linear function of the surrounding index variables 15, such that j < i; and that s,- is an integer constant. Then each lb,- is in the form: 0:3111 + ai,212 + - - . + ai,jIj + Ci, where j < i, at]: is the coefficient of index variable I,- with respect to I j, and c,- is an integer constant. Similarly, each ub; is in the form: bi,lIl + b.3212 + - . - + bi,jIj + dz" where j < i, b.31- is the coefficient of index variable I.- with respect to Ij, and d,- is an integer constant. As an example, suppose that we have a loop in the form “for I3 = 211 +6, 812, m”, then: 03,1 = 2; b3,1 = 0; 82 = m; 03,2 = 0; b3,2 = 8; 62:6; (12:0. Subscript Expression Notation In a loop nest, 3,- refers to an entire assignment statement. If ARR is the name of an m- dimensional array, then each sub;, 1 S i S m, is referred to as either an array subscript, subscript expression, or reference to index variables. Note that m does not necessarily equal 17.. That is, the loop nest depth and dimension of the array are not necessarily the same. An intuitive representation of an array ARR is using array index function form [24]. An m-dimensional array ARR in a n-nested loop is in the form ARR[sub1, subg, . . ., subm], where each sub.- is a linear function of the loop index variables of the loops plus a constant term. The subscript expressions in array ARR are represented in the form: Hl,l . . . H1," 11 Cl HgJ . . . H2," 62 . . X 5 + . Hm,l . . . Hm," In Cm where the m x n matrix H is called the array index function. In general, the representation of an array in terms of an array index function is in the form [H ][I,-] + [03'], and is called array index function form. As an example, the assignment statement 51 shown in Figure 2.2 has the array index function form shown to the right of the source code. L; for 11 = Lab; 3 0 I 0 L2 for12=1,ub2 1 2 x[I1 ]+ 3 s, ARR[3I1,I1+212+3,4]=... o o 3 4’ Figure 2.2. Sample code and array index function for statement S]. Two array references have uniformly generated references if they refer to the same array and have the same array index function H. A uniformly generated set is an equivalence class, such that any two members of the set have uniformly generated references. There is little potential for reuse in an array that has non-uniformly generated references [24]. That is, arrays with the same name, that also have the same array index function H, will likely have data locality that can be exploited. If all Hi“,- and c,- for a subscript expression sub,- are integer constants, then sub,- is linear. Otherwise, sub,- is nonlinear. If all subscript expressions in an m-dimensional array ARR are linear, then ARR is linear. If some subscript expressions are nonlinear, then ARR is partially linear. If no subscript expressions are linear, then ARR is nonlinear. Finally, if index variable 1;, 1 S i S n, appears in more than one sub,, 1 S j s m, then ARR has coupled subscripts, where n is the loop nest depth and m is the dimension of the array [25]. 2.2 Perfect and Imperfect Loop Nests A loop nest hierarchy is a representation of the structure of a nested loop using a rooted tree, where the parent-child relation is defined in terms of loop nesting and the root is the outermost loop. Specifically, loop L,- is the parent of loop L ,- if loop L,- is the next outer level of loops that completely encloses loop L j. For example, in Figure 2.3(a), L1 is the parent of L2 and L2 is the parent of L3. In Figure 2.3(b), L4 is the parent of L5 and L5 is the parent of L6. Finally, in Figure 2.3(c), L7 is the parent of both L8 and L9. Note that in the source code in Figures 2.3(b) and 2.3(c), there are assignment statements between the respective for’s and the endfor’s. (The endfor’s are included in Figures 2.3(a) through 2.3(c) for clarity. In other source code samples in this dissertation, the location of the endior’s is implied.) Two loops may have the same parent, as shown in Figure 2.3(c); however, no loop may have more than one parent. Two loops are perfectly nested if there are no statements between their opening for statements, nor between their respective closing endfor statements. A loop nest is perfectly nested if each parent-child pair in the loop nest hierarchy is perfectly nested. Note that there is a difference between perfect nesting of two loops and perfect nesting of an entire loop nest. A loop nest that is not perfectly nested is imperfectly nested. For example, L1 and L2 are perfectly nested, as are L2 and L3 . Therefore, the entire loop nest in Figure 2.3(a) is perfectly nested. L4 and L5 are imperfectly nested in Figure 10 L1 101'11... L4 for];... L1 fOI'K1... L2 forlg... 81 var1=... $1 var1=... L3 for13... L5 forlz... L3 foer... S; var1=... Le for-.13... $2 var2=... S; var2 = . .. 32 var2 = ... endtor $3 var3=... $3 var-3:... L9 forKa... endfor endtor 33 var3 = . .. endfor endtor endtor and! or end! or endtor Figure 2.3(a). Sample code and loop nest hierarchy: perfect nest- 25. Figure 2.3(b). Sample code and loop nest hierarchy: imperfect- only- child nesting. Figure 2.3(c). Sample code and loop nest hierarchy: imperfect- sibling nesting. 2.3(b), however, L5 and L3 are perfectly nested. Therefore, the loop nest in Figure 2.3(b) is imperfectly nested. Similarly, L8 and L9 are imperfectly nested in L7, and the entire loop nest is imperfectly nested in Figure 2.3(c). The taxonomy for perfect and imperfect loop nests is derived from the form of the loop nest hierarchy. Perfect loop nests (Figure 2.3(a)) are rooted trees of degree one with no statements between opening for ’s and closing endfor ’s of any parent-child loops. Imperfect-only-child loop nests (Figure 2.3(b)) are rooted trees with degree one with at least one assignment statement between a parent-child pair of for ’s or between a parent-child pair of endfor ’3. Thus, statement 51 in Figure 2.3(b) makes the loop nest imperfect- only-child. Finally, imperfect-sibling loop nests (Figure 2.3(c)) are rooted trees with degree greater than one. The relationship between these three different forms of loop nests is shown in Figure 2.4. 11 rimperfect-sibling loop nests \ imperfect-only-child loop nests C perfect loop nests > k j Figure 2.4. Relationship of perfect and imperfect loop nests. The applicability of unimodular transformations is dependent upon the distinction be- tween perfect, imperfect-only-child, and imperfect-sibling loop nests. That is, currently unimodular transformations are applied only to perfect loop nests. This research shows how to apply transformation matrices to imperfect-only-child loop nests, and how to trans- form many imperfect-sibling loop nests into imperfect-only-child loop nests. 2.3 Dependences In any assignment statement, the variable on the left-hand side (LHS) is being defined (def), and any variables on the right-hand side (RHS) are being used (use). The three types of dependences: flow, anti, and output, are defined in terms of the chain of defs and use’s that result in a dependence. Flow dependence is defined as a def-use chain; anti dependence is defined as a use-def chain; and output dependence is defined as a def-def chain. Symbols representing flow, anti, and output dependence are 6Jf , 6°, and 6", respectively, where 6 represents one or more occurrences of these types. That is, if the relation 6’, 6“, or 6° holds, then 6 holds. Source code that illustrates each type of dependence is given in Figure 2.5: 5'5 is flow dependent upon 5'1 due to A (51 6f 55); 53 is output dependent upon 5'2 due to B (5'2 6" S3); S4 is anti dependent upon 53 due to G (53 6“ S4); and $2 and 33 are anti dependent upon 51 due to B (51 6“ $2, $1 6“ 53). Note that the order of the statements in the notational 12 description implies the order in the execution of the source code. That is, $1 6! 55 implies that statement SI must finish execution before statement 35 begins execution. 51 A=A+B S: B=C+D $3 B=G+I $4 G=G+1 35 H=A+K Figure 2.5. Sample code containing scalar flow, anti, and output dependence. Dependences can be represented in two forms: distance vectors and direction vectors. In distance vectors, the components are integral, whereas in direction vectors, the components are symbolic (<, =, and >). A distance vector set D is a set representation of the dependences within a body of source code. Each member of the set is called a distance vector d: Finally, each component of a distance vector d,- is a distance. For an n-nested loop, d1 is the outermost loop, dn is the innermost loop, and d,-, 1 < i < n, refers to the respective loops between L1 and Ln. For example, the distance vector set for the loop nest in Figure 2.6 is: D = {(0,1)r(190)v(1’ _1)’(0’2)’(0’3)}’ where the first element of D, (0,1) is the distance vector for the output dependence between 32 and $1; the second element (1,0) is the distance vector for the anti dependence between 53 and 51; the third element of D, (1,-1) is the distance vector for the anti dependence between $4 and S]; the fourth element of D, (0,2) is the distance vector for the anti dependence between $1 and itself; and the fifth element of D, (0,3) is the distance vector for the anti dependence between $2 and 51. A symbolic direction in a dependence vector describes the temporal nature of depen- dences between statements. The integer distances in a distance vector explicitly define the 13 L1 for [1 =1,” L: for I; = 1,n 51 l[11,12]=1[11,12+2]+B[ll,12]+0[11.12] S: A[11,12 - I] = var; S3 BU1 - 1,12] = var; $4 C[I1 - 1,12 + I] = vars Figure 2.6. Sample code containing array flow, anti, and output dependences. distance in terms of iteration space between uses of a variable, and distance vectors are used in this dissertation to represent dependence information. It should be noted that di- rection vectors can be derived from distance vectors, but not vice-versa. As an example, the distance vector (2,-3,0) is equivalent to the direction vector (<, >, =). 2.4 Properties of Dependences This section formalizes definitions related to dependence vectors and dependence vector sets. A distance vector d in an n-nested loop is in the form (d1, . . .,d,-, . . .,d,,). A loop- independent dependence is one in which all of the distances in the distance vector are zero: (Vi:1$i§n:d;=0). A loop-carried dependence is one in which at least one distance is non-zero: (3j:lgjgn:dj¢0). In a loop-carried dependence, the dependence is carried at the loop depth of the first non- zero distance. That is, if d,- is the first non-zero distance in d: then the dependence is carried at depth j. Each dependence relation exists from a statement to a statement of source code. An intuitive representation of the dependence relations uses a directed graph G'(V, E) called a dependence graph. The vertices V are the statements in the source code and the edges E 14 are the dependences between the statements. Figure 2.7 is the dependence graph for the source code in Figure 2.6, where the are points in the direction of the dependence. (2,0,0) (1,0,0) (0,0,2) (0,3,0) 0 2 1 9 9 0—K v.“ 6:) (.-.,, 4x89 Figure 2.7. Dependence graph (depth = 1). Dependence graphs are also defined in terms of depths of the carried dependences. A dependence graph at depth i contains all of the dependences that are carried at depth i or greater. The depth 1 dependence graph represents all of the dependences in the source code. As an example, the depth 1 dependence graph for the source code shown in Figure 2.6 is given in Figure 2.7. The depth 2 dependence graph contains all dependences that are carried at a depth of 2 or greater. Thus, the distance vectors {(2,0,0),(1,0,0),(1,-1,0)} are eliminated from the dependence graph in Figure 2.8. Similarly, the depth 3 dependence graph is shown in Figure 2.9. v l) (0,2,1) (0,0,2) (0,3,0) it 1% Figure 2.8. Dependence graph (depth = 2). 15 GD @ (0,0,2) v ® @ Figure 2.9. Dependence graph (depth 2 3). A lexicographically positive distance vector, denoted d > 0, has some positive distance within the distance vector and all distances preceding the positive distance are nonnegative: d>0::((3izlgign:dg>0) /\ (Vj:1$j 0‘)v(v1m'g kgjzdk 2 0)). (2.1) That is, either the distance vectors are lexicographically positive preceding location i, or are non-negative in the range from d,- to dj. Fully permutable distance vectors are significant because, as the name implies, any permutation of loops in the range L,- through LJ- results in a legal transformation. Suppose that we have a predicate fully_perm(d., i, j) that evaluates whether distance vector cf is fully permutable in the range of d,- through dj. Evaluation of fully_perm(d: i, j) on d=(0,1,-3,2), 1 S i < j S 11 results in the following table: l6 i j fully_perm 1 2 yes 1 3 no 1 4 no 2 3 no 2 4 no 3 4 yes Typically, the objective of a loop transformation sequence is to maximize parallelism, and parallelism is determined based upon properties of the distance vectors d in the distance vector set D [4]. Consider a loop nest of depth n, with lexicographically positive distance vectors (VJ: d6 D : d.» (I). A loop L,- can be parallelized if either the distance vectors are lexicographically positive preceding location i, or the distance at location i is 0. That is, the i“ loop is parallelizable if and only if: (vi: J'e D : ((d1,...,d,-_1) > 0') v (d,- = 0)). This result is significant because it describes a property for determining available par- allelism based only upon the form of the distance vectors in D. That is, parallelism is determined from the abstract representation of the source code in terms of the dependence vector set. The criteria for vectorization relies more heavily on the dependence graph G'(V, E) and is less restrictive than the criteria for parallelization. That is, source code that cannot be parallelized may be vectorizable, but not vice-versa. The criteria for vectorization is dependent upon cycles in the dependence graph. Specifically, loop L,- can be vectorized in an assignment statement as long as there are no cycles in the dependence graph carried at depth i. As an example, loop L1 of statements 31, 5'2, 53, and 34 cannot be vectorized because all of the statements are in cycles in the depth 1 dependence graph shown in Figure 2.7. 17 Loop L2 can be vectorized for statements 51 and S3 because they are not in a cycle in the depth 2 dependence graph shown in Figure 2.8. However, statements 32 and 54 cannot be vectorized. All statements can be vectorized at depth 3, because none are in a dependence cycle as shown in Figure 2.9. 2.5 Unimodular Transformations The objectives of the unimodular approach are: 1) to represent loop permutation, reversal, and skewing as matrices and 2) to transform initial source code into some final source code that is semantically equivalent to the original source code, but exhibits a desirable property (parallelism, data locality, etc.) [3, 4, 9, 26]. All transformations are represented as operations between a transformation matrix T and the dependence vectors (f in the dependence vector set D. A unimodular matrix T has the following properties, where I is the set of all integers, and T[i, j] refers to the element of T in the ith row and j‘h column: 1. an 2. (Vi:l$i$n:(Vj:lngn:T[i,j]6I)) 3. IdetT|=1 This definition states that T is square; all components of T are integers; and the absolute value of the determinant of T is 1. Three kernel loop transformations represented by unimodular matrices are: Permutation: interchanging two loops; Reversal: modifying a loop to increment from the negation of the upper bound to the negation of the lower bound; Skewing: adding an integer function of an outer loop to the indices of an inner loop. The matrices that represent permutation, reversal, and skewing are all simple modifications to the identity matrix I. The following are examples of each kernel transformation matrix applied to the distance vector d: (1, 3, -2, 0) as shown in the source code in Figure 2.10. 18 L1 for 11 = M1, 1101 L2 for I: = “2,002 L3 for 13 = lbs, uba L4 for I4 = “4,1504 S1 A[I1,Iz,13,14]=... 32 L[I1-I,Ig-3,Is+2,l4]=... /* *D = {(1,3, -2,0)} t */ Figure 2.10. Sample code: before unimodular transformation. The 4 X 4 matrix tJD shown in Figure 2.11(a) is used to represent loop permutation of loops L1 and L2. Note that the original lower and upper bounds of loops L1 and L2 (Figure 2.10) are swapped by loop interchange, as shown in Figure 2.11(b). Also, the distance vector set after loop permutation of loops L1 and L2 is tp x J: (3,1, —2,0). L1 for J1 = “2,1152 L2 for J2 = lbl,ubl 0 l 0 0 L3 for J3 = lbs, uba t I 0 0 0 L4 for J4 = "34, “04 p_ 0 0 l 0 $1 A[J1,12,J3,J4]=... 0 0 0 I 32 A[J1-I,J2-3,J3+2,J4]=... /s *D ={(3,1, -2, 0)} s s/ Figure 2.11(a). Unimodular matrix to permute Figure 2.11(b). Sample code: after loop permu- loops L1 and L2. tation of loops L1 and L2. In order to reverse loop L3, the transformation matrix t, is used as shown in Figure 2.12(a), where t, x d: (1, 3, 2, 0). Note the differences between the loop bound expressions and array subscripts for index variable J3 in Figures 2.10 and 2.12(b). Finally, the matrix shown in Figure 2.13(a) represents skewing loop L2 with respect to loop L1 by a factor of 2. The dependence vector is multiplied by t,, where t, x d = (1 , 5, —2, 0). 19 L1 for J1 = lb1, 001 L2 for J2 = "’2, “b2 1 0 0 0 L3 for J3 = —-ub3, -l03 _ 0 1 0 0 L4 for J4 = (54,1404 1,- — 0 0 -1 0 $1 A[J1,J2,—13,J4]=... 0 0 0 1 $2 A[J1-1,J2—3,—J3+2,J4]=,,. /¢ tD = {(1,3,2,0)} s s/ Figure 2.12(a). Unimodular matrix to reverse Figure 2.12(b). Sample code: after loop reversal loop L3. of loop L3. L1 for J1 =lb1,ubl L2 for J2 = lb2 + 211,1102 + 2.11 l. 0 0 0 L3 for 13 = lbs, 003 t _ 2 l 0 0 L4 for J4 = “4,1104 ._ 0 0 I 0 S1 A[J1,J2-2J1,J3,J4]=... 0 0 0 1 $2 A[J1-l,Jz-3—2J1,Ja+2,.]4]=... /:u I'D = [(1, 5, --2,0)} t s/ Figure 2.13(a). Unimodular matrix to skew loop Figure 2.13(b). Sample code: after loop skewing L2 by 2L1. of loop L2 by 2L1. In this dissertation, a superscript is sometimes used on the distance vector set D and the transformation matrix T for clarity purposes. A superscript on the distance vector set D has the following interpretation: 0 Dan; the initial distance vector set for the source code before the transformation; 0 Dlx'll: the transformed distance vector set before the x“ transformation; 0 D”): the transformed distance vector set after the x“ transformation; 0 Dm): the final, transformed distance vector set. Similarly, superscripts on the transformation matrix T have the following interpretation: 0 Tlo): the identity matrix; 0 T("'1): the transformation matrix before the x“ transformation; 0 Tm: the transformation matrix after the x“ transformation; 0 Tm): the final transformation matrix; 0 to”): the elementary transformation matrix representing the 2:“ elementary operation. 20 The relation between D“) and TU) is as follows: 0 Tl”) = t(”) x T(""'1): the matrix representing the first x transformations is the product of the x"‘ elementary transformation matrix and the matrix representing the first x -1 transformations; 0 DH") 2 Tl”) X Dlo): the transformed distance vector set after x transformations is the product of the transformation matrix representing the first x transformations and the initial distance vector set; 0 D(”) = t”) x D(’"1): the transformed distance vector set after x transformations is the product of the x“ elementary transformation matrix and the distance vector set after x — 1 transformations. Unimodular matrices have a composition property; that is, sequences of unimodular transformations are also unimodular. As an example, a sequence of loop reversal, permu- tation, and skewing using the matrices shown in Figures 2.11(a), 2.12(a), and 2.13(a) may be composed into one transformation matrix T(3) = t(3) x tm X t“): 2 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 TB) = 1 0 0 0 = 2 1 0 0 x 1 0 0 0 X 0 1 0 0 0 0 —1 0 0 0 l 0 0 0 1 0 0 0 —1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 The distance vector set can be transformed in one of two ways: after each kernel trans- formation (D(’) = tlf) x db“) 6 D("1)); or after the final Tm is found thatrepresents the entire transformation sequence (D(‘”) = T“) X (1(0) 6 D(°)). Continuing with the earlier example, D(3) = T(3) x dlo) 6 BM: 2100 1 5 10 00>< 3_1 00—10 —2 2 0001 0 0 The composition property is significant because a sequence of transformations is represented by a single matrix. Therefore, mapping the distance vector set D and source code need occur only once. 21 2.6 Global Pre- and Postconditions Initially, all dependence vectors JIO) in the dependence vector set 0(0) are lexicographically positive. Formally, this property is described as (leo) : 4'10) 6 0(0) : dlo) >- 0). This condition is referred to as the global precondition, because it is true for all distance vectors dlo) before any transformations are applied. The global postcondition must be true after all unimodular transformations have been applied, and asserts that all transformed dependence vectors (din) 6 0(9)) must be lexico- graphically positive. This condition is formally described as (Vdm) : J19) 6 0(9) : dln) > 0), where 0(9) = Tm) x (1'10) 6 Dm). Requiring lexicographically positive distance vectors af- ter loop transformation ensures that the source code after transformation is semantically equivalent to the code before transformation. The global pre— and postconditions are significant because they govern the legality of a sequence of unimodular transformations. That is, all (110) E D(°) are initially lexicographi- cally positive. Any sequence of transformations, represented by Tm), can be legally applied to the source code, as long as all transformed distance vectors (Tm) x J10) E D(°)) are lexicographically positive. 2.7 Transforming Source Code Wolf and Lam [4] describe an approach for mapping source code from initial to final form, given a transformation matrix T. The approach is divided into two portions: mapping as- signment statements and mapping loop bounds. Mapping loop bounds is further subdivided into a four-step algorithm. Both assignment statement mapping and loop bound mapping are described below. Mapping assignment statements requires the inverse of the transformation matrix (T‘1 ), which is used to determine the appropriate linear combination of transformed loop indices to replace the initial loop indices. In matrix form, the following multiplication is used to determine new assignment statements, where I,- are initial loop indices and J J- are new loop 22 indices: =T-1X It is also possible to determine new loop indices in terms of the old loop indices in matrix form: F q _ q [1 J1 In Jn L d h d and then solve for each I,- using algebraic substitution. As described later, T representing the extended transformations may be non-square, and T"1 is undefined for non-square matrices. Therefore, algebraic substitution, rather than direct matrix-vector multiplication, is used to solve for I,- in terms of J,. As an example, the assignment statements from the source code shown in Figure 2.10 are mapped as shown in Figure 2.14 using the transformation matrix T above. The boxed expressions represent the algebraic solution of the initial index variables I,- in terms of the transformed index variables J,. The final step is mapping the transformed assignment statements from the initial assignment statements. 2 1 0 0 11 J: 1 0 0 0 x 12 _ J: 0 0 -l 0 13 _ J3 0 0 0 1 14 J4 11:211-1-12 [11:12] Js=-ls [14=J4[ I2 = J1 - 211 13 = -J3 [I2 = J1 — 212| A[11.lz,13,14]=> III-72.11 - 212, --73. J4] A[I1 — 1,12 - 3,13 + 2,14] 2 A[J2 — 1, J1 - 2J2 - 3,-J3 + 2, J4] Figure 2.14. Transformation of assignment statements. 23 Mapping the loop bounds requires a four-step algorithm that is shown in Figure 2.15. The first and second steps are to extract inequalities, minima, and maxima from the initial source code. Step 3 is transforming the indices, given the transformation matrix T. Also, the inequalities, minima, and maxima are transformed based upon the new loop indices. Finally, the new loop bounds are calculated based upon the inequalities, minima, and maxima determined in Step 3. Algorithm map_bounds Maps the loop bounds from the initial index variables (11, . . . ,In) to the transformed index variables (J],...,Jn)o Input: I1, . . . , In /**the initial index variables**/ B /*"'the loop bound set for the initial index variables**/ T /**the transformation matrix*"'/ Output: J1, . . . , J" /**the transformed index variables**/ 8' /*"'the loop bound set for the transformed index variables**/ Procedure: 1. Extract inequalities. /**Inequalities are derived from the upper and lower bounds of each loop index variable.**/ 2. Find the absolute minimum and maximum for each initial index variable. /**Minima and maxima for each index variable are based upon the inequalities from Step 1."/ 3. Thansform the indices using T. I“ The transformation matrix T is used to map the initial index variables into transformed index variables. Also, the transformed index variables are substituted into the inequalities, maxima, and minima calculated in Steps 1 and 2.*“/ 4. Calculate the new loop bounds. /"“'I The new loop bounds for the final index variables are determined based upon the inequalities and minima and maxima from Step 3.**/ Figure 2.15. Algorithm to map the loop bounds. The loop bounds for the source code shown in Figure 2.10 are mapped as shown in Figure 2.16, assuming that the lower bounds for all loops is one (Vi : 1 5 i 5 4 : lb,- = 1). 24 The boxed expressions represent the algebraic solutions to the inequalities in terms of the transformed index variables. The transformed inequalities are used to determine the loop bounds of the transformed index variables (J1, . . . , J4). Step 1. Extract inequalities 1121 1221 1321 1421 11 S “111 12 S 1162 13 S ab; 14 S 054 Step 2. Find maxima and minima min(Il) = l min(I2) = 1 min(Ia) = l max(I1) = “01 max(I2) = 002 max(13) = 1103 Step 3. Transform the indices Using results from assignment statement mapping. 11 = 12 12 = 11 - 212 13 = --13 .1221 J1—2.1221 -J321 .12 S 1351 J1 -2.12 S ub2 -.13 S 053 min(J2) =1 min(J1 — 2j2) =1 min(—J3) = 1 max(J2) = 1161 max(J1 — 212) = M2 max(—-Ja) = abs Step 4. Calculate new loop bounds .11 2 1 + 2min(J2) J1 - 2.12 S ub2 11 S “be + 212 J; 3 ub2 + 2max(J2) [11 Subs-+2851] Final code L; for J; = 3, ub2 + 2ub1 L2 for J2 = 1,1161 L3 for .13 = —ub3, -1 L4 for .14 = 1,1104 51 A[.12,J1 -2J2,-.13,.14] = S2 ‘[.12-1,.11—2J2—3,—J3+2,.14]=... Figure 2.16. Transformation of loop bounds. min(r.) = 1 max(I4) = 1104 LI = J‘ J. 2 1 .14 S 2104 min(J.) = 1 mIX(.14) = ub4 .1221 .11-2J221 -.1321 .1421 Jagub. J121+212 1.3.10. 25 The complexity of the source code transformation is dependent upon the loop nest depth n, the number of assignment statements m, and the number of inequalities derived from the source code q. Mapping assignment statements has the following complexity: 0 matrix multiply 0(n2). 0 search and replace occurrences 0(n(n + m)) = 0(n2 + nm). Assumes that there is an index variable for each loop. 0 Total 0(n2 + nm) = 0(n2). Mapping the loop bounds has the following complexity: Step 1. Extract inequalities 0(a). Step 2. Find minima and maxima 0(n). Step 3. Transform loop bound indices 0(n2 + q). Matrix multiply and scan through inequalities. Step 4. Calculate new loop bounds 0(nq). For each index variable, search through set of inequalities. Total 0(nq). There are always at least two inequalities from each loop. Note that the code is always transformed using this method. Thus, the average and worst case complexity are equivalent. Also, the code is transformed only once, after the final transformation code T is found. CHAPTER 3 Kernel Transformations and Loop Normalization The chapter briefly describes how the kernel set of loop transformations are applied to imperfect loop nests. This chapter also describes loop normalization, a transformation that modifies a loop L,-, such that the lower bound lb,-, and step size 3;, are both one. 3.1 Kernel Transformations The kernel set of transformations includes loop reversal, permutation, and skewing. The legality and applicability of the kernel set is well understood in the context of perfectly nested loops. This section discusses the legality of the kernel set in terms of [imperfect- only-child loop nests. Transformation matrices cannot be applied to imperfect-sibling loop nests; however, many imperfect-sibling loop nests can be transformed into imperfect-only- child loop nests by loop fission (Section 5.1). For simplicity, a doubly-nested loop is used in the following discussion. However, the concepts can be easily generalized to an n-nested loop, where n > 2. The generalform of an imperfect-only-child loop nest is shown in Figure 3.1, where f 0, g(), and h() are functions of the enclosing index variables. In the following discussion, the predicate normal(B(‘”), i) is true if loop L,- is normal in loop bound set BM. A loop is normal if both the lower bound and step size are one. 26 27 L; for I; =lb1 , ub; , s; S. Amr[f(I.)] = ... L2 for 12 = 162,052,.” 52 ARMY[g(11, 12)] = . . . endfor $3 ARRAY[h(11)] = endfor Figure 3.1. Sample code: general form of an imperfectly nested loop. 3.1.1 Loop Reversal Loop reversal of loop L,- results in a modified loop and all references to the index variable I,- being negated. Specifically, if the initial loop is incremented from 1 to ub;, then the reversed loop increments from —ub,- to —1. Loop reversal is applied to a single loop. Reversal of the outer loop (L1) of the source code shown in Figure 3.1 results in the source code shown in Figure 3.2(a), and reversal of the inner loop (L2) results in the source code shown in Figure 3.2(b). L1 for 11 = —ub1 , -lb1 , 81 L1 for 11 = “1, ub1 , 81 s. mun—1.)] = ... S. ARRAY[f(Ii)] = ... L2 for 12 = lb2, ub2, s2 L2 for I2 = -ub2, -lb2, s2 52 ARMYLfl-h, 12)] = . . . S2 ARRAYE(11, -12)] =-" . . . endtor endfor $3 ARRAY[h(-Il)] = . . . $3 ARRAY[h(I;)] = . . . endfor endfor Figure 3.2(a). Sample code: reversal of loop L1. Figure 3.2(b). Sample code: reversal of loop L2. As shown in Figures 3.2(a) and 3.2(b), the assignment statements 51 , S2, and S3 have the same number of iterations before and after loop reversal. The loop bounds and correspond- ing references to the index variables are reversed; however, the source code is semantically equivalent (assuming that the global postcondition is true). Loop reversal is always legal for imperfect-only-child loops. That is, loop reversal can be applied to source code regardless of whether or not it is perfectly nested. All kernel 28 unimodular transformations require that the step size be one prior to transformation. The invariant below corresponds to this requirement. Preconditions Invariant normal(B(’-1), i) Explicit Nesting 3.1.2 Loop Permutation Loop permutation of loops L.- and L,- results in the interchange of the lower bounds, upper bounds, and step sizes for the two loops. Figure 3.3 shows the permuted source code from Figure 3.1. The arrays in SI and S3 are functions of the index variable of loop L1. After loop interchange, loop L1 and index variable 11, are inside of statements 31 and 5'3, and the functions f(Il) and h(Il) have no context. That is, an assignment statement that uses an index variable to calculate a subscript expression must be contained by the loop that increments the index variable. L2 for 12 = lb2,ub2,s2 SI ARRAY[f(11)] = . . . L1 for 11 = 1131,1151,” S2 ARMYLg(11, 12)] = . . . endfor S3 AmY[h(11)] = . . . endfor Figure 3.3. Sample code: illegal permutation of loops L1 and L2. Loop permutation may result in loop L,- following the assignment statement that ref- erences it. Therefore, if two loops L,- and L; are permuted, then all parent-child pairs of 29 loops in between loops L; and L,- must be perfectly nested. Again, the invariant for loop permutation is that both loops L; and L,- are normal. Finally, we assume that loop L; is initially outermost when loops L; and L,- are interchanged. Preconditions Invariant normal(B("1),i) A normal(B("'1),j) Explicit i < j Nesting [(Vk : i S k < j : perfect(Lk,Lk+1))l where perfect(L;, L;+1) is true if loops L; and L;+1 are perfectly nested in each other. 3.1.3 Loop Skewing Loop skewing results in the modification of the loop bounds and references to an index variable I ,- with respect to an outer index variable I;. The upper bound and lower bound of loop L j are increased by a linear function m of index variable 1;, and all references to index variable I; inside of the skewed loop are decreased by m x I;. As an example, skewing I2 by 111 times I1 results in the source code shown in Figure 3.4. L1 for 11 = 101,051,.” 31 ARRAY[f(Il)] = L2 forI2 =lb2+mI1,ub2+m11,s2 S2 ARMYLQ(11,12 -m11)] =... endfor S3 ARRAY[h(I1)] = . . . endtor Figure 3.4. Sample code: skew of loop L2 by m x L1. Perhaps surprisingly, loop skewing is like loop reversal in that it can be applied to a target source code regardless of the nesting structure of the code. Loop permutation results 30 in two loops being modified. Loop skewing is like loop reversal because only one loop is modified by a function of an enclosing loop. Therefore, the nesting precondition is true. We assume that loop L; is outermost and the invariant is that both loops are normal. Preconditions Invariant normal(B(“'1), i) A normal(B(”_l),j) Explicit i < j Nesting 3.2 Loop Normalization We have developed a matrix-based representation of the loop normalization transformation [13, 14, 15]. Loop normalization is a transformation that changes the bounds of a loop, such that the lower bound and step size are both one. The transformation is important because normalized loops are a precondition for many kernel and extended loop transformations. For example, the step size of a loop must equal one before any kernel transformations are applied, and fully normalized loops make loop blocking and collapsing easier to implement. Also, loop normalization may decrease the cardinality of the distance vector set D, resulting in greater opportunities for parallelism. An example of code that is not normalized is shown in Figure 3.5(a), where neither the lower bounds nor the step sizes are equal to one for loops L1 and L2. In general, loop normalization of loop L; results in ub; being replaced with [(ub; — lb; + s;)/s;] and all occurrences of index variable I; being replaced by lb; + (J; — 1)s;. In Figure 3.5(b), loop normalization is applied to the outermost loop (L1), and in Figure 3.5(c), loop normalization is applied to the innermost loop (L2). In order to normalize the outer loop in Figure 3.5(a) the upper bound of loop L1 is replaced with [(1th1 — lb;1 + s], )/s 1,]. As shown in Figure 3.5(b), [1 is renamed J1 and the 31 upper bound of .11 is calculated as: 11.11.]1 = [(1th1 -11111 +81,)/811_| = [(13 — 2+ 3)/3] = [_14/3] = 4. All occurrences of I1 are replaced with lb;1 + (J1 — 1)s;1: 11 =>lb11 + (.11 --1)811 = 2+ (.11 — 1)3 = 3.11 - 1. Note that occurrences of 11 in the loop bounds of loop L2, as well as in assignment statement 31, are replaced with 3.11 — 1. Similarly, the innermost loop (L2) is normalized as shown in Figure 3.5(c), where the upper bound is calculated as: ...,, = [(ubz. — 1b.. + era/sh = [(10 + 61. — (3.1. - 1) + 6)/61 = w. + 10/61, and occurrences of [2 in the source code are replaced with: 12 =>lb12 + (.12 — 1)812 = 3.11 — 1+ (.12 - 1)6 = 3.11 + 6.12 — 7. L1 for 11 = 2,13,3 L1 for .11 = 1,4 L1 for .11 = 1,4 L2 for 12 = 11,12+ 211,6 L2 for 12 = 3.11 - 1,10 + 6.11,6 L2 for .12 = 1, [(3.11 + 17)/6J S1 A[11,12] = .. . 31 “3.11 - 1,12] = .. . S1 A[3.11 - 1, 3.11 +6.12 -7]= Figure 3.5(a). Sample code: be- Figure 3.5(b). Sample code: Figure 3.5(c). Sample code: fore loop normalization. loop normalization of loop L1. loop normalization of loop L2. 3.2.1 Legality Criteria Loop normalization is always legal, although it may not be necessary. That is, loop L; may already be normal (lb; = 1 and s; = 1) and, therefore, loop normalization does not 32 modify the source code. However, if the application of a kernel or extended transformation results in unnormalized code, then loop normalization typically follows in the transformation sequence. The predicate normal(B("'),i) is true if loop L; is normal in loop bound set B0”). The explicit precondition is that the loop is not normal before loop normalization is applied. Preconditions Invariant Explicit 2(normal(B("1), i)) Nesting 3.2.2 Effect on the Dependence Vector Set D If loop L; is not normal, then it is in one of three forms. Either the step size of loop L; is not one, the lower bound of loop L; is not one, or neither the step size nor the lower bound of loop L; is one. Figure 3.6 shows each of these three cases: loop L1 fits case 1 because the step size does not equal one (3; 7t 1 A lb; = 1); loop L2 fits case 2 because the lower bound does not equal one (s; = 1 A lb; 75 1); and loop L3 fits case 3 because neither the step size nor the lower bound is one (s; 75 1 A lb; 75 1). Each case is more thoroughly described below. Case 1. s; 75 1 A lb; = 1. Loop normalization results in modification to all distance vectors in the distance vector set. Specifically, the distances at loca- tion i (d;) are divided by the current step size 3;. Intuitively, the modification to the distances is logical since the distances between executions of assignment statements are reduced by :-.. Formally, if loop normalization is the x“ trans- 33 L1 for 11 =1,m,2 L2 for I2 = 311m L3 for 13 = 2I2,p,5 S1 A[11,12,13] = S2 A[11-4,12+5,13-10]=... [at *D = {(4, —5,10)} s s/ Figure 3.6. Sample code: different cases of unnormal loops. formation, then the modification to the distance vector set D is described as: d(x—1) 3i (lid-(Pl) : db.” 6 D(”1) :dgx) = ). If any distances in d1”) are non-integral, then the distance vector is removed from the distance vector set. The effect on D is represented by the transformation matrix T. Suppose that T(“"1) is the transformation matrix before loop normalization and t7, is the elementary matrix representing normalization. Then Tl”) = in X T‘s-l) becomes the new transformation matrix, where tn is the identity matrix except that 1 tn[i,i] = 7." As an example, the outermost loop L1 in Figure 3.6 is not normal. In order to normalize L1, t,,[1, 1] must equal 711" or %. Therefore, the tn that normalizes L1 is: ch Hoc II OONI" OHO HOG l tn: 0 0 and the modified distance vector set is tn X (16 D: 34 300 4 2 010x—5=—5 001 10 10 Case 2. s; = 1Alb; 76 1. Again, loop normalization of L; results in modification to all if. The assumption is made that lb; is constant, or it is a linear function of an outer loop. Let a;,,- be the integer coefficient of the index variable I ,- with respect to the lower bound of loop L;. As an example, a2] = 3 for loop L2 of the source code shown in Figure 3.6. In order to normalize loop L2, the lower bound of loop L2 (lb2) must be modified to equal one. Therefore, a skew is required, and the skew factor is equal to the negation of the coefficient a;,,-. In the case of loop L2 in Figure 3.6, the skew factor is —a2,1 or —3. Also, since lb2 (i = 2) is skewed with respect to index variable II ( j = 1), the location of the skew is tn[i,j] = tn[2, 1]: 1 0 0 1 O 0 in = —a2,1 1 0 = —3 1 0 0 0 1 0 0 1 Ifloop L; fits Case 2, then the form of the lower bound of loop L; is lb; = a;,,- XI ,. That is, the lower bound is a. linear function of the index variable of an outer loop. The distances at location i (Ax-1)) are modified by subtracting the coefficient a;,,- times the distance at j MES-1)), as in loop skewing. Formally, this condition is described as: (vile-1) :d-l"1) e 0““) : .1?) = .15“) — (...,,- x age-1)». Using the distance vector from the source code in Figure 3.6, d") = tn X 1113‘”: 1 0 0 4 4 -3 1 0 X -5 = —17 0 0 l 10 10 35 Case 3. 3; ;£ 1 A lb; 9E 1. In Case 3, neither the step size nor the lower bound of loop L; equals one, and, not surprisingly, the modifications to D reflect both normalization of the step size to equal one (Case 1) and the lower bound to equal one (Case 2). First, we consider the lower bound lb;. As was shown in Case 2, lb; = a;,,- X I j, where j < i. The matrix that normalizes the lower bound is tn such that tn[i, j] = —a; '1‘. In Case 1, the step size of loop L; was not one (s; 76 1), and the matrix to normalize the loop was tn such that tn[i,i] = :7. If h both Cases 1 and 2 are true, then t,,[i, j] is —a;,,-, and then the entire it row is multiplied by 3:. As an example, loop L3 in Figure 3.6 has a lower bound that is a linear function of I2 and the step size does not equal one. That is, lb3 = a3,j X I,- = 2 X 12 (i = 31.1 = 21 03,2 = 2) and 3; = 5. Therefore, tn[i,j] = “[3,2] = —2 and the 5th(3rd) 1 1 row is multiplied by :—. = —- = g. 83 The elementary transformation matrix tn is: 1 0 0 1 0 0 in = 0 1 0 = 0 1 O a_.. _1- 0 - :32 s3 0 —% 5 Without loss of generality, the form of tn for normalization of all loops in a 3-nested loop is: .1. ,1 0 0 tn : _ 221i .1. 0 32 32 03,1 _ 63,2 __1__ .. 83 83 83 . where a;,,- is the appropriate coefficient and s; is the step size for loop L;. The matrix tn can be easily generalized to n > 3, where n is the loop nest depth. 36 Another advantage of loop normalization is that it discriminates between many actual dependences and “apparent” dependences. Dependence tests make conservative assump- tions when calculating dependences for source code. That is, if calculating the distances is too complex, then a dependence test assumes that a dependence exists. Also, more depen- dence vectors in D lead directly to more restrictions on the amount of attainable parallelism. Therefore, it is desirable that the distance vector set D has minimal cardinality and accu- rate distance vectors. Loop normalization allows the elimination of distance vectors that are erroneously in D as a result of a conservative assumption by a dependence test. As an example, the source code in Figure 3.7 has an apparent distance vector set of D = {(8,2,5)}. The transformation matrix that represents loop normalization of loops L1,L2, and L3 is: 200 8 1 %§0><2=—7 _00-3, 5 3 After normalization, one of the distances in the distance vector set is non-integral (g). Since all distances in a distance vector must be integral, the distance vector (1: {(8, 2, 5)} is elim- inated from the distance vector set, removing potential restrictions on available parallelism in the source code. L1 fori=3,m,4 L2 forj=2+2i,n+2i,2 L3 fork = -p,-1,3 S1 A[i,j,k] = S2 A[i-8,j-2,k—5]=... /¢ 11D = {(8,2,5)}* 41/ Figure 3.7. Sample code: false dependence. 37 In summary, an elementary matrix tn is found that represents the effect of loop normal- ization of the distance vector set. The matrix tn is used to represent normalization of the step size, lower bound, or both, of any loop. The matrix tn composes with other kernel and extended transformations. Finally, loop normalization helps to eliminate “apparent” dis- tance vectors in the distance vector set, thus increasing the amount of potential parallelism in the transformed source code. 3.2.3 Effect on Source Code Mapping The general form of a doubly-nested loop before loop normalization is shown in Figure 3.8(a). Normalization of the outer loop L1 results in a new upper bound: 11le = [(ub1 — lbl + sl)/31] 2 [(d1 - c1 + sl)/31], and any occurrences of 11 are replaced with: lb; + (J1 -1)sl : c1 + J131 - 31. (3.1) Similarly, normalization of the inner loop results in the upper bound of loop L2 becoming: 11ng = [(ub2-lb2+s2)/s2] = [((bz,1(61+~1181-81)+d2)-(02,1(61+1131-81)+02)+82)/32J and any occurrences of I2 are replaced with: lb2 + (J2 — 1)s2 = (a2,1(c1 + .1131 — 31) + c2) + .1232 — .32. I (3.2) Source code with both L1 and L2 normalized is shown in Figure 3.8(b). The original assignment statement S 1 was A[I1 , I2] in Figure 3.8(a), but after substitution using Equations (3.1) and (3.2) becomes: Alcl + 1131 — 31,02,1(61 + «1181 - 31) + 62 + 1282 - 82]- (3-3) 38 L1 for 11 = c1,d1,s1 L2 for 12 = 02,111 + 62,52,111 + 42.82 51 A[11,12]= L1 for J; = 1, [(d1 - c1 + s1)/s1] . L2 for .12 = 1, [((b2,1(C1 + J181 - 31) + dz) —(a2,1(C1 + J1 81 - a1) + c2) + 821/821 51 “61 + 1181 - 81. 62,1(61 +J181 - 81)+62 +1282 - 82] = Figure 3.8(a). General form of unnormalized Figure 3.8(b). General form of normalized double-nested loop. double-nested loop. Therefore, after loop normalization of loops L1 and L2, we know that: 11 = Cl + 1131 — 31, and 12 = 02,1(01 + J131 - 31) + 62 + «1282 - s2. Manipulating Equation (3.4), and placing it in terms of J1, we obtain: 11 Cl 81 J1 = — — — _. 31 81 81 Similarly, putting Equation (3.5) in terms of .12, we obtain: (3.4) (3.5) (3.6) (3.7) Recall from Section 2.7, that transforming source code involves the matrix representa- tion of the new index variables in terms of the initial index variables. In matrix form, the representation is [J;] = [T] X [I;], where J; are the new index variables, T is the transfor- mation matrix, and I; are the initial index variables. The transformation matrix T after loop normalization of loops L1 and L2 in Figure 3.8(a) is in the form: 1- 0 1 __ 02,1 1 32 32 (3.8) 39 Putting Equations (3.6) and (3.7) in matrix form results in: 1 _. .11 = g; 0 X 11 _ 'L—H ”8 (39) J2 _SEA 1 [2 _2._2.c -. 82 82 ‘2 W additional terms The form of the matrix representation of Equations (3.6) and (3.7) is consistent with the kernel unimodular transformations ([J;] = [T][I;]). However, an additional term is subtracted from each new index variable which represents the effect of normalization. The additional terms from Equation (3.9) are rewritten in terms of the normalization matrix tn: -1—-’-° " -1— 0 c — 1 31 ’1 = ’1 X (3.10) _2_zC-0 -22.; .1. ,2 ,2 ,2 02,1(01 - 31) + 62 - 32 Substituting Equation (3.10) into Equation (3.9) results in the matrix representation: 1 1 .11 = :1- 0 X 11 _ E 0 X C1 — 31 (3.11) 12 -93‘l l 12 -93‘-1' L 02,1(61 - 31) + 62 - 82 52 92 52 32 The remainder of the source code mapping process from initial to final form, given a transformation matrix T, is the same after loop normalization as it was after only kernel transformations. That is, the four-step approach described in Section 2.7 is directly applied for source code mapping. In summary, the general form of the index variable mapping of a normalized loop is: .11 11 V1 J I V ,2 =TX 2 -TX 2 40 where each J; is a normalized index variable, each I; is an initial index variable, T is the transformation matrix, and each V; is a normalization expression that is a function of the coefficients and step sizes of the initial loops. CHAPTER 4 Loop Blocking and Loop Collapsing Loop blocking is the generic name for a family of transformations that increase the depth of a loop nest, and includes tiling, strip mining, and cycle shrinking. Loop collapsing is the inverse of loop blocking and decreases the depth of a loop nest. Repeated application of loop collapsing is called loop coalescing. This chapter describes the matrix-based representations of both loop blocking and loop collapsing [12, 23]. 4.1 Loop Blocking Loop blocking is the name given to a family of loop transformations that increase the nest depth of a nested loop. The objective of increasing the nest depth is to divide the initial iteration space into several iteration subspaces that are distributed and executed in parallel. Strip mining [4, 16, 17, 18, 19], cycle shrinking, and tiling [4, 19] are all generalized as one transformation in the extended matrix-based model presented here. Loop blocking represents each of these individual transformations, dependent upon the quantity of loops that are blocked and which loops are parallelized. Strip mining is a technique that transforms an n-nested loop into an n + l-nested loop. The n‘h loop is divided into equal size partitions (sn) and is executed in parallel. Strip mining is referred to as a memory optimizing transformation [17] because 3,, is dependent 41 42 upon some feature of the machine architecture. An advantage of strip mining is that fine- grained parallelism is obtained by parallelizing the nth loop. Cycle shrinking is similar to strip mining, since an n-nested loop is transformed into an n + 1-nested loop. However, in cycle shrinking, 3,, is dependent upon the source code, rather than the machine. Also, the n+1“ loop, rather than the nth loop, is parallelized. An advantage of cycle shrinking is that each iteration within the n + 1“ loop may be executed in parallel, as long as the initial dependence distances of the a“ loop are greater than one. Therefore, a fine-grain intra-block parallelism is obtained. Finally, tiling is a transformation that divides the initial iteration space into some fixed number of subspaces, or tiles. Each original loop may be partitioned based upon an ar- chitectural characteristic (as in strip mining) or a source code characteristic (as in cycle shrinking). Using tiling, an n-nested loop is transformed into a loop with a nest depth between n + 1 and 2n. As an example, the 3-nested loop in Figure 4.1(a) becomes the 6-nested loop given in Figure 4.1(b) if each loop is tiled. An advantage of tiling is that the transformed source code can be evaluated for parallelism within tiles (intra—tile parallelism) and between tiles (inter-tile parallelism). Thus, tiling can be used to obtain both fine-grain and coarse-grain parallelism. L; for I; = l,n L; for I; = 1, ms; L2 for 12 = l,rn L; for 15: l, m,s2 L3 for 13 = 1,p L3 for I; = 1,p, s3 ' S1 A[11,12,13]= L1 for 11 =1[,min(1{ +81 -1,n) S2 B[11 —2,12 — 1,13 -1] = L2 101712 =I§,min(12+s2 —1,m) L3 for 13 = I§,min(I§ + s; — 1,p) S1 A[11,12,13]= S2 B[11-2,12-1,13—1]=... Figure 4.1(a). Sample code: before transforma- Figure 4.1(b). Sample code: after loop blockingall tion. loops. 43 Traditionally, loop blocking and tiling are considered synonymous, and strip mining and cycle shrinking are considered subclasses. That is, loop blocking typically refers to tiling a range of loops (L; through Lj) within a loop nest and strip mining and cycle shrinking refer to transforming a single loop (L;). In this dissertation, blocking is applied to a single loop as in strip mining. Tiling is equivalent to a combination of blocking and permutation using this definition of blocking. As an example, to tile the source code shown in Figure 4.1(a) we would invoke the function tile(L;, L,,b[i, . . ., j]). Loops L; through L,- are tiled with blocking factors of 6; through b;, respectively. The resulting source code is shown in Figure 4.1(b). In contrast, obtaining the source code shown in Figure 4.1(b) is a six-step process using the primitive block and permute: H . block(L1,3l); 2. block(L2,82); 3. block(L3.83); 4. permute(L1,I/2); 5. permute(L1,L3); 6. permute(L1,L2)- First, the outer loop L1 is blocked with blocking factor 31 (Figure 4.2(a)). Next, loops L2 and L3 are blocked with blocking factors 32 and s3, respectively (Figure 4.2(b)). As is shown in Figure 4.2(b), blocking of all three loops results in the blocking loop L:- directly enclosing the blocked loop L;. Finally, permutation is used to move all of the blocking loops outermost and all of the blocked loops innermost. Figure 4.2(c) shows the transformed source code just prior to the final permutation. Although redefining blocking seems like a complication, it is actually a simplification. The composed transformation tiling is subdivided into its constituent elementary transfor- mations. Rather than discussing the legality of a composite transformation, the legality of the constituent transformations are further explored and more easily understood. 44 L; for I; = l,n, s; L; for I; = l,n, s1 L1 for 11 = I{,I:ln(n,1; + s; - 1) L; for 11 = 1;,Iin(n,1{ + s1 - 1) L2 for 12 = 1,711 L; for L1,: l,m,s2 L3 for 13 = 1,p L2 for 12 = 15,-1n(m,15 + s2 - 1) S1 A[11,12,13] = . .. L; for 1:; = 1,p,83 S2 B[11 — 2,12 - 1,13 - l] = L3 for 13 = 15,-1n(p,15 + s3 - 1) S1 A[11,12,13] = S2 B[11-2,12—1,13-l]=... Figure 4.2(a). Sample code: after blocking loop Figure 4.2(b). Sample code: after blocking L1. loops L1 through L3. L; for I; = l,n, s; L for 1;: l,m,s2 L: for I; = 1,p, s3 L2 for 12 = 15,-111(m,1£ + s2 — 1) L; for 11 = 1;,Iin(n,1[ + s; - 1) L3 for 13 = 15,-111(1),]; + s3 - 1) $1 A[I1,I2,13] = S2 B[11-2,12-1,13-l]=... Figure 4.2(c). Sample code: before permuting loops L2 and L1. 4.1.1 Legality Criteria Historically, tiling loops L; through L ,- was legal if they were fully permutable [24]. How- ever, tiling involves blocking single loops and then permuting the blocking loops to the outermost position and the blocked loops to the innermost position. The precondition of full permutability is necessary to allow the second stage (permutation) of the loops. However, in this dissertation the transformations of loop blocking and loop permutation are separated. The precondition of full permutability is only relevant in the context of the kernel transformation of loop permutation. As an example, the source code in Figure 4.2(b) shows the blocking of each loop L1 through L3, without any permutation. Loop blocking, as an elemental transformation, is always legal as long as the bounds are normal 45 prior to the transformation. It is assumed that loop normalization follows loop blocking in the transformation sequence and returns the source code to normal form. Preconditions Invariant normal(B(”‘1), i) Explicit Nesting 4.1.2 Effect on the Dependence Vector Set D Given that the loop nest depth is n, and loop blocking is the x“ transformation in the sequence, then all distance vectors prior to loop blocking have a length n (lex’11 : db-” 6 Def-1) : lift-1)] = n). Blocking a loop increases the nest depth by one and also increases the length of each distance vector by one (Vd-l") : dl”) 6 0(3) : Idlfll = n + 1). As can be seen, blocking a range of loops L; through L,- results in all distance vectors becoming 11’ long, where n’ = n + (j — i + 1). Loop blocking may increase the cardinality of the distance vector set D, based upon the values calculated from the distances before loop blocking and the block size. Specifically, if (it’d) E Dlx‘ll are the distance vectors before loop blocking, and loop L; is blocked, then: 0 distances in loops outside of the blocked code (1 S k < 1') remain unchanged; 0 distances at loop L; (k = i) are assigned values calculated from the block size and the distance before the loop blocking transformation; 0 and distances in loops inside of the blocked code (i < k g n + 1) are assigned the original distances with a location offset by one. 46 Formally, the mapping rule from le‘ll to D“), where loop blocking is the x‘“ transformation and loop L; is blocked with the blocking factor 3;: (VJ: J'e D(==-1) : ([d;/s;] = [d;/s;] A (de:(lgk A[J3, —J1] Next the loop bounds are transformed. The inequalities, minima, and maxima are determined as in the original source code mapping algorithm. Extract Inequalities Find min’s and max’s 11 Z 1; II 3 m; min(Il) = 1; max(Il) = m; I; Z 1; I: g n min(Ig) = 1; max(lg) = 11 When the indices are transformed, four new inequalities are introduced into the inequal- ity set to reflect the new blocking and blocked index variables. Specifically, if loop L,- with index variable I,- is blocked by a blocking factor of s,- and becomes a blocking loop (be) with index variable J, and a blocked loop (b,-) with index variable Jt, then the following inequalities are added to the inequality set: J. 2 (bi; J. S "be; JtZJs; JtSJs'i’si‘l- Using the source code shown in Figure 4.3(a), the inequality set becomes: J3 2 1; J3 S m; min(Jg) = 1; max(J3) = m; —J1 Z 1; —J1 S n; min(—J1)= 1; max(—J1)= n; 12 Z 1; J2 S m; J32J2; J3$J2+82—1- The final step, calculating the new loop bounds, is equivalent for blocked code and code containing only kernel transformations. That is, the entire inequality set is used to determine the new loop bounds for J1, J2, and J3, where the boxed inequalities are used to determine the lower bounds and upper bounds of J1, J2, and J3. 51 J1 J2 Js Plsfl Illa-"J [132M [JsSsz—I] 125-73; Jzzja-Jz-l-l; .72 S Iax(J3); J2 Z nin(J3) - 82 +1; 1221+1- ...; Given the bounds calculated above, the transformed source code is shown in Figure 4.3(b). 4.2 Loop Collapsing Loop collapsing is the inverse of loop blocking and decreases an n-nested loop to an n — 1 nested loop [14, 15, 18]. Repeated application of loop collapsing is called loop coalescing and results in an n’-nested loop, where n’ < n. The source code shown in Figure 4.4(a) is transformed into the source code shown in Figure 4.4(b) by loop collapsing. Additional statements are introduced to recalculate the value of the index variables 11 and 12 in the transformed source code (3:). A loop nest may be collapsed to improve scheduling, reduce barrier synchronization, or improve utilization of the memory hierarchy. Scheduling is improved because the work is divided among fewer loops, and self-scheduling techniques are more apt to find work available when the size of the task varies. Barrier synchronization is reduced because fewer synchronization points exist due to the decrease in the number of loop nests. Finally, loop collapsing increases the effective vector length for vector machines, improving utilization of vector registers. Vector operations are typically applied to the loop with the highest iteration count. Collapsing two adjacent loops L,- and L ,- results in a collapsed loop L], with a greater iteration count than loops L,- or LJ- had individually. 52 integer W[2, 3], V[2, 3] integer W[2, 3], V[2, 3] L1 for I1 =1,ub.' L1,: for 11.2 = l,nb; X ab,- L2 for 12 = 1, ab 5] it ((11,: nod flbj) == 0) S] W[I1, 12] = V[I1, 12] t 5 S; then I1 = 11,2/05,‘ SQ else I1 = cei1(I1,2/ub,') + l S; 12 = ((113 —l) nod ubj)+l S1 W[I1,12] = V[Il,12] It 5 Figure 4.4(a). Sample code: before loop collapsing Figure 4.4(b). Sample code: after loop collapsing 4.2.1 Legality Criteria The legality criteria for loop collapsing was initially described by Wolfe [14], and later for- mally specified by Polychronopoulos [27]. However, this legality condition is too restrictive because it results in the illegality of loop collapsing in many cases where its application results in semantically equivalent source code. Wolfe and Polychronopoulos state that two loops L,- and L,- can be collapsed as long as d,- and d, are 0 for all of the direction vectors in the direction vector set. According to Wolfe and Polychronopoulos, loops L,- and Li can be collapsed if: (WV-1) :61“) 6 UPI) :d, = o A d, = 0). As a counter-example to this legality criteria, the source code in Figure 4.5(a) can be collapsed into the source code in Figure 4.5(b), resulting in semantically equivalent source code, where D = {(1,2)} (d1 76 0 A d2 75 O). In this dissertation, it is legal to collapse two loops, L,- and L,-, as long as loop L,- is the parent loop L5 in the loop nest hierarchy and loops L,- and L,- are both normal. Loop collapsing cannot be applied to two loops L,- and L, if they are imperfectly nested. Collapsing code that is imperfectly nested results in an increased number of iterations for the statement 3k between the loops, which results in S]. executing more times than in the initial source code. As an example, statement 31 executes m times in the source code shown 53 L1 for 11 = 1,2 L2 for]: = 1,3 S1 W[I1,I2]=... $2 ...=W[I1-1,12—2] Figure 4.5(a). Sample code: before loop collapsing transformation, illegal by Wolfe’s [14] and Poly- chronopoulos’s L27] definition. L1,2 for [1,2 =1,6 S] if ((113 nod 3) == 0) S; then 11 =11,2/3 55 else I; = ceil(I1,2/3) + l 51 I2 = ((11.2 - 1) nod 3) + 1 S1 W[I1,I2] = S2 ...=W[I1-l,I2-2] Figure 4.5(b). Sample code: after legal application of loop collapsing transformation. in Figure 4.6(a). However, statement SI executes m X n times in the collapsed source code shown in Figure 4.6(b). L1 for I1 = 1,17: s, A[11] = L2 for 12 = l,n Figure 4.6(a). Sample code: imperfectly nested code before loggcollapsing transformation. £12 for [1,2 = 1,m X n 11((113 nod n) == 0) then I1 = 11,2/73 . 0180 I1 = ceil(11,2/n) + I: = ((11,: — 1)I0d n) + 1 S1 A[I1] = Figure 4.6(b). Sample code: illegal loop collapsing transformation of imperfectly nested code. In summary, the legality for loop collapsing is: Preconditions Invariant nonnal(B(”"1),i) A normal(B(”‘1),j) Explicit j = .-+ 1 Nesting perfect(L,~, L j) 54 4.2.2 Effect on the Dependence Vector Set D The initial loop nest depth is n, and all dependence vectors prior to loop collapsing d 3‘1) in D("”1) have a length n, (Vd-lx'll : (113—1) E D(”’1) : Ida-1) | = n). The dependence vector set after 100p collapsing is D”), if loop collapsing is the 2‘“ transformation. As mentioned earlier, the loop nest depth decreases by one in loop collapsing and correspondingly, the length of the dependence vectors d1”) decreases by one, (Vd-(I) : d1”) 6 0(3) : Illu‘lll = n—l). Each dependence vector ill”) in the dependence vector set D”) is modified as follows: 0 distances in loops outside of the collapsed code (1 S k < 2') remain unchanged; 0 distances within the collapsed loops (k = i) are assigned a value that is a function of the initial distances and the upper bound of the inner loop; 0 distances in loops inside of the collapsed code (j S k S n — 1) are assigned the original distances with locations offset by one. Formally, the modification to the distances in each distance vector when loops L,- and Li are collapsed is specified as follows, where ubj is the upper bound for loop Lj: wit-1): (fix-1) e MP1) : ( ((Vk:1$k A[Ila J19 13] Note that the index variables [1 and 13 are not transformed in the final source code. As will be shown later, the indices are recalculated by some additional assignment statements. The loops are mapped after the index variables are calculated. The inequalities, minima, and maxima are determined from the initial source code. Extract Inequalities Find min’s and max’s 11 2 1; 11 3 ub1; min(Il) = 1; max(Il) 2 ub1; I; 2 1; I; 5 ub2; min(Ig) = 1; max(Ig) = ubg; I3 2 1; I3 5 ub3 min(Ig) = 1; max(13) = 1111;; The indices are transformed using inequalities from the initial source code and the index variable mapping used for the assignment statements. Two inequalities for the index variables that represent the collapsed loop replace the initial inequalities. That 'is, if loops L,- and L,- are collapsed and J, is the index variable in the transformed source code for the collapsed loops, then the following inequalities are introduced: J, 2 1b.- x 1b,; J, g ab.- x ubj If we assume that the loops are normalized prior to loop collapsing (lb,- 2 1 and 1b, = 1), then the first inequality is replaced with J, 2 1 x 1 E J, 2 1. Continuing with the example, the transformed inequality set is: 58 J1 J2 912g [JISubgl balm-x115; Jggungubj J221X1; Also, after loop collapsing, the array subscripts for I,- and I ,- are recalculated. The re- calculated array subscripts for I,- and I ,- with J, representing the collapsed loops are: if ((J, mod ubj) == ) than I,- = J,/ubj else I,- = ceil(J,/ubJ-) + 1 Ij = ((J, - 1)mod ubj) +1 The complete transformed source code is shown in Figure 4.7(b). CHAPTER 5 Loop Fission and Loop Fusion Loop fission is a transformation that changes the structure of a loop nest, such that it becomes two adjacent loop nests. Loop fusion is the inverse of loop fission and combines two adjacent loop nests into one. The following discussion describes the legality criteria, effect on the distance vector set D, and effect on the transformation matrix T for both loop fission and loop fusion [12, 22]. 5.1 Loop Fission Loop Fission [17, 18, 19, 20, 21] is a loop transformation technique that divides a set of program statements that are within a nested loop into two adjacent nested loops. As an example, the source code in Figure 5.1(a) becomes the source code in Figure 5.1(b) after loop fission. Applying loop fission improves the localities of reference such that the resultant code has the largest possible groupings of references to the same variables, and also may enable unimodular transformations, either by modifying the transformed Source code so that all assignment statements are perfectly nested [28], or by reducing the cardinality of the distance vector set D. 5.1.1 Legality Criteria The initial source code has a dependence vector set Dw) that contains all dependence vectors between statements in the initial code. As mentioned earlier, D(°) can be represented by 59 60 L; for 11 = l,n L; for 11 = l,n L2 fox-12:1,": L2 forlg =1,m L3 for 13 = 1,p L3 for 13 = 1,p $1 Alli. 12.13] = 3U: - 1.12. 13] 52 CIA. 12.13] = DlliJz - 3.13] 32 0111, 12.13] = Dlli. 12 - 3. 13] $4 Dlli. 12.13] = BU: + 1.12 - 1.13]+ S3 BU}, 12,13] = “11, 12 - 2, 13 - l] C[11, 12,13 - 2] 54 D[11,12,13] = B[11 + 1,12 - 1,13]+ L1 for 11 = l,n C[11,12,13 — 2] L2 for 12 = 1,m 55 var; = vat-2 L3 for 13 = 1,p 31 411112.13] = Elli - 1.12.13] 53 B[11,12,13] = I[11,12 - 2,13 - 1] 55 var; = var; Figure 5.1(a). Sample code: before loop fission Figure 5.1(b). Sample code: after loop fission transformation transformation a dependence graph, where each assignment statement is a node and each dependence is a directed arc. A II—block is a strongly connected component of the dependence graph [13]. II-blocks are significant because they cannot be divided by most loop transformations. A 111-block (defined in this paper) is a set of statements whose members are strongly connected components. In other words, dividing a set of statements into two \II-blocks (\Ill and I'll) results in two partitions such that each element of \III and W2 is a lI-block. The source code in Figure 5.1(a) has three II-blocks: 111 = (81,83), II; = (82,84), and H3 = (85). After transformation, the source code shown in Figure 5.1(b) has two \II-blocks: \Ill = (82,84), and ‘11; = (81,83,85). Note that none of the II-blocks are divided by the loop fission transformation. In the following discussion, loop fission is the a“ transformation in the sequence. The dependence vector set for the statements in \111 is 05:52 , and the dependence vector set for the statements in W; is D5,: '31 . Finally, the dependence vector set from \III to W2 is labeled D511}; and the dependence vector set from ‘112 to ‘111 is D9; 311,] . The union of all four of these sets is D(""1), the dependence vector set before the loop fission transformation: 3. —1 -1 -1 —1 D( I) = 03:»! U 037.111 U 05111.11; U D5172»:- 61 As an example, the dependence vector set DOV-1) for the source code in Figure 5.1(a) is: 13‘“) = {(0,2,1),(0,o,2),(1,0,0),(1,—1,0),(0,3,0)}, (5.1) where 05,231 = {(0,2, 1),(1,o,0)}, 135,13] = {(0,0,2),(0,3, 0)} 05,331 = (a, and man]. {(1, -1,0)}. As can be seen from Expression (5.1), D(z-l) = D("1) U 05:2,;1 U DSIZ-dlu -1 Divisi- The dependence vector sets Dita-1111; and D131: are critical to the legality and order of the loop fission transformation. If there are dependences from both VIII to In and from 1) '1’: to {11, then loop fission cannot be applied. A non-empty 05:1 7% implies that In must execute before In in the transformed source code. Similarly, a non-empty DEE-é)! implies that ‘1'; must execute before \Ill. Clearly, both execution orders cannot be satisfied. Formally, loop fission cannot be applied if ((3d: d E D(1=1-1)) A (3d: d E Dita—11112)) A simpler, formal notation to describe the legality criteria is: Preconditions Invariant Explicit (055.71! = 0 v 0137:! = 0) Nesting This notation states that loop fission is legal if either the distance vector set from In to \I12, or from II; to \111, is empty, regardless of the nesting structure of the two \II-blocks. 5.1.2 Effect on the Dependence Vector Set D The dependence vector set for the source code before the loop fission transformation is 0(3'1) = D(==-1) UDSr: 1;: UDsff1 7,2) UDSIZ-éi. The sets D“ I) and 03,323: were initially loop— carried dependences; however, after loop fission the partitions D5331??? and Ufa-$2 become loop-independent. Loop fission is referred to as a dependence-breaking transformation [17] 62 because the dependences between \II-blocks are carried at a higher level after transformation. After loop fission, the dependence vector set for the first code block (D513:1 ),\Iv1) consists of the dependences within the statement partition \Ill. Similarly, the dependence vector set for the second code block (D5121? 2) consists of the dependences within the statement partition ‘1'22 Partition \III: D”) = (Dali-1111)) Partition \112: 0(3) = (05:23:) 5.1.3 Effect on Source Code Mapping Any unimodular transformation can be applied at any time during the loop transformation process (composition property). In compliance with this property, loop fission can be applied at any point in the parallelizing process, given that the legality criteria are met. Loop fission results in two adjacent nested loops where only one previously existed. Therefore, both new blocks of code (\111 and IQ) have a transformation matrix associated with them (TS? and Téjl). Because of the composition property, both T1,?) and T11? are initialized to the transformation matrix that existed for the nested loop prior to loop fission (TV-1)). Notationally, this condition is expressed as: Partition \III: Ta) = T(1=-1) Partition ‘15: T]: = Tlx-l) 5.2 Loop Fusion This section describes the extension of the unimodular approach to include loop fusion [17, 18, 19], which is the complement of loop fission. Loop fusion combines two adjacent nested loops into one nested loop resulting in a decrease in communication and synchro- nization overhead. As an example, the initial source code that is shown in Figure 5.2(a) is transformed into the source code shown in Figure 5.2(b) by loop fusion. 63 L; for 11 = l,n L1 for 11 = l,n L2 for 12 = l,m L2 for 12 = 1,m L3 for 13 = 1,p L3 for 13 = 1,1) 31 A[11112,13] = BUiJzJa] S1 A[11.12,13]= 8111212113] 52 l[11 - 1,12,13 - 2] = D[11, 12 - 2,13] 32 A[11 - 1,12,13 — 2] = D[11,12 - 2,13] L; for 11 = l,n 53 E[11 - 2,12,13] = F[11,12,13] + G[11,12,13] L; for 12 = 1,711 $4 “[11 - 3,12,13] = E[11,12,13] L; for 13 = 1,p 53 EM - 2112.13] = 5111.12.13] + G[11.12. 13] S4 “[11 - 3,12,13] = E[11,12,13] Figure 5.2(a). Sample code: before loop fusion Figure 5.2(b). Sample code: after loop fusion transformation transformation 5.2.1 Legality Criteria Previous work related to loop fusion implies that the complexity of determining the legality of the transformation exceeds the potential gains from its implementation [14]. This sec- tion of the paper clarifies and simplifies the legality condition for loop fusion based upon properties of the loop bound set and the distance vector set. A (~)-block is a set of statements that is associated by spatial locality in the source code. That is, all of the assignment statements in a loop nest make up one 9-block. As an example, the source code in Figure 5.2(a) has two 9-blocks: 91 = (81,82) and 92 = (83,84). A O-block is another name for a loop nest and is introduced in this paper to ensure consistent notation with loop fission. Unlike II- and \II-blocks, nothing is assumed about the dependence relations within G-blocks . Zima [15] and Wolfe [14] both state that a precondition for loop fusion is that the loop bound sets for both loop nests must match exactly, or be “adjustable” using a transfor- mation such as loop unrolling. The following discussion shows that this precondition can be weakened into a predicate named can-adj ust:(Be1 , 89,), which returns true if the loop bound sets for ('91 (B9,) and 62 (392) can be adjusted to match exactly. To clarify, Zima and Wolfe state that the loop bound sets must already match, whereas can_adjust() de- 64 termines if the loop bound sets can be adjusted. An algorithm that determines the value of can_adjust(B9, , B9,) is given in Figure 5.3. Algorithm can_adjust Given two loop bound sets, 89, and 89,, determines if the loop bound sets can be adjusted. Evaluation of this function is a precondition for loop fusion. Input: B91 /*“'loop bound set for the first code block"/ B9, /**loop bound set for the second code block**/ Output: FLAG /”TRUE or FALSE‘V Procedure: 1. /"‘*Determine iteration count for all 3-tuples in Be, and Be,**/ For all 5: 6 Be, 61.count.- = [(lb; — ub; + s,)/s,] For all 5} 6 Be, eaxount,’ = [My - “51' + SJl/SJ‘J 2. [“Sort 91.nount and egxountfll SORT(91 .count) SORT(62 .count) 3. [*‘Determine value of F LAG**/ FLAG = TRUE For all i 6 61.count /"‘"‘if the iteration counts difler for a loop in 91 and 92"/ If 61 _count, ;£ 62.count.- /**then the two loop bound sets cannot be adjusted'*/ Then FLAG = FALSE Figure 5.3. Algorithm to determine if two loop bound sets can be adjusted. Determining the set of transformations needed to adjust B91 and B92 may involve a complex solution. Note that can_adjust(B9,, B92) determines if the loop bound sets can be adjusted, but does not determine the actual adjustments. The algorithm to determine the adjustment matrix is dependent upon the cardinality of the set of transformations and the loop nest depth of the target source code. 65 Typically, another legality criteria for loop fusion is based upon the dependences between the two code blocks 91 and 92. The dependence vector set after loop fusion has four partitions: 1. D323]: The loop-carried dependences between statements in 91. 2. D8231: The loop-carried dependences between statements in 92. 3. Dg1 ).9 2: Before the fusion transformation, this partition was loop-independent. How- ever, after loop fusion, this partition is loop-carried and is the distance vector set from statements in 01 to statements in 92. 4. DS; {91: Before the fusion transformation, this partition was loop-independent. How- ever, after loop fusion, this partition is loop-carried and is the distance vector set from statements in 92 to statements in ('91. As an example, the dependence vector sets for the initial source code in Figure 5.4(a) are: D823: = {(0)} and Data = {(0)}, where 91 =(81) and 92 =(82). Note that there is also a loop-independent dependence from S2 to 81 caused by variable AEi] . After loop fusion is applied, as shown in Figure 5.4(b), the loop-independent dependences from 82 to 81 are now loop-carried (D3292 = fl, and D8191 = {(1)}). Wolfe [14] describes the legality of loop fusion based upon dependences between state- ments in the first loop nest 91 and the second loop nest 02. Zima [15] uses a similar, although more specific definition, called serial-fusion preventing, to describe an illegal condi- tion for applying loop fusion. Both discussions state that loop fusion is illegal if D5291 aé 0. In other words, if after the loop fusion transformation there exists any loop-carried depen- dences from the second to the first code block, then loop fusion is illegal. According to this legality criteria, the source code shown in Figure 5.4(a) cannot be fused and still maintain semantic equivalence to the original source code. For illustrative purposes, the source code in Figure 5.4(b) shows the fused code, and, as can be seen, it is not semantically equivalent to the initial code because of the dependence from 52 to 51 caused by array A. However, if the loop is reversed after loop fusion is applied, as shown in Figure 5.4(c), then the transformed source code is semantically equivalent to the initial code. This example contradicts the legality criteria proposed by Wolf and Zima. That is, we are able to apply loop fusion to code that initially appeared non-fusible. From studying and 66 exploiting properties of the dependence vector set we developed a new strategy for applying loop fusion. The approach for determining the legal application of loop fusion is as follows. L1 for 11 = l,n L1 for 11 = l,n L1 for 11 = —n, -1 S1 l[11] = l[11] + C[11] S1 A[11] = A[11] + C[11] S1 A[-11] = A[-11]+ C[-11] L; for 11 = l,n S2 D[11] = A[11 + l] + D[11] S2 D[-11] = A[-11 + l] + Ill—11] S2 D[11] = l[11 + 1] + 3111] Figure 5.4(a). Sample code: Figure 5.4(b). Sample code: Figure 5.4(c). Sample code: loop before loop fusion transforma- loop fusion results in non- fusion followed by loop reversal re- tion. equivalent source code. sults in equivalent source code. It is understood that distance vectors are initially lexicographically positive and define the lexicographic execution order of the assignment statements. In the initial source code (Figure 5.4(a)), all iterations of statement 51 complete execution before any iterations of statement 52 begin execution. Clearly, the source code after loop fusion must preserve any dependences between $1 and 32. The distance vectors in D3291 must reflect the execution order of ('91 before ('92. There- fore, the set 1382,91 is negated when DU’) is determined after the loop fusion transformation. That is, the distance vector set D”) is: :1: z-l 3-1 :1: :1: D( ) = Diane: U demo: U Dye, U ‘(Db2),91)- 11 As an example, the distance vector set for the source code in Figure 5.4(b) is: per) = {(0)} u {(0)} u 0 u —{(1)}= {(0), (-1)} Determining DU”) after loop fusion may result in lexicographically non-positive depen- dence vectors in D1929; , as in the above case when D3291 = —{(1)}. Even though some distance vectors may not be lexicographically positive during intermediate transformations, the global postcondition must be true when all transformations are complete. That is, all 67 of the final, transformed distance vectors must be lexicographically positive. The reversal transformation is applied after loop fusion in the source code in Figure 5.4(c), which makes all of the distance vectors in DU”) lexicographically positive. A less trivial example is illustrated by the source code in Figure 5.5(a), which has only one distance vector (Dbl-2,61 = {(1, —1)}) after loop fusion is applied (D8513: , 08:31, and D8392 are all empty). Reversal of the signs of the distances in D8291 results in a distance vector set that reflects the execution order of 01 before 92 (Dm = —(D(ea;)’el) = -{(1,-1)} = {(-1,1)})- The distance vector in DU”) is not lexicographically positive, and therefore the global postcondition is false. Two possible transformations that make the distance vector set lexicographically positive are to: 1) reverse the outermost loop using the transformation matrix: —10 “l .1], or 2) interchange the inner and outer loop using the transformation matrix: T=[(1)(1)]. Loop reversal applied after loop fusion is shown in Figure 5.5(b) and loop interchange applied after loop fusion is shown in Figure 5.5(c). Both cases produce semantically equiv- alent code with respect to the initial source code. Loop reversal results in D”) = {(1, 1)}, and the transformed source code is semantically equivalent to the initial source code because the transformed distance vector set is lexicographically positive. Loop permutation results in D0”) = {(1, —1)}, and, again, the distance vector set is lexicographically positive and the transformation sequence is legal. The above discussion shows that the legality criteria for loop fusion need not be based on the dependence vector set between the two fused blocks of code. Instead, the legality criteria for loop fusion is based strictly upon the predicate can.adjus1:.(B91 , B9, ). As with all unimodular transformation sequences, all (1 in D(”) must be lexicographically positive 68 L1 for 11 = 1,3 L1 for 11 = -3, -1 L2 for 12 = 1,3 L2 for 12 = 1,3 L2 for 12 = 1,3 L1 for 11 = 1,3 S1 A[11,12] = . . . S1 A[-11,12] = . . . S1 A[11,12] = Li for11=1,3 S2 ...=A[-11+1,12-1] S2 ...=A[11+1,12-1] L; for 12 = 1,3 S2 ...=A[11+1,12-1] Figure 5.5(a). Sample code: be- Figure 5.5(b). Sample code: Figure 5.5(c). Sample code: loop fore loop fusion transformation. loop fusion and loop reversal. fusion and loop interchange. upon completion of the transformation sequence. In summary, the legality criteria for loop fusion is: Preconditions Invariant Explicit can.adj ust:(Be1 , B92) Nesting 5.2.2 Effect on the Dependence Vector Set D Modifications to the distance vector set DU”) were discussed above in the context of loop fusion legality. There are four partitions to the distance vector set after loop fusion: Dbl—$1 , Dg’él, Dye, , and D8 9. The distance vector set after loop fusion 18 the union of these four partitions with Dgzl’el negated. That is, 3 2—1 x-l :1: a: D( )=D(919)1U 0(92 9; U D(91).92 U (D52).9 1) 5.2.3 Effect on Source Code Mapping The composability property states that any unimodular transformation can be applied at any time. Therefore, the loop nest 91 may have a sequence of transformations applied to it prior to the loop fusion transformation (T9,). The loop nest 92 may also have a sequence of 69 transformations applied to it prior to loop fusion (T9,). Also, the loop bound sets 89, and B9, may need “adjustment”. In any case, assume that after loop fusion, the statements originally from 91 will have a different transformation matrix associated with them (T9,) than the statements originally from 92 (T9,). The implementation of loop fusion must account for both T9, and T9, in the transformed source code. In other words, we have two transformation matrices for one loop nest after the loop fusion transformation. The final source code mapping proceeds in two steps. The assignment statements that were initially in 91 are mapped using T9,. Similarly, the assignment statements that were initially in 92 are mapped using T9,. As an example, the source code in Figure 5.6(a) cannot be fused in its current form because the loop bounds for the inner loops do not exactly match (loop L’, is the reversal of loop L2). However, can-adjust(B9,,B9,) is true, because the outer loop and inner loop of 91 have the same number of iterations as the outer loop and inner loop of 92, respectively. If reversal is applied to the inner loop of 92, then the source code in Figure 5.6(b) is obtained. Then the loops can be fused as shown in Figure 5.6(c), resulting in semantically equivalent source code. The transformation matrix for 91 is: 10 “[01], where 91 =(81), and the transformation for 92 is: 1 0 ”[0 ~1l where 92 =(S2) and T9, reflects the reversal of the inner loop. After loop fusion, T9, remains associated with 91 and is used to determine the form of the assignment statements in 91. Similarly, T9, remains associated with 92 and is used to determine the form of the assignment statements in 92. Finally, either T9, or T9, is used to map the loop bounds. That is, mapping the loop bounds for 91 using T9, results in the same loop bounds as mapping the loop bounds for 92 using T9,. 70 L; for 11 = 1,m L; for 11 = 1,m L; for I; = 1,m L2 for I2 = l,n L2 for 12 = 1, n L2 for I2 = l,n S1 ”11]: S1 A[I1]= 31 M11]: L; for 13 = 1,m Li for 13 = 1,m S2 . . . = A[I2 — 1] L; for 14 = -n, —1 L; for I; = l, n S2 ...=A[—I4-l] $2 ...=L[I4-1] Figure 5.6(a). Sample code: be- Figure 5.6(b). Sample code: af- Figure 5.6(c). Sample code: af- fore adjustment ter Mustment ter adjustment and loop fusion In summary, the transformation matrix for the assignment statements originally in 91 is TS? and the transformation matrix for the assignment statements originally in 92 is T82. 91 Statements: T8? = (gt-U 92 statements: T3? = Tgi-l) CHAPTER 6 Integration of the Extended Loop Transformations Algorithms that use the kernel and extended set of transformations are broadly classified as either machineindependent or machine-dependent. Machine-independent algorithms determine a transformation matrix T that meets some general goal, such as parallelism or data locality. Machine-dependent algorithms determine a transformation matrix T that optimizes source code for a particular architecture, such as a vector or shared-memory machine. The kernel unimodular transformations are currently used to obtain parallelism, and in some cases data locality, on source code in perfect loop nests. In order to provide context and motivation for developing algorithms to integrate the extended loop transformations, this chapter begins by reviewing the existing algorithms that use the kernel unimodular transformations to obtain coarse- and fine-grain parallelism. Next, new algorithms that we developed are described that use the kernel and extended sets of transformations to obtain new goals or that enable the application of transformations to a broader range of source code structures. First, the wavefront algorithm that achieves fine-grain parallelism is modified so that it is applicable to imperfect-only-child loop nests, in addition to perfect loop nests. Second, the extended transformations are used to expand the machine-independent goals: first, by finding parallelism where the kernel unimodu- 71 72 lar transformations fail; and second, by obtaining data locality within the matrix-based framework. Finally, the extended transformations are used to obtain machine—dependent optimization. Specifically, two algorithms are developed for source code executing on vec- tor and shared-memory machines. All of the algorithms that use the extended set of loop transformations can be applied to source code with both perfect and imperfect-only-child loop nests. Both Banerjee [3], and Wolf and Lam [4] have developed algorithms that yield coarse- grain parallelism in some cases, and fine-grain parallelism in all cases of perfect loop nests. Using their algorithms, coarse-grain parallelism may not be obtainable with the kernel uni- modular transformations. However, we have developed an algorithm that uses the extended set of transformations and may allow parallelization of the outermost loops in cases where the kernel set of transformations fail. Similarly, fine-grain parallelism refers to the parallelism of the innermost loops of a per- fect, n-deep loop nest. Wolf and Lam show that n — 1 degrees of parallelism can always be obtained from a n-deep perfect loop nest. A new algorithm is discussed that is a modifica- tion of the original algorithm for fine-grain parallelism and works on imperfect loop nests. Specifically, if assignment statements are nested at q unique depths in a loop nest, then the algorithm finds a minimum of n - q degrees of parallelism. As examples, q is one for a perfect loop nest because all statements are nested at one level, and q is greater than one for any imperfect loop nest. Data reuse is defined as using the same datum in different iterations of a loop nest. Data locality occurs if reused data remains in the same memory hierarchy level between uses. An algorithm is presented that improves data locality using the transformations from the extended set, namely loop fission and fusion. Next, two architecture-specific algorithms are introduced that optimize source code us- ing the extended set of loop transformations. Specifically, an algorithm is given that opti- mizes source code for a vector machine using loop normalization, fission, permutation, and blocking. Finally, an algorithm for coarse-grain parallelism on a shared-memory machine is discussed, and uses loop normalization, fusion, permutation, and collapsing. The objective 73 of the vector machine algorithm is to minimize work performed in each loop and move the parallel loops innermost. In contrast, the objective of the shared memory machine algorithm is to maximize the work performed in each loop and move the parallel loops outermost. The development of these algorithms shows that the extended set of loop transformations can be composed to achieve both machine-independent and machine-dependent compiler goals and can be applied to a wide variety of source code structures. 6.1 Previously Developed Algorithms This section discusses two existing algorithms that are applied to perfect loop nests. The first algorithm obtains coarse-grain parallelism on a doubly-nested loop, but can be gener- alized to a n-nested loop. The second algorithm obtains fine-grain parallelism on a n-nested loop. Loop L,- can be parallelized if all of the distance vectors J in the distance vector set D have specific properties. If each distance vector of is in the form ((11, d2, . . . , d“), where n is the loop nest depth and J > (T means that distance vector (I is lexicographically positive, then loop L; can be parallelized if: (VJ: J’e D : ((d1,...,d,-_1) > 6) v (d.- = 0)). (6.1) That is, loop L,- can be parallelized if each partial distance vector preceding location i is lexicographically positive or each distance at location i is 0. In the following discussion, dist refers to the first disjunct of Expression (6.1), ((dl, . . ., (1,--1) >- (i) and disj2 refers to the second disjunct of Expression (6.1) (d,- = 0). As an example, suppose that D contains only one distance vector: D = {(0,2, —2)}. The following table shows the parallelism in the source code based upon disjl and disj2: 74 d.-=0 (d1,...,d.-_1)>0° Parallel L1 yes not applicable yes L2 no no no L3 no yes yes 6.1.1 Coarse-grain parallelism for perfect loop nests using the kernel transformation set A distance vector J for a doubly-nested loop is in the form (d1 , d2), where (11 corresponds to the outer loop (L1) and d2 corresponds to the inner loop (L2). Parallelization of the outer loop is equivalent to finding a transformation matrix T, such that (11 = 0 for all of the dis- tance vectors in the transformed distance vector set T x J E D. Satisfying this requirement is equivalent to satisfying disj2 from Expression (6.1). Note that disj] is irrelevant in this case since i— 1 = 0. For a doubly-nested loop, the transformation matrix T is in the form: T[1,1] 111,2] T[2, 1] T[2, 2] Therefore, finding T that parallelizes the outermost loop is equivalent to finding T[l, 1] and T[l, 2], such that T[l, 1] x d1 +T[l, 2] x d2 = 0 for all J6 D. The values of T[2, 1] and T[2, 2] are later calculated to make T unimodular. That is, both T[2, 1] and T[2,2] are integers and are calculated such that Idet(T)| = 1. Wolf and Lam [4] describe coarse-grain parallelism as choosing the transtrmation ma- trix T with first row t: such that t: x J = 0 for all dependence vectors d: Banerjee [3] describes an algorithm for parallelizing the outermost loop of a doubly-nested loop such that the transformed source code is semantically equivalent to the original source code; the outer loop can be executed in parallel; and the iteration count of the outermost loop is maximized. Transformation matrix T results in (11 = 0 for all transformed distance vectors 75 in D. Algorithm kerneLcoarse obtains coarse-grain parallelism on a perfect, 2-deep 100p nest using the kernel transformations and is given in Figure 6.1. Algorithm kernel_coarse Finds a 2 x 2 transformation matrix T that parallelizes the outermost loop of a doubly-nested loop, if one exists. Input: D /**the distance vector set”/ Output: T /"‘*the 2 x 2 transformation matrir"/ Procedure: 1. Move the loop with the highest iteration count to the outermost position using permutation. 2. Determine if all J E D are in basis form a(a1, a2), where a is a positive integer and (a1 , a2) is lexicographically positive. If all of d E D cannot be placed in this form, then terminate. Else, continue. 3. Find T[2, l],T[2, 2] and g = GCD(a1,a2) such that g = a; x T[2, 1] + a2 x T[2, 2], where T[l, l] = a2/g and T[l, 2] = -a1/g. Figure 6.1. Algorithm to obtain coarse-grain parallelism using the kernel transformations of perfect loop nests. Note that a transformation matrix T may not exist that results in coarse-grain paral- lelism. That is, Banerjee’s algorithm terminates (Step 2) if the distance vectors cannot be placed in basis form. The extended algorithm discussed in Section 6.4 shows how to increase the chances of finding a basis form for all distance vectors in the distance vector set. 76 6.1.2 Fine-grain parallelism for perfect loop nests using the kernel trans- formation set This section describes an algorithm that parallelizes the inner n — 1 loops of a n-deep, perfect loop nest using the kernel set of loop transformations. The approach is known as the wavefront transformation [4, 24], and is named kerneLfine in this dissertation. Two important proofs developed by Wolf and Lam [4] are critical to the applicability of the wavefront transformation. The first shows that any lexicographically positive distance vector can be made fully permutable. The second shows that any perfect, n-deep loop nest with a fully permutable distance vector set D can be transformed to have n — 1 degrees of parallelism. The proofs are paraphrased below, and the reader is referred to Wolf and Lam [4] for more details. All distance vectors in the initial distance vector set are lexicographically positive, where a lexicographically positive J has at least one positive distance (1),. The second disjunct of Expression (2.1) (definition of fully permutable) requires that distances d,- through (I, must be non-negative, if loops L; through L,- are fully permutable. Any negative distance in the range d,- through d,- can be made non-negative via skewing by the positive distance d2. That is, any negative distances between loops L,- and L,- can be made non-negative satisfying the second disjunct of Expression (2.1). Step 1 of Algorithm kernel-fine transforms a lexicographically positive distance vector set D into a fully permutable distance vector set D“). Transforming a lexicographically positive distance vector set into a fully permutable distance vector set involves skewing the inner loops by the positive distance d], in each distance vector d: As an example, the transformation matrix Tm for a 4-deep loop nest is: 1 0 0 #2,1 1 0 #3,1 [13,2 1 #44 #4,: #4,3 p-acoc where 11.3,,- = [max(-d,/d.-)], such that d,- > 0 and JG D. 77 After transforming the distance vectors to fully permutable form, loops L2 through L, are parallelized by satisfying disjl from Expression (6.1) (definition of parallelism). That is, loops L2 through L. are parallelized by ensuring that the distances associated with loop L1 are strictly positive (VJ: J6 D : d1 > 0). In Expression (6.1), dist states that loop L,- can be parallelized if (VJ: J6 D : (d1, . . . ,d.-_1) >- 6). In an n-nested loop, if we show that the first distance is greater than zero (d1 > 0), then disjl is satisfied. In fully permutable form, all distances in a distance vector set are non-negative. The wavefront transformation skews the innermost loop L, of a n-nested loop by each of the outer loops L1 through Ln-1. Therefore, all distances associated with the innermost loop d" are strictly positive. As a final step in the wavefront transformation, the innermost loop is moved outermost. Thus, rather than (1,, being positive for all J6 D, (11 is positive for all J6 D, which satisfies (11°st from Expression (6.1). The wavefront transformation matrix T skews the innermost loop by each of the outer loops and then moves the innermost loop to the outermost position. The wavefront matrix for a 4-deep loop is in the form: COD-“H CHOH H061— OOOr-I Algorithm kernstine obtains n — 1 degrees of parallelism from a n-deep, perfect loop nest. Step 1 transforms all lexicographically positive distance vectors in the distance vector set so that they are fully permutable. Step 2 skews the innermost loop by each of the outer loops. Finally, Step 3 moves the innermost loop to the outermost position using loop permutation. 6.2 New Algorithms This section introduces several new algorithms that make use of the extended and kernel set of matrix-based loop transformations to address different objectives of a parallel com- piler. The first algorithm is a modified version of the wavefront transformation and can be applied to imperfect loop nests. The second algorithm achieves coarse-grain parallelism in 78 Algorithm kernsl_fine Finds a n x n transformation matrix T that parallelizes the innermost n - 1 loops of a n-deep perfect loop nest. Input: D /*"‘the initial distance vector set**/ Output: T /"the n x n transformation matrix”/ D' /”the transformed distance vector set"/ Procedure: 1. /*"Make all d6 D fully permutable by skewing“*/ For all i, 2 5 i S n For all j,1 S j < i BkOI(L.’, L,,nax|'—d,~/d.'l), such that d.- > 0 Endfor Endfor 2. [*‘Shew the innermost loop L" by each of the outer loops“/ For all h, 1 S h < n skev(Ln, L;., 1) Endfor 3. /**Permute the innermost loop L. to the outermost position"/ per-uts(L1, Ln) Figure 6.2. Algorithm to obtain fine-grain parallelism using the kernel transformations on perfect loop nests. cases where the kernel unimodular transformations cannot and uses loop fission to reduce the cardinality of the dependence vector set D. The dependence vectors in D form the constraints on possible coarse—grain parallelism. Thus, reducing the number of constraints (i.e., reducing the cardinality of D) results in more potential parallelism. Next, an algorithm is presented that obtains data locality. Data locality has not previ- ously been addressed with respect to matrix transformations. However, loop blocking was used by Wolf [24] to obtain data locality external to the matrix-based framework. We use loop fission and fusion to obtain locality within the matrix-based model. 79 Finally, two architecture-specific algorithms are given that transform source code for a given machine. Specifically, an algorithm for vector machines is discussed, as is an algorithm for a shared-memory system. 6.3 Fine-grain parallelism for imperfect-only-child loop nests using the kernel transformation set This section describes a modification to Algorithm kernel_fine that allows it to be applied to imperfect loop nests. The algorithm, named mod-kernel_fine, applies the wavefront transformation to partitions or regions of the nested loop. We show that if m statements (51, . . .,.S'm) are nested at q unique levels in a n—deep loop, then the source code can be transformed such that it has a minimum of n — q degrees of parallelism. As an example, the source code shown in Figure 6.3 has an assignment statement SI nested i-deep, and several assignment statements (S2, . . ., Sm), nested n-deep. Therefore, there are two unique levels at which assignment statements are nested (q = 2), and the source code can be transformed such that n — q = n — 2 degrees of parallelism result. L1 for11=... L2 forI2=... Li forI.-=... S1 A[I1,...,I.']=... L,- forI,=... [1,4,1 for Ij+1 = Ln forIn=... 32 B[Il,...,]n]=... Sm C[Il,...,1n]=o.. Figure 6.3. Sample code: n-deep imperfect-only-child loop nest. 80 Algorithm mod-kernel_fine, given in Figure 6.4, is similar to Algorithm kerneLfine except that it parallelizes within regions that are defined by imperfectly nested assignment statements. Using the source code shown in Figure 6.3 as an example, the Algorithm mod_kernel_f ins first parallelizes loops L1 through L,- using the wavefront transformation. Next, loops L, through L, are parallelized using the wavefront transformation. The source code is parallelized in regions because the legality condition of loop permutation does not allow interchange of loops that are not perfectly nested within each other. Any two loops in the range L1 through L,- can be interchanged as needed, as can any two loops in the range L, through Ln. However, loops cannot be permuted between the two ranges. The first step of Algorithm mod_kernel_fine is to transform all distance vectors in the dependence vector set into fully permutable form using loop skewing. Loop skewing is legal regardless of the nesting structure of the source code. That is, loop skewing can be legally applied across imperfectly nested loops. A significant result of the skewing transformation is that any partial distance vector is also fully permutable. As an example, if the distance vector d = (d1, . . .,dn) is fully permutable, then so is any (dg, . . .,d,-), where 1 S i S j S 11. All of the distances are non- negative in a fully permutable distance vector, and therefore, any partial distance vectors (dg, . . . , d,) are either the zero vector or are strictly positive. If the distances in a region are all zero ((dg, . . .,d,~) > 0) then all loops L.- through L,- can be parallelized by disj2 of Expression (6.1). If the distance vectors are strictly positive, then the inner loop in the region can be skewed by each of the outer loops in the region and moved outermost. This strategy is similar to the wavefront transformation, however, it is applied across a region of a n-deep loop nest bounded by nested assignment statements, rather than all n loops. The motive for the application of region wavefronting is to maximize parallelism despite the constraints imposed by the legality of loop permutation. That is, loops are permuted only within a region. Maximal parallelism within a n—deep imperfect-only-child loop nest is guaranteed because maximum parallelism is obtained within each region bound by an imperfectly nested assignment statement. 81 Algorithm mod-kernel-fine Finds a n x n transformation matrix T that parallelizes a minimum of n - q loops of a n-deep imperfect- only-child loop nest, where q is the number of unique depths at which assignment statements are nested. Input: D /"'*the initial distance vector set**/ depth[1, ' ’ q] ”tonic“ 53‘ of unique depths at which assignment statements are nestea'"/ Output: T /*"‘the n x n transformation matrif‘] D' /"‘*the transformed distance vector set**/ Procedure: 1. /”Malce all d6 D fully permutable by skewing”/ For all i, 2 5 i S n For all j, l S j < i skev(L.-, L,,nax|'—d,/d:]), such that d.- > 0 Endfor Endfor (A) /"Initialize the counters**/ deptlu'ndez = 1 start.depth = 1 (B) /**For each partition of the loop,” / Repeat Wide?“ = depth[depth.indez] (C) /**Skew within regions"/ For all k, starLdepth S h < end.depth I’k°'(Lend.depth1 1"“ 1) Endfor (D) /"Permute within regions"/ ”1111“”: start .depth’ L en d. dep th) (E) /**Increment the counters*"‘/ deptthdez ++ starLdepth = end.depth + 1 Until deptthdez > q Figure 6.4. Algorithm to obtain fine-grain parallelism using the kernel transformations on imperfect-only-child loop nests. 82 As an example, Algorithm mod_kernel-f ins is applied to the source code given in Fig- ure 6.5. The initial distance vector set for the source code is D = {(0,0, 1, —1), (0,0,0, 1)}. Assignment statement 5'1 is nested 2-deep and assignment statements S2 through 55 are nested 4-deep. An array, named depth stores the ordered set of unique depths at which assignment statements are nested. Thus, q = 2 and depth[1 . . .q] = [2,4]. L1 for 11 = 1,051 L2 for 12 = 1,1152 51 A[I1,I2]=... L3 for 13 = l, 1153 L4 for 14 = l, 1164 s2 B[I;,I2, 13,14] = 53 B[I1,I2,13—1,14+l]=... s. C[1,, 1213,14 = $5 C[I1,I2,13,14 - 1] = . . . [as #D = {(0, 0, l, —1),(0, 0,0, 1)} :1- t/ Figure 6.5. Sample code: initial code before application of modJrernelJine. Step 1 transforms the distance vector set such that all d E D are fully permutable. Specifically, 100p L4 is skewed by loop L3 and the resulting distance vector set is D“) = {(0, 0, 1, 0), (0, 0, 0, 1)}. Note that all of the distances in the transformed distance vector set are non-negative. Step 2(A) initializes the value for indexing the array depth, and Step 2(B) iterates through depth. Each region of the loop nest, defined by the depths of the assignment statements is traversed. The first region to which the transformation is applied is loops L1 through L2. That is, end.depth = depth[1] = 2. The next iteration of Step 2(B) processes loops L3 through L4. That is, start_depth = end_depth+1 = 3 and end.depth = depth[2] = 4. Steps 2(C) and 2(D) are equivalent to Steps 2 and 3 from Algorithm kerneLfine, respectively, except that the wavefront transformation is applied to regions of the n-deep loop nest, rather than all n loops. The skewing transformation (Step 2(C)) skews loop 83 L2 by loop L1. However, since d1 and d2 are zero for both distance vectors, the resulting distance vector set is 00(0)) = {(0,0,1,0),(0,0,0,1)}. Next, loop L4 is skewed by loop L3 resulting in D(2(C)) = {(0,0,1,1),(0,0,0,1)}. Finally, loops L3 and L4 are permuted resulting in DMD” = {(0,0, l,1),(0,0,1,0)}. Loops L1,L2, and L, are parallelized using the definition for parallelism given in Expression (6.1) as shown in Figure 6.6. L1 forall J1 = 2, ub2 + 001 L2 torall J2 = nax(l, J1 — ub2),nin(ub1, J1 - 1) S1 A[J2, J1 - J2] = . . . L3 for J3 = 3, 1154 + 2ub3 L4 forall J4 = nax(l, [Jig—“*j),nin(ub3, [=1ng 52 B[J2,J1 -J2,J4,J3-2J4]=... $3 B[J2,J1-J2,J4-—1,J3-2]4+1]=... s. c11211 — 1,,1.,13 — 21.] = . .. s. c112, 11 — 12,1213 — 21, — 1] = /: *D’ = {(o,o,1,1),(o,o,1,0)} t t/ Figure 6.6. Sample code: transformed code after application of modlerneliine. 6.4 Coarse-grain parallelism for imperfect-only-child loop nests using the extended transformation set Figure 6.7 gives an algorithm, named ext-coarse, that uses loop fission within a matrix-based framework to obtain parallelism where kernel unimodular transformations alone would not. In addition to using loop fission, the permutation step of Algorithm kerneLcoarse is modified in Algorithm ext_coarse. The modification is due to the ille- gality of loop permutation for imperfectly nested loops. Specifically, if there are any statements between loops L1 and L2, then loops L1 and L2 cannot be permuted. However, if there are no statements between loops L1 and L2, then the loop with the highest iteration count is moved outermost. 84 Algorithm ksrnel-coarse (Figure 6.1) terminated if any d6 D did not have the same basis a(a1,a2). The distance vectors that cannot be placed in basis form are further ex- amined in Algorithm ext-coarse. If possible, the statements causing the non-basis dis— tances are separated into adjacent \It-blocks by loop fission. Chapter 5 showed that strongly connected components (SCC) of the dependence graph cannot be divided by loop fission. Consistent with that constraint, Algorithm ext_coarse does not divide SCC’s of the depen- dence graph. However, if the non-basis distances are not in a strongly connected component, then the statements causing the distance are placed in separate 111—blocks. Thus, the dis- tance vectors preventing parallelism that were originally in the distance vector set D become loop-independent and are consequently ignored. Loop fission can be applied to both perfect and imperfect loop nests, and thus, an algorithm that uses loop fission can be applied to both perfect and imperfect loop nests. The source code given in Figure 6.8 cannot be parallelized using Algorithm kernel_coarse, but can be parallelized using Algorithm ext_coarse. We want to parallelize L1 because the iteration count is significantly greater than the iteration count of L2. The strongly connected components of the dependency graph are {(51, S2), (53), (34)}. The distance vector (0,3) can be placed in basis form 3(0,l), where a = 3 and the basis is (0,1). The distance vector (0,4) can be also be placed in basis form 4(0,1), where a = 4. However, the distance vector (1,2) cannot be placed in basis form. That is, there is no positive integer a that, when multiplied by (0,1), equals (1,2). Therefore, the distance vector (1,2) causes termination of Algorithm kernel-coarse. Algorithm sxt_coarse determines that $3 and 5.2 are not in the same strongly connected component and the statements are placed in adjacent loop nests by loop fission, as shown in Figure 6.9. The outer loops of both \It-blocks are parallelized in the source code in Figure 6.9 because no dependency prevents it. 85 Algorithm ext_coarse Finds a 2 x 2 transformation matrix T that parallelizes the outermost loop of a doubly-nested loop, if one exists. Input: D /**the distance vector set"/ SCC(D) [*‘strongly connected components of D”/ Output: T(\It,) /*"‘the 2 x 2 transformation matrir"/ Procedure: 1. If there are no statements between loops L1 and L2, then move the loop with the highest iteration count to the outermost position using permutation. 2. Determine if all d6 D can be put in basis form a(a1, a2), where a is a positive integer and (a1, a2) is lexicographically positive. (A) If any J6 D in non-basis form is caused by statements in the same strongly connected component, then terminate; (B) If any J E D in non-basis form is caused by statements in different strongly connected components, then place the statements in separate \Il-blocks; (C) If all d6 D are in basis form, then continue. 3. For each 111-block, find T[2, 1],T[2, 2] and g = GCD(a;,a2) such that g = a; x T[2, 1] + a2 x T[2,2], where T[1,1] = a2/g and T[l, 2] = —a1/g. Figure 6.7. Algorithm to obtain coarse-grain parallelism using the extended transformations on imperfect-only-child loop nests. 86 L1 for 11 = 1,1000 L2 for I2 = 1,6 51 A[I1,I2] = B[I1,I2 - 3] S2 B[I1,I2] =A[I1,I2 —4] 53 C[I1,I2] = S4 C[I1-1,I2-2]=... /* *D = {(0.3), (0,4), (1, 2)} * */ Figure 6.8. Sample code: parallelization not allowed by kernel unimodular transformations. L1 forall I1 = 1, 1000 L2 for I2 = 1,6 S1 A[I1,I2] = B[I1,I2 — 3] S2 B[I1,I2] =B[I1,I2 -4] $3 C[I1,I2]=... L3 tor-all I1 = 1,1000 L4 for I2 =1,6 S4 C[I1-1,I2—2]=... Figure 6.9. Sample code: parallelization after loop fission. 6.5 Data locality for imperfect-only-child loop nests using the extended transformation set This section discusses an algorithm that uses transformations from the extended transfor- mation set to obtain data locality. Wolf [24] discusses the use of loop blocking (tiling) to obtain data locality, where loop blocking is considered as a separate transformation outside of the unimodular model. We present an algorithm that uses loop fission and fusion to obtain data locality. The legality conditions for loop fission and fusion are not dependent upon source code structure. Correspondingly, this algorithm is applicable to both perfect and imperfect-only-child loop nests. Chapter 5 discussed the legality criteria for loop fission and fusion, but did not discuss 87 the context in which the two transformations should be applied. Algorithm ext_1ocality shows when to apply loop fission and fusion in order to obtain data locality. Reuse was earlier defined as the use of the same datum on different iterations of a loop nest, and locality refers to the reused data remaining in the same memory hierarchy level (e.g. cache) between uses. The source code in Figure 6.10 shows that reuse does not guarantee locality, especially if the cache is small. That is, A[1] is reused in each iteration of L1, however, A[1] is local only if ub2 is smaller than the cache size. Data reuse is a property of the source code and is not dependent upon the loop structure or machine. In contrast, locality is dependent upon machine characteristics, such as cache size. Thus, increasing data locality results in a decreased number of memory accesses and improved use of the cache. L1 for 11 = 1,1151 L2 for 12 = 1, ub2 $1 A[I2] = . . . Figure 6.10. Sample code: showing reuse and locality. Section 2.1 discussed the representation of array subscript expressions using array index functions and uniformly generated sets. A uniformly generated set is an equivalence class, such that any two members of the set have the same array index function H and refer to the same array. Therefore, our objective is to determine the uniformly generated sets with maximum cardinality or simply to form the largest possible groupings of similar references to the same array. Loop fission and fusion are used to separate and/ or combine groupings of statements and provide the basis for the algorithm that improves locality. As an example, all assignment statements shown in Figure 6.11 reference the same array A. However, the array index function of 31 is the same as 54. Similarly, the array index function of S2 is the same as 53. Thus, the greatest locality results from grouping S2 and 5'3 into one equivalence class, and 31 and 34 into another. 88 O W H O 1 S1 A[3I1 + 2,12,3]: a: L1 for I1 = 1,1101; F L2 for 12 = 1,ub1, . 51 “311 + 2,12,3] = . . . S? “311 + 2.12.11 + 3]. 2 $2 l[311+2.12,11+3]=... 1 0‘ ‘ .3. L3 for11=1,ub1, ,3 0 J . 1" ‘°’J’=1"‘b’= S 131 1 16 J - o 1 1 $3 1[311.12+16,11]=... 3 [ 1’ 2+ "1‘ [, 0 12 S, A[3J1,J2+16,1]= w 0 I 34 A[3J1, J2 + 16,1] : C 1—1 + H 03 + I I Chic 0: ”“5 coo HO l l l J 1 NH H + ON Figure 6.11. Sample code: statement groupings for data locality {(51, S4), (S2, 33)}. Algorithm ext_locality, shown in Figure 6.12, obtains data locality using loop fission and fusion. The objective of the algorithm is to partition a set of statements (51, . . . , Sn) into the largest possible sets, such that the statements in the partitions have similar accesses to the same variable. The input to the algorithm is two adjacent 9-blocks (loop nests). The algorithm uses loop fusion to combine similar accesses to the same variable and loop fission to separate accesses to different variables. Note that the statements in different 9-blocks are only combined if the loop bound sets can be adjusted and the statements are separated only if they are not in the same H-block, regardless of the perfect or imperfect nesting structure of the source code. The source code shown in Figure 6.13 becomes the source code shown in Figure 6.14 after application of Algorithm ext_1ocality. The partitions are initialized to the strongly connected components of the source code {(51), (52,36),(S3),(S4,S7), (35)}, then through pairwise comparison the assignment statements are combined into larger sets. The transformation results in the following partition of the assignment statements: {(51, 35), (S2, 53, Se), (S4, 57)} because each set has similar references to the same variable as indicated by the array index functions. 89 Algorithm ext_loca1 ity Partitions the assignment statements in two adjacent G—blocks into groups with increased data locality. Input: 39, /"”"loop bound set for the first loop nest"‘*/ Be, /”loop bound set for the second loop nest"/ Doha, /”the distance vector set for the first loop nest**/ 092,62 /**the distance vector set for the second loop nest**/ 59, /"set of assignment statements in the first loop nest"/ 59, /"set of assignment statements in the second loop nest”*/ Output: \I'1, . . . , \Itm /"‘*partitions of Se, and Se,**/ Procedure: 1. If can.adiust(Be,,Be,) (A) Initialize the partitions \Ih, . . . , \Ilm to the strongly connected components of the assignment statements. (B) For each pair of assignment statements S: and 5,: (i) If S.- and S, are in separate SCC’s (a) If S.- and SJ have the same array index function for the same variable then combine the partitions containing 3.- and 5,. Figure 6.12. Algorithm to improve data locality using the extended transformations on imperfect-only-child loop nests. 90 L1 for 11 = l,ub1 S1 G111] = . . . L2 for I2 = 1,ub2 S2 A[I1, I2] = B[I1 -- 1,12] 53 var; = B[I1 — 1,12 -1] S4 €111,212] = D[I1,212] L3 for J1 = 1, ub1 $5 ...=G[J1-2] L, for J2 = l,ub2 So B[J1, J2] = A[J1 -- 1, J2] S7 D[J1 - 1,2J2] = C[J1 - 1, 2J2] ft #800 = {(51), (S2, 55), (53), (34,57), (55)} t t/ Figure 6.13. Sample code: before application of data locality algorithm. Comments on coarse-grain parallelism and data locality algorithms. Intuitively, Algorithm sxt_coarse and Algorithm ext-1ocality could be merged into one algorithm that combines both parallelism and data locality goals. However, these two goals often conflict. That is, source code that is transformed to obtain maximum parallelism may have poor data locality, and vice-versa. As an example, the source code shown in Figure 6.15(a) has two singly-nested loops. In the form shown in Figure 6.15(a), both L1 and L2 can be parallelized since there is no data dependence that prevents parallelization. However, statements 5'1 and 32 have the same array index functions and could be combined if data locality were a priority. After the loops are combined, as shown in Figure 6.15(b), the data locality is improved but the enclosing loop can no longer be parallelized due to the data dependence between statements 31 and $2. This simple example shows that parallelism goals and data locality goals may conflict, and blindly combining Algorithms ext_coarse and ext_1ocality may result in an algo- rithm that obtains neither parallelism nor data locality. A possible approach for a combined algorithm is to assign a weight to both parallelism and data locality goals. When a potential conflict arises, then the goal with the greater weight is met at the expense of the goal with 91 L1 for K1 = 1,ttb1 S1 G[K1]=... S5 ...=G[K1-2] L1 for K1 = 1,1101 L2 for K2 = 1,052 $2 A[K1,K2]=B[K1 -1,K2] 53 var; = B[K1 — 1,K2 - 1] Se B[K1,K2] = A[K1 - 1,K2] L3 for K1 = 1,051 L, for K2 = l,ub2 Sq, C[K1,2K2] = D[K1,2K2] 37 our, — 1, 2K2] = cm — 1, 2K2] /s at Final Partitioning = {(31, 55), (52,53,56),(54.S7)} * */ Figure 6.14. Sample code: after application of data locality algorithm. L1 forI1=... L1 for11=... S1 ARRAYUfl = . . . S1 ARRAYU1] = . . . L2 torJ1=... S2 ...=ARRAY[I1-l] S2 ...=ARRAY[J1—1] Figure 6.15(a). Sample code: parallel loops, but Figure 6.15(b). Sample code: data locality, but no no data locality. parallel loops. the lesser weight. That is, if parallelism has a higher weight, then the example source code is transformed as shown in Figure 6.15(a). If data locality has a higher weight, then the source code is transformed as shown in Figure 6.15(b). 6.6 Imperfect-only-child loop nests on vector architecture machines using the extended transformation set Vectorization is the process of modifying source code for execution on a vector machine. The legality criteria for vectorization was given in Section 2.4. However, the legality criteria defines what must be true for vectorization to occur, but does not describe what vectoriza- 92 tion is. Vectorization is SIMD execution of assignment statements, that is, all iterations of an assignment statement S,- execute before any iterations of assignment statement S,- begin execution, where i < j. As an example, the source code in Figure 6.16(a) is shown before vectorization. The source code after vectorization is shown in Figure 6.16(b) and uses typical vector notation. The notation states that all 17: iterations of 5'1 execute simultaneously. After completion of S], then all m iterations of S2 execute simultaneously. The loop L1 is implied in the array notation and is eliminated from the source code shown in Figure 6.16(b). L; for I; = 1,m S; A[1 : m] = B[l : m] S; A[I1] = B[I1] S2 D[1 : m] = A[l : m] 32 D[I1] = A[I1] Figure 6.16(a). Sample code: before vectorization Figure 6.16(b). Sample code: after vectorization. Some objectives of a compiler for a vector machine are to: o minimize the work done in each loop nest; 0 move the legal vectorizable loop with the greatest iteration count to the innermost position; 0 match vector register length to iteration count for the innermost loop; 0 evaluate if vectorization is worthwhile; and 0 ensure that subscript expressions remain simple. Each of these objectives is described in further detail below. A compiler for a vector machine uses loop fission to minimize the work done in each loop nest. Vectorization of a loop L,- in a loop nest is applied universally to all statements (51, . . .,Sm) in the loop nest. If any statement(s) in the nest are non-vectorizable, then none of the statements are vectorized. Loop fission is used to partition the statements into the smallest, integral sets, namely II-blocks. Separation into II-blocks isolates the non- 93 vectorizable partitions of assignment statements so that the remaining partitions can be vectorized. One objective of a vector machine is to execute as many iterations as possible as vector operations. We assume that vector operations are applied to only the innermost loop of a loop nest. Correspondingly, we want to permute the loop with the greatest iteration count Ln,“ to the innermost position. As with previous algorithms, the legality condition of loop permutation does not allow the transformation to be applied across imperfectly nested loops. Loop Lma, is found in the innermost region of the loop nest. As an example, the source code shown in Figure 6.18 has assignment statement SI nested 1-deep and assignment statements S2 through S3 nested 3-deep. Loop Ln,” is either of loops L2 or L3, but cannot be loop L1 since loop L1 cannot be permuted innermost. The cost of loading a vector register and executing vector operations may exceed the value of the resulting speedup in vector mode. As an example, if the vector register length is 256 elements, but the iteration count of loop Lmu. is 32, then vector operations may not be justified. That is, it may be more efficient to execute loop Lma, in sequential mode. Therefore, a machine-dependent threshold is specified. If the iteration count of loop Lma, is below the threshold value, then the assignment statements are executed sequentially. If the iteration count of loop Lma, is above the threshold value, then the assignment statements are executed as vector operations. Assuming that the iteration count of loop Ln,” exceeds the threshold value, it is strip mined according to some characteristic length of the target machine, such as vector register length. As examples, the Convex C100 and C200 series vector machines have 128 element vector registers [29]. Strip mining ensures that the innermost loop is optimized so that the iteration count of the innermost loop and the vector register length exactly match. Finally, subscript expressions need to remain simple for arrays in a loop nest that is vec- torized. Typically, a gather-scatter directive consolidates a sparse array for vectorization, then redistributes it after vector operations. Optimization manuals for vector machines imply that gather-scatter only works for simple subscript expressions, so we assume that all subscript expressions are linear functions of one index variable plus a constant. This as- 94 sumption eliminates the possibility of using a wavefront transformation because it typically results in complex, coupled subscript expressions. A newly developed algorithm for optimizing source code for a vector machine, named ext_vector, uses the extended and kernel set of loop transformations is given in Figure 6.17. Specifically, the algorithm uses loop normalization, fission, permutation, and blocking to obtain vectorization. As an example, Algorithm ext-vector is applied to the source code shown in Figure 6.18. The source code has an imperfect-only-child loop nest because assignment statement 51 is nested between loops L1 and L2. Also, loops L1 through L3 are initially normal, which greatly simplifies the mathematics of the following discussion without compromising the generality of Algorithm ext-vsctor. Step 1. Normalize all loops. As mentioned, the source code shown in Figure 6.18 already has normal loops. Step 2. Find the strongly connected components (SCC’s) of the loop nest. SCC’s of the dependence graph are sets of assignment statements that are in cy- cles, and individual assignment statements. Thus, the SCC’s for the source code are {(51),(S2,S3),(S4),(Ss),(Se)}. The dependence from assignment statement 55 to 33 was initially loop-carried, but after loop fission is loop-independent. The distance vec- tor (0,0,1) is eliminated from the distance vector set and the new distance vector set D“) is {(0,1,2),(1,0,0)}. Step 3(A). For each II-block, find the permutable loop with the highest iteration count. There are two unique depths at which assignment statements are nested (q = 2) and so depth[1 . . .2] = [1, 3]. That is, assignment statement 5'1 is nested 1-deep and assignment statements S2 through Se are nested 3-deep. Loop Ln,“ is chosen between loop Ldepth[q—1] = Ldepthp] = L; and Ldepth{q] = Ldepth[2] = L3. Loop L2 has an iteration count of 10000, and loop L3 has an iteration count of 32. Thus, loop Lmar is assigned loop L2. 95 Algorithm ext_vector Optimizes source code for execution on a vector architecture machine using the kernel and extended sets of loop transformations. Specifically, the objectives of the algorithm are to: minimize the work done in each loop (loop fission); move the vectorizable loop with the greatest iteration count to the innermost position (loop permutation); match the vector length to the iteration count of the innermost loop (loop blocking); evaluate if vectorization is worthwhile; and ensure that the subscript expressions remains simple. Input: {L1, . . . , Ln} /”an n-deep loop nest"/ C(V, E) /”a dependency graph for the source code where V are the assignment statements and E are the dependences in the source code**/ {51, . . . , Sm} /**set of assignment statements in the loop nest”/ b.- /"'*bloclcing factor, such as vector register length**/ depth[1 . . . q] /**ordered set of unique depths at which assignment statements are nested”/ Output: {\Ill, . . . , ‘11,} /**sets of assignment statements optimized for vector machines"/ Procedure: 1. /** Normalize all loops in the loop ne‘hhtl/ For all I"! 1 S i S n, nonalize(L.-) 2. / ”F ind the strongly connected components (II-blocks) for the loop nest."/ {1111 , . . . , \Itq} = SCC(G) 3. /”For each lIobloclcf'V FOIHJl‘I’j,lSqu (A) / "F ind the permutable loop with the highest iteration count Lm¢,.“/ Lma: = HIGHESTJT.COUNT(\II,-) between Ldepth 2-1] and L depth[g] (B) /"If the iteration count of Lma, is less than threshold, where threshold = b.-**/ threshold = b.- If L...” < threshold Then terminate. (C) Else (i)/"Move Lma, innermost"/ pornute(L,M,, Ln) (ii)/"Block innermost loop by vector register length.*"‘/ block(Ln, bi) (iii)/”Put blocked loop in vector notation"/ vectorize(Ln+1) Figure 6.17. Algorithm to optimize source code for a vector machine using the extended transformations on imperfect-only-child loop nests. 96 L1 for I1 = 1,3 31 Alb] = L2 for 12 = 1, 10000 La for 13 = 1,32 52 B[Ii.12,13]=0[11 -1112.13] 53 C[I1,I2,13] = B[I1,I2 - 2,13 — 1] $4 E[I1+1,I2,13-1]=... 55 F[11,12,13] = . . . So F[I1,I2,13-1]=... /‘ *D = {(0,2,1).(1,0,0),(0.0,1)} b.’ = 256 SOC = {(51), (52, 53), (50,095), (56)} * */ Figure 6.18. Sample code: before application of Algorithm ext.vector. Step 3(B). If the loop Lmax has a low iteration count, then terminate. Else, we continue. The iteration count of Lma, is 10000. The threshold is equal to the block size b,-, and has the value of 256 in this example. Therefore, we continue. Step 3(C(i)). Interchange loop Ln,” with the innermost loop. The second loop Lm” and the third loop L3 are interchanged. The transformation matrix representing the interchange is: OOH HOG 91-56 and the transformed distance vector set is: 11(2): {(0, 1,2),(1,0, 0)}. Step 3(C(ii)). Block the innermost loop by b,- = 256. The innermost loop is blocked so that the iteration count matches the vector register length. The transformation matrix is: OOOH 38cc: oak-c: 97 The transformed distance vector set is: 0‘3) ={(0.1.15:31.2),(0A,[5231.2),(1,0,L-2°al,0),(1.0, [591.00 = {(0,1,0,2),(0,1,1,2),(1,0,0,0)}. Step 3(C(iii)). Place the source code in vector notation. In the transformed distance vector set 0(3), all of the loop-carried dependences are carried at a depth less than four. Thus, no dependence cycles exist for the innermost blocked loop and the source code can be vectorized. The source code shown in Figure 6.19(a) shows transformed source code before being placed in vector notation, and Figure 6.19(b) shows the same source code in vector notation. L1 torI1=l,3 S1 A[1:3]=... S1 A[I1]=... L1 for I1 =1,3 L1 for I1 = 1,3 L3 for I3 = 1,32 L3 for I3 = 1,32 L; for I2 = 1,10000,b.‘ L; for I; = I, 10000, bi S2 BUM]; :I; + b.‘ - 1,I3] = L2 for I2 = I2,nin(12 + be —1,10000) C[I1 - 1,12 :12 + 04 - 1,13] 32 B[I1,I2,Ia] = C[I1 — 1,12,I3] $3 C[I1,I; :1; + be - 1,13] = 53 C[I1,I2,I3]=B[I1,I2-2,I3-1] B[I1,I$-2:I;+bi-3,I3-1] L2 for I1 = 1,3 L1 for I; = 1,3 L3 for I3 = 1,32 L3 for I3 = 1,32 L; for I; = 1, 10000, b.‘ L; for 12 = 1, 10000, 05 L2 for I2 = I2,lin(12 +b.‘ — 1,10000) $4 E[I1 + LI; :1; + b.‘ - 1,13 — 1] = . . . S4 E[I1+1,I2,I3-1]=... L1 forl; =l,3 L; for I1 = 1,3 L3 for I3 =1,32 L3 for I3 = 1,32 L; for 12 = 1,10000,b,‘ L; for 12 = 1, 10000, b.- 55 1111,12 :12 +b.- — 1,13] = L2 for 12 = I§,uin(I; + b.- — 1,10000) L, for 11 = 1, 3 55 F[I1,I2,I3] = . . . L3 for Ia =1,32 L1 for I1 = 1, 3 L; for I; = 1,10000, 0: L3 for13=1,32 Sa F[I1,I;:I;+bi—1,I3-1]=... L; for I; = 1, 10000, b.- L2 for 12 = 15,-mu; + b.- - 1,10000) Se F[I1,I2,I3-1]=... Figure 6.19(a). Sample code: ready for vectoriza- @- Figure 6.19(b). Sample code: in vector notation. 98 6.7 Imperfect-only-child loop nests on a shared-memory machine using the extended transformation set Our model of a shared-memory machine is MIMD. If p is the number of processors, then each processor P5, 1 S i _<_ p, may operate on different instructions and a different data stream. As an example, the source code in Figure 6.20(a) is shown before execution on a MIMD shared-memory machine. Figure 6.20(b) shows the execution of the assignment statements 51 and S2 on processors P1 through Pm. (For simplicity, we assume that the number of processors and number of iterations are the same, that is, p = m.) Note that there is no implicit or explicit barrier synchronization between execution of 51 and S2 on processors P,- and P,, where i 9‘. j. That is, assignment statement 52 may finish execution on processor P,- before processor P,- begins execution of statement SI. P, P2... Pm Ll for-11:1,": s, 1(1)=3(1) 1(2)=s(2)... 1(m)=B(m) g: 0]I:j:l]i:j 52 l3(1)=l'l(1) 0(2)=A(2)... 0(m)=1(m) Figure 6.20(3). Sample code: before Figure 6.20(b). Statement distribution on a SMS. Erallelization The objectives of a compiler for a shared-memory machine differ significantly from those of a vector architecture machine. Specifically, the objectives for a shared memory machine are to: o maximize the work done in each loop nest; 0 move parallel loops outermost; s maximize iteration count of the parallelized loops; In vector machines, the objective is to minimize the work done in each loop, because the majority of the overhead on a vector machine is loading the vector registers. In contrast, significant overhead for a shared-memory machine is physically distributing the work to 99 the available processors. Therefore, a primary objective of a compiler for a shared-memory system is to maximize the work done in each loop, which minimizes the work distribution overhead. Loop fusion is used to combine assignment statements from adjacent loop nests, which maximizes the work done in each loop nest. Algorithm mod-kernel_fine (Figure 6.4) is a modified wavefront transformation that applies to imperfect-only-child loop nests. The algorithm for shared-memory machines, named ext_sms and shown in Figure 6.21, uses Algorithm mod_kernel_fine to obtain the maximal amount of parallelism. One of the objectives for a shared-memory machine is to move the parallelizable loops outermost. As in previous algorithms, we cannot violate the legality condition for loop permutation. That is, if the outermost assignment statement 51 is nested j-deep, then the two loops with the highest iteration count are found between loops L1 and L,- and moved outermost. The two loops with the highest iteration count are named Lmafl and Lmu2, respectively. In previous algorithms, only the single loop with the highest iteration count was found. In this algorithm, we find two loops Lmafl and Lmaz2 because the final stage of the Algorithm sxt-sms is scheduling. As mentioned earlier, work distribution on a shared-memory machine incurs significant overhead. The actual work distribution protocol is referred to as scheduling. There are two types of scheduling: self-scheduling, in which the work distribution is determined at run- time; and pre-scheduling, in which the work distribution is determined at compile time. Loop collapsing aids scheduling by combining parallel loops that are distributed to processors. That is, loop collapsing decreases the amount of overhead associated with work distribution by simplifying the loop structure to which scheduling protocols are applied. Therefore, we collapse the two loops with the highest iteration counts (L2 and L3) in order to aid scheduling. An algorithm that optimizes source code for a shared-memory machine, named ext_sms, uses the extended and kernel set of loop transformations and is given in Figure 6.21. The algorithm uses loop normalization, fusion, skewing, permutation, and collapsing. 100 Algorithm ext-sns Optimizes source code for execution on a shared-memory machine using the kernel and extended sets of loop transformations. Specifically, the objectives of the algorithm are to: maximize the work done in each loop (loop fusion); move the parallelizable loops with the greatest iteration count to the outermost position (loop permutation); and maximize the iteration count of the innermost loops (loop collapsing). Input: 91 = {M1, . . . ,Mn} /"""an n-deep loop nest**/ 92 = {N1 , . . . , Nu} /"‘"'another n-deep loop nest“*/ depth1[l . . . ql] /"'*ordered set of unique depths at which assignment statements in 91 are nested”/ depth2[l . . . q2] /**ordered set of unique depths at which assignment statements in 92 are nested“*/ Output: L /**the transformed loop nest optimized for 0 SMS“‘*/ Procedure: 1. /**Normalize all loops in both loop nests.**/ For all i, 1 5 i S n nonalize(M:) nonalize(N,-) 2. /"‘*Fuse loop nests to maximize work.*"‘/ L = fuse(91,92) depth = depthl U depth2 3. /**Apply wavefront to obtain 1: — q degrees of parallelism:**/ nod_kernel_1ine(L) 4. /*"F ind parallel loops in nest with highest iteration count and move to positions L2 and L3.**/ LmasI = m(L21 - - - 1 Ldepthn) Lmu2 = nax(L2, . . . , Ldepthr] - Lman) per-ute(L2, Laws-1) penute(L3, Lmaz2) 5. /**Collapse loops L2 and L3 to aid scheduling.**/ collapse(L2, L3) Figure 6.21. Algorithm to optimize source code for a shared memory machine using the extended transformations on imperfect-only-child loop nests. 101 Algorithm ext_sms is applied to the source code shown in Figure 6.22. The first loop nest 91 has four loops M1 through M4, and assignment statements are nested at two unique depths. That is, q] = 2 and depth1[1. . .2] = [3,4]. The second loop nest 92 also has four loops N1 through N4. All assignment statements have perfect nesting. Therefore, q2 = 1 and depth2[l] = [4]. For simplicity, all loops in both the first code block 91 and the second code block 92 are normal. M1 for I1 = 1, 1000 M2 for I2 = 1,32 M3 for 13 = 1,500 31 A[I1,I2,I3] = M4 for I1 = 1,50 S2 B[I1,I2,I3,I4]= N1 for J1 = 1,1000 N2 for J2 =1,32 N3 tor J3 = 1,500 N4 for J2 = 1,50 53 ...=B[J1,J2,J3,J4—2] Figure 6.22. Sample code: before application of Algorithm ext.s1ns. Step 1. Normalize all loops. As mentioned, the source code shown in Figure 6.22 already has normal loops. Step 2. Fuse 91 and 92. After loop fusion, the assignment statements in loop nest 91 and 92 are combined into one loop nest, named L. The loop-independent dependence between statement S2 and S3 becomes loop-carried, and the dependence vector set D”) is {0,0,0,2)}. Assignment statement 51 remains nested 3-deep in the fused source code. Step 3. Apply modified wavefront transformation. Algorithm mod-kernel_fine is applied to the loop nest to obtain maximal parallelism. The imperfectly nested assignment statement 51 causes wavefront transformation to be applied to loops L1 through L3 and to loop L4. Loop L3 is skewed by loop L1 and L2, then loop L3 is moved outermost. The resulting dependence vector is D(3) = {(0,0,0,2)}. According to the parallelism criteria (Expression (6.1)), loops L1 through L3 can be parallelized. 102 Step 4. Find two permutable loops with the highest iteration counts and move them outermost. The iteration count of loops L1, L2, and L3 are 1000, 32, and 500, respectively. Therefore, loop L1 has the highest iteration count (Lmul = L1) and loop L3 has the second highest iteration count (Lmax2 = L3). Loops Lmafl and Lmafl are moved outermost using loop permutation. The distance vector set remains unchanged. Step 5. Collapse the outermost loops. Finally, the outermost two loops are collapsed to aid scheduling. The resulting source code is shown in Figure 6.23. Note that the two outermost loops are parallel and the transformed distance vector set is D(5) = {((O X 500) + 0.0.2)} = {(0,0,2)}- L1,2 forall K13 = 1,500000 L3 forall K2 = 1,32 if (K13 nod 500) == than K1 = K1,3/500 0188 K1 = [K1,3/500] + 1 K3 = ((K1,3 — 1) mod 500) + 1 S1 L[K1,K2,K3] = L4 for K4 = 1,50 32 B[K1,K2,K3,K4] = $3 ...=B[K1,K2,K3,K4—2] Figure 6.23. Sample code: after application of Algorithm ext_sns. CHAPTER 7 Implementation Model This chapter gives the requirements analysis for the development of an environment that supports the theoretic model presented in previous chapters. The approach used for system analysis is the Object Modeling Technique (OMT) [30], which uses three diagramming notations to model orthogonal vieWpoints of the system. The Object Model describes system structure and the relationships of objects to each other. The Dynamic Model describes the dynamic behavior of the system as it changes from state to state. Finally, the Functional Model captures data flow through the system. The three models describe complementary aspects of the system and can be used to guide the entire development process. 7 .1 Overview The implementation prototype is targeted for a graphics-based framework for matrix-based loop transformations to target source code. The loop transformations: loop permutation, loop reversal, loop skewing, loop normalization, loop blocking, strip mining, cycle shrink- ing, loop collapsing, loop coalescing, loop fission, and loop fusion, will be represented by matrices. Transformation matrices are determined based upon machine-independent goals, such as parallelism or data locality; or machine-dependent goals, such as optimization for vector or shared memory systems. Finally, the transformations can be user-directed,where the user defines the transformation sequence and the matrix representing the sequence. 103 104 7.1.1 User Scenario The following is a typical scenario of a session with the proposed framework. 99°!" The user selects a source code file to process. The system checks the syntax of the source code. The system loads the source code file into a window. The user selects a portion of source code (e.g. loop nest, block, all) upon which to run dependence analysis. The user selects which dependence test to use, such as: o Banerjee Test [15, 31]; GCD Test [15]; o Lambda Test; Power Test [32, 33]; Omega Test [34]. The system runs the dependence test on the target source code and determines the distance vector set D, which is displayed to the user in a window. The system allows the user to specify either machine-directed, goal-directed, or user- directed transformation matrix. (a) If machine-directed is chosen, then the system prompts the user for machine type. 0 If vector machine is chosen, then the system generates a transformation sequence based upon Algorithm ext-vector for vector machines. 0 If shared-memory machine is chosen, then the system generates a transforma- tion sequence based upon Algorithm ext_sms for shared-memory machines. (b) If goal-directed is chosen, then the system prompts the user for a goal. 0 If parallelism is chosen, then the system generates a transformation matrix T based upon Algorithm ext-coarse. o If data locality is chosen, then the system generates a transformation matrix T based upon Algorithm ext-locality. (c) If user-directed is chosen, then the user chooses a transformation from a menu and a matrix-based representation of the transformation sequence is displayed in a window. If the transformation that the user chooses is illegal, then a corre- sponding error message is returned to the user. 0 If permutation is chosen, then the user is queried for two loops. 0 If reversal is chosen, then the user is asked which loop. 0 If skewing is chosen, then the user is asked to supply the outer loop, inner loop, and the value of the skewing factor. 0 If normalization is chosen, then the user is asked which loop. 105 o If fission is chosen, then the user is asked to partition assignment statements into two VII-blocks by choosing individual or blocks of assignment statements. 0 If fusion is chosen, then the user chooses two adjacent 9-blocks. e If blocking is chosen, then the user is queried for the loop and the blocking factor. 0 If collapsing is chosen, then the user is asked which two adjacent loops. 8. The transformation matrix T is displayed in a window, and the distance vector set D’ is displayed in another window. 9. The user can request at any time to check the global postcondition by evaluating D’. 10. The user can request at any time to have the source code transformed based upon the current T. 11. If the user requests to transform the source code, then the system checks the global postcondition. If the global postcondition is true, then the code is updated. If it is not, then an error message is given, and the user is prompted to apply more transfor- mations. 12. Updated code is displayed in a separate window. 13. The user may save the transformed source code and distance vector set to a file for future use. Figure 7.1 shows a sample graphical user interface for the system. The source code transformation process proceeds from left to right using this GUI. Popup menus give the user choices of dependence tests, transformation objectives, and source code mapping. 7 .2 Requirements Unimodular matrices are a method of representing a sequence of kernel transformations (loop permutation, reversal, and skewing) in matrix form. The dependences in the source code are represented by the dependence vector set D, and the transformation sequence is represented by a matrix T. The distances in the initial distance vector set D are all lexicographically positive (global precondition). After transformation, the transformed distance vectors in the distance vector set must be lexicographically positive (global postcondition). Additional source code trans- formation goals, such as parallelism, are evaluated based upon the form of the transformed distance vector set. 106 i a Unimodular Transformations _ i l (Find D 9 (Precondition) (Transformations v) ( Update D 8: T) (Postcondition) (Map T v) r. 1 s :9 ‘ ,9 19 i ' (GCD Test ) (Machine Directed > (Fine Gralned ) l Omega Test Goal Directed Coarse Grains d Bounds H , ‘ Power Test User Directed 9 Both Banerjee Wolfe Test 9 ( Permutation j Reversal Skewing Fission Fusion Blocking Coelesclng Figure 7.1. Graphical user interface. 107 The transformations that are represented include the kernel unimodular transformation set (loop permutation, reversal, and skewing), as well as the extended transformation set (loop normalization, blocking, collapsing, fission, and fusion). A sequence of kernel and extended transformations is represented by the product of the matrices representing the individual transformations (composability). The source code is mapped only once after the final transformation matrix T is determined. The transformation matrix T is determined in one of three ways: 1) machine-dependent: the system generates T based upon machine characteristics; 2) machine-independent: the system generates T based upon machine-independent goals, such as parallelism and data locality; or 3) user-directed: T is generated based upon user directives. The legality of the individual transformations is determined based upon the precondi- tions discussed in Chapters 3 through 5. If an individual transformation is illegal, then a popup window displays the legality criteria violation. The system applies transformations to both perfect and imperfect loop nests. The transformation process continues, regardless of the source code structure, as long as the individual transformations are legal. 7 .3 OMT Analysis This section contains the analysis of the system using the Object Modeling Technique (OMT). The three submodels that comprise the model are the object model, dynamic model, and functional model. The diagrams that represent the respective models are the objectdiagram, state diagram, and data flow diagram. 7.3.1 Object Model The object model describes the overall static structure of the system, where objects are discrete, distinguishable entities. The relationships between objects are represented in dia- grams, where a triangle (A) means inheritance and a diamond (0) means aggregation. Five object models are described below: 1) system; 2) dependences; 3) transformation matrices; 4) transformation sequences; and 5) source code. 108 The object model for the system is given in Figure 7.2. The parallel compiler consists of a dependence test utility, such as Tiny [19, 35]. Also, the compiler has a GUI, a transformation matrix T builder, and a source code mapper (for example, the algorithms developed my Wolf and Lam [4]). The transformation matrix builder consists of a global precondition checker, a global postcondition checker, and algorithms. The algorithms reflect operation of the system in either machine-independent, machine-dependent, or user-directed mode. Finally, the mapper has two components: one maps the distance vector set D, and another maps the source code. The relationships between pages of the object model are represented by associations in Figure 7.2. Dependence tests generate the distance vector set D, and also, the distance vector set D is mapped by the distance vector set D mapper. The source code is mapped by the source code mapper. Finally, the global transformation matrix is generated by the algorithms in the system. The object model representing dependences is shown in Figure 7.3. A distance vector set D consists of distance vectors, where each distance vector consists of distances. The distance vector set D can be represented by a dependence graph, where each node is an assignment statement and there is a one-to-one correspondence between arcs in the dependence graph and distance vectors. Figure 7.4 shows the object model for the transformation matrices, and also, the asso- ciation between the transformation matrix and the transformation sequence (Figure 7.5). Each global matrix is a product, or sequence, of transformations. Each transformation is comprised of unimodular and elementary matrices. A unimodular matrix is a type of ele- mentary matrix with special properties, which represents loop permutation, reversal, and skewing. The object model for the transformation sequence is shown in Figure 7.5. A transforma- tion sequence is comprised of an ordered sequence of transformations. Each transformation is either a kernel unimodular transformation (loop permutation, reversal, and skewing) or an extended transformation (loop normalization, blocking, collapsing, fission, and fusion). 109 is made up of Figure 7.2. Object diagram for overall system. 110 is comprised of Figure 7.3. Object diagram for source code dependences. is comprised of Figure 7.4. Object diagram for transformation matrices. 111 ence is made up of {ordered} ormat1o : \ is either Skewing No zation Tallapsing 'on ermu ation Reversal 31 I Fission Figure 7.5. Object diagram for transformation sequence. Figure 7.6 shows the object model for source code, which is the target to which the transformations are applied. Because of the recursive structure of the source code (i.e., a loop may be comprised of other loops), the object model may not be the best representation of the source code. However, the object model is used here for consistency with the OMT modeling approach. In Figure 7.6, source code is comprised of loop nests, where each loop nest consists of loops. Each loop has a header (i.e., for I,- = lb,-, ub2, 3,) which has an index variable, lower bound, upper bound, and step; assignment statements; and a tail (i.e., endfor). Assignment statements are comprised of use variables and def variables, which are typically arrays with subscript expressions. 9-blocks are defined in terms of loop nests and contain groupings of assignment statements. A ill-block is a specific type of 9-block with special dependence characteristics. Finally, a lI-block is a strongly connected component of the dependence graph, and ill-blocks are comprised of II-blocks. 112 is comprised of is comprised of is bound by matched with is a linear function of is a linear function of Figure 7.6. Object diagram for source code. 113 A more rigorous representation of source code is to use Backus-Near Form (BNF). The BNF of a target language fully describes its structure in a succinct manner. The BNF for a modified imperative language is shown in Figure 7 .7. 7.3.2 Dynamic Model The dynamic model describes the behavior of the system over time. State diagrams rep- resent the dynamic model, where nodes in the diagram are states and arcs are transitions (events) between states. In the state diagram, “ur” is shorthand for “user request”. Also, the parentheses following some transitions are conditions that must be true in order for the transition to trigger. The state diagram for the system is shown in Figure 7.8. The system is idle until the user requests that a file be loaded. If the syntax of the file is incorrect, then the system enters an error state until a file with good syntax is entered. Once a file with good syntax is loaded, the user can request a dependence test to calculate the initial distance vector set 11“”. There are three possible paths that the user chooses after the distance vector set is calculated. The machine-independent and machine-dependent paths generate the transfor- mation matrix T using the algorithms discussed in Chapter 6. The third alternative is for the user to generate her/ his own transformation sequence. There is extensive error checking while the user is choosing individual transformations. Each time a loop transformation choice is made, the legality of the transformation is eval- uated. If the transformation is illegal, then the user is allowed to undo a transformation and choose another transformation. This process continues until a legal transformation is chosen. Once the user is satisfied with the transformation sequence, the system checks if the global postcondition is true. If it is not true, then the user continues to choose transformations. If it is true, then the system proceeds. All three paths converge after a transformation matrix T is determined. The user may request that the source code be mapped from initial to final form given the transformation matrix. 114 program stlist statement constdecl constassignment integerdecl realdecl vardecllist vardecl assignment loop expression Figure 7.7. Backus-Naur Form of target imperative language (modified Tiny syntax). stlist statement[’;’staternent]... ’const’ constassignment [’ , ’constassignment] . . . ID ’=’ expression ’integer’ vardecllist ’real’ vardecllist vardecl[’,’vardecl]... ID ID ’(’[expression ’:'] expression [’,'[expression ’:'] expression]... ’)’ lhs ’=' expression ID ID ’(' expression [’,’ expression]... ’)’ = [’for’ | ’doall’] D. II I. I. O. O. I. O. I. I. O. O. I. O. .0 I. O. O. I. O. O. O. O. C. .0 ID ’=’ expression [’to' | ',’ I ’:’] expression [[’by’ I ’,' | ’:’] expression] ’do’ stlist ’endfor' ID ID ’(’ expression [’,' expression]... ’)’ INTCONST FLOATCONST expression ’+’ expression expression ’-’ expression expression "" expression expression ’/ ’ expression expression """ expression ’-’ expression ’+’ expression ’(’ expression ’)’ expression ’<’ expression expression ’<=’ expression expression ’<>' expression expression ’)’ expression expression '>=’ expression expression 'mod’ expression expression ’max’ expression expression ’rnin’ expression rmr 1(1 expression r)r ’floor' ’(’ expression ’/’ expression ’)’ ’ceiling’ ’(’ expression ’/’ expression ’)’ ’max’ ’(’ expression [’,’ expression]... ')' min’ ’(' expression [’,’ expression]... ’)’ E 115 System Idle ur = filename ur = filename {good syntax} {bad syntax} Error State ur = filename (Source Loaded (bad syntax) {bad syntax} ur = filename _ {good syntax} ur — dependence test ur = t“) . {illesd(t(‘))} fDis. Vec. Set ur = user-directed Tr Error State k D(°) Found \(illesal trans.) ur = mach- ur = mach- “r = t(3) indep goals dep goals {legal(t(’))} ( ) Interm. T “r = t(") A “r = t(") oals Ch Goals Chosen) CG osen C Generated {hang-10)} {mesalwfln system system ur = ti‘) generates T generates T {legal ( t( 3)» J'Iians. Mat. .( Error State \ T0) Found ur = done K(violate post) {globstost} “1' = done ur = map code {H(global.posi)} Source Code Mapped Figure 7.8. State diagram for system. 116 7.3.3 Functional Model The functional model captures the flow of data through the system, and is represented by a data flow diagram (DFD). DFD’s are described in terms of levels, where each successive level shows greater detail of the data flow in the system. In a DFD, the ovals represent processes, that is, producers and consumers of data. The boxes represent actors that are external entities that move data through the system. Two horizontal lines represent a data store, and the arcs represent data flow. Figure 7.9 is the level 0 data flow diagram for the implementation, also known as a context diagram. The process Transformer does the actual source code transformations. As input, Transformer needs the initial distance vector set D(°) and the original source code. As output, Transformer produces a transformed distance vector set D”) and transformed source code. We assume that an external process, such as Tiny [19, 35], is used to determine the distance vector set from the initial source code. (I) \ D : C Transformer 2 D Set I 0(0) original transformed 0(0) source source _ Dependence Anal. Source Code 0 . . al Utility Tiny source Figure 7.9. Level 0 data flow diagram. The Transformer process is examined more closely in the level 1 data flow diagram shown in Figure 7.10. The initial distance vector set DM is used when determining the transformation matrix T. The source code is mapped from its initial to final form as long as both the global postcondition and goals are met, as determined from D”) = T”) x JG D(°). 117 The transformation matrix and initial source code are used to map the final source code. Both the mapped source code and transformed distance vector set can be saved to external files. Tia‘asz’r'iiiéi """"""""""""""""""" mapped source r a Map Source > initial : source : 1(1) {globaLpost(D(’)) and goals(D(‘))} Source Code 4v D Set 0(1) (0) C Find T \ D j {global_pre(D(°))} Figure 7.10. Level 1 data flow diagram. Figure 7.11 shows the level 2 data flow diagram for finding the transformation matrix T, and assumes that the user is choosing the transformation sequence. The initial distance vector set D(°) is input into the process. The user chooses an elementary transformation matrix t“), and if t(‘”) is legal, then a new global transformation matrix Tm is calculated. After T0”) is calculated, the distance vector set D”) is updated. The user can choose more transformations, or map the source code. The source code is mapped according to the approach described in Section 2.7. First, inequalities, minima, and maxima are determined from the initial source code. Then the index variables are transformed based upon the transformation matrix Tm. Finally, new loop bounds are calculated based upon the mapped index variables and inequality set. 118 1*" IFifidT""""' 4 I I a I —C Calculate 0(3) \ I > i j : 0(1) I D“) 7‘”. 0"") I ( Calculate it“) D ' I ,(x), Du-I) I E {Iegal(t(’))} E : Determine 1”) x i : C J : D<°> ——————————————————————————————— Figure 7.11. Level 2 data flow diagram for finding the transformation matrix T. 'FI 119 Miii'CBdéTWéll'ifid’Lé’rii)""""m'""'""§ f Step 1: initial \ Extract Inequalities code inequality set Step 2: Find min’s and max’s min’s and max’s f Step 3: T”) \ Transform Indices trans. ineq set, min’s and max’s Step 4: Calculate New Bounds oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo transformed source Figure 7.12. Level 2 data flow diagram for mapping source code. CHAPTER 8 Related Work This chapter describes current research in ordering source code transformations. Several of the approaches are matrix-based, while others are based upon formal specification of the transformations. Some of the approaches suggest static ordering of the transformations to obtain optimal transformed source code, while other approaches determine transformation ordering based upon source code structure. Finally, some representations include both scalar and loop transformations, while others contain only loop transformations. 8.1 Unimodular Matrices Work by Banerjee [3] shows that unimodular matrices represent a sequence of transfor- mations on a perfect doubly-nested loop nest, where all dependences are represented by distance vectors. Banerjee also describes an algorithm that may obtain coarse-grain paral- lelism and an algorithm that always obtains fine-grain parallelism. Wolf and‘Lam [4] expand upon Banerjee’s work and show the use of unimodular matrices to represent a sequence of transformations on an n-nested loop, where dependences are represented as either distance or direction vectors. Wolf and Lam also give a general approach to obtain coarse-grain parallelism and prove that n - 1 degrees of parallelism can always be obtained from an n-nested loop. 120 121 The basis of unimodular transformations is thoroughly described in Section 2.5, and is briefly described below. Sequences of kernel unimodular transformations (loop permutation, reversal, and skewing) are represented by a transformation matrix T, where T is: 1. n X n (square); 2. all T[i, j] are integers; 3. Idet TI = 1. The legality of a transformation sequence and the available parallelism are determined from the product of T and all of the distance vectors in D. Finally, an algorithm exists to map the source code from its initial to final form given the transformation matrix T. Some restrictions to the traditional unimodular approach are that unimodular matrices only represent three kernel transformations and are only applied to perfect loop nests. These two restrictions imply that, although unimodular transformations are theoretically elegant, they have limited applicability. Sass and Mutka [26, 28] have described two different methods for applying unimodu- lar matrices to imperfect loop nests. The first method, called enabling, transforms some cases of imperfect loop nests into perfect loop nests. The second method, called the phase method, applies unimodular transformations directly to imperfect loop nests at the expense of increased barrier synchronization. Sass and Mutka [28] describe a static ordering of transformations that enables unimod- ular transformations. Specifically, scalar forward substitution and loop distribution are used to transform source code into perfect loop nests. Next, unimodular transformations are used to obtain parallelism, as in the traditional approach. Finally, loop fusion and common subezpression elimination are used, as necessary, to restore the structure of the original source code. The only source code structure that cannot be transformed into perfect loop nests is source code with a transfer of control from within a loop nest, or source code with an iterative command within an alternative command. Alternatively, Sass and Mutka [26] describe a transformation, called the phase method, that transforms a 2-deep, imperfect loop nest into three adjacent, 2-deep, imperfect-only- 122 child loop nests. The phase method has three steps: 1) calculate and lengthen the depen- dence vectors in the dependence vector set D; 2) invoke Banerjee’s algorithm [3] to obtain fine-grain parallelism on a doubly-nested loop; and 3) map the source code into three phases with the innermost loop executing in parallel in all three phases. 8.2 Non-Singular Matrices Ramanujam [5] and Li and Pingali [6] independently developed a similar extension of the unimodular approach using non-singular matrices. A non-singular matrix relaxes the third constraint of unimodular matrices (ldet Tl = 1), and allows the determinant of the trans- formation matrix T to be greater than or equal to one. Relaxation of this constraint allows more freedom in the choice of T, and, in fact, unimodular matrices are a subclass of non- singular matrices. The legality of the transformation for non-singular matrices is the same asthe global postcondition for unimodular matrices. That is, all transformed dependence vectors must be lexicographically positive. This condition is formally described as: (VJ: JG D : T x d.» 5). Also, optimal transformations for the non-singular approach are described in terms sim- ilar to optimal transformations for the unimodular approach. Specifically, fully permutable loop nests are related to dependence cones by Ramanujam [5]. Recall that fully permutable loop nests allow any ordering of the loops. Using non-singular matrices, there is a method for mapping from the initial code to transformed code for both the assignment statements and the loop bounds. Although de- termining the assignment statements is relatively easy, determining the new loop bounds is more complicated for non-singular transformations than it was for unimodular transforma- tions. Suppose that (11,12,...,In) is the original iteration space, T is the transformation matrix, and (J1,J2,.. .,J,,) is the transformed iteration space. The substitution in the 123 assignment statements is determined as follows: (11,12,...,I,.) = T‘1(J1,J2,...,J,.) Then, the new values for [1, . . . , In are substituted into the original assignment statements. Note that this is the case for both unimodular and non-singular matrices. Transforming the loop bounds for a non-singular matrix is more complex than for a unimodular matrix. The reason for the complication is based upon the value of the deter- minant of the transformation matrix T (det T Z 1). The transformed iteration space may be much more sparse than the original iteration space, and the step sizes of the transformed loops are typically greater than one. The work presented by Li and Pingali, and Ramanujam relaxes the selection of the transformation matrix T. The relaxation enables T to be more easily selected, and enables the representation of one additional loop transformation technique, loop scaling. As dis- cussed previously, the criteria for legality, optimality, and assignment statement mapping are equivalent for unimodular and non-singular matrices. However, the mapping of the loop bounds for the transformed code is more complex because of possible changes in the density of the iteration space. The use of non-singular matrices complicates the loop bound mapping strategy, while adding only limited expressive power (one additional transformation). 8.3 Schedules Kelly and Pugh introduce a representation of source code transformations called schedules that is an alternative to matrix-based representations [7]. Schedules are in the form: T:[il,...,im] —* [f1,...,fn]]C where T is the schedule; it is the original iteration space; f], is a function of original iteration space; and C represents the domain restrictions. Note that the original iteration space may have a different loop nest depth (m) than the final loop nest (n). In unimodular matrices, 124 a transformation matrix T is assigned to an entire block of source code. In contrast, each individual assignment statement has its own schedule. The legality of a schedule is based upon its variable part and constant part. The variable part is the portion of the schedule that is a function of the original index variables, and the constant part is what remains after the variable part is removed. A level refers to the nest depth of the transformed code. As an example, suppose that we are given the schedule: [i,j,k] —+ [i+2k+n + 1,3j,2,2k+3n] Then the variable part of the schedule at level 1 is i+ 2k, and the constant part is n + l. Legality is formally defined as: Vi,j,p,q,Sym: i —> j 6 (1m :> Tr“) -< T9(i)' Informally, this definition states that if there is a dependency from between statement p at iteration i and statement q at iteration j in the initial source code, then the transformed code must maintain the same execution order. A legal schedule T that satisfies the execution order constraint is calculated based upon its variable part and constant part. The variable part is algorithmically determined and the constant part is heuristically resolved. Optimality of a schedule is determined based upon three criteria: simplicity, granularity, and locality [8]. Simplicity is a metric that is based upon the complexity of the schedule and the amount of parallel execution between statements. Granularity is a measure of the total number of iterations that are executed sequentially outside of the outermost parallel loop. Each loop size is assumed to be 100, since the value may not be known at compile time. Finally, locality is an estimate of cache misses that may occur. Again, each loop is assumed to have 100 iterations, each cache line to contain 8 elements, and array layout is row major. Schedules are used to map the initial to transformed source code based upon intervals. The schedules for each assignment statement at each level of the schedule are compared to 125 determine whether or not the statement is executing in the interval. Based upon the com- parisons, transformed source code is generated and simplified. Schedules are implemented in Omega/ Tiny [7], which is a data dependence/source code transformation tool that is built on top of Tiny [19]. Schedules allow a large family of transformations to be represented. The ability to represent transformations in addition to loop skewing, reversal, and permutation leads to a more usable compiler. Also, the genericity of schedules implies that they are extensible to other loop transformations. However, large data structures are needed since there exists a unique schedule for each source code statement. Also, the solution set is infinitely large for the variable and constant parts of the schedules. Apparently, some clever heuristic is used in the implementation to prune the search space. As mentioned, an assumption that is made when determining optimality is that the loop size is always 100. In many cases this assumption is incorrect, and extraneous optimality results may be obtained. Finally, and most importantly, the mapping algorithm is exponential (0(3")) [36], because of the method used to determine intervals. 8.4 Generalized Loop Transformations Sarkar and Thekkath [9] present work that combines some of the benefits of a. formal speci- fication approach with that of matrix representation of loop transformations. They present templates that formally describe the pre- and postconditions for the dependence vector set, the loop bounds, and initialization statements for a kernel set of transformations ap- plied to perfect loop nests. The kernel set of transformations is small, but extensible, and includes Unimodular, ReversePermute, Parallelize, Block, Coalesce, and Interleave. Each technique is described in terms of a template that describes what must be true before the transformation can be applied, the input parameters that are required, and the effect of the transformation on the source code. The overall transformation of serial to parallel source code occurs in four steps: 126 1. Template instantiation 2. Sequence representation for loop transformations 3. Legality tests 0 Dependence vector test 0 Loop bounds test 4. Code generation for the transformed loop nest 0 Generate new loop bound expressions 0 Generate new initialization statements Each of the four steps will be briefly described below. Template instantiation is obtained by supplying the input parameters, such as the num- ber of nested loops, for one of the transformation techniques. A Sequence of transformations is represented by a global sequence T = (t1,t2, . . .,tk), where each t,- represents an indi- vidual transformation in the sequence. Note that the concept of a global sequence T has no real meaning in this context. In Sarkar and Thekkath’s framework, the effect of T is determined by application of transformations for each of the individual 15,8 in the sequence. In contrast, the transformed code can be determined directly from T with no concern as to the individual transformations in the unimodular approach. The Legality tests are divided into two subsections: Dependence vector test and Loop bound test. The dependence vector test ensures that all dependence vectors are lexicograph- ically positive after transformation. The loop bound test checks the bounds of the loop to determine if the preconditions are met for the individual technique. Each transformation in the kernel set has a unique set of preconditions on the loop bounds. The final step in the generalized transformation model is Code generation. This step describes the effects of the transformations in terms of loop bounds: how the upper and lower loop bound for each loop in the loop nest is affected; and initialization statements: how the index variables are affected. The transformation template for the dependence vector set for loop blocking will be shown for illustrative purposes. The template for the dependence vector set is: 127 p'={ (dl,...,d.-_1,d:,....d;,d:-',...,d;-'.d,-+i,...,dn)l (d1,...,d,,) 6 DA (d2, II) 6 bIOCkmaP(dk) W s k s j} where blockmap(dk) is defined as: {(0,0)} if d, = 0 {(6*I, ‘*I)} if dk = 6*I {(0,dk),(dk,‘*’)} if (I); = 101‘ - l {(0, dk), (dir(dk), ‘*’)} otherwise and div-(dk) is defined as: d], if d;: is a direction value or if dk = 0 “+” if dk is a positive distance value “—” if d]. is a negative distance value Suppose that we are given the following dependence vector set: D = {(2,0, +, —1),(3, —2, 1, —1),(+, —1, *,2)}. Using the mapping just described for loop blocking and given that the loop nest depth is four (12. = 4), loops i = 2 through j = 3 are blocked, and bsize[l . . . 2] is a vector representing the block size for loops i through j, then the new dependence vector set is: D, = {(2, 09 0a 09 +’ -1), (2,0, +3 0, *7 -1), (3,090, -2: 1, _1)9 (33 0, 19 “'2, *, —1)v (3’ —’0, *3 1’ -1), (3, _9 19*i*9 -1), (+109*9—12*v2)9 (+, —1i*: *3 *92)} This approach combines some elements of formal specifications and matrix-based repre- sentation of transformations. The transformations are not described in terms of a general framework. The individual transformations (tg’s) must each be applied to determine the global effect of the transformations (T). However, overall legality of the sequence cannot be determined until all of the transformations are applied. Therefore, if one of the early 128 transformations in a sequence causes an illegal condition, then the compiler must “back up” through the transformation sequence, eliminate the illegal transformation, then reapply the remainder of the sequence. In contrast, in the unimodular approach only the final trans- formation matrix T is used to translate the initial source code into final form. Thus, if T causes an illegal condition, then it is much easier to correct. The applicability of a transformation is dependent upon the accuracy of the dependence information. Potential parallelism is determined based upon properties of the transformed dependence vector set. Therefore, if dependence vector set mapping rules result in too restrictive, or spurious dependence vectors, then potential parallelism may be lost. Finally, Sarkar and Thekkath’s method requires perfect loop nests. 8.5 Precondition Approach Whitfield and Soffa [37, 38, 39] propose a formalized method for applying parallel compiler techniques to translate source code. Their work describes a method to determine interde- pendences between compiler techniques and also discusses methods for exploiting them in order to establish an optimal ordering for compilation. The ordering is based solely upon interactions between compiler methods and is completely independent of the properties of the underlying source code. First, seven common parallel compilation methods are formally described in terms of pre- and postconditions. The seven code transformations are Dead Code Elimination, Con- stant Propagation, Invariant Code Motion, Loop Unrolling, Strip Mining, Loop Fusion, and Loop Interchange. Next, enabling and disabling conditions for each individual method are determined. The enabling and disabling conditions describe what conditions are necessary in order for the compiler method to be applied, and are used to determine any interactions between optimizations. Since each pre- and postcondition is formally described, reasoning about dependences between the individual transformations is also formalized. Finally, en- abling and disabling graphs are developed between the seven transformations. The results 129 from the enabling and disabling graphs are combined into a global directed graph that implies overall transformation ordering. The next step in this method is to determine the enabling and disabling conditions for each of the seven code transformations. The enabling conditions are the preconditions (i.e., the conditions necessary for the optimization to be applied), and the disabling conditions are negations of the enabling conditions. Then the enabling and disabling conditions are compared in a pairwise fashion to determine enabling and disabling conditions between the transformations. Lack of interactions between code transformations is also proven using formal reasoning techniques. The final step is constructing an overall dependency graph based upon the pairwise comparison mentioned above. This is a three step process: 1. Develop the enabling graph; 2. Develop the disabling graph; 3. Develop the overall dependency graph (optimization ordering). Step 3 is a consolidation of steps 1 and 2, and any conflicts arising from step 1 or 2 are reflected in the overall dependency graph. Finally, the ordering of the techniques is obtained by reviewing the results from the overall dependency graph. Whitfield and Soffa suggest the following order: 1. Constant Propagation Dead Code Elimination (iterate) Loop Interchange (iterate) Invariant Code Motion (iterate) Loop Interchange and Invariant Code Motion (iterate) Loop Unrolling (iterate) Loop Fusion (iterate) 90.49999050 Strip Mining 130 Most of the available parallelism in source code is in the loops. Therefore, extending the transformation set to include more loop transformations may increase the applicability of this approach. There is no metric mentioned to measure parallelism or data locality in the transformed source code with this approach. In most cases, parallelism or data locality is the goal of loop transformations and having a metric shows whether the goals have been met. Another disadvantage of this approach is that static ordering of the transformations is independent of the source code to which the transformations are applied. Different source code and different machines have different optimization goals. Attempting to obtain one ordering of the transformations that meets multiple goals seems optimistic. 8.6 Morphism Notation Lu and Chen [40] present a mapping from original source code to parallelized code, rather than the transformations themselves. The mappings, or morphisms, are broken down into three subclasses: spatial, temporal, and reordering. Temporal morphism is the mapping of the bounding loops of a nested loop. Temporal mapping determines which loops can be parallelized and which loops may be interchanged. It is important to note that the temporal morphism does not change the ordering of the statements within the loop, only the order and parallelism of the iterative structures (do’s, for’s) surrounding the statements. Reordering maps the statements within the loops into some new ordering. As an example, a statement reordering morphism may indicate that the original loop ordering (51, $2, . . . , $1., . . . , Sn) may be transformed into parallelized code with the following state- ment order (31:, 5'2, . . ., Sn, . . . , 51). The final morphism is spatial morphism. Spatial morphisms are hardware dependent translations. In other words, they answer the question “How is the code domain mapped to the available architecture?” An example of a spatial morphism is strip mining. Strip mining determines the distribution of data into “chunks” that are dependent upon hardware vector 131 size. Spatial morphisms were not the focus of Lu and Chen’s work and are briefly discussed below. Assume that D is the domain of the initial source code, m is the transformation mapping, and P is the domain of the transformed (parallelized) source code, then: m(D) —> P Further, assume that the mapping function m has spatial s, temporal t, and reordering 1' components, then: m(D) = f(8(D),t(D)ar(D)) -* P That is, the overall mapping is some function of the spatial, temporal, and statement reordering functions over the domain of the initial source code. The overall transformer that combines the spatial, temporal, and reordering as described above, is called the loop transformer. A mapping that combines only temporal and reorder- ing morphisms: 7r(D) = 9(t(D),r(D)) -* P’ is called a loop scheduler. The loop transformer m is hardware dependent and the loop scheduler 1r is hardware independent. Lu and Chen’s interest was hardware independent translations and so the focus was loop scheduling rather then overall loop transformations. Suppose that the initial source code is of the form shown in Figure 8.1(a). A temporal mapping t indicates the order of the bounding loops. Any index name that is not included in t is parallelized (for —+ forall), and must be innermost after transformation. As examples, t(D) = (k, i,j) results in the transformed loop shown in Figure 8.1(b), and t(D) = (j,k) is interpreted as shown in Figure 8.2(a). The statement reordering function r indicates the order of the statements, by position, after transformation. As an example, r(D) = (2,3, 1) indicates that the statement that 132 L1 for11=... L1 10143:... L2 for12=... L2 forI1=... L3 torI3=... L3 for12=... 51 S; Sn Sn Figure 8.1(a). Sample code: before morphism . Figure 8.1(b). Sample code: after temporal mor- phism with ((1)) = (13, 11,12). was originally first is now second; originally second is now third; and originally third is now first. A complete loop schedule for the original example shown in Figure 8.1(a) is: "(13): (t(D),r(D)): ((J'),(1.3.2)) and results in the loop transformation shown in Figure 8.2(b). L1 10r12=... L1 tor-12:... L2 forI3=... L2 fora11I1=... L3 forall 11 = . .. L3 torall I3 = .. . 51 . . . S; . . . $2 .. . S3 S3 .. . 52 Figure 8.2(a). Sample code: after temporal mor- Figure 8.2(b). Sample code: after temporal and phism with t(D) = (12, I3). reordering morphism with t(D) = (t(D),r(D)) = ((12),(1:3»2))- An important constraint on transformations is that the resultant dependence vectors are lexicographically positive. This constraint means that any dependences that existed before transformation must still exist after transformation. 133 A difficulty with this approach is that it only works on perfect loop nests. Any deviation from perfect nesting results in a loop structure that cannot be represented by the notation as described in this paper. Some extension to the mappings that allows imperfect loop nests is needed. The method of determining how the morphisms are determined was only briefly discussed in the paper and is a non-trivial task. That is, the representation of the mapping was described, but not the derivation of the mapping. Finally, the division of spatial, temporal, and statement reordering morphisms allows the representation of the hardware-independent portion of mapping (loop scheduling) and the hardware-dependent portion of mapping (overall loop transformation). CHAPTER 9 Conclusions and Future Investigations This dissertation presented an extension to the kernel set of unimodular transformations, which includes loop permutation, loop skewing, and loop reversal applied to perfect loop nests. Several advantages to the traditional unimodular approach are that it unifies seem— ingly independent transformations into a common framework; a sequence of transformations is composed into the product of the matrices representing the individual kernel transfor- mations; the source code is mapped from its initial to final form only once, after a final transformation matrix T is found; and parallelism goals are evaluated based upon the prod- uct of the transformation matrix T and an abstraction of the source code in the form of a distance vector set D. The unimodular transformations have been extended to include five other matrix-based transformations [12, 22, 23]. The extended transformations are: loop normalization, which modifies the step size and upper bound of a loop so that they are both one; loop blocking, which includes tiling, strip mining, and cycle shrinking and increases the depth of a nested loop; loop collapsing, which includes loop coalescing and decreases the depth of a nested loop; loop fission, which divides a set of assignment statements within a loop nest into two adjacent loop nests; and loop fusion, which combines two sets of assignment statements from adjacent loop nests into a common loop nest. The legality of the extended set of 134 135 transformations was discussed in terms of both perfect and imperfect loop nests. Explicit mapping rules for the distance vector set were described, as well as any modifications to the source code mapping. Several algorithms were described that show the application of various combinations of transformations from both the kernel and extended transformation set. Finally, the prelim- inary design for the implementation of a tool that uses the extended transformation set was described. The detailed requirements analysis can be used to guide the development of a parallel compiler that supports the extended, matrix-based model of loop transformations. Potential areas of future investigations are to develop new algorithms using the extended, matrix-based framework, and remove the constraints on the structure and syntax of the target source code language to which the extended framework is applied. New algorithms, that include both the extended and kernel set of loop transformations, can be designed to meet other machine—independent and machine-dependent goals. As an example, we did not discuss the application of the extended family of transformations to data distribution for a distributed-memory machine. Now that the unified model has been developed, there are numerous software optimization goals to which it can be applied. Second, the theoretical model should be extended to include direction vectors, as well as distance vectors. It is often the case that explicit distance information is unavailable because of complexities in the source code dependences. The ability to handle distances, directions, and combinations of the two, could make the theoretic model applicable to a broader range of source code. We assumed that the loop bounds and subscript expressions were linear functions of the index variables. This assumption is reasonable based upon statistical data derived from actual source code [25]. However, including more complex expressions would broaden the scope of the source code to which the unified model could be applied. BIBLIOGRAPHY BIBLIOGRAPHY [1] G. M. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. In AFIPS Conference Proceedings, 1967. [2] Jau-Hsiung Huang. On Parallel Processing Systems: Amdahl’s Law Generalized and some Results on Optimal Design. IEEE Transactions on Software Engineering, 18(5):434—448, May 1992. [3] Uptal Banerjee. Unimodular Transformations of Double Loops. In Proceeding of the 3rd Workshop on Languages, Compilers, and Parallel Computing, pages 192—219, August 1989. [4] Michael Wolf and Monica Lam. A Loop Transformation and an Algorithm to Maximize Parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452—471, October 1991. [5] J. Ramanujam. Non-unimodular Transformations of Nested Loops. In Proceedings of Supercomputing ’92, pages 214-223, November 1992. [6] Wei Li and Keshav Pingali. A Singular Loop Transformation Framework Based on Non-singular Matrices. In 5th Workshop on Languages and Compilers for Parallel Computing, pages 249—260, August 1992. [7] Wayne Kelly and William Pugh. A Framework for Unifying Reordering Tmnsforma- tions. Technical Report CS-TR-2995-1, University of Maryland, August 1993. [8] Wayne Kelly and William Pugh. Determining schedules based on performance esti- mation. Technical Report UMIACS-TR-93-67,CS-TR-3108, University of Maryland, August 1993. [9] Vivek Sarkar and Radhika Thekkath. A General Framework for Iteration-Reordering Loop Transformations. In Proceeding of the ACM SIGPLAN ’92 Conference on Pro- gramming Language Design and Implementation, pages 175-187, June 1992. [10] David Chesney. A Formal Approach to Automatic Source Code Translation for Parallel Architectures. Technical Report CPS-91-15, Michigan State University, October 1991. 136 137 [11] David Chesney and Betty Cheng. Formal Specification of an Automatic Source Code Translator for Parallel Computer Architectures. In Minnowbrook Workshop on Soft- ware Engineering for Parallel Computing, August 1992. [12] David R. Chesney and Betty H.C. Cheng. Generalizing the Unimodular Approach. In International Conference on Parallel and Distributed Systems, pages 398-403, Decem- ber 1994. [13] Randy Allen and Ken Kennedy. Automatic Translation of FORTRAN Programs to Vector Form. ACM Transactions on Programming Languages and Systems, 9(4):491- 542, October 1987. [14] Michael Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, 1989. [15] Hans Zima and Barb Chapman. Supercompilers for Parallel and Vector Computers. The ACM Press, 1990. [16] James Davies, Christopher Huson, Thomas Macke, Bruce Leasure, and Michael Wolfe. The KAP/S-l: An Advanced Source-to—Source Vectorizer for the S-l Mark IIa Super- computer. In International Conference on Parallel Processing, pages 833—835, 1986. [17] Ken Kennedy, Kathryn McKinley, and Chan-Wen Tseng. Analysis and Transformation in the Parascope Editor. Technical Report CRPC-TR90106, Rice University, December 1990. [18] David Padua and Michael Wolfe. Advanced Compiler Optimizations for Supercomput- ers. Communications of the ACM, 29(12):]184—1201, December 1986. [19] Michael Wolfe. The Tiny Loop Restructuring Research Tool. In International Confer- ence on Parallel Processing, pages 1146—1153, 1991. [20] Zhiyuan Li and Pen-Chung Yew. Interprocedural Analysis for Parallel Computing. In International Conference on Parallel Processing, pages 221—228, 1988. [21] Constantine Polychronopoulos, Milind Girkar, Mohammed Haghighat, Chia Lee, Bruce Leung, and Dale Schouten. Languages and Compilers for Parallel Computing, chapter 21 - The Structure of Parafrase—2: an Advanced Parallelizing Compiler for C and Fortran, pages 423—435. MIT Press, 1990. [22] David Chesney and Betty Cheng. Application of the Unimodular Approach to Loop Fission and Loop Fusion. Presented at the Scalable High Performance Computing Conference, May 1994. [23] David Chesney and Betty Cheng. Extending the Unimodular Approach to Other Transformation Techniques. Technical Report CPS-93—24, Michigan State University, April 1993. 138 [24] Michael Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Stan- ford University, August 1992. [25] Zhiyu Shen, Zhiyuan Li, and Pen-Chung Yew. An Empirical Study of Fortran Programs for Parallelizing Compilers. IEEE Transactions on Parallel and Distributed Systems, 1(3):356—364, July 1990. [26] Ron Sass and Matt Mutka. Transformations on Doubly Nested Loops. In International Conference on Parallel Architectures and Compilation Techniques, 1994. [27] Constantine Polychronopoulos. Loop Coalescing: A Compiler Transformation for Par— allel Machines. In International Conference on Parallel Processing, pages 235-242, 1987. [28] Ron Sass and Matt Mutka. Enabling Unimodular Transformations. In Supercomputing, pages 753-762, 1994. [29] Convex Computer Corporation. Convex FORTRAN Optimization Guide, second edi- tion, 1994. [30] James Rumbaugh, Michael Blaha, William Premerlani, Frederick Eddy, and William Lorensen. Object-Oriented Modeling and Design. Prentice-Hall, Inc., 1991. [31] Uptal Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic, 1988. [32] Michael Wolfe and Chan—Wen Tseng. The Power Test for Data Dependence. Technical Report Rice COMP TR90-145, Rice University, December 1990. [33] Michael Wolfe and Chan-Wen Tseng. The Power Test for Data Dependence. IEEE‘ Transactions on Parallel and Distributed Systems, 3(5):591—601, September 1992. [34] W. Pugh. The Omega Test: A Fast and Practical Algorithm for Dependence Analysis. In Proceedings of Supercomputing ’91, pages 4—13, November 1991. [35] Michael Wolfe. TINY: A Loop Restructuring Research Tool. Oregon Graduate Institute of Science and Technology, December 1990. i [36] Wayne Kelly. A Framework for Unifying Reordering Transformations. personal com- munication, September 1993. [37] Debbie Whitfield and Mary Lou Soffa. An Approach to Ordering Optimizing Trans- formations. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Pragramming, pages 137-146, March 1990. [38] Deborah Whitfield and Mary Lou Soffa. Investigating Properties of Code Transfor- mations. In 1993 International Conference on Parallel Processing, pages II—156-160, August 1992. 139 [39] Deborah Lynn Whitfield. A Unifying Approach for Optimizing Transformations. PhD thesis, University of Pittsburgh, 1991. [40] Lee-Chung Lu and Marina Chen. New Loop Transformation Techniques for Massive Parallelism. Technical Report YALEU/DCS/TR-833, Yale University, October 1990.