.75... L). I I!!! Illlu3a...‘ I. I: J .1 1.73;)..T. 4.1...(11 )3 :11. .I1. 7.. i . : . 1.5.191. llllllllllllllllllllllllllllllllllllllllllllllllllll 31293 00903 2032 This is to certify that the thesis entitled The Parallelization of Vectorizable Programs presented by Paul Dennis Hovland has been accepted towards fulfillment of the requirements for Master of Science degree in Computer Science \jwwwg H . Ml Major professor Date 33mm? runs 0-7639 MS U is an Affirmative Action/Equal Opportunity Institution LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE MSU Is An Affirmative Action/Equal Opportunity Institution c:\circ\ddedue.pm3—p.1 The Parallelization of Vectorizable Programs By Paul Dennis Hovland A Thesis Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Master of Science Department of Computer Science December 2, 1992 ABSTRACT The Parallelization of Vectorizable Programs By Paul Dennis Hovland Many of today’s scientific applications are so computationally demanding that they require the use of the most powerful supercomputers available. In the past, this class of applications was solved on vector computers. Today, parallel computers, es- pecially distributed—memory machines, offer better performance. This introduces the question of how to parallelize codes originally targeted for vector architectures. For distributed-memory parallel architectures, the parallelization of programs that vec- torize well requires that we find a good way to distribute data among the processors, so that we may exploit global parallelism. A formal technique utilizing augmented data access descriptors (ADADS) to deter- mine this distribution is presented. This technique differs from previous approaches in that it views the problem of finding a good distribution as an extension of data dependence analysis. The application of this technique to the task of parallelizing the highly vectorizable programs generated by ADIFOR, an automatic differentiation tool, demonstrates its utility. ACKNOWLEDGEMENTS I would like to-thank my thesis advisor, Dr. Lionel Ni, and my entire thesis committee for their help and insight as I put this thesis together. Thanks as well to Chris Bischof and Andreas Griewank, for introducing me to the world of automatic differentiation and ADIFOR. Finally, thanks to my friends and family for the love and support they have provided over the years. iv TABLE OF CONTENTS LIST OF FIGURES 1 Introduction 1.1 Motivation ................................. 1.2 ADIFOR .................................. 1.3 Problem Statement ............................ 1.4 Organization ............................... 2 Foundations 2.1 Data Alignment .............................. 2.2 Data Dependence Analysis ........................ 2.3 Data Access Descriptors ......................... 2.4 Properties of Vectorizable Code ..................... 3 Augmented Data Access Descriptors 3.1 Augmented Data Access Descriptors .................. 3.2 Finding a Good Alignment Using ADADs ............... 3.2.1 Program Model .......................... 3.2.2 Computing Augmented Data Access Descriptors ........ 3.2.3 Combining Augmented Data Access Descriptors ........ 3.2.4 Using ADADS to Develop Alignment Statements ........ 3.3 An Example ................................ 3.4 Interprocedural Analysis ......................... 4 The Effects of Loop Transformations on ADADs 4.1 Promotion ................................. 4.2 Loop Unrolling .............................. 4.3 Loop Fusion ................................ 4.4 Loop Interchange ............................. 4.5 Strip Mining ................................ 4.6 Invariant Code Movement ........................ vii (OQOi-ht—ti-l 11 11 15 17 22 23 23 26 26 27 29 32 36 43 49 50 52 54 56 57 58 4.7 An Example ................................ 60 5 Performance Analysis 65 5.1 Parallelizing ADIFOR-generated code .................. 65 5.2 Applying ADADs to Vectorizable Code ................. 69 5.3 Empirical Results ............................. 74 6 Conclusions 77 6.1 Summary ................................. 77 6.2 Future Studies ............................... 79 BIBLIOGRAPHY 81 A Sample Programs 83 B ADIFOR-generated Code 88 vi LIST OF FIGURES 5.1 Performance of various parallelization methods for program 1 ..... 76 5.2 Performance of inside-out method for program 2 ............ 76 vii CHAPTER 1 Introduction Since the advent of powerful vector supercomputers such as the Cray and Convex, scientists have expended a great deal of energy creating code that vectorizes well. Furthermore, computer scientists have developed tools that can take ordinary code, perform some analysis, and produce new code that vectorizes well [3]. Such code is characterized by many small loops that traverse an (one or multi—dimensional) array in a single direction. 1 . 1 Motivation Today, another type of supercomputer has entered the picture—parallel computers. These machines are characterized by several, and often hundreds of, processors ca- pable of operating more or less autonomously. It is highly desirable that the code that executed well on the vector supercomputers of the past would execute as well on the supercomputers of today with little or no modification. Since each iteration of the small loops mentioned earlier may be executed in parallel (this is, in effect, what happens in a vector processor), one would expect vector code to parallelize easily, and with good performance. However, this is not always the case. The reason for this is in large part due to the hardware used to execute parallel programs. There is a start-up latency associated with executing several instructions in parallel. In addition, the data to be used for these parallel operations must be distributed to the processor where the computation will take place, the result of which must be stored in its destination. On a shared memory parallel supercomputer, a variable is retrieved from either the local cache or main memory. In either case, the access time is essentially inde- pendent of which processor is performing the computation. However, in a distributed memory parallel supercomputer, each processor has associated with it its own local memory. Accessing variables stored in local memory requires significantly less time than accessing variables located in the memory associated with some other processor, as the data must be communicated between the processors via some sort of network. Because it is more scalable, the predominant paradigm in massively parallel systems is distributed memory. So, if we wish to achieve good performance on a modern supercomputer, we must try to minimize the communication of data from one pro- cessor to another. As an example of a program that is well—suited for execution on a vector supercomputer, but exhibits poor performance on a parallel supercomputer, especially one with distributed memory, consider the following code segment: DOALL I=2, 100 Ml) = B(I) + B(I-1) ENDDD This loop corresponds to the vector operation: A(2:100) = B(2:100) + B(1:99) and could be executed at very low cost on a vector computer. This program is said to exhibit fine granularity; i.e., the amount of work performed within the parallel loop is very small. This situation presents a problem for all parallel supercomputers, because these machines must resynchronize after each parallel loop. If only a few synchronizations need to be done, this cost is negligible, but if it is done frequently, as in fine-grained programs, the additional cost can be substantial, and performance poor. In addition to synchronization costs, on a distributed memory supercomputer, execution of this code would require a substantial amount of communication. If A(k) and B(k) are stored on processor k, k = 1, - - - , 100, then in order for us to perform this computation in parallel, we must first communicate the value of B(k-1) from processor k — 1 to processor k. As a consequence of the high costs associated with entering and leaving a parallel loop, in order for code to execute efficiently on a parallel machine, it must have a high ratio of work performed inside parallel loops to number of parallel loops, or coarse granularity. Also, on a distributed memory machine, communication of data from one processor to another must be kept to a minimum. One approach to attaining the goal of coarse granularity is through the use of “global parallelism.” In this paradigm, the entire program executes as parallel tasks on the several processors. If data computed by one processor is required by one or several other processors, it must be communicated to this (these) processor(s). It is extremely important that we minimize the amount of communication that is performed. This problem reduces to deciding where data should be stored (that is, where it is computed) among the various processors. We address this problem, proposing the augmented data access descriptor as a means to determine the best distribution for arrays on a distributed memory parallel computer. Li and Chen have proven that the general problem of data distribution on distributed memory machines is NP-complete [8]. However, the method presented provides a good heuristic, in that it is often able to identify a communication free distribution, if one exists. As presented, the algorithm applies to all programs. How- ever, case studies will focus on vectorizable code, in particular, the code generated by ADIFOR, a Fortran source-to-source translator that produces code for derivative computation [2]. 1.2 ADIFOR In addition to existing programs intended for execution on vector computers, some code that is generated automatically may lend itself easily to vectorization, but not necessarily to parallelization. The code generated by ADIFOR, a tool for the au- tomatic differentiation of functions, falls in this category. In order to facilitate our discussion of the parallelization of these programs, a brief explanation of automatic differentiation and ADIFOR is in order. Traditionally, there have been three approaches to obtaining the derivatives of functions in a computer program. One approach is to manually code the derivative of the function. This task can be exceedingly complicated and tedious, and is prone to errors, particularly if the function is very complex. Alternatively, one can use finite differences. This approach is the simplest, but it is also the least accurate and can produce large errors if a poor step size is selected. A third option is to use a symbolic manipulator, such as Maple [4]. This is an excellent choice for many functions, but has some drawbacks of it’s own. First, most manipulators are unable to handle extremely complicated functions. Second, some redundant computations will be performed if common subexpressions are not eliminated. An alternative to all of these techniques is automatic differentiation. Automatic differentiation relies upon the fact that every function, no matter how complicated, is simply the composition of a small set of elementary functions, such as multiplication, sine, and square root. The mathematical notion of a composition corresponds directly to a program statement that performs some mathematical operation on the results of previous computations. Thus, the code segment X2 = X*X F = SIN(X2) corresponds mathematically to the composition of sine and multiplication, i.e. = sin :32 . As a conse uence of this correlation, we may apply the chain rule 9 >szg...)) (% (film) repeatedly to the composition of the elementary operations, allowing us to calculate g (9(t))|,:,0 = (gig-as the derivatives of a function computed by an elaborate program accurately and in a completely mechanical fashion. This procedure is referred to as automatic differenti- ation [6]. ADIFOR transforms Fortran 77 programs using this approach. As an example, suppose we have a Fortran subroutine that computes the value of f = 1'1?=1 58(2) The code sequence might be written as: subroutine prod5(x,f) real x(5), f f = x(1) * x(2) * x(3) * x(4) * x(5) return end Then, if we wish to compute the derivatives of f with respect to 93,-,i = 1,-~,5 and store the results in array g$f(), we might apply our knowledge of calculus and produce: f = x(1) * x(2) * x(3) * x(4) * x(5) g$f(1) = X(2) * XC3) * XC4) * X(5) g$f(2) = x(1) * xCB) * XC4) * x(5) g$f(3) = X(1) * X(2) * X(4) * X(5) g$f(4) = X(1) * XCZ) * X(3) * XCS) g$f(5) = X(1) * XC2) * X(3) * X(4) where we use g$f(i) to designate 63%,. If instead we apply ADIFOR to this subrou- tine, ADIFOR produces: r$1 = x(1) * x(2) r$2 = r$1 * x(3) r$3 = r$2 * x(4) r$4 = x(5) * x(4) r$5 = r$4 * x(3) r$1bar = r$5 * x(2) r$2bar = r$5 * x(1) r$3bar = r$4 * r$1 r$4bar = x(5) * r$2 do g$i$ = 1, g$p$ g$f(g$i$) = r$1bar*g$x(g$i$,1) + r$2bar*g$x(g$i$,2) + r$3bar*g$x(g$i$,3) + r$4bar*g$x(g$i$,4) + r$3*g$x(g$i$, 5) end do f = r$3 * x(5) Variables introduced by ADIF OR contain the $ sign, and continuation line charac- ters have been deleted to improve readability. The array variable g$x corresponds mathematically to g—:. From calculus, we know that 33, 0 fii¢j 1ifi=j Thus, g$x should be initialized to the 5 X 5 identity matrix. The variable g$p$ corresponds to the number of columns in this array and should be set equal to 5. If this initialization is performed, the code above is equivalent to: r$1 = x(1) * x(2) r$2 = r$1 * x(3) = x(1) * x(2) * x(3) r$3 = r$2 * x(4) = x(1) * x(2) * x(3) * x(4) r$4 = x(5) * x(4) r$5 = r$4 * x(3) = x(5) * x(4) * x(3) r$1bar = r$5 * x(2) = x(2) * x(3) * x(4) * x(5) r$2bar = r$5 * x(1) = x(1) * x(3) * x(4) * x(5) r$3bar = r$4 * r$1 = x(1) * x(2) * x(4) * x(5) r$4bar = x(5) * r$2 = x(1) * x(2) * x(3) * x(5) g$f(1) = r$1bar = x(2) * x(3) * x(4) * x(5) g$f(2) = r$2bar = x(1) * x(3) * x(4) * x(5) g$f(3) = r$3bar = x(1) * x(2) * x(4) * x(5) g$f(4) = r$4bar = x(1) * x(2) * x(3) * x(5) g$f(5) = r$3 = x(1) * x(2) * x(3) * x(4) f = r$3 * x(5) Comparing this code with the program derived using calculus reveals that the vector g$f does in fact contain 8—5—9. In addition, the ADIFOR-generated program computes the product in a binary-tree like fashion, allowing intermediate results to be utilized in the computation of the gradient. As a consequence, no redundant subexpressions are computed, and the ADIFOR—generated code may require fewer floating point operations than the code that we generated by hand. What is most important about ADIFOR-generated code for our purposes is that it contains many loops of length g$p$l If g$p$ (which, memory permitting, is equal to the number of elements in the independent variables) is large, this code vectorizes very well. In addition, the vectors under consideration are always the same—the first index of the gradient arrays is the only one that varies. Since arrays are always accessed along the same dimension, it would appear that distributing a different row (or set of rows, if there are more rows than processors) to each processor would minimize the amount of communication necessary. This conclusion is correct, but it does not take into consideration the need to communi— cate the scalar values computed in between the vector loops. We can eliminate this problem by using scalar promotion so that scalar computations are duplicated on all processors, but this approach increases the total amount of work being performed. Thus, we cannot achieve linear speedup, that is, execution on N processors requires 1For a. more complete discussion of the code and how it works, see [2]. % execution time on a single processor, our ultimate goal. Alternatively, we can try to take advantage of any parallelism inherent in the original code, which ADIFOR is guaranteed to preserve. 1 .3 Problem Statement In Section 5.1, some mathematical analysis using estimates for various properties of ADIFOR-generated code is performed. This analysis provides some very general rules for parallelizing this code. These rules also apply to other vectorizable code. However, the guidelines established are only useful if the programmer is able to find the best distribution for the original program. If the original program is exceedingly complex, this may be a formidable task if the best distribution must be determined by the programmer. Instead, we would like to determine the best distribution of data in an automatic fashion. Toward this aim, a formal technique utilizing aug- mented data access descriptors to compute a close approximation to this distribution is presented. Empirical results from execution on a Butterfly TC2000 reveal that conclusions reached by the proposed method are reasonable. 1 .4 Organization The organization of this thesis is as follows. Chapter 2 provides a brief explanation of four subjects that are the underpinnings of augmented data access descriptors- data alignment, data dependence analysis, Data Access Descriptors, and the special properties of vectorizable code. The following chapter presents the augmented data 10 access descriptor, specifying its format, computation, meaning, and application to modular programs. Chapter 4 describes the effects of various loop transformations— promotion, loop unrolling, loop fusion, strip mining, and invariant code movement— on augmented data access descriptors. In addition, an example of how goal-directed loop transformations can be used to eliminate the need for data communication in ‘ a program is presented. The final chapter examines the parallelization of ADIFOR- generated code, and provides empirical results to support the conclusions reached. The nature of augmented data access descriptors and how they should be applied to vectorizable code is reviewed, and topics requiring further study are presented. CHAPTER 2 Foundations This chapter serves as a brief introduction to data alignment, data dependence anal- ysis, Data Access Descriptors, and the special nature of vectorizable code. This back- ground information is necessary for a complete understanding of augmented data access descriptors, and why they are useful. 2.1 Data Alignment In order to aid parallelizing compilers in the task of code generation for distributed memory computers (or any computer with non—uniform memory access time), several languages have been developed that enable the programmer to include information concerning the way in which data should be distributed among the various processors [5, 12]. If the proper alignment of data is chosen by the programmer, then commu- nication costs may be minimized. Furthermore, the compiler may exploit multicast communication and other specialized communication techniques to further reduce the cost of communicating information [10]. For example, suppose we wish to multiply 11 12 an N x N matrix A by an N element vector B to produce an N element product vector, C. An ordinary Fortran 77 program to accomplish this task might look like: DOUBLE PRECISION A(N,N), B(N), C(N) INTEGER OUTER, INNER DO OUTER=1, N C(OUTER) = 0 DO INNER=1, N C(OUTER) = C(OUTER) + B(INNER)*A(OUTER,INNER) ENDDO ENDDO The Fortran D version [5] of this program would be something like: DOUBLE PRECISION ACN,N), B(N), C(N) INTEGER OUTER, INNER c The next 5 lines are Fortran D specifications regarding c the alignment of arrays A,B, and C. DECOMPOSITION X(N) ALIGN A(I,J) with X(I) ALIGN B(I) with x(*) ALIGN C(I) with X(I) DISTRIBUTE X(CYCLIC) PARALLEL DO OUTER=1, N C(OUTER) = 0 DO INNER=1, N C(OUTER) = C(OUTER) + B(INNER)*A(OUTER,INNER) ENDDO ENDDO The DECOMPOSITION statement indicates that an N element vector X is to be used as our frame of reference, or template. Each element of C aligns exactly with an element of X. In addition, one row of A and the entire vector B is aligned with each 13 element of X. The DISTRIBUTE statement simply indicates how the data aligned with each element of X should be distributed if there are not enough processors} If we assume that the number of processors is greater than N, then each processor contains a row of A, the vector B, and an element of C. Thus, each processor may compute an element of C without needing to communicate with any of the other processors. Unfortunately, determining the best distribution for a complicated program can be exceedingly difficult. Also, a small mistake in the alignment statements can be extremely costly. Suppose, for example, that we had produced the Fortran D program: DOUBLE PRECISION A(N,N), B(N), C(N) INTEGER OUTER, INNER DECOMPOSITION X(N) ALIGN A(I,J) with X(J) ALIGN B(I) with x(*) ALIGN C(I) with X(I) DISTRIBUTE X(CYCLIC) PARALLEL DO OUTER=1, N C(OUTER) = 0 DO INNER=1, N C(OUTER) = C(OUTER) + B(INNER)*A(OUTER,INNER) ENDDO ENDDO The only difference between this program and the previous example is the ALIGN statement for A. However, the change in performance would be dramatic. In order for a given processor to compute its assigned element of C, C,, it must first get the values of A,,,-(,-¢j) from the other processors. This communication could be very costly. 1For a more complete explanation of the syntax of this and other Fortran D instructions, see [5]. 14 In order to prevent such mistakes from occurring, and to assist in the automatic generation of data alignment information, several techniques have been developed that determine a good (but not necessarily the best) data alignment in a mechanical fashion. Li and Chen model the data alignment problem as a graph problem, and present a heuristic algorithm for solving the problem [8]. The heuristic approach produces performance results that are comparable to the optimal alignment (found using brute force), and significantly better than the performance of an alignment determined using a greedy algorithm. Gupta and Banerjee use a different approach, attacking the problem from the perspective of the whole program, and solving for an optimal combination of parallelism and communication costs, subject to certain constraints [7]. Both approaches have their merit and may be better suited for many applications than the technique described in Chapter 3. However, the approach we examine has the advantage that it is very simple (and thus suitable for use in a compiler, which must process a program in a reasonable amount of time) and can treat statements, regions of programs, and entire programs in an identical manner. This flexibility is an important feature, because it enables the separate examination of sections of a program. In many cases, one alignment is appropriate for one stage of a computation, while a different alignment is appropriate for a later stage. Using both alignments can result in better performance than just using one alignment or the other for the entire program [11]. 15 2.2 Data Dependence Analysis Instead of treating automatic data alignment as a completely new problem, we can view data alignment as an extension of data dependence analysis. Data dependence analysis is the investigation of the manner in which the execution of one statement influences the results of another. In order for statements to be executed in parallel, it is necessary that the result of each statement be independent of the results of all other statements. Data dependence may take one of three forms: flow dependence, antidependence, or output dependence. Flow dependence indicates that a variable is defined by statement 81, then used by statement S2. Antidependence indicates that a variable is used by statement S1, then defined by statement S2. Output dependence indicates that a variable is defined by statement SI and by statement S2. By far the most important of these is flow dependence, because anti— and output dependence may be eliminated through the use of temporary variables. Dependence analysis is straightforward when statements involve only scalar vari- ables; if a variable appears on the left side of an assignment statement, then a depen- dence exists with any other statement that references that variable. However, when arrays are referenced using an expression that may vary between iterations of a loop, it is possible for a dependence to exist between separate iterations of the loop. Such a dependence is called a loop carried dependence. There are many ways to test for loop carried dependences. One method is to solve a Diophantine equation describing the array references and check if the solution lies within the bounds of the loop(s). For example, suppose state— 16 ment Sl defines A(f1(i1,i2, . . .,ig,), . . . ,fm(i1,i2, . . . ,ig,)) and statement S2 uses A(g1(j1,j2,...,jg,),...,gm(j1,j2,...,jg,)), where array A has m dimensions, S1 is nested in 61 loops and S2 is nested in [2 loops. Then a flow dependence exists if the equation fk(i1)i2)' ' .,i£1)= gk(j1)j27'° '7jl2)31 S k S m has an integer solution within the loop bounds. Solving the Diophantine equation can be a computationally demanding task, especially when m is large. As a consequence, several conservative tests of data dependence have been developed. Among these are the GCD test, the Banerjee test, and the /\ test [9, 13]. These tests are conservative in that they test necessary conditions for a dependence to exist. However, these conditions may not be sufficient. Thus, a test may indicate that a dependence exists when in fact it does not, but it will never indicate that a dependence does not exist when it does. Traditionally, parallelizing compilers have relied upon data dependence analysis to determine whether particular sections of code or iterations of a loop may be executed in parallel. If a dependence exists, then the loop is not parallelized. However, in the global parallelization scheme, it is assumed that occasionally a variable will be defined by one processor and used by another. What is most important is that the number of times a dependence occurs (thereby creating a need for communication) is minimized. Thus, data alignment may be viewed as an advanced type of data dependence analysis, through which we attempt to minimize some function, which serves as an approximation to communication costs. 17 2.3 Data Access Descriptors In developing the Data Access Descriptor, Balasundaram takes a slightly different ap- proach to data dependence analysis [1]. Rather than focusing upon the dependence of one statement on another, Balasundaram examines the manner in which regions of a program influence one another. This approach allows traditional statement-to- statement data dependence analysis to be unified with interprocedural data depen— dence analysis, which is extremely important in modular programs. The Data Access Descriptor contains a variety of information regarding the way in which a particular variable is accessed. An extremely important aspect of the Data Access Descriptor is the simple section, a set of hyperplanes that form convex hulls that completely enclose the regions of arrays accessed by a section of a program (which may be a single statement or the entire program). By applying intersection and union operations to these descriptors, it is possible to determine whether a data dependence exists between two regions of a program. For example, suppose we have the following section of code: D0 I=1, 5 D0 J=1, 5 A(J,I) = 1.0 ENDDO ENDDO D0 I=1, 3 D0 J=1, 3 A(J-I+5,I+J-1) = 1.5 ENDDO ENDDO D0 I=6, 10 DO J=1,5 18 A(I,J) = A(I,6-J) ENDDO ENDDO which we wish to parallelize by executing the third pair of loops concurrently with the execution of the first and second pairs of loops. Then the portion of A referenced by the first pair of DO loops (a region of the program which we will denote R1) can be described by: 2£$+y310 —4Sw—y§4 where a: and y correspond to the first and second dimensions of A, respectively. This system of equations may be represented graphically as: where solid lines correspond to equations which actually impose bounds on the array dimensions, and dotted lines correspond to equations providing redundant informa- 19 tion. The second pair of DO loops, R2, access a portion of A described by: OSSC'i—USIO —4§:c—y§0 which may be represented graphically as: 0 2 4 6 8 10 7Sx+y315 ISr—yg9 This set of equations has a graphical representation of: 20 6 I I I/ l \l \ 5— / - / \ 4e / ‘s 3— .. 2— \ 7 1_ \. /_ \ / 0 1 L l\ l /1 0 2 4 6 8 10 The DADS describing two regions of a program can be combined using union to yield a DAD describing a larger region of the program. So, we can get the Simple section describing regions R1 and R2 by taking the union of the simple sections for R1 and R2, yielding: 21 2§x+y310 —4S:c—y§4 If we compute the intersection of this merged simple section with the simple section for R3, we discover that it is: 83$+y310 —4£x-yS—2 Or, graphically, The fact that the simple section is not empty indicates that the section of code composed of R1 and R2 accesses some of the same array elements as R3. Therefore, these two sections of the program may not be executed in parallel. It is also easy to determine that region R1 could not be executed in parallel with a section composed of R2 and R3, since SS(R1) fl (SS(R2) U SS(R3)) 74 0. 22 2.4 Properties of Vectorizable Code In addition to the standard considerations when performing data alignment analysis, code which vectorizes well has certain special properties which we may wish to keep in mind. Arrays are generally accessed along rows and columns, rather than in irregular patterns. It is these regular accesses that facilitate vectorization. They also facilitate data alignment, in that it is easier to determine how to best distribute data that is always accessed in the same, Simple pattern. As was seen in Section 1.2, in the case of ADIFOR—generated code, the accesses are even more regular, as they all occur along the leading dimension of the array. In addition to regular access patterns, vectorizable code is characterized by many small loops. This presents a problem for standard parallelization techniques, because the amount of work performed inside the loops is too small to make the high overhead of executing a loop in parallel tolerable. However, when we exploit global parallelism, this fine granularity may be acceptable. We will keep these special properties in mind as we examine a mechanism for determining the proper data alignment for vectorizable programs. CHAPTER 3 Augmented Data Access Descriptors We now explore a mechanism for finding a good data alignment. We define a good alignment as one that leads to a distribution that is free from communication. In cases where we cannot eliminate communication, we attempt to minimize it. 3.1 Augmented Data Access Descriptors One approach to quantifying the amount of communication overhead associated with various data distributions is to extend the data access descriptor so that it describes both the sections of an array that are accessed by a particular statement and also the manner in which they are accessed. As explained in Section 2.3, the original data access descriptor consists of a set of hyperplanes forming a convex hull within which all array accesses are guaranteed to occur. This information can be very useful, but may be expensive to compute. However, because of the restrictive nature of data 23 24 alignment languages like Fortran D and DINO [5, 12], as well as the propensity of vector codes to access arrays along the axes, it is sufficient to restrict our convex hulls to hypercubes. We are most interested in accesses that proceed along one dimension of the array (as opposed to cases where the subscripts are coupled). Such accesses are easy to identify, because they occur when one of the subscripts of the array is a function of only one of the iteration variables; e. g., A (3*I+7) =1 .0. These accesses are also impor- tant, because they indicate that the array may be distributed across that dimension (and parallelized along that dimension) without any need for communication. The augmented data access descriptor (ADAD) for an array referenced in a par— ticular statement will consist of a k X n array of tuples, where k is the level of nesting for the statement and n is the number of dimensions in the array, plus an integer element, whose value is set equal to 19.1 For example, in statement SI in the following code: DO I=1, 100 D0 J=2, 50 ACI,J,I+J) = B(I,J) (Si) ENDDO ENDDO the ADAD describing A would contain a 2 x 3 array of tuples, plus the integer 2, and the ADAD describing B would contain a 2 X 2 array of tuples, plus the value 2. For a k X n tuple array, each tuple is denoted by D(A, S, I, D), where A is the array being 1Strictly speaking, this integer element is not necessary, as we may deduce k from the number of rows in the ADAD. However, it is included to facilitate the use of a statement’s degree of nest- ing as a tie-breaker in determining alignments (the use of this heuristic for situations in which a communication—free alignment is not possible will be discussed later). 25 described, 5 is the statement with which we are concerned, I is an iteration variable, and D is a dimension of array A. The first two members of each tuple indicate the lower and upper bounds, re- spectively, on the value subscript D may assume (note that this value is independent of I). The third element gives some indication of whether parallelizing the iteration variable to which that row of the ADAD corresponds will create a need for commu- nication if the array in question is distributed across that dimension. If this element is an asterisk (*) then communication is necessary, and the fourth element in the tuple is the empty set. If the element is a positive integer, then communication is not necessary, if the array is distributed in bands of width equal to this positive in- teger. In this case, the fourth element of the tuple is a set of offsets indicating which rows (or columns, depending on perspective) are referenced by that statement. This information is important for combining ADADS, a procedure to be described later. If the third element is zero, no communication is necessary, so long as this is the only reference to the array. The value zero is used as an indicator that the subscript is a nonlinear function of the particular iteration variable. Distribution is complicated, but possible, as discussed in Section 3.2.4. AS in the case of an asterisk, the tuple’s fourth element is the empty set. The ADADS for the example above are: 26 2 dimensionl dimension2 dimension3 139,31) = 1 (1,100,1,{0}) (2,50,*,(0) (3,150,*,0) J (1,100,*,(Zl) (2,50,1,{0}) (3,150,*,(0) and 2 dimension 1 dimension 2 00351) = I (1,100,1,{0}) (2,50,*,(0) J (1,100,*,0) (2,50,1,{0}) The details of how these ADADS are computed will be discussed in the next section. 3.2 Finding a Good Alignment Using ADADS 3.2.1 Program Model In describing ADADS and how they should be used, several assumptions are made about the programs being evaluated. One assumption is that loops have a step size of one. Such a loop is said to be normalized. Since it is possible to normalize any loop [13], this restriction is not significant. We also assume that arrays are primarily indexed by iteration variables. Constants and induction variables are also often used to index arrays. Since constants do not provide a means for distribution and induction variables can always be replaced by expressions involving iteration variables, this 27 assumption does not impose unnecessary restrictions. The third assumption is that those statements located in the innermost loops are executed most frequently. In general, this is a reasonable assumption, especially if loops involve more than a few iterations. However, if a statement is guarded by a conditional, and the guard is usually false, then it may be executed less frequently than a statement with a smaller degree of nesting. However, since this assumption only governs our decision as to how data should be aligned when a communication-free distribution does not exist, correctness is not affected, and it is hoped that the effect on performance is minimal. 3.2.2 Computing Augmented Data Access Descriptors The computation of ADADS is very simple. Given a particular reference to a par- ticular array in a particular statement, for each loop within which that statement is nested, we compute a tuple, D(A, S, I, D), representing the dependences of a particu- lar subscript with respect to the iteration variable corresponding to that loop, as well as the upper and lower bounds of the subscript. The upper and lower bounds, which as mentioned earlier are the first and second elements of a tuple and which we denote D(A,S,I,D)[1] and D(A,S,I,D)[2], can often be determined in a straightforward fashion using the upper and lower bounds of the iteration variables. For example, if a subscript is a monotonic function, G, of iteration variables I and J, then the lower bound of the subscript, D(A, S, I, D)[1] is equal to the lower bound of the function: D(A,S.I,D)l1l = min(G(L(1), L(J))aG(L(1), U(J))aG(U(I),L(J)),GUJU), U(J))) 28 Similarly, the upper bound of the subscript, D(A, S, I, D)[2], is: 904,531,139]: m&X(G(L(I), 1301)): G(L(I), U(J)),G(U(I),L(J)),G(U(I), UUD) In cases where G is not monotonic, or where the bounds of the iterations are not known, a conservative assumption regarding these bounds (0 for the lower bound and positive infinity (00) for the upper bound) may be made. The computation of the third element of the tuple may be expressed as: r m if D is a linear function m] + b of the iteration variable D(A7 5: I) D)l3l : i 0 if D is a nonlinear function of the iteration variable, I otherwise Similarly, {b} if D is a linear function mI + b of the iteration variable D(A, S,I,D)[4] = (2) otherwise As an example, the augmented data access descriptor for statement 81 in the following code segment: DO 100 I=1,100 DO 110 J=1,100 A(I,J) = SQRT(I*J) (Si) 110 CONTINUE 100 CONTINUE is: 29 2 dimensionl dimension2 D(A,Sl) = 1 (1,100,1,{0}) (1,100,*,0) J (1,100,*,0) (1,100,1,{0}) This is equivalent to saying: “Statement 81 has a nesting level of 2. With respect to iteration variable I, array A may be distributed across its first dimension, since the first subscript is a linear function (U + 0) of I, but not across its second dimension. With respect to iteration variable J, array A may be distributed across its second dimension, since the second subscript is a linear function (1.] + 0) of J, but not across the first dimension.” This is in complete agreement with the actual syntax of SI. 3.2.3 Combining Augmented Data Access Descriptors Combining two ADADS that describe the same array is also a rather straightforward task, as long as the convex hulls described by the the first two elements of the tuples intersect. If they do not intersect, it may be possible to avoid communication by distributing those two regions of the array in different ways. However, since the ability to do this also implies that the program could be re—written in terms of two different arrays, each of which would have its own augmented data access descriptor, we can ignore this case and assume the hulls intersect. Definition 3.1 (Merged Augmented Data Access Descriptors) Merged ADADS consist of an I X n array of tuples, where l is the number of iteration variables in common to the ADADS being combined, and n is the number of columns 30 in the original ADADS {also equal to the number of dimensions in the variable being described). As in regular ADADS, each tuple is a I-tuple. For each iteration variable I and dimension D, we derive a new tuple according to the following procedure. 1. D(A, {Sl,S2},I,D)[1] = min(D(A,SI,I,D)[1],D(A,S2,I,D)[1] 2. D(A, {Sl,S2},I,D)[2] = max(’D(A,Sl,I,D)[2],D(A,S2,I,D)[2] The reason for choosing these elements in this manner is that accesses to the array by these statements must lie within the hypercube whose bounds correspond to the max- imum and minimum, respectively, of the upper and lower bounds of the hypercubes containing the accesses of the statements being merged. 3. If :D(A,31,I,D)[3] ,i D(A,32,I,D)[3] or D(A, Sl, 1,0)[3] = 0 or D(A, 52, I, D)[3] = * Then D(A,{51,S2},I,D)[3] = *;D(A,{51,S2},I,D)[4]=0; Else Let B : D(A, 51,1, D)[4] U D(A, 52,1, D)[4]; If max(B) —min(B) < D(A,Sl,I,D)[3] Then D(A, {51, 52},I,D)[3] = D(A, Si, I, D)[3]; D(A, {51, 52},I,D)[4] = B; Else D(A, {51, $2},I,D)[3] = *; D(A, {Sl,S2},I,D)[4] = 0; Endif 31 Endif As an example, consider the following ADADs: 3 dimension 1 dimension 2 dimension 3 I (1,49227{_1}) (1.100330) (1,100,*,0) D(A,Sl) = J (1,49,*,0) (1,100,1,{-1}) (1,100,*,0) K (1,49,*,lll) (1,100,*,@) (1,100,2,{0}) 3 dimension 1 dimension 2 dimension 3 I 2,100,2, 0 1,200,*,0 1,100,*,(b W’s” : < { }) < ) < ) J (2,100,*,(ll) (1,200,1,{0}) (1,100,*,(b) K (2,100,*,(0) (1,200,*,0) (1,100,3,{0}) Then the merged augmented data access descriptor is: 3 dimension 1 dimension 2 dimension 3 I (1,100,2,{—1,0}) (1,200,*,(0) (1,100,*,0) D(A,{Sl,82}) = J (17100370) (1.200,*,@) (1,100,*,0) K (1,100,393 (1.200330) (1400740) Definition 3.2 (Global Augmented Data Access Descriptors) A global aug- mented data access descriptor is a merged augmented data access descriptor for all 32 statements in the section of a program being analyzed. 3.2.4 Using ADADS to Develop Alignment Statements After all of the ADADS describing a particular array have been combined to form one global ADAD, we can make some decision as to what the best distribution for that array might be. Lemma 3.1 If the third element of the tuple corresponding to the iteration variable whose loop is to be parallelized and the dimension of the array across which we wish to distribute is a positive integer, m, then the array should be distributed in bands of width equal to m. In addition, they should be oflset by an amount corresponding to the smallest member of the fourth element. If this procedure is followed, then communication will not be necessary. This may be formalized as follows: Given: S — a section of code. I —- an iteration variable in section S. n — the number of arrays accessed in section S. Ak(k = 1..n) — the arrays accessed in section S. Dk — an index corresponding to a dimension of array Ak. D(A, S,I, D)[j] — the value of the jth element of the tuple corresponding to iteration variable I and dimension D in the ADAD describing the accesses to array A in section S. 33 If: (31:Loop(I)§S:(VAk:1gkgn: (3D,, : 1 g Dk S Numdims(Ak) :’D(Ak,S,I,Dk)[3] Z 0))) Then: The loop may be parallelized in a communication free manner. Algorithm: Given any code segment S for which the constraint holds, we can construct a data parallel Fortran D version that is free from communication in the following manner: 1. Change the DO loop associated with I to a parallel loop. 2. For each array Ak choose some dimension Di, satisfying the above constraint, and let Tk = ’D(Ak, S, I, Dk)[3]. Also, let Tlcm equal the least common multiple of the nonzero Tks. 3. Add a decomposition statement of the form DECOMPOSITION X(U) to the code, where X is an unused variable name, and U = max{’D(Ak,S,I,Dk)[2] X (Tim/TO}- 4. For each array Ak, if T}, 75 0 add an alignment statement of the form ALIGN Ak(I,J,...) with X(stride*M—(strideXoflset)) to the program, where stride = Tzcm/Tk, oflset = max(D(Ak, S,I,Dk)l4]), and M corresponds to the placeholding index variable (I, J, etc.) representing di- mension D1,. If T1,. = O, array A}, is referenced only once and its distribution 34 should be handled independently and in a manner consistent with that refer- ence. If the reference pattern is complicated enough, this may require Specifying the distribution of each row individually. 5. Add a distribution statement, such as DISTRIBUTE X(BLOCK-CYCLIC(Tzcm)) to the program. Example: AS an example of the application of this algorithm, suppose we have the following ADADS. dimensionl dimension 2 D(A,global) = I (1,100,*,(b) (2,100,1,{0}) dimension 1 dimension 2 D(B,global) : I (2,199,2,{-2,-1}) (1,100,*,(ll) Then, we should create a Fortran D version of the program by adding the following lines to the program. DECOMPOSITION X(200) ALIGN A(I,J) with X(2*I) ALIGN B(I,J) with X(I+1) DISTRIBUTE X(BLOCK_CYCLIC(2)) 35 Proof of correctness: Assume a need for communication exists. Then, there must be a reference to an element of array A,- by processor p, when that element is not stored on processor p. Since we have used the ADADS as a basis for distribution, there is a reference R to A,- such that the value of dimension D,- is not in the range [p X D(Aj, S,I, Dj)[3] + min(D(Aj, S, I, D,)[4]), (p +1) X D(Aj, S, I, Dj)[3] + min(D(A,-, S, I, Dj)[4]) — 1] (we assume D(Aj,S,I,Dj)[3] 75 0, for if it did, then there would be only one refer- ence to Aj, and distribution should be performed in a manner consistent with that reference, as discussed in Section 3.2.4). Because of the procedure for merging ADADS, D,- ¢ [p X D(Aj, S, I, D,)[3] +min(’D(A,-, S, I, D,)[4]),p X D(Aj, S, I, D,)[3] + ma.v(D(A,~, S, I, Dj)[4])], which in turn implies that the individual ADAD for array A,- and reference R, D(A,R), must have D(A,,R, I, Dj)[3] 75 D(Aj,S,I,Dj)[3], or that the single element of D(A,,R, I,D,-)[4] is not an element of D(Aj, S,I, Dj)[4]. But, under the rules for constructing D(Aj, S) provided in Section 3.2.3, either situ- ation is impossible. Thus, the assumption that communication is necessary is false. QED. A value of 0 for the third element indicates a special case for which the requi— site distribution is difficult—the rows being accessed by successive iterations must be distributed to successive processors. Describing such distributions is possible in lan- guages like Fortran D, but very complicated. Because of the large amount of work involved, as well as the potential for error, one would not typically attempt to add the necessary code by hand. However, since the alignment statements describing this dis- 36 tribution can be generated automatically, the presence of such references is a strong argument for the use of a tool that can generate alignment information automatically. If the third element of a tuple is an asterisk, then communication will be necessary if the array is distributed across that dimension and the loop designated by that iteration variable is to be parallelized. In order to determine which distribution will result in the least communication, we re-evaluate the global ADAD, considering only those ADADS whose level of nesting is maximal. This approach is approximate, based on the assumption that the most deeply nested statements will be executed most frequently. If the new global ADAD still has an asterisk for the third element, much communication will be necessary if we insist on using that iteration variable and distributing across that dimension. Any of the distributions favored by an individual ADAD of maximal nesting is acceptable (if no such distribution exists, then any will suffice). 3.3 An Example As a simple example of the application of augmented data access descriptors, con- sider the following program. The program does not perform any useful function; it is intended merely as a simple example of various types of array access and the corresponding ADADS. REAL A(1oo,1oo),B(1oo,100),c(2oo,1oo),D(2oo,2oo,100) REAL PI PARAMETER (PI = 3.14159265) 37 INTEGER I ,J,K,M,N DO 100 I=i,1oo D0 110 J=1,100 A(I,J) = SQRT(I*J) (Si) B(I,J) = (I+J)/2 (52) 110 CONTINUE D0 120 K=1,50 C(2*I-1,2*K-1) = PI * ACI,2*K-1) (53) C(2*I,2*K-1) = C(2*I-1,2*K-1) * B(I,2*K-1) (s4) C(2*I-1,2*K) = PI * BCI,2*K) (SS) C(2*I,2*K) = C(2*I-1,2*K) * ACI,2*K) (so) 120 CONTINUE DO 130 K=1,200 D(I,K,1) = 0.0 (57) 130 CONTINUE DO 140 M=2,1oo D0 140 N=M,100 D(I+N,I,M) = D(I+N,I,M-i) + C(2*I,I+N) (38) 140 CONTINUE 100 CONTINUE Then the data access descriptors for A, B, C, and D may be computed for each of 31 through S8. They are: 2 dimensionl dimension2 D(A,Sl) = I (1,100,1,{0}) (1.100330) J (1,100,*,0) (1,100,1,{0}) D(B,SZ) D(A,Ss) D(C,SB) D(B,S4) mass) 38 2 dimension 1 dimension 2 I (1,100.1,{0}) (1.100,*,0) J (1,100,*,(0) (1,100,1,{0}) 2 dimension 1 dimension 2 I (1,100,1,{0}) (1799530) [(1 (1:1007*90) (1199,27i'1I) 2 dimension 1 dimension 2 I (17199a23{'1}) (13997*30) [(1 (1)1997*7®) (179972>{‘1}) 2 dimension 1 dimension 2 I (1,100.1,{0D (1.99,*,®) [(1 (171007*70) (1399:27{'1}) 2 dimension 1 dimension 2 I (17199)2${'1}) (17993*7®) [(1 (1:1997*90) (1)99327I'II) Ds(c,s4) D(B,SS) D(C,SS) D(A,SG) D1(C,S6) 39 2 dimension 1 dimension 2 I (2,200.2,{0}) (1,9933%) K1 (2,200,*,(ll) (1,99,2,{-1}) 2 dimension 1 dimension 2 I (1,100,1,{0}) (2,100,*,0) K1 (1,100,*,(l)) (2,100,2,{0}) 2 dimension 1 dimension 2 1 (1,199,2,{-1}) (2,100.30) K1 (1,199,*,(Z)) (2,100,2,{O}) 2 dimension 1 dimension 2 I (1,100.1,{0}) (2,100.30) K1 (1,100,*,(ll) (2,100,2,{0}) 2 dimension 1 dimension 2 I (17199723{'1}) (271007*70) K1 (1,199,*,(b) (2,100,2,{0}) D2(C,Se) D(D,S7) D(C,S8) D1(D,S8) 40 2 dimension 1 dimension 2 I (2,200,2,{0}) (2,100,*,0) K1 (2,200,*,0) (2,100,2,{0}) 2 dimension 1 dimension 2 dimension 3 I (1,100,1,{0}) (1,200,*,0) (i,1,*,(0) K2 (1,100,*,(ll) (1,200,1,{0}) (1,1,*,(ll) 3 dimension 1 dimension 2 I (2,200,2,{0}) (3,200,*,0) M (2,200,*,0) (3,200,*,@) N (2,200,*,(b) (3,200,*,0) 3 dimension 1 dimension 2 dimension 3 I (3,200,*,(ll) (1,100,1,{0}) (1,99,*,0) M (3,200,*,lll) (1,100,*,0) (1,99,1,{-1}) N (3,200,*,0) (1,100,*,Q)) (1,99,*,fl) D2(D , 38) Note that the variable name K is used for two logically different iteration vari- ables. To accommodate this situation, the variable names in the ADADS have been 41 dimension 1 dimension 2 dimension 3 (3,200,*,(l)) (1,100,1,{0}) (2,100,*,0) (3,200,*,0) (1,100,*,Ql) (2,100,i,{0}) (3,200,*,0) (1,100,*,0) (2,100,*,(Z)) subscripted. From these ADADS we can derive: D(A,global) D(B,global) D(C,global) D(D,global) dimension 1 dimension 2 (1,100,1,{0}) (1,100,*,0) dimension 1 dimension 2 (1,100,1,{0}) (1,100,*,0) dimension 1 dimension 2 (1,200,2.{0}) (1,100,*,(2l) dimension 1 dimension 2 dimension 3 (1,200,*,0) (i,200,*,®) (i,100,*,0)) 42 Since I is the only loop common to all statements, it is the only candidate for parallelization. From the global augmented data access descriptors, it may be seen that arrays A and B should be distributed in a row—wise manner (that is, A(1 , 1. . 100) and B(1 , 1. .100) on one processor, A(2, 1. . 100) and B(2 , 1 . .100) on another, etc.). Matrix C should be distributed in striated row-wise fashion, with striations of width 2. Array D, however, cannot be distributed without a need for communication. Thus, the distribution favored by those statements with the greatest level of nesting is determined. In this case, the statement whose nesting level is maximal is statement S8 (rather than S7, the only other statement to reference D). The merged access descriptor is thus: dimension 1 dimension 2 dimension 3 D(D,{S8}) = I (3.200%) (1,100,1,{0}) (1.100320) So, array D should be distributed across its second dimension. It should be noted that all communication could be eliminated if the order of the subscripts in S7 were reversed, which is most likely the programmer’s intention. Thus, code that contains errors that do not necessarily lead to incorrect answers (suppose, for example, that the Fortran compiler automatically initializes all arrays to all zeroes) can still lead to inefficient parallel programs. The Fortran D code corresponding to this alignment is: REAL A(100, 100) ,B(100 , 100) ,C(200,100) ,D(200,200,100) DECOMPOSITION X(200) ALIGN A(I,J) with X(2*I) 43 ALIGN B(I,J) with X(2*I) ALIGN C(I,J) with X(I) ALIGN D(I,J,K) with X(2*J) overflow (WRAP) DISTRIBUTE X(BLOCK_CYCLIC(2)) REAL PI PARAMETER (PI = 3.14159265) INTEGER I,J,K,M,N PARALLEL DO 100 I=1,100 DO 110 J=1,100 A(I,J) SQRT(I*J) B(I,J) (I+J)/2 110 CONTINUE DO 120 K=1,50 C(2*I-1,2*K-1) = PI * A(I,2*K-1) C(2*I,2*K-1) C(2*I-1,2*K-1) * B(I,2*K-1) C(2*I-1,2*K) PI * B(I,2*K) C(2*I,2*K) = C(2*I-1,2*K) * ACI,2*K) 120 CONTINUE DO 130 K=1,2OO D(I,K,1) = 0.0 130 CONTINUE DO 140 M=2,100 DO 140 N=M,100 D(I+N,I,M) = D(I+N,I,M-1) + C(2*I,I+N) 140 CONTINUE 100 CONTINUE 3.4 Interprocedural Analysis It is often the case that a section of code that we wish to analyze contains one or more calls to subroutines. Therefore, we must establish a framework for computing ADADS 44 when such a call occurs. It is a fairly simple task to include procedure calls in the scheme for combining ADADS. First, we compute the ADAD(S) for the subroutine, as usual. Then, the ADAD(S) for the call statement can be computed by replacing the formal parameters in the ADAD(S) computed for the subroutine with the actual parameters of the call. If the arrays undergo reshaping due to the nature of the call, then this reshaping and the indexing used need to be taken into aCcount. Any rows corresponding to iteration variables local to the subroutine can be ignored, since we are restricting ourselves to an analysis at a higher level. If we wish to consider these variables in our alignment analysis, we should determine the best alignment for the section of code before the call, the best alignment for the subroutine, and the best alignment for the section of code after the call. Finally, the ADAD(S) for the call statement can be merged with the ADADS for the rest of the section. This procedure is best illustrated with an example. Suppose we have the following section of code: DOUBLE PRECISION A(100),X(100,100),Y(100) DO I=1, 100 A(I) = I (Si) CALL SUBR(I,A,X(1,I)) (s2) Y(I) = 0 (33) DO J=1, 100 Y(I) = Y(I) + X(J,I) (34) ENDDO ENDDO and that the code for SUBR looks like: SUBROUTINE SUBRCI,X,Y) DOUBLE PRECISION X(100),Y(100) 45 INTEGER I,J DO J=1, 100 Y(J) = SQRT(X(I)*J) ENDDO END Then, the ADADS for the subroutine are: 2 dimension 1 D(X,SUBR) = 1 (1,ee,i,{0}) J (1,oo,*,@) and 2 dimension 1 D(Y,SUBR) = 1 (1,100,*,@) J (1,100,1,{0}) Following the rules outlined above, we get the following ADADS for statement 82: 1 dimension 1 190%52) I (1,100J,{0}) and D(x,s2) These may be combined with the ADADS for the rest of the section: D(A,Si) D(Y,SB) D(X,S4) and 46 dimension 1 dimension 2 (1,100,*,0) (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 dimension 2 (1,100,*,(0) (1,100,1,{0}) (1,100,1,{0}) (1,100,*,(ll) 47 2 dimension 1 D(Yss‘l) = I (1,100,1,{0}) J (1,100,*,(Z)) to yield the ADADS for the section: 1 dimension 1 D(A,section) = 1 (1,100,1,{0}) 1 dimensionl dimension 2 D(X,section) = I (1,100,*,@) (1,100,1,{0}) and 1 dimension 1 D(Y,section) = I (1,100,1,{0}) Thus, this computation may be performed in a manner free from communication, if A and Y are distributed across their first dimensions, and X is distributed across its second dimension. The Fortran D code for this is: DOUBLE PRECISION A(100) ,X(100,100) ,Y(100) 48 DECOMPOSITION B(100) ALIGN A(I) with B(I) ALIGN X(I,J) with B(I) ALIGN Y(I) with B(I) DISTRIBUTE B(CYCLIC) PARALLEL DO I=1, 100 A(I) = I CALL SUBR(I,A,x(i,I)) Y(I) = 0 D0 J=1, 100 Y(I) = Y(I) + X(J,I) ENDDO ENDDO CHAPTER 4 The Effects of Loop Transformations on ADADS Most parallelizing compilers attempt to perform a number of different loop transfor- mations, in the hope of enabling the parallelization of loops that previously could not be parallelized. Loop transformations are also used to increase the granularity of a program. When we use a data parallel programming paradigm, our concerns are different. In particular, our primary goal is to reduce the amount of communi- cation required. Loop transformations may help us to achieve that goal. Therefore, the effects of some of the standard loop transformations on augmented data access descriptors are presented, not because it is easier to modify ADADS than it is to recalculate them, but so that our transformations may be guided by the effect they will have on the ADADS. 49 50 4. 1 Promotion Promotion entails the addition of a dimension to a variable, which is indexed by the iteration variable of a loop within which the variable is defined. This technique is par- ticularly appropriate for intermediate variables. For example, consider the following section of code: DO I=1, 100 DO J=1, 10 TEMPCJ) = COS(A(I,I+J)) (Sl) B(I,J) = 2.0*TEMP(J) (S2) ENDDO ENDDO The variable TEMP is not indexed by I and therefore can not be distributed among the processors using that variable. However, if we perform a promotion, we get the following code: DO I=1, 100 D0 J=1, 10 TEMP(I,J) = CDS(A(I,I+J)) (81’) B(I,J) = 2.0*TEMP(I,J) (82’) ENDDO ENDDO which enables TEMP, A, and B to be distributed among up to 100 different processors. After a promotion, the ADADS associated with the variable being promoted should have a dimension (column) added. The tuple for the iteration variable with which we are indexing the new dimension will be (L, U, 1, {0}), where L and U are the lower and upper bounds, respectively, on the iteration variable. The tuple for all other rows is (L, U, *, (ll). So, for TEMP in the example above, the original ADADS were: D(TEMP,Sl) = and D(TEMP,S2) = dimension 1 (1,10,*,(b) (1,10,1,{0}) dimension 1 (1,10,*,(ll) (1,10,1,{0}) After promotion, the ADADS are: D(TEMP,Sl’) = and D(TEMP,SZ’) = 51 dimension 1 dimension 2 (1,100,1,{0}) (1,10,*,0) (1,100,*,(Zl) (1,10.1,{0}) dimension 1 dimension 2 (1,100,1,{0}) (1,10,*,(b) (17100)*)®) (1.10.1,{0}) 52 4.2 Loop Unrolling In loop unrolling, the number of iterations of the leop is reduced by changing the step size, and adding statements to accommodate the values between steps. Since we have restricted actual step sizes to a length of 1, the effective step size may be changed by dividing the loop’s upper bound by some factor, which we will refer to as the unrolling factor, and multiplying the iteration variable by that factor. For example, DO I=1, 100 A(2*I) = B(I+2) (Si) ENDDO might become (with an unrolling factor of 4) DO I=1, 25 A(8*I-6) = B(4*I-1) (81a) A(8*I-4) = B(4*I) (Sib) A(8*I-2) = B(4*I+1) (Sic) A(8*I) = B(4*I+2) (Sid) ENDDO If a loop is unrolled by a factor M, the ADADS for statements within the loop could be expanded to M ADADS, one for each statement introduced by the unrolling. However, it is simpler to modify the original ADAD to one which describes the M derived statements. Only tuples in the row corresponding to the loop being unrolled are affected. The first and second elements in each tuple remain unchanged. If the third element, D(A, S, I, D)[3], is * or 0, it should remain unchanged. Otherwise, D(A, S, I, D)[3] should be multiplied by .M and the fourth element modified according to the following procedure. For each element E in the original set, D(A, S, I, D)[4], 53 introduce M elements to the new set, D(A, {S0,Sb, ...},I,D)[4], equal to E + k X D(A, S, I, D)[3], (k = [1 — M, 0]). The original ADADS for the example above are: 1 dimension 1 D(A,Si) = I (2,200,2,{0}) and 1 dimension1 D(B,Si) = I (3,102,1,{2}) The new augmented data access descriptors would be: 1 dimension 1 D(A,{Sia,S1b,Sic,Sid}) = I (2,200,8,{—6,—4,—2,0}) and 1 dimension 1 D(B,{S1a,Sib,Sic,S1d}) = I (3,102,4,{—i,0,i,2}) So, the Fortran D alignment statements might look like: DECOMPOSITION X(200) 54 ALIGN A(I) with X(I) C The expression 2*I-4 in the next statement derives from C the fact that 8/4=2 and B has offset 2, so we need to C align B(I) with X(2*(I-2)). ALIGN B(I) with X(2*I-4) overflow (WRAP) DISTRIBUTE X(BLOCK_CYCLE(8)) 4.3 Loop Fusion In loop fusion, two adjacent loops are merged to form one loop. The iteration variables of the original loops are replaced by the new iteration variable. For example, DO I=1, 100 B(I) = 2*I*A(2*I) (Si) ENDDO DO J=i, 100 C(J) = A(2*J-i) - 1 (S2) ENDDO could become DO K=1, 100 B(K) = 2*K*A(2*K) (Si’) C(K) = A(2*K-1) - 1 (82’) ENDDO In order to perform loop fusion, while preserving program correctness, there are certain criteria which must be satisfied. The formal verification that transformations are correctness-preserving is left to books on parallel compiler construction [13]. In- stead, we restrict ourselves to examples that are Simple enough that correctness is readily apparent. 55 Modifying the ADADS affected by a loop fusion, once it has been determined that one may occur, is Simple. The rows of the ADADS that corresponded to the original iteration variables now correspond to the new iteration variable. Also, the ADADS describing a region of code may change, and should be computed. Thus, for the example above, the original ADADS of 1 dimension 1 D(A,Si) = 1 (2,200,2sl0l) 1 dimension1 D(B,Si) = I (1,100,1,{0}) 1 dimensionl D(A,SQ) = J (17199127l'1}) 1 dimension1 D(C,S2) = J (1,100,1,{0}) D(A,{Si,s2}) NULL become: 56 1 dimension 1 D(A,Si’) = K (2,200.2,{0D 1 dimension1 D(B,Si’) = K (1,100,1,{0}) 1 dimension1 D(A,S2’) = K (1,199.2,{-1}) 1 dimension1 D(C,S2’) : K (1,100,1,{0}) 1 dimension 1 LKA,{Si’,S2’}) 2: K (1,200,2,{-1,0}) 4.4 Loop Interchange As the name implies, loop interchange entails the interchange of two loops. The outer loop becomes the inner loop, and the inner loop becomes the outer loop. Thus, DO I=1, 100 D0 J=1, 100 57 A(I,J) = 0.0 ENDDO ENDDO would become DO J=i, 100 D0 I=1, 100 A(I,J) = 0.0 ENDDO ENDDO As with loop fusion, the conditions under which loop interchange may safely be performed depend on sophisticated data dependence analysis [13]. However, assuming loop interchange can be performed, the ADADS associated with individual statements and with regions of the program will remain unchanged. 4.5 Strip Mining Strip mining is often used in parallelizing compilers to increase the amount of work performed inside of a loop. It is analogous to loop unrolling, except that rather than increasing the number of statements by some factor M, a new loop is introduced to cover the range of iterations between the outer steps. So, DO I=1, 1000 A(I) = 3.14*(R(I)**2) ENDDO becomes DO I=1, 10 58 DO J=1, 100 A(100*(I-1)+J) = 3.14*(R(100*(I-1)+J)**2) ENDDO ENDDO This is an extremely useful technique when we want to parallelize a particular loop for execution on a Shared memory computer with a relatively small number of processors. However, the process of strip mining results in all tuples associated with the original iteration variable, as well as those associated with the new iteration variable, having a third element of *. Thus, when we are attempting to exploit global parallelism on a large distributed memory computer, strip mining can have an adverse effect. Instead of increasing the work load per processor through strip mining, we should utilize the distribution facilities of the data parallel language being used. 4.6 Invariant Code Movement If a statement within a loop uses variables that do not change within the loop, defines a variable that is not defined elsewhere in the loop nor used earlier in the loop, and the statement does not occur within a conditional, then that statement may be moved to a position immediately before the loop. This is useful for eliminating extra calculations. For example, DO J=1, 5 V(J) = 0.0 (Si) DO I=1, 100 ‘ A(J) = 3.14*R(J)*R(J) (s2) V(J) = V(J) + A(J)*F(I) ($3) ENDDO ENDDO 59 can be changed to DO J=1, 5 V(J) = 0.0 (Si) A(J) = 3.14*R(J)*R(J) (82’) DO I=1, 100 V(J) = V(J) + A(J)*F(I) (SS) ENDDO ENDDO When a statement like this is moved outside of a loop, the row in the ADAD for the statement corresponding to that loop’s iteration variable should be eliminated, and the integer representing the nesting level of the statement Should be decremented. S0, in the previous example, 2 dimension 1 D(A’SQ) = I (1.5M) J (175717{0}) becomes 1 dimension 1 IMA,S2’) J (1,5,1,{0}) Although this technique does not assist in the parallelization of the program, it can result in significant performance improvements, by eliminating unnecessary work. 4.7 An Example To understand the usefulness of directed loop transformations, consider the following section of a program: DO I=1, 100 D0 J=1, 100 60 A(J,I) = A(J,I) + SQRT(J*I) ENDDO DO K=1, 50 A(2*K,I) = A(2*K-1,I) ENDDO DO L=1, 100 A(L,I+1) = A(L,I) ENDDO ENDDO If we attempt to determine the best alignment for this code, we get the following ADADS 2 dimension 1 dimension 2 WW) = I (1.100,*,@) (1,100,1,{0}) J (1,100,1,{0}) (1,100,*,(ll) 2 dimension 1 dimension 2 D2(A,Sl) = 1 (1,100,*,0) (1,100,1,{0}) J (1,100,1,{0}) (1,100,*.0) (81) (S2) (83) D1(A,82) D2(A,S2) D1(A,SB) D2(A,SB) and D(A,global) 61 dimension 1 dimension 2 (2,100,*,(ll) (1,100,1,{0}) (2,100,2,{0D (1,100,*,(2)) dimension 1 dimension 2 (1,99,*,(ll) (1,100,1,{0}) (1199127{'1}) (1,100,*,0) dimension 1 dimension 2 (1,100,*,(D) (2,101,{1}) (1,100,1,{0}) (2,101,*,0) dimension 1 dimension 2 (1,100,*,0) (1,100,{0}) (1,100,1,{0}) (1,100,*,(Z)) dimension 1 dimension 2 (1,100,*,(2)) (1,101,*,(Z)) 62 So, there is no way in which to distribute this code without the need for commu- nication being introduced. However, if we unroll the first and third loops by a factor of two, then fuse the three loops, we get: DO I=1, 100 D0 J=1, 50 A(2*J-1,I) = A(2*J-1,I) + SQRT((2*J-1)*I) (Sia’) A(2*J,I) = A(2*J,I) + SQRT(2*J*I) _ (Slb’) A(2*J,I) = A(2*J-1,I) (32’) A(2*J-1,I+1) = A(2*J-1,I) (SSa’) A(2*J,I+i) = A(2*J,I) (SSb’) ENDDO ENDDO which has the following ADADS: 2 dimension1 dimension 2 1310.781?) = I (i,99,*,0) (1,100,1,{0}) J (1,99,2,{—1}) (1,100,*,(ll) 2 dimension1 dimension 2 D2(A,Sla’) = I (1,99,*,(Z)) (1,100,1,{0}) J (1,99,2,{-1}) (1,100,*,(ll) 2 dimension1 dimension 2 IDNthib’) ll ll—l (2,100,*,0) (1,100,1,{0}) J (2,100,2,{0D (1,100.30) D2(A,Sib’) Dl(A,32’) D2(A,S2) D1(A,SSa’) D2(A,SBa ’) 63 dimension 1 dimension 2 (2,100,*,(0) (1,100,1,{0}) (2,100,2,{0D (11100330) dimension 1 . dimension 2 (2,100,*,0) (1,100,1,{0}) (2,100,2,{0}) (1,100,*,0) dimension 1 dimension 2 (1,99,*,0) (1,100,1,{0}) (1,99,2,{-1}) (1,100,*,(2)) dimension 1 dimension 2 (1,99,*,fl) (2,101,{1}) (1,99,2,{-1}) (2,101,*,(b) dimension 1 dimension 2 (1,99,*,(l)) (1,100,{0}) (1,99,2,{-1}) (1,100,*,(D) 64 dimension 1 dimension 2 D1(A,83b’) (2,100,*,0) (2,101,{1}) (2,100,2,{0}) (2,101,*.0) dimension 1 dimension 2 D2(A,S3b’) (2,100,*,Q)) (1,100,{0}) (2,100,2,{0}) (1,100,*.0) and dimension 1 dimension 2 D(A,global) (1,100,*,0) (1,101,*,0) (1,100,2,{-1,0}) (1,101,*,(l)) So, after the transformations, we may distribute A along the first dimension, using a stride Size of 2, and we will have no need for communication. This could be achieved using: DECOMPOSITION X(iOO) ALIGN A(I) WITH X(I) DISTRIBUTE X(BLOCK_CYCLE(2)) This represents a significant improvement over the original version, which would have had to forgo all parallelism, or suffer the added cost of communicating information between processors. CHAPTER 5 Performance Analysis Having described ADADS, the manner in which they are combined, how they should be interpreted, and the effect upon them of various loop transformations, we are now prepared to examine the manner in which they may be applied to the parallelization of vectorizable programs. We begin by examining some of the properties of ADIFOR- generated code and examine how an understanding of these properties assists in the task of parallelization. We then propose a method using ADADS to decide how a vectorizable program should be parallelized and apply this method to two simple programs. Finally, we confirm the conclusions made about these programs through performance measurements on a parallel computer. 5.1 Parallelizing ADIFOR-generated code The parallelization of code generated by ADIFOR is a complicated problem, be— cause nothing is known about the algorithms to which ADIFOR will be applied. In cases where the original algorithm is not parallelizable, it is highly desirable that the 65 66 parallelism inherent to ADIFOR-generated code be exploited to the greatest extent possible. However, it is not clear that this is the case when the original algorithm has a large amount of parallelism. In such cases, it may be best to ignore the additional parallelism introduced by ADIFOR. Because ADIF OR uses primarily the forward mode1 of automatic differentiation [2], a large number of vector operations are embedded in the newly generated code. For example, given the code section: IC=ICOMP(J,I) SIGT(J,I)=SIGMAT(IC) C(J,I)=CC(IC) QEXT(J,I)=Q(IC) ADIFOR will produce: ic = icompCj, i) C Sigt(j, i) = sigmat(ic) do 99798 g$i$ = 1, g$p$ g$sigt(g$i$, j, i) = g$sigmath$i$, ic) 99798 continue sigt(j, i) = sigmat(ic) C C(j, i) = cc(ic) do 99797 g$i$ = 1, g$p$ g$c(g$i$, j, i) = g$cc(g$i$, ic) 99797 continue C(j, i) = cc(ic) C qexth, i) = q(ic) do 99796 g$i$ = 1, g$p$ g$qext(g$i$, j, i) = g$q(g$i$, ic) 99796 continue qexth, i) = inc) 1for a discussion of the differences between the forward and reverse modes of automatic differen- tiation, please see [6]. 67 Two options for global parallelism present themselves. If the original algorithm possessed parallelism in an outer loop, the ADIFOR-generated code will too, and we can use this for parallelizing the code. However, if this parallelism does not exist, we can exploit the parallelism inherent in the vector loops. In order to achieve this, we replicate scalar operations across all processors, then perform all vector operations in parallel. This is roughly equivalent to promoting all scalars and fusing the vector loops. We shall refer to this technique as the inside—out method. For example, the code fragment: A=PI*(R**2) W[i . .30] =A*X[1 . .30] might become: PROCESSOR 1 PROCESSOR 2 PROCESSOR 3 A=PI*(R**2) A=PI*(R**2) A=PI*(R**2) W[1..10]=A*X[1..10] wEii..20]=A*xE11..2oJ w[21..30]=A*xC21..30] In order to get an upper bound on the speedup achievable with this method, assume that the time to branch and join is negligible compared to the total time for a computation. Also, let: N = number of processors to be used C = number of columns of the jacobian to be evaluated E (0) = time to compute original function. E (1) = time to compute function plus one column of J acobian E (2) = time to compute function plus two columns of J acobian Then, the time to compute one element of each of the gradient vectors (one column of the Jacobian), EV, is: 68 Furthermore, the time spent doing scalar operations (computing the value of the function, computing reverse mode objects, and loop overhead), E5, is: E5 = E(l) - EV = 2 * E(l) — E(2) The run time of the sequential program should therefore be: ES'l'CEV The time required using the inside—out method and a slice size of 1 is: Accordingly, speedup is: Es + CEV [1%] (Es + EV) Using a more appropriate slice Size ([C/N] ), we achieve a runtime of: C EsiltlEV Since E5 is guaranteed to be at least E (0), the running time of the original algorithm, this represents a Significant improvement in cases where C is substantially larger than 69 N. The speedup is: ES+CEV Es+ [19,—] Ev 5.2 Applying ADADS to Vectorizable Code Based on the results of the previous section, we propose a method for determining the best manner in which to parallelize a vectorizable program. First, the original program should be Simplified by replacing all vector operations by scalar operations. Next, ADADS may be used to determine an alignment for the resultant code. If this alignment if free from, or nearly free from, communication, then this alignment may be employed in parallelizing the original program. Otherwise, we may utilize the inside—out method, promoting the scalar operations and fusing the vector loops to yield a program with a high degree of global parallelism, at the cost of duplicate computations. AS an application of this technique, consider the programs in Appendix A. Pro- gram 1 computes f,- = sin(:z:,-) through a series of trigonometry identities. The reason for doing this is to make the solution easy to check. Program 2 computes f = fsinflri) through the same identities. Note that in the case of ADIFOR- ":1 generated code, replacing the vector operations with scalar operations yields a pro— gram roughly equivalent to the original program. Thus, it is sufficient to examine the original programs, and apply our conclusions to the ADIFOR-generated code, which 70 can be found in Appendix B. It is obvious from inspection that the first program pos- sesses a high degree of parallelism while the second is naturally sequentially. However, the same conclusions can be reached methodically through the use of ADADS. First, consider program 1. The augmented data access descriptors are: 1 dimension 1 D(temp1,Si) = I (1,100,1,{0}) 1 dimension 1 D1(X,Sl) : 1 dimension 1 D2(X,Sl) = 1 (1,100,1,{0}) 1 dimension 1 D(temp2,S2) = I (1,100,1,{0}) 1 dimension 1 D1(x,82) :— I (1,100,1,{0}) DQ(X,S2) D(temp3,83) D1(x,83) D2(X,33) D3(X,83) D4(X,83) D(f,S4) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) dimension 1 (1,100,1,{0}) 72 1 dimension 1 D1(temp1,S4) : 1 (1,100,1,{0}) 1 dimension 1 D2(temp1,S4) = I (1,100,1,{0}) 1 dimension 1 D(temp2,S4) = I (1,100,1,{0}) 1 dimension1 D(temp3,S4) = 1 (1,100,1,{0}) 1 dimension1 D(x,S4) = I (1,100,1,{0}) These may be combined to yield the global ADADs: dimension 1 D(x,global) = 73 dimension 1 D(temp1,global) = dimension 1 D(temp2,global) :— 1 (1,100,1,{0}) dimension 1 D(temp3,global) : I (1,100,1,{0}) dimension 1 D(f,global) : I (1,100,1,{0}) Thus, this program may be parallelized by parallelizing the I loop and distributing x,temp1,temp2,temp3, and f across their first (and only) dimension. The Fortran D code for doing the alignment is: DECOMPOSITION A(100) ALIGN F(I) with A(I) ALIGN X(I) with A(I) ALIGN TEMP1(I) with A(I) ALIGN TEMP2(I) with A(I) ALIGN TEMP3(I) with A(I) DISTRIBUTE A(CYCLIC) Next, we verify that program 2 can not be parallelized. Since it is impossible to 74 distribute a variable if it is only a scalar, we promote these variables, yielding program 3. However, even this program is inherently sequential. The augmented data access descriptors for f array are: 1 dimension1 D1(farray,S4) = and I (1,100,1,{0}) 1 dimension 1 D2 (f array,S4) = 1 (2,101,1,{1}) When combined, these yield the global ADAD: dimension 1 D(farray,global) = l (1,101,*,Ql) Thus, we are unable to distribute f array in any way that prevents communication. This limitation is what makes the inside—out method so attractive when we attempt to parallelize the ADIFOR-generated code. 5.3 Empirical Results In order to confirm the conclusions reached in the previous section, both programs were implemented on a BBN Butterfly TC2000. Although the TC2000 is a shared 75 memory machine, it is similar to distributed memory machines in that it exhibits non- uniform memory access times, dependent upon which processor attempts to reference a particular storage location. The programs were processed by ADIFOR and the resultant program parallelized. As can be observed from the data in Figures 1 and 2, the performance of actual implementations was in keeping with the projections made using our simplistic model in Section 5.1. In particular, note that the speedup of the inside-out method is bounded by N/Z. The degradation in efficiency as N increases can probably be attributed to the overhead associated with forks and joins, and would probably be less significant in larger problems. Taking this into account, we can conclude that for computationally intensive programs, the speedup of a program that requires no communication will be nearly linear, While only a speedup of N / 2 can be achieved for programs that must be parallelized using the inside—out technique to eliminate the need for communication. Thus, by using ADADS to determine a good alignment for the original code and by applying the conclusions of section 5.1 regarding the speedup available using the inside—out method for a particular ratio of vector to scalar computations, we can make accurate predictions concerning the best way to parallelize a program. In the case of ADIFOR-generated code, if the efficiency of a program containing alignment information is better than 50%, it is probably not worthwhile to attempt to use the inside-out technique on the derivative program. 76 6 r executipn time I 20 1 speedup seconds nodes l I efficiency solid - parallel function evaluation dotted - inside-out mechanism 0.6 ~ — dot-dash - both dashed - sequential version 0.4 I 0.2 - I 0 I l L 0 10 20 30 40 nodes Figure 5.1. Performance of various parallelization methods for program 1 5 execution time 5 T speedup r i 4 c - —8 — 3 _ i I: 0 § — 2 — — _ l .. .. 0 1 i 0 1 l O 10 20 30 0 10 20 30 nodes nodes 0 6 , efficiency I solid - inside-out parallel mechanism 0'5 _ I dashed =- sequential version 0.4 ~ ~ 0.3 r 1 0.2 L ~ 0.1 4 ' 0 10 20 30 nodes Figure 5.2. Performance of inside-out method for program 2 CHAPTER 6 Conclusions 6.1 Summary Many of today’s scientific applications, such as climate modeling and superconduc- tivity studies, are so complex that they require the use of the most powerful super— computers available. Today, this class is represented by parallel computers. A great deal of effort has gone into developing programs that vectorize well. In addition, there exists an abundance of naturally vectorizable code, such as that generated by ADIFOR, an automatic differentiation tool. Because there exists such a large class of computationally difficult problems that vectorize well, we would like to be able to parallelize vectorizable programs. However, the parallelization of programs that perform well on vector supercom- puters is not a trivial task, because vector supercomputers and parallel supercom- puters exploit different types of parallelism. Vector computers utilize fine-grained parallelism, while parallel computers, especially distributed memory supercomputers, 77 78 achieve better performance using coarse—grained parallelism. This results in part from the memory hierarchies of the machines in question. So, the problem becomes one of determining the best distribution of data among the various processors. We presented a formal technique for determining a good distribution of data using the augmented data access descriptor (ADAD). This technique differs from previous approaches in that it views the problem of data alignment as an extension of data dependence analysis, rather than a completely new problem. Thus, the Data Access Descriptor [1], used for data dependence analysis, may be augmented to facilitate data alignment analysis, yielding the augmented data access descriptor. This approach is simple, efficient, and quite accurate. Ways to handle interprocedural analysis and the effects of loop transformations were presented. The former was discussed so that programs with good modularity are not penalized by incomplete analysis. The effects of loop transformations were examined so that they might guide decisions as to which transformations should be performed. An example showing how this knowledge could be utilized was provided. The question of how to apply ADADS to the parallelization of vectorizable code was also addressed. ADADs can be used to discover a good data alignment for any program. However, in the case of vectorizable programs, a second option presents itself. If an analysis of the program with all vector operations replaced by a scalar operation reveals that an alignment free from, or nearly free from, communication exists, then this alignment may be used to parallelize the program. Otherwise, the scalar operations of the program may be promoted and the vector loops fused, en- abling a high degree of global parallelism, at the cost of duplicate computations. 79 We term this technique the inside-out method. The viability of this procedure was confirmed through performance analysis on a BBN Butterfly TC2000. 6.2 Future Studies Possible subjects for future research include the development of a tool to automati- cally compute ADADS, plus the development of a tool that can use ADADS and the original source code to automatically generate source code in some language with data alignment constructs, such as Fortran D or DINO [5, 12]. Construction of these tools, especially the former, would help to resolve several unknown aspects of ADAD analysis. In particular, it is not clear what the storage requirements for the data structures used to describe an actual application program might be. Also, analysis of real programs would facilitate an assessment of the viability of ADADs in the auto- matic data partitioning of vectorizable programs. Should ADADS prove inadequate, we may employ a more sophisticated analysis (at the cost of performance) which pre- serves the notion of data alignment as data dependence analysis and recognizes the importance of interprocedural analysis and directed loop transformations. An important feature of ADAD analysis is that it can be done incrementally. Anal- ysis is performed on small regions of a program, and the ADADS are then merged so that they describe larger regions of the program. This feature makes ADAD analysis particularly well suited for a programming environment, because small changes to the program do not mandate another complete analysis of the program, but instead incur a small additional cost. In addition, incremental merging of ADADS may allow 80 a programmer to detect what statements are blocking a communication free distri— bution. This knowledge may enable the programmer to rewrite the program in a manner better suited for global parallelization. Thus, future studies should examine what hierarchy of ADAD analysis best facilitates these two goals. Another issue yet to be addressed is whether it is better to apply the method of ADAD analysis to the vectorizable code itself, or a scalar version of that code (in the case of ADIFOR, the scalar version is simply the original program). In the section 5.2, we observed that one can make good predictions as to the best approach using information about the ratio of vector statements to scalar statements, and the application of ADADS to the scalar version of the code. However, we could have come to the same conclusion by applying ADADS to the ADIFOR-generated code, after several loop transformations. The former approach offers simplicity and speed, while the latter approach could potentially offer greater accuracy. It remains to some future analysis to determine which approach is best. BIBLIOGRAPHY BIBLIOGRAPHY [1] V. Balasundaram. Interactive parallelization of numerical scientific programs. [2] [4] l5] [8] Technical Report TR89—95, Dept of Computer Science, Rice University, 1989. Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hov- land. Adifor: Generating derivative codes from Fortran programs. Scientific Programming, 1(1), 1992. David Callahan, Jack Dongarra, and David Levine. Vectorizing compilers: A test suite and results. In Proceedings of Supercomputing ’88, pages 98—105. IEEE Computer Science Press, 1988. also Argonne Technical Report ANL/MCS-TM- 109. Bruce W. Char, Keith O. Geddes, Gaston H. Gonnet, Benton L. Leong, Michael B. Monagan, and Stephen M. Watt. Maple V Language Reference Man- ual. Springer Verlag, New York, 1991. G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C.-W. Tseng, and M.-Y. Wu. Fortran (1 language specification. Technical Report CRPC-TR90079, Center for Research in Parallel Computation, Rice University, December 1990. Andreas Griewank. On automatic differentiation. In Mathematical Programming: Recent Developments and Applications, pages 83—108, Amsterdam, 1989. Kluwer Academic Publishers. M. Gupta and P. Banerjee. Demonstration of automatic data partitioning tech— niques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179~193, 1992. J. Li and M. Chen. The data alignment phase in compiling programs for distributed-memory machines. Journal of Parallel and Distributed Computing, 13:213—221, 1991. 81 82 [9] Z. Li, P.-C. Yew, and C.-Q. Zhu. An efficient data dependence analysis for parallelizing compilers. IEEE Transactions on Parallel and Distributed Systems, 1(1):26—34, 1990. [10] P. K. McKinley, H. Xu, and L. M. Ni. Efficient communication services for scalable architectures. Technical report, Dept. of Computer Science, Michigan State University, 1992. [11] Lionel Ni. Private communication. [12] M.‘ Rosing, R. Schnabel, and R. Weaver. The DINO parallel programming lan- guage. Journal of Parallel and Distributed Computing, 13:30-42, 1991. [13] M. Wolfe. Optimizing Supercompilers for Supercomputers. MIT Press, Cam- bridge, Mass., 1989. APPENDICES APPENDIX A Sample Programs PROGRAM examplel C This program contains a great deal of parallelism double precision f(100),x(100) do 100 i = 1 , 100 X(i) = sqrt(dble(i)) * sin(dble(i)) 100 continue call func(f,x) do 110 i = 1, 10 write(*,*) X(i), f(i), sin(x(i)) 110 continue end subroutine func(f,x) 83 200 84 double precision f(100),x(100) double precision temp1(100), temp2(100), temp3(100) do 200 i = 1, 100 templCi) = (sin(x(i)))**2 + (cos(x(i)))**2 temp2(i) = (1/(Sin(x(i))*sin(X(i)))) temp3(i) = (cos(x(i))*cos(x(i)))/(sin(x(i))*sin(x(i))) f(i) = sin(x(i)) * (temp1(i)*temp2(i) - temp1(i)*temp3(i)) continue end C 100 110 200 85 PROGRAM example2 This program is naturally sequential double precision f,x(100),answer,deriv do 100 i = 1 , 100 x(i) sqrt(dble(i)) * sin(dble(i)) continue call func1(100,f,x) answer - 0 do 110 i 1, 100 answer - answer + sin(x(i)) deriv = deriv + cos(x(i)) continue write(*,*) f,answer end subroutine func1(n,f,x) integer n double precision f,x(n), templ, temp2, temp3 f = 0.0d0 do 200 i = 1, n templ = (sin(x(i)))**2 + (cos(x(i)))**2 temp2 = (1/(sin(x(i))*sin(x(i)))) temp3 = (cos(x(i))*cos(x(i)))/(sin(x(i))*sin(x(i))) f = f + sin(x(i)) * (temp1*temp2 - temp1*temp3) continue end 100 110 86 PROGRAM example3 This is a new version of program 2, with scalar variables promoted to vectors. double precision f,x(100),answer,deriv do 100 i = 1 , 100 x(i) = sqrt(dble(i)) * sin(dble(i)) continue call func1(100,f,x) answer = 0 do 110 i 1, 100 answer - answer + sin(x(i)) deriv = deriv + cos(x(i)) continue write(*,*) f,answer end subroutine funchn,f,x) integer n double precision f,farray(101),x(n) double precision temp1(100), temp2(100), temp3(100) farray(1) = 0.0d0 do 200 i = 1, n (sin(x(i)))**2 + (cos(x(i)))**2 (1/(sin(x(i))*sin(x(i)))) temp1(i) temp2(i) 87 temp3(i) = (cos(x(i))*cos(x(i)))/(sin(x(i))*sin(x(i))) farray(i+1) = farray(i) + sin(x(i)) * (temp1(i)*temp2(i) + - temp1(i)*temp3(i)) 200 continue f = farrayClOl) end APPENDIX B ADIFOR— generated Code subroutine g$func$6(g$p$, n, f, g$f, ldg$f, x, g$x, ldg$x) O 0000 integer g$p$ integer g$pmax$ The ADIFOR-generated subroutine for program 1 Formal x is active. Formal f is active. parameter (g$pmax$ = 100) integer g$i$ double precision double precision double precision double precision double precision double precision double precision double precision double precision double precision templbar d$9bar d$10 d$9 d$8 d$7 d$4bar d$6 d$5 d$4 88 89 double precision d$3 double precision d$2 double precision d$1 double precision d$0 integer 1 real sin real cos integer n double precision f(n), x(n), templ, temp2, temp3 double precision g$f(ldg$f, n), g$x(ldg$x, n), g$temp1(g$pmax$), * g$temp2(g$pmax$), g$temp3(g$pmax$) shared x,g$f,g$x integer 1dg$f integer ldg$x if (g$p$ .gt. g$pmax$) then print *, ’Parameter g$p is greater than g$pmax.’ stop endif do 99999, i = 1, n C templ = sin(x(i)) ** 2 + cos(x(i)) ** 2 d$0 = X(i) d$1 Sin(d$0) d$3 X(i) d$4 cos(d$3) do 99994 g$i$ = 1, g$p$ g$temp1(g$i$) = cos(d$0) * (2 * d$1) * g$x(g$i$, i) + ((-sin *(d$3) * (2 * d$4)) * g$x(g$i$, 1)) 99994 continue templ = d$1 ** (2) + (d$4 ** (2)) C temp2 = (1 / (sin(x(i)) * sin(x(i)))) d$0 = X(i) d$1 = sin(d$0) d$2 = X(i) d$3 = sin(d$2) d$4 = d$1 * d$3 d$5 1 / d$4 d$4bar = (-d$5 / (d$4)) do 99993 g$i$ = 1, g$p$ g$temp2(g$i$) = cos(d$0) * (d$4bar * d$3) * g$x(g$i$, i) + c 99993 99992 99991 200 99999 90 *os(d$2) * (d$4bar * d$1) * g$x(g$i$, i) continue temp2 = d$5 temp3 = (cos(x(i)) * cos(x(i))) / (sin(x(i)) * sin(x(i))) d$O = x(i) d$1 = cos(d$0) d$2 = X(i) d$3 = cos(d$2) d$5 = X(i) d$6 = sin(d$5) d$7 = X(i) d$8 = sin(d$7) d$9 = d$6 * d$8 d$10 = d$1 * d$3 / d$9 d$4bar = (1.0do / d$9) d$9bar = (-d$10 / (d$9)) do 99992 g$i$ = 1, g$p$ g$temp3(g$i$) = (-sin(d$0) * (d$4bar * d$3)) * g$x(g$i$, i) *+ ((-sin(d$2) * (d$4bar * d$1)) * g$x(g$i$, i)) + cos(d$5) * (d$9b *ar * d$8) * g$x(g$i$, i) + cos(d$7) * (d$9bar * d$6) * g$x(g$i$, i *) end continue temp3 = d$10 f(i) = sin(x(i)) * (templ * temp2 - temp1 * temp3) d$0 = X(i) d$1 sin(d$0) d$4 templ * temp2 - templ * temp3 temp1bar = -d$1 * (temp3) temp1bar = temp1bar + d$1 * temp2 do 99991 g$i$ = 1, g$p$ g$f(g$i$, i) = temp1bar * g$temp1(g$i$) + d$1 * templ * g$te *mp2(g$i$) + (-d$1 * (templ) * g$temp3(g$i$)) + cos(d$0) * d$4 * g$ *x(g$i$, 1) continue fCi) = d$1 * d$4 continue continue O 0000 91 subroutine g$func1$6(g$p$, n, f, g$f, ldg$f, x, g$x, ldg$x) The ADIFOR-generated subroutine for program 2 Formal x is active. Formal f is active. integer g$p$ integer g$pmax$ parameter (g$pmax$ = 100) integer g$i$ double precision temp1bar double precision d$9bar double precision d$10 double precision d$9 double precision d$8 double precision d$7 double precision d$4bar double precision d$6 double precision d$5 double precision d$4 double precision d$3 double precision d$2 double precision d$1 double precision d$0 integer 1dg$f integer i real sin real cos integer n double precision f, X(n), templ, temp2, temp3 double precision g$f(1dg$f), g$x(ldg$x, n), g$temp1(g$pmax$), g$ *temp2(g$pmax$), g$temp3(g$pmax$) integer ldg$x if (g$p$ .gt. g$pmax$) then print *, ’Parameter g$p is greater than g$pmax.’ stop endif 99993 99992 99991 92 r = 0.0d0 do 99993 g$i$ = 1. g$p$ g$f(g$i$) = 0.0d0 continue do 99999, i = 1, n templ d$0 d$1 d$3 d$4 = sin(x(i)) ** 2 + cos(x(i)) ** 2 x(i) sin(d$0) x(i) cos(d$3) do 99992 g$i$ = 1. g$p$ g$temp1(g$i$) = c08(d$0) * (2 * d$1) * g$x(g$i$, i) + (C-Sin *(d$3) * (2 * d$4)) * g$x(g$i$, 1)) continue templ = d$1 ** (2) + (d$4 ** (2)) temp2 = (1 / (sin(x(i)) * sin(x(i)))) d$0 = x(i) d$1 = sin(d$0) d$2 = x(i) d$3 = sin(d$2) d$4 = d$1 * d$3 d$5 = 1 / d$4 d$4bar = (-d$5 / (d$4)) do 99991 g$i$ = 1, g$p$ g$temp2(g$i$) = cos(d$0) * (d$4bar * d$3) * g$x(g$i$, i) + c *os(d$2) * (d$4bar * d$1) * g$x(g$i$, i) continue temp2 = d$5 temp3 = (cos(x(i)) * cos(x(i))) / (sin(x(i)) * sin(x(i))) d$0 = x(i) d$1 = cos(d$0) d$2 = x(i) d$3 = cos(d$2) d$5 = x(i) d$6 = sin(d$5) d$7 = x(i) d$8 = sin(d$7) d$9 = d$6 * d$8 d$10 = d$1 * d$3 / d$9 d$4bar = (1.0do / d$9) 93 d$9bar = (-d$1o / (d$9)) do 99990 g$i$ = 1, g$p$ g$temp3(g$i$) = C-Sin(d$0) * (d$4bar * d$3)) * g$x(g$i$, i) *+ ((-sin(d$2) * (d$4bar * d$1)) * g$x(g$i$, i)) + cos(d$5) * (d$9b *ar * d$8) * g$x(g$i$, i) + cos(d$7) * (d$9bar * d$6) * g$x(g$i$, i *) 99990 ' continue temp3 = d$10 C f = f + sin(x(i)) * (templ * temp2 - templ * temp3) d$0 x(i) d$1 sin(d$0) d$4 templ * temp2 - templ * temp3 temp1bar = -d$1 * (temp3) temp1bar temp1bar + d$1 * temp2 do 99989 g$i$ = 1, g$p$ g$f(g$i$) = g$f(g$i$) + temp1bar * g$temp1