. .. 4.3%32fi. .JnuwnthJumrhw I ‘ ..vuhbm..mewhu tan»? u (‘u 5 Si}:- i: ’i u‘a§ 5P1 W3... . .. ad... 3 "I ‘l’ . I...“ our...) .. . 3...... 2m 1. .nxkuflfs a. g... u!.%§§ .s. .mfififia... at... .. .2... 2...... _ R . .. 139...... PMuO 1L5.” .b 3.5.... .. IhrM$ rd...- 1 . 3...: a. 2.... ..... .. ... . tilgflinilt .5.- ... 9PM} - . - . .i... t. .. .. E. ,.a«............... 5...... Saturuafi. .uis z. . #3.... e............ as... .2. .. 3. .... . . tag... i... smut...» m 51...... . E . <3!!!“ I I IS I {-‘A. 3‘ ‘00: .x 1 PI \ 1"! :53‘ ,- n!“u¢li. u .. 2.... . .. s... It: .85... .rx...........v. he. .. ,L. . a - .2.-. . t... . 3 .iarmqfiz -. ufimmT». "five”; 1.33.5. .....m ............... firth; .3... I . {A .\ 1.3.0..“ lg§h I 4...... . .. mewunwfimsk- «a; ti-Iavzzi. 91p: 34...... 3.0 III: - A. 2,. a l). .1 310m“: 1 chum sub... 2.1. 3.1.23... .1... a... 3:. e ... n. x {(6.5 . 1' . . hr... 3.....- l... n .st‘flunfir‘os. )\ ii. 33.. I... 7 5.2.3.... 33...... 5....F sea“... 3.1.... .2215... 2.3% . 2.....r....~....uxa 3A 1...... .9 1.13... :K .. P..(.....E$?..la...:. .2...hnvfl..fl..umr. n... fihiytnanfl. .1... 1...... . .. l. .s ........ . . .. t .23.... z. . . .1... 53...... o!)xl..a!. f. a... I. :11...) . .3 .33.... .9"... .....(......w..... : V .5... tum... . . H.953... 5| '32.... \Q: .1... 3!... : ash“; hydra Uni .01... «In. 3.... .. .3 13.! ‘tv .11.}... 11......5‘ .3 I... 3.1!. 5.3.14) 1333‘).- striflf .. 55! Oat... Idantcllzllss 33:11.; .. «a... . 6. -5 52....-.» .. - -..”Hu...fin...v§n.. s ..............?.3. xii-53:511..- t! :ixlxvii:lt.rflnaiv§ .a.911v.‘$. 9! 3.24:. (3|... {1.13. a .1! 5:15!» .t’ip.» 1|: :- .l VIC-ul- . 2.. .t .3 .9 (h: z. . .. 32....-. . .2. .11.--.) .i... HR 3.2.; )- .15) u in... . . 5,1... )33! Lulailvll .1. at). .5. .9. fiflfl\)l)1ia\.i. ( \plts. nu 27’! z i ... 5.5.3.... .1...ng .5.» 1.1.3:... 1 n.-. 1...}.-.{5,.\..7Irr.~.|§v.r 3.. ix .0352: 5.9.. afi. .. Pu}... £9.32 .. .. in... firixs... 1.114... II a . dbl. .y. u 2...: s . UNIVERSI ITY LI am Iiliilililfl‘il’ililil‘ lilllllllilililillllll 3 1293 01020 15 This is to certify that the dissertation entitled Compilation—Time Data Decomposition Optimization for Data Parallel Programs presented by Hong XU has been accepted towards fulfillment of the requirements for Ph . D . degree in Computer Science \vi \WQ \V\ M‘ Major professor Date November 9, 1994 MS U is an Affirmative Action/Equal Opportunity Institution 0- 12771 LIBRARY Michigan State Unlverslty PLACE ll RETURN BOXto modulo ohookoUHrorn your mood. To AVOID FINES rotum on or baton duo duo. DATE DUE DATE DUE DATE DUE MSU In An Affirmative Action/Equal Opportunity Institution WHO'S-9.1 COMPILATION-TIME DATA DECOMPOSITION OPTIMIZATION FOR DATA PARALLEL PROGRAMS Hang Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1994 ABSTRACT COMPILATION-TIME DATA DECOMPOSITION OPTIMIZATION FOR DATA PARALLEL PROGRAMS By Hang Xu Data decomposition is critical to the performance of data parallel programs on scalable parallel computers. A data decomposition model can be viewed as a two-level mapping of array elements to abstract processors. Data alignment determines what array elements are aligned relative to one another, and data distribution resolves how the group of aligned arrays is distributed onto the processors. Depending on the alignment relationship as within dimension or across dimension, alignment can be classified into base alignment and offset alignment. The purpose of base alignment is to reduce the amount of unstructured communi- cation. In this research, we use the data reference graph model to describe the various reference patterns associated with each array and to resolve the conflict of the com- patible alignment requirements. An efficient spanning tree algorithm addresses the fundamental issues in base alignment. Base alignment is further studied with the con- sideration of the optimal expression evaluation and dataflow analysis. Efficient base alignment algorithms are proposed to reduce the redundant communication and opti- mize the RHS expression evaluation. These contributions make this research unique from related research. The purpose of offset alignment is to reduce the amount of data shift movement. This thesis successfully models the cost of data shift movement using the piecewise linear function. This cost model solves the accuracy problem in measuring the quan— tity of data shift movement, an unresolved problem left by other work in this area. Based on this cost model, the optimal post-alignment algorithm is first proposed to exceed the limitation of the owner—computes rule and minimize the amount of data shift movement after offset alignment is determined. The data reference graph model is used to address the problem of offset alignment and develop efficient spanning tree algorithms. The RHS expression evaluation and dataflow optimizations are incorpo- rated with the proposed offset alignment algorithms. The purpose of data distribution is to reduce the impact of data shift movement and increase processor workload balance. Segment distribution is proposed to resolve the conflict between reducing data shift movement and increasing processor workload balance with regard to a particular dimension of the template array. An optimal processor allocation algorithm is introduced to minimize the overall cost of data shift communication across multiple dimensions of the template array. The segment distribution and optimal processor allocation proposed in this thesis provide the best data distribution support for most data parallel programs. To my parents iv ACKNOWLEDGMENTS I wish to thank my thesis advisor, Prof. Lionel Ni, for his considerable help and guidance with my thesis and my doctoral program. His patience in listening to initial versions of many of these results is especially appreciated. His suggestions and questions have greatly improved both the accuracy and the presentation of this thesis. Without his enthusiasm and encouragement, the completion of this thesis work would be impossible. I would like to thank Prof. Tien Yien Li, Prof. Abdol Esfahanian, Prof. Philip McKinley, and Prof. Diane Rover, the members of my Ph.D guidance committee for their guidance and support during my thesis work. I would particularly like to express my appreciation to Prof. Diane Rover for her careful reading of this dissertation and her many useful comments. Special thanks go to my parents for constant encouragement and support for everything I do. Lastly and most importantly, I would like to thank my wife, Yiru, who makes everything I do worthwhile. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 1.2 1.3 1.4 1.5 Scalable Parallel Computer Architectures ................ 1.1.1 The NUMA Multiprocessors ................... 1.1.2 The Message-Passing Based Multiprocessors .......... 1.1.3 Workstation Clusters ....................... Data Parallel Programming Model ................... Motivation and Problem Definition ................... Objectives and Scope of Research .................... Thesis Outline ............................... 2 Data Decomposition Overview 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Loop Characteristics ........................... 2.1.1 Parallelizable Loop ........................ 2.1.2 Perfectly Nested Loops ...................... 2.1.3 Special-Purpose Loops ...................... Interprocessor Communication ...................... Processor Workload Balance ....................... ‘ The Model for Data Decomposition ................... 2.4.1 Formulation for Data Alignment and Distribution ....... 2.4.2 Layered Structure for Data Decomposition ........... 2.4.3 The Base Alignment Phase .................... 2.4.4 The Offset Alignment Phase ................... 2.4.5 The Data Distribution Phase .................. Data Re—distribution and Data Re-Alignment ............. Data Flow Analysis ............................ Related Work ............................... vi g. X COOEO'IrPOOOOtOl-i p-o H 13 14 14 15 17 17 18 19 20 21 22 28 30 34 36 37 2.7.1 Data Parallel Languages ..................... 39 2.7.2 Preference Graph Model ..................... 40 2.7.3 Using Communication Cost Estimation ............. 41 2.7.4 Dynamic Programming Methods ................ 42 2.7.5 Linear Algebra Methods ..................... 42 2.7.6 Parallelizing Loops with Data Dependence ........... 43 3 Base Alignment 45 3.1 Terminology ................................ 45 3.2 Base Alignment for Single Reference .................. 48 3.2.1 Base Alignment Equation .................... 48 3.2.2 A Legitimate Solution of Alignment Matrix .......... 51 3.2.3 Solving Base Alignment Equation ................ 56 3.3 The Cost of Reorganization Communication .............. 58 3.3.1 Reorganization Communication ................. 58 3.3.2 The Weighted Cost ........................ 63 3.4 Spanning-Tree Base Alignment Algorithm ............... 64 3.4.1 Data Reference Graph ...................... 64 3.4.2 Spanning Tree Base Alignment Algorithms ........... 68 3.4.3 Experimental Results ....................... 73 3.5 Optimizing RHS Expression Evaluation ................. 74 3.5.1 RHS Expression Evaluation Optimization ........... 74 3.5.2 Alignment Graph ......................... 76 3.5.3 AG Base Alignment Algorithm ................. 77 3.5.4 Experimental Results ....................... 80 3.6 Avoiding Redundant Communication .................. 81 3.6.1 Redundant Communication ................... 81 3.6.2 Enhanced Alignment Graph ................... 83 3.6.3 EAG Base Alignment Algorithm ................. 84 4 Offset Alignment 88 4.1 Offset Alignment for Single Reference .................. 88 4.1.1 Offset Alignment Equation .................... 89 4.1.2 Multiple Aligned Base Groups .................. 91 4.1.3 Calculating Alignment Offset .................. 94 4.2 The Cost of Neighboring Communication ................ 96 4.2.1 The Basic Cost .......................... 96 4.2.2 The Weight of the Basic Cost .................. 98 vii 4.3 The Impact of Access Offset ....................... 102 4.3.1 Piecewise Linear Cost Function ................. 102 4.3.2 Properties of Piecewise Linear Cost Function .......... 105 4.4 Spanning-Tree Offset Alignment Algorithm ............... 112 4.4.1 Offset Reference Graph ...................... 112 4.4.2 Spanning Tree Offset Alignment Algorithms .......... 115 4.5 Optimizing RHS Expression Evaluation ................. 120 4.5.1 RHS Expression Evaluation Optimization ........... 120 4.5.2 Post-Alignment Optimization .................. 122 4.5.3 Alignment Graph ......................... 124 4.5.4 AG-Based Offset Alignment Algorithm ............. 125 4.5.5 Performance Comparison ..................... 128 4.6 Avoiding Redundant Communication .................. 129 4.6.1 Redundant Communication ................... 129 4.6.2 Enhanced Alignment Graph ................... 132 4.6.3 EAG-Based Offset Alignment Algorithm ............ 132 5 Data Distribution 136 5.1 Segment Distribution ........................... 136 5.1.1 The Limitation of Existing Distribution Types ......... 136 5.1.2 Segment Distribution ....................... 140 5.1.3 Multiple-Variable Density Function ............... 146 5.1.4 Experimental Results ....................... 150 5.2 Virtual Processor Allocation ....................... 151 5.2.1 Reducing the Overall Neighboring Communication ...... 151 5.2.2 Optimal Processor Allocation .................. 153 5.2.3 Performance Results ....................... 157 6 Conclusions and Future Research 160 6.1 Research Contributions .......................... 160 6.2 Directions of Future Work ........................ 163 BIBLIOGRAPHY 165 viii 2.1 2.2 2.3 3.1 3.2 3.3 4.1 4.2 4.3 LIST OF TABLES Related work in base alignment analysis ................ 38 Related work in offset alignment analysis ................ 39 Related work in data distribution analysis ............... 39 Values of T, Q1, and Q2 in executing the MICS algorithm for Example 7 72 Resolving base alignment for Example 9 ................ 79 Resolving base alignment for Example 10 ................ 87 Values of T, Q1, and Q2 in executing the MICS algorithm for Example 13 ..................................... 119 Resolving offset alignment for Example 13 by using the AGOA algorithm127 Comparison of the MWST, STOA, and AGOA algorithms using Ex- ample 13 .................................. 129 ix 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 LIST OF FIGURES Pipeline computation and communication ............... The model for data decomposition .................... Layered structure for data decomposition process ........... Example 1: A NAS benchmark loop ................... Base alignment in Example 1 ...................... Example 2: A Purdue benchmark loop ................. Base alignment in Example 2 ...................... Example 3: A NAS benchmark loop ................... Offset alignment in Example 3 ...................... Data distribution in Example 3 ..................... Example 4: A Heatwave benchmark loop ................ Processor allocation in Example 4 .................... A dynamic programming algorithm for data re—alignment and data re-distribution ............................... Example 1 in single occurrence statements ............... The compactness of the image on T ................... Example 5: A Whetstone benchmark loop ............... The alignment for A and B in Example 5 ................ Example 6: Inner product benchmark loop ............... Reorganization communication for Dx = (1, —1) and Dy = (1,1) in Example 1 ................................. Example 7: A Lapack benchmark loop ................. The DRG and base alignment for Example 7 .............. The minimum-weight induced communication algorithm ....... Use the MICS algorithm to resolve base alignment for Example 7 Comparison of MWST algorithm and MICS algorithm on 16-node nCUBE-2 ................................. Example 8: A Splash benchmark loop .................. Optimal evaluation trees for Example 8 ................. 19 22 23 24 26 26 28 29 31 32 33 35 46 52 53 55 56 61 65 66 70 71 74 75 3.14 3.15 3.16 3.17 3.18 3.19 3.20 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Example 9: An Electric benchmark loop ................ 77 Alignment graph for Example 9 ..................... 77 The alignment graph base alignment algorithm ............. 78 Comparison of the MWST, MICS, and AGBA algorithms on l6-n0de nCUBE-2 ................................. 80 Example 10: An Oceanwater benchmark loop ............. 82 Enhanced alignment graph for Example 10 ............... 84 The enhanced alignment graph base alignment algorithm ....... 85 Example 11: An Eispack benchmark loop ................ 91 Example 12: A Weather-Climate benchmark loop ........... 95 Neighboring communication in Example 3 ............... 100 The sum of cm and 6,43 ......................... 106 The sum of ca,- and 63,}; ......................... 108 The minimal-cost pairwise offset alignment algorithm ......... 110 Example 13: A Livermore benchmark loop ............... 113 The offset reference graph for Example 13 ............... 114 The offset reference graph for Example 13 ............... 116 The spanning—tree offset alignment algorithm .............. 117 Resolving offset alignment for Example 13 ............... 118 Example 14: A Livermore Kernel 7 loop ................ 120 The optimal evaluation trees for Example 14 .............. 121 An example of post-alignment optimization ............... 124 The AG for Example 13 ......................... 125 The alignment graph offset alignment algorithm ............ 126 Example 15: A Dhrystone benchmark loop ............... 130 Enhanced alignment graph for Example 15 ............... 132 The enhanced alignment graph offset alignment algorithm ...... 133 Example 16: An Electromagnetic benchmark loop ........... 137 Different types of data distribution for Example 16 .......... 138 Linpack TQL2 benchmark loop ..................... 141 Some common loop patterns ....................... 143 An example of no optimal distribution pattern ............. 149 Comparison of processor workload balance between block distribution and segment distribution ......................... 150 Two different processor allocation patterns for Example 4 ....... 152 A livermore kernel 18 benchmark loop ................. 158 xi 5.9 Comparison of different processor allocation strategies on a 64-node nCUBE-2 ................................. xii CHAPTER 1 Introduction Sequential computers are approaching a fundamental physical limit, the speed of light, on their potential computational power. Such a limit cannot satisfy the compu- tation power requirement of the so-called grand-challenge problems including three- dimensional fluid flow calculations, real-time simulations of complex systems, and in- telligent robots which require over 1,000 times the computing power of the maximum- power uniprocessors. It has been widely recognized that parallel computing represents the only plausible way to continue to increase the computational power available to such grand-challenge applications. Such parallel computers, also known as scalable parallel computers (SPCs), are designed to offer corresponding increases in the pro- cessing capability of the system, memory bandwidth, and total interprocessor commu- nication bandwidth as the number of processors and the number of memory modules are increased. However, programming on SPCs is much more complicated than programming on uniprocessor computers. The naive approach of exposing system architecture features directly to programmers has been proved to be time-consuming, tedious, and error- prone due to the difficulty in compromising various conflicting requirements arising from concurrent processing and distributed data allocation. As a result, an easy- to—use and highly efficient programming model is critical to the success of SPCs. The data-parallel programming model has been accepted as one of the most efficient programming models provided for SPCs. The basic idea of a data-parallel program is to decompose the global data objects across the processors each of which executes the same program structure on the data objects it owns. Based on this type of single- program-maltiple-data (SPMD) mode of computation, data decomposition determines both the process scheduling and interprocessor communication. Selecting an efficient data decomposition is one of the most important intellectual steps in developing quality data-parallel programs. The purpose of this dissertation research is to make an in-depth study of optimiz- ing data decomposition for data-parallel programs. The proposed algorithmic results provide a framework for data decomposition optimization used by parallelizing com- pilers in optimizing user—written data-parallel programs at compilation-time. The theoretical results obtained in this thesis are derived from the fundamental architec- ture features among various existing SPC platforms. It is not the intention of this thesis to address any specific issues involved in a particular machine architecture. Instead, our proposed framework has chosen the machine-independent approach as the focus of our study. In this introductory chapter, we provide a brief review of the fundamental features of SPC architectures, present an overview of data-parallel programming model and its implementations, and describe the objectives and scope of this research. 1.1 Scalable Parallel Computer Architectures There are two major classes of parallel computers: shared-memory multiprocessors and message-passing multiprocessors. Differing in how the memory is shared and dis- tributed among the processors, shared-memory multiprocessors can be further classi- fied into uniform-memory-access (UMA) multiprocessors and non-uniform-memory- access (NUMA) multiprocessors. In UMA multiprocessors, all processors have equal access time to all memory words. This feature prevents maximizing the performance from the impact of data locality with regard to data allocation in the memory. How- ever, limited by the current technique and cost, the delayed memory access time due to the added interconnection network significantly degrades the overall performance as the number of the processors and memory modules increases. For this reason, it is believed that UMA multiprocessors shall not be a good candidate for SPCs. 1.1.1 The N UMA Multiprocessors Typically in NUMA multiprocessors, each processor has a local memory. The collec- tion of all local memories forms a global address space accessible by all processors. However, the memory access time varies with the location of the memory word. It is quicker to access a local memory with a local processor. The access of remote memory attached to other processors takes longer due to the added delay through the interconnection network. BBN TC-2000 [1] is an example of the commercial ma— chines using NUMA architecture. The TC-2000 can be configured to have up to 512 M88100 processors interconnected by a multistage cube network. In the TC-2000, the remote memory access time is about fives times as expensive as the local memory access time. Besides the TC-2000, the Cray T3D [2] and Convex Exampler [3] are two other important commercial machines using N UMA architecture. 1.1.2 The Message-Passing Based Multiprocessors Distributed-memory multiprocessors are organized as ensembles of nodes, each of which is a programmable computer with its own processor, local memory, and other supporting devices. As the number of nodes in the system increases, the total com- munication bandwidth, memory bandwidth, and processing capability of the system also increase. Nodes communicate each other by passing messages through the in- terconnection network. For this reason, distributed—memory multiprocessors are also known as the message-passing based multiprocessors. A noval switching technique, known as wormhole routing [4], uses a cut-through approach to reduce the impact of the physical distance between two communicating nodes. The commercial ma- chines characterized by the message-passing based multiprocessors include the Intel Paragon (successor to Touchstone DELTA [5]), N cube nCUBE-2 [6], Meiko CS-2 [7], and TMC CM-5 [8]. However, in those systems, the communication latency is orders of magnitude as expensive as the local memory access. 1.1.3 Workstation Clusters Recently, the evolution of fast LAN-connected workstation cluster has the trend in creating yet another kind of SPCs. The high-performance workstation clusters inter- connected through high-speed switches have been advocated in the place of special- purpose multicomputers. The IBM SP-l [9] development has already moved in this direction. In the SP-l configuration, a collection of IBM RS6000s are interconnected through the IBM high-performance Vulcan switch [10]. Though great efforts are be- ing made to reduce the communication overhead involved in operating systems and communication protocols [11, 12], message transmission is still much more expensive than local memory access in workstation clusters. Overall, the hierarchy of the local memory and remote memory with significant access latency difference characterizes the fundamental feature of the current existing SPCs. Exploring data locality is critical to the performance of message-passing based multicomputers, N UMA multicomputers, and workstation clusters. The data objects referenced by each processor should be arranged to reside in the local memory with that processor in order to avoid the communication overhead added by remote data access. The importance of data locality motivates the research of this thesis. It should be emphasized that it is not the intent of this thesis to consider the detailed implemen- tation of remote data access, such as remote memory reference through the crossbar in NUMA multicomputers or message transmission through the interconnection net- work in distributed-memory multiprocessors. Our research is based on the abstract memory hierarchy model which characterizes the existing SPC architectures. 1.2 Data Parallel Programming Model The data parallel programming model has been recognized as one of the most success- ful programming models for the SPCs. Data objects are pre-distributed across the local processor memories before a data parallel program is executed. During the exe- cution, each processor runs an identical copy of the original program but only writes to the data objects owned by that processor. As a result, instruction partition in data parallel programs is fully determined by data partition. With the similar concept of lockstep operations in the SIMD programming model, the data parallel programs are easier to write, easy to port, and much more scalable. However, unlike the SIMD programming model, the data parallel programs emphasize medium-grain parallelism and synchronization at the subprogram level rather than at the instruction level. It has been commonly accepted that the programming languages supporting global name space are much easier to use than the programming languages supporting sepa- rate name space, in particular for those scientific application programmers who wish to write the programs in some dialect of Fortran when using the SPCs. One of the greatest advantages that the data parallel programming model has is the nature of supporting global name space at the language level. Many successful data paral- lel languages, including Fortran 90 [13] and High Performance Fortran (HPF) [14], support global name space. Though programmers may give some hint about how to decompose the data ob- {—fi jects using specific language extensions, compiler is responsible for arranging and optimizing data allocation across the processor local memories. The type of memory space addressing is hidden from the vieWpoint of programmers. If the underlying memory architecture serves the global address space, the reference to a data object owned by a remote processor is simply achieved by a remote memory access. If the underlying system architecture only supports separate address space, the reference to a data object owned by a remote processor is implemented by a proper message passing which is inserted at compilation time. Such a data parallel programming model maximizes the programmability. 1.3 Motivation and Problem Definition There are two levels of data parallel activities in data-parallel application programs. First, there is the question of how arrays should be aligned with respect to one another, both within and across array dimensions. This is called as the problem mapping [15, 16] induced by the structure of the underlying computation. It represents the minimal requirements for reducing data movement for the program, and largely independent of any machine considerations. The alignment of arrays in the program depends on the natural fine-grain parallelism defined by individual members of data arrays. Second, there is the question of how arrays should be distributed onto the SPCs. This is called as the machine mapping caused by translating the computation struc- ture onto the finite resources of the machine. Data distribution provides opportunities to reduce data movement, but must also maintain load balance. The distribution of arrays in the program depends on the coarse-grain parallelism defined by the SPCs. The objective of data decomposition is to minimize data movement across distinct processor local memories and maximize the processor workload balance among all pro- cessors. The first effort of reducing data movement is to optimize data alignment. There are two-level of alignment: base alignment and offset alignment. Base align- ment determines the alignment across dimensions of various arrays. Offset alignment specifies the alignment within each dimension of various arrays. Traditionally, data arrays are decomposed based on each dimension and base alignment is dimension-based alignment. However, the limitation of such a dimension- based alignment prevents the optimizer from exploring the inherit parallelism avail- able in many programming structures. Recently, this limitation has been removed by partitioning a data array in a family of parallel hyperplanes in the linear-tangle space defined by the data array [17, 18]. Array elements residing on the same hyperplane are allocated to the same processor local memory such that any unstructured data reference among elements on the same hyperplane is free of interprocessor commu- nication. Based on this new partitioning strategy, this thesis proposes efficient base alignment algorithms to resolve the conflicted communication-free partitions imposed by different computation structures. The goal of offset alignment is to reduce data shift movement. The previous work [19] on the offset alignment problem focuses on the SIMD programming model on the SIMD multiprocessors, where each processor is assigned a single element in each array operand. In the SPMD programming model on the SPCs, however, each processor can be assigned a collection of elements in the same data array. As a result, the amount of shifted data with respect to each local processor is the accumulation of all shifted data requested by defining each element owned by the local processor. Since each processor owns a collection of data elements in SPMD programs, the shifted data requested by defining one local element has great chance to overlap another local element on the same processor. However, this fact has been ignored by other research works which have been done so far in the area of data decomposition. This thesis establishes a mathematical framework to model the problem of offset alignment for data parallel programs. The theoretical results built upon this framework provide efficient solutions in optimizing offset alignment. This thesis emphasizes the impact of optimal expression evaluation to the data alignment analysis [20, 21]. The traditional implementation of data parallel programs follows the owner—computes rule in which all the right-hand side (RHS) operands must be first transmitted to the local processor and then the evaluation of the RHS is taken place on that local processor. Data movement may be minimized by evaluating an intermediate result on a remote processor rather than the local processor which owns the left-hand side (LHS) operand. Of course, this approach of optimal expression evaluation is based on the presumption that the associate and commutative properties of the RHS are not violated. The issues of optimizing expression evaluation for SIMD mode of programming have been addressed in [19]. However, the algorithm in [19] is given only for a single assignment statement with the assumption that the alignment of each array operand is given. This thesis pioneerly studies optimal expression evaluation with regard to multiple assignment statements. The algorithmic results given by this thesis take advantage of such optimal expression evaluation in resolving both base alignment and offset alignment. In traditional parallelizing compilers, interprocessor communication optimization including redundant message avoidance, message vectorization, and overlapping com- munication with computation is performed after data decomposition analysis. How- ever, it is not true that the profitability of every communication optimization tech- nique is only passively determined by the result of data decomposition. In this thesis, the effect of redundant communication avoidance is considered in the first place during the decision making for data alignment. More efficient solution of data alignment can be obtained by the alignment algorithms proposed in this thesis because the dataflow analysis eliminates the impact of redundant communication in cost estimation and assists the data alignment analyzer to make a more accurate decision. Message vector- ization usually reduces the software latency per message, provided that the software latency in a SPC is significant as against the network latency [22]. Since the em- phasize of this thesis is on the issues of machine independent data decomposition, we do not make any assumption about the detailed machine parameters including the ratio between the software latency and network latency. Therefore, the message vectorization technique is not discussed in this thesis. For the same reason, we do not make any assumption about the computation speed over the communication speed. Hence, the technique of overlapping communication with computation is left to the back-end compilation optimizer after the data decomposition is performed. Data distribution is responsible for reducing data shift communication and in- creasing processor workload balance. The amount of data shift communication can be greatly reduced if elements are distributed to processors in block fashion. On the other hand, however, limited by the owner-computes rule, the requirement of proces- sor workload balance favors cyclic distribution when the workload is not uniformly distributed among all the LHS elements. The conflict between block and cyclic distri- bution has become an open issue in the research of data distribution. In this thesis, we propose a segment distribution which minimizes the impact of data shift move- ment by allocating elements consecutively to processors and balances the processor workload by varying the size of the segments assigned to different processors. An optimal processor allocation algorithm is given to minimize the overall cost of data shift communication across multiple dimensions of the template array. The segment distribution and optimal processor allocation proposed in this thesis provide the best data distribution support for most data parallel programs. 1.4 Objectives and Scope of Research The results of our data decomposition analysis are presented based on data parallel Fortran languages which are written in the global name space. The reason we choose 10 data parallel Fortran languages is that scientists wish to use the SPCs in their fa- miliar dialect of sequential Frotran. The parallelizable loops are the major resources of parallelism available in data parallel Fortran programs. Many sophisticated loop transformation techniques [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34] have been developed to transform various sequential loops into parallelizable loops in order to explore the inherit parallelism. In this thesis, we assume that the data decomposition analysis takes place after the loop transformation has been performed. Given an ap- plication program, our data decomposition analysis neglects those non—parallelizable subprogram structures and only focuses on the parallelizable subprogram structures which can be represented by the HPF FORALL structures [14]. We believe that this approach is reasonable and practical due to the following reasons: 0 DOACROSS loops can be successfully parallelized by pipelining both computation and communication. The corresponding decomposition for arrays referenced in the DDACROSS loops is also straightforward: block distribution on the partitioned dimension. 0 Most of non-parallelizable subprogram structures in data parallel Fortran pro- grams can be formalized into a few intrinsic functions which efficient implemen- tation is machine dependent and may be resorted to special system software or even hardware support. We do not address the data decomposition issue involved in the procedure calling because the techniques available in dealing with the interprocedural data dependence test are still at the preliminary stage. Without the support of a mature interprocedu- ral data dependence testing technique, we feel it extremely difficult to achieve quality data decomposition results. Our data decomposition framework is based on the abstract model of the SPCs. The basic assumption is that interprocessor communication is much more expensive 11 than local memory access and thus the performance of data parallel programs is dictated by the data locality. Only the number of elements involved in remote data reference is measured as the cost of interprocessor communication. Detailed machine- dependent parameters, such as the network latency, software latency, and switching techniques, have been ignored. In the data distribution analysis, we simply assume that data elements are distributed to the virtual processors which are interconnected through a fully connected network. Though the actual architecture of a SPC is not likely to afford the fully connected network topology, the scalable communication library [35, 36, 37, 38] usually can offset the adverse impact of network topology and achieve efficient data communication. In addition, other machine-dependent optimization issues [39, 40], including run-time support [41], will not be addressed in this thesis. It should be emphasized that it is not the intention of this thesis to address the compilation techniques [41] involved in generating efficient parallel programs for the SPCs after data decomposition is well-defined. The data partition, instruction partition, and communication generation have been actually performed in order to obtain the performance results of benchmark subprogram structures. However, they are just treated as a part of implementation and will not be further addressed in the rest of this thesis. 1.5 Thesis Outline The rest of this thesis is organized as follows. In Chapter 2, an overview of data decom- position for data parallel programs is presented. Different subprogram structures are clarified and different types of interprocessor communication are examined. A layer structured data decomposition model is used to illustrate the tasks accomplished in each major phase. Strategies and difficulties are discussed for data re—distribution l2 and data re-alignment. Major previous work is classified based on the proposed layer structure model. Chapter 3 addresses the base alignment phase. Efficient base align- ment algorithms are proposed with regard to various concerns including the optimal expression evaluation and redundant communication avoidance. Chapter 4 addresses the offset alignment phase. Efficient offset alignment algorithms are proposed by considering both the optimal expression evaluation and redundant communication avoidance. Chapter 5 addresses the data distribution phase. An optimal virtual pro- cessor allocation strategy is proposed and the closed form of segment distribution is given to maximize the processor workload balance and minimize the neighboring communication. Experimental results of benchmark program structures are shown in each of Chapters 3, 4, and 5. Finally, Chapter 6 summerizes the major contribution of this research and provides directions for future research. CHAPTER 2 Data Decomposition Overview This chapter gives an overview of the data decomposition problem. Different subpro- gram structures are clarified and different types of interprocessor communication are examined. A layer structured data decomposition model is proposed to illustrate each major phase in data decomposition. Strategies and difficulties are discussed for data re-distribution and data re-alignment. Influential previous work has been classified based on the proposed layer structure model. In this thesis, we follow Fortran 90 array specification. In a one-dimensional array, an array subscript is declared as 0 : n — 1 : s, where n is the total number of elements and s is the value of stride. An array subscript always starts with zero. The stride option can be omitted if its value is one. For example, A(O : n — 1) represents A(O : n — 1 : 1). Consecutive array elements A(i), A(€ + 1), .. ., A(r) (E < r) can be represented by array segment A(€ : r). In a multi-dimensional array, the array subscript declared for each dimension is separated by a comma. For example, a 8 x 10 two-dimensional array A can be represented by A(O : 7,0 : 9). A(1,2) is an element in A(O : 7,0 : 9). The notation for array subscript is also applicable to the loop statement declaration. 13 14 2.1 Loop Characteristics Loop is the major resource of parallelism in data parallel Fortran programs [42]. What type of parallelism a loop can explore is determined by the pattern of data dependence among different iterations of the loop. 2. 1 . 1 Parallelizable Loop A loop is a parallelizable loop if a data object defined in an iteration is not used or re-defined in other iterations in the loop [24, 43]. A general form of parallelizable loops can be modeled by the FORALL loop in HPF [14]. A FDRALL loop may consist of one or more FORALL assignment statements, or FORALL assignments for short. A FORALL assignment is implemented by first executing the evaluation of the RHS of the assignment statement for all combination of loop index subscripts, and then assigning to the corresponding elements of the array referenced at the LHS of the assignment statement. A FORALL assignment is free of loop-carried data dependence. Seman- tically, in a FORALL loop consisting of several FDRALL assignments, the execution of the next F URALL assignment will not take place until the execution of the previous FORALL assignment is fully completed. In a FDRALL assignment, the access to a RHS data element owned by a remote processor is achieved by passing a message to the local processor. Since the FDRALL assignment is free of loop-carried data dependence, messages sent to the local proces— sor can be combined and transferred prior to the loop execution. As a result, in the execution of a FORALL loop consisting of several FORALL assignments, computation and communication are alternated statement-by-statement. Since it is fully paral- lelizable, the performance of a FORALL loop is only determined by processor workload balance and interprocessor communication overhead. Therefore, the FORALL loop be- comes one of the most important program structures on which data decomposition 15 issues are studied. The survey of previous work with regard to the FORALL loop can be found in Chapter 2.7. 2.1.2 Perfectly Nested Loops In many perfectly nested loops with constant distance vectors, the outmost loop cannot be parallelized by any legal loop transformation [29]. This feature makes the coarse-grain parallelism difficult to explore and thus degrades the overall performance on the SPCs due to the large overhead of barrier synchronization at the end of each iteration of the outmost loop. A closer inspection reveals that computation and communication can be pipelined in a perfectly nested loop with constant distance vectors. For example, consider the following four point difference operation. D031 = 2,71 —1 DO i2 = 2,77. -l 81: X(il,ig) = (X01 —1,i2)+ X(21 +1,Z.2)+ X(il,32 — 1)+ X(il,32 +1))/4 END DO END DO Figure 2.1(a) shows the original iteration space in which an arrow represents the direction of data dependence. Figure 2.1(b) illustrates how this loop can be executed in pipeline. The iterations are assigned to different processors in shaded blocks, so are columns of array X. The row segment assigned to each processor is labeled by the lock step in pipeline. As long as the computation on the first row is finished, all processors can start executing in parallel. This technique is also known as tiling [44]. The tiled code the corresponds to Figure 2.1(b) is as follows. Here b is the length of the row segment assigned to each processor. The original i2 loop is split into two dimensions: the outer i4 loop and inner i3 loop. Iterations of 16 DO i4=2,n—1,b DO 1:2 = 2,17, —1 DO i3 = i4,min(n —1,i4 + b— 1) 312 X(i1,i3) = (X(21 -1,1:3)+ X(Zl +1,i3)+ X(i1,i3 — 1) + X(i1,i3 + l))/4 END DO END DO END DO the i4 loop are spread across processors, while iterations of the i2 and i3 loops reside on the same processor. r< r< 14 1L L L L L L L il V i1 i (a) Original iteration space (b) Pipelining computation and communication Figure 2.1. Pipeline computation and communication An important feature of pipelining is that computation and communication can be overlapped [17]. As shown in Figure 2.1(b), processor p; can send the result of X (il, b) to processor p2, while processor 122 is still working on element X (i1 — 1,i3) Where b S i3 5 2b. The pipelining technique can also be applied to the perfectly nested loops with irregular and complex dependence constraints using the dependence uniformation methods [45]. 17 2.1.3 Special-Purpose Loops Special parallel algorithms may be required to parallelize those special-purpose oper- ations, such as prefix, suffix, gather, scatter, and other combining operations. Gener- ally speaking, these algorithms are machine-dependent or architecture-dependent. As a result, they are typically supported as library routines, such as intrinsic procedures in HPF, and usually not further optimized at compilation time. An important issue which the compilation optimizer should deal with is how to efficiently utilize these library routines, in particular, when library routines are designed for various data decomposition patterns. Issues in designing scalable library routines for MPCs can be found in [46]. 2.2 Interprocessor Communication Interprocessor communication occurs when a data object referenced by a local proces- sor resides in a remote memory. As mentioned in [47], interprocessor communication generated in executing data parallel Fortran programs can be classified into intrin- sic communication and residual communication. Intrinsic communication arises from those special-purpose computational operations such as scatter and gather that re- quire data motion as an integral part of the operation. As addressed in the previous section, minimizing intrinsic communication is a major task of efficient intrinsic li- brary implementation [48, 35, 36, 37, 49, 38, 50, 51] and thus is beyond the scope of compiler optimization. Residual communication arises from nonlocal data references required in a compu- tation whose operands are not aligned with respect to each other. Typically, residual communication can be further separated into neighboring communication and reorga- nization communication. Neighboring communication refers to the nearest-neighbor shifts of data. Reorganization communication is due to the mismatch in data de- 18 composition, which requires reorganizing the entire data structure. Not all residual communication patterns are equally expensive in SPCs. Neighboring communica- tion can be significantly reduced by distributing data arrays in blocks. On the other hand, however, reorganization communication is often much more expensive because the data movement pattern is unstructured, such as transpose, change of stride, and vector-valued subscripts. Reducing interprocessor communication by properly allocating data objects into processors is a key issue in the data decomposition analysis. 2.3 Processor Workload Balance The main reason to employ SPCs is to reduce the overall execution time. In most data parallel scientific programs, parallelizable loops are the major resources of the parallelism. For concurrent execution of a parallelizable loop, the iterations have to be assigned to processors. This is also known as workload distribution. Ideally, given the fixed amount of the overall workload, the workload should be evenly distributed to each processor. Therefore, the overall execution time regarding to the whole system can be minimized. Otherwise, the overall execution time would be longer if there is a processor idle when other processors are still working. Note that a barrier synchro— nization is typically required between two adjacent subprogram phases. As a result, if it finishes its own work earlier than others, a processor has to be idle rather than start working on the next subprogram phase. The previous study in the processor workload balance has been focused on the run-time scheduling on shared-memory multiprocessor systems [52, 53, 54, 55, 56, 57, 58, 59]. In data parallel programs, the workload is distributed in the way that a data Object is written only by the local processor which owns it. Therefore, the workload distribution depends on the pattern of data decomposition which is determined at 19 compilation-time. In this thesis, we study the static processor workload balance, another key issue in the data decomposition analysis. It has been shown [60] that static processor workload balance is critical to the overall performance even with the support of the runtime scheduling. 2.4 The Model for Data Decomposition T mmmmmm [ Arrays ) “““““““““““““ I“ a ” | l I “l V [ Data . 1 Alignment Problem Mapping L ‘ Machine Independent (Template ) ____________ i . / A \ \ I r l [ Data Abstract Processor [ Distribution Mapping } /’\ l ’1 Abstract I A] ”””””” (processor ] "—“T—“ “_"‘“”_W l \ / I [ Physical processor Machine Dependent Mapping [ / _[ fibysical I _____ \processor .\\ \ f \\ I‘ll Figure 2.2. The model for data decomposition As illustrated in Figure 2.2, the data decomposition model can be viewed as a two-level mapping of array elements to abstract processors. Array elements are first aligned 20 relative to one another, known as data alignment; the group of aligned arrays is then distributed onto a set of virtual processors, known as data distribution. Data align— ment, the problem mapping, represents the requirements of reducing data movement induced by the structure of the underlying computation. Data alignment is achieved by aligning array elements to an abstract index space, known as template in HPF [14], which is typically represented by a rectilinear space. Data distribution, the virtual- machine mapping, represents the requirement of efficiently allocating computation structures to finite machine resources. Data distribution is attained by allocating template elements onto the rectilinear arrangement of virtual processors. As a result, all array elements which are aligned to the same template element are allocated to the processor to which that particular template element is allocated. However, the data alignment and distribution phases are machine-independent. The final step in which the rectilinear arrangement of virtual processors are mapped to the physical processors is machine-dependent. The issues in physical processor mapping are be- yond the scope of this thesis. For simplicity, in the rest of this thesis a processor refers to a virtual processor rather than a physical processor. 2.4.1 Formulation for Data Alignment and Distribution A mathematical model is presented for data alignment and data distribution. An array of dimension m defines an array space A, an m-dimensional rectangle. Each el- ement in the array is accessed by an integer vector (i = (a1, a2, . . . ,am). For the same reason, a template of dimension h defines an array space T, an h-dimensional rectan- gle. Each element in the template is accessed by an integer vector {2 (t1, t2, . . . , th). Definition 2.1 For each index {i of an m-dimensional array, the data alignment of the array onto an h-dimensional template is a function 6,4(6) : A —) T, where 5,,(5) = on: + d} 21 DA is an h x m linear transformation matrix and dA is a constant vector. Definition 2.2 DA is called alignment matrix of array A. (ll, is called alignment offset of array A. The rectilinear arrangement of abstract processors is represented by a processor array. The dimension of the processor should be the same as the dimension of the tem- plate. The processor array defines an array space ’P, an h-dimensional rectangle. Each element in the processor array is accessed by an integer vector 15': (p1, p2, . . . , ph). Definition 2.3 For each index {of an h-dimensional template, the data distribution of the array onto an h-dimensional processor array is a function 7TH) : T —> ’P. The function 71 may not necessarily be an affine function and varies with different applications. 2.4.2 Layered Structure for Data Decomposition A typical data decomposition process can be described by a three-phase layered struc- ture shown in Figure 2.3. The base alignment phase determines alignment matrix D A in alignment function 6A for each array A. The ofiset alignment phase specifies align- ment offset (7,4 in alignment function 6,4 for each array A. The base alignment and offset alignment phases accomplish the task of data alignment by mapping array el- ements to template elements. The distribution phase decides in what fashion the template elements are distributed to the processors. All array elements which are aligned to the same template element are allocated to the same processor to which that particular template element is allocated. As a result, taking the bridge of the template, the distribution phase fulfills the requirement of data distribution. 22 Various Multi—dimensional Arrays I Base Alignment F 3 I. “'"x .— V/"f The 1st arm’s/ii}... The 2nd imension ""*~rhe h—th dimension of tho/toniplate of the to plate of ‘tixuemplate \ A\ g ,4" MT """" _ _ ,.—/”/ FT "‘\ ( Offset Alignment ) I Offset Alignment 3“) o o o (- Offset Alignment if) \\ .w/ ‘\.___ '/,«" "xx,” /_,-// Mfl// \\‘-~~L -"T" “I“ "’ \ —-,_L_.._, .-, —" —-—--~-- "* ’- \ .ll' fr” \ T“~\~., »' I; Data Distribution ,3 fl Aligned and Distributed Arrays Figure 2.3. Layered structure for data decomposition process 2.4.3 The Base Alignment Phase A multi-dimensional array can be treated as a multi-dimensional bounded linear space, called data space, in which each integer grid represents an array element. A multi-dimensional array can be partitioned in a family of parallel hyperplanes such that array elements on the same hyperplanes are always allocated to the same pro- cessor. As a result, the potential overhead of interprocessor communication involved in the mutual references between the elements in the same hyperplane can be fully avoided. Consider the NAS benchmark loop in Example 1 (Figure 2.4). Array elements Y(i1,i2) and Y(i2,i1) referenced in FORALL assignment 31 are along the diagonal. Therefore, FORALL assignment 31 is free of interprocessor communication if array Y is partitioned in a family of off-diagonals, as shown in Figure 2.5(b). In Figure 2.5(b), 23 FORALLa,=o:n—1,z‘2=0:n-1-i1) 81: Y(i1, 22) = Y(32,i1) 321 XUuil + i2) = Yfila i2) END FORALL Figure 2.4. Example 1: A NAS benchmark loop y] is the index of the column dimension and y; is the index of the row dimension of array Y. T represents the one-dimensional template array. The size of array Y is declared as 8 x 8 and the size of T is 8. Each array element is represented by a circle. Elements on each off-diagonal (represented by a solid line) are collapsed and mapped (represented by each dash line) to the same element in the template. Some array elements are not covered by any solid line since they are out of the loop boundary. Vector (1, —l), the direction vector of an off-diagonal, is called the collapse base of array Y. Two Y elements must be collapsed and mapped to the same element in the template if the vector starting with one element and ending with another is parallel to collapse base (1, —1). Vector (1,1), which is orthogonal to the collapse base (1, —1), is called the distribution base of Y. Two Y elements must be mapped to the different elements in the template and thus may be distributed to different processors if the inner-product of the distribution base and the vector constructed by these two Y elements is non-zero. Alignment matrix is constructed by distribution bases. In this example, since there is only one distribution base (1,1), Dy = (1,1). Thus, alignment function by can be specified as follows. III 311 311 t=6y( )=Dy =(l,l) =y1+y2 312 312 312 where Y(y1, 3,12) is a Y element and T(t) is a template element. Base alignment only 24 determines the value of alignment matrices. The value of alignment offsets is decided in the offset alignment phase. In the above 6X and by, we simply assume that dx = 0 and dy = 0. ’ x T 00113936 base (1,-1) distribution base (1.1) C" ‘* ‘“ m r '-*- ---i:'7 J a e: I": if? It? ’/ ff“ ‘ —"‘ .__ - 7 —— — H3 (gait:- in" .1: o .2"; I 1. f // fL" “‘ v— “a —— -~-cJ Q I)” (:5 3.3 If? <1) If} // I] / QT “I I I T “ . r c! a :3 0 o / / / // if“ T- ..._ h v US " m" (a O O. O Y I (1* M " / / / / / / :3 :T 3 )3” O I I“ O O O / / / / / ,/'/ / i O J I o O O C» of 1 (f “ f’ ‘i” ‘i‘ —> y2 (D If f (f, (3" ([1 $ 0 C) I; I? It q, 3'1 o o 0 ti» If g- x O O m I) $ T (If, (T. _ ' , p X 9 Q t r) c. (b rt ( {T 2 o O o o o o q) [ X o o o o o o o It 1 collapse base (1,0) distribution base (0,1) +v] (a) (b) Figure 2.5. Base alignment in Example 1 Generally speaking, the vector represented by each row in an alignment matrix indicates a distribution base. As a result, the rows in the same alignment matrix must be mutually linear independent. On the other hand, the vector which is orthogonal to all rows in a alignment matrix implies that that vector is a collapse base. The distinct collapse bases must also be mutually linear independent. All collapse bases and distribution bases should span the entire data space with regard to each array. 25 Consider the array subscript structure in FORALL assignment 32 (Figure 2.4). Since Y elements are collapsed along the off-diagonal, X elements have to be collapsed in columns so as to make FORALL assignment 32 free of interprocessor communication. Therefore, the collapse base of array X is equal to (1,0), the direction vector of a column. The distribution base of array X is equal to (0,1), which is orthogonal to collapse base (1,0). Therefore, we have Dx = (0,1). Thus, alignment function 6x can be specified as follows. t=6x( $1 )=DX $1 =(o,1) “”1 :33; x2 x2 x2 where X (x1, x2) is an X element and T(t) is a template element. Figure 2.5(a) shows the alignment of array X with regards to the template. In Figure 2.5(a), x1 is the index of the column dimension and x2 is the index of the row dimension of array X. The size of array X is declared as 8 x 8. X elements in the same column (represented by each solid line) are collapsed and mapped (represented by each dash line) to the same element in the template. Some elements are not covered by any solid line since they are out of loop boundary. Note that the X partition in columns and Y partition in off-diagonals are not randomized. They are in fact inducted from the requirement of minimizing inter- processor communication based on the given array subscript structures in Example 1. Such a relationship between the alignment of X and the alignment of Y is called base alignment. Base alignment represents the alignment relationship between various distribution bases of various arrays and the alignment relationship between various collapse bases of various arrays. Depending on the pattern of how array elements are mapped to the template, an array may have more than one distribution base and / or more than one collapse base. Consider the Purdue benchmark loop in Example 2 (Figure 2.6). Unlike Example 26 FORALL(i1 = 0 : n1 —1,i2 = 0 : n2 — 1) 811 W(i1,i2) = Z(l2,i1) END FORALL Figure 2.6. Example 2: A Purdue benchmark loop 1, in Example 2 there is no self-reference regarding to arrays W and Z. Both W and Z can be partitioned in both row and column dimensions. Thus, the collapse base of both W and Z is degenerated to (0,0). Both W and Z have distribution bases (1,0) and (0,1), which span the entire data space. This implies that there is no requirement that two particular elements in the same array should be mapped to the same template element. distribution base ( 1,0) i \ distribution base (0,1) distribution base (1,0) . . . ] ix distribution base (0,1) ‘___ Z Z i,_ 2 V“ W w 0 0‘ 0 U C) I) (”J (‘I \ \ \ i, ‘ ,z‘ [7, 0 H ( ii‘? q} \:'<\\ ( :'[\\ (>\ ~[\\ (\ (€‘\ 571, f E C],\\ Q‘ t‘,‘ (i\\ Q \l ril [ ]\ ' . ‘l‘ U ) \\\\( _\_]\ J‘\\\\‘ \‘{l\\\ \ [1‘ [\[\‘\( :3 ‘\‘\ \‘i \ | \[\\‘I x if} i-) if) KC) WW \\\\\\\ ‘\\\ \\ \\\\\\ M) . ‘7‘ n, my, z") \\\\ [Mi] \\l\\ \\\\\\\\\\\\. W] m] \\\\ , , , ’/ ' "17> U 6 [(1, Z [Ow-1M. \\\1\O ' r of, I, \\ \;)\\ ])\\O\\\ \\\\ (:I]\“\ \I“)\ \\I\‘V__ s—> T2 .' . . I /,.l , L) .t. \\\ \\ \\ \O\:\\ \\::\ \ i . D O 1” If-) \0 l O '\ K) \ I");" :j I‘i (is I) I") {i .1) .’ , O ) () (i O () Q (’_ i T Cl (7 . _' ~..' . I " 3,! 1,1 1;! , 6 2 (‘3 I") f‘\ {if {1 I": I“) "'5/ / 9.7 T T 2" .1 {1'1 i",- ti. ('X I (:5; (3 Ti (a) (b) Figure 2.7. Base alignment in Example 2 27 Figure 2.7 shows the alignment of W and Z. In Figure 2.7, the size of W is declared as 4 x 8 and the size of Z is declared as 8 x 4. Template T is declared as a two-dimensional array which size is 4 x 8. wl, 21, and t1 represent the column dimension of arrays W, Z, and T, respectively. mg, 22, and t2 represent the row dimension of arrays W, Z, and T, respectively. Figure 2.7(a) shows the alignment of array W: dimension wl is aligned with dimension t1 and dimension 21);; is aligned with 1 dimension t2. Therefore, DW 2 and alignment function (Spy is specified as 0 1 follows. t1 wl ml 1 0 w1 w1 t2 “’2 1.02 0 1 1.02 7.02 where W(w1,w2) is a W element and T(t1,t2) is a template element. Figure 2.7(b) shows the alignment of array Z: dimension 22 is aligned with dimension t1 and di- mension 21 is aligned with dimension t2. Therefore, Dz = and alignment function 62 is specified as follows. ii 21 21 0 1 21 22 = 5z( ) = Dz t2 2'2 22 1 0 22 21 where Z(zl, 22) is a Z element and T(t1, t2) is a template element. For arrays each of which has multiple (linear independent) distribution bases, it is crucial that which distribution base of one array should be aligned with which distribution base of the other. In Example 2 (Figure 2.6), in order to make FORALL assignment .31 free of interprocessor communication, distribution base (1, O) (the col- umn dimension) of W is aligned with distribution base (0,1) (the row dimension) of Z, and distribution base (0,1) (the row dimension) of W is aligned with distribution 28 base (1, 0) (the column dimension) of Z. Aligned distribution bases in various arrays comprise an aligned-base group. Example 2 has two distinct aligned-base groups. One consists of distribution base (0,1) of W and distribution base (1,0) of Z. The other consists of distribution base (1, O) of W and distribution base (0, 1) of Z. By the def- inition of the alignment function, each aligned-base group can be uniquely identified by a dimension in the template array. Different distribution bases of the same array can never be aligned with each other and never be included in the same aligned-base group. 2.4.4 The Offset Alignment Phase As described in the previous section, an array can be partitioned in a family of parallel hyperlanes in each of which the array elements are collapsed and mapped to the same element in the template. However, base alignment only determines the constituent base(s), known as distribution base(s) and collapse base(s), for such a family of parallel hyperlanes. Offset alignment is responsible for the displacement of each hyperplane with regard to the template. In other words, while base alignment determines DA, offset alignment determines d; in alignment function 6A for each array A. FORALL(2'1=Ozn—1,i2=01n—3—i1) 811 Y(i1, 22) = Y(i2, 21) 322 X(i1, i1 + i2 + 2) = Y(21.22) END FORALL Figure 2.8. Example 3: A NAS benchmark loop Consider the NAS benchmark loop in Example 3 (Figure 2.8). Except the access 29 T alignment offset 0 0—- —— —— __ .m ___ __ W“ _ (3‘. 0 I3] If.“ 9 O f‘) \ (Ck—— —_‘ _ _ _ __ __ ~ C7 I i. {’1‘ C? Q t, \ \‘ C) —— .m. v __ A ._ fl 1 if g (Ml ‘) C) I ‘ \ ‘ .»,__ _ _- r Q" I ( a 1:2 (I \ \\ 1 . "' — ~ — j {I} ( I: f) ”‘1 Y \ \ IK \. in- 1_ _ _ A _ g ., I I" VD _* t‘ H r ( WI :1) .~. \ \ \b \\ .\ .fi _ “_ g t k) . . 1 I r If.) (I) C \ \ \i l i“ \f f O ' ~. . ( 1 L) O (J \ \ \ \ I I, J 1. , , _> . \l l “I \\ 6 y2 alignment offset 2 \ \ .\ I .\ ~\ 6 bk; ‘ 3' L? L1 1‘ til) (:11 Y] ‘. ‘I i .I I t i C O (1. (1 T 1; (1, C) C (1‘ fl ('1: 1| ( I) '1) A A) x iEI T I“. I x u r r“. 11 1. I 1 (1 o I .1 Ii II ‘1 O L C‘ k1) (1| r r_.> X , 2 C) (l ( l , I-) (I; 1 v C) ) I r) (1» X1 Figure 2.9. Offset alignment in Example 3 offset 2 in the array subscript X (2'1, i1 + i2 + 2) (Figure 2.8), Example 3 is similar to Example 1. For this reason, the base alignment analysis in Example 3 is identical to that in Example 1. Dx = (O, 1) and Dy = (1,1). Figure 2.9 shows the alignment for arrays X and Y in Example 3. In Figure 2.9, Y is partitioned in off-diagonals and X is partitioned in columns. Unlike Example 1 (Figure 2.5), however, in Example 3 (Figure 2.8), the (k + 2)-th column in X should be aligned with the k-th off-diagonal in Y, in order to make FORALL assignment 3; free of interprocessor communication. Such an alignment relationship between the column displacement in X and the off- diagonal displacement in Y is called offset alignment. In Example 3, we have d X = 2 and dy = 0. Alignment function 6X can be written as follows: 331 $1 i=5x( )=Dx +dx=$2+2 $2 932 where X (3:1, 1:2) is an X element and T(t) is a template element. Alignment function 30 6y can be written as follows: y1 y1 t=6y( )=Dy +dy=y1+y2 312 312 where Y(y1, yg) is a Y element and T (t) is a template element. Generally speaking, the value of alignment offset is represented by a constant vector. Offset alignment should determine each component in such an alignment offset vector. In Chapter 4, we will show that the components in the alignment offset vector are independent with one another. Each component can be decided based on a distinct aligned-base group. 2.4.5 The Data Distribution Phase The data distribution phase determines how to map the template elements to vir- tual processors. Since both the template and the virtual processor arrangement are represented by multi—dimensional arrays, there are two major decisions to make re- garding to each dimension in the template: what is the distribution type and how many virtual processors should be allocated. We first consider the case in which both the template array and the processor array are one-dimensional. There are numerous ways to distribute template elements across processors. Among them the cyclic distribution and block distribution are most popular. In cyclic distribution, template elements are assigned to processors in the round-robin fashion. In block distribution, template elements are contiguously allocated to each processor. Figure 2.10 shows the data distribution phase of Example 3. In Figure 2.10, arrays X and Y are declared as 8 x 8. There are total two processors available, denoted as P(O) and P(1). The template array is distributed in cyclic. As a result, processor P(O) owns the even-numbered X columns and the even-numbered 31 N1 I‘ T; O *:, :‘N .- 1 r} o r1 A a n. - .7 , - Figure 2.10. Data distribution in Example 3 Y off-diagonals. Processor P(l) owns the odd-numbered X columns and the odd- numbered Y off-diagonals. The data distribution function, 7T, can be written as follows: 192770!) =t mod 2 where P(p) is an element in the processor array and T(t) is an element in the template array. Note that the number of the LHS array elements mapped to each template element is varied. For example, two Y elements are mapped to T(l) but five Y elements are mapped to T(4). Since workload assigned to each processor depends on the number of the LHS elements owned by that processor, the workload imposed by each template element can be varied. For this reason, the cyclic distribution increases the processor workload balance. Another reason of using cyclic distribution is that interprocessor communication has been fully eliminated by the alignment function. Otherwise, significant communication cost could occur if cyclic distribution is employed. 32 The template array and processor array can be multi-dimensional. The task of processor allocation is to determine the layout of the multi-dimensional processor array. FORALL(21 = 1 1 n1 — 3,i2 = 1 2 n2 - 3) 312 W(i1,i2) = Z(21,22)+Z(21+1,22)+Z(21 —1,i2) +Z(i1,i2 + 1) + Z(i1,i2 — 1) 32: Z(i1,i2) = W(i1,i2) END FORALL Figure 2.11. Example 4: A Heatwave benchmark loop Consider the Heatwave benchmark loop in Example 4 (Figure 2.6). By the con- struction of array subscripts in FORALL assignment 3;, there is no interprocessor com- munication as long as W(i1,i2) and Z (i1,i2) are mapped to the same template ele— ment. On the other hand, however, since both Z (i1+1, i2) and Z (2'1, i2) are referenced, assignment 31 is free of communication only if Z elements are collapsed in the row dimension. Similarly, since both Z (2'1, i2 + 1) and Z (i1, i2) are referenced, assignment 31 is free of communication only if Z elements are collapsed in the column dimension. This conflict implies that interprocessor communication in executing assignment .91 cannot be avoided no matter how W and Z are aligned. For this reason, alignment functions (SW and 62 are simply defined as follows: t1 w] wl : 6W( ) = t2 w2 w2 33 and t1 21 21 =6A )= t2 22 22 where W(w1, wg) is a W element, Z(zl, 22) is a Z element, and T(t1,t2) is a template element. “x? W2 OLD (ror:no Q” a; wl T <13 C) 4‘3; 4 :1: <3 4»: (f: in 'f,‘ 4 T ’7 TT TT": \T (T: 40 L; (7;; :T‘} 4i; «i; tf T W T T Tu a: l T» m 1 T T T T TC.- rJ i T04 T; Ci 4 T34 Q 54 :3 T, T T T T T T:T TT T TTJ TT (T) TCT (T) TT (T) 0 T T l T 4 l 4 ‘ T I I T I 4 l TTz Tp(o,)T: T TP(0.11 T T T 24P<942>I 1 ? TT l 4 ‘ fos‘a “T 4 4T 4;.» T14 TT4 i T l' T l T\‘\Q o TQT {3.5.41 9‘ TOT TT’li TOT ‘T .x I 4 T T 4 T _ 4.. 3T T -. T i} l | 4 ‘T T 1 r; {ix-Tie; «T's: 4‘ ‘33-: l ~~ (r . ’ - "‘iTT- T r7" i -T 7" 77" T” «‘1 T T ( 1.“:T T) Hf} TC; .1 T “T T (T'T "le C. T T l T > U —\| m .9 t-J MT A» ,c‘) o o T T T l T ‘T;T TTiLTT‘L‘T'T T IT_ "E4TLTL_.TE4 T T. T. TP'ITO T‘ T where 47 A is an array variable, and k is the statement number. Let 17A,;c be the access ma- trix for the instance of array A referenced in FORALL assingment 3k. In Example 0 1 1 0 1 O 1 (Figure 3.1), we have Fy,3 = , FYA = , FY,2 = , and l 0 0 1 0 l 1 0 FX,2 = 1 1 In order to make the base alignment results transparent from the single occurrence transformation, the access function of a temporary variable is forced to be same as that of the data array instance which the temporary variable replaces. In Example 0 1 1 (Figure 3.1), FTTA 2 F713 2 F353 2 . Moreover, the alignment function 1 0 of the temporary variable is also forced to be same as that of the data array instance which the temporary variable replaces. For temporary variable TT in Example 1 (Figure 3.1), 677 is always equal to by. It is assumed that any program structure used in this chapter has been pre-processed by the necessary single occurrence trans- formation and all base alignment theorems and algorithms are stated based on the denotation of single occurrence statements. Definition 3.2 If array A is referenced on the LHS and array B is referenced on the RHS in statement 3),, the read/write relationship between arrays A and B is called a reference and is represented by the symbolic form “A <— B@sk”. In Example 1 (Figure 3.1), we have references “T T <— Y@s3”, “Y 4— TT@s4”, and “X +— Y@32”. The notation “A (— B@sk” is not ambiguous because A and B have only one type of instance referenced in statement 3;, by the assumption of the single occurrence statement. 48 3.2 Base Alignment for Single Reference In this section, we use the linear space theory to model the inter-relationship between base alignment and access function regarding to a single reference. 3.2.1 Base Alignment Equation i Consider reference “X 4— Y@32” in Example 1 (Figure 3.1). In iteration 1 , i2 assume that X ($1,332) is referenced on the LHS and Y(y1,y2) is referenced on the RHS. Using the access function specification, we have 131 31 = FXJ (3.1) $2 t2 yl i1 = Fyg (3.2) 312 i2 Multiplying both sides of Equation 3.1 by D X and both sides of Equation 3.2 by Dy, we have Dx = DXFX,2 (3.3) $2 i2 311 2'1 Dy .—. Dy Fy,2 (3.4) 312 i2 The purpose of data alignment is to eliminate interprocessor communication. In this case, interprocessor communication guarantees to be avoided if both X ($1,352) 49 and Y(y1, yg) can be aligned to the same template element. In other words, $1 in 5x( )= 5Y( ) $2 92 Therefore, we search for the solution of D x and Dy such that a: 31 DX 1 = DY 1 $2 an . . $1 . yl . Substituting Dx by Equatlon 3.3 and Dy by Equation 3.4, we get $2 .112 i1 i1 Dxeg = By FYQ (3.5) i2 i2 Equation 3.5 can be re-written as follows: i1 (DXFX,2 - DYFY,2) = 0 i2 i Since 1 represents any iteration in the iteration space, Equation 3.5 holds if and i2 only if Dx Fxg — DyFyg = 0 (3.6) Equation 3.6 can be easily extended to the following property for any multi- dimensional arrays A and B. 3 Proposition 1 Reference “A ._ B@sk’ is free of interprocessor communication if 50 the following equation holds. DAFAJc - DBFBJ: = 0 01‘ DAFAJ: = IDBFBJc (3.7) Alignment matrices DA and D3 are compatible with respect to statement 3;, if Equa— tion 3.7 holds. Otherwise, DA and DB are called incompatible in respect to that statement. Consider reference “Y +— TT@34” in Example 1 (Figure 3.1). Using Proposition 1, we get DYFYA = DTTFTTA (3-8) Since TT is the temporary variable used to replace Y(i2,i1) in statement 33, it is always true that DTT = Dy and FTTA = Fy,3. Substituting the values of DTT and FTTA into Equation 3.8, we get DY Fm = DY Fm This can be re-written as DY(FY,4 — Fm) = 0 On Substitution of FyA and Fy,3, we have DY( - )=0 51 Dy = 0 (3.9) which conducts Dy 2 (1,1). Substituting the result of Dy into alignment function by , we have 91 311 5Y( ) = DY = 311 + 312 312 312 Alignment function by implies that all elements in an off—diagonal must be collapsed to the same element in the template. This guarantees that Y(i1, i2) and Y(i2, i1) must be assigned to the same processor and thus assignment 31 is free of reorganization communication. Substituting values of Dy, Fm, and F X3 into Equation 3.6, we have D X = (0,1) and the alignment function of X can be specified as follows. This implies that, in order to make assignment 82 free of reorganization communica- tion, X must be partitioned in columns if Y is partitioned in off-diagonals. 3.2.2 A Legitimate Solution of Alignment Matrix Note that there are infinite solutions of Dy which can satisfy Equation 3.9. The reason we chose Dy = (1,1) is that the image projected by by from )7, data space of Y, to T, data space of T, should be compact. Definition 3.3 A subspace is compact if and only if the subspace includes every integer grid on any line which two ends are included in the subspace. 52 An image on the template space T is compact if and only if the image includes every integer grid on any line which two ends are included in the image. The concept of the compactness can be explained in Figure 3.2. In Figure 3.2, the template elements included in the image projected by the alignment function are in dark. Figure 3.2(a) shows the compact image projected by by defined by Dy = (1, 1). Figure 3.2(b) shows the non-compact image projected by by defined by Dy = (2,2). In Figure 3.2(b), T(3) is the integer grid on the line incident with T (0) and T (7) However, T(3) is not included in the image, while T(O) and T(7) are. T t\ K \ alignment T[01“x\ ‘“ ~~--\ k“‘\.:\\\..::TNFM‘~ v- on so or 11.33.:‘\\ ;\ -. \ N a; ‘1. p n a a . .0 to o .\ :: x N“ k N g y: {a L v Q o 0 la \. g \ *~- c, .7 m n o o o a: T 7 ' \ ”N as" w cf 5 c O O O Y TV] 0 ' NP. ‘ A" t‘ t: a K 0 o 0 Tie] C’ \ \ \7 \ \‘ O 3 o “ a) O O O k A i“ H 3 o ’) O 0 O O C Vyz O o yl o (a) D y=( l ,1) and the image on T is compact T F \ \ M alignment 1101 r N \ fl- 1 L‘ x _\ x N “ \ O ‘ ~—— 4“ M g- _ x. [a . i c;- [:11 (It: o r 1131"“ ~ ‘ K K H T \ {‘3 D (3 (ca/O] r) [)3 H o “H——\\; Diff r1000 rm~—v-————__-__o {Km {-3 II) o o o o f I f _,_ ..— n a r) 01””. o O o o Y Pf” , 43 M {"0 o i o o 117] O /, ,r ” /f a C‘ o o o o o o T[8] V0 0 o c) o o o o o a" / 1 [T> Y2 a” if y] O (b) Dy=(2’2) and the image on T is not compact Figure 3.2. The compactness of the image on T 53 Given a n x m matrix A, let A, be a square submatrix of A such that its de- terminant [Arl 75 0 and r is the rank of the original matrix A. The value of [AT] is defined to be the determinant of the rectangle n x m matrix A, denoted as [A]. Array elements which are defined or used in a FORALL structure are called efiective elements. The effective domain of an array is composed of all effective elements in the array. Generally speaking, the solution of an alignment matrix DA is legitimate if the effective domain of A is compact and IDA| = 1. In this case, the image on the template space projected by 6,1 will be guaranteed compact. In Example 1 (F ig- ure 2.5), the effective domain of array Y consists of the whole upper triangle and thus is compact. Therefore, Dy = (1,1) is legitimate since [Dy] 2 1. However, the determinant of legitimate DA can be any rational number if the effective domain of array A is no longer compact. Under such circumstance, the stride in A’s effective domain has to be considered in order to find a correct solution of DA. This can be illustrated by using the Whetstone benchmark loop shown in Example 5 (Figure 3.3). FORALL(i1=O:n/2- 1 ,i2=0:n/3— 1) 81: 3(221, 322) = A(il, 222) 822 A(21,222) = B(Zl,322) END FORALL Figure 3.3. Example 5: A Whetstone benchmark loop For Example 5, by Proposition 1, we get DBFBJ = DAFA,1 DAFA,2 = DBFB,2 54 2 0 1 0 1 . where FB,1 : , FB,2 : , and FAJ 2 FAQ 2 . SlIlCC FAJ 03 03 02 and F A’g are invertible, the above equations can be re-writen as follows: 03513,,ng = DA (3.10) 1),. = DBFByFy-g (3.11) By substitution, we have DBFBJFXA = DBFBQFA—é which can be re-written as follows: DB(FB.1FX,i - 173.21?” = 0 2 O 10 10 10 DB( -— )=0 0 3 0 -;- 0 3 o; 10 DB :0 (3.12) 0 0 Any vector (O,h) could be a solution of Equation 3.12. However, there is one legitimate solution of D3 which makes the compact image on the template space projected by alignment function 63. Since B(2i1,3i2) and B(i1,3i2) are referenced in Example 5 (Figure 3.3), the stride in B’s effective domain regarding to the row dimension is 3. Therefore, we choose 1 DB : (0, '3') 55 Substituting the value of DB into Equation 3.11, we have I DA : (0, i) Therefore, the alignment functions of A and B can be specified as follows. 01 1 5A = ’02 (12 2 b 5B 1 = —52 b2 ,. . ” ”i ’ V b2 a o o i r) > o a c) as" o t) i a O C) o I) <1 [i <3 ( > m t) a 0 . » b1 1 ( o . o i) o O O O a O r . u i) o 0 ti o If g; o o 0 i 0 O E o c) o O < O x O r: G J 0 . C O u c J ' o o a (W a A O O i O .) o d) O O 0 O ) o B I) O O i C l) (x o m (1) () t: 0 o l O O O O O 0 O o O o o O i) 0 i 0 O C‘ 0 (w 0 t > o o O O O o ( u 0 0 o O O O O O ('3 0 O O ("I (J ('i a) L J O C) o o g t ) ( i t O C M l t 0 O O O O O O O O O O O W i J L ( I ) ('3 (j) C O I (W \j r“ J m C) o J U C) O r ) O O O O O O O o O ( ) o O a O t 1 o o O O o (v D k ) 0 O O o O O ( ) 0 Figure 3.4. The alignment for A and B in Example 5 Figure 3.4 shows the alignment of A and B in Example 5. In Figure 3.4, effective elements in A and B are covered by solid lines. The effective domain of A only contains elements (ahaz) such that a; must be a multiplier of 2. The effective domain of B only contains elements (b1, b2) such that b3 must be a multiplier of 3. Though neither 56 A’s effective domain nor B’s effective domain is compact, the image on the template space projected by 6,1 and 63 is compact. 3.2.3 Solving Base Alignment Equation If access matrix F N, is non-singular, Equation 3.7 can be easily transformed into DA = DBFBJcFXj, and the relationship of D A and DB is straightforward. However, solving Equation 3.7 may not be simple when both F Ag; and F 3,], are singular. For singular matrices F Ag; and Fay” let F fik be the right inverse matrix of F A}; and F112,, be the right inverse matrix of F 3,1,. Therefore, Equation 3.7 is equivalent to the following two equations DA!” = DBFBJcFfik DA FM,ng = 031.3 where rA is the rank of DA and r3 is the rank of D3 . The method to find a right inverse matrix of a given matrix can be found in [78]. Example 6 (Figure 3.5) is used to illustrate the basic idea. FORALL(i1 =0:n—1,i2 =Ozn—1) S12 2201,22) = 0 DO t3 = 0 3 n —I 322 ZZ(Z1,Z2) = ZZ(21,22) + XX(Z1,23) X YY(23,Z2) END DO END FORALL Figure 3.5. Example 6: Inner product benchmark loop 57 In Example 6, the inner DO loop is sequential and thus will not be distributed 1 0 0 1 0 0 across the processors. 1722.2 2 , F X X; = , and FYY’2 = 0 1 0 0 0 1 0 0 1 . Consider the equation 0 1 0 132ze12 = DXXFXX,2 for reference “ZZ (— X X @32”. Since F 22,2 and F X x3 are singular matrices, the above equation is equivalent to R DZZITZZ = DXXFXX.2FZZ,2 R Dzzez,2Fxx,2 = DXXIrxx 1 0 1 0 where rzz : rxx = 2, foa = 0 0 and F5232 = 0 1 . On substitution of 0 1 0 0 those values, we have 1 0 022 = Dxx O 0 1 0 022 = DXX 0 0 Therefore, the solution of both D22 and D X x have to be in the format of (h, 0). The similar approach is used in solving the equation 192ze2.2 = Dnyvm 58 for reference “ZZ +— YY@32” since F223 and FYY'2 are singular matrices. The solution specified by this equation requires both Dzz and Dyy to be in the format of (0, g). The requirement of Dzz imposed by two equations conflicts with each other, which implies that interprocessor communication cannot be avoided. 3.3 The Cost of Reorganization Communication Reorganization communication occurs if Proposition 1 does not hold for a given ref- erence. This section studies the cost of reorganization communication. 3.3.1 Reorganization Communication i Consider reference “X (— Y@sz” in Example 1 (Figure 2.4). In iteration 1 i2 assume that X (231,22) is referenced on the LHS and Y(y1,y2) is referenced on the RHS. Using the access function specification, we have 131 i1 = FX,2 $2 22 311 i1 = Fm 312 22 Since F X,2 and Fyg are invertible, the above equations can be rewritten as 21 -F“1 171 — X,2 22 5132 i 1 -—F"1 311 — Y2 22 312 59 where Ff}? is the inverted matrix of Fx,2 and F17; is the inverted matrix of Fy,2. By substitution, we get $1 91 —1 —1 FX 2 = FY2 $2 92 It can be rewritten as 91 __ $1 = FY.2FX,12 92 $2 Multiplying both sides of the above equation by Dy, we get 91 $1 Dy = Dy FY31}; (3.13) 92 332 Assume that given values of Dy and D X, Proposition 1 does not hold for reference “X +— Y@S2”. In other words, Dx Fx,2 75 DY Fm Since F X.2 is invertible, Dx 3&4 DyFygFilz (3.14) On substitution of Dy FY’2F £12 from Inequality 3.14, Equation 3.13 can be rewritten as Dy 7é DX 92 $2 60 In other words, 91 931 5Y( ) ¢ 6X( l 92 5132 . 311 $1 . Thls means that array elements and are never aligned to the same 92 502 element in the template. Moreover, by Equation 3.13, we have 91 1'1 91 131 5Y( )-5x( )=DY -DX 92 1'2 92 $2 ..1 $1 $1 _1 $1 = DYFY’2FX3 —DX = (DYFYJFXQ— BX) 232 $2 $2 . . 91 $1 , By Equation 3.14, the difference between Dy and DX 18 an affine 92 $2 . “:1 . . . . function of . However, if template elements are distrlbuted in regular patterns, 932 , 91 $1 , such as block or cyclic, 6y( ) and 6x( ) are unlikely to be mapped onto 92 $2 the same processor. This implies that to write an X element, the local processor almost always requires a remote access to the RHS Y element. Figure 3.6 is used to illustrate the idea. In Figure 3.6, arrays X and Y (in Example 1) are declared as 8 x 8. The template array is one—dimensional and has eight elements. There are four processors, denoted as P(O : 3). The template elements are distributed onto these four processors in block. Since Dx = (1, —1) and Dy = (1,1), X is distributed along diagonal and Y is distributed along anti-diagonal. Equation 3.7 does not hold for reference “X 4— 61 " \ . L J ,, . ‘J ..1“ 1: l: ("a if. {j 1:) l Dx=(1,_1) Figure 3.6. Reorganization communication for Dx = (1,—1) and Dy 2 (1,1) in Example 1 Y@32”. To write an X element on the line 2:1 —:1:2 = 1 (highlighted in dark), a distinct RHS Y element on the line y; = 1 (highlighted in dark) is accessed. However, all X elements on the line 2:1 — $2 = 1 are owned by processor P(O), while six out of seven elements on the line y; = 1 are owned by other three processors. Based on the above observation, we have the following proposition. Proposition 2 Assume that array elements are evenly distributed across the proces- sors. For reference “A <— B@sk”, the cost of reorganization communication is % if DAFAJ: 75 DBFBJc where n is the total number of elements involved in reference “A (— B@sk” and p is the total number of available processors. Since p is the total number of available processors and is a fixed constant to the 62 base decomposition analysis, 11, the total number of elements involved in reference “A <— B@sk”, determines the cost of reorganization communication. In the rest of this chapter, the cost of reorganization communication is simply measured in 71. Proposition 2 justifies the correctness of the single occurrence transformation with regarding to base alignment. In the following loop, the original FORALL assignment FORALL(i1=0:n—1,i2=O:n—1—i1) S11 X(t1,i1 + 22) = Y(t1,i2) + Y(t1,t1 + Z2) END FORALL 31 references two distinct instances of array Y: Y(i1,i2) and Y(i1,i1 + i2). In order 0 to distinguish the access matrices of these two instances, let FY’I = and 0 1 , 1 0 Fy’l = . Assume that DX = (1,—1) and Dy = (1,1). Since DXFX1 7i 1 1 Dy FYJ, by Proposition 2, the number of remote Y required to be accessed is % regarding to instance Y(i1,i2). Similar, since Dxe, 75 DyFbJ, by Proposition 2, the number of remote Y required to be accessed is 3; regarding to instance Y(i1, i1+i2). Moreover, since DYFY’I 75 Dth, corresponding elements Y(i1, i2) and Y(i1, i1 + i2) are allocated to different remote processors. Therefore, the total cost of reorganization communication is 2f. This cost estimation is consistent with the result obtained from the following transformed loop, which has the property of single occurrence. For this FORALL(i1 =0:n—l,i2=0:n—1—i1) S22 TT(Z1, 22) = Y(t1,22) 33: X(i1,i1 +12) = TT(i1,i2) + Y(i1,i1 +12) END FORALL 63 reason, we claim that single occurrence transformation does not bring any side effect in the base alignment analysis. 3.3.2 The Weighted Cost In Proposition 2, the total number of elements involved in reference “A (— B@sk” may not be equal to the number of all elements in array A or B. Typically there are three different cases resulting in different costs of reorganization communication. First, the number of elements involved in reference “A «— B@sk” are limited to the effective domain imposed by the loop boundary. In Example 1 (Figure 2.4), suppose that the size of arrays X and Y is no X 77.1. The number of effective elements limited by the loop boundary is only equal to §n2. Second, the number of effective elements involved in reference “A (— B@sk” can be limited by the probability of the WHERE clause used in a F DRALL assignment. For example, in the following code, the total number of elements in array X is 10,000. FORALL(2’1 = 0 : 100 — 1,12 = 0 : 100 - 1) WHERE(X(2’1,2'2).NE.0) 813 X(i1,i2)=X(i1,i2)**2 END WHERE END FORALL However, if the possibility for the condition in WHERE clause to be true is only 90%, the cost of reorganization would be only 9, 000 provided that the elements which values are zero are evenly distributed among processors. Third, the number of elements involved in reference “A 4— B @sk” can be equal to the number of elements in B where the size of B is much larger than that of A. For example, in the following code, the inner loop 32 is a sequential loop. Therefore, the 64 FORALL(21 ‘-= 02100 —1,22 = 01100 — 1) S12 X(21,22) = Y(21,22) For(i3 = 0 : 100 — 1) S22 X(21,22) = X(21,22) + Y(21,22,23) END FOR END FORALL cost of reorganization communication is 1,000,000, which equals to the number of elements in array Y, if D X and Dy are incompatible with respect to statement 32. 3.4 Spanning-Tree Base Alignment Algorithm 3.4.1 Data Reference Graph The problem of base alignment can be simply modeled by data reference graph (DRG). Given a program structure, a DRG G = (V, E) is constructed as follows. An array is represented by a node in V. For an array which has multiple instances referenced in one or more F ORALL assignments, there is only one corresponding node in the DRG. There is a distinct edge in E connecting two nodes for each reference between the corresponding two arrays. A DRG is undirected. Figure 3.8(a) shows the DRG for Example 7 (Figure 3.7). In Figure 3.8(a), each array variable is represented by a node labeled by the variable name. Edges are constructed based on the references generated by each FORALL assignment in Example 7. For example, edge (A, X) connects nodes A and X due to reference “A (— X @sz”. There is no edge between nodes X and Y because these two arrays are not involved in the same reference. Since there is one-to-one correspondence between an array and a node, terms “array” and “node” are used alternatively in the rest of this chapter. Similarly, since there is one-to—one correspondence between an edge and reference, terms “edge” and “reference” are also used alternatively in the rest of this chapter. 65 In Figure 3.8 (a), each edge is weighted by the cost of reorganization communication if alignment matrices of two arrays connected by the edge are incompatible with respect to the reference represented by the edge. The weight on each edge is decided based on the probability of true condition in the WHERE clause. FORALL(21 = 0 2 10, 22 = 0 I 10) 31: Y(il,i2) = A(2ii + i2ail + 22) WHERE(A(2’1,2'2).NE.0) lHPF the probability of A(i1,i2) # 0 is 83% 323 1401,12) = W(il,iz) + X01 + 22, i1 + 222) END WHERE WHERE(Z(2'1, 2'2).NE.0) lHPF the probability of Z(i1, i2) 51$ 0 is 83% S31 Z(21,22) = B(21,22) * Y(21, 22) END WHERE 84: W(21,22) = Z(21+22,21 + 222) WHERE(B(z‘1,i2).NE.O) lHPF the probability of B(i1, i2) 75 0 is 53% 85: B(21, 22) = A(221 + 22, 21 + 22) END WHERE WHERE(X(2'1,z'2).NE.0) lHPF the probability of X(i1,i2) 75 0 is 53% 86: X(21, 22) = Z(21, 22) END WHERE END FORALL Figure 3.7. Example 7: A Lapack benchmark loop Generally speaking, there may not exist a solution of the alignment matrices such that every reference can be free of reorganization communication. This is because the (a) DRG (b) (c) 11(3) "(3) " /‘~'\ 100 / 100 100 / 109mm 64 100 121 64 mo m 64 , ,. / ,4.” ~ ,_ (W) ()9 (Q g; ‘B (w) \ \ / ' \. \ lm/ 10.0 \ 64\ . 121 - \ ,1 121 a max-weight spanning an optimal alignment tree alignment 60 B (w x z’ ” "\ l 00/ 100 . ._ \\ ,f ., 121 . \ <2 (21> induced communication induced communication subgraph subgraph Figure 3.8. The DRG and base alignment for Example 7 compatibility requirement imposed by one reference may conflict with that imposed by another reference, in particular, when the two edges representing these two references are involved in a cycle. For example, Figure 3.8(a) consists of cycle A —> W -—) Z -—> Y —» A. By Proposition 3.7, the four references represented by four edges in the cycle are free of reorganization communication if and only the following equations have a non-trivial solution of alignment matrices DA, Dw, Dz, and Dy. DAFA,2 = DWFW,2 DWFW,4 = DZFZA Dze,3 = DYFY,3 DYFYJ = DAFAJ 67 10 11 where FAQ = Fw’4 = F13 2' FYJ = FY,3 = FWJ = aFZA = , and 0 1 1 2 2 o I a o a o FA,1 = . Since all index matrices are invertible, the above equatlons can be 1 1 re-written as follows. 0,, = DwagFX} Dw = DZFZAFQL Dz = DYFY,3FE,; DY = DAFAJFfi} By substitution, we get DA = DAFA,1FE}FY.3FE,;FZ,4FV}}4FW,2FX} Using the value of each alignment matrix, the above equation can be written as 2 1 1 1 DA = 0,; 1 1 1 2 1 0 3 4 DA( — )= 0 0 1 2 3 —2 —2 DA = 0 —3 -—2 In order to satisfy the above equation, DA has to be (0,0). This implies that there does not exist a solution of alignment matrices such that all the four references in the cycle are free of reorganization communication. In other words, there exists at least one reference with respect to which any given solution of alignment matrices is incompatible. 68 3.4.2 Spanning Tree Base Alignment Algorithms If all index matrices are invertible, the conflict of compatibility requirement can only occur within a cycle of a DRG. Intuitively, such conflict can be resolved by a spanning tree. Base alignment between two arrays are specified by the tree edge connecting these two arrays using Equation 3.7. As a result, references represented by tree edges are always free of reorganization communication. Each non-tree edge determines a unique fundamental cycle of the DRG with respect to the spanning tree. Reorganiza- tion communication may not be avoided for each non-tree edge. An edge is weighted. Given a choice between two edges, the tree edge would be chosen as the one with higher weight. As a result, the spanning tree for base alignment would be chosen as a maximum-weight spanning tree. For example, Figure 3.8(b) shows a maximum-weight spanning tree for base align- 2 —l ment in Example 7. In Figure 3.8(b), DW 2 DA, DX 2 DA , Dy :- —1 1 2 1 2 l 2 —1 2 —1 DA ,DB=DA ,andDzsz 21);; l l 1 1 —1 l —1 1 Though (Z,X) is a non-tree edge, D X and Dz are compatible with respect to edge (Z, X) because D x and Dz satisfy the equation DX FX,6 = DZFZ,6 Therefore, non-tree edge (Z, X) is also free of reorganization communication. Given 2 Dw = DA and Dx = DA , edges (W, Z) and (X, Z) are called homoge- —1 1 means since the base alignment equations imposed by two edges are equivalent. DXFX,6 = DZFZ,6 DWFWA = DZFZA 69 Edges (W, Z) and (X, Z) comprise a homogeneous edge set. In Figure 3.8(b), reorganization communication can not be avoided on non-tree edges (Y, Z) and (B, Z) since Dx and Dz are not compatible with respect to edge (X, Z) and DX and Dy are not compatible with respect to edge (Y, Z). Fixing base alignment in a DRG G, an induced communication subgraph (ICS) (5' = (V',E') is a subgraph of G such that alignment matrices are compatible with respect to each edge in E -— EI but incompatible with respect to each edge in E'. Therefore, the total cost of reorganization communication in G is equal to that in G'. The ICS for the maximum—weight spanning tree alignment is shown in Figure 3.8(b). The total cost of reorganization communication in Figure 3.8(b) is equal to 200. The problem of optimizing base alignment is to find a base alignment such that the cost of reorganization communication in the ICS is minimal. Does the maximum-weight spanning tree (MWST) algorithm always minimize the cost of neighboring communication? The answer is no. Figure 3.7(c) shows yet another spanning tree alignment. The cost of reorganization communication in Figure 3.7(c) is only 185 because Dz and Dy are compatible with respect to non- tree edge (X, Y). The failure of the MWST algorithm is due to the assumption that only tree edge is free of reorganization communication. However, in fact not every non-tree edge is necessary included in the ICS. A non—tree edge can also be free of reorganization communication as long as the alignment matrices of two incident arrays are compatible. Next, we introduce a new spanning tree base alignment algorithm which aims to minimize the sum of weights on those non-tree edges included in the ICS. We name it as the minimum-weight ICS (MICS) algorithm. Definition 3.4 Given a DRG G = (V, E), let U be an arbitrary subset of V. Node A has the single-degree connectivity with U if and only ifA is not in U and there is only one node B in U such that edge (A, B) is in E. 70 Definition 3.5 Given a DRG G = (V, E), let U be an arbitrary subset ofV. Node A has the multi-degree connectivity with U if and only if A is not in U and there exist at least two distinct nodes X and Y in U such that edges (A,X) and (A, Y) are in E. Definition 3.6 Given a DRG G = (V, E), let U be an arbitrary subset of V. Node A is a single-degree neighbor ofU ifA has the single-degree connectivity with U. Node B is a multi-degree neighbor of U if B has the multi-degree connectivity with U. For example, in Figure 3.8(a), let U = {A,X,Y}. Thus, W and B are single-degree neighbors of U. Z is a multi-degree neighbor of U since it is incident with both X and Y. Given a DRG G = (V, E), the MICS algorithm can be formalized as follows: (1) T = 43 (2) while T aé V do (3) Let Q1 be the set of single-degree neighbors of T (4) if Q1 # ()5 then (5) Find a node A in Q1 such that edge (A, B) has the maximum weight among all edges incident with one node in Q1 and another in T (6) Define DA such that DAFAJc = DBFBJ; where (A, B) represents reference “A 4— B@sk” (7) T = T U {A} (8) else (9) Let Q2 = V — T (10) Find a node A in Q2 such that the homogeneous edge set, each edge in which is incident with A and a node in T, has the maximum accumulated weight (11) Define DA such that DAFA,’c = DBFBJ; where reference “A «— B@sk” is represented by an edge (A, B) in the above homogeneous edge set (12) T = T U {A} (13) end if (14) end while Figure 3.9. The minimum—weight induced communication algorithm 71 Figure 3.10 illustrates how base alignment is resolved using the MICS algorithm (Figure 3.9). Table 3.1 shows the contents of T, Q1, and Q2 at each step of outmost while loop (lines (2)-( 14)). Initially, T is empty and any node in V is assumed to be a single-degree neighbor of an empty set. In Figure 3.10, the nodes included in T at each step are highlighted. The algorithm starts with X which is arbitrarily selected and T = {X}. In step (b), since the weight of edge (X, A) is greater than that of (X, Z), A is chosen to be a new member of T (line (5)). Arrays X and A are aligned by DAFAQ = Dx Fx,2 (line (6)). For the same reason, in step (c), node Y is included in T because edge (A, Y) has the maximum weight among edges incident with nodes W, Z, Y, and B, all single-degree neighbors of {A, X} (line (5)). Arrays A and Y are aligned by DYFYJ = DAF A.1 (line (6)). The same algorithm repeats in steps ((1) and (e). Nodes W and B are included in T, respectively. Eventually, node Z becomes a multi—degree neighbor of T. Figure 3.10. Use the MICS algorithm to resolve base alignment for Example 7 72 Table 3.1. Values of T, Q1, and Q2 in executing the MICS algorithm for Example 7 steps T Q1 Q2 base alignment Initial (,b {A, W, X, Y, B, Z} (15 (a) {X} {AZ} 45 (b) {X,A} {W,Y,B,Z} <15 DA = Dxe,2 (C) {X,A,Y} {W,B} {Z} DY = DAFAJ (d) {XaA’YvW} {B} {Z} DW = DA (e) {X,A,Y, W,B} (23 {Z} D3 = DAFAA (f) {X,A,Y,W,B,Z} ab 9b Dz = Dy In step (f), Q1 = 45. By line (9), Q2 = V — T = {Z}. Node Z is incident with two homogeneous edge sets. One set includes edges (W, Z) and (X, Z). The other includes edges (Y, Z) and (B, Z). The accumulated weight in the set {(W, Z), (X, Z )} is equal to 185, while the accumulated weight in the set {(Y, Z), (B, Z )} is equal to 200. Since nodes W, X, Y, and B are all included in T, by line (11), D2 is chosen such that Dze,3 = Dnyg because edge (Z, Y) is in the homogeneous edge set with the larger weight. The MICS algorithm (Figure 3.9) is an improvement of the MWST algorithm. If Q1 is not empty, like the MWST algorithm, the tree edge is selected as an edge with the largest weight among all the edges which are incident with one node in Q1 and another node in T (lines (4)-(7)). If Q1 is empty but Q2 is not, alignment matrix is determined such that every edge in a homogeneous edge set with the maximum accumulated-weight is free of reorganization communication. By its construction, the MICS algorithm is superior to the MWST algorithm in general. If the heap- sort algorithm [79] is used in lines (5) and (10), the time complexity of finding the maximum-weight edge or homogeneous edge set can be reduced to O(log|E|) where |E| is the number of edges in DRG G = (V, E). As a result, the time complexity of the MICS algorithm is 0(IEIlogIEI). 73 3.4.3 Experimental Results 5 I I I I 4.5 - MWST algorithm A— _ MICS algorithm 9— Benchmark loop: Example 7 4 r a Communication Cost 3.5 b -* (msec) 3 - _ 2.5 _. ‘ 2 Number of processors Figure 3.11. Comparison of MWST algorithm and MICS algorithm on 16-node nCUBE-2 Figure 3.11 shows the comparison of communication cost between the MWST algorithm and MICS algorithm on 16-node Purdue nCUBE—2. Example 7 (Figure 3.7) is used as the benchmark loop in our experiment. We increase the iteration space in Example 7 by allocating the loop boundary as (i1 = 0 : 34, i2 = 0 : 34). In Figure 3.11, when the number of processors is 16, the size of the messages sent out from each processor reaches the minimum and the number of messages reaches the maximum. Since the startup software latency is much expensive than the network latency in message transmission on Purdue nCUBE-2, the startup latency dominates the overall communication latency when the message size is relatively small. This explains why the overall communication overhead increases when the number of processors becomes 16. 74 3.5 Optimizing RHS Expression Evaluation 3.5.1 RHS Expression Evaluation Optimization In the MICS algorithm, it is assumed that all the operations in a FORALL assign- ment are performed in the local processor which owns the LHS operand. If a RHS operand is owned by a remote processor, message passing is invoked in order to transfer the operand to the local processor first. However, this limitation, known as owner-computes rule, can be exceeded by evaluating different parts of the RHS ex- pression in a FORALL assignment on different processors. Given a FORALL assignment which consists of one kind of associative and commutative operations, data movement can be minimized by an optimal evaluation tree in which an intermediate result may be evaluated by a remote processor rather than the one which owns the LHS operand. 81: 1401,22) = B(i1, 22) * X(31, 22) * Y(i1,i2) * Z(i1,i2) 321 301,25) = A(2i1 +i2,i1 + i2) + X(2i1 +i2,i1 + i2) + Y(2i1 + i2,i1 + i2) +Z(ilai2) END FORALL Figure 3.12. Example 8: A Splash benchmark loop In Example 8 (Figure 3.12), it is assumed that base alignment is pre-determined such that DAFA’I = DBFBJ = DXFX’I = DYFY’I = DZFZ’I. Therefore, FORALL assignment 31 is free of reorganization communication. Nevertheless, reorganization communication cannot be eliminated in FORALL assignment 32 because DBFB’Z 74 DAFAJ, DBFB’2 75 DXFX’Q, and D3173; 74 DyFy,2. If the owner-computes rule is followed, remote elements A(2i1 +i2,i1+i2), X(2i1+i2,i1+i2), and Y(2i1 +i2,i1+i2) 75 have to be transferred to the local processor which owns the LHS B(i1, i2). As a result, the cost of reorganization communication is equal to 300. A(i,i) . . . . ‘\. . l 2 1T(21,+12,t +1 201,12) ) l 2 .1 \\ -..r V fig/(7 X \\9\ ‘,—,/(—)-"/T/ OT xx ..... \‘0 30,32) X(i,,i,) Y(il,i2) 261,12) A(2i1+i2, i1+i2) X(2i,+i2,i,+i2) Y(2i,+i2.i,+i2) statement statement SI 52 Figure 3.13. Optimal evaluation trees for Example 8 However, closer inspection reveals that DAR” = D x F X; = Dy Fy,2. This implies that the sum of A(2i1 +i2, i1 +i2), X(2i1 +i2, i1 +i2), and Y(2i1 +i2, i1 +i2) can be first calculated on a remote processor without any data movement. Figure 3.13 shows such an optimal evaluation tree for Example 8. As shown in Figure 3.13, TT(2i1+i2, i1+i2), a temporary variable which stores the sum, is then added with Z (i1, i2) by passing a message to the local processor which owns B (i1, i2). In Figure 3.13, each arc represents a reference relationship from one array to the other. Each arc is weighted by the cost of reorganization communication if the alignment matrices of two corresponding arrays are incompatible with respect to the reference represented by the arc. The optimal evaluation of Example 8 can be rewritten as follows: FORALL(i1 = 0 : 9, i2 = O : 9) 81: A(i1,i2) = B(i1,i2) * X(i1,i2) * Y(il, 22) * Z(i1, 22) 32.13 TT(2i1 + i2, i1 + i2) = A(Zil + i2,i1 + i2) + X(2i1 + i2, i1 + i2) +Y(2i1 + i2, i1 + i2) $2.2: B(i1, 22) = TT(21, 22) + Z01, 22) END FORALL 76 The original assignment 32 is transformed into two assignments 32,1 and 32.2. The alignment matrix of TT is determined as DTT = D A. Statement 5“ is free of reorga- nization communication. Reorganization communication only occurs in assignments 32.2. The total cost of reorganization communication is reduced to 100. Generally speaking, given a a FORALL assignment, the RHS expression evaluation can be optimized using the following rule. Proposition 3 Assume that in a given FORALL assignment, the RHS expression is operated by one kind of associate and commutative operations. Each group of the RHS array operands which alignment matrices are compatible with each other should be evaluated together. Only the intermediate results needs to be transmitted to the local processor which owns the LHS operand if the intermediate results are generated on a remote processor. In order to take advantage of optimal expression evaluation, we assume that the original program has been pre-processed by transforming each original FORALL as- signment into an equivalent set of FORALL assignments, each of which has the RHS expression operated by one kind of associate and commutative operations. 3.5.2 Alignment Graph An alignment graph (AG) is used to model the alignment problem. An AG is a collection of arrays and statements which can be represented as a bipartite graph, G = (Va, V,, E). An array is represented by a node in Va. A statement is represented by a node in V,. A undirected edge in E connects an array and a statement if the array is referenced in the statement. Figure 3.15 shows the AG for Example 9 (Figure 3.14). 77 FORALL(2'1 = 0 : 9, i2 = 0 : 9) 812 A(ihig) = B(i1,i2) * X(i1,i2) * Y(i1, 22) .92: B(i1,i2) = A(i1 +i2,i1) + X(i1 + i2,i1) + Y(i1 + 2'2, 2'1) 833 X(i1,i2) = B(i1,i2) * A(Zl + t2,i1) END FORALL Figure 3.14. Example 9: An Electric benchmark loop Figure 3.15. Alignment graph for Example 9 3.5.3 AG Base Alignment Algorithm The AG-based base alignment (AGBA) algorithm is shown in Figure 3.16. In an AG G = (Va,V,, E), there is a set, denoted as Qk, associated with each node 3,, in V,. Initially, each set Q. is empty. During the execution, each Oh will contain elements of type DAFAJc. Each element of Qk, DAF A1,, is further weighted by the number of effective elements in array A referenced in assignment 3],. By Proposition 3, a group of the RHS array operands with compatible alignment matrices can be evaluated together. Thus, the cost of reorganization communication only depends on the effective elements in the intermediate results, rather than the accumulation of effective elements among all the RHS array operands. For this reason, with respect to a single FORALL assignment, the fewer the number of the distinct groups of compatible alignment matrices among all the RHS array operands, the less 78 (1) for each node A in Va (2) T = <25 (3) for each neighboring node 3;, (in V,) of A (4) for each element DBFBJ: in Q, (5) T = T U DBFB,kFX,i¢ (6) endfor (7) endfor (8) Choose DA = D131"‘153,k1",;}c such that DBFB,]¢FA_’}c is the member in T which has the maximum accumulated-weight (9) for each neighboring node 3;. (in V,) of A (10) If DAFA'k is not in Qk then (11) Q]: = Qk U {DAFAJc} (12) endif (13) endfor (14) endfor Figure 3.16. The alignment graph base alignment algorithm the cost of reorganization communication. In Figure 3.16, Qk records the distinct groups of compatible alignment matrices among all operands in statement 3;, (lines (10)-(12)). When array A is referenced in statement 31,, we hope that DAFA,k can be chosen in the way that it matches to another element DBFBJ, existing in Q, and the size of Q. is not increased. In other words, DA should be compatible with DB. Therefore, no extra reorganization communication occurs since A and B can be evaluated together and the communication cost of moving intermediate results is equal to that of moving the operand B. On the other hand, however, array A may be referenced in multiple statements and the compatibility requirements imposed by different FORALL assignment may conflict with one another. T collects all the types of different DB F MFR}. if both array A and array B are referenced in FORALL assignment Sk- Note that T may contain multiple identical elements each of which comes from a distinct FORALL assignment. As a result, DA is chosen to be the member in T which 79 has the maximum accumulated-weight (line (8)). The accumulated weight refers to the weight accumulated among all identical elements. Thus, the maximum amount of reorganization communication has been eliminated. Table 3.2 illustrates how the AGBA algorithm works using Example 9. Table 3.2 shows the content of each Q), at each step of the outmost loop between lines (1)- (14) in Figure 3.16. The algorithm begins with A. After executing lines (9)-(13), Q1, Q2, and Q3 become {DAFA’I}, {DAFA’Q}, and {DAFAB}, respectively. FAJ = l 0 1 and F Ag = F A.3 = . Next, B is selected. After executing lines 0 1 1 0 (3)-(7), T = {DAFAJFBT’IDDAFAQFE’ImDAFA,3FB-,l3}. Since FBJ = F33 = F33 = I, 1 1 1 1 1 1 T = {DA,DA ,DA }. Consequently, DA is the member 1 0 1 0 1 0 l l in T which has the larger accumulated-weight. Thus, D3 = DA (line(8)). 1 0 Continuing in the algorithm, D x and Dy are likewise obtained. Table 3.2. Resolving base alignment for Example 9 alignment matrices Q1 Q2 Qa . D. m :3. D. is. B 0.-..“ a) if. D. 1:, D. :3, X DX=DA DA,DA i (1) DA 1 (1) DAa-DA i (1) Y Dy=DA DA,DA i (1) DA 1 (1) DAaDA i (1) In Figure 3.16, the order in which Va nodes are visited (line (1)) is important in minimizing reorganization communication. An efficient heuristic approach can be 80 addressed as follows. Let U be a subset of Va. A node in V, is a neighboring node of U if this node is connected to a node in U. Initially, U is empty. The algorithm begins with a node in Va, say A, which has the maximum connectivity degree. U = U U {A}. Then, in each of the rest steps, a node in Va — U, say B, is selected such that B is connected to a neighboring node of U, say sk, where 3;, has the minimum connectivity degree among those neighboring nodes of U. U = U U {B}. This procedure repeats until U contains every node in Va. If the heap-sort algorithm [78] is used in line (6), the time complexity of finding the element with the maximum accumulated-weight would be reduced to 0(loglE|) where IE] is the number of edges in AG G = (V, E). As a result, the time complexity of the AGBA algorithm is 0(IEIlogIEI). 3.5.4 Experimental Results 6 I I I I MWST algorithm A— MICS algorithm 43— 5 ' AGBA algorithm -x-— ‘ Benchmark loop: Example 9 Communication Cost msec ( ) 3 _ _ 2 - _ 1 l l l l Number of processors Figure 3.17. Comparison of the MWST, MICS, and AGBA algorithms on 16-node nCUBE—2 81 Figure 3.17 shows the comparison of communication cost among the MWST, MICS, and AGBA algorithms on 16-node nCUBE-2. Example 9 (Figure 3.14) is used as the benchmark loop in our experiment. We increase the iteration space in Example 9 by allocating the loop boundary as (i1 = 0 : 24, i2 = 0 : 24). The AGBA algorithm outperforms other two algorithms. In Figure 3.11, when the number of processors increases, the number of the messages sent out from each processor increases, while the size of each message decreases. The startup software latency is much expensive than the network latency in message transmission on Purdue nCUBE-2. Therefore, the startup latency dominates the overall communication latency when the message size is relatively small. Since the amount of communication generated by the proposed MICS and AGBA algorithms is small comparing with the MWST algorithm, both the performance of MICS algorithm and the performance of AGBA algorithm are dominated by the startup latency and getting close each other after the number of processors exceeds 8. 3.6 Avoiding Redundant Communication 3.6.1 Redundant Communication The same context of array elements may be referenced in more than one FORALL assignment. If these elements reside on remote processors, a local processor should receive a single copy of these elements rather than multiple identical copies. This technique is known as redundant communication avoidance. The identification of re- dundant communication can not only reduce communication overhead following up the data decomposition phase, but also have great impact in the alignment anal- ysis. In this section, we focus on the issues of avoiding redundant reorganization communication. 82 1 1 In Example 10 (Figure 3.18), assume that D3 2 Dz 2: DA. As a result, 1 0 DBFB’Z 76 DAFA’Q and DZFZ’3 # DAFA,3. In order words, remote element A(iI-i-ig, i1) is referenced in both statements 32 and 33. However, D3 = Dz implies that the LHS element B(i1,i2) in statement 32 and the LHS element Z (i1,i2) in statement 33 reside on the same local processor. A(il + i2, i1) is required to be sent from the same remote processor to the same local processor in executing both statements 32 and 33. However, since A is not written between statement 32 and 33, the message transmission of remote element A(il + i2, il) in statement 32 is redundant with that of remote element A(il + i2, il) in statement .33. These two messages transmitting the same A(i1 + i2, i1) comprise redundant communication. FORALL(2'1 = o : 10, i2 = o : 10) WHERE(X(2’1,2'2).NE.0) lHPF the probability is 83% 312 X(i1,i2) = A(t1,t2) 822 B(t1,i2) = A(Zl + i2, 21) 832 Z(i1,i2) = A(tl + i2, 21) * B(i1, 22) END WHERE 84! A(ihiz) = Z(i1,i2) S52 B(i1,i2) = X(i1,l2) + A(l1,22) END FORALL Figure 3.18. Example 10: An Oceanwater benchmark loop The identification of redundant communication is resorted to both the location and the context. The location requires that the senders of redundant communication must be the same, so do the receivers. The location requirement can be judged by alignment matrices. In Example 10, since DBFBQ = DZFZQ, elements B(i1,i2) and Z(i1,i2) are 83 always owned by the same local processor. The context requires that the context of every redundant message must be identical. In Example 10, array A is not defined between references “B(i1,i2) {— A(il +i2,i1)@32” and “Z(i1,i2) <— A(i1 +i2, i1)@s3”. Therefore, two messages transmitting the same context of A(il + i2, i1) are redundant. Consider references “X(i1,i2) <— A(i1,i2)@sl” and “B(i1,i2) (— A(i1,i2)@35”. The messages transmitting remote element A(i1,i2) are not redundant because array A is re-defined in statement 34 and the context of A(i1,i2) referenced in statement 32 is different from the context of A(i1,i2) referenced in statement 35. The concept of single assignment block is used to identify the distinct contexts with regard to the same array variable. Definition 3.7 Given a program structure, a single assignment block of array A, denoted as SA, contains a block of statements such that A is only defined in the first statement in SA and A is used in the other statements in SA. In Example 10, there are two single assignment blocks with regard to A :{sl, 32, 33} and {34,35}. B has two single assignment blocks:{s2,s3} and {.35}. Z has one single assignment block: {33, .94}. The single assignment block is based on the def/use flow for array variables. Therefore, the identification of a single assignment block can be easily obtained using data flow analysis algorithms [63]. The single assignment block has the following important property: Proposition 4 The context of an array element is not changed within the same single assingment block. The contexts of the same array element in difl'erent single assingment blocks are difi'erent. 3.6.2 Enhanced Alignment Graph An alignment graph can be enhanced to model the use/def flow for array variables by using the concept of single assignment block. The definition of an enhanced 84 alignment graph (EAG) G = (Va,V,, E,A,) is similar to that of an alignment graph G = (Va,V,, E) except that an EAG has an extra arc set A,. The definition of Va, V,, and E can be found in Section 3.5. There is an arc in set A, from node 3( to node 3;, if an array defined in statement 3; is used in statement 3],. Figure 3.19 shows the EAG for Example 10 (Figure 3.18). 1,5,1, __ kl, I \ ( X) l“ - ‘~ , " I it 3 l ,._+ ,\ _‘ ,,.-+ as 0. For other cases, see Section 3.3. 99 A closer inspection of many benchmark programs reveals that the density of dif- ferent template elements can be different. The density function, MAJ“ is defined such that the value of wA,k(t_) is the number of effective elements in array A which are mapped to template element T(t) with respect to FORALL assignment 3k. Consider Example 3 (Figure 2.8). Suppose alignment functions of X and Y are defined as follows: 171 i=6x( )=$2+1 $2 yl t=5Y( )=yl+312 312 where T(t) is a template element. Thus, X is partitioned in columns and Y is partitioned along diagonal. Alignment offsets d x = 1 and dy = 0 imply that the (k + 1)-th column in X is aligned with the k—th off-diagonal. By the definity of density function, we have wx,2(t) = max{0,t — 1} wY,2(t) = t Since the density function is not uniform, the amount of neighboring communication between different pairs of neighboring processors is different. In this thesis, neighbor- ing processors refer to processors P(i) and P(i + 1). Figure 4.3 shows the amount of neighboring communication between each pair of neighboring processors. In Figure 4.3, arrays X and Y are declared as 8 x 8 and the number of available processors is 3. Each array element is represented by a cirle. In each array, elements mapped to the same element in the template are connected by the same solid line. Only effective elements are covered by solid lines. By Proposition 6, 100 neigboring communic on Figure 4.3. Neighboring communication in Example 3 the basic cost of neighboring communication is equal to 1. Figure 4.3 gives the real .meaning which the value 1 stands for. In order to define the LHS X elements on a column, each processor requires to access remote Y elements which form one off- diagonal line. For example, in order to write elements X (331,1:2) on line x2 = 5, processor P(2) accesses elements Y(y1,y2) on line y] + y2 = 3 which are owned by remote processor P (1) Therefore, the amount of neighboring communication between processors P(1) and P(2) is equal to the number of elements on line yl + y2 = 3, which is 4. Similarly, the amount of neighboring communication between processors P(O) and P(1) is equal to 2. Note that the amount of neighboring communication between P(O) and P(1) is less than that between P(1) and P(2). Since the messages for neighboring communication can be transmitted simultaneously among pairs of neighboring processors, the cost of neighboring communication among all processors Should be equal to the maximum amount of neighboring communication per pair of 101 processors. Thus, the actual cost of neighboring communication is 4 in Figure 4.3. This justifies the reason that the basic cost can be used to measure the actual cost of neighboring communication. Like the cost of reorganization communication, the cost of neighboring commu- nication can be weighed by various array sizes. For example, in the following code segment, loop 32 is a sequential loop and can not be parallelized among the processors. ALIGN Y(j) WITH X(j) ALIGN Z(j) WITH X(j) FORALL(2'1 = 0 : 10) 31: X(i1) =Y(i1+1) FOR(22 = 0 1 10) 321 X(21)=X(21)+Z(21+1,22) END FOR END FORALL Y( j ) is aligned with X (i), so is Z (2) By Proposition 5, the basic cost of neighboring communication involved in reading Y(i1 + 1) is 1, and the basic cost of neighboring communication involved in reading Z (i1 +1, i2) is also 1. However, in the former case, the actual cost of neighboring communication is 1. In the later case, the actual cost of neighboring communication is 10 since the whole column in Z is mapped to the same template element. In the rest of this Chapter, we use the weight of the basic cost to formalize the impact of the density function and different array sizes on the actual amount of neighboring communication imposed by a given offset alignment. The actual amount of neighboring communication will be represented by the so-called weighted basic cost. 102 4.3 The Impact of Access Offset 4.3.1 Piecewise Linear Cost Function The following code pattern is very common in scientific application programs. Multiple instances of array Y referenced on the RHS have the same access matrix FORALL(z'1 = 0 : 10) 51: X(i1) =Y(i1)+Y(i1+1)+Y(i1+2) END FORALL but different access offsets. The access offset of instance X (i1) is 0, the access offset of instance X(i1 + 1) is 1, and the access offset of instance X(i1 + 2) is 2. Using the notation from the previous section, assume that there are totally q processors and template elements T(m, + 1 : m,“ — 1) are owned by the local processor P(i) where0 =mo < m1 < m2 < < mq_1 < m, =n—1.Sincet=6x(x)= x+dx, the X elements owned by processor P(i) are X (m,- — d x : m,+1 — l — d X). Conse- quently, by the construction of statement 31 , processor P (z ) requires to access elements Y(m.- —dx : m,“ —1-dx) with respect to instance Y(i1), Y(m,- —dx +1 :m;+1 —dx) with respect to instance Y(i1 + 1), and Y(m.~ — dx + 2 : m,“ — dx + 1) with respect to instance Y(i1 + 2). On the other hand, since t = 6y(y) = y + dy, the Y elements owned by processor P(i) are Y(m,- — dy : m,“ -- 1 — dy). Consider the remote Y elements accessed by processor P(i). If d X < dy, remote elements Y(m.-+1 — dy : m,“ — dx) are accessed with respect to instance Y(i1). Remote elements Y(m,-+1 — dy : m,“ — dx + 1) are accessed with respect to instance Y(i1 + 1). Remote elements Y(m;+1 — dy : m,“ — d x + 2) are accessed with respect to instance Y(i1 +2). Elements in these three segments overlap. 103 The union of these three segments is Y(m.-+1 — dy : m,“ — dx + 2). Therefore, the cost of neighboring communication is equal to dy — dx + 2 if d X < dy. If dx > dy + 2, remote elements Y(m.~ — dx : m,- — 1 — dy) are accessed with respect to instance Y(i1). Remote elements Y(m,~ — d X + 1 : m, — 1 — dy) are accessed with respect to instance Y(i1 + 1). Remote elements Y(m,- — dx + 2 : m.- — 1 — dy) accessed with respect to instance Y(i1 + 2). The union of these three segments is Y(m,~ — d X : m.- — 1 — dy). Therefore, the cost of neighboring communication is equal to dx—dy ifdx >dy+2. In particular, if dx = dy, remote elements Y(m,- — dy) and Y(m,- — dy + 1) are accessed with respect to instances Y(i1 +1) and Y(i1 +2), respectively. If dy +1 2 dx, remote elements Y(m,- —dx - 1) and Y(m,- —dy) are accessed with respect to instances Y(i1) and Y(i1 + 2), respectively. If dy + 2 = dx, remote elements Y(m,- — dx — 2) and Y(m.~ — dx —- 1) and Y(m.- — dy) are accessed with respect to instances Y(i1) and Y(i1 + 1), respectively. Overall, under these three special conditions, the cost of neighboring communication is equal to 2. Let cBJc be the cost of neighboring communication involved in accessing all in- stances of array B referenced on the RHS of statement 3],. In the above example, we have the following result. 2—(dx-Cly) when (dx—dy) <0 CY.1 = 2 when 0 S (dx — dy) S '2 (4.15) (dX—dy) when 2<(dx —dy) Unlike the cost of reorganization communication, Equation 4.15 shows that the cost of neighboring communication cannot be treated independently for each distinct in- stance when there are multiple instances for the same array referenced in the same statement. In the above example, if the cost of neighboring communication regard- ing to instances Y(i1), Y(i1 + 1), and Y(i1 + 2) were estimated independently by 104 Proposition 5, the overall cost would be Idx —dY|+|dx —dY+1I+|dx —dY+2I The value of the above equation would be three times as much as CY’I. For this reason, the single occurrence transformation is not used in the offset alignment analysis since it can change the cost estimation and thus affect the result of offset alignment. For ease of explanation, in the rest of this chapter, we use the symbol “A <— B@sk” to represent the references of all distinct instances of array B in order to write the LHS A elements in statement sk. For the sake of consistency, the symbol “A +- B@sk” is still called a reference. Equation 4.15 can be extended as follows: Proposition 7 Assume that in a FORALL assignment 3],, all instances of array B are referenced on the RHS. These instances are represented by B(i1 + r1), B(i1 + r2), . ., and B(i1 + r5) where r1 < r2 < . .. < rt. The access offset tuple is denoted as up,’‘ = {r1, r2, . . . rt}, which is distinguished by the array variable B and the statement number 1:. c1“, the cost of neighboring communication generated by accessing all B instances in executing statement 3),, can be formalized by the following piecewise linear equation. rt — (dA — d3) when (dA — (13) < r1 CBJ: = 7‘( — 7‘1 when 7‘1 S (dA — dB) S 1‘3 (4.16) (dA — dB) — 1‘1 when 7“: < (61,; — (13) where array A is referenced on the LHS. For simplicity, 63,}; is called the cost function. Proposition 7 can be easily extended to the case where the array referenced on the LHS has different instance(s) referenced on the RHS in the same statement. Assume that A(i1 + r,-) is referenced on the LHS and other instances of A referenced on the RHS are A(i1 + r1), ..., A(i1 + rj-1), A(i1+ n+1), ..., and A(i1 + r;) where r1 < 105 . . . < rJ-_1 < r,- < rj+1 < . . . < r1. By using the similar analysis as in Equation 4.16, the cost of neighboring communication involved in accessing A elements is equal to r; — r1 no matter whatever dA is. In other words, the cost is a constant which is independent with the alignment of array A. 4.3.2 Properties of Piecewise Linear Cost Function In the rest of this section, we explore more properties of the piecewise linear cost function. Consider the following program structure. Assignment 3; is free of neigh- 313 X(i1)=Y(i1)+Y(i1+1)+Y(i1+2) 32: Z(i1) = X(i1) * *2 33: X(i1) =X(i1)+Y(z'1 +4) +Y(z'1 +5)+Y(z'1 +7) END IF END FORALL boring communication as long as dz = dx. Consider the relationship of dy and dx. Since uy,1 = {0,1, 2}, we have 2 — (dx —- dy) When (dx - dy) < 0 CY,1 = 2 when 0 S (dx — dy) S 2 (dx -— dy) When 2 < (dx — dy) Since uy,3 = {4,5, 7}, we have 7 — (dx — dy) when (dx — dy) < 4 CY,3 : 3 when 4 S (dx —dy) S 7 (dx — Cly) — 4 when 7 < (dx — dy) 106 The task of offset alignment is to find the optimal value of d X — dy such that the sum of cm and Cy,3 is minimal. Figure 4.4 shows the sum of eyJ and ey,3 in a coordinate system. “"[‘— dx’dv _|' I I l i I +i_ — ———++++ .[ 7 Figure 4.4. The sum of CY’l and ey,3 Studying Figure 4.4, we obtain the following results. If the position of dx —- dy is chosen between 2 and 4, such as the postion g, the value of Cy,1 is (g - 2) + 2 and the value of ey,3 is (4 — g) + 3. As a result, the sum of CY'l and ey,3 is equal to (4 - g) + (g — 2) + 2 + 3, or (4 — 2) + 2 + 3. Note that (4 —- 2) is the distance between the maximum element in tuple uy,1 and the minimum element in tuple uy,3. If the position of d X — dy is chosen to the left of 2, such as the position f, the value of cyl is (2 — f) + 2 and the value of ey,3 is (4 — f) +3. Since f extends left to 2, (4 - f), the distance between f and 4, is greater than 2, the distance between 2 and 4. Therefore, 107 the sum of (2 -— f) + 2 and (4 — f) + 3 is greater than (4 —— 2) + 2 + 3. On the other hand, if the position of d X — dy is chosen to the right of 4, such as the position h, the value of eyJ is (h — 2) + 2 and the value of ey,3 is (h — 7) + 3. Since h extends right to 4, (h — 2), the distance between 2 and h, is greater than 2, the distance between 2 and 4. Overall, we conclude that the minimal value of ey,1 + Cy,3 is 7 when dX — dy is chosen as any integer between (including) 2 and (including) 4. We further notice that the middle elements, 1 in uy,1 and 5 in uy,3, do not affect the shape of the piecewise linear function and thus are not relevant to the decision making for the minimal sum of the cost. For this reason, the minimum and maximum elements in an access offset tuple are sufficient to identify a piecewise linear cost function. For this reason, the cost function cBJc is be denoted as 63,]; =< 11;,k,r1_:3,;c > where 13,], is the minimum element and 78,]; is the maximum element in u3,k. Element 13,}; is called the left end point and element uBJc is called then right end point. Both (3,]: and uBJc are called end points for simplicity. Using the similar analysis as in Figure 4.4, we can obtain the general results as follows. Proposition 8 Assume that cBJ- is the cost function for accessing remote B elements in executing reference “A +— B@sj ” and 63,]; is the cost function for accessing remote B elements in executing reference “A (— B@sk”. Let cad =< l3,j,r3,j > and 63,]; =< IB,k,r3,k >. If l3,,- < T3,)” the sum of cBJ- and 63,]: is minimized when dA — dB is chosen as any integer between rBJ- and 131.. The minimum value of the sum of egg- and 03,1: is equal to TB,j — 13$. Figure 4.4 only shows the case where TB’J' < lB,k. Proposition 8 also holds for both case 7‘8,j < (19,;c and case TBJ‘ > 13¢, as shown in Figure 4.5(a) and (b). A close inspection of Figure 4.5 reveals that the sum of clay and CBJc can be represented by the piecewise linear function when dA — dB varies between lBJ' and 108 'T dA-dB n l [lll-lll;_i .. _ r8 _ < IKI k ll. I] 18 B. .. C 1 e + s n m _ a _ B. C , ['8']. B d _A d /. k B. m C m l < . a I ..m +,. _ _ a ,_l I. lllll l_ m M I I L. _ m... I ...x C rB.k lB,k 1'8". 13.] Figure 4.5. The sum of cBJ- and C3,]; ‘1" a, 109 T3,]? This can be formalized as follows. Proposition 9 Assume that CB,j is the cost function for accessing remote B elements in executing reference “A (— B@sj ” and CBJc is the cost function for accessing remote B elements in executing reference “A +— B@sk”. Let cad =< IB,J-,r3,j > and C3,]; =< l3,k,r3,k >. If lgd- < TB'k, the sum of CB“,- and C3,]; have the following property. 03,1 + 63,1: = (dA - dB) - ’34 + (Ta/c - Ink) max{TB,j,lB,k} < (dA - 0'3) S. TBJ: T3,, — lg.)c min{r3,,-, lB,lc} S ((1,; — (13) S max{rBJ, lBJc} TBJ: — (dA — 613) + (raj —- 13,1) lB,j S (61.4 — £13) < min{r3,j,13,k} For example, in Figure 4.4, the sum of cm and Cy,3 can be represented as the following piecewise linear function where the value of dx — dy varies between 0 and 7. T—(dx—dy)+2 When 0S(dx—dy)<2 CY,1+CY,3= 7 when 2 S(dx —dy) S4 (dx—dy)+3 When 4((dx—dy)S7 Proposition 9 is important in determining offset alignment between arrays A and B such that the sum of the cost of neighboring communication with regard to multiple FORALL assignments is minimal. More precisely, the problem can be formalized as follows. Assume that there are m distinct references “A +— B@s,-” where ch =< lB,j,1'B’j > for 1 _<_ j g m. The following algorithm can be used to find the value of dA — d3 such that 2;, cBJ- can be minimized. The minimal-cost pairwise offset alignment (MPOA) algorithm is shown in F ig- ure 4.6. In Figure 4.6, piecewise linear cost function Gay is monotonously increasing when dA — dB extends right to r31 or left to 13,5. Therefore, if 130- is the minimal end point and rBJc is the maximum end point among all access offset tuples in T, each 110 (1) T = U311{} (2) while T contains more than one tuple do (3) Find j such that 134- has the minimal value among all end points in T (4) Find k such that r15”c has the maximum value among all end points in T ) if k = j then ) T=T—{< lB,j,T‘B’j >} ) else ) T=T— {< lB,j,TB,j >,< lB,k,7‘B,k >} ) T = T U {< min{r3,j,lg,k},max{rB,J-,13*} >} 0) end if 1) end while 2) dA — d3 can be chosen as any integer number between (including) l3 1 and (including) T3,; where {13,5, rgy} IS the only tuple left 1n T (5 (6 (7 (8 (9 (1 (1 (1 Figure 4.6. The minimal-cost pairwise offset alignment algorithm piecewise linear function cBJ- (1 S j S m) is monotonously increasing when dA - d3 extends left to 13,; or right to r11;c (lines (3)-(4)). This implies that the value of dA — d3 must be between 13,; and ray, such that 2;, cBJ can be minimized. If j = k (line (5)), 13,,- and r3.)c are two end points in the same tuple 113*. By Equation 4.16, the cost function 63,]; is a constant when d A — dB is between 13,; and T'BJc. Therefore, this tuple is no longer considered in the rest steps (line (6)). If j 7’: k (lines (7)-(9)), the sum of c153,,- and 63’]: is considered. By Proposition 9, the sum function can be rep- resented by the piecewise linear function < min{rB,J-,lB,k},max{r3,,-,lg,k} > when dA — dB is between 113,,- and r3,» Finally, when there is one tuple left in step ( 12), the optimal value of dA — d3 can be chosen using Equation 4.16. The MPOA algorithm terminates in m — 1 steps of the while loop because one tuple is deleted from T at each step (lines (6) and (8)). By its construction, the MPOA algorithm does find the optimal dA — dB. If the heap-sort algorithm is used to search the maximum and minimum end points (lines (3) and (4)), the time complexity of the MPOA algorithm is O(mlog(m)). 111 No new end point can be generated in the MPOA algorithm, though new tuples may be generated (line (9)). This important feature implies that given a set of CB’J' =< IB,j,TB’J‘ > for 1 S j S m, there exists an end point to which dA — d3 can be assigned such that 23-":1 C3“, is minimized. This fact results in a straight forward algorithm to find the minimal value of 237;, 63“,“: assign each end point to dA — dB and find the one which generates the minimal-cost. Proposition 10 Given a set of CB’J' =< (Ba-,rBJ- > for 1 S j S m, there exists an end point to which dA — dB can be assigned such that 237;, cBJ is minimized. As mentioned in the previous section, the access offset can have different weights due to various array sizes. For example, in the following program structure, the cost FORALL(2’1 = 0 : 10) 31: X0.) = Y(i1,0) + Ya, +1,0)+ Y(z‘1 + 2,0) FOR(2'2 = 0 : 10) 32: X(21) = X(21) + Y(21 + 4, 22) + Y(21 + 5, 22) + Y(21 + 7,22) END FOR END FORALL of neighboring communication is cy,1 + 10 x Cy’g where 2 — (dx - dy) when (dx — dy) < 0 CY,1 = 2 when 0 S (dx — dy) S 2 (dx — dy) when 2 < (dx — dy) (7 — (dx — dy)) when (dX — dy) < 4 CY,2 = 3 when 4 S (dx — dy) S 7 ((dx — dy) — 4) When 7 < (dx — dy) The cost eyg is weighted by 10 since a column of remote Y elements are accessed 112 in the innerloop 32. Therefore, the problem of finding the optimal dA — dB in the presence of weighted cost situation can be formalized as follows: Assume that there are m distinct references “A «— B@5j” where cBJ- =< IB,J-,r3,j > for 1 S j S m. Each ch- is weighted by w,. The optimal dA — d3 should minimize the value of 22;, wj x cBJ. One approach to resolve this alignment problem is to use the MPOA algorithm. In line (1), let T consist of 111,- identical copies of tuple < l3,,-,r3,j > for 1 S j S m. Though it may take excessive time if w,- is large, the MPOA algorithm guarantees that Proposition 10 still holds regarding to the weighted cost of neighboring commu- nication. Therefore, we can assign dA — dB to be each distinct end point in all access offset tuples and find the one with the minimal-cost. 4.4 Spanning-Tree Offset Alignment Algorithm 4.4.1 Offset Reference Graph The problem of offset alignment can be modeled by oflset reference graph (ORG). Given a program structure in which the alignment matrix of each array is pre- determined, an ORG G = (V, E) is constructed as follows. An array is represented by a node in V. For an array which has multiple instances referenced in one or more F ORALL assignments, there is only one corresponding node in the ORG. A reference which is free of reorganization communication is presented by an edge in E. The edge connects two nodes which represent two arrays specified by the corresponding reference. There is only one edge connecting the LHS array A and the RHS array B for all B instances which have the same access matrix and are referenced in the same statement. A ORG is undirected. Since offset alignment is studied independently for each aligned-base group, different ORGS are used to model different aligned-base groups. For simplicity, the examples used in the rest of this chapter only contain a 113 single aligned-base group and there is only one ORG for each program structure. FORALL01=O:9) 31: A(21)=X(21+2,Z2)+A(21+4) END FOR 82: A(21)= A(21)+ Z(21+ 3,22) + Z(21 + 5,32) END FOR FOR02=02U 34: Y(21) = Y(21) + Z(21 + 2, 22) + Z(21 + 4,22) END FOR END FORALL Figure 4.7. Example 13: A Livermore benchmark loop Figure 4.8 shows the ORG for Example 13 (Figure 4.7). In Figure 4.8, each array variable is represented by a node labeled by the variable name. Edges are constructed based on the references generated by each FORALL assignment in Example 13. For example, edge (A, X) connects nodes A and X due to reference “A (— X@31”. Edge (A, Z) connects nodes A and Z because of reference “A «— Z @32”. Note that reference “A 4— Z @32” represents the access to all distinct instances of array Z in statement 32. Therefore, there is only one edge incident with A and Z. Since there is one- to-one correspondence between an array and a node, terms “array” and “node” are used alternatively in the rest of this chapter. Similarly, since there is one-to—one correspondence between an edge and a reference, terms “edge” and “reference” are also used alternatively in the rest of this chapter. 114 Figure 4.8. The offset reference graph for Example 13 9 Consider reference “A +— B@sk’ with access offset tuple uggc. By Proposi- tion 4.16, the minimum cost of neighboring communication is zero if UBJC only contains one element. The minimum cost of neighboring communication is equal to T3,]: —l 3,], if u 3,}, contains more than one element and 13,1: is the left end point and ray; is the right end point. However, considering multiple references, there may not exist a solution of the alignment offsets such that the minimum cost of neighboring communication can be achieved for every reference. The reason is that the minimum-cost requirement imposed by one reference may conflict with that imposed by another reference, in particular, when the two edges representing these two references are involved in a cycle. For example, Figure 4.8 has a cycle A -—+ X —-> Y —> A. By Equation 4.16, since mm = {2}, the minimum cost for reference “A 4— X @31” is zero if the following equation is satisfied. dA—dx =2 (4.17) Since ux,3 = {—2}, the minimum cost for reference “Y +— X@s3” is zero if the following equation is satisfied. dy — dx = —2 (4.18) Since uA,3 = {—3, —1,1}, the minimum cost for reference “Y <— A@s3” is 4 if the 115 following inequality is satisfied. —3de—dAS1 (4.19) However, subtracting Equation 4.18 from Equation 4.17, we have (1,; — dy = 4 which does not satisfy Inequality 4.19. This implies that the minimum cost of neigh- boring communication cannot be attained for at least one reference among the above three. 4.4.2 Spanning Tree Offset Alignment Algorithms The conflict of minimum-cost requirement can only occur if an ORG consists of cycles. Intuitively, such conflict can be resolved by a spanning tree: offset alignment between two arrays is specified by the tree edge connecting these two arrays such that the minimum cost can be achieved on the reference represented by this tree edge. Each non-tree edge determines a unique fundamental cycle of the ORG with respect to the spanning tree. The minimum cost of neighboring communication may not be achieved for each non-tree edge. Each edge is weighted. Since loop 32 in Example 13 is a sequential loop, the cost of neighboring communication for reference “A +— Z” is weighted by 10, the number of elements in a row of two-dimensional array Z. Consequently, edge (A, Z) is weighted by 10. For simplicity, the weight on edge (A, B) is denoted as wA,B. In Figure 4.8, wAz = 10, wa = wy,z = 2, and wxy = wA,y = 1. Given a choice between two edges, the tree edge would be chosen as the one with higher weight. As a result, the spanning tree for offset alignment would be chosen as a maximum-weight spanning tree (MWST). 116 Similar to the analysis in section 3.4 for the MICS algorithm, the MWST algo- rithm may not generate offset alignment which offers the minimum overall cost of neighboring communication regarding to all references in a program structure. For Example 13, the maximum-weight spanning tree is shown in Figure 4.9(a). Alignment offsets are selected such that dA — dz = 3, dy — dz = 2, and d4 — dx = 2. Thus, the minimum cost requirements of neighboring communication imposed by edges (A, Z), (Y, Z), and (A,X) are satisfied, respectively. Without loss of generality, let dA = 0. We have dz = —3, dy = 2, and d X = —2. Therefore, by Proposition 7, the overall cost of neighboring communication is 42 regarding to all five references in Example 13. However, the spanning tree alignment in Figure 4.9(b) generates lower overall cost of neighboring communication. In Figure 4.9(b), alignment offsets are selected such that dA — dz = 3, dy — d4 = 0, and d4 — dX = 2. Thus, the minimum cost requirements of neighboring communication imposed by edges (A,Z), (Y, A), and (A,X) are satisfied, respectively. Without loss of generality, let d4 = 0. We have dz = +3, dy = 0, and dx = —2. Therefore, by Proposition 7, the overall cost of neighboring communication is only 40. /'\ Q 2 \ 2 \ [)6 1 ( i) l X) l ( Z) l 2 l 2 "\fl , _lV / (g (y) (a) Maximum-weight spanning tree (b) Minimum-cost spanning tree Figure 4.9. The offset reference graph for Example 13 In this section, we propose an efficient spanning-tree offset alignment (STOA) 117 algorithm which is an improvement of the MWST algorithm. Given an ORG G = (V, E), the STOA algorithm can be formalized in Figure 4.10. (1) Choose an arbitrary array variable A Let T = {A} (2) while T ¢ V do (3) Let Q1 be the set of single-degree neighbors of T (4) if Q1 76 45 then (5) Find a node B in Q1 such that there exists an edge (B, H) where H E T and 1123,” = min{XeQ1,Y6T} wx,y (6) Let d3 — d” = 1H,): where edge (B, H) represents reference “B (— H@3k” and [H’k is the left end point in 2112],]; (7) T = T U {B} (8) else (9) Let Q2 = V — T (10) for each node X in Q2 do (11) Let wx = 2011710)“)! (12) end for (13) Find a node B in Q2 such that 203 = max{XeQz} wx ( 14) Use Proposition 10 to determine the value of d3 which achieves mlnde-dA} Z:{(B,H)EE,HeT, and (H,B) represents “Ho—B@s;’} CB’J' (15) T = T u {B} (16) end if (17) end while Figure 4.10. The spanning-tree offset alignment algorithm In Figure 4.10, the single-degree neighbor and the multi-degree neighbor is defined as in section 3.4. If Q1 is not empty, like the MWST algorithm, the tree edge is selected as an edge with the largest weight among all the edges which are incident with one node in Q1 and another node in T (lines (4)-(7)). If Q1 is empty but Q2 is not, we find the next tree edge as follows. First, w x, the sum of the weight over all edges connecting X and another node in T, is computed for each node X in Q2 (lines 118 (10)-( 12)). Then the node B in Q2 with the maximum sum of the weight is selected to be the next tree node (line (15)). The value of alignment offset (13 is determined by using Proposition 10. By this construction, the STOA algorithm is superior to the MWST algorithm. \Al \5) ( A/l (So 1 (z: (x: 1 <1 2) (3;, 1 z\ 1 2 1 ’ 2 1 2 ,J. _ _ , (’Y) (X) (Y Figure 4.11. Resolving offset alignment for Example 13 Figure 4.11 shows how the STOA algorithm works to resolve offset alignment for Example 13. In Figure 4.11, array A is selected as the template (line(1)). Note that the selection is arbitrary. The values of other alignment offsets will be in the form of : (L; + k where k can be any integer. T = {A}. The first phase is shown in Figure 4.11(a). Since all X, Y, and Z are single-degree neighbors of T, Q1 2 {X, Y, Z}. Since it has the maximum weight among edges (A,Y) and (A, Z), edge (A, Z) is selected as the tree edge (line (5)). Since edge (A, Z) represents reference “A +— Z@32” where 112,2 = {3,5}, (1,4 — dz = 3 (line (6)). Note that line (6) only shows one case. In another case, edge (B, H) may represent reference “H +— B@sk”. As a result, dB should be determined as (1;; —- d3 = 13),. The second phase is shown in Figure 4.11(b). In the second phase, T = {A, Z}. Therefore, node X becomes the only single-degree neighbor of T. Similar to the first phase, we have dA — dx = 2. 119 The last phase is shown in Figure 4.11(c). In the last phase, T = {A, X, Z}. Node Y is multi-degree neighbor of T. The cost functions for references “Y +— X @33”, “Y (— A@33”, and “Y «— Z@s4” can be represented by piecewise linear functions cx,3 =< —2 >, 6,43, =< —3,1 >, and czA =< 2,4 >, respectively. Note that ex; =< —2 > is a function of dy — dx. Since dX = dA — 2 (step (b)), cx,3 can be rewritten as < O > with respect to dy — (1,4. 62,4 =< 2,4 > is a function of dy — dz. Since dA — dz = 3 (step (a)), 62.4 can be rewritten as < —1,1 > with respect to dy — d A. Therefore, using Proposition 10, we conclude that the minimal value of the sum cz,4 + CA3 + cx,3 can be achieved when dy — (1,4 = 0 (lines (14)). The values of T, Q1, and Q2 in each phase are illustrated in Table 4.1. Table 4.1. Values of T, Q1, and Q2 in executing the MICS algorithm for Example 13 phases T Q1 Q2 offst alignment (3) {A} {X,Y,Z} <15 51.4 —dz = 3 (b) {M} {Z} {Y} dx = d. — 2 (C) {A,X,Y} <15 {Y} dY - dA = 0 In the STOA algorithm, each edge in the ORG G is referenced only once. If the heap-sort algorithm [79] is used in lines (5) and (13), the time complexity of finding the maximum-weight edge can be reduced to 0(loglEl) where IEI is the number of edges in DRG G = (V, E). As a result, the time complexity of the STOA algorithm is 0(|E|log|E|). 120 4.5 Optimizing RHS Expression Evaluation 4.5.1 RHS Expression Evaluation Optimization Similar to the base alignment analysis, neighboring communication can be minimized by an optimal evaluation tree in which an intermediate result may be evaluated on a remote processor instead of the local processor which owns the LHS operand. The limitation of the owner-computes rule can be exceeded by executing different parts of a FORALL assignment in distinct processors. We assume the original program has already been pre-processed by transforming each original FORALL assignment into an equivalent set of FORALL assignments, each of which has the RHS expression operated by one kind of associate and commutative operations. FORALL(z'1 = o : 9) 31: 14(21): B(ZI)*X(21)*Y(21) END FORALL Figure 4.12. Example 14: A Livermore Kernel 7 loop Consider Example 14 in Figure 4.12. In Example 14, offset alignment are pre- determined such that d A = d3 = d X = dy. By Equation 4.16, the cost of neighboring communication for all references in statement 31 is 0. Therefore, statement .91 is free of neighboring communication. However, such an offset alignment decision makes ev- ery reference in statement .32 not free of neighboring communication. Consider how to minimize the cost of neighboring communication in statement 32. Since the interme- diate result can be generated by a remote processor rather than the one which onws the LHS operand, the key issue is to find what operands should be operated together 121 8(5) 1 m(il+1) 2/’ \\0 :45) rr1(i,+3) A(i,+1) f O i I”! \ B(i,) X(i,) Y(i1) X(il+3) Y(ii+4) statement 51 statement 8 2 Figure 4.13. The optimal evaluation trees for Example 14 on what processor? Figure 4.13 shows an optimal evaluation tree for statement .92. In Figure 4.13, each reference is represented by an arc. The cost of neighboring commu- nication required for each reference is represented by the weight on the corresponding arc. X (i1 + 3) and Y(z'1 + 4) are first added and the result is saved to a temporary variable TT1(i1 + 3) at the remote processor which owns X (i1 + 3). dTTl = dx. The cost of neighboring communication for reference “TTl (— Y” is 1. TT1(i1 + 3) and A(il + 1) are then added and the result is saved to another temporary variable TT2(2’1 + 1) at the remote processor which owns A(z'l + 1). dTT2 = dA. The cost of neighboring communication for reference “TT2 (— TT 1” is 2. Finally, TT2(i1 + 1) is assigned to B(z'1). The cost of neighboring communication for reference “B +— TT2” is 1. Therefore, the optimal evaluation for Example 14 can be written as follows: FORALLUI=0:9) 31: A(i1)= B(z'1) an: X(z'1) * Y(i1) 3232 8(21) = TT2(21 + 1) END FORALL 122 4.5.2 Post-Alignment Optimization The study of Example 14 reveals two important phases in determining the optimal expression evaluation with regard to multiple FORALL assignments. One phase is the offset alignment phase. By Proposition 6, no cost of neighboring communication is involved in a RHS expression evaluation if an appropriate offset alignment is used. For example, in Example 14, assignment 31 is free of neighboring communication when d A = d3 = dx = dy. However, for multiple assignments, neighboring communication may not be fully avoided no matter how good offset alignment is. Therefore, the other phase is to minimize the cost of neighboring communication in each statement after alignment offset for each array is determined. This phase is known as post-alignment optimization. In Example 14, the post-alignment optimization for assignment 32 is shown in Figure 4.13. These two phases are not isolated. By the definition, the post-alignment optimization is carried out after the offset alignment analysis is done. On the other hand, however, the decision of offset alignment depends on the accurate cost estimation provided by post-alignment optimization techniques. We study the post-alignment optimization technique as follows. Proposition 11 Assume that in a F ORALL assignment, the RHS expression is oper- ated by one kind of associate and commutative operations. Ao(i1 + r0) is referenced on the LHS. All instances referenced on the RHS are A1(i1 +r1), A2(i1 +r2), ..., and Am(i1 + rm). Assume that dAo, dA“ dA2, ..., and (1.4". are pre-determined. Assume that the sequence ofro+dAo,r1 + dA, , r2 +dA,, . . . , rm + (1.4," is in monotonous (either increase or decrease) order. Therefore, the optimal evaluation of the assignment is as follows: 1. Begin with Am(i1 +rm). Operate Am(i1 +rm) with Am_1(i1 +rm_1) and save the intermediate result in the temporary variable TTm_1(i1 + rm_1) at the processor which owns Am_1(i1 + rm_1). 123 2. Operate TTJ-(il +r,-) with Aj_1(i1+r,-_1) and save the intermediate result in the temporary variable TTJ-_1(i1 + rJ-_1) at the processor which owns Aj_1(i1 + rJ-_1) for 2 Sj S m — 1. 3. Assign TT1(i1 + r1) to A0(i1 + r0). It is easy to prove that Proposition 11 is true. Consider the cost of neighboring communication involved in reading Am(i1 + rm). Since (rm + (1.4m) — (r0 + dAo) > r,- + dAJ. — (r0 + dAo) for (1 Sj S m - 1), the distance between Am(i1 + rm) and the LHS operand is the longest among that between any other RHS operand and the LHS operand. Therefore, we hope that Am(i1 + rm) can be operated with another operand which is closer to the LHS operand. On the other hand, the cost of moving Am(i1 + rm) to such an operand should be as small as possible. Since the closest operand to Am(i1 + rm) is Am_1(i1 + rm_1), Am_1(i1 + rm_1) becomes the best candidate. Repeating the similar approach, the optimism in Proposition 11 can be proved. In Proposition 11, the symbols A0, A1, A2, ..., and Am are not required to be pairwise distinct. Proposition 11 still holds if an arbitrary subset of them represents the same array. For example, in the following FORALL assignment, assume d X = dy = 0. FORALL(2‘1 = 0 : 9) 31: X(i1) = Y(i1 — 3) + Y(z'1 — 1) + Y(z'1 +1) + X(i1 — 2) + X(z'1 + 2) +X(i1 + 6) END FORALL Figure 4.14 shows the result of the post-alignment optimization for the above FORALL assignment. In Figure 4.14, each reference is represented by an arc. The cost of neighboring communication required for each reference is represented by the 124 weight on the corresponding arc. Note that the sequence of instances with respect to the monotonous order of access offset is {Y(i1 — 3), X(i1 — 2), Y(i1 — 1), X(i1), Y(i1 + 1), X(i1 + 2), X(i1 + 6)}. Since the LHS instance X(i1) resides in the middle, the se- quence is separated into two subsequences: {Y(i1 — 3), X(i1 — 2), Y(i1 — 1), X(i1)} and {X(i1), Y(i1 +1), X(i1 +2), X(i1 +6)}. Proposition 11 works on each of subsequences as shown in Figure 4.14. Note that the unit cost of neighboring communication must have the same weight with regard to all arrays referenced in the same FORALL as- signment. Therefore, Proposition 11 holds with the consideration of the weighted cost. i 1 r X( l)‘.\\l Tr2(il “—1) 1730,“) ,//l ‘\\l 0 0/] \-\\1 /. "’ .. , / \\ n10, -2) Y( il —1) Y( i, +1) 'I'I‘4(il+2) I 15" \\ 0 0/“ \ 4 \\ /" \\ Y( H -3) X( i] -2) X( il+2) X( i, +6) Figure 4.14. An example of post-alignment optimization 4.5.3 Alignment Graph Like base alignment, alignment graph (AG) is used to model the offset alignment problem. An AG is a collection of arrays and statements which can be represented as a bipartite graph, G = (Va,V,, E). An array is represented by a node in Va. A statement is represented by a node in V,. A undirected edge in E connects an array A and a statement 3), if A is the LHS array. A undirected edge in E connects an array B and a statement 5), if DBFBJc = DAFAJc where A is referenced on the LHS and B 125 is referenced on the RHS in statement 3),. There is only one edge connecting the LHS array A and the RHS array B for all B instances which have the same access matrix and are referenced in the same statement. Figure 4.15 shows the AG for Example 13 (Figure 4.7). Figure 4.15. The AG for Example 13 4.5.4 AG-Based Offset Alignment Algorithm The AG-based offset alignment (AGOA) algorithm is shown in Figure 4.16. In an AG G = (Va,V,, E), there is a set, denoted as Qk, associated with each node 31C in V,. Initially, each set Q)c is empty. During the execution, each Q;c will contain elements in access offset tuple for each array which is referenced in statement 31c and whose alignment offset has been determined. In Figure 4.16, g4,” represents an element in the access offset tuple u,”c = {g1,A,k,g1,A,k,...,gh,,4,k} (lines (5) and (15)). The first array selected in line (1), say A, serves as the template. The value of other alignment offset is in the form of i (I); + k where k is any integer. Set T contains all choices of : dA + k for every other array to select as alignment offset. Given an array B, dB is chosen among all 126 elements in T such that the sum of the cost of neighboring communication over each FORALL assignment where B is referenced can be minimized (line (13)). The cost of neighboring communication in executing each FORALL assignment is estimated using Proposition 11. If the heapsort algorithm [79] is used in finding the minimum value, the time complexity in line (13) is 0(109 Z iuBJci) (B,sk)€E (1) for each node B in Va do (2) T = <15 (3) for each neighboring node 3;, (in V,) of B do (5) for each element 943,}; in up)c do 6) for each element q in Q)c do (7) if q — gag), is not in T then (8) T = T U {q - geek} (9) end if (10) end for (11) end for (12) end for (13) Assign dB to be the element in T such that the sum of the cost of neighboring communication over each FORALL assignment where B is referenced is the minimum (14) for each neighboring node 3), (in V,) of B do (15) for each 943,): in 113,}; do (16) if d3 + 943,], is not in Q, then (17) Q]: = Qk U {(13 + 943*} (18) end if (19) end for (20) end for (21) end for Figure 4.16. The alignment graph offset alignment algorithm 127 where IUBJcl is the number of distinct elements in u3,k. Thus, the time complexity of the AGOA algorithm is 0( Z luBJcllOg Z iuBJci) (8,3).) (8,3).) Table 4.2. Resolving offset alignment for Example 13 by using the AGOA algorithm Z Y X Q1 {dA,dA + 4} {dAadA + 4} {dAadA + 4} Q2 {dA} {dAadA +2} {dAidA +2} Q3 {dA—3,dA,dA+l} {dA-3,dA,dA+l} {dA—3,dA,dA+I} Q4 {} {dA-1,dA+1} {dA-1,dA+1} result dzsz—3 dy=dA+1 dX=dA+2 Table 4.2 illustrates how offset alignment in Example 13 is resolved by using the AGOA algorithm (Figure 4.16). The algorithm begins with A since A has the maximum degree of connectivity. Thus, A serves as the template. Arrays Z, Y, and X are chosen in sequential (line (1)). Table 4.2 shows the content of Q1, Q2, Q3, and Q; at the end of each iteration in the outmost for loop, where each of arrays A, Z, Y, and X is selected. When A is first select, sets Q1, Q2, Q3, and Q4 are all empty. Therefore, no operation is done in line ( 13). Since A is referenced in statements 31, 32, and 33, Q1 2 {dA,dA + 4}, Q2 2 {d4}, and Q3 = {dA — 3,dA,dA +1} by lines (14)-(20). Next, Z is selected. Since Z is referenced in statement 82 and ”(12,2 =< +3, +5 >, T = {dA — 3,d,4 — 5}. Note that though Z is referenced in statement 34, Q4 is still empty. Therefore, Q4 makes no contribution to T. By Proposition 11 (line (13)), either dA — 3 or (L4 — 5 can be the value of dz. Thus, we choose dz = dA — 3. Executing lines (14)-(20), Q2 :2 {dA,dA + 2} and Q4 2 {dA —— 1,dA + 1} since Z is referenced in both statements 32 and s4. 128 Next, Y is selected. Since Y is referenced in statements 33 and 34, for executing lines (2)-(12), T = {dA — 3,dA —- 1, dA,dA +1}. Consider CA3, the cost of neighboring communication in accessing the RHS elements A(i1 — 3), A(il), and A(il + 1) in statement 33. By Proposition 11 (line (13)), 6,43 = 4 if dy is chosen as any element in T. Consider cz,3, the cost of neighboring communication in accessing the RHS elements Z(i1 + 2) and Z(i1 + 4) in statement 34. By Proposition 11 (line (13)), Cy,4 = 8 if dy is selected as dA — 3. CYA = 4 if dy is selected as either dA — 1, or dA, or dA + 1. Thus, we choose dy = dA + 1. Executing lines (14)-(20), Q3 and Q. are not changed. Last, X is selected. Since X is referenced in statements 31 and 33, by executing lines (2)-(12), T = {dA—2, dA—l, dA +2, dA +3}. Consider ch, the cost of neighboring communication in accessing the RHS elements X (i1 + 2) in statement 31. No extra cost of neighboring communication is paid if d X is chosen as either dA — 2 or dA + 2. If dx = dA — 2, elements A(il) and X(i1 + 2) are owned by the same processor. If dx = dA + 2, elements A(il + 4) and X(i1 + 2) are owned by the same processor and the RHS expression can be evaluated on the processor which owns A(il +4). Consider CX,3, the cost of neighboring communication in accessing the RHS elements X (i1 — 2) in statement 33. Similarly, no extra cost of neighboring communication is paid if dx is chosen as either dA — 1 or dA + 2. Therefore, dA + 2 is the only solution of d X (line (13))- 4.5.5 Performance Comparison Table 4.3 shows the comparison of the overall cost of neighboring communication generated by the MWST, STOA, and AGOA algorithms using Example 13. The AGOA algorithm outperforms the other two algorithms. In the AGOA algorithm, each element in access offset tuple is taken count into in performing the best post- alignment optimization. For example, when dx is chosen as dA + 2, in statement 129 Table 4.3. Comparison of the MWST, STOA, and AGOA algorithms using Example 13 Algorithms Overall cost of neighboring communication MWST 42 STOA 40 AGOA 36 33 elements X (i1 — 2, 0) and A(il) are owned by the same processor. However, this information is not utilized by the STOA algorithm since the STOA algorithm only considers the two end points in u A.3 =< —3, 0, l > and ignores the middle element 0. 4.6 Avoiding Redundant Communication When the same context of remote elements are referenced in more than one F DRALL assignment, the local processor should receive a single copy of these elements instead of multiple identical copies. The importance of such redundant communication avoid- ance has been shown in Section 3.6 regarding to the base alignment analysis. In this section, we concentrate on issues of avoiding redundant communication regarding to the offset alignment analysis. 4.6.1 Redundant Communication In Example 15 (Figure 4.17), array A is two-dimensional and other arrays are one- dimensional. A is partitioned in columns such that all elements in the same column are collapsed to the same processor. Only the outmost loop indexed by i1 is distributed across the processors. Assume that offset alignment is pre-determined such that (1,, 2 d3 = dz = 0. Since d3 = dz, the LHS element Z(i1) in statement 33 and the LHS element B ( il) in statement 34 are located to the same local processor. Therefore, 130 the two copies of A(i1+3, 0) referenced in both statements are identical. The messages transmitting remote element A(i1+3, 0) in statement 33 and 34 are redundant. Assume that processor P(i) owns elements A(m, : m,“ — 1,0). The remote A elements referenced by P(i) in executing statement 33 and 34 are A(m,+1 : mg+1+2, 0). Consider statement 32. Since dA = d3, the remote A elements referenced by P(i) in executing statement 32 are A(m,+1 : m,-+1 +4, 0). Therefore, the messages transmitting remote A elements in statements 33 and 34 are further redundant with the messages transmitting remote A elements in statements 32. Combining the remote A elements referenced by local processor P(i) in executing statements 32, 33, and 34, we conclude that A(m;+1 : m,“ + 4,0) are only remote A elements required to be transmitted from the remote processor to the local processor. FORALL(i1 = 0 : 9) 812 AU], 0) = X01) 822 3(21) = A(Zl + 4,0) + A(Z] + 5,0) 832 2(21) = A(Zl + 3,0) * 8(21) S42 8(21) =X(21)+A(21 +3,0) FORALL(i2 = O :9) S52 A(il,i2) = A(il,i2 — I) + t2 END FORALL 862 8(21) 2' 8(21) + A(Zl +1,i2) END FOR END FORALL Figure 4.17. Example 15: A Dhrystone benchmark loop As addressed in section 3.6, the identification of redundant neighboring commu- nication depends on both location and the context. The location requires that the senders of redundant communication must be the same, so do the receivers. The 131 context requirement implies that the context of the redundant remote elements must be the same. As introduced in Section 3.6, the concept of single assignment block can be used to identify the distinct contexts with regard to a particular array variable. In Example 15, there are two single assignment blocks with regard to A: {31, 32, s3, s4} and {35,36}. Therefore, the message transmission of A(m;+1,0) in executing state- ment .96 is not redundant with the message transmission of A(m,+1 : mg+1 + 4, 0) in executing statements 33, 33, and 34 because the context of array A is re—defined in statement 35. More accurately, redundant neighboring communication can be classified into two types: fully redundant messages and partially redundant messages. In respect to fully redundant messages, the messages carrying remote elements for executing two distinct FORALL assignments are identical, such as the messages transmitted in executing statements 33 and 34 (Example 15). Only one copy of the fully redundant messages needs to be transmitted and the cost of neighboring communication is equal to the size of the single message. In respect to partially redundant messages, the messages carrying remote elements for executing two distinct FORALL assignments overlap in a subset of common elements. However, the two messages are not identical, such as the messages transmitted in executing statements 32 and 33 (Example 15). In this case, the redundant elements should be only transmitted once. In Example 15, the result of combining remote A elements accessed in executing statements 32 and 33 can be treated as if instances A(il +3, 0), A(i1+4, 0), and A(i1+5, 0) are accessed in the same statement and thus the access offset tuple becomes < 3, 4, 5 >. The understanding of avoiding partially redundant messages is important in correctly estimating the cost of necessary neighboring communication. 132 4.6.2 Enhanced Alignment Graph As addressed in Section 4.6, the enhanced alignment graph (EAG) is used to identify single assignment block. The definition of an enhanced alignment graph (EAG) G = (Va,V,, E, A,) is similar to that of an alignment graph G = (Va,V,, E) except that an EAG has an extra arc set A,. The definition of Va, V3, and E can be found in Section 4.5. There is an arc in A, from node 35 to node 3;, if an array defined in statement 3; is used in statement 3),. A distinct EAG is used for the offset alignment analysis in each aligned-base group. Figure 4.18 shows the EAG for Example 15 (Figure 4.17). In Figure 4.18, edges are weighted 1 if their weights are not labeled. //_.v —T K“ _ 1"]— 31...!” \E‘ , . (I "I - \\ ‘3 All “i i i . / I . . .\\ I \ .~.\ // / x i/ \ ~ ) / / ‘ Z ’ k\§/ / / - ‘ _/' I r 1’ __ ',,4-/—’—“\_\ f—‘x /“s* \~ ”’ .1- . s w 3 K 2 /> / K\ S3 _//) T\",—4f”"/ 10 \FL/ \_ <\\ / f —’—"Q ~“ ['14’,' \ //” I \. 1’/ , \ / ’ /,/ ‘4’ / \ //’ AT“\\~ , / TTV‘\\\T\‘ /’ I,” \ l l/ , f)>-.:'\ /"’;~.< \ I“ . 'V /x\\ // // ,»’ ” .// \\‘ ‘ \\ i / x I" ,/' ,.—.\ I/ ,z/ i ’1 / I i i r// // ,«’/// '(,/' i, ‘(\3§)\~ , ,///,x///-/ "'l :1“, '/ 4 2 4,2,4 (AH—_— ( s ) \\-’/ \\\~§.—// Figure 4.18. Enhanced alignment graph for Example 15 4.6.3 EAG-Based Offset Alignment Algorithm The EAG-based offset alignment (EAGOA) algorithm is shown in Figure 4.19. The EAGOA algorithm is an invariant of AGOA algorithm introduced in Section 4.5. Like the AGOA algorithm, in Figure 4.19, set Q)c is associated with each node 3), in V3. 133 Initially, each set Q)c is empty. During the execution, each Q)c will contain elements in access offset tuple for each array which alignment offset has been determined and is referenced in statement 3),. gm”, represents an element in the access offset tuple uAJ, = {g1,,4,k,g1,,4,k,. . .,gh,A,k} (lines (5) and (15)). The first array selected in line (1), say A, serves as the template. The values of other alignment offsets are in the form of i d A + k where k is any integer. (1) for each node B in Va do (2) T = 45 (3) for each neighboring node s)c (in V,) of B do (5) for each element guy, in 213,), do (6) for each element q in Q6 do (7) if q — gag), is not in T then (8) T = T U {q '- game} (9) end if (10) end for (11) end for (12) end for (13) Assign dB to be the element in T such that the sum of the cost of neighboring communication over each single assignment block with respect to B is the minimum (14) for each neighboring node 3): (in V,) of B do (15) for each 91.8.1: in uBJc do (16) If d3 + gag]; is not in Qk then (17) Q]: = Qk U {dB + 91,8,k} (18) end if (19) end for (20) end for (21) end for Figure 4.19. The enhanced alignment graph offset alignment algorithm The major difference between the EAGOA and AGOA algorithms is how to es- 134 timate the sum of the cost of neighboring communication (line (13)). In the AGOA algorithm, the sum of the cost is estimated based on each FORALL assignment where a particular array is referenced. However, in the EAGOA algorithm, the sum of the cost is estimated based on each single assignment block with respect to that par- ticular array. Therefore, the EAGOA algorithm prevents the selection of alignment offset from the adverse impact of redundant neighboring communication. This feature makes the EAGOA algorithm superior to the AGOA algorithm. Example 15 can be used to illustrate the idea. In Figure 4.18, A has the maximum connectivity degree. Therefore, A is selected first. Executing lines (14)-(20) in the EAGOA algorithm, we have Q1 2 {dA}, Q2 = {dA+4,dA+5}, Q3 = {dA+3}, Q4 = {dA+3}, Q5 = {dA}, and Q5 = {dA+1}. Assume that B is selected next. Since B is referenced in statements 32, s3, s4, and 36, T is only constructed based on Q3, Q3, Q4, and Q6. By lines (3)-(12), we have T = {dA +1, dA + 3, d A + 4, dA + 5}. Statements 33, 33, and 34 comprise the single assignment statement with regard to arrays A, X, and Z. When redundant neighboring communication is removed, the sum of the cost of neighboring communication in statements 32, 33, and 35 can be modeled by the piecewise linear function < 3, 5 > with respect to d3 — dA. On the other hand, the cost of neighboring communication generated by statement 36 is modeled by piecewise linear function < 1 > with respect to d3 — dA. Since statement 36 is not in the same single assignment block in which statements 32, 33, and 34 are (regarding to array A), the sum of the cost in line (13) is equal to < 3, 5 > + < 1 > + < 1 >. Note that the cost of neighboring communication is weighted by 2 for statement 36. Therefore, the minimal value of < 3,5 > + < 1 > + < 1 > can be achieved when (13 = dA + 1. In contrast, if the AGOA algorithm is used, the sum of the cost of neighboring communication is estimated based on each FORALL assignment. In other words, the cost sum would be c133 + cB,3 + 08.4 + c193, where 63,2 =< 4,5 >, cB,3 =< 3 >, CBA =< 3 >, and cB,6 =< 1 > + < 1 > regarding 135 to d3 — dA. This wrong cost estimation leads to the solution d3 = dA + 3 which minimizesthevalueof<4,5>+<3>+<3>+<1>+<1>. The extra time complexity in the EAGOA algorithm is spent in identifying single assignment blocks for each array variable. Using the dataflow analysis technique proposed in [64], the extra time complexity is O(IVallel). Thus, the overall time complexity is O(lvallV8l+ Z luBJCIIOQ Z luB.kl) (Bvsk) (Bvsk) CHAPTER 5 Data Distribution The task of data distribution is to determine the mapping of the template elements onto the processors. Data elements are assigned to the processor to which the mapped template element is assigned. The goal of data distribution is to both reduce neighbor- ing communication and increase processor workload balance. In this chapter, segment distribution is introduced as the best distribution pattern for data distribution within a single dimension. An optimal processor allocation algorithm is proposed to further minimize the overall cost of neighboring communication across multiple dimensions of the template array. 5.1 Segment Distribution In this section, we study data distribution with regard to a single dimension in the template array. The issues involved in data distribution across multiple dimensions in the template array will be addressed in the next section. 5.1.1 The Limitation of Existing Distribution Types We illustrate the limitation of existing distribution types by using Example 16 (Fig- ure 5.1). 136 137 FORALL(i1 = 0 : n —1,i2 = i1 : n — 2) 31: B(i1,i2+1)=B(n—— 1+i1 —i2,i2)**2 END FORALL Figure 5.1. Example 16: An Electromagnetic benchmark loop In Example 16, array B is distributed in columns. In other words, the distribution function 63 is defined as follows: 61 t: 63( 2 b2 52 where T(t) is a template element and B(b1,b3) is a B element. By Proposition 1, assignment 31 is free of reorganization communication. However, neighboring com- munication cannot be avoided. By Proposition 5, the basic cost of neighboring com- munication generated by assignment 31 is 1. An effective element is a data element that is used or defined in assignment 31. Limited by the loop boundary, the effective elements form the upper triangle in the two-dimensional data space. Thus, density function tag can be defined as follows: w3(t) = t where T(t) is a template element Figure 5.2 shows three different types of data distribution for Example 16: cyclic distribution, block distribution, and segment distribution. In Figure 5.2, array B is declared as 11 x 11. Template T is one-dimensional array with 11 elements. There are three processors available, denoted as P(O), P(1), and P(2). Figure 5.2 shows the distribution patterns of both the template elements and the effective elements in array B. 138 00.00.0000 T ooooooooooo . . ‘ P[O] o o o o o o o o o a» If" :22 ’ =26 ( =18 PH] 0 o o o o o o o 0 HO] PU] P[2] P2 0 o o o o o o o C = C _, C ._. ”P[0] o o o o o o o ”0] 22 Pill-26 P[2] 18 mini] : 3 z z 2 B workload balance '[P[1]-iP[2]l=8 P[O] o o o o . _ pm . . 0 communication Cpm=26 P[2] o o P[0]o PH] (a) cyclic distribution oooooooooooT ...OOOOO... oooooooooo _ _ ,a _ P[O]. O O O “i O O O O lP[0]_10 (Fur—26 (P[2] —30 com do... 0 o o o o o o cp[o]=4 cplll=8 cp[2]=0 o c o o o o B p[no : z : : workload balance l[P[2]-[P[O] |=20 . . . communication cP[1]=8 o o P[2] . (b) block distribution OOOOOOOOOOOT ...COOOOC’OC 0000000000 3 _ 1" =1": . . . . o o o . . ‘PIOI‘21 ‘Pm 24 ‘Pm 2‘ O O O O O O O . FIG] 0 O O O O O O CP[0]=4 CP[1]=9 CP[2]=O O O O O O O 0 o 0 0 0 B workload balance lipm- [P[0]|=3 O O O . Pl” 0 ' ' communication cPlll=9 O O O P[2] (c) segment distribution Figure 5.2. Different types of data distribution for Example 16 139 Since the RHS computation is uniform among all the LHS elements, workload on a processor is proportional to the number of the LHS elements that the processor owns. For simplicity, 6pm, the workload on processor P(i), is represented by the number of the LHS elements that processor P(i) owns. The quality of processor workload balance imposed by a particular distribution pattern can be evaluated by the variance of workload, Og¢€g§,{|€p(i) — howl} The cost of neighboring communication generated on processor P(i) is denoted as CP(£)- The value of ep(,-) is determined by the number of remote elements that pro— cessor P(i) accesses in executing assignment 31. Since the basic cost of neighboring communication in Example 16 is 1, each processor will access one solid line owned by a remote processor. However, since the length of solid lines are varied, the value of ep(,-) for different processor P(i) is different. Therefore, the cost of neighboring communication is measured by ggggfcmol Figure 5.2(a) shows cyclic distribution in which the template elements are dis- tributed to three processors in the round-robin fashion. Though it attains a good processor workload balance (Itpm — €p(2)| = 8), cyclic distribution generates an ex- tremely high cost of neighboring communication (Cp(0) = 22). Figure 5.2(b) shows block distribution in which the template elements are distributed to three proces- sors in the block fashion. The number of template elements each processor owns is equal. Though it attains a low cost of neighboring communication (Cp(1) = 8), the poor quality of processor workload balance (Mpg) — 3pm] 2 20) is hard to accept. A new distribution pattern, segment distribution, is employed in Figure 5.2(c). In segment distribution, the template elements are consecutively assigned to the proces- sors. However, the number of template elements owned by different processors can be 140 varied. In Figure 5.2(c), the number of template elements owned by each processor is niced arranged such that processor workload is balanced (“P(2) -— [pm] = 3) and the cost of neighboring communication is small (613(1) 2 9). The study of Example 16 reveals two important facts. First, as shown in F ig- ure 5.2(a), with cyclic distribution, a remote element is read for writing each local element in Example 1. This implies that the cost of neighboring communication generated by an inappropriate distribution pattern can be as much as the cost of re- organization communication generated by mismatched alignment bases. Second, the impact of neighboring communication can be significantly reduced by consecutively assigning template elements to processors. As shown in Figure 5.2(b) and (c), tem- plate elements are consecutively assigned to processors in both block distribution and segment distribution. As a result, the cost of neighboring communication is small, CPU) = 8 in Figure 5.2(b) and CPU) 2 9 in Figure 5.2(c). The relative difference of the cost of neighboring communication between block distribution and segment ditribution will be smaller when the size of arrays increases. 5.1.2 Segment Distribution As discussed in the previous section, in segment distribution, template elements are consecutively assigned to processors in order to minimize the impact of neighboring communication. The length of the segment assigned to each processor may be var- ied. This feature gives segment distribution a big flexibility to exploit the maximum processor workload balance. In data parallel programs, data elements are typically written by the processor which owns them. As a result, the workload assigned to a processor is determined by the LHS elements owned by that processor. Therefore, the distribution strategy for the LHS array elements on each assignment statement determines processor workload balance. 141 Given total q processors, template array T(O : n — 1) is partitioned into q pieces such that each piece T(m, : m,“ — 1) is owned by processor P(i) where 0 S i S q — 1, mo = O, and mq_1 = n. Our goal is to find the value of those break points m,- (0 S i S q — 1) such that the workload imposed by elements in each segment can be equal. This can be formalized as follows: m1—1 1712-1 mq—l_1 Z wA(t) = Z wA(t) = = Z wA(t) (5.1) t=mo j=m1 j=mq—2 where A serves as the LHS array in a FORALL assignment and wA(t) is the density function of A. In Equation 5.1, the density wA(t) only shows the number of LHS A elements mapped to the template element T(t). Strictly speaking, the number of the LHS elements may not be proportional to the workload imposed by the LHS elements. For example, in the Linpack TQL2 loop (Figure 5.3), array Z is referenced on the LHS. Since Z is the one-dimensional array, there is only one Z element mapped to each template element. However, based on the inner loop indexed by i2, the workload imposed by different Z elements is obviously different: total nn inner loop iterations are executed for element Z(1), while only one inner loop iteration is executed for element Z (nn) FORALL(i1 = 2 : nn) 312 Z(21)= Z(Zl +1) DO (i2 = i1,nn) 822 Z(21)= Z(21)+S*(H(21,22)+F*H(21—l,32)) END DO END FORALL Figure 5.3. Linpack TQL2 benchmark loop 142 In order to precisely model the workload, we modify the definition of density function am as follows: wA(t)= 2 [4(0) 6302):: where T(t) is a template element, A(a) is a data element, and «A(a) represents the computation estimate to define the LHS element A(a). The computation can be estimated based on the number of integer or floating-point operations required by the RHS expression evaluation. For instance, in assignment 3; of Example 16, 7rz(z) = (nn — z +1)(2t... + 2t+ + 2t) + t,) where t... represents a floating-point multiplication, t+ represents a floating-point ad- dition, tm represents a memory load operation, and t, represents a memory store operation. The amount of computation can be further normalized by converting various operations to the equivalent number of clock cycles. In most scientific applications, the density function of a LHS array is typically a constant of an affine function with a single variable. Figure 5.4 shows a few common loop patterns extracted from scientific application programs. In Figure 5.4, A is a two-dimensional array and B is a one-dimensional array. The template array is one- dimensional and A is distributed in columns. In Figure 5.4(d), (e), (f), and (h), the inner loop is sequential. In Figure 5.4(a), wA(t) = t for element T(t). In Figure 5.4(b), wA(t) = n— 1 —t for element T(t). In Figure 5.4(c), cm 2 min{t, n —1—t} for element T(t). In Figure 5.4(d), w3(t) = t for element T(t). In Figure 5.4(e), wB(t) = n —1—t for element T(t). In Figure 5.4(f), we 2 min{t,n — 1 —— t} for element T(t). In Figure 5.4(g), wB(t) = 1 for element T(t). In Figure 5.4(h), wB(t) = n for element T(t). In Figure 5.4(i), wA(t) = n for element T(t). For this reason, in this thesis we assume that to); = aj + b for each template element T(j). Here, both a and b are fixed integers. Suppose there are total q 143 processors. Template elements T(O : n — 1) are partitioned into q segments such that each segment T(m; : m,“ — 1) is owned by processor P(i) where 0 S i S q -— 1, mo = 0, and mq_1 = n. Break points m,- (0 S i S q — 1) are optimized such that the workload imposed by each segment isequal. In other words, m1—1m2—1 mq—l-l Zaj+b=2aj+b=~ = Z aj+b j=mo J= —m1 j=mq—2 FORALL(i1=0:n-1) FORALL(i2=0:z'1) A(il, 22) = . . . END FORALL END FORALL (a) FORALL(i1=0:n-1) FOR(22=0.21) END FOR END FORALL (d) FORALL(i1=0:n-1) END FORALL (g) FORALL(i1=O:n-l) FORALL(i2=i1:n-2) A(i1,i2) = END FORALL END FORALL (b) FORALL(i1=0:n-1) FOR(i3=i1:n-2) Boa) = END FOR END FORALL (e) FORALL(i1=0:n-1) FOR(22=0.n-2) B(i1) = END FOR END FORALL (h) FORALL(i1=0:n-1) FORALL(i2=i1:n-1-i1) A(i1,i2) = END FORALL END FORALL (C) FORALL(i1=0:n-1) FORALL(i2=i1:n-1-i1) 8(21) = . . . END FORALL END FORALL (f) FORALL(i1=0:n-1) FORALL(i2=0:n-1) A(ilai2) = END FORALL END FORALL (0 Figure 5.4. Some common loop patterns We find these optimal break points by extending density function wA to all real numbers within the range [0, n — 1]. The density function is extended to em = av + b for any real v in [0, n — 1]. Segment [0, n — 1] is partitioned into q pieces of [v.-,v.-+1] 144 where 0 S i S q — 2, v0 = 0, and v9-1 = n — 1. We need to find break points v,- (0 S i S q— 1) such that U1 112 Uq_1 [av+bz/ av+b=...=/ av+b U0 2)] 119.2 This implies /v'av+b=:/vq_lav+b 120 q U0 U: ' n-l [av+bzi/ av+b o q 0 Solving the integration, we get 01' l 2 .—il _ 2 ._ 2avi +bv, — q(2a(n 1) + b(n 1)) 2b i 2b —. ——1 2 — —1 v,+av: q((n )+a(n )) Solving the above equation, we get the solution of v,- as —%”+ 4——.":‘+4 ((n-1)2+ff(n—1)) 2 22.: b—+3((n—1)2+g—b(n—1))— 9 (5.2) a2 q a a vi: Since v,- may not be an integer grid, m, is chosen as an integer approximation of v,- by rvi—rf— «)n—l +2-—”(n—1))— 51 a On the other hand, we should know what processor a particular template element T( J) is assigned to. Suppose that T(j) is owned by processor P(i). Therefore, we 145 should have u,- S j S 1),-+1. Thus, by Equation 5.2, we have b2 i b b — — — — — < p+fiun u an 1» 0-] b2 i+1 2 2b b J- a,+ q(m—1)+a0w4»— 5 This can be re-written as K (1(27'7 +j2) - 2:502 - 1) _ (n- 1V (1(2—bj'l‘j2l- 2b(n-1) 1<. (n—nz -‘ In other words, we have q(%‘lj+j)—%§(n—1)_1<.W2 ‘ W ‘ O K, O O 3 . W Q [I C) C‘- I”: (“I O (I O ‘ W1 I O 0 T} I.) (“I ’ O O l i i) r If, 0 b I”. I.) I) If) I; I“ (I) l k (I) It.) I 0 CI I I r") (I II’ Ix) i3 ) I’m I“ (_I If. I W I I (Tr If) i. (I: L”) r :I I I I I...) II'T'IICI (J ‘3 OI I?) I I», II, b (.3- I I l I I0 d ’4 ~‘I 0 (I I“: I I I I I III II Ii (tI VIII 0 I) If: :) (_II (; I I I I FI ' (3 tI ’i)‘ Q II, I I I I it." I.. III N l) i,“- '0 I“ P I? k‘ {I FIT? _Ig(0.0) . IIP(Q. , I I II I I \ new?L4ur—I-«IIL—TII .ILJLIILI I I I I l |\<\J C) IOI {A ’ .. 7, (”X I . I)? - ' ._.-.Ihtqufc‘; , ‘I””'_I+Li AI»: . I I \hI (I -':«'«___«:I_ _C‘ i AI | I I a? J‘I' II" Ii" ‘1 'I z'r' I'I' I77“ .IP(1,II T i) (I ’1 X I ‘ v —7£I:‘_W IL I H1 0 ‘1 )7 \l I!" 'Iil ,Liéixt H. A t) (\‘f I II I I .. L «- I « I- I «Wife wI—II—I. 6 “a I, I I I III? ?I " C’I‘i. "II 'I L I I moi III II. III I ”I I. I2, I». .I. I-7‘w~.(2"’ ! I II- CI; 11’ " I I f I IILIF'II ’ II“ Iddtt“ I I I l . I | I , H II I I I I I "(2‘2 (I, "'l I, II, II ,I I II I I \ "I II: I.:III:».I.«:.' I .:-II.~II.«.I Ii I I I I 21 ' 1I .I'I; l .I l I Z' I» I II « «I I II II «II I I I I I .» I. .- « ; I «z «I «I