. .. 4.3%32ﬁ.

.JnuwnthJumrhw
I ‘

..vuhbm..mewhu

tan»?

 

          

 

u

       

 

(‘u
5
Si}:-
i: ’i
u‘a§

 

5P1

 

    

W3... . ..
ad...

 

 

3 "I ‘l’ .
I...“ our...)

       

..
. 3...... 2m
1. .nxkuﬂfs a.
g...

u!.%§§

.s. .mﬁﬁﬁa...

                    

 

                 

               

 

       

                 

   

       

 

 

   
    

    
 

 

at... ..
.2... 2...... _ R
. .. 139...... PMuO 1L5.” .b
3.5.... .. IhrM$ rd...- 1 . 3...:
a. 2.... ..... .. ...
. tilgﬂinilt .5.- ... 9PM}
- . - . .i... t. .. .. E.
,.a«............... 5...... Saturuaﬁ. .uis z. . #3....
e............ as... .2. .. 3. .... . . tag... i... smut...» m
51...... . E .
<3!!!“ I I IS I {-‘A. 3‘ ‘00: .x
1 PI \ 1"! :53‘ ,- n!“u¢li. u
.. 2.... . .. s... It:
.85... .rx...........v. he. .. ,L. . a

- .2.-.
. t... . 3 .iarmqﬁz -.
uﬁmmT». "ﬁve”; 1.33.5. .....m
............... ﬁrth; .3... I

     

         

      

 

               

.
{A .\ 1.3.0..“ lg§h

I 4...... . ..

mewunwﬁmsk- «a;

 

ti-Iavzzi.
91p: 34......
3.0 III:

 

- A. 2,.
a l). .1
310m“: 1 chum
sub... 2.1. 3.1.23...
.1... a... 3:. e
... n. x {(6.5 .
1' . . hr... 3.....- l...
n .st‘ﬂunﬁr‘os. )\ ii.
33.. I... 7 5.2.3.... 33......
5....F sea“... 3.1.... .2215... 2.3%
. 2.....r....~....uxa 3A 1...... .9 1.13...
:K .. P..(.....E$?..la...:. .2...hnvﬂ..ﬂ..umr.
n... ﬁhiytnanﬂ. .1... 1...... .

.. l.
.s ........
. . .. t
.23....
z. .

                  

     

 

  

. .1... 53......

o!)xl..a!. f.
a... I.

:11...)

    

       

       

 

           

 

. .3 .33....
.9"... .....(......w..... : V
.5... tum... . . H.953...

        

 

5| '32.... \Q: .1...
3!... : ash“;
hydra Uni .01... «In. 3....
..

     

 

  

.3 13.!
‘tv .11.}... 11......5‘ .3 I...
3.1!. 5.3.14) 1333‘).- striﬂf

       

                   

 

.. 55!

 

Oat...
Idantcllzllss
33:11.;

    

  

.. «a...
. 6. -5
52....-.» ..
- -..”Hu...ﬁn...v§n..

s ..............?.3.

     

      

 

               

 

xii-53:511..- t!
:ixlxvii:lt.rﬂnaiv§
.a.911v.‘$. 9!
3.24:. (3|...
{1.13.

          

 

 

  

a
.1!

5:15!»

.t’ip.»

   

1|: :-

 

        

 

            

 

   

.l VIC-ul-
. 2.. .t
.3 .9 (h: z. .
.. 32....-. . .2.
.11.--.) .i... HR

3.2.;

         

)- .15)

u in...
. . 5,1... )33!

 

  

 

Lulailvll
.1. at). .5. .9.
ﬁﬂﬂ\)l)1ia\.i.

     

      

        

(
\plts.

nu

 

27’!
z i ...
5.5.3....

 

.1...ng
.5.» 1.1.3:... 1 n.-.
1...}.-.{5,.\..7Irr.~.|§v.r
3.. ix .0352: 5.9.. afi.
.. Pu}... £9.32 .. .. in...

     

 

firixs...
1.114... II
a . dbl.

   

 

.y. u 2...: s .

  
 
 

UNIVERSI ITY LI am

Iiliilililﬂ‘il’ililil‘ lilllllllilililillllll

3 1293 01020 15

 

 

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled

Compilation—Time Data Decomposition
Optimization for Data Parallel Programs

presented by

Hong XU

has been accepted towards fulﬁllment
of the requirements for

Ph . D . degree in Computer Science

 

 

\vi \WQ \V\ M‘

 

Major professor

Date November 9, 1994

 

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

 

LIBRARY
Michigan State
Unlverslty

 

 

 

PLACE ll RETURN BOXto modulo ohookoUHrorn your mood.
To AVOID FINES rotum on or baton duo duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU In An Afﬁrmative Action/Equal Opportunity Institution
WHO'S-9.1

 

COMPILATION-TIME DATA
DECOMPOSITION OPTIMIZATION FOR

DATA PARALLEL PROGRAMS

Hang Xu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1994

ABSTRACT

COMPILATION-TIME DATA
DECOMPOSITION OPTIMIZATION FOR
DATA PARALLEL PROGRAMS

By

Hang Xu

Data decomposition is critical to the performance of data parallel programs on
scalable parallel computers. A data decomposition model can be viewed as a two-level
mapping of array elements to abstract processors. Data alignment determines what
array elements are aligned relative to one another, and data distribution resolves how
the group of aligned arrays is distributed onto the processors. Depending on the
alignment relationship as within dimension or across dimension, alignment can be
classiﬁed into base alignment and offset alignment.

The purpose of base alignment is to reduce the amount of unstructured communi-
cation. In this research, we use the data reference graph model to describe the various
reference patterns associated with each array and to resolve the conﬂict of the com-
patible alignment requirements. An efﬁcient spanning tree algorithm addresses the
fundamental issues in base alignment. Base alignment is further studied with the con-

sideration of the optimal expression evaluation and dataﬂow analysis. Efﬁcient base

alignment algorithms are proposed to reduce the redundant communication and opti-
mize the RHS expression evaluation. These contributions make this research unique
from related research.

The purpose of offset alignment is to reduce the amount of data shift movement.
This thesis successfully models the cost of data shift movement using the piecewise
linear function. This cost model solves the accuracy problem in measuring the quan—
tity of data shift movement, an unresolved problem left by other work in this area.
Based on this cost model, the optimal post-alignment algorithm is ﬁrst proposed to
exceed the limitation of the owner—computes rule and minimize the amount of data
shift movement after offset alignment is determined. The data reference graph model
is used to address the problem of offset alignment and develop efficient spanning tree
algorithms. The RHS expression evaluation and dataﬂow optimizations are incorpo-
rated with the proposed offset alignment algorithms.

The purpose of data distribution is to reduce the impact of data shift movement
and increase processor workload balance. Segment distribution is proposed to resolve
the conﬂict between reducing data shift movement and increasing processor workload
balance with regard to a particular dimension of the template array. An optimal
processor allocation algorithm is introduced to minimize the overall cost of data
shift communication across multiple dimensions of the template array. The segment
distribution and optimal processor allocation proposed in this thesis provide the best

data distribution support for most data parallel programs.

To my parents

iv

ACKNOWLEDGMENTS

I wish to thank my thesis advisor, Prof. Lionel Ni, for his considerable help
and guidance with my thesis and my doctoral program. His patience in listening to
initial versions of many of these results is especially appreciated. His suggestions and
questions have greatly improved both the accuracy and the presentation of this thesis.
Without his enthusiasm and encouragement, the completion of this thesis work would
be impossible.

I would like to thank Prof. Tien Yien Li, Prof. Abdol Esfahanian, Prof. Philip
McKinley, and Prof. Diane Rover, the members of my Ph.D guidance committee for
their guidance and support during my thesis work. I would particularly like to express
my appreciation to Prof. Diane Rover for her careful reading of this dissertation and
her many useful comments.

Special thanks go to my parents for constant encouragement and support for
everything I do.

Lastly and most importantly, I would like to thank my wife, Yiru, who makes

everything I do worthwhile.

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1 Introduction

1.1

1.2
1.3
1.4
1.5

Scalable Parallel Computer Architectures ................
1.1.1 The NUMA Multiprocessors ...................
1.1.2 The Message-Passing Based Multiprocessors ..........
1.1.3 Workstation Clusters .......................
Data Parallel Programming Model ...................
Motivation and Problem Deﬁnition ...................
Objectives and Scope of Research ....................
Thesis Outline ...............................

2 Data Decomposition Overview

2.1

2.2
2.3
2.4

2.5
2.6
2.7

Loop Characteristics ...........................
2.1.1 Parallelizable Loop ........................
2.1.2 Perfectly Nested Loops ......................
2.1.3 Special-Purpose Loops ......................
Interprocessor Communication ......................

Processor Workload Balance .......................

‘ The Model for Data Decomposition ...................

2.4.1 Formulation for Data Alignment and Distribution .......
2.4.2 Layered Structure for Data Decomposition ...........
2.4.3 The Base Alignment Phase ....................
2.4.4 The Offset Alignment Phase ...................
2.4.5 The Data Distribution Phase ..................
Data Re—distribution and Data Re-Alignment .............
Data Flow Analysis ............................
Related Work ...............................

vi

g.

X

COOEO'IrPOOOOtOl-i

p-o
H

13
14
14
15
17
17
18
19
20
21
22
28
30
34
36
37

 

2.7.1 Data Parallel Languages ..................... 39

2.7.2 Preference Graph Model ..................... 40
2.7.3 Using Communication Cost Estimation ............. 41
2.7.4 Dynamic Programming Methods ................ 42
2.7.5 Linear Algebra Methods ..................... 42
2.7.6 Parallelizing Loops with Data Dependence ........... 43

3 Base Alignment 45
3.1 Terminology ................................ 45
3.2 Base Alignment for Single Reference .................. 48
3.2.1 Base Alignment Equation .................... 48
3.2.2 A Legitimate Solution of Alignment Matrix .......... 51
3.2.3 Solving Base Alignment Equation ................ 56

3.3 The Cost of Reorganization Communication .............. 58
3.3.1 Reorganization Communication ................. 58
3.3.2 The Weighted Cost ........................ 63

3.4 Spanning-Tree Base Alignment Algorithm ............... 64
3.4.1 Data Reference Graph ...................... 64
3.4.2 Spanning Tree Base Alignment Algorithms ........... 68
3.4.3 Experimental Results ....................... 73

3.5 Optimizing RHS Expression Evaluation ................. 74
3.5.1 RHS Expression Evaluation Optimization ........... 74
3.5.2 Alignment Graph ......................... 76
3.5.3 AG Base Alignment Algorithm ................. 77
3.5.4 Experimental Results ....................... 80

3.6 Avoiding Redundant Communication .................. 81
3.6.1 Redundant Communication ................... 81
3.6.2 Enhanced Alignment Graph ................... 83
3.6.3 EAG Base Alignment Algorithm ................. 84

4 Offset Alignment 88
4.1 Offset Alignment for Single Reference .................. 88
4.1.1 Offset Alignment Equation .................... 89
4.1.2 Multiple Aligned Base Groups .................. 91
4.1.3 Calculating Alignment Offset .................. 94

4.2 The Cost of Neighboring Communication ................ 96
4.2.1 The Basic Cost .......................... 96
4.2.2 The Weight of the Basic Cost .................. 98

vii

4.3 The Impact of Access Offset ....................... 102

4.3.1 Piecewise Linear Cost Function ................. 102

4.3.2 Properties of Piecewise Linear Cost Function .......... 105

4.4 Spanning-Tree Offset Alignment Algorithm ............... 112
4.4.1 Offset Reference Graph ...................... 112

4.4.2 Spanning Tree Offset Alignment Algorithms .......... 115

4.5 Optimizing RHS Expression Evaluation ................. 120
4.5.1 RHS Expression Evaluation Optimization ........... 120

4.5.2 Post-Alignment Optimization .................. 122

4.5.3 Alignment Graph ......................... 124

4.5.4 AG-Based Offset Alignment Algorithm ............. 125

4.5.5 Performance Comparison ..................... 128

4.6 Avoiding Redundant Communication .................. 129
4.6.1 Redundant Communication ................... 129

4.6.2 Enhanced Alignment Graph ................... 132

4.6.3 EAG-Based Offset Alignment Algorithm ............ 132

5 Data Distribution 136
5.1 Segment Distribution ........................... 136
5.1.1 The Limitation of Existing Distribution Types ......... 136

5.1.2 Segment Distribution ....................... 140

5.1.3 Multiple-Variable Density Function ............... 146

5.1.4 Experimental Results ....................... 150

5.2 Virtual Processor Allocation ....................... 151
5.2.1 Reducing the Overall Neighboring Communication ...... 151

5.2.2 Optimal Processor Allocation .................. 153

5.2.3 Performance Results ....................... 157

6 Conclusions and Future Research 160
6.1 Research Contributions .......................... 160
6.2 Directions of Future Work ........................ 163
BIBLIOGRAPHY 165

viii

2.1
2.2
2.3

3.1
3.2
3.3

4.1

4.2
4.3

LIST OF TABLES

Related work in base alignment analysis ................ 38
Related work in offset alignment analysis ................ 39
Related work in data distribution analysis ............... 39

Values of T, Q1, and Q2 in executing the MICS algorithm for Example 7 72

Resolving base alignment for Example 9 ................ 79
Resolving base alignment for Example 10 ................ 87
Values of T, Q1, and Q2 in executing the MICS algorithm for Example

13 ..................................... 119
Resolving offset alignment for Example 13 by using the AGOA algorithm127
Comparison of the MWST, STOA, and AGOA algorithms using Ex-

ample 13 .................................. 129

ix

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13

3.1
3.2
3.3
3.4
3.5
3.6

3.7
3.8
3.9
3.10
3.11

3.12
3.13

LIST OF FIGURES

Pipeline computation and communication ...............
The model for data decomposition ....................
Layered structure for data decomposition process ...........
Example 1: A NAS benchmark loop ...................
Base alignment in Example 1 ......................
Example 2: A Purdue benchmark loop .................
Base alignment in Example 2 ......................
Example 3: A NAS benchmark loop ...................
Offset alignment in Example 3 ......................
Data distribution in Example 3 .....................
Example 4: A Heatwave benchmark loop ................
Processor allocation in Example 4 ....................
A dynamic programming algorithm for data re—alignment and data

re-distribution ...............................

Example 1 in single occurrence statements ...............
The compactness of the image on T ...................
Example 5: A Whetstone benchmark loop ...............
The alignment for A and B in Example 5 ................
Example 6: Inner product benchmark loop ...............
Reorganization communication for Dx = (1, —1) and Dy = (1,1) in
Example 1 .................................
Example 7: A Lapack benchmark loop .................
The DRG and base alignment for Example 7 ..............
The minimum-weight induced communication algorithm .......
Use the MICS algorithm to resolve base alignment for Example 7

Comparison of MWST algorithm and MICS algorithm on 16-node
nCUBE-2 .................................
Example 8: A Splash benchmark loop ..................
Optimal evaluation trees for Example 8 .................

19
22
23
24
26
26
28
29
31
32
33

35

46
52
53
55
56

61
65
66
70
71

74
75

3.14
3.15
3.16
3.17

3.18
3.19
3.20

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19

5.1
5.2
5.3
5.4
5.5
5.6

5.7
5.8

Example 9: An Electric benchmark loop ................ 77
Alignment graph for Example 9 ..................... 77
The alignment graph base alignment algorithm ............. 78
Comparison of the MWST, MICS, and AGBA algorithms on l6-n0de

nCUBE-2 ................................. 80
Example 10: An Oceanwater benchmark loop ............. 82
Enhanced alignment graph for Example 10 ............... 84
The enhanced alignment graph base alignment algorithm ....... 85
Example 11: An Eispack benchmark loop ................ 91
Example 12: A Weather-Climate benchmark loop ........... 95
Neighboring communication in Example 3 ............... 100
The sum of cm and 6,43 ......................... 106
The sum of ca,- and 63,}; ......................... 108
The minimal-cost pairwise offset alignment algorithm ......... 110
Example 13: A Livermore benchmark loop ............... 113
The offset reference graph for Example 13 ............... 114
The offset reference graph for Example 13 ............... 116
The spanning—tree offset alignment algorithm .............. 117
Resolving offset alignment for Example 13 ............... 118
Example 14: A Livermore Kernel 7 loop ................ 120
The optimal evaluation trees for Example 14 .............. 121
An example of post-alignment optimization ............... 124
The AG for Example 13 ......................... 125
The alignment graph offset alignment algorithm ............ 126
Example 15: A Dhrystone benchmark loop ............... 130
Enhanced alignment graph for Example 15 ............... 132
The enhanced alignment graph offset alignment algorithm ...... 133
Example 16: An Electromagnetic benchmark loop ........... 137
Different types of data distribution for Example 16 .......... 138
Linpack TQL2 benchmark loop ..................... 141
Some common loop patterns ....................... 143
An example of no optimal distribution pattern ............. 149
Comparison of processor workload balance between block distribution

and segment distribution ......................... 150
Two different processor allocation patterns for Example 4 ....... 152
A livermore kernel 18 benchmark loop ................. 158

xi

5.9 Comparison of different processor allocation strategies on a 64-node

nCUBE-2 .................................

xii

CHAPTER 1

Introduction

Sequential computers are approaching a fundamental physical limit, the speed of
light, on their potential computational power. Such a limit cannot satisfy the compu-
tation power requirement of the so-called grand-challenge problems including three-
dimensional ﬂuid ﬂow calculations, real-time simulations of complex systems, and in-
telligent robots which require over 1,000 times the computing power of the maximum-
power uniprocessors. It has been widely recognized that parallel computing represents
the only plausible way to continue to increase the computational power available to
such grand-challenge applications. Such parallel computers, also known as scalable
parallel computers (SPCs), are designed to offer corresponding increases in the pro-
cessing capability of the system, memory bandwidth, and total interprocessor commu-
nication bandwidth as the number of processors and the number of memory modules
are increased.

However, programming on SPCs is much more complicated than programming on
uniprocessor computers. The naive approach of exposing system architecture features
directly to programmers has been proved to be time-consuming, tedious, and error-
prone due to the difficulty in compromising various conﬂicting requirements arising
from concurrent processing and distributed data allocation. As a result, an easy-

to—use and highly efficient programming model is critical to the success of SPCs.

The data-parallel programming model has been accepted as one of the most efﬁcient
programming models provided for SPCs. The basic idea of a data-parallel program
is to decompose the global data objects across the processors each of which executes
the same program structure on the data objects it owns. Based on this type of single-
program-maltiple-data (SPMD) mode of computation, data decomposition determines
both the process scheduling and interprocessor communication. Selecting an efﬁcient
data decomposition is one of the most important intellectual steps in developing
quality data-parallel programs.

The purpose of this dissertation research is to make an in-depth study of optimiz-
ing data decomposition for data-parallel programs. The proposed algorithmic results
provide a framework for data decomposition optimization used by parallelizing com-
pilers in optimizing user—written data-parallel programs at compilation-time. The
theoretical results obtained in this thesis are derived from the fundamental architec-
ture features among various existing SPC platforms. It is not the intention of this
thesis to address any speciﬁc issues involved in a particular machine architecture.
Instead, our proposed framework has chosen the machine-independent approach as
the focus of our study.

In this introductory chapter, we provide a brief review of the fundamental features
of SPC architectures, present an overview of data-parallel programming model and

its implementations, and describe the objectives and scope of this research.

1.1 Scalable Parallel Computer Architectures

There are two major classes of parallel computers: shared-memory multiprocessors
and message-passing multiprocessors. Differing in how the memory is shared and dis-
tributed among the processors, shared-memory multiprocessors can be further classi-

ﬁed into uniform-memory-access (UMA) multiprocessors and non-uniform-memory-

access (NUMA) multiprocessors. In UMA multiprocessors, all processors have equal
access time to all memory words. This feature prevents maximizing the performance
from the impact of data locality with regard to data allocation in the memory. How-
ever, limited by the current technique and cost, the delayed memory access time due
to the added interconnection network signiﬁcantly degrades the overall performance
as the number of the processors and memory modules increases. For this reason, it

is believed that UMA multiprocessors shall not be a good candidate for SPCs.

1.1.1 The N UMA Multiprocessors

Typically in NUMA multiprocessors, each processor has a local memory. The collec-
tion of all local memories forms a global address space accessible by all processors.
However, the memory access time varies with the location of the memory word. It
is quicker to access a local memory with a local processor. The access of remote
memory attached to other processors takes longer due to the added delay through
the interconnection network. BBN TC-2000 [1] is an example of the commercial ma—
chines using NUMA architecture. The TC-2000 can be conﬁgured to have up to 512
M88100 processors interconnected by a multistage cube network. In the TC-2000,
the remote memory access time is about ﬁves times as expensive as the local memory
access time. Besides the TC-2000, the Cray T3D [2] and Convex Exampler [3] are

two other important commercial machines using N UMA architecture.

1.1.2 The Message-Passing Based Multiprocessors

Distributed-memory multiprocessors are organized as ensembles of nodes, each of
which is a programmable computer with its own processor, local memory, and other
supporting devices. As the number of nodes in the system increases, the total com-

munication bandwidth, memory bandwidth, and processing capability of the system

also increase. Nodes communicate each other by passing messages through the in-
terconnection network. For this reason, distributed—memory multiprocessors are also
known as the message-passing based multiprocessors. A noval switching technique,
known as wormhole routing [4], uses a cut-through approach to reduce the impact
of the physical distance between two communicating nodes. The commercial ma-
chines characterized by the message-passing based multiprocessors include the Intel
Paragon (successor to Touchstone DELTA [5]), N cube nCUBE-2 [6], Meiko CS-2 [7],
and TMC CM-5 [8]. However, in those systems, the communication latency is orders

of magnitude as expensive as the local memory access.

1.1.3 Workstation Clusters

Recently, the evolution of fast LAN-connected workstation cluster has the trend in
creating yet another kind of SPCs. The high-performance workstation clusters inter-
connected through high-speed switches have been advocated in the place of special-
purpose multicomputers. The IBM SP-l [9] development has already moved in this
direction. In the SP-l conﬁguration, a collection of IBM RS6000s are interconnected
through the IBM high-performance Vulcan switch [10]. Though great efforts are be-
ing made to reduce the communication overhead involved in operating systems and
communication protocols [11, 12], message transmission is still much more expensive
than local memory access in workstation clusters.

Overall, the hierarchy of the local memory and remote memory with significant
access latency difference characterizes the fundamental feature of the current existing
SPCs. Exploring data locality is critical to the performance of message-passing based
multicomputers, N UMA multicomputers, and workstation clusters. The data objects
referenced by each processor should be arranged to reside in the local memory with
that processor in order to avoid the communication overhead added by remote data

access. The importance of data locality motivates the research of this thesis. It should

be emphasized that it is not the intent of this thesis to consider the detailed implemen-
tation of remote data access, such as remote memory reference through the crossbar
in NUMA multicomputers or message transmission through the interconnection net-
work in distributed-memory multiprocessors. Our research is based on the abstract

memory hierarchy model which characterizes the existing SPC architectures.

1.2 Data Parallel Programming Model

The data parallel programming model has been recognized as one of the most success-
ful programming models for the SPCs. Data objects are pre-distributed across the
local processor memories before a data parallel program is executed. During the exe-
cution, each processor runs an identical copy of the original program but only writes
to the data objects owned by that processor. As a result, instruction partition in data
parallel programs is fully determined by data partition. With the similar concept of
lockstep operations in the SIMD programming model, the data parallel programs are
easier to write, easy to port, and much more scalable. However, unlike the SIMD
programming model, the data parallel programs emphasize medium-grain parallelism
and synchronization at the subprogram level rather than at the instruction level.

It has been commonly accepted that the programming languages supporting global
name space are much easier to use than the programming languages supporting sepa-
rate name space, in particular for those scientiﬁc application programmers who wish
to write the programs in some dialect of Fortran when using the SPCs. One of the
greatest advantages that the data parallel programming model has is the nature of
supporting global name space at the language level. Many successful data paral-
lel languages, including Fortran 90 [13] and High Performance Fortran (HPF) [14],
support global name space.

Though programmers may give some hint about how to decompose the data ob-

{—ﬁ

jects using speciﬁc language extensions, compiler is responsible for arranging and
optimizing data allocation across the processor local memories. The type of memory
space addressing is hidden from the vieWpoint of programmers. If the underlying
memory architecture serves the global address space, the reference to a data object
owned by a remote processor is simply achieved by a remote memory access. If the
underlying system architecture only supports separate address space, the reference
to a data object owned by a remote processor is implemented by a proper message
passing which is inserted at compilation time. Such a data parallel programming

model maximizes the programmability.

1.3 Motivation and Problem Deﬁnition

There are two levels of data parallel activities in data-parallel application programs.
First, there is the question of how arrays should be aligned with respect to one another,
both within and across array dimensions. This is called as the problem mapping [15,
16] induced by the structure of the underlying computation. It represents the minimal
requirements for reducing data movement for the program, and largely independent
of any machine considerations. The alignment of arrays in the program depends on
the natural ﬁne-grain parallelism deﬁned by individual members of data arrays.
Second, there is the question of how arrays should be distributed onto the SPCs.
This is called as the machine mapping caused by translating the computation struc-
ture onto the ﬁnite resources of the machine. Data distribution provides opportunities
to reduce data movement, but must also maintain load balance. The distribution of
arrays in the program depends on the coarse-grain parallelism deﬁned by the SPCs.
The objective of data decomposition is to minimize data movement across distinct
processor local memories and maximize the processor workload balance among all pro-

cessors. The ﬁrst effort of reducing data movement is to optimize data alignment.

There are two-level of alignment: base alignment and offset alignment. Base align-
ment determines the alignment across dimensions of various arrays. Offset alignment
speciﬁes the alignment within each dimension of various arrays.

Traditionally, data arrays are decomposed based on each dimension and base
alignment is dimension-based alignment. However, the limitation of such a dimension-
based alignment prevents the optimizer from exploring the inherit parallelism avail-
able in many programming structures. Recently, this limitation has been removed by
partitioning a data array in a family of parallel hyperplanes in the linear-tangle space
deﬁned by the data array [17, 18]. Array elements residing on the same hyperplane
are allocated to the same processor local memory such that any unstructured data
reference among elements on the same hyperplane is free of interprocessor commu-
nication. Based on this new partitioning strategy, this thesis proposes efficient base
alignment algorithms to resolve the conﬂicted communication-free partitions imposed
by different computation structures.

The goal of offset alignment is to reduce data shift movement. The previous work
[19] on the offset alignment problem focuses on the SIMD programming model on the
SIMD multiprocessors, where each processor is assigned a single element in each array
operand. In the SPMD programming model on the SPCs, however, each processor
can be assigned a collection of elements in the same data array. As a result, the
amount of shifted data with respect to each local processor is the accumulation of all
shifted data requested by deﬁning each element owned by the local processor. Since
each processor owns a collection of data elements in SPMD programs, the shifted
data requested by deﬁning one local element has great chance to overlap another
local element on the same processor. However, this fact has been ignored by other
research works which have been done so far in the area of data decomposition. This
thesis establishes a mathematical framework to model the problem of offset alignment

for data parallel programs. The theoretical results built upon this framework provide

efﬁcient solutions in optimizing offset alignment.

This thesis emphasizes the impact of optimal expression evaluation to the data
alignment analysis [20, 21]. The traditional implementation of data parallel programs
follows the owner—computes rule in which all the right-hand side (RHS) operands must
be ﬁrst transmitted to the local processor and then the evaluation of the RHS is taken
place on that local processor. Data movement may be minimized by evaluating an
intermediate result on a remote processor rather than the local processor which owns
the left-hand side (LHS) operand. Of course, this approach of optimal expression
evaluation is based on the presumption that the associate and commutative properties
of the RHS are not violated. The issues of optimizing expression evaluation for SIMD
mode of programming have been addressed in [19]. However, the algorithm in [19] is
given only for a single assignment statement with the assumption that the alignment
of each array operand is given. This thesis pioneerly studies optimal expression
evaluation with regard to multiple assignment statements. The algorithmic results
given by this thesis take advantage of such optimal expression evaluation in resolving
both base alignment and offset alignment.

In traditional parallelizing compilers, interprocessor communication optimization
including redundant message avoidance, message vectorization, and overlapping com-
munication with computation is performed after data decomposition analysis. How-
ever, it is not true that the proﬁtability of every communication optimization tech-
nique is only passively determined by the result of data decomposition. In this thesis,
the effect of redundant communication avoidance is considered in the ﬁrst place during
the decision making for data alignment. More efficient solution of data alignment can
be obtained by the alignment algorithms proposed in this thesis because the dataﬂow
analysis eliminates the impact of redundant communication in cost estimation and
assists the data alignment analyzer to make a more accurate decision. Message vector-

ization usually reduces the software latency per message, provided that the software

latency in a SPC is signiﬁcant as against the network latency [22]. Since the em-
phasize of this thesis is on the issues of machine independent data decomposition,
we do not make any assumption about the detailed machine parameters including
the ratio between the software latency and network latency. Therefore, the message
vectorization technique is not discussed in this thesis. For the same reason, we do not
make any assumption about the computation speed over the communication speed.
Hence, the technique of overlapping communication with computation is left to the
back-end compilation optimizer after the data decomposition is performed.

Data distribution is responsible for reducing data shift communication and in-
creasing processor workload balance. The amount of data shift communication can
be greatly reduced if elements are distributed to processors in block fashion. On the
other hand, however, limited by the owner-computes rule, the requirement of proces-
sor workload balance favors cyclic distribution when the workload is not uniformly
distributed among all the LHS elements. The conﬂict between block and cyclic distri-
bution has become an open issue in the research of data distribution. In this thesis,
we propose a segment distribution which minimizes the impact of data shift move-
ment by allocating elements consecutively to processors and balances the processor
workload by varying the size of the segments assigned to different processors. An
optimal processor allocation algorithm is given to minimize the overall cost of data
shift communication across multiple dimensions of the template array. The segment
distribution and optimal processor allocation proposed in this thesis provide the best

data distribution support for most data parallel programs.

1.4 Objectives and Scope of Research

The results of our data decomposition analysis are presented based on data parallel

Fortran languages which are written in the global name space. The reason we choose

10

data parallel Fortran languages is that scientists wish to use the SPCs in their fa-
miliar dialect of sequential Frotran. The parallelizable loops are the major resources
of parallelism available in data parallel Fortran programs. Many sophisticated loop
transformation techniques [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34] have been
developed to transform various sequential loops into parallelizable loops in order to
explore the inherit parallelism. In this thesis, we assume that the data decomposition
analysis takes place after the loop transformation has been performed. Given an ap-
plication program, our data decomposition analysis neglects those non—parallelizable
subprogram structures and only focuses on the parallelizable subprogram structures
which can be represented by the HPF FORALL structures [14]. We believe that this

approach is reasonable and practical due to the following reasons:

0 DOACROSS loops can be successfully parallelized by pipelining both computation
and communication. The corresponding decomposition for arrays referenced in
the DDACROSS loops is also straightforward: block distribution on the partitioned

dimension.

0 Most of non-parallelizable subprogram structures in data parallel Fortran pro-
grams can be formalized into a few intrinsic functions which efficient implemen-
tation is machine dependent and may be resorted to special system software or

even hardware support.

We do not address the data decomposition issue involved in the procedure calling
because the techniques available in dealing with the interprocedural data dependence
test are still at the preliminary stage. Without the support of a mature interprocedu-
ral data dependence testing technique, we feel it extremely difﬁcult to achieve quality
data decomposition results.

Our data decomposition framework is based on the abstract model of the SPCs.

The basic assumption is that interprocessor communication is much more expensive

11

than local memory access and thus the performance of data parallel programs is
dictated by the data locality. Only the number of elements involved in remote data
reference is measured as the cost of interprocessor communication. Detailed machine-
dependent parameters, such as the network latency, software latency, and switching
techniques, have been ignored. In the data distribution analysis, we simply assume
that data elements are distributed to the virtual processors which are interconnected
through a fully connected network. Though the actual architecture of a SPC is not
likely to afford the fully connected network topology, the scalable communication
library [35, 36, 37, 38] usually can offset the adverse impact of network topology
and achieve efﬁcient data communication. In addition, other machine-dependent
optimization issues [39, 40], including run-time support [41], will not be addressed in
this thesis.

It should be emphasized that it is not the intention of this thesis to address
the compilation techniques [41] involved in generating efﬁcient parallel programs for
the SPCs after data decomposition is well-deﬁned. The data partition, instruction
partition, and communication generation have been actually performed in order to
obtain the performance results of benchmark subprogram structures. However, they
are just treated as a part of implementation and will not be further addressed in the

rest of this thesis.

1.5 Thesis Outline

The rest of this thesis is organized as follows. In Chapter 2, an overview of data decom-
position for data parallel programs is presented. Different subprogram structures are
clariﬁed and different types of interprocessor communication are examined. A layer
structured data decomposition model is used to illustrate the tasks accomplished in

each major phase. Strategies and difﬁculties are discussed for data re—distribution

l2

and data re-alignment. Major previous work is classiﬁed based on the proposed layer
structure model. Chapter 3 addresses the base alignment phase. Efﬁcient base align-
ment algorithms are proposed with regard to various concerns including the optimal
expression evaluation and redundant communication avoidance. Chapter 4 addresses
the offset alignment phase. Efﬁcient offset alignment algorithms are proposed by
considering both the optimal expression evaluation and redundant communication
avoidance. Chapter 5 addresses the data distribution phase. An optimal virtual pro-
cessor allocation strategy is proposed and the closed form of segment distribution
is given to maximize the processor workload balance and minimize the neighboring
communication. Experimental results of benchmark program structures are shown in
each of Chapters 3, 4, and 5. Finally, Chapter 6 summerizes the major contribution

of this research and provides directions for future research.

CHAPTER 2

Data Decomposition Overview

This chapter gives an overview of the data decomposition problem. Different subpro-
gram structures are clariﬁed and different types of interprocessor communication are
examined. A layer structured data decomposition model is proposed to illustrate each
major phase in data decomposition. Strategies and difﬁculties are discussed for data
re-distribution and data re-alignment. Inﬂuential previous work has been classiﬁed
based on the proposed layer structure model.

In this thesis, we follow Fortran 90 array speciﬁcation. In a one-dimensional
array, an array subscript is declared as 0 : n — 1 : s, where n is the total number of
elements and s is the value of stride. An array subscript always starts with zero. The
stride option can be omitted if its value is one. For example, A(O : n — 1) represents
A(O : n — 1 : 1). Consecutive array elements A(i), A(€ + 1), .. ., A(r) (E < r) can
be represented by array segment A(€ : r). In a multi-dimensional array, the array
subscript declared for each dimension is separated by a comma. For example, a 8 x 10
two-dimensional array A can be represented by A(O : 7,0 : 9). A(1,2) is an element
in A(O : 7,0 : 9). The notation for array subscript is also applicable to the loop

statement declaration.

13

14

2.1 Loop Characteristics

Loop is the major resource of parallelism in data parallel Fortran programs [42].
What type of parallelism a loop can explore is determined by the pattern of data

dependence among different iterations of the loop.

2. 1 . 1 Parallelizable Loop

A loop is a parallelizable loop if a data object deﬁned in an iteration is not used or
re-deﬁned in other iterations in the loop [24, 43]. A general form of parallelizable
loops can be modeled by the FORALL loop in HPF [14]. A FDRALL loop may consist
of one or more FORALL assignment statements, or FORALL assignments for short. A
FORALL assignment is implemented by ﬁrst executing the evaluation of the RHS of the
assignment statement for all combination of loop index subscripts, and then assigning
to the corresponding elements of the array referenced at the LHS of the assignment
statement. A FORALL assignment is free of loop-carried data dependence. Seman-
tically, in a FORALL loop consisting of several FDRALL assignments, the execution of
the next F URALL assignment will not take place until the execution of the previous
FORALL assignment is fully completed.

In a FDRALL assignment, the access to a RHS data element owned by a remote
processor is achieved by passing a message to the local processor. Since the FDRALL
assignment is free of loop-carried data dependence, messages sent to the local proces—
sor can be combined and transferred prior to the loop execution. As a result, in the
execution of a FORALL loop consisting of several FORALL assignments, computation
and communication are alternated statement-by-statement. Since it is fully paral-
lelizable, the performance of a FORALL loop is only determined by processor workload
balance and interprocessor communication overhead. Therefore, the FORALL loop be-

comes one of the most important program structures on which data decomposition

15

issues are studied. The survey of previous work with regard to the FORALL loop can

be found in Chapter 2.7.

2.1.2 Perfectly Nested Loops

In many perfectly nested loops with constant distance vectors, the outmost loop
cannot be parallelized by any legal loop transformation [29]. This feature makes the
coarse-grain parallelism difﬁcult to explore and thus degrades the overall performance
on the SPCs due to the large overhead of barrier synchronization at the end of each
iteration of the outmost loop. A closer inspection reveals that computation and
communication can be pipelined in a perfectly nested loop with constant distance

vectors. For example, consider the following four point difference operation.

D031 = 2,71 —1
DO i2 = 2,77. -l
81: X(il,ig) = (X01 —1,i2)+ X(21 +1,Z.2)+ X(il,32 — 1)+ X(il,32 +1))/4
END DO
END DO

Figure 2.1(a) shows the original iteration space in which an arrow represents the
direction of data dependence. Figure 2.1(b) illustrates how this loop can be executed
in pipeline. The iterations are assigned to different processors in shaded blocks, so
are columns of array X. The row segment assigned to each processor is labeled by
the lock step in pipeline. As long as the computation on the ﬁrst row is ﬁnished, all
processors can start executing in parallel. This technique is also known as tiling [44].
The tiled code the corresponds to Figure 2.1(b) is as follows.

Here b is the length of the row segment assigned to each processor. The original

i2 loop is split into two dimensions: the outer i4 loop and inner i3 loop. Iterations of

16

DO i4=2,n—1,b

DO 1:2 = 2,17, —1
DO i3 = i4,min(n —1,i4 + b— 1)
312 X(i1,i3) = (X(21 -1,1:3)+ X(Zl +1,i3)+ X(i1,i3 — 1) + X(i1,i3 + l))/4
END DO
END DO
END DO

the i4 loop are spread across processors, while iterations of the i2 and i3 loops reside

on the same processor.

 

r<
r<
14
1L

 

L
L
L
L

 

L
L

 

 

 

 

 

 

 

il V i1 i

(a) Original iteration space (b) Pipelining computation and communication

Figure 2.1. Pipeline computation and communication

An important feature of pipelining is that computation and communication can
be overlapped [17]. As shown in Figure 2.1(b), processor p; can send the result of
X (il, b) to processor p2, while processor 122 is still working on element X (i1 — 1,i3)
Where b S i3 5 2b. The pipelining technique can also be applied to the perfectly

nested loops with irregular and complex dependence constraints using the dependence

uniformation methods [45].

17

2.1.3 Special-Purpose Loops

Special parallel algorithms may be required to parallelize those special-purpose oper-
ations, such as preﬁx, sufﬁx, gather, scatter, and other combining operations. Gener-
ally speaking, these algorithms are machine-dependent or architecture-dependent. As
a result, they are typically supported as library routines, such as intrinsic procedures
in HPF, and usually not further optimized at compilation time. An important issue
which the compilation optimizer should deal with is how to efﬁciently utilize these
library routines, in particular, when library routines are designed for various data
decomposition patterns. Issues in designing scalable library routines for MPCs can

be found in [46].

2.2 Interprocessor Communication

Interprocessor communication occurs when a data object referenced by a local proces-
sor resides in a remote memory. As mentioned in [47], interprocessor communication
generated in executing data parallel Fortran programs can be classiﬁed into intrin-
sic communication and residual communication. Intrinsic communication arises from
those special-purpose computational operations such as scatter and gather that re-
quire data motion as an integral part of the operation. As addressed in the previous
section, minimizing intrinsic communication is a major task of efﬁcient intrinsic li-
brary implementation [48, 35, 36, 37, 49, 38, 50, 51] and thus is beyond the scope of
compiler optimization.

Residual communication arises from nonlocal data references required in a compu-
tation whose operands are not aligned with respect to each other. Typically, residual
communication can be further separated into neighboring communication and reorga-
nization communication. Neighboring communication refers to the nearest-neighbor

shifts of data. Reorganization communication is due to the mismatch in data de-

18

composition, which requires reorganizing the entire data structure. Not all residual
communication patterns are equally expensive in SPCs. Neighboring communica-
tion can be signiﬁcantly reduced by distributing data arrays in blocks. On the other
hand, however, reorganization communication is often much more expensive because
the data movement pattern is unstructured, such as transpose, change of stride, and
vector-valued subscripts.

Reducing interprocessor communication by properly allocating data objects into

processors is a key issue in the data decomposition analysis.

2.3 Processor Workload Balance

The main reason to employ SPCs is to reduce the overall execution time. In most
data parallel scientiﬁc programs, parallelizable loops are the major resources of the
parallelism. For concurrent execution of a parallelizable loop, the iterations have to
be assigned to processors. This is also known as workload distribution. Ideally, given
the ﬁxed amount of the overall workload, the workload should be evenly distributed to
each processor. Therefore, the overall execution time regarding to the whole system
can be minimized. Otherwise, the overall execution time would be longer if there is
a processor idle when other processors are still working. Note that a barrier synchro—
nization is typically required between two adjacent subprogram phases. As a result,
if it ﬁnishes its own work earlier than others, a processor has to be idle rather than
start working on the next subprogram phase.

The previous study in the processor workload balance has been focused on the
run-time scheduling on shared-memory multiprocessor systems [52, 53, 54, 55, 56, 57,
58, 59]. In data parallel programs, the workload is distributed in the way that a data
Object is written only by the local processor which owns it. Therefore, the workload

distribution depends on the pattern of data decomposition which is determined at

19

compilation-time. In this thesis, we study the static processor workload balance,
another key issue in the data decomposition analysis. It has been shown [60] that

static processor workload balance is critical to the overall performance even with the

support of the runtime scheduling.

2.4 The Model for Data Decomposition

T mmmmmm [ Arrays ) “““““““““““““ I“
a ” |
l I “l V
[ Data .
1 Alignment Problem Mapping
L ‘
Machine Independent (Template ) ____________ i
. /
A \ \
I r l
[ Data Abstract Processor
[ Distribution Mapping
} /’\ l
’1 Abstract I
A] ”””””” (processor ] "—“T—“ “_"‘“”_W
l \ /
I [ Physical processor

Machine Dependent Mapping

[ /

_[ ﬁbysical I
_____ \processor

.\\

\ f \\ I‘ll

Figure 2.2. The model for data decomposition

As illustrated in Figure 2.2, the data decomposition model can be viewed as a two-level

mapping of array elements to abstract processors. Array elements are ﬁrst aligned

20

relative to one another, known as data alignment; the group of aligned arrays is then
distributed onto a set of virtual processors, known as data distribution. Data align—
ment, the problem mapping, represents the requirements of reducing data movement
induced by the structure of the underlying computation. Data alignment is achieved
by aligning array elements to an abstract index space, known as template in HPF [14],
which is typically represented by a rectilinear space. Data distribution, the virtual-
machine mapping, represents the requirement of efﬁciently allocating computation
structures to ﬁnite machine resources. Data distribution is attained by allocating
template elements onto the rectilinear arrangement of virtual processors. As a result,
all array elements which are aligned to the same template element are allocated to
the processor to which that particular template element is allocated. However, the
data alignment and distribution phases are machine-independent. The ﬁnal step in
which the rectilinear arrangement of virtual processors are mapped to the physical
processors is machine-dependent. The issues in physical processor mapping are be-
yond the scope of this thesis. For simplicity, in the rest of this thesis a processor

refers to a virtual processor rather than a physical processor.

2.4.1 Formulation for Data Alignment and Distribution

A mathematical model is presented for data alignment and data distribution. An
array of dimension m deﬁnes an array space A, an m-dimensional rectangle. Each el-
ement in the array is accessed by an integer vector (i = (a1, a2, . . . ,am). For the same
reason, a template of dimension h deﬁnes an array space T, an h-dimensional rectan-

gle. Each element in the template is accessed by an integer vector {2 (t1, t2, . . . , th).

Deﬁnition 2.1 For each index {i of an m-dimensional array, the data alignment of

the array onto an h-dimensional template is a function 6,4(6) : A —) T, where

5,,(5) = on: + d}

21
DA is an h x m linear transformation matrix and dA is a constant vector.

Deﬁnition 2.2 DA is called alignment matrix of array A. (ll, is called alignment

offset of array A.

The rectilinear arrangement of abstract processors is represented by a processor
array. The dimension of the processor should be the same as the dimension of the tem-
plate. The processor array deﬁnes an array space ’P, an h-dimensional rectangle. Each

element in the processor array is accessed by an integer vector 15': (p1, p2, . . . , ph).

Deﬁnition 2.3 For each index {of an h-dimensional template, the data distribution

of the array onto an h-dimensional processor array is a function 7TH) : T —> ’P.

The function 71 may not necessarily be an afﬁne function and varies with different

applications.

2.4.2 Layered Structure for Data Decomposition

A typical data decomposition process can be described by a three-phase layered struc-
ture shown in Figure 2.3. The base alignment phase determines alignment matrix D A
in alignment function 6A for each array A. The oﬁset alignment phase speciﬁes align-
ment offset (7,4 in alignment function 6,4 for each array A. The base alignment and
offset alignment phases accomplish the task of data alignment by mapping array el-
ements to template elements. The distribution phase decides in what fashion the
template elements are distributed to the processors. All array elements which are
aligned to the same template element are allocated to the same processor to which
that particular template element is allocated. As a result, taking the bridge of the

template, the distribution phase fulﬁlls the requirement of data distribution.

22

Various Multi—dimensional Arrays
I Base Alignment F 3

I. “'"x .— V/"f

 

The 1st arm’s/ii}... The 2nd imension ""*~rhe h—th dimension
of tho/toniplate of the to plate of ‘tixuemplate
\
A\ g ,4" MT """" _ _ ,.—/”/ FT "‘\
( Offset Alignment ) I Offset Alignment 3“) o o o (- Offset Alignment if)
\\ .w/ ‘\.___ '/,«" "xx,” /_,-//

Mﬂ// \\‘-~~L -"T" “I“ "’
\ —-,_L_.._, .-, —" —-—--~-- "* ’-
\ .ll'

 

 

fr” \ T“~\~., »'
I; Data Distribution ,3
fl
Aligned and Distributed Arrays

Figure 2.3. Layered structure for data decomposition process

2.4.3 The Base Alignment Phase

A multi-dimensional array can be treated as a multi-dimensional bounded linear
space, called data space, in which each integer grid represents an array element. A
multi-dimensional array can be partitioned in a family of parallel hyperplanes such
that array elements on the same hyperplanes are always allocated to the same pro-
cessor. As a result, the potential overhead of interprocessor communication involved
in the mutual references between the elements in the same hyperplane can be fully
avoided.

Consider the NAS benchmark loop in Example 1 (Figure 2.4). Array elements
Y(i1,i2) and Y(i2,i1) referenced in FORALL assignment 31 are along the diagonal.
Therefore, FORALL assignment 31 is free of interprocessor communication if array Y

is partitioned in a family of off-diagonals, as shown in Figure 2.5(b). In Figure 2.5(b),

23

 

FORALLa,=o:n—1,z‘2=0:n-1-i1)

81: Y(i1, 22) = Y(32,i1)
321 XUuil + i2) = Yfila i2)
END FORALL

Figure 2.4. Example 1: A NAS benchmark loop

 

 

 

y] is the index of the column dimension and y; is the index of the row dimension of
array Y. T represents the one-dimensional template array. The size of array Y is
declared as 8 x 8 and the size of T is 8. Each array element is represented by a circle.
Elements on each off-diagonal (represented by a solid line) are collapsed and mapped
(represented by each dash line) to the same element in the template. Some array
elements are not covered by any solid line since they are out of the loop boundary.
Vector (1, —l), the direction vector of an off-diagonal, is called the collapse base
of array Y. Two Y elements must be collapsed and mapped to the same element
in the template if the vector starting with one element and ending with another is
parallel to collapse base (1, —1). Vector (1,1), which is orthogonal to the collapse
base (1, —1), is called the distribution base of Y. Two Y elements must be mapped
to the different elements in the template and thus may be distributed to different
processors if the inner-product of the distribution base and the vector constructed by
these two Y elements is non-zero. Alignment matrix is constructed by distribution
bases. In this example, since there is only one distribution base (1,1), Dy = (1,1).

Thus, alignment function by can be speciﬁed as follows.

III 311 311
t=6y( )=Dy =(l,l) =y1+y2

312 312 312

where Y(y1, 3,12) is a Y element and T(t) is a template element. Base alignment only

24

determines the value of alignment matrices. The value of alignment offsets is decided

in the offset alignment phase. In the above 6X and by, we simply assume that dx = 0

and dy = 0.
’ x
T 00113936 base (1,-1) distribution base (1.1)
C" ‘* ‘“ m r '-*- ---i:'7 J a e: I": if? It?
’/ ff“ ‘ —"‘ .__ - 7 —— — H3 (gait:- in" .1: o .2"; I 1.
f // fL" “‘ v— “a —— -~-cJ Q I)” (:5 3.3 If? <1) If}
// I] / QT “I I I T “ . r c! a :3 0 o
/ / / // if“ T- ..._ h v US " m" (a O O. O Y
I (1* M "
/ / / / / / :3 :T 3 )3” O I I“ O O O
/ / / / / ,/'/ / i O J I o O O C»
of 1 (f “ f’ ‘i” ‘i‘ —> y2
(D If f (f, (3" ([1 $
0 C) I; I? It q, 3'1
o o 0 ti» If g-
x O O m I) $ T (If, (T.
_ ' , p X
9 Q t r) c. (b rt ( {T 2
o O o o o o q) [ X
o o o o o o o It 1
collapse base (1,0)
distribution base (0,1) +v]
(a) (b)

Figure 2.5. Base alignment in Example 1

Generally speaking, the vector represented by each row in an alignment matrix
indicates a distribution base. As a result, the rows in the same alignment matrix must
be mutually linear independent. On the other hand, the vector which is orthogonal
to all rows in a alignment matrix implies that that vector is a collapse base. The
distinct collapse bases must also be mutually linear independent. All collapse bases

and distribution bases should span the entire data space with regard to each array.

25

Consider the array subscript structure in FORALL assignment 32 (Figure 2.4).
Since Y elements are collapsed along the off-diagonal, X elements have to be collapsed
in columns so as to make FORALL assignment 32 free of interprocessor communication.
Therefore, the collapse base of array X is equal to (1,0), the direction vector of a
column. The distribution base of array X is equal to (0,1), which is orthogonal to
collapse base (1,0). Therefore, we have Dx = (0,1). Thus, alignment function 6x
can be speciﬁed as follows.

t=6x( $1 )=DX $1 =(o,1) “”1 :33;

x2 x2 x2
where X (x1, x2) is an X element and T(t) is a template element. Figure 2.5(a) shows
the alignment of array X with regards to the template. In Figure 2.5(a), x1 is the
index of the column dimension and x2 is the index of the row dimension of array X.
The size of array X is declared as 8 x 8. X elements in the same column (represented
by each solid line) are collapsed and mapped (represented by each dash line) to the
same element in the template. Some elements are not covered by any solid line since
they are out of loop boundary.

Note that the X partition in columns and Y partition in off-diagonals are not
randomized. They are in fact inducted from the requirement of minimizing inter-
processor communication based on the given array subscript structures in Example
1. Such a relationship between the alignment of X and the alignment of Y is called
base alignment. Base alignment represents the alignment relationship between various
distribution bases of various arrays and the alignment relationship between various
collapse bases of various arrays.

Depending on the pattern of how array elements are mapped to the template, an
array may have more than one distribution base and / or more than one collapse base.

Consider the Purdue benchmark loop in Example 2 (Figure 2.6). Unlike Example

26

 

FORALL(i1 = 0 : n1 —1,i2 = 0 : n2 — 1)
811 W(i1,i2) = Z(l2,i1)
END FORALL

Figure 2.6. Example 2: A Purdue benchmark loop

 

 

 

1, in Example 2 there is no self-reference regarding to arrays W and Z. Both W
and Z can be partitioned in both row and column dimensions. Thus, the collapse
base of both W and Z is degenerated to (0,0). Both W and Z have distribution
bases (1,0) and (0,1), which span the entire data space. This implies that there is
no requirement that two particular elements in the same array should be mapped to

the same template element.

distribution base ( 1,0) i \
distribution base (0,1) distribution base (1,0)
. . . ] ix
distribution base (0,1) ‘___ Z Z
i,_ 2

V“ W

w 0 0‘ 0 U C) I) (”J (‘I

\ \ \ i, ‘ ,z‘ [7, 0 H
( ii‘? q} \:'<\\ ( :'[\\ (>\ ~[\\ (\ (€‘\ 571, f E
C],\\ Q‘ t‘,‘ (i\\ Q \l ril [ ]\ ' . ‘l‘ U
) \\\\( _\_]\ J‘\\\\‘ \‘{l\\\ \ [1‘ [\[\‘\( :3 ‘\‘\ \‘i \ | \[\\‘I x if} i-) if) KC)
WW \\\\\\\ ‘\\\ \\ \\\\\\ M) . ‘7‘ n, my, z")
\\\\ [Mi] \\l\\ \\\\\\\\\\\\. W] m] \\\\ , , , ’/ ' "17> U 6 [(1, Z
[Ow-1M. \\\1\O ' r of, I,
\\ \;)\\ ])\\O\\\ \\\\ (:I]\“\ \I“)\ \\I\‘V__ s—> T2 .' . . I /,.l , L) .t.
\\\ \\ \\ \O\:\\ \\::\ \ i . D O 1”
If-) \0 l O '\ K) \ I");" :j I‘i (is I) I") {i .1) .’ ,
O ) () (i O () Q (’_ i T Cl (7 . _' ~..' . I " 3,! 1,1 1;! ,
6 2 (‘3 I") f‘\ {if {1 I": I“) "'5/ / 9.7
T T 2" .1 {1'1 i",- ti. ('X I (:5; (3
Ti
(a) (b)

Figure 2.7. Base alignment in Example 2

27

Figure 2.7 shows the alignment of W and Z. In Figure 2.7, the size of W is
declared as 4 x 8 and the size of Z is declared as 8 x 4. Template T is declared
as a two-dimensional array which size is 4 x 8. wl, 21, and t1 represent the column
dimension of arrays W, Z, and T, respectively. mg, 22, and t2 represent the row
dimension of arrays W, Z, and T, respectively. Figure 2.7(a) shows the alignment of

array W: dimension wl is aligned with dimension t1 and dimension 21);; is aligned with

1
dimension t2. Therefore, DW 2 and alignment function (Spy is speciﬁed as
0 1
follows.
t1 wl ml 1 0 w1 w1
t2 “’2 1.02 0 1 1.02 7.02

where W(w1,w2) is a W element and T(t1,t2) is a template element. Figure 2.7(b)

shows the alignment of array Z: dimension 22 is aligned with dimension t1 and di-
mension 21 is aligned with dimension t2. Therefore, Dz = and alignment

function 62 is specified as follows.

ii 21 21 0 1 21 22
= 5z( ) = Dz

t2 2'2 22 1 0 22 21

where Z(zl, 22) is a Z element and T(t1, t2) is a template element.

For arrays each of which has multiple (linear independent) distribution bases, it
is crucial that which distribution base of one array should be aligned with which
distribution base of the other. In Example 2 (Figure 2.6), in order to make FORALL
assignment .31 free of interprocessor communication, distribution base (1, O) (the col-
umn dimension) of W is aligned with distribution base (0,1) (the row dimension) of

Z, and distribution base (0,1) (the row dimension) of W is aligned with distribution

28

base (1, 0) (the column dimension) of Z. Aligned distribution bases in various arrays
comprise an aligned-base group. Example 2 has two distinct aligned-base groups. One
consists of distribution base (0,1) of W and distribution base (1,0) of Z. The other
consists of distribution base (1, O) of W and distribution base (0, 1) of Z. By the def-
inition of the alignment function, each aligned-base group can be uniquely identiﬁed
by a dimension in the template array. Different distribution bases of the same array

can never be aligned with each other and never be included in the same aligned-base

group.

2.4.4 The Offset Alignment Phase

As described in the previous section, an array can be partitioned in a family of
parallel hyperlanes in each of which the array elements are collapsed and mapped
to the same element in the template. However, base alignment only determines the
constituent base(s), known as distribution base(s) and collapse base(s), for such a
family of parallel hyperlanes. Offset alignment is responsible for the displacement of
each hyperplane with regard to the template. In other words, while base alignment
determines DA, offset alignment determines d; in alignment function 6A for each

array A.

 

FORALL(2'1=Ozn—1,i2=01n—3—i1)

811 Y(i1, 22) = Y(i2, 21)
322 X(i1, i1 + i2 + 2) = Y(21.22)
END FORALL

Figure 2.8. Example 3: A NAS benchmark loop

 

 

Consider the NAS benchmark loop in Example 3 (Figure 2.8). Except the access

29

T alignment offset 0
0—- —— —— __ .m ___ __ W“ _ (3‘. 0 I3] If.“ 9 O f‘)
\ (Ck—— —_‘ _ _ _ __ __ ~ C7 I i. {’1‘ C? Q t,
\ \‘ C) —— .m. v __ A ._ ﬂ 1 if g (Ml ‘) C) I
‘ \ ‘ .»,__ _ _- r Q" I ( a 1:2 (I
\ \\ 1 . "' — ~ — j {I} ( I: f) ”‘1 Y
\ \ IK \. in- 1_ _ _ A _ g .,
I I" VD _* t‘ H r ( WI :1) .~.
\ \ \b \\ .\ .ﬁ _ “_ g t k)
. . 1 I r If.) (I) C
\ \ \i l i“ \f f O '
~. . ( 1 L) O (J
\ \ \ \ I I, J
1. , , _>
. \l l “I \\ 6 y2
alignment offset 2 \ \ .\ I .\ ~\
6 bk; ‘ 3' L? L1 1‘ til) (:11 Y]
‘. ‘I i .I I t i
C O (1. (1 T 1; (1,
C) C (1‘ fl ('1: 1| ( I) '1)
A A) x iEI T I“. I
x u r r“. 11 1. I 1 (1
o I .1 Ii II ‘1
O L C‘ k1) (1| r r_.> X
, 2
C) (l ( l , I-) (I; 1 v
C) ) I r) (1» X1

Figure 2.9. Offset alignment in Example 3

offset 2 in the array subscript X (2'1, i1 + i2 + 2) (Figure 2.8), Example 3 is similar to
Example 1. For this reason, the base alignment analysis in Example 3 is identical to
that in Example 1. Dx = (O, 1) and Dy = (1,1). Figure 2.9 shows the alignment for
arrays X and Y in Example 3. In Figure 2.9, Y is partitioned in off-diagonals and
X is partitioned in columns. Unlike Example 1 (Figure 2.5), however, in Example 3
(Figure 2.8), the (k + 2)-th column in X should be aligned with the k-th off-diagonal
in Y, in order to make FORALL assignment 3; free of interprocessor communication.
Such an alignment relationship between the column displacement in X and the off-
diagonal displacement in Y is called offset alignment. In Example 3, we have d X = 2

and dy = 0. Alignment function 6X can be written as follows:

331 $1

i=5x( )=Dx +dx=$2+2

$2 932

where X (3:1, 1:2) is an X element and T(t) is a template element. Alignment function

30

6y can be written as follows:

y1 y1
t=6y( )=Dy +dy=y1+y2

312 312
where Y(y1, yg) is a Y element and T (t) is a template element.

Generally speaking, the value of alignment offset is represented by a constant
vector. Offset alignment should determine each component in such an alignment
offset vector. In Chapter 4, we will show that the components in the alignment offset
vector are independent with one another. Each component can be decided based on

a distinct aligned-base group.

2.4.5 The Data Distribution Phase

The data distribution phase determines how to map the template elements to vir-
tual processors. Since both the template and the virtual processor arrangement are
represented by multi—dimensional arrays, there are two major decisions to make re-
garding to each dimension in the template: what is the distribution type and how
many virtual processors should be allocated.

We ﬁrst consider the case in which both the template array and the processor
array are one-dimensional. There are numerous ways to distribute template elements
across processors. Among them the cyclic distribution and block distribution are
most popular. In cyclic distribution, template elements are assigned to processors in
the round-robin fashion. In block distribution, template elements are contiguously
allocated to each processor. Figure 2.10 shows the data distribution phase of Example
3. In Figure 2.10, arrays X and Y are declared as 8 x 8. There are total two processors
available, denoted as P(O) and P(1). The template array is distributed in cyclic. As

a result, processor P(O) owns the even-numbered X columns and the even-numbered

31

 

 

 

N1 I‘ T;
O *:, :‘N .- 1 r}

o r1 A a n.
- .7 , -

Figure 2.10. Data distribution in Example 3

Y off-diagonals. Processor P(l) owns the odd-numbered X columns and the odd-
numbered Y off-diagonals. The data distribution function, 7T, can be written as

follows:

192770!) =t mod 2

where P(p) is an element in the processor array and T(t) is an element in the template
array. Note that the number of the LHS array elements mapped to each template
element is varied. For example, two Y elements are mapped to T(l) but ﬁve Y
elements are mapped to T(4). Since workload assigned to each processor depends
on the number of the LHS elements owned by that processor, the workload imposed
by each template element can be varied. For this reason, the cyclic distribution
increases the processor workload balance. Another reason of using cyclic distribution
is that interprocessor communication has been fully eliminated by the alignment
function. Otherwise, signiﬁcant communication cost could occur if cyclic distribution

is employed.

32

The template array and processor array can be multi-dimensional. The task of
processor allocation is to determine the layout of the multi-dimensional processor

array.

 

FORALL(21 = 1 1 n1 — 3,i2 = 1 2 n2 - 3)

312 W(i1,i2) = Z(21,22)+Z(21+1,22)+Z(21 —1,i2)
+Z(i1,i2 + 1) + Z(i1,i2 — 1)
32: Z(i1,i2) = W(i1,i2)
END FORALL

Figure 2.11. Example 4: A Heatwave benchmark loop

 

 

 

Consider the Heatwave benchmark loop in Example 4 (Figure 2.6). By the con-
struction of array subscripts in FORALL assignment 3;, there is no interprocessor com-
munication as long as W(i1,i2) and Z (i1,i2) are mapped to the same template ele—
ment. On the other hand, however, since both Z (i1+1, i2) and Z (2'1, i2) are referenced,
assignment 31 is free of communication only if Z elements are collapsed in the row
dimension. Similarly, since both Z (2'1, i2 + 1) and Z (i1, i2) are referenced, assignment
31 is free of communication only if Z elements are collapsed in the column dimension.
This conﬂict implies that interprocessor communication in executing assignment .91
cannot be avoided no matter how W and Z are aligned. For this reason, alignment

functions (SW and 62 are simply defined as follows:

t1 w] wl
: 6W( ) =

t2 w2 w2

33

and
t1 21 21
=6A )=
t2 22 22

where W(w1, wg) is a W element, Z(zl, 22) is a Z element, and T(t1,t2) is a template

element.

 

“x? W2
OLD (ror:no Q” a;
wl T <13 C) 4‘3; 4 :1: <3 4»: (f: in 'f,‘ 4
T ’7 TT TT": \T (T: 40 L; (7;; :T‘} 4i; «i; tf
T W T T Tu a: l T» m 1
T T T T TC.- rJ i T04 T; Ci 4 T34 Q 54 :3 T,
T T T T T T:T TT T TTJ TT (T) TCT (T) TT (T) 0 T
T l T 4 l 4 ‘ T I I T I 4 l
TTz Tp(o,)T: T TP(0.11 T T T 24P<942>I 1 ? TT l 4
‘ fos‘a “T 4 4T 4;.» T14 TT4 i T l' T l
T\‘\Q o TQT {3.5.41 9‘ TOT TT’li TOT ‘T .x I 4 T T
4 T _ 4.. 3T T -. T i} l | 4 ‘T
T 1 r; {ix-Tie; «T's: 4‘ ‘33-: l
~~ (r . ’ - "‘iTT- T r7" i -T 7" 77" T” «‘1
T T ( 1.“:T T) Hf} TC; .1 T “T T (T'T "le C. T
T l T > U —\| m .9 t-J MT A» ,c‘) o o
T T T l T ‘T;T TTiLTT‘L‘T'T T IT_ "E4TLTL_.TE4
T T. T. TP'ITO T‘ T <le T. | P02)
4 4T! ”T T4T4T|44TIT 4
922T 4sz {ITITTT'TTTTI I
K O C) QT (ff- AC. QTTQT a}; T QT QT TaT C) TT | T
Zl a CT (3T T T T ”OT OT «'2‘ CT ml 4 TT T .
6 T (.2 T OT (1 04 T MT 4 at) T 4
Z 0, ('T f 0 IT, UT Q if) T; 4 (7, T
t \T (J O QT o ) 0 GT 0 O ‘
(T "J (3 F4) C) k U C' k (”3' (I)

Figure 2.12. Processor allocation in Example 4

Figure 2.12 shows a pattern of processor allocation in Example 4. In Figure 2.12,
arrays W, Z, and T (the template array) are declared as 6 x 12. There are total
6 processors available. The processor array is speciﬁed as 2 X 3, denoted as P(O :
1,0 : 2). P(O : 1,0 : 2) indicates there are 2 processors assigned to row dimension
and 3 processors assigned to column dimension. Template elemets are distributed to

processors in block fashion. The mapped data blocks in arrays W and Z are indicated

34

by the dash lines. The data distribution function, 7T, can be written as follows:

p1 =7T( t1 )= [ﬂ
P2 t2 li§l

where P(p1,p2) is an element in the processor array and T(t1,t2) is an element in the
template array. Note that the 2 x 3 processor allocation minimizes the overall cost
of interprocessor communication involved in executing assignment 31. The theory of

the processor allocation optimization will be addressed in Chapter 5.

2.5 Data Re-distribution and Data Re-Alignment

It is not necessary to keep the same layout of data decomposition in the life time of
a data parallel program. Data objects can be re-aligned and re-distributed from one
subprogram phase to the other in order to meet the requirement of minimizing in-
terprocessor communication and maximizing processor workload balance imposed by
different computing structures. However, few researchers have addressed these issues
due to the lack of good mathematical model(s) to represent data re—alignment and
data re—distribution. The major difﬁculty that data re-distribution and re-alignment
are encountering is the lack of the knowledge of the subprogram breaking point. A
straight forward approach is to treat each array assignment statement as a building
block on which data re-distribution and data re-alignment is considered. A dynamic
programming algorithm based on this approach is proposed [61] to resolve the data
re—distribution and data re-alignment. We brieﬂy review its basic idea as follows.
Let .91, 32, ..., 3;, be It FORALL assignments in sequence in the program. Let
Mg“,- be the cost of computing in sequence of loops 3;, 33,-+1, ..., and s.-+,-_1 using
the component-alignment algorithm [62], and Di,j be the distribution scheme, for

l S i S k and l S j S k — i + 1. Deﬁne CM to be the minimum cost of interprocessor

35

communication involved in computing the sequence of loops 3;, 3,41, ..., and s.-+j_1
with the restriction that the ﬁnal data distribution scheme after computing Cm- is
D”.

A dynamic programming algorithm for computing the minimum cost of data dis-
tribution schema of executing a sequence of k F ORALL assignments is formalized in
Figure 2.13. In Figure 2.13, cost(D,-_¢,5,D,',j) returns the communication cost of

changing data distribution layouts from Di-“ to Di,j-

 

Input: Mm- and D“, where 1 S i S k and 1 Sj S k —i+1

Output: The minimum-cost data distribution of executing k FORALL assignments
(1) fori=2tokdo

2) forj=1tos—z'+1do

3) CM = minlgzgk{C.’—z,z + Mi,j + €08t(D.'—e,e, Di,j)}

4) endfor

5) endfor

6) MinimumCostzminngSk{Ck-g+1,g}

Figure 2.13. A dynamic programming algorithm for data re-alignment and data
re—distribution

 

 

 

However, the above dynamic programming algorithm has the following drawbacks:

1 The component-alignment algorithm cannot give the accurate cost of interpro-
cessor communication, in particular, the cost of data shift movement generated

by mismatched alignment offset.

2 In the component-alignment algorithm, only a dimension can be chosen as a

distribution base. In other words, the partition of an array is dimension-based.

3 The dynamic programming algorithm only emphasizes data re-distribution. In

fact, in many data parallel programs, the distribution layout of the template

36

may not be changed, while the alignment function is likely to change during the

different subprogram phases.

4 The assumption of arranging k F ORALL assignments in the sequential order is
too limited. Depending on the program control ﬂow, the relationship among
those FORALL assignments can form much more complicated graph, such as a

directed acyclic graph (DAG).

5 The algorithm is time-consuming by taking each FORALL assignment as the
building block of the dynamic programming when the program has tens of

thousands lines.

The ﬁrst two issues have been resolved by the results obtained in this thesis. In
addition, this thesis has done an in-depth study of the data alignment analysis and
provided the accurate cost of interprocessor communication imposed by mismatched
alignment. This basically answers the questions raised in the third issue. The forth
issue can be resolved by extending the dynamic programming algorithm to the situa-
tion where the control dependence of adjacent subprograms constitutes a DAG. The
ﬁfth issue is still an open problem. However, the concept of single assignment block
proposed in this thesis has made a great effort in ﬁnding the bigger building block
for the dynamic programming algorithm. For the above reasons, we do not address

the issues in data re—distribution and data re—alignment in the rest of this thesis.

2.6 Data Flow Analysis

Communication overhead can signiﬁcantly affect the performance of executing data
parallel programs on the SPCs. The dataﬂow analysis has been proposed to assist
communication optimizations [63, 64, 65]. This section brieﬂy reviews the basic ideas

used in the dataﬂow analysis for optimizing communication [64].

37

For each remote reference requiring communication, it is fundamental to deter-
mine all points in the program at which the communication can be performed. This
information can be computed by identifying the communication associated with each
statement and propagating it in the backward direction along all execution paths in
the program. An array reference nested within a loop may reference different remote
array elements during each loop iteration. The group of remote elements residing
on the same remote processor are represented as a single entity. Groups of array
elements that must be communicated are propagated. If deﬁnitions of array elements
are encountered, the propagation of those elements is discontinued. A reference to a
group of remote elements may have to be split when a deﬁnition which only deﬁnes
a subset of elements in the group is encountered. Moreover, repeated communication
of the same remote elements may be encounted during the back paths. In this case,
the repeated references can be combined and the redundant communication can be
avoided.

The extra cost of analyzing dataﬂow information for array reference is the set rep-
resentation and set operation for array elements, in particular, the set union and the
set intersection. For regular data distribution pattern, this extra cost is reasonable.
The detailed framework of such dataﬂow analysis can be found in [64]. This thesis
will use the results of existing array dataﬂow analysis to assist data alignment but

not address the issues in dataﬂow analysis.

2.7 Related Work

This section brieﬂy summarizes inﬂuential previous work in the research area of data
decomposition. Table 2.1 compares the existing work in the area of base alignment.
The RHS expression evaluation optimization indicates that part or whole RHS expres-

sion evaluation can be executed on the remote processor which does not own the LHS

38

operand. Table 2.2 compares the existing work in the area of offset alignment. The
piecewise linear function accurately models the cost of data shift movement regard—
ing to multiple instances of the same array variable referenced in the same statement.
Table 2.3 compares the existing work in the area of data distribution. The summary
of each major related work is given in the rest of this section. Note that the scope of
most existing data decomposition work, including this thesis, is limited to the scope

of FORALL structure.

Table 2.1. Related work in base alignment analysis

 

 

 

 

 

 

 

 

 

Methods Base Alignment Analysis
Dimension Hyperlane RHS expression
partition partition evaluation optimization
HPF et al. No No No
Li and Chen Yes No No
Knob et al. Yes No No
Gupta et al. Yes No No
Chatterjee et al. Yes No No
Anderson and Lam Yes Yes No
This thesis Yes Yes Yes

 

 

 

 

 

 

39

Table 2.2. Related work in offset alignment analysis

 

 

 

 

 

 

 

 

 

Methods Offset Alignment Analysis
Using access Using piecewise RHS expression
offset linear function evaluation optimization

HPF et al. No No No
Li and Chen No No No
Knob et al. No No No
Gupta et al. No No No
Anderson and Lam No No No
Chatterjee et al. Yes No No
This thesis Yes Yes Yes

 

 

 

 

 

 

Table 2.3. Related work in data distribution analysis

 

 

 

 

 

 

 

 

 

Methods Data Distribution Analysis
Reducing Increasing Optimizing processor
communication load balance allocation

HPF et al. No No No
Li and Chen No No No
Knob et al. No No No
Anderson and Lam No No No
Chatterjee et al. No No No
Gupta et al. Yes No No
This thesis Yes Yes Yes

 

 

 

 

 

 

2.7 .1 Data Parallel Languages

In order to provide high-level language support for data-parallel programming, sev-
eral data-parallel Fortran extensions have been proposed, such as Fortran D [66]

and Vienna Fortran [67]. In an effort to standardize data parallel Fortran program-

40

ming, HPF (High Performance Fortran) is being proposed as a standard by the High
Performance Fortran Forum led by Rice University [14] for distributed-memory ma-
chines. An essential part of these data parallel Fortran extensions is the speciﬁcation,
through compiler directives, of the distribution and alignment of data arrays. The
languages, however, do not provide the programmer any guidance in selecting the
data decomposition. Programmers are fully responsible for choosing an efﬁcient data

decomposition.

2.7 .2 Preference Graph Model

One representation that has been frequently used in alignment analysis is the pref-
erence graph. The preference graph is a undirected, weighted graph. The nodes are
constructed by dimensions of various arrays. The edges are constructed from data
references in the source program. Edges represent the alignment preferences among
various array dimensions. There are two principal variants of this general framework.

Knob, Lukas, and Steel [68] use the preference graph model to address alignment
issues in the SIMD (Single Instruction Multiple Data) mode of computation. In
their approach, the concept of virtual processor space is adopted in which each array
element is mapped to a distinct virtual processor grid. The mapping function can
be represented by ai + b where axis alignment determines the dimension i, stride
alignment determines the value of a, and offset alignment determines the value of b.
Note that axis alignment and stride alignment are two special cases of base alignment.
In their preference graph model, the weight on each edge is determined by the loop
nesting depth of the reference the edge' represents. Alignment is determined in the
way that a preference edge with higher weight is honored by constructing a maximum-
weight spanning tree.

Li and Chen [62] use the preference graph model, known as component aﬁ‘inity

graph, to address the axis alignment problem. In their approach, the weight on each

41

edge is represented by either 6, or 1, or 00, which depends on the type of the reference
the edge represents. The purpose of the component affinity algorithm is to partition
the nodes into different components such that the total weight of edges, which incident
nodes are in different components, is minimized. They have proved such a component
afﬁnity problem is N P-complete.

The limitation of the preference graph model is that arrays can only be partitioned
based on each dimension. For instance, a 2D array can be partitioned only in row or
column, but not diagonal. This limitation prevents the compilation optimizer from

exploiting inherent data locality in the source program.

2.7 .3 Using Communication Cost Estimation

The preference graph model has been extended by other researchers [69]. Among
them, Gupta and Banerjee [70] use the cost of of underlying communication primi-
tives to estimate the penalty if an alignment preference is not honored. This approach
is known as the constraint-based approach. Li and Chen [71] ﬁrst present the idea of
using a set of low level communication primitives to best match the communication
requirements generated in data parallel programs. The cost of such low level commu-
nication primitives can also be evaluated by given a particular system architecture. In
the constraint-based approach, edge weight is assigned by the communication penalty
cost when such an alignment preference is not honored.

However, the penalty cost is a function of alignment. Different values of penalty
cost can be obtained by different alignment results. The constraint-based approach

has difficulties in representing and using such dynamic cost evaluation.

42

2.7 .4 Dynamic Programming Methods

Chatterjee, Gilbert, Schreiber, and Teng [19, 72, 47, 73, 74] propose a dynamic pro-
gramming method to solve axis alignment, stride alignment, and offset alignment
for a basic block. A DAG representation is used to model the basic block. Their
approach ﬁrst ﬁxes the alignment position of each input array represented by a leaf
node and then uses dynamic programming method to ﬁnd the optimal alignment for
each intermediate array represented by an internal node. A set of distance functions
is incorporated to characterize different architecture topologies. Their approach does
ﬁnd an optimal alignment when the common subexpression is not contained in a
DAG, in other words, the DAG is a tree. Otherwise, the cost function deﬁned in their
approach is only an approximation to the real communication cost. They have proved
that ﬁnding a minimum-cost offset alignment using their deﬁnition of cost function
is NP-complete.

One major limitation in their approach is that the computation complexity of the
dynamic algorithm would be unacceptablely high when the alignment position of each

input array is free.

2.7 .5 Linear Algebra Methods

In stead of using the graph model, Ramanujam and Sadayappan [75] use a matrix
notation to describe array access functions. Their approach focuses on the search of
the existence of a hyperplane such that communication is free if the array is parti-
tioned in a family of such parallel hyperplanes. The hyperplane partition determines
both base alignment and offset alignment. However, their approach does not address
the issues of how to minimize the interprocessor communication overhead when there
does not exist a communication-free hyperplane partition.

Recently, Anderson and Lam [17, 76] use linear space properties to ﬁnd a hyper-

43

lane such that reorganization communication is free if arrays referenced in a loop are
partitioned in a family of such parallel hyperplanes. Reorganization communication
refers to communication due to mismatches in base alignment which requires moving
the entire data structure. Communication due to mismatches in offset alignment is
named neighboring communication. Their approach ﬁrst searches for a linear sub-
space such that reorganization communication is free if arrays are partitioned in a
family of such parallel subspaces. If such reorganization communication-free sub-
space partition does not exist, their approach uses a data ﬂow iterative algorithm to
ﬁnd an effient partition which can reduce reorganization communication. However,
their approach does not address the offset alignment problem and cannot minimize

neighboring communication.

2.7 .6 Parallelizing Loops with Data Dependence

All techniques summarized above are focusing on the study of Fortran 90 [13] like
array languages. In other words, there is no loop-carried data dependence existing
during an array operation. The analysis is much more complicated and difﬁcult when
loop-carried data dependence is considered. A simple approach taken by Li and Chen
[62] in the component afﬁnity graph model is to assign the weight 00 to an edge which
models a reference involved in a loop-carried data dependence.

Anderson and Lam [17] use tiling method [18, 77] to pipeline communication
and computation in a perfectly nested loop with loop-carried data dependence. In
[17], a relationship between iteration space and data space is studied and arrays are
distributed with regards to the iteration space partition. When the data array size is
much larger than the number of available processors, the linear speedup can be almost
achieved. In order to guarantee the legality of tiling, their approach ﬁrst transforms
a nested loop into a fully permutable loop [29]. This transformation requires the

knowledge of all distance vectors which may not be always practical for any type of

44

nested loops. In fact, the tiling method can be further extended by using convex
hull method proposed by Tzen and Ni [45]. In [45], Tzen and Ni use convex hull
to search for the maximum and minimum values of dependence slopes by given any
type of dependence functions. A legal tiling can be easily implemented by given the
values of the maximum and minimum dependence slopes [18]. Therefore, in theory,
the pipeline technique can be used to overlap communication and computation given

any perfectly nested loop.

CHAPTER 3

Base Alignment

Base alignment determines the alignment matrix for each array in order to reduce
the cost of interprocessor communication. In this chapter, we present a mathemati-
cal model to represent base alignment, analyze the communication cost imposed by

mismatched alignment, and propose efficient base alignment algorithms.

3.1 Terminology

A loop nest of depth 3, with loop bounds that are affine functions of the loop indices,
deﬁnes an iteration space I, a polytope in l—dimensional space. Each iteration of
the loop nest corresponds to an integer point in the polytope and is identiﬁed by its
index vector i: (i1, i2, . . . ,ig). An array of dimension m deﬁnes an array space A,
an m-dimensional rectangle. Each element in the array can be accessed by an integer
vector 5 = (a1,a2, . . . ,am). Therefore, an affine array subscript can be written as
F i+ f, where F is a linear transformation and f is a constant vector. F is called
access matrix and f is called access oﬁ’set. The subscript F i+ f is called access
function.

If an instance A(Fi+ f) is referenced in a statement 5],, we want to label the

subscript F i+ f by the array variable and the statement number. However, multiple

45

46

instances of the same array may be referenced in the same FORALL assignment. For
instance, Y(i1,i2) and Y(i2,i1) in statement 31 of Example 1 (Figure 2.4). The
concept of single-occurrence statement is used to simplify the symbolic notation used

in the thesis.

Deﬁnition 3.1 A single-occurrence statement is the statement in which any refer-

enced array variable can not have more than one type of distinct instances.

In Example 1, F ORALL assingment 32 is a single-occurrence statement, but FORALL
assingment 81 is not because two distinct types of instances Y(i1,i2) and Y(i2,i1)
appear. Any assignment statement can be transformed to the equivalent single-
occurrence statements by using extra temporary array variables. Example 1 can

be re-written as shown in Figure 3.1.

 

FORALL(i1=O:n—l,i2=0:n-—1—i1)

S3: TT(22,21) = Y(t2,t1)

343 Y(i1,t2) = TT(22, 21)

82! X(i1,i1 + 22) = Y(21.22)
END FORALL

Figure 3.1. Example 1 in single occurrence statements

 

 

 

The statement 32 in the original loop (Figure 2.4) is equivalent to single occurrence
statements 33 and 34 in the transformed loop (Figure 3.1) by using the temporary
array variable TT. The temporary variables used for such single occurrence transfor-
mation are only necessary for the sake of alignment analysis and will be ignored in
the code generation.

In a FORALL structure consisting of only single occurrence statements, index ma-

trices of different instances can be uniquely distinguished by the pair < A, k > where

47

A is an array variable, and k is the statement number. Let 17A,;c be the access ma-

trix for the instance of array A referenced in FORALL assingment 3k. In Example

0 1 1 0 1 O
1 (Figure 3.1), we have Fy,3 = , FYA = , FY,2 = , and
l 0 0 1 0 l
1 0
FX,2 =
1 1

In order to make the base alignment results transparent from the single occurrence
transformation, the access function of a temporary variable is forced to be same as

that of the data array instance which the temporary variable replaces. In Example

0 1
1 (Figure 3.1), FTTA 2 F713 2 F353 2 . Moreover, the alignment function

1 0

of the temporary variable is also forced to be same as that of the data array instance
which the temporary variable replaces. For temporary variable TT in Example 1
(Figure 3.1), 677 is always equal to by. It is assumed that any program structure
used in this chapter has been pre-processed by the necessary single occurrence trans-
formation and all base alignment theorems and algorithms are stated based on the

denotation of single occurrence statements.

Deﬁnition 3.2 If array A is referenced on the LHS and array B is referenced on the
RHS in statement 3),, the read/write relationship between arrays A and B is called a

reference and is represented by the symbolic form “A <— B@sk”.

In Example 1 (Figure 3.1), we have references “T T <— Y@s3”, “Y 4— TT@s4”,
and “X +— Y@32”. The notation “A (— B@sk” is not ambiguous because A and B
have only one type of instance referenced in statement 3;, by the assumption of the

single occurrence statement.

48
3.2 Base Alignment for Single Reference

In this section, we use the linear space theory to model the inter-relationship between

base alignment and access function regarding to a single reference.

3.2.1 Base Alignment Equation

i

Consider reference “X 4— Y@32” in Example 1 (Figure 3.1). In iteration 1 ,
i2

assume that X ($1,332) is referenced on the LHS and Y(y1,y2) is referenced on the

RHS. Using the access function speciﬁcation, we have

131 31

= FXJ (3.1)
$2 t2
yl i1

= Fyg (3.2)
312 i2

Multiplying both sides of Equation 3.1 by D X and both sides of Equation 3.2 by Dy,

we have

Dx = DXFX,2 (3.3)
$2 i2
311 2'1

Dy .—. Dy Fy,2 (3.4)
312 i2

The purpose of data alignment is to eliminate interprocessor communication. In

this case, interprocessor communication guarantees to be avoided if both X ($1,352)

49

and Y(y1, yg) can be aligned to the same template element. In other words,

$1 in

5x( )= 5Y( )

$2 92

Therefore, we search for the solution of D x and Dy such that

a: 31
DX 1 = DY 1
$2 an
. . $1 . yl .
Substituting Dx by Equatlon 3.3 and Dy by Equation 3.4, we get
$2 .112
i1 i1
Dxeg = By FYQ (3.5)
i2 i2

Equation 3.5 can be re-written as follows:

i1

(DXFX,2 - DYFY,2) = 0
i2
i
Since 1 represents any iteration in the iteration space, Equation 3.5 holds if and
i2
only if
Dx Fxg — DyFyg = 0 (3.6)

Equation 3.6 can be easily extended to the following property for any multi-

dimensional arrays A and B.

3

Proposition 1 Reference “A ._ B@sk’ is free of interprocessor communication if

50

the following equation holds.

DAFAJc - DBFBJ: = 0

01‘

DAFAJ: = IDBFBJc (3.7)

Alignment matrices DA and D3 are compatible with respect to statement 3;, if Equa—
tion 3.7 holds. Otherwise, DA and DB are called incompatible in respect to that

statement.

Consider reference “Y +— TT@34” in Example 1 (Figure 3.1). Using Proposition 1,

we get

DYFYA = DTTFTTA (3-8)

Since TT is the temporary variable used to replace Y(i2,i1) in statement 33, it is
always true that DTT = Dy and FTTA = Fy,3. Substituting the values of DTT and

FTTA into Equation 3.8, we get

DY Fm = DY Fm

This can be re-written as

DY(FY,4 — Fm) = 0

On Substitution of FyA and Fy,3, we have

DY( - )=0

51

Dy = 0 (3.9)

which conducts Dy 2 (1,1). Substituting the result of Dy into alignment function

by , we have

91 311
5Y( ) = DY = 311 + 312

312 312
Alignment function by implies that all elements in an off—diagonal must be collapsed
to the same element in the template. This guarantees that Y(i1, i2) and Y(i2, i1) must
be assigned to the same processor and thus assignment 31 is free of reorganization
communication.
Substituting values of Dy, Fm, and F X3 into Equation 3.6, we have D X = (0,1)

and the alignment function of X can be speciﬁed as follows.

This implies that, in order to make assignment 82 free of reorganization communica-

tion, X must be partitioned in columns if Y is partitioned in off-diagonals.

3.2.2 A Legitimate Solution of Alignment Matrix

Note that there are inﬁnite solutions of Dy which can satisfy Equation 3.9. The
reason we chose Dy = (1,1) is that the image projected by by from )7, data space of

Y, to T, data space of T, should be compact.

Deﬁnition 3.3 A subspace is compact if and only if the subspace includes every

integer grid on any line which two ends are included in the subspace.

52

An image on the template space T is compact if and only if the image includes
every integer grid on any line which two ends are included in the image. The concept of
the compactness can be explained in Figure 3.2. In Figure 3.2, the template elements
included in the image projected by the alignment function are in dark. Figure 3.2(a)
shows the compact image projected by by deﬁned by Dy = (1, 1). Figure 3.2(b) shows
the non-compact image projected by by deﬁned by Dy = (2,2). In Figure 3.2(b),
T(3) is the integer grid on the line incident with T (0) and T (7) However, T(3) is

not included in the image, while T(O) and T(7) are.

T
t\ K \ alignment
T[01“x\ ‘“ ~~--\
k“‘\.:\\\..::TNFM‘~ v- on so or
11.33.:‘\\ ;\ -. \ N a; ‘1. p n a a . .0 to o
.\ :: x N“ k N g y: {a L v Q o 0
la \. g \ *~- c, .7 m n o o o
a: T 7 ' \ ”N as" w cf 5 c O O O Y
TV] 0 ' NP. ‘ A" t‘ t: a K 0 o 0
Tie] C’ \ \ \7 \ \‘ O 3 o “ a) O O O
k A i“ H 3 o ’) O 0 O O
C Vyz
O
o yl
o
(a) D y=( l ,1) and the image on T is compact
T
F \ \ M alignment
1101 r N \ ﬂ- 1
L‘ x _\ x N “ \
O ‘ ~—— 4“ M g- _ x. [a . i c;- [:11 (It: o r
1131"“ ~ ‘ K K H T \ {‘3 D (3 (ca/O] r) [)3 H
o “H——\\; Diff r1000
rm~—v-————__-__o {Km {-3 II) o o o
o f I f _,_ ..— n a r) 01””. o O o o Y
Pf” , 43 M {"0 o i o o
117] O /, ,r ” /f a C‘ o o o o o o
T[8] V0 0 o c) o o o o o
a" / 1 [T> Y2

a” if y]

O

(b) Dy=(2’2) and the image on T is not compact

Figure 3.2. The compactness of the image on T

53

Given a n x m matrix A, let A, be a square submatrix of A such that its de-
terminant [Arl 75 0 and r is the rank of the original matrix A. The value of [AT]
is deﬁned to be the determinant of the rectangle n x m matrix A, denoted as [A].
Array elements which are deﬁned or used in a FORALL structure are called eﬁective
elements. The effective domain of an array is composed of all effective elements in
the array. Generally speaking, the solution of an alignment matrix DA is legitimate
if the effective domain of A is compact and IDA| = 1. In this case, the image on
the template space projected by 6,1 will be guaranteed compact. In Example 1 (F ig-
ure 2.5), the effective domain of array Y consists of the whole upper triangle and thus
is compact. Therefore, Dy = (1,1) is legitimate since [Dy] 2 1.

However, the determinant of legitimate DA can be any rational number if the
effective domain of array A is no longer compact. Under such circumstance, the
stride in A’s effective domain has to be considered in order to ﬁnd a correct solution
of DA. This can be illustrated by using the Whetstone benchmark loop shown in
Example 5 (Figure 3.3).

 

FORALL(i1=O:n/2- 1 ,i2=0:n/3— 1)

81: 3(221, 322) = A(il, 222)
822 A(21,222) = B(Zl,322)
END FORALL

Figure 3.3. Example 5: A Whetstone benchmark loop

 

 

 

For Example 5, by Proposition 1, we get

DBFBJ = DAFA,1

DAFA,2 = DBFB,2

54

2 0 1 0 1 .
where FB,1 : , FB,2 : , and FAJ 2 FAQ 2 . SlIlCC FAJ

03 03 02

and F A’g are invertible, the above equations can be re-writen as follows:

03513,,ng = DA (3.10)

1),. = DBFByFy-g (3.11)

By substitution, we have

DBFBJFXA = DBFBQFA—é

which can be re-written as follows:

DB(FB.1FX,i - 173.21?” = 0

2 O 10 10 10
DB( -— )=0
0 3 0 -;- 0 3 o;
10
DB :0 (3.12)
0 0

Any vector (O,h) could be a solution of Equation 3.12. However, there is one
legitimate solution of D3 which makes the compact image on the template space
projected by alignment function 63. Since B(2i1,3i2) and B(i1,3i2) are referenced
in Example 5 (Figure 3.3), the stride in B’s effective domain regarding to the row

dimension is 3. Therefore, we choose

1
DB : (0, '3')

55
Substituting the value of DB into Equation 3.11, we have

I
DA : (0, i)

Therefore, the alignment functions of A and B can be speciﬁed as follows.

01 1
5A = ’02
(12 2
b
5B 1 = —52
b2

 

,. . ” ”i ’ V b2
a o o i r) > o a c) as" o t) i a O C) o I) <1 [i <3 ( > m t) a 0 . » b1
1 ( o . o i) o O O O a O r . u i) o 0 ti o If g; o o 0 i 0 O E
o c) o O < O x O r: G J 0 . C O u c J ' o o a (W a
A O O i O .) o d) O O 0 O ) o B I) O O i C l) (x o m (1) () t: 0 o l
O O O O O 0 O o O o o O i) 0 i 0 O C‘ 0 (w 0 t > o o O O O o ( u 0 0 o
O O O O O ('3 0 O O ("I (J ('i a) L J O C) o o g t ) ( i t O C M l t 0 O O
O O O O O O O O O W i J L ( I ) ('3 (j) C O I (W \j r“ J m C) o J U C) O r )
O O O O O O O o O ( ) o O a O t 1 o o O O o (v D k ) 0 O O o O O ( ) 0

Figure 3.4. The alignment for A and B in Example 5

Figure 3.4 shows the alignment of A and B in Example 5. In Figure 3.4, effective
elements in A and B are covered by solid lines. The effective domain of A only contains
elements (ahaz) such that a; must be a multiplier of 2. The effective domain of B

only contains elements (b1, b2) such that b3 must be a multiplier of 3. Though neither

56

A’s effective domain nor B’s effective domain is compact, the image on the template

space projected by 6,1 and 63 is compact.

3.2.3 Solving Base Alignment Equation

If access matrix F N, is non-singular, Equation 3.7 can be easily transformed into
DA = DBFBJcFXj,

and the relationship of D A and DB is straightforward. However, solving Equation 3.7
may not be simple when both F Ag; and F 3,], are singular. For singular matrices F Ag;
and Fay” let F ﬁk be the right inverse matrix of F A}; and F112,, be the right inverse

matrix of F 3,1,. Therefore, Equation 3.7 is equivalent to the following two equations

DA!” = DBFBJcFﬁk

DA FM,ng = 031.3

where rA is the rank of DA and r3 is the rank of D3 . The method to ﬁnd a right
inverse matrix of a given matrix can be found in [78]. Example 6 (Figure 3.5) is used

to illustrate the basic idea.

 

FORALL(i1 =0:n—1,i2 =Ozn—1)

S12 2201,22) = 0
DO t3 = 0 3 n —I
322 ZZ(Z1,Z2) = ZZ(21,22) + XX(Z1,23) X YY(23,Z2)
END DO
END FORALL

Figure 3.5. Example 6: Inner product benchmark loop

 

 

 

57

In Example 6, the inner DO loop is sequential and thus will not be distributed

1 0 0 1 0 0
across the processors. 1722.2 2 , F X X; = , and FYY’2 =
0 1 0 0 0 1
0 0 1
. Consider the equation
0 1 0

132ze12 = DXXFXX,2

for reference “ZZ (— X X @32”. Since F 22,2 and F X x3 are singular matrices, the

above equation is equivalent to

R
DZZITZZ = DXXFXX.2FZZ,2

R
Dzzez,2Fxx,2 = DXXIrxx

1 0 1 0
where rzz : rxx = 2, foa = 0 0 and F5232 = 0 1 . On substitution of
0 1 0 0
those values, we have
1 0
022 = Dxx
O 0
1 0
022 = DXX
0 0

Therefore, the solution of both D22 and D X x have to be in the format of (h, 0).

The similar approach is used in solving the equation

192ze2.2 = Dnyvm

58

for reference “ZZ +— YY@32” since F223 and FYY'2 are singular matrices. The
solution speciﬁed by this equation requires both Dzz and Dyy to be in the format of
(0, g). The requirement of Dzz imposed by two equations conﬂicts with each other,

which implies that interprocessor communication cannot be avoided.

3.3 The Cost of Reorganization Communication

Reorganization communication occurs if Proposition 1 does not hold for a given ref-

erence. This section studies the cost of reorganization communication.

3.3.1 Reorganization Communication

i
Consider reference “X (— Y@sz” in Example 1 (Figure 2.4). In iteration 1
i2
assume that X (231,22) is referenced on the LHS and Y(y1,y2) is referenced on the

RHS. Using the access function speciﬁcation, we have

131 i1
= FX,2

$2 22

311 i1
= Fm

312 22

Since F X,2 and Fyg are invertible, the above equations can be rewritten as

21 -F“1 171
— X,2
22 5132
i
1 -—F"1 311
— Y2

22 312

59

where Ff}? is the inverted matrix of Fx,2 and F17; is the inverted matrix of Fy,2. By

substitution, we get

$1 91

—1 —1

FX 2 = FY2
$2 92

It can be rewritten as

91 __ $1

= FY.2FX,12
92 $2

Multiplying both sides of the above equation by Dy, we get

91 $1

Dy = Dy FY31}; (3.13)

92 332

Assume that given values of Dy and D X, Proposition 1 does not hold for reference

“X +— Y@S2”. In other words,

Dx Fx,2 75 DY Fm

Since F X.2 is invertible,

Dx 3&4 DyFygFilz (3.14)

On substitution of Dy FY’2F £12 from Inequality 3.14, Equation 3.13 can be rewritten

as

Dy 7é DX

92 $2

60

In other words,

91 931
5Y( ) ¢ 6X( l
92 5132
. 311 $1 .
Thls means that array elements and are never aligned to the same
92 502

element in the template. Moreover, by Equation 3.13, we have

91 1'1 91 131
5Y( )-5x( )=DY -DX
92 1'2 92 $2
..1 $1 $1 _1 $1
= DYFY’2FX3 —DX = (DYFYJFXQ— BX)
232 $2 $2
. . 91 $1 ,
By Equation 3.14, the difference between Dy and DX 18 an affine
92 $2
. “:1 . . . .
function of . However, if template elements are distrlbuted in regular patterns,
932

, 91 $1 ,
such as block or cyclic, 6y( ) and 6x( ) are unlikely to be mapped onto
92 $2
the same processor. This implies that to write an X element, the local processor

almost always requires a remote access to the RHS Y element. Figure 3.6 is used to
illustrate the idea.

In Figure 3.6, arrays X and Y (in Example 1) are declared as 8 x 8. The template
array is one—dimensional and has eight elements. There are four processors, denoted
as P(O : 3). The template elements are distributed onto these four processors in
block. Since Dx = (1, —1) and Dy = (1,1), X is distributed along diagonal and Y

is distributed along anti-diagonal. Equation 3.7 does not hold for reference “X 4—

61

 

 

" \
. L J ,, .
‘J ..1“ 1: l: ("a if. {j 1:) l

Dx=(1,_1)

Figure 3.6. Reorganization communication for Dx = (1,—1) and Dy 2 (1,1) in
Example 1

Y@32”. To write an X element on the line 2:1 —:1:2 = 1 (highlighted in dark), a distinct
RHS Y element on the line y; = 1 (highlighted in dark) is accessed. However, all X
elements on the line 2:1 — $2 = 1 are owned by processor P(O), while six out of seven
elements on the line y; = 1 are owned by other three processors.

Based on the above observation, we have the following proposition.

Proposition 2 Assume that array elements are evenly distributed across the proces-

sors. For reference “A <— B@sk”, the cost of reorganization communication is %

if
DAFAJ: 75 DBFBJc

where n is the total number of elements involved in reference “A (— B@sk” and p is

the total number of available processors.

Since p is the total number of available processors and is a ﬁxed constant to the

62

base decomposition analysis, 11, the total number of elements involved in reference

“A <— B@sk”, determines the cost of reorganization communication. In the rest of

this chapter, the cost of reorganization communication is simply measured in 71.
Proposition 2 justiﬁes the correctness of the single occurrence transformation with

regarding to base alignment. In the following loop, the original FORALL assignment

FORALL(i1=0:n—1,i2=O:n—1—i1)
S11 X(t1,i1 + 22) = Y(t1,i2) + Y(t1,t1 + Z2)
END FORALL

31 references two distinct instances of array Y: Y(i1,i2) and Y(i1,i1 + i2). In order

0
to distinguish the access matrices of these two instances, let FY’I = and
0 1
, 1 0
Fy’l = . Assume that DX = (1,—1) and Dy = (1,1). Since DXFX1 7i
1 1

Dy FYJ, by Proposition 2, the number of remote Y required to be accessed is %
regarding to instance Y(i1,i2). Similar, since Dxe, 75 DyFbJ, by Proposition 2,
the number of remote Y required to be accessed is 3; regarding to instance Y(i1, i1+i2).
Moreover, since DYFY’I 75 Dth, corresponding elements Y(i1, i2) and Y(i1, i1 + i2)
are allocated to different remote processors. Therefore, the total cost of reorganization
communication is 2f. This cost estimation is consistent with the result obtained from

the following transformed loop, which has the property of single occurrence. For this

FORALL(i1 =0:n—l,i2=0:n—1—i1)
S22 TT(Z1, 22) = Y(t1,22)
33: X(i1,i1 +12) = TT(i1,i2) + Y(i1,i1 +12)
END FORALL

63

reason, we claim that single occurrence transformation does not bring any side effect

in the base alignment analysis.

3.3.2 The Weighted Cost

In Proposition 2, the total number of elements involved in reference “A (— B@sk”
may not be equal to the number of all elements in array A or B. Typically there
are three different cases resulting in different costs of reorganization communication.
First, the number of elements involved in reference “A «— B@sk” are limited to the
effective domain imposed by the loop boundary. In Example 1 (Figure 2.4), suppose
that the size of arrays X and Y is no X 77.1. The number of effective elements limited
by the loop boundary is only equal to §n2.

Second, the number of effective elements involved in reference “A (— B@sk” can

be limited by the probability of the WHERE clause used in a F DRALL assignment. For

example, in the following code, the total number of elements in array X is 10,000.

FORALL(2’1 = 0 : 100 — 1,12 = 0 : 100 - 1)
WHERE(X(2’1,2'2).NE.0)
813 X(i1,i2)=X(i1,i2)**2
END WHERE
END FORALL

However, if the possibility for the condition in WHERE clause to be true is only 90%, the
cost of reorganization would be only 9, 000 provided that the elements which values
are zero are evenly distributed among processors.

Third, the number of elements involved in reference “A 4— B @sk” can be equal to
the number of elements in B where the size of B is much larger than that of A. For

example, in the following code, the inner loop 32 is a sequential loop. Therefore, the

64

FORALL(21 ‘-= 02100 —1,22 = 01100 — 1)

S12 X(21,22) = Y(21,22)
For(i3 = 0 : 100 — 1)
S22 X(21,22) = X(21,22) + Y(21,22,23)
END FOR
END FORALL

cost of reorganization communication is 1,000,000, which equals to the number of

elements in array Y, if D X and Dy are incompatible with respect to statement 32.

3.4 Spanning-Tree Base Alignment Algorithm

3.4.1 Data Reference Graph

The problem of base alignment can be simply modeled by data reference graph (DRG).
Given a program structure, a DRG G = (V, E) is constructed as follows. An array is
represented by a node in V. For an array which has multiple instances referenced in
one or more F ORALL assignments, there is only one corresponding node in the DRG.
There is a distinct edge in E connecting two nodes for each reference between the
corresponding two arrays. A DRG is undirected.

Figure 3.8(a) shows the DRG for Example 7 (Figure 3.7). In Figure 3.8(a), each
array variable is represented by a node labeled by the variable name. Edges are
constructed based on the references generated by each FORALL assignment in Example
7. For example, edge (A, X) connects nodes A and X due to reference “A (— X @sz”.
There is no edge between nodes X and Y because these two arrays are not involved
in the same reference. Since there is one-to-one correspondence between an array and
a node, terms “array” and “node” are used alternatively in the rest of this chapter.
Similarly, since there is one-to—one correspondence between an edge and reference,

terms “edge” and “reference” are also used alternatively in the rest of this chapter.

65

In Figure 3.8 (a), each edge is weighted by the cost of reorganization communication if
alignment matrices of two arrays connected by the edge are incompatible with respect
to the reference represented by the edge. The weight on each edge is decided based

on the probability of true condition in the WHERE clause.

 

FORALL(21 = 0 2 10, 22 = 0 I 10)
31: Y(il,i2) = A(2ii + i2ail + 22)

WHERE(A(2’1,2'2).NE.0)

lHPF the probability of A(i1,i2) # 0 is 83%

323 1401,12) = W(il,iz) + X01 + 22, i1 + 222)
END WHERE

WHERE(Z(2'1, 2'2).NE.0)
lHPF the probability of Z(i1, i2) 51$ 0 is 83%

S31 Z(21,22) = B(21,22) * Y(21, 22)
END WHERE
84: W(21,22) = Z(21+22,21 + 222)

WHERE(B(z‘1,i2).NE.O)
lHPF the probability of B(i1, i2) 75 0 is 53%
85: B(21, 22) = A(221 + 22, 21 + 22)
END WHERE

WHERE(X(2'1,z'2).NE.0)
lHPF the probability of X(i1,i2) 75 0 is 53%
86: X(21, 22) = Z(21, 22)
END WHERE
END FORALL

Figure 3.7. Example 7: A Lapack benchmark loop

 

 

 

Generally speaking, there may not exist a solution of the alignment matrices such

that every reference can be free of reorganization communication. This is because the

   

(a) DRG (b) (c)
11(3) "(3) "
/‘~'\
100 / 100 100
/ 109mm 64 100 121 64 mo m 64
, ,. / ,4.” ~ ,_
(W) ()9 (Q g; ‘B (w)
\ \ / '
\. \ lm/ 10.0
\ 64\ .
121 - \ ,1 121
a max-weight spanning an optimal alignment
tree alignment
60 B (w x
z’ ” "\
l 00/ 100 . ._ \\
,f ., 121 . \
<2 (21>
induced communication induced communication
subgraph subgraph

Figure 3.8. The DRG and base alignment for Example 7

compatibility requirement imposed by one reference may conﬂict with that imposed by
another reference, in particular, when the two edges representing these two references
are involved in a cycle. For example, Figure 3.8(a) consists of cycle A —> W -—) Z -—>
Y —» A. By Proposition 3.7, the four references represented by four edges in the cycle
are free of reorganization communication if and only the following equations have a

non-trivial solution of alignment matrices DA, Dw, Dz, and Dy.

DAFA,2 = DWFW,2
DWFW,4 = DZFZA
Dze,3 = DYFY,3
DYFYJ = DAFAJ

67

10 11

where FAQ = Fw’4 = F13 2' FYJ = FY,3 = FWJ = aFZA = , and
0 1 1 2
2 o I a o a o
FA,1 = . Since all index matrices are invertible, the above equatlons can be
1 1

re-written as follows.

0,, = DwagFX}
Dw = DZFZAFQL
Dz = DYFY,3FE,;
DY = DAFAJFﬁ}

By substitution, we get
DA = DAFA,1FE}FY.3FE,;FZ,4FV}}4FW,2FX}

Using the value of each alignment matrix, the above equation can be written as

2 1 1 1
DA = 0,;
1 1 1 2
1 0 3 4
DA( — )= 0
0 1 2 3
—2 —2
DA = 0
—3 -—2

In order to satisfy the above equation, DA has to be (0,0). This implies that there
does not exist a solution of alignment matrices such that all the four references in
the cycle are free of reorganization communication. In other words, there exists at

least one reference with respect to which any given solution of alignment matrices is

incompatible.

68

3.4.2 Spanning Tree Base Alignment Algorithms

If all index matrices are invertible, the conﬂict of compatibility requirement can only
occur within a cycle of a DRG. Intuitively, such conﬂict can be resolved by a spanning
tree. Base alignment between two arrays are speciﬁed by the tree edge connecting
these two arrays using Equation 3.7. As a result, references represented by tree edges
are always free of reorganization communication. Each non-tree edge determines a
unique fundamental cycle of the DRG with respect to the spanning tree. Reorganiza-
tion communication may not be avoided for each non-tree edge. An edge is weighted.
Given a choice between two edges, the tree edge would be chosen as the one with
higher weight. As a result, the spanning tree for base alignment would be chosen as
a maximum-weight spanning tree.

For example, Figure 3.8(b) shows a maximum-weight spanning tree for base align-

2 —l
ment in Example 7. In Figure 3.8(b), DW 2 DA, DX 2 DA , Dy :-
—1 1
2 1 2 l 2 —1 2 —1
DA ,DB=DA ,andDzsz 21);;
l l 1 1 —1 l —1 1

Though (Z,X) is a non-tree edge, D X and Dz are compatible with respect to edge

(Z, X) because D x and Dz satisfy the equation
DX FX,6 = DZFZ,6

Therefore, non-tree edge (Z, X) is also free of reorganization communication. Given

2

Dw = DA and Dx = DA , edges (W, Z) and (X, Z) are called homoge-
—1 1

means since the base alignment equations imposed by two edges are equivalent.

DXFX,6 = DZFZ,6

DWFWA = DZFZA

69

Edges (W, Z) and (X, Z) comprise a homogeneous edge set.

In Figure 3.8(b), reorganization communication can not be avoided on non-tree
edges (Y, Z) and (B, Z) since Dx and Dz are not compatible with respect to edge
(X, Z) and DX and Dy are not compatible with respect to edge (Y, Z). Fixing base
alignment in a DRG G, an induced communication subgraph (ICS) (5' = (V',E') is
a subgraph of G such that alignment matrices are compatible with respect to each
edge in E -— EI but incompatible with respect to each edge in E'. Therefore, the
total cost of reorganization communication in G is equal to that in G'. The ICS
for the maximum—weight spanning tree alignment is shown in Figure 3.8(b). The
total cost of reorganization communication in Figure 3.8(b) is equal to 200. The
problem of optimizing base alignment is to find a base alignment such that the cost
of reorganization communication in the ICS is minimal.

Does the maximum-weight spanning tree (MWST) algorithm always minimize
the cost of neighboring communication? The answer is no. Figure 3.7(c) shows
yet another spanning tree alignment. The cost of reorganization communication in
Figure 3.7(c) is only 185 because Dz and Dy are compatible with respect to non-
tree edge (X, Y). The failure of the MWST algorithm is due to the assumption that
only tree edge is free of reorganization communication. However, in fact not every
non-tree edge is necessary included in the ICS. A non—tree edge can also be free of
reorganization communication as long as the alignment matrices of two incident arrays
are compatible. Next, we introduce a new spanning tree base alignment algorithm
which aims to minimize the sum of weights on those non-tree edges included in the

ICS. We name it as the minimum-weight ICS (MICS) algorithm.

Deﬁnition 3.4 Given a DRG G = (V, E), let U be an arbitrary subset of V. Node
A has the single-degree connectivity with U if and only ifA is not in U and there is

only one node B in U such that edge (A, B) is in E.

70

Definition 3.5 Given a DRG G = (V, E), let U be an arbitrary subset ofV. Node A
has the multi-degree connectivity with U if and only if A is not in U and there exist

at least two distinct nodes X and Y in U such that edges (A,X) and (A, Y) are in E.

Deﬁnition 3.6 Given a DRG G = (V, E), let U be an arbitrary subset of V. Node A
is a single-degree neighbor ofU ifA has the single-degree connectivity with U. Node

B is a multi-degree neighbor of U if B has the multi-degree connectivity with U.

For example, in Figure 3.8(a), let U = {A,X,Y}. Thus, W and B are single-degree
neighbors of U. Z is a multi-degree neighbor of U since it is incident with both X
and Y.

Given a DRG G = (V, E), the MICS algorithm can be formalized as follows:

 

(1) T = 43

(2) while T aé V do

(3) Let Q1 be the set of single-degree neighbors of T
(4) if Q1 # ()5 then

(5) Find a node A in Q1 such that edge (A, B) has the maximum weight
among all edges incident with one node in Q1 and another in T

(6) Deﬁne DA such that DAFAJc = DBFBJ; where (A, B) represents
reference “A 4— B@sk”

(7) T = T U {A}

(8) else

(9) Let Q2 = V — T

(10) Find a node A in Q2 such that the homogeneous edge set, each edge in

which is incident with A and a node in T, has the maximum
accumulated weight

(11) Deﬁne DA such that DAFA,’c = DBFBJ; where reference “A «— B@sk”
is represented by an edge (A, B) in the above homogeneous edge set

(12) T = T U {A}

(13) end if

(14) end while

Figure 3.9. The minimum—weight induced communication algorithm

 

 

 

71

Figure 3.10 illustrates how base alignment is resolved using the MICS algorithm
(Figure 3.9). Table 3.1 shows the contents of T, Q1, and Q2 at each step of outmost
while loop (lines (2)-( 14)). Initially, T is empty and any node in V is assumed to be
a single-degree neighbor of an empty set. In Figure 3.10, the nodes included in T at
each step are highlighted. The algorithm starts with X which is arbitrarily selected
and T = {X}. In step (b), since the weight of edge (X, A) is greater than that of
(X, Z), A is chosen to be a new member of T (line (5)). Arrays X and A are aligned
by DAFAQ = Dx Fx,2 (line (6)). For the same reason, in step (c), node Y is included
in T because edge (A, Y) has the maximum weight among edges incident with nodes
W, Z, Y, and B, all single-degree neighbors of {A, X} (line (5)). Arrays A and Y are
aligned by DYFYJ = DAF A.1 (line (6)). The same algorithm repeats in steps ((1) and
(e). Nodes W and B are included in T, respectively. Eventually, node Z becomes a

multi—degree neighbor of T.

 

Figure 3.10. Use the MICS algorithm to resolve base alignment for Example 7

72

Table 3.1. Values of T, Q1, and Q2 in executing the MICS algorithm for Example 7

 

 

 

 

 

 

 

 

 

 

 

 

 

steps T Q1 Q2 base alignment
Initial (,b {A, W, X, Y, B, Z} (15

(a) {X} {AZ} 45

(b) {X,A} {W,Y,B,Z} <15 DA = Dxe,2

(C) {X,A,Y} {W,B} {Z} DY = DAFAJ

(d) {XaA’YvW} {B} {Z} DW = DA

(e) {X,A,Y, W,B} (23 {Z} D3 = DAFAA

(f) {X,A,Y,W,B,Z} ab 9b Dz = Dy

 

 

In step (f), Q1 = 45. By line (9), Q2 = V — T = {Z}. Node Z is incident with
two homogeneous edge sets. One set includes edges (W, Z) and (X, Z). The other
includes edges (Y, Z) and (B, Z). The accumulated weight in the set {(W, Z), (X, Z )}
is equal to 185, while the accumulated weight in the set {(Y, Z), (B, Z )} is equal to
200. Since nodes W, X, Y, and B are all included in T, by line (11), D2 is chosen
such that Dze,3 = Dnyg because edge (Z, Y) is in the homogeneous edge set with
the larger weight.

The MICS algorithm (Figure 3.9) is an improvement of the MWST algorithm. If
Q1 is not empty, like the MWST algorithm, the tree edge is selected as an edge with
the largest weight among all the edges which are incident with one node in Q1 and
another node in T (lines (4)-(7)). If Q1 is empty but Q2 is not, alignment matrix
is determined such that every edge in a homogeneous edge set with the maximum
accumulated-weight is free of reorganization communication. By its construction,
the MICS algorithm is superior to the MWST algorithm in general. If the heap-
sort algorithm [79] is used in lines (5) and (10), the time complexity of ﬁnding the
maximum-weight edge or homogeneous edge set can be reduced to O(log|E|) where
|E| is the number of edges in DRG G = (V, E). As a result, the time complexity of
the MICS algorithm is 0(IEIlogIEI).

73

3.4.3 Experimental Results

 

5 I I I I

4.5 - MWST algorithm A— _
MICS algorithm 9—

Benchmark loop: Example 7

 

4 r a
Communication
Cost 3.5 b -*
(msec)
3 - _
2.5 _. ‘
2

 

 

 

Number of processors

Figure 3.11. Comparison of MWST algorithm and MICS algorithm on 16-node
nCUBE-2

Figure 3.11 shows the comparison of communication cost between the MWST
algorithm and MICS algorithm on 16-node Purdue nCUBE—2. Example 7 (Figure 3.7)
is used as the benchmark loop in our experiment. We increase the iteration space in
Example 7 by allocating the loop boundary as (i1 = 0 : 34, i2 = 0 : 34). In Figure 3.11,
when the number of processors is 16, the size of the messages sent out from each
processor reaches the minimum and the number of messages reaches the maximum.
Since the startup software latency is much expensive than the network latency in
message transmission on Purdue nCUBE-2, the startup latency dominates the overall
communication latency when the message size is relatively small. This explains why
the overall communication overhead increases when the number of processors becomes

16.

74

3.5 Optimizing RHS Expression Evaluation

3.5.1 RHS Expression Evaluation Optimization

In the MICS algorithm, it is assumed that all the operations in a FORALL assign-
ment are performed in the local processor which owns the LHS operand. If a RHS
operand is owned by a remote processor, message passing is invoked in order to
transfer the operand to the local processor ﬁrst. However, this limitation, known as
owner-computes rule, can be exceeded by evaluating different parts of the RHS ex-
pression in a FORALL assignment on different processors. Given a FORALL assignment
which consists of one kind of associative and commutative operations, data movement
can be minimized by an optimal evaluation tree in which an intermediate result may

be evaluated by a remote processor rather than the one which owns the LHS operand.

 

81: 1401,22) = B(i1, 22) * X(31, 22) * Y(i1,i2) * Z(i1,i2)
321 301,25) = A(2i1 +i2,i1 + i2) + X(2i1 +i2,i1 + i2) + Y(2i1 + i2,i1 + i2)
+Z(ilai2)
END FORALL

Figure 3.12. Example 8: A Splash benchmark loop

 

 

 

In Example 8 (Figure 3.12), it is assumed that base alignment is pre-determined
such that DAFA’I = DBFBJ = DXFX’I = DYFY’I = DZFZ’I. Therefore, FORALL
assignment 31 is free of reorganization communication. Nevertheless, reorganization
communication cannot be eliminated in FORALL assignment 32 because DBFB’Z 74
DAFAJ, DBFB’2 75 DXFX’Q, and D3173; 74 DyFy,2. If the owner-computes rule is

followed, remote elements A(2i1 +i2,i1+i2), X(2i1+i2,i1+i2), and Y(2i1 +i2,i1+i2)

75

have to be transferred to the local processor which owns the LHS B(i1, i2). As a result,

the cost of reorganization communication is equal to 300.

A(i,i) . . . . ‘\. .
l 2 1T(21,+12,t +1 201,12)

)
l 2

.1 \\ -..r V
fig/(7 X \\9\ ‘,—,/(—)-"/T/ OT xx ..... \‘0

30,32) X(i,,i,) Y(il,i2) 261,12) A(2i1+i2, i1+i2) X(2i,+i2,i,+i2) Y(2i,+i2.i,+i2)

statement statement
SI 52

Figure 3.13. Optimal evaluation trees for Example 8

However, closer inspection reveals that DAR” = D x F X; = Dy Fy,2. This implies
that the sum of A(2i1 +i2, i1 +i2), X(2i1 +i2, i1 +i2), and Y(2i1 +i2, i1 +i2) can be ﬁrst
calculated on a remote processor without any data movement. Figure 3.13 shows such
an optimal evaluation tree for Example 8. As shown in Figure 3.13, TT(2i1+i2, i1+i2),
a temporary variable which stores the sum, is then added with Z (i1, i2) by passing a
message to the local processor which owns B (i1, i2). In Figure 3.13, each arc represents
a reference relationship from one array to the other. Each arc is weighted by the cost of
reorganization communication if the alignment matrices of two corresponding arrays
are incompatible with respect to the reference represented by the arc. The optimal

evaluation of Example 8 can be rewritten as follows:

FORALL(i1 = 0 : 9, i2 = O : 9)

81: A(i1,i2) = B(i1,i2) * X(i1,i2) * Y(il, 22) * Z(i1, 22)

32.13 TT(2i1 + i2, i1 + i2) = A(Zil + i2,i1 + i2) + X(2i1 + i2, i1 + i2)
+Y(2i1 + i2, i1 + i2)

$2.2: B(i1, 22) = TT(21, 22) + Z01, 22)

END FORALL

76

The original assignment 32 is transformed into two assignments 32,1 and 32.2. The
alignment matrix of TT is determined as DTT = D A. Statement 5“ is free of reorga-
nization communication. Reorganization communication only occurs in assignments
32.2. The total cost of reorganization communication is reduced to 100.

Generally speaking, given a a FORALL assignment, the RHS expression evaluation

can be optimized using the following rule.

Proposition 3 Assume that in a given FORALL assignment, the RHS expression is
operated by one kind of associate and commutative operations. Each group of the
RHS array operands which alignment matrices are compatible with each other should
be evaluated together. Only the intermediate results needs to be transmitted to the
local processor which owns the LHS operand if the intermediate results are generated

on a remote processor.

In order to take advantage of optimal expression evaluation, we assume that the
original program has been pre-processed by transforming each original FORALL as-
signment into an equivalent set of FORALL assignments, each of which has the RHS

expression operated by one kind of associate and commutative operations.

3.5.2 Alignment Graph

An alignment graph (AG) is used to model the alignment problem. An AG is a
collection of arrays and statements which can be represented as a bipartite graph,
G = (Va, V,, E). An array is represented by a node in Va. A statement is represented
by a node in V,. A undirected edge in E connects an array and a statement if the

array is referenced in the statement.

Figure 3.15 shows the AG for Example 9 (Figure 3.14).

77

 

FORALL(2'1 = 0 : 9, i2 = 0 : 9)
812 A(ihig) = B(i1,i2) * X(i1,i2) * Y(i1, 22)

.92: B(i1,i2) = A(i1 +i2,i1) + X(i1 + i2,i1) + Y(i1 + 2'2, 2'1)
833 X(i1,i2) = B(i1,i2) * A(Zl + t2,i1)
END FORALL

Figure 3.14. Example 9: An Electric benchmark loop

 

 

 

Figure 3.15. Alignment graph for Example 9

3.5.3 AG Base Alignment Algorithm

The AG-based base alignment (AGBA) algorithm is shown in Figure 3.16. In an AG
G = (Va,V,, E), there is a set, denoted as Qk, associated with each node 3,, in V,.
Initially, each set Q. is empty. During the execution, each Oh will contain elements
of type DAFAJc. Each element of Qk, DAF A1,, is further weighted by the number of
effective elements in array A referenced in assignment 3],.

By Proposition 3, a group of the RHS array operands with compatible alignment
matrices can be evaluated together. Thus, the cost of reorganization communication
only depends on the effective elements in the intermediate results, rather than the
accumulation of effective elements among all the RHS array operands. For this reason,
with respect to a single FORALL assignment, the fewer the number of the distinct

groups of compatible alignment matrices among all the RHS array operands, the less

78

 

(1) for each node A in Va

(2) T = <25

(3) for each neighboring node 3;, (in V,) of A
(4) for each element DBFBJ: in Q,

(5) T = T U DBFB,kFX,i¢

(6) endfor

(7) endfor

(8) Choose DA = D131"‘153,k1",;}c such that DBFB,]¢FA_’}c is the member
in T which has the maximum accumulated-weight

(9) for each neighboring node 3;. (in V,) of A
(10) If DAFA'k is not in Qk then

(11) Q]: = Qk U {DAFAJc}
(12) endif

(13) endfor

(14) endfor

Figure 3.16. The alignment graph base alignment algorithm

 

 

 

the cost of reorganization communication. In Figure 3.16, Qk records the distinct
groups of compatible alignment matrices among all operands in statement 3;, (lines
(10)-(12)). When array A is referenced in statement 31,, we hope that DAFA,k can
be chosen in the way that it matches to another element DBFBJ, existing in Q,
and the size of Q. is not increased. In other words, DA should be compatible with
DB. Therefore, no extra reorganization communication occurs since A and B can
be evaluated together and the communication cost of moving intermediate results is
equal to that of moving the operand B. On the other hand, however, array A may
be referenced in multiple statements and the compatibility requirements imposed by
different FORALL assignment may conﬂict with one another. T collects all the types of
different DB F MFR}. if both array A and array B are referenced in FORALL assignment
Sk- Note that T may contain multiple identical elements each of which comes from a

distinct FORALL assignment. As a result, DA is chosen to be the member in T which

79

has the maximum accumulated-weight (line (8)). The accumulated weight refers to
the weight accumulated among all identical elements. Thus, the maximum amount
of reorganization communication has been eliminated.

Table 3.2 illustrates how the AGBA algorithm works using Example 9. Table 3.2
shows the content of each Q), at each step of the outmost loop between lines (1)-
(14) in Figure 3.16. The algorithm begins with A. After executing lines (9)-(13),
Q1, Q2, and Q3 become {DAFA’I}, {DAFA’Q}, and {DAFAB}, respectively. FAJ =

l 0 1
and F Ag = F A.3 = . Next, B is selected. After executing lines
0 1 1 0
(3)-(7), T = {DAFAJFBT’IDDAFAQFE’ImDAFA,3FB-,l3}. Since FBJ = F33 = F33 = I,
1 1 1 1 1 1
T = {DA,DA ,DA }. Consequently, DA is the member
1 0 1 0 1 0

l l
in T which has the larger accumulated-weight. Thus, D3 = DA (line(8)).
1 0

Continuing in the algorithm, D x and Dy are likewise obtained.

Table 3.2. Resolving base alignment for Example 9

 

alignment matrices Q1 Q2 Qa
. D. m :3. D. is.
B 0.-..“ a) if. D. 1:, D. :3,
X DX=DA DA,DA i (1) DA 1 (1) DAa-DA i (1)
Y Dy=DA DA,DA i (1) DA 1 (1) DAaDA i (1)

 

 

 

 

 

 

 

In Figure 3.16, the order in which Va nodes are visited (line (1)) is important

in minimizing reorganization communication. An efficient heuristic approach can be

80

addressed as follows. Let U be a subset of Va. A node in V, is a neighboring node of U
if this node is connected to a node in U. Initially, U is empty. The algorithm begins
with a node in Va, say A, which has the maximum connectivity degree. U = U U {A}.
Then, in each of the rest steps, a node in Va — U, say B, is selected such that B is
connected to a neighboring node of U, say sk, where 3;, has the minimum connectivity
degree among those neighboring nodes of U. U = U U {B}. This procedure repeats
until U contains every node in Va.

If the heap-sort algorithm [78] is used in line (6), the time complexity of ﬁnding
the element with the maximum accumulated-weight would be reduced to 0(loglE|)
where IE] is the number of edges in AG G = (V, E). As a result, the time complexity
of the AGBA algorithm is 0(IEIlogIEI).

3.5.4 Experimental Results

 

6 I I I I

MWST algorithm A—
MICS algorithm 43—
5 ' AGBA algorithm -x-— ‘

Benchmark loop: Example 9

 

Communication
Cost
msec
( ) 3 _ _
2 - _

 

 

1 l l l l

 

Number of processors

Figure 3.17. Comparison of the MWST, MICS, and AGBA algorithms on 16-node
nCUBE—2

81

Figure 3.17 shows the comparison of communication cost among the MWST,
MICS, and AGBA algorithms on 16-node nCUBE-2. Example 9 (Figure 3.14) is used
as the benchmark loop in our experiment. We increase the iteration space in Example
9 by allocating the loop boundary as (i1 = 0 : 24, i2 = 0 : 24). The AGBA algorithm
outperforms other two algorithms. In Figure 3.11, when the number of processors
increases, the number of the messages sent out from each processor increases, while
the size of each message decreases. The startup software latency is much expensive
than the network latency in message transmission on Purdue nCUBE-2. Therefore,
the startup latency dominates the overall communication latency when the message
size is relatively small. Since the amount of communication generated by the proposed
MICS and AGBA algorithms is small comparing with the MWST algorithm, both
the performance of MICS algorithm and the performance of AGBA algorithm are
dominated by the startup latency and getting close each other after the number of

processors exceeds 8.

3.6 Avoiding Redundant Communication

3.6.1 Redundant Communication

The same context of array elements may be referenced in more than one FORALL
assignment. If these elements reside on remote processors, a local processor should
receive a single copy of these elements rather than multiple identical copies. This
technique is known as redundant communication avoidance. The identiﬁcation of re-
dundant communication can not only reduce communication overhead following up
the data decomposition phase, but also have great impact in the alignment anal-
ysis. In this section, we focus on the issues of avoiding redundant reorganization

communication.

82

1 1
In Example 10 (Figure 3.18), assume that D3 2 Dz 2: DA. As a result,
1 0

DBFB’Z 76 DAFA’Q and DZFZ’3 # DAFA,3. In order words, remote element A(iI-i-ig, i1)
is referenced in both statements 32 and 33. However, D3 = Dz implies that the
LHS element B(i1,i2) in statement 32 and the LHS element Z (i1,i2) in statement
33 reside on the same local processor. A(il + i2, i1) is required to be sent from the
same remote processor to the same local processor in executing both statements 32
and 33. However, since A is not written between statement 32 and 33, the message
transmission of remote element A(il + i2, il) in statement 32 is redundant with that
of remote element A(il + i2, il) in statement .33. These two messages transmitting the

same A(i1 + i2, i1) comprise redundant communication.

 

FORALL(2'1 = o : 10, i2 = o : 10)
WHERE(X(2’1,2'2).NE.0)
lHPF the probability is 83%

312 X(i1,i2) = A(t1,t2)
822 B(t1,i2) = A(Zl + i2, 21)
832 Z(i1,i2) = A(tl + i2, 21) * B(i1, 22)
END WHERE
84! A(ihiz) = Z(i1,i2)
S52 B(i1,i2) = X(i1,l2) + A(l1,22)
END FORALL

Figure 3.18. Example 10: An Oceanwater benchmark loop

 

 

 

The identiﬁcation of redundant communication is resorted to both the location and
the context. The location requires that the senders of redundant communication must
be the same, so do the receivers. The location requirement can be judged by alignment

matrices. In Example 10, since DBFBQ = DZFZQ, elements B(i1,i2) and Z(i1,i2) are

83

always owned by the same local processor. The context requires that the context of
every redundant message must be identical. In Example 10, array A is not deﬁned
between references “B(i1,i2) {— A(il +i2,i1)@32” and “Z(i1,i2) <— A(i1 +i2, i1)@s3”.
Therefore, two messages transmitting the same context of A(il + i2, i1) are redundant.
Consider references “X(i1,i2) <— A(i1,i2)@sl” and “B(i1,i2) (— A(i1,i2)@35”. The
messages transmitting remote element A(i1,i2) are not redundant because array A is
re-deﬁned in statement 34 and the context of A(i1,i2) referenced in statement 32 is
different from the context of A(i1,i2) referenced in statement 35.

The concept of single assignment block is used to identify the distinct contexts

with regard to the same array variable.

Deﬁnition 3.7 Given a program structure, a single assignment block of array A,
denoted as SA, contains a block of statements such that A is only deﬁned in the ﬁrst

statement in SA and A is used in the other statements in SA.

In Example 10, there are two single assignment blocks with regard to A :{sl, 32, 33}
and {34,35}. B has two single assignment blocks:{s2,s3} and {.35}. Z has one single
assignment block: {33, .94}. The single assignment block is based on the def/use ﬂow
for array variables. Therefore, the identiﬁcation of a single assignment block can be
easily obtained using data ﬂow analysis algorithms [63]. The single assignment block

has the following important property:

Proposition 4 The context of an array element is not changed within the same
single assingment block. The contexts of the same array element in diﬂ'erent single

assingment blocks are diﬁ'erent.

3.6.2 Enhanced Alignment Graph

An alignment graph can be enhanced to model the use/def ﬂow for array variables

by using the concept of single assignment block. The deﬁnition of an enhanced

84

alignment graph (EAG) G = (Va,V,, E,A,) is similar to that of an alignment graph
G = (Va,V,, E) except that an EAG has an extra arc set A,. The deﬁnition of Va, V,,
and E can be found in Section 3.5. There is an arc in set A, from node 3( to node

3;, if an array deﬁned in statement 3; is used in statement 3],. Figure 3.19 shows the

EAG for Example 10 (Figure 3.18).

1,5,1, __
kl, I \ ( X)
l“ - ‘~ , "
I it 3 l
,._+ ,\ _‘ ,,.-+ as
</ 82 ,i?‘ 3, “xx“ S3 / ,, _ _ E“ I, +
++ 2+ 2:2
\\ X ‘( . . ."l 7-
\ X / /
\ 't-ii/ \ 7/ \
l ' , ,_ , ,..v ._. ' ' —..-.,__,_
\ f A ],__.__——I. , S4" \2 fl
Tm \ l
/ \‘\
(Br \ l
a, \ ’_5_\/
‘».\_~‘ 5 —._’H

Figure 3.19. Enhanced alignment graph for Example 10

3.6.3 EAG Base Alignment Algorithm

The EAG-based base alignment (EAGBA) algorithm is shown in Figure 3.20. In an
EAG G = (Va,V,, E,A,), there is a set, denoted as Qk, associated with each node 3;,
in V,. Initially, each set Q), is empty. During the execution, each Q, will contain
elments of type DAFAJE. Each element of Qk, DAFA’k, is further weighted by the
number of effective elements in array A referenced in assignment sk.

In Figure 3.20, lines (3)-(13) indicate that the same element DBF3,kFA,k is only

included once in T for each single assignment block regarding to B. Therefore, redun-

85

 

 

( 1) for each node A in Va

(2) T = 43

(3) for each neighboring node 3;, (in V,) of A
(4) for each DBFBJ¢ in Qk

(5) if DBFBJcFL}: is not in T then

(6) T = T U DBFB,kF/T,}¢

(7) else if s). is in a new single assignment block for B then
(8) T = T U DgFgch/Z}:

(9) else if there exists an element q in Q1, such that

q = DBFB’kFX’}: but some effective elements of B
are not included in q’s effective domain then
(10) Merge those new effective elements of B to q’s effective domain
(11) endif
(12) endfor
(13) endfor
(14) Choose DA = DlgF‘19,kF,;"}c such that DBIITBJJ'X’}c is the member
in T which has the maximum accumulated-weight
(15) for each neighboring node 3;, (in V,) of A
(16) if DAFAJ. is not in Q, then
(17) Q}: = Q): U {DAFAJc}
(18) endif
(19) endfor
(20) endfor

Figure 3.20. The enhanced alignment graph base alignment algorithm

 

 

86

dant communication will not be counted in line (14) and the selection of alignment ma-
trix DA will be immuned from the adverse impact of redundant communication. This
feature makes the EAGBA algorithm superior to the AGBA algorithm. Note that
the same array variable may be referenced in different FORALL assignments within the
same single assignment block regarding to that particular array variable. The effec—
tive domains imposed by those different FORALL assignment loop boundaries can be
different. For this reason, different effective domains for the same type of the element
in T need to be combined (lines (9)-(10)). The array expression operation proposed
in [64] can be used to combine different effective domains. The single assignment
block can be found by using data ﬂow analysis regarding to array variables [80, 81].

Example 10 is used to illustrate the basic idea. In the EAG for Example 10
(Figure 3.19), A has the maximum degree of the connectivity. Therefore, A is se-
lected ﬁrst in line (1) of the EAG-based algorithm (Figure 3.20). Executing lines

1 1 l 1
(13)-(17), we have Q1 = {DA}, Q2 = {DA }, Q3 = {DA },
1 0 1 0

Q; = {DA}, and Q5 2 {DA}. Assume that B is selected next. Since B is only
referenced in statements 32, s3, and 34, only Q2, Q3, and Q4 are considered in de-

termining T in lines (3)-(11). Since statements 32 and 33 are in the same single

1 1
assignment block regarding to A, D A is only included once in T. Therefore,

1 0

l l
T = {DA ,DA}. The weight associated with DA is 121 (in statement 34),

1 1
and the weight associated with DA is 100 (in statement 32). Therefore, by
1 0

line (12), D3 is chosen such that D3 = DA. In contrast, if the AGBA algorithm

1 1 1
is used, T 2 {DA ,DA ,DA}. Since the weight associated with

10 10

87

1

each DA is 100, the accumulated-weight over these two identical elements
1 0
1

1
DA in T is 200, which is larger than 121, the weight of DA. As a result,
1 O

1 1
DB would be chosen as D3 = DA instead of DA. The AGBA algorithm
1 0

replicates the cost of redundant reorganization communication and makes the wrong
decision in base alignment. As shown in Table 3.3, continuing in the algorithm, D X

and Dz are likewise obtained.

Table 3.3. Resolving base alignment for Example 10

 

 

 

 

 

 

A B X Z
result DB = DA DX = 114(1 (1)) DZ 2 DA
1 1 1 1
Q1 DA DA DAaDA 1 0 DAIDA 1 0
1 1 l l l 1 1 1
Q2 DA 1 0 DA, DA 1 0 DA, DA 1 0 DA, DA 1 0
1 1 1 1 1 l 1 1
Q4 DA DA DA DA
Q4 DA DA DA DA

 

 

The time complexity of the EAGBA algorithm is the same as that of the AGBA
algorithm. In general, the size of T should be smaller in the EAGBA algorithm than

that in the AGBA algorithm. Therefore, less searching time would be taken in line
(14) of the EAGBA algorithm.

CHAPTER 4

Offset Alignment

In the base alignment phase, an array can be partitioned in a family of parallel hyper-
planes. Offset alignment determines the displacement of each hyperlane with respect
to the template. More precisely, the alignment offset d; for each array A is resolved in
the offset alignment phase. This chapter provides a mathematical framework to ad-
dress communication cost in offset alignment. The impact of access offset is accurately
represented by piecewise linear cost function. Efﬁcient offset alignment algorithms
are proposed to be incorporated with the RHS expression evaluation optimization

and redundant communication avoidance.

4.1 Offset Alignment for Single Reference

Similar to labeling access matrix, if an instance A(Fi+ j?) is referenced in a statement
sk, we want to label the access offset f by the array variable A and the statement
number k. However, multiple instances of the same array may be referenced in the
same FORALL assignment. Unlike the base alignment analysis, the single occurrence
transformation will change the meaning of access offset in the transformed program
(details discussed in Section 4.3). For this reason, this section only considers the case

where each array has no more than one type of instance referenced in each FORALL

88

89

assignment. Let jig be the access offset for the instance of array A referenced in

FORALL assingment s k.

4.1.1 Offset Alignment Equation

Consider reference “X +— Y@s2” in Example 3 (Figure 2.8). Assume that in iteration
i
l , X(x1,x2) is referenced on the LHS and Y(y1, y;) is referenced on the RHS.
i2

$1 311 . .
We want to map and to the same template element 1n order to avord

$2 312
interprocessor communication. In other words,

551 311

5X( )= 5Y( )

932 112

By the deﬁnition of alignment function, the above equation can be re-written as

follows:

171 311

DX +d} = Dy +d} (4.1)

$2 312

On the other hand, by the deﬁnition of access function, we have

31 i1 ..
= FX,2 + fx,2 (4.2)

332 i2

311 31 4
= Fm + fY,2 (4'3)
312 i2

90

1 0 1 0 .. ..
where FX,2 = , Fm = , fX,2 = , and fy,2 = . Substi-

1 1 0 1 2 0
tuting Equation 4.2 and Equation 4.3 into Equation 4.1, we get

i1 " " i] -o -o
DXFX,2 . + Dxfx,2 + dx = DYFY,2 . + DYfY,2 + (iv

22 22

To satisfy the above equation, the following two equations must hold.

2'1 2'1
DX Fxg = Dy FY13 (4.4)

2'2 i2

Dxfim + d} = Dyfm + d} (4.5)

Equation 4.4 implies that DX and Dy must be compatible with respect to reference
“X 4— Y@sz”. Equation 4.1 cannot hold if Equation 4.4 is not satisﬁed. This implies
that offset alignment should be studied only if D X and Dy are compatible with respect
to reference “X <— Y@82”. In general, offset alignment is only studied among those
references which are free of reorganization communication. Equation 4.5 speciﬁes
the relationship between alignment offsets d} and d; in order to make reference
“X (— Y@32” free of interprocessor communication.

As speciﬁed in section 2.4, let DX = (0,1) and Dy = (1,1). This solution of
Dx and Dy satisﬁes Equation 4.4. Substituting the values of Dx and Dy into

Equation 4.5, we get
0 _. 0 ..
(0,1) +dx = (1,1) +dy

In order words, d-y — d} = (2). For simplicity, we choose dy = (2) and d} = (0).

3

Therefore, reference “X 4— Y@sz’ is free of interprocessor communication.

91

Equation 4.5 can be easily extended to the following property for any multi-

dimensional arrays.

Proposition 5 In reference “A(FA,kii+ fly) (— B(FB,ki+ f3,k)@sk”, suppose that
alignment matrices DA and D3 are given such that DAFAJ; = DBngc. The reference

is free of interprocessor communication if and only the following equation holds.
Diff. + d1 = 03f}... + d}; (4.6)

Given reference “A «— B@sk”, if the values of d; and dig cannot satisfy Equa-

tion 4.6, d; and d}; are called to be mismatched each other regarding to that reference.

4.1.2 Multiple Aligned Base Groups

In Example 3 (Figure 2.8), there is only one aligned-base group. In this section, we
study how to resolve offset alignment for multiple aligned-base groups.

Consider Example 11 in Figure 4.1.

 

FORALL(21 = 2 2 n1 — 2,i2 = 0 217,2 — l)
812 W(21+1,22) =Z(t2—2,t1)
END FORALL

Figure 4.1. Example 11: An Eispack benchmark loop

 

 

 

Suppose that base alignment in Example 11 is pre—determined as follows. Both
arrays W and Z are distributed in row dimension and column dimension. The row

dimension of W is aligned with the column dimension of Z. The column dimension

92

1 0
of W is aligned with the row dimension of Z. Therefore, Dw = and
0 1
0 1 . 1 0 0 1
DZ = . Since FWJ = and FZJ = , it is easy to verify
1 0 0 1 1 0

that Equation 3.7 holds for such Dw and Dz. In other words,

DWFW,1 = Dze,1

Therefore, by Proposition 5, reference “W(i1 + 1,i2) (— Z(i2 — 2,i1)@sl” is free of

interprocessor communication, if the following equation holds

DWJFWJ + div = szz,. + d} (4.7)
.. _. —2
where wa = and fz’l =
0 0
. . . D1,W D1,Z
Alignment matrices Dw and D2 can be re-written as and
Dz,w 02,2

where D1,w = (1,0), D2,w = (0,1), 01,2 = (0,1), and D12 2 (1,0). Note that

DLW and Dz,w are two distribution bases of W. D1,Z and Dzz are two distribution

_. d ,W -¢ d ,2 "' 9

bases of Z. Let dw = 1 and dz = 1 . Let wa = fl’WI and
d2,w 612,2 f2,W,1

~ f1,z,1 , ,

fzJ = . Equatlon 4.7 can be re-wrltten as follows.

f2,z,1
Dl,W f1,W,1 d1,w 01,2 f1,z,1 611,2

+ = +

D2,W f 2,W,1 d2,w 192.2 f 2,2,1 d2,z

93

The above equation holds if and only if the following two equations hold.

f1,W,l fl,Z,1

D1,W + d1,w = 01,2 + d1,Z (4-8)
f2.w,1 f2,z,1
f1,w,1 f1,z,1

Dz,w + d2,w = 02,2 + d2,z (4-9)
f2,w,1 f2,z,1

Equations 4.8 and 4.9 reveal two important facts. First, Equation 4.8 indicates
that distribution base (1,0) of W (represented by D1,W) and distribution base (0,1)
of Z (represented by 01,2) comprise an aligned-base group. Equation 4.9 indicates
that distribution base (0,1) of W (represented by D2,w) and distribution base (1,0)
of Z (represented by D22) comprise another aligned-base group. In general, if two
distribution bases of two various arrays are aligned, these two distribution bases must
satisfy Equation 3.7.

Second, the solution of de and sz in Equation 4.8 is competely independent
to the solution of d2,w and (12,2 in Equation 4.9. The values of de and (11,2 are
determined with regard to the ﬁrst aligned-base group. The values of d2,w and (12,2

are determined with regard to the second aligned-base group.

f1,W,1 f1,z,1 f1,W,1
On substitution of DLW, 01,2, Dz,w, 02,2, 3 , a
f2,W,1 f2,Z,1 f2,W,1
f1.z.1 . . .
and by their values, Equatlons 4.8 and 4.9 can be re-wrltten as follows.

f 2,2,1

dl,W+1=d1,Z—2

d2,w = d2.z

Therefore, reference “W(i1 +1, i2) +— Z ( i2—2, i1)@sl” is free of interprocessor commu-

94

4 —1 -o
nication if dw = and dz = . Substituting the values of two alignment

0 0

offsets in the alignment functions, we get

t1 -o w] 101 _. 1 0 wl —l
= 5W( l = DW + dw = +
t2 102 102 U l. 102 0
wl — 1
= (4.10)
wz
t1 -o 21 21 .. O 1 Z] 2
= 5z( ) = Dz + dz = +
t2 22 22 1 0 22 0
z —2
= 2 (4.11)
21

where T(t1,t2) is a template element. Alignment functions 4.10 and 4.11 indicate
that the (k + 1)-th row of W is aligned with the (k — 2)-th column of Z, and the t—th
column of W is aligned with the t-th row of Z.

Since offset alignment can be analyzed independently with each aligned-base
group, the examples used in the rest of this chapter will contain only one aligned-base
group. Moreover, the template is only a one-dimensional array for a data decompo-
sition pattern which employs a single aligned-base group. Therefore, the vector of
alignment offset, d, will contain only one element. For this reason, CTA will be simpli-

ﬁed as integer dA in the rest of this chapter.

4.1.3 Calculating Alignment Offset

The value of alignment offset can be decimal. Consider Example 12 (Figure 4.2). The

base alignment analysis in Example 12 is identical to that in Example 5. Therefore,

95

Example 12 also has

DA (Oaﬁ)
DB (09 %)

 

FORALL(i1=0:n/2-1,i2=1:n/3-2)

312 B(2Zl,322 - 1) = A(t1,2t2 +1)
822 A(21,222) = 801,322)
END FORALL

Figure 4.2. Example 12: A Weather-Climate benchmark loop

 

 

 

By Proposition 5, F ORALL assignment 31 is free of neighboring communication if

DAfAJ + dA = DBfB,1+ d3

where 54,1 = and fB’l = . The above equation can be re-written as
1 —-1
1 1
— d = —— d
2 + A 3 + B
Therefore, we choose dA = —% and d3 + % to avoid interprocessor communication.

The alignment functions of A and B can be speciﬁed as follows.

5A( )=—02——

96

4.2 The Cost of Neighboring Communication

4.2.1 The Basic Cost

Consider reference “X(i1,i1 + i2 +2) +— Y(i1, i2)” in Example 3 (Figure 2.8). Assume
i

that in iteration 1 , X(x1, x2) is referenced on the LHS and Y(y1, yg) is referenced
i2

on the RHS. Therefore, we have

(131 l1 ..
= FX,2 + fX.2 (4-12)
932 i2
311 i1 -
= FY,2 + fY,2 (4°13)
:92 i2

We assume that element X (x1, x2) is mapped to the template element T(tx) and
element Y(y1,y2) is mapped to the template element T(ty). Therefore, alignment

functions for X and Y can be written as follows:

$1

tx = Dx + dx
172
ill

tY = DY + dY
312

Hence, if the communication-free requirement cannot be honored, the misalignment

between X and Y can be evaluated based on the difference between t X and ty. On

97

$1 1
substitution of from Equation 4.12 and from Equation 4.13, we get

$2 .112

2'1 -o —o
ix - W = (DXFX,2 - DYFY,2) . + (Dxfx,2 + dx) - (DYfY,2 + JV)
22
Since offset alignment is only studied for references which are free of reorganization
communication, we only consider those values of Dx and Dy such that Dxeg :
DyFy,2. For example, Dx = (0,1) and Dy = (1,1). On substitution of Dx = (0,1)

and Dy = (1,1), the above equation can be re-written as follows:
tx—tYZZ-l-dx—dy (4.14)

. . $1 311
Note that the value of t X — ty 1S independent from or

$2 312
In this thesis, a segment distribution is used to distribute the template elements

across the processors. Like block distribution, the basic idea of segment distribution
is to assign template elements consecutively to processors. However, unlike block
distribution, the length of segment owned by each processor can be varied. The
detailed description of segment distribution will be found in Chapter 5.

Assume that template is declared as T(O : n — 1) and the number of available
processors is q. T(O : n — 1) is partitioned into q segments T(m; : mg“ — 1) where
0 = mo < m1 < m2 < < mq_1 < mq = n — 1. Segment T(m; : m,“ — 1) is owned
by processor P(i). Therefore, by the owner-computes rule, processor P(i) will work
on the X elements which are mapped to T(m; : mg.“ — 1). By equation 4.14, we know
that the Y elements accessed by processor P(i) are mapped to T(m; — 2 — d X + dy :

mg+1—- 2—dX +dy —1). As a result, those Y elements which are mapped to the portion

98

T(m;+1 — 2 — d X + dy : mg“) “are remote to processor P(i). For grand-challenge
applications, the size of the template is usually large. Therefore, the remote portion
T(m,-+1 — 2 — dx + dy : mg“) should be owned by processor P(i + 1). As a result,
this type of data shift communication from processor P(i + 1) to processor P(i) is
classiﬁed as neighboring communication. For this reason, neighboring communication

occurs if alignment offsets are mismatched.

Deﬁnition 4.1 Suppose that alignment function ofA and alignment function ofB
are pre-determined. Given reference “A <— B@sk”, the basic cost of neighboring
communication, basic cost for short, is deﬁned to be the number of remote template

elements to which remote B elements are mapped in order to deﬁne A.

In this above example, the basic cost is equal to 2 + d X —— dy. This property can

be formalized as follows:

Proposition 6 In reference “A(FA,ki+ f:4,k) (— B(F3,ki+ f3,k)@sk”, suppose that
alignment matrices DA and D3 are given such that DAFA’]; = DBFB’k. The basic

cost of neighboring communication is equal to [(DAfiq,k — D3f3,k) + (dA — dB)|.

4.2.2 The Weight of the Basic Cost

The basic cost depends on the number of template elements to which the remote
effective data array elements mapped. The basic cost is equal to the actual cost of
neighboring communication if there is only one effective element of each data array
mapped to each template element. However, in many cases, the map between array
elements and template elements is many-to—one. If the number of effective elements
mapped to a template element is called the density of the template element, the actual
amount of neighboring communication should be the product of the density and the

basic cost.

 

‘Assume 2 + dx - dy > 0. For other cases, see Section 3.3.

99

A closer inspection of many benchmark programs reveals that the density of dif-
ferent template elements can be different. The density function, MAJ“ is deﬁned such
that the value of wA,k(t_) is the number of effective elements in array A which are
mapped to template element T(t) with respect to FORALL assignment 3k. Consider
Example 3 (Figure 2.8). Suppose alignment functions of X and Y are deﬁned as

follows:

171

i=6x( )=$2+1
$2
yl

t=5Y( )=yl+312
312

where T(t) is a template element. Thus, X is partitioned in columns and Y is
partitioned along diagonal. Alignment offsets d x = 1 and dy = 0 imply that the
(k + 1)-th column in X is aligned with the k—th off-diagonal. By the deﬁnity of

density function, we have

wx,2(t) = max{0,t — 1}

wY,2(t) = t

Since the density function is not uniform, the amount of neighboring communication
between different pairs of neighboring processors is different. In this thesis, neighbor-
ing processors refer to processors P(i) and P(i + 1).

Figure 4.3 shows the amount of neighboring communication between each pair of
neighboring processors. In Figure 4.3, arrays X and Y are declared as 8 x 8 and the
number of available processors is 3. Each array element is represented by a cirle. In
each array, elements mapped to the same element in the template are connected by the

same solid line. Only effective elements are covered by solid lines. By Proposition 6,

100

neigboring
communic on

  
 

Figure 4.3. Neighboring communication in Example 3

the basic cost of neighboring communication is equal to 1. Figure 4.3 gives the real
.meaning which the value 1 stands for. In order to deﬁne the LHS X elements on
a column, each processor requires to access remote Y elements which form one off-
diagonal line. For example, in order to write elements X (331,1:2) on line x2 = 5,
processor P(2) accesses elements Y(y1,y2) on line y] + y2 = 3 which are owned by
remote processor P (1) Therefore, the amount of neighboring communication between
processors P(1) and P(2) is equal to the number of elements on line yl + y2 = 3,
which is 4. Similarly, the amount of neighboring communication between processors
P(O) and P(1) is equal to 2. Note that the amount of neighboring communication
between P(O) and P(1) is less than that between P(1) and P(2). Since the messages
for neighboring communication can be transmitted simultaneously among pairs of
neighboring processors, the cost of neighboring communication among all processors

Should be equal to the maximum amount of neighboring communication per pair of

101

processors. Thus, the actual cost of neighboring communication is 4 in Figure 4.3.
This justiﬁes the reason that the basic cost can be used to measure the actual cost of
neighboring communication.

Like the cost of reorganization communication, the cost of neighboring commu-
nication can be weighed by various array sizes. For example, in the following code

segment, loop 32 is a sequential loop and can not be parallelized among the processors.

ALIGN Y(j) WITH X(j)
ALIGN Z(j) WITH X(j)

FORALL(2'1 = 0 : 10)

31: X(i1) =Y(i1+1)
FOR(22 = 0 1 10)
321 X(21)=X(21)+Z(21+1,22)
END FOR
END FORALL

Y( j ) is aligned with X (i), so is Z (2) By Proposition 5, the basic cost of neighboring
communication involved in reading Y(i1 + 1) is 1, and the basic cost of neighboring
communication involved in reading Z (i1 +1, i2) is also 1. However, in the former case,
the actual cost of neighboring communication is 1. In the later case, the actual cost
of neighboring communication is 10 since the whole column in Z is mapped to the
same template element.

In the rest of this Chapter, we use the weight of the basic cost to formalize the
impact of the density function and different array sizes on the actual amount of
neighboring communication imposed by a given offset alignment. The actual amount
of neighboring communication will be represented by the so-called weighted basic

cost.

102

4.3 The Impact of Access Offset

4.3.1 Piecewise Linear Cost Function

The following code pattern is very common in scientiﬁc application programs.

Multiple instances of array Y referenced on the RHS have the same access matrix

FORALL(z'1 = 0 : 10)
51: X(i1) =Y(i1)+Y(i1+1)+Y(i1+2)
END FORALL

but different access offsets. The access offset of instance X (i1) is 0, the access offset
of instance X(i1 + 1) is 1, and the access offset of instance X(i1 + 2) is 2. Using
the notation from the previous section, assume that there are totally q processors
and template elements T(m, + 1 : m,“ — 1) are owned by the local processor P(i)
where0 =mo < m1 < m2 < < mq_1 < m, =n—1.Sincet=6x(x)= x+dx,
the X elements owned by processor P(i) are X (m,- — d x : m,+1 — l — d X). Conse-
quently, by the construction of statement 31 , processor P (z ) requires to access elements
Y(m.- —dx : m,“ —1-dx) with respect to instance Y(i1), Y(m,- —dx +1 :m;+1 —dx)
with respect to instance Y(i1 + 1), and Y(m.~ — dx + 2 : m,“ — dx + 1) with respect
to instance Y(i1 + 2). On the other hand, since t = 6y(y) = y + dy, the Y elements
owned by processor P(i) are Y(m,- — dy : m,“ -- 1 — dy). Consider the remote Y
elements accessed by processor P(i).

If d X < dy, remote elements Y(m.-+1 — dy : m,“ — dx) are accessed with respect
to instance Y(i1). Remote elements Y(m,-+1 — dy : m,“ — dx + 1) are accessed with
respect to instance Y(i1 + 1). Remote elements Y(m;+1 — dy : m,“ — d x + 2) are

accessed with respect to instance Y(i1 +2). Elements in these three segments overlap.

103

The union of these three segments is Y(m.-+1 — dy : m,“ — dx + 2). Therefore, the
cost of neighboring communication is equal to dy — dx + 2 if d X < dy.

If dx > dy + 2, remote elements Y(m.~ — dx : m,- — 1 — dy) are accessed with
respect to instance Y(i1). Remote elements Y(m,~ — d X + 1 : m, — 1 — dy) are accessed
with respect to instance Y(i1 + 1). Remote elements Y(m,- — dx + 2 : m.- — 1 — dy)
accessed with respect to instance Y(i1 + 2). The union of these three segments is
Y(m,~ — d X : m.- — 1 — dy). Therefore, the cost of neighboring communication is equal
to dx—dy ifdx >dy+2.

In particular, if dx = dy, remote elements Y(m,- — dy) and Y(m,- — dy + 1) are
accessed with respect to instances Y(i1 +1) and Y(i1 +2), respectively. If dy +1 2 dx,
remote elements Y(m,- —dx - 1) and Y(m,- —dy) are accessed with respect to instances
Y(i1) and Y(i1 + 2), respectively. If dy + 2 = dx, remote elements Y(m,- — dx — 2)
and Y(m.~ — dx —- 1) and Y(m.- — dy) are accessed with respect to instances Y(i1)
and Y(i1 + 1), respectively. Overall, under these three special conditions, the cost of
neighboring communication is equal to 2.

Let cBJc be the cost of neighboring communication involved in accessing all in-
stances of array B referenced on the RHS of statement 3],. In the above example, we

have the following result.

2—(dx-Cly) when (dx—dy) <0
CY.1 = 2 when 0 S (dx — dy) S '2 (4.15)
(dX—dy) when 2<(dx —dy)

Unlike the cost of reorganization communication, Equation 4.15 shows that the cost
of neighboring communication cannot be treated independently for each distinct in-
stance when there are multiple instances for the same array referenced in the same
statement. In the above example, if the cost of neighboring communication regard-

ing to instances Y(i1), Y(i1 + 1), and Y(i1 + 2) were estimated independently by

 

 

104

Proposition 5, the overall cost would be
Idx —dY|+|dx —dY+1I+|dx —dY+2I

The value of the above equation would be three times as much as CY’I. For this reason,
the single occurrence transformation is not used in the offset alignment analysis since
it can change the cost estimation and thus affect the result of offset alignment. For
ease of explanation, in the rest of this chapter, we use the symbol “A <— B@sk” to
represent the references of all distinct instances of array B in order to write the LHS
A elements in statement sk. For the sake of consistency, the symbol “A +- B@sk” is
still called a reference.

Equation 4.15 can be extended as follows:

Proposition 7 Assume that in a FORALL assignment 3],, all instances of array B
are referenced on the RHS. These instances are represented by B(i1 + r1), B(i1 + r2),
. ., and B(i1 + r5) where r1 < r2 < . .. < rt. The access offset tuple is denoted as
up,’‘ = {r1, r2, . . . rt}, which is distinguished by the array variable B and the statement
number 1:. c1“, the cost of neighboring communication generated by accessing all B

instances in executing statement 3),, can be formalized by the following piecewise linear

equation.

rt — (dA — d3) when (dA — (13) < r1
CBJ: = 7‘( — 7‘1 when 7‘1 S (dA — dB) S 1‘3 (4.16)

(dA — dB) — 1‘1 when 7“: < (61,; — (13)

where array A is referenced on the LHS. For simplicity, 63,}; is called the cost function.

Proposition 7 can be easily extended to the case where the array referenced on the
LHS has different instance(s) referenced on the RHS in the same statement. Assume
that A(i1 + r,-) is referenced on the LHS and other instances of A referenced on the

RHS are A(i1 + r1), ..., A(i1 + rj-1), A(i1+ n+1), ..., and A(i1 + r;) where r1 <

 

105

. . . < rJ-_1 < r,- < rj+1 < . . . < r1. By using the similar analysis as in Equation 4.16,
the cost of neighboring communication involved in accessing A elements is equal to
r; — r1 no matter whatever dA is. In other words, the cost is a constant which is

independent with the alignment of array A.

4.3.2 Properties of Piecewise Linear Cost Function

In the rest of this section, we explore more properties of the piecewise linear cost

function. Consider the following program structure. Assignment 3; is free of neigh-

313 X(i1)=Y(i1)+Y(i1+1)+Y(i1+2)
32: Z(i1) = X(i1) * *2
33: X(i1) =X(i1)+Y(z'1 +4) +Y(z'1 +5)+Y(z'1 +7)
END IF
END FORALL

boring communication as long as dz = dx. Consider the relationship of dy and dx.

Since uy,1 = {0,1, 2}, we have

2 — (dx —- dy) When (dx - dy) < 0
CY,1 = 2 when 0 S (dx — dy) S 2

(dx -— dy) When 2 < (dx — dy)

Since uy,3 = {4,5, 7}, we have

7 — (dx — dy) when (dx — dy) < 4
CY,3 : 3 when 4 S (dx —dy) S 7

(dx — Cly) — 4 when 7 < (dx — dy)

106

The task of offset alignment is to ﬁnd the optimal value of d X — dy such that the sum
of cm and Cy,3 is minimal. Figure 4.4 shows the sum of eyJ and ey,3 in a coordinate

system.

 

 

 

“"[‘— dx’dv

_|'
I
I
l
i
I
+i_ — ———++++ .[
7

Figure 4.4. The sum of CY’l and ey,3

Studying Figure 4.4, we obtain the following results. If the position of dx —- dy
is chosen between 2 and 4, such as the postion g, the value of Cy,1 is (g - 2) + 2
and the value of ey,3 is (4 — g) + 3. As a result, the sum of CY'l and ey,3 is equal to
(4 - g) + (g — 2) + 2 + 3, or (4 — 2) + 2 + 3. Note that (4 —- 2) is the distance between
the maximum element in tuple uy,1 and the minimum element in tuple uy,3. If the
position of d X — dy is chosen to the left of 2, such as the position f, the value of cyl
is (2 — f) + 2 and the value of ey,3 is (4 — f) +3. Since f extends left to 2, (4 - f), the

distance between f and 4, is greater than 2, the distance between 2 and 4. Therefore,

107

the sum of (2 -— f) + 2 and (4 — f) + 3 is greater than (4 —— 2) + 2 + 3. On the other
hand, if the position of d X — dy is chosen to the right of 4, such as the position h, the
value of eyJ is (h — 2) + 2 and the value of ey,3 is (h — 7) + 3. Since h extends right
to 4, (h — 2), the distance between 2 and h, is greater than 2, the distance between 2
and 4. Overall, we conclude that the minimal value of ey,1 + Cy,3 is 7 when dX — dy
is chosen as any integer between (including) 2 and (including) 4.

We further notice that the middle elements, 1 in uy,1 and 5 in uy,3, do not affect the
shape of the piecewise linear function and thus are not relevant to the decision making
for the minimal sum of the cost. For this reason, the minimum and maximum elements
in an access offset tuple are sufﬁcient to identify a piecewise linear cost function. For
this reason, the cost function cBJc is be denoted as 63,]; =< 11;,k,r1_:3,;c > where 13,],
is the minimum element and 78,]; is the maximum element in u3,k. Element 13,}; is
called the left end point and element uBJc is called then right end point. Both (3,]:
and uBJc are called end points for simplicity.

Using the similar analysis as in Figure 4.4, we can obtain the general results as

follows.

Proposition 8 Assume that cBJ- is the cost function for accessing remote B elements
in executing reference “A +— B@sj ” and 63,]; is the cost function for accessing remote
B elements in executing reference “A (— B@sk”. Let cad =< l3,j,r3,j > and 63,]; =<
IB,k,r3,k >. If l3,,- < T3,)” the sum of cBJ- and 63,]: is minimized when dA — dB is
chosen as any integer between rBJ- and 131.. The minimum value of the sum of egg-

and 03,1: is equal to TB,j — 13$.

Figure 4.4 only shows the case where TB’J' < lB,k. Proposition 8 also holds for both
case 7‘8,j < (19,;c and case TBJ‘ > 13¢, as shown in Figure 4.5(a) and (b).
A close inspection of Figure 4.5 reveals that the sum of clay and CBJc can be

represented by the piecewise linear function when dA — dB varies between lBJ' and

108

 

'T dA-dB

 

n
l [lll-lll;_i
.. _
r8 _
<
IKI k ll. I]
18 B.
.. C
1
e +
s n
m _
a _
B.
C ,

 

 

['8'].

 

 

B
d
_A
d
/.
k
B. m
C m l
< . a I
..m +,. _
_
a ,_l I. lllll l_
m
M I I L.
_
m... I
...x C

 

 

rB.k

lB,k

1'8".

13.]

Figure 4.5. The sum of cBJ- and C3,];

‘1"

a,

109
T3,]? This can be formalized as follows.

Proposition 9 Assume that CB,j is the cost function for accessing remote B elements
in executing reference “A (— B@sj ” and CBJc is the cost function for accessing remote
B elements in executing reference “A +— B@sk”. Let cad =< IB,J-,r3,j > and C3,]; =<

l3,k,r3,k >. If lgd- < TB'k, the sum of CB“,- and C3,]; have the following property.

03,1 + 63,1: =
(dA - dB) - ’34 + (Ta/c - Ink) max{TB,j,lB,k} < (dA - 0'3) S. TBJ:
T3,, — lg.)c min{r3,,-, lB,lc} S ((1,; — (13) S max{rBJ, lBJc}

TBJ: — (dA — 613) + (raj —- 13,1) lB,j S (61.4 — £13) < min{r3,j,13,k}

For example, in Figure 4.4, the sum of cm and Cy,3 can be represented as the

following piecewise linear function where the value of dx — dy varies between 0 and

7.
T—(dx—dy)+2 When 0S(dx—dy)<2

CY,1+CY,3= 7 when 2 S(dx —dy) S4

(dx—dy)+3 When 4((dx—dy)S7

Proposition 9 is important in determining offset alignment between arrays A and
B such that the sum of the cost of neighboring communication with regard to multiple
FORALL assignments is minimal. More precisely, the problem can be formalized as
follows. Assume that there are m distinct references “A +— B@s,-” where ch =<
lB,j,1'B’j > for 1 _<_ j g m. The following algorithm can be used to ﬁnd the value of
dA — d3 such that 2;, cBJ- can be minimized.

The minimal-cost pairwise offset alignment (MPOA) algorithm is shown in F ig-
ure 4.6. In Figure 4.6, piecewise linear cost function Gay is monotonously increasing
when dA — dB extends right to r31 or left to 13,5. Therefore, if 130- is the minimal end

point and rBJc is the maximum end point among all access offset tuples in T, each

110

 

(1) T = U311{<IB.1,7‘B.1>}
(2) while T contains more than one tuple do
(3) Find j such that 134- has the minimal value among all end points in T
(4) Find k such that r15”c has the maximum value among all end points in T
) if k = j then
) T=T—{< lB,j,T‘B’j >}
) else
) T=T— {< lB,j,TB,j >,< lB,k,7‘B,k >}
) T = T U {< min{r3,j,lg,k},max{rB,J-,13*} >}

0) end if
1) end while
2) dA — d3 can be chosen as any integer number between (including) l3 1

and (including) T3,; where {13,5, rgy} IS the only tuple left 1n T

(5
(6
(7
(8
(9
(1
(1
(1

Figure 4.6. The minimal-cost pairwise offset alignment algorithm

 

 

 

piecewise linear function cBJ- (1 S j S m) is monotonously increasing when dA - d3
extends left to 13,; or right to r11;c (lines (3)-(4)). This implies that the value of
dA — d3 must be between 13,; and ray, such that 2;, cBJ can be minimized. If j = k
(line (5)), 13,,- and r3.)c are two end points in the same tuple 113*. By Equation 4.16,
the cost function 63,]; is a constant when d A — dB is between 13,; and T'BJc. Therefore,
this tuple is no longer considered in the rest steps (line (6)). If j 7’: k (lines (7)-(9)),
the sum of c153,,- and 63’]: is considered. By Proposition 9, the sum function can be rep-
resented by the piecewise linear function < min{rB,J-,lB,k},max{r3,,-,lg,k} > when
dA — dB is between 113,,- and r3,» Finally, when there is one tuple left in step ( 12), the
optimal value of dA — d3 can be chosen using Equation 4.16. The MPOA algorithm
terminates in m — 1 steps of the while loop because one tuple is deleted from T at
each step (lines (6) and (8)). By its construction, the MPOA algorithm does ﬁnd
the optimal dA — dB. If the heap-sort algorithm is used to search the maximum and
minimum end points (lines (3) and (4)), the time complexity of the MPOA algorithm
is O(mlog(m)).

111

No new end point can be generated in the MPOA algorithm, though new tuples
may be generated (line (9)). This important feature implies that given a set of
CB’J' =< IB,j,TB’J‘ > for 1 S j S m, there exists an end point to which dA — d3 can
be assigned such that 23-":1 C3“, is minimized. This fact results in a straight forward
algorithm to ﬁnd the minimal value of 237;, 63“,“: assign each end point to dA — dB

and ﬁnd the one which generates the minimal-cost.

Proposition 10 Given a set of CB’J' =< (Ba-,rBJ- > for 1 S j S m, there exists an

end point to which dA — dB can be assigned such that 237;, cBJ is minimized.

As mentioned in the previous section, the access offset can have different weights

due to various array sizes. For example, in the following program structure, the cost

FORALL(2’1 = 0 : 10)

31: X0.) = Y(i1,0) + Ya, +1,0)+ Y(z‘1 + 2,0)
FOR(2'2 = 0 : 10)
32: X(21) = X(21) + Y(21 + 4, 22) + Y(21 + 5, 22) + Y(21 + 7,22)
END FOR
END FORALL

of neighboring communication is cy,1 + 10 x Cy’g where

2 — (dx - dy) when (dx — dy) < 0
CY,1 = 2 when 0 S (dx — dy) S 2
(dx — dy) when 2 < (dx — dy)
(7 — (dx — dy)) when (dX — dy) < 4
CY,2 = 3 when 4 S (dx — dy) S 7

((dx — dy) — 4) When 7 < (dx — dy)

The cost eyg is weighted by 10 since a column of remote Y elements are accessed

112

in the innerloop 32. Therefore, the problem of ﬁnding the optimal dA — dB in the
presence of weighted cost situation can be formalized as follows: Assume that there
are m distinct references “A «— B@5j” where cBJ- =< IB,J-,r3,j > for 1 S j S m.
Each ch- is weighted by w,. The optimal dA — d3 should minimize the value of
22;, wj x cBJ.

One approach to resolve this alignment problem is to use the MPOA algorithm.
In line (1), let T consist of 111,- identical copies of tuple < l3,,-,r3,j > for 1 S j S m.
Though it may take excessive time if w,- is large, the MPOA algorithm guarantees
that Proposition 10 still holds regarding to the weighted cost of neighboring commu-
nication. Therefore, we can assign dA — dB to be each distinct end point in all access

offset tuples and ﬁnd the one with the minimal-cost.

4.4 Spanning-Tree Offset Alignment Algorithm

4.4.1 Offset Reference Graph

The problem of offset alignment can be modeled by oﬂset reference graph (ORG).
Given a program structure in which the alignment matrix of each array is pre-
determined, an ORG G = (V, E) is constructed as follows. An array is represented
by a node in V. For an array which has multiple instances referenced in one or more
F ORALL assignments, there is only one corresponding node in the ORG. A reference
which is free of reorganization communication is presented by an edge in E. The
edge connects two nodes which represent two arrays speciﬁed by the corresponding
reference. There is only one edge connecting the LHS array A and the RHS array B
for all B instances which have the same access matrix and are referenced in the same
statement. A ORG is undirected. Since offset alignment is studied independently
for each aligned-base group, different ORGS are used to model different aligned-base

groups. For simplicity, the examples used in the rest of this chapter only contain a

113

single aligned-base group and there is only one ORG for each program structure.

 

FORALL01=O:9)
31: A(21)=X(21+2,Z2)+A(21+4)
END FOR

82: A(21)= A(21)+ Z(21+ 3,22) + Z(21 + 5,32)
END FOR

FOR02=02U
34: Y(21) = Y(21) + Z(21 + 2, 22) + Z(21 + 4,22)

END FOR
END FORALL

Figure 4.7. Example 13: A Livermore benchmark loop

 

 

 

Figure 4.8 shows the ORG for Example 13 (Figure 4.7). In Figure 4.8, each array
variable is represented by a node labeled by the variable name. Edges are constructed
based on the references generated by each FORALL assignment in Example 13. For
example, edge (A, X) connects nodes A and X due to reference “A (— X@31”. Edge
(A, Z) connects nodes A and Z because of reference “A «— Z @32”. Note that reference
“A 4— Z @32” represents the access to all distinct instances of array Z in statement
32. Therefore, there is only one edge incident with A and Z. Since there is one-
to-one correspondence between an array and a node, terms “array” and “node” are
used alternatively in the rest of this chapter. Similarly, since there is one-to—one
correspondence between an edge and a reference, terms “edge” and “reference” are

also used alternatively in the rest of this chapter.

114

 

Figure 4.8. The offset reference graph for Example 13

9

Consider reference “A +— B@sk’ with access offset tuple uggc. By Proposi-
tion 4.16, the minimum cost of neighboring communication is zero if UBJC only contains
one element. The minimum cost of neighboring communication is equal to T3,]: —l 3,], if
u 3,}, contains more than one element and 13,1: is the left end point and ray; is the right
end point. However, considering multiple references, there may not exist a solution of
the alignment offsets such that the minimum cost of neighboring communication can
be achieved for every reference. The reason is that the minimum-cost requirement
imposed by one reference may conﬂict with that imposed by another reference, in
particular, when the two edges representing these two references are involved in a
cycle. For example, Figure 4.8 has a cycle A -—+ X —-> Y —> A. By Equation 4.16,

since mm = {2}, the minimum cost for reference “A 4— X @31” is zero if the following

equation is satisﬁed.

dA—dx =2 (4.17)
Since ux,3 = {—2}, the minimum cost for reference “Y +— X@s3” is zero if the
following equation is satisﬁed.

dy — dx = —2 (4.18)

Since uA,3 = {—3, —1,1}, the minimum cost for reference “Y <— A@s3” is 4 if the

115

following inequality is satisﬁed.
—3de—dAS1 (4.19)
However, subtracting Equation 4.18 from Equation 4.17, we have
(1,; — dy = 4

which does not satisfy Inequality 4.19. This implies that the minimum cost of neigh-
boring communication cannot be attained for at least one reference among the above

three.

4.4.2 Spanning Tree Offset Alignment Algorithms

The conﬂict of minimum-cost requirement can only occur if an ORG consists of cycles.
Intuitively, such conﬂict can be resolved by a spanning tree: offset alignment between
two arrays is speciﬁed by the tree edge connecting these two arrays such that the
minimum cost can be achieved on the reference represented by this tree edge. Each
non-tree edge determines a unique fundamental cycle of the ORG with respect to
the spanning tree. The minimum cost of neighboring communication may not be
achieved for each non-tree edge. Each edge is weighted. Since loop 32 in Example 13
is a sequential loop, the cost of neighboring communication for reference “A +— Z”
is weighted by 10, the number of elements in a row of two-dimensional array Z.
Consequently, edge (A, Z) is weighted by 10. For simplicity, the weight on edge (A, B)
is denoted as wA,B. In Figure 4.8, wAz = 10, wa = wy,z = 2, and wxy = wA,y = 1.
Given a choice between two edges, the tree edge would be chosen as the one with
higher weight. As a result, the spanning tree for offset alignment would be chosen as

a maximum-weight spanning tree (MWST).

116

Similar to the analysis in section 3.4 for the MICS algorithm, the MWST algo-
rithm may not generate offset alignment which offers the minimum overall cost of
neighboring communication regarding to all references in a program structure. For
Example 13, the maximum-weight spanning tree is shown in Figure 4.9(a). Alignment
offsets are selected such that dA — dz = 3, dy — dz = 2, and d4 — dx = 2. Thus, the
minimum cost requirements of neighboring communication imposed by edges (A, Z),
(Y, Z), and (A,X) are satisﬁed, respectively. Without loss of generality, let dA = 0.
We have dz = —3, dy = 2, and d X = —2. Therefore, by Proposition 7, the overall
cost of neighboring communication is 42 regarding to all ﬁve references in Example
13. However, the spanning tree alignment in Figure 4.9(b) generates lower overall
cost of neighboring communication. In Figure 4.9(b), alignment offsets are selected
such that dA — dz = 3, dy — d4 = 0, and d4 — dX = 2. Thus, the minimum cost
requirements of neighboring communication imposed by edges (A,Z), (Y, A), and
(A,X) are satisﬁed, respectively. Without loss of generality, let d4 = 0. We have
dz = +3, dy = 0, and dx = —2. Therefore, by Proposition 7, the overall cost of

neighboring communication is only 40.

 

 

/'\ Q
2 \ 2 \
[)6 1 ( i) l X) l ( Z)
l 2 l 2
"\fl , _lV /
(g (y)
(a) Maximum-weight spanning tree (b) Minimum-cost spanning tree

Figure 4.9. The offset reference graph for Example 13

In this section, we propose an efficient spanning-tree offset alignment (STOA)

117

algorithm which is an improvement of the MWST algorithm. Given an ORG G =
(V, E), the STOA algorithm can be formalized in Figure 4.10.

 

(1) Choose an arbitrary array variable A
Let T = {A}
(2) while T ¢ V do
(3) Let Q1 be the set of single-degree neighbors of T
(4) if Q1 76 45 then
(5) Find a node B in Q1 such that there exists an edge (B, H)
where H E T and 1123,” = min{XeQ1,Y6T} wx,y

(6) Let d3 — d” = 1H,): where edge (B, H) represents reference
“B (— H@3k” and [H’k is the left end point in 2112],];
(7) T = T U {B}
(8) else
(9) Let Q2 = V — T
(10) for each node X in Q2 do
(11) Let wx = 2011710)“)!
(12) end for
(13) Find a node B in Q2 such that 203 = max{XeQz} wx
( 14) Use Proposition 10 to determine the value of d3 which achieves

mlnde-dA} Z:{(B,H)EE,HeT, and (H,B) represents “Ho—B@s;’} CB’J'
(15) T = T u {B}
(16) end if
(17) end while

Figure 4.10. The spanning-tree offset alignment algorithm

 

 

 

In Figure 4.10, the single-degree neighbor and the multi-degree neighbor is deﬁned
as in section 3.4. If Q1 is not empty, like the MWST algorithm, the tree edge is
selected as an edge with the largest weight among all the edges which are incident
with one node in Q1 and another node in T (lines (4)-(7)). If Q1 is empty but Q2 is
not, we ﬁnd the next tree edge as follows. First, w x, the sum of the weight over all

edges connecting X and another node in T, is computed for each node X in Q2 (lines

118

(10)-( 12)). Then the node B in Q2 with the maximum sum of the weight is selected
to be the next tree node (line (15)). The value of alignment offset (13 is determined
by using Proposition 10. By this construction, the STOA algorithm is superior to the
MWST algorithm.

 

 

 

\Al \5) ( A/l
(So 1 (z: (x: 1 <1 2) (3;, 1 z\
1 2 1 ’ 2 1 2
,J. _ _ ,
(’Y) (X) (Y

Figure 4.11. Resolving offset alignment for Example 13

Figure 4.11 shows how the STOA algorithm works to resolve offset alignment for
Example 13. In Figure 4.11, array A is selected as the template (line(1)). Note
that the selection is arbitrary. The values of other alignment offsets will be in the
form of : (L; + k where k can be any integer. T = {A}. The ﬁrst phase is shown
in Figure 4.11(a). Since all X, Y, and Z are single-degree neighbors of T, Q1 2
{X, Y, Z}. Since it has the maximum weight among edges (A,Y) and (A, Z), edge
(A, Z) is selected as the tree edge (line (5)). Since edge (A, Z) represents reference
“A +— Z@32” where 112,2 = {3,5}, (1,4 — dz = 3 (line (6)). Note that line (6) only
shows one case. In another case, edge (B, H) may represent reference “H +— B@sk”.
As a result, dB should be determined as (1;; —- d3 = 13),.

The second phase is shown in Figure 4.11(b). In the second phase, T = {A, Z}.
Therefore, node X becomes the only single-degree neighbor of T. Similar to the ﬁrst

phase, we have dA — dx = 2.

119

The last phase is shown in Figure 4.11(c). In the last phase, T = {A, X, Z}. Node
Y is multi-degree neighbor of T. The cost functions for references “Y +— X @33”,
“Y (— A@33”, and “Y «— Z@s4” can be represented by piecewise linear functions
cx,3 =< —2 >, 6,43, =< —3,1 >, and czA =< 2,4 >, respectively. Note that
ex; =< —2 > is a function of dy — dx. Since dX = dA — 2 (step (b)), cx,3 can be
rewritten as < O > with respect to dy — (1,4. 62,4 =< 2,4 > is a function of dy — dz.
Since dA — dz = 3 (step (a)), 62.4 can be rewritten as < —1,1 > with respect to
dy — d A. Therefore, using Proposition 10, we conclude that the minimal value of the
sum cz,4 + CA3 + cx,3 can be achieved when dy — (1,4 = 0 (lines (14)). The values of

T, Q1, and Q2 in each phase are illustrated in Table 4.1.

Table 4.1. Values of T, Q1, and Q2 in executing the MICS algorithm for Example 13

 

 

 

 

 

 

 

 

 

phases T Q1 Q2 offst alignment
(3) {A} {X,Y,Z} <15 51.4 —dz = 3
(b) {M} {Z} {Y} dx = d. — 2
(C) {A,X,Y} <15 {Y} dY - dA = 0

 

 

In the STOA algorithm, each edge in the ORG G is referenced only once. If the
heap-sort algorithm [79] is used in lines (5) and (13), the time complexity of ﬁnding
the maximum-weight edge can be reduced to 0(loglEl) where IEI is the number of
edges in DRG G = (V, E). As a result, the time complexity of the STOA algorithm
is 0(|E|log|E|).

120

4.5 Optimizing RHS Expression Evaluation

4.5.1 RHS Expression Evaluation Optimization

Similar to the base alignment analysis, neighboring communication can be minimized
by an optimal evaluation tree in which an intermediate result may be evaluated on
a remote processor instead of the local processor which owns the LHS operand. The
limitation of the owner-computes rule can be exceeded by executing different parts
of a FORALL assignment in distinct processors. We assume the original program has
already been pre-processed by transforming each original FORALL assignment into an
equivalent set of FORALL assignments, each of which has the RHS expression operated

by one kind of associate and commutative operations.

 

FORALL(z'1 = o : 9)

31: 14(21): B(ZI)*X(21)*Y(21)
END FORALL

Figure 4.12. Example 14: A Livermore Kernel 7 loop

 

 

 

Consider Example 14 in Figure 4.12. In Example 14, offset alignment are pre-
determined such that d A = d3 = d X = dy. By Equation 4.16, the cost of neighboring
communication for all references in statement 31 is 0. Therefore, statement .91 is free
of neighboring communication. However, such an offset alignment decision makes ev-
ery reference in statement .32 not free of neighboring communication. Consider how to
minimize the cost of neighboring communication in statement 32. Since the interme-
diate result can be generated by a remote processor rather than the one which onws

the LHS operand, the key issue is to ﬁnd what operands should be operated together

121

8(5)
1
m(il+1)
2/’ \\0
:45) rr1(i,+3) A(i,+1)
f O i I”! \
B(i,) X(i,) Y(i1) X(il+3) Y(ii+4)
statement 51 statement 8

2

Figure 4.13. The optimal evaluation trees for Example 14

on what processor? Figure 4.13 shows an optimal evaluation tree for statement .92. In
Figure 4.13, each reference is represented by an arc. The cost of neighboring commu-
nication required for each reference is represented by the weight on the corresponding
arc. X (i1 + 3) and Y(z'1 + 4) are ﬁrst added and the result is saved to a temporary
variable TT1(i1 + 3) at the remote processor which owns X (i1 + 3). dTTl = dx.
The cost of neighboring communication for reference “TTl (— Y” is 1. TT1(i1 + 3)
and A(il + 1) are then added and the result is saved to another temporary variable
TT2(2’1 + 1) at the remote processor which owns A(z'l + 1). dTT2 = dA. The cost of
neighboring communication for reference “TT2 (— TT 1” is 2. Finally, TT2(i1 + 1) is
assigned to B(z'1). The cost of neighboring communication for reference “B +— TT2”

is 1. Therefore, the optimal evaluation for Example 14 can be written as follows:

 

FORALLUI=0:9)
31: A(i1)= B(z'1) an: X(z'1) * Y(i1)
3232 8(21) = TT2(21 + 1)

END FORALL

 

 

 

122

4.5.2 Post-Alignment Optimization

The study of Example 14 reveals two important phases in determining the optimal
expression evaluation with regard to multiple FORALL assignments. One phase is the
offset alignment phase. By Proposition 6, no cost of neighboring communication is
involved in a RHS expression evaluation if an appropriate offset alignment is used. For
example, in Example 14, assignment 31 is free of neighboring communication when
d A = d3 = dx = dy. However, for multiple assignments, neighboring communication
may not be fully avoided no matter how good offset alignment is. Therefore, the other
phase is to minimize the cost of neighboring communication in each statement after
alignment offset for each array is determined. This phase is known as post-alignment
optimization. In Example 14, the post-alignment optimization for assignment 32 is
shown in Figure 4.13. These two phases are not isolated. By the deﬁnition, the
post-alignment optimization is carried out after the offset alignment analysis is done.
On the other hand, however, the decision of offset alignment depends on the accurate
cost estimation provided by post-alignment optimization techniques. We study the

post-alignment optimization technique as follows.

Proposition 11 Assume that in a F ORALL assignment, the RHS expression is oper-
ated by one kind of associate and commutative operations. Ao(i1 + r0) is referenced
on the LHS. All instances referenced on the RHS are A1(i1 +r1), A2(i1 +r2), ..., and
Am(i1 + rm). Assume that dAo, dA“ dA2, ..., and (1.4". are pre-determined. Assume
that the sequence ofro+dAo,r1 + dA, , r2 +dA,, . . . , rm + (1.4," is in monotonous (either
increase or decrease) order. Therefore, the optimal evaluation of the assignment is

as follows:

1. Begin with Am(i1 +rm). Operate Am(i1 +rm) with Am_1(i1 +rm_1) and save the
intermediate result in the temporary variable TTm_1(i1 + rm_1) at the processor

which owns Am_1(i1 + rm_1).

123

2. Operate TTJ-(il +r,-) with Aj_1(i1+r,-_1) and save the intermediate result in the
temporary variable TTJ-_1(i1 + rJ-_1) at the processor which owns Aj_1(i1 + rJ-_1)

for 2 Sj S m — 1.
3. Assign TT1(i1 + r1) to A0(i1 + r0).

It is easy to prove that Proposition 11 is true. Consider the cost of neighboring
communication involved in reading Am(i1 + rm). Since (rm + (1.4m) — (r0 + dAo) >
r,- + dAJ. — (r0 + dAo) for (1 Sj S m - 1), the distance between Am(i1 + rm) and the
LHS operand is the longest among that between any other RHS operand and the LHS
operand. Therefore, we hope that Am(i1 + rm) can be operated with another operand
which is closer to the LHS operand. On the other hand, the cost of moving Am(i1 +
rm) to such an operand should be as small as possible. Since the closest operand
to Am(i1 + rm) is Am_1(i1 + rm_1), Am_1(i1 + rm_1) becomes the best candidate.
Repeating the similar approach, the optimism in Proposition 11 can be proved.

In Proposition 11, the symbols A0, A1, A2, ..., and Am are not required to be
pairwise distinct. Proposition 11 still holds if an arbitrary subset of them represents

the same array. For example, in the following FORALL assignment, assume d X = dy =

0.
FORALL(2‘1 = 0 : 9)
31: X(i1) = Y(i1 — 3) + Y(z'1 — 1) + Y(z'1 +1) + X(i1 — 2) + X(z'1 + 2)
+X(i1 + 6)
END FORALL

Figure 4.14 shows the result of the post-alignment optimization for the above
FORALL assignment. In Figure 4.14, each reference is represented by an arc. The

cost of neighboring communication required for each reference is represented by the

 

124

weight on the corresponding arc. Note that the sequence of instances with respect to
the monotonous order of access offset is {Y(i1 — 3), X(i1 — 2), Y(i1 — 1), X(i1), Y(i1 +
1), X(i1 + 2), X(i1 + 6)}. Since the LHS instance X(i1) resides in the middle, the se-
quence is separated into two subsequences: {Y(i1 — 3), X(i1 — 2), Y(i1 — 1), X(i1)} and
{X(i1), Y(i1 +1), X(i1 +2), X(i1 +6)}. Proposition 11 works on each of subsequences
as shown in Figure 4.14. Note that the unit cost of neighboring communication must
have the same weight with regard to all arrays referenced in the same FORALL as-

signment. Therefore, Proposition 11 holds with the consideration of the weighted

cost.

i
1 r X( l)‘.\\l

Tr2(il “—1) 1730,“)

,//l ‘\\l 0 0/] \-\\1
/. "’ .. , / \\
n10, -2) Y( il —1) Y( i, +1) 'I'I‘4(il+2)
I
15" \\ 0 0/“ \ 4
\\ /" \\
Y( H -3) X( i] -2) X( il+2) X( i, +6)

Figure 4.14. An example of post-alignment optimization

4.5.3 Alignment Graph

Like base alignment, alignment graph (AG) is used to model the offset alignment
problem. An AG is a collection of arrays and statements which can be represented
as a bipartite graph, G = (Va,V,, E). An array is represented by a node in Va. A
statement is represented by a node in V,. A undirected edge in E connects an array
A and a statement 3), if A is the LHS array. A undirected edge in E connects an array

B and a statement 5), if DBFBJc = DAFAJc where A is referenced on the LHS and B

125

is referenced on the RHS in statement 3),. There is only one edge connecting the LHS
array A and the RHS array B for all B instances which have the same access matrix
and are referenced in the same statement. Figure 4.15 shows the AG for Example 13

(Figure 4.7).

Figure 4.15. The AG for Example 13

4.5.4 AG-Based Offset Alignment Algorithm

The AG-based offset alignment (AGOA) algorithm is shown in Figure 4.16. In an
AG G = (Va,V,, E), there is a set, denoted as Qk, associated with each node 31C in V,.
Initially, each set Q)c is empty. During the execution, each Q;c will contain elements
in access offset tuple for each array which is referenced in statement 31c and whose
alignment offset has been determined.

In Figure 4.16, g4,” represents an element in the access offset tuple u,”c =
{g1,A,k,g1,A,k,...,gh,,4,k} (lines (5) and (15)). The ﬁrst array selected in line (1),
say A, serves as the template. The value of other alignment offset is in the form of
i (I); + k where k is any integer. Set T contains all choices of : dA + k for every

other array to select as alignment offset. Given an array B, dB is chosen among all

126

elements in T such that the sum of the cost of neighboring communication over each
FORALL assignment where B is referenced can be minimized (line (13)). The cost of
neighboring communication in executing each FORALL assignment is estimated using
Proposition 11. If the heapsort algorithm [79] is used in ﬁnding the minimum value,

the time complexity in line (13) is

0(109 Z iuBJci)

(B,sk)€E

 

(1) for each node B in Va do

(2) T = <15

(3) for each neighboring node 3;, (in V,) of B do
(5) for each element 943,}; in up)c do

6) for each element q in Q)c do

(7) if q — gag), is not in T then
(8) T = T U {q - geek}

(9) end if

(10) end for

(11) end for

(12) end for

(13) Assign dB to be the element in T such that the sum of the
cost of neighboring communication over each FORALL assignment
where B is referenced is the minimum

(14) for each neighboring node 3), (in V,) of B do
(15) for each 943,): in 113,}; do

(16) if d3 + 943,], is not in Q, then

(17) Q]: = Qk U {(13 + 943*}

(18) end if

(19) end for

(20) end for
(21) end for

Figure 4.16. The alignment graph offset alignment algorithm

 

 

 

127

where IUBJcl is the number of distinct elements in u3,k. Thus, the time complexity of

the AGOA algorithm is
0( Z luBJcllOg Z iuBJci)

(8,3).) (8,3).)

Table 4.2. Resolving offset alignment for Example 13 by using the AGOA algorithm

 

 

 

 

 

Z Y X
Q1 {dA,dA + 4} {dAadA + 4} {dAadA + 4}
Q2 {dA} {dAadA +2} {dAidA +2}
Q3 {dA—3,dA,dA+l} {dA-3,dA,dA+l} {dA—3,dA,dA+I}
Q4 {} {dA-1,dA+1} {dA-1,dA+1}
result dzsz—3 dy=dA+1 dX=dA+2

 

 

Table 4.2 illustrates how offset alignment in Example 13 is resolved by using
the AGOA algorithm (Figure 4.16). The algorithm begins with A since A has the
maximum degree of connectivity. Thus, A serves as the template. Arrays Z, Y, and
X are chosen in sequential (line (1)). Table 4.2 shows the content of Q1, Q2, Q3, and
Q; at the end of each iteration in the outmost for loop, where each of arrays A, Z,
Y, and X is selected. When A is ﬁrst select, sets Q1, Q2, Q3, and Q4 are all empty.
Therefore, no operation is done in line ( 13). Since A is referenced in statements 31,
32, and 33, Q1 2 {dA,dA + 4}, Q2 2 {d4}, and Q3 = {dA — 3,dA,dA +1} by lines
(14)-(20).

Next, Z is selected. Since Z is referenced in statement 82 and ”(12,2 =< +3, +5 >,
T = {dA — 3,d,4 — 5}. Note that though Z is referenced in statement 34, Q4 is still
empty. Therefore, Q4 makes no contribution to T. By Proposition 11 (line (13)),
either dA — 3 or (L4 — 5 can be the value of dz. Thus, we choose dz = dA — 3.
Executing lines (14)-(20), Q2 :2 {dA,dA + 2} and Q4 2 {dA —— 1,dA + 1} since Z is

referenced in both statements 32 and s4.

128

Next, Y is selected. Since Y is referenced in statements 33 and 34, for executing
lines (2)-(12), T = {dA — 3,dA —- 1, dA,dA +1}. Consider CA3, the cost of neighboring
communication in accessing the RHS elements A(i1 — 3), A(il), and A(il + 1) in
statement 33. By Proposition 11 (line (13)), 6,43 = 4 if dy is chosen as any element
in T. Consider cz,3, the cost of neighboring communication in accessing the RHS
elements Z(i1 + 2) and Z(i1 + 4) in statement 34. By Proposition 11 (line (13)),
Cy,4 = 8 if dy is selected as dA — 3. CYA = 4 if dy is selected as either dA — 1, or dA,
or dA + 1. Thus, we choose dy = dA + 1. Executing lines (14)-(20), Q3 and Q. are
not changed.

Last, X is selected. Since X is referenced in statements 31 and 33, by executing
lines (2)-(12), T = {dA—2, dA—l, dA +2, dA +3}. Consider ch, the cost of neighboring
communication in accessing the RHS elements X (i1 + 2) in statement 31. No extra
cost of neighboring communication is paid if d X is chosen as either dA — 2 or dA + 2.
If dx = dA — 2, elements A(il) and X(i1 + 2) are owned by the same processor. If
dx = dA + 2, elements A(il + 4) and X(i1 + 2) are owned by the same processor and
the RHS expression can be evaluated on the processor which owns A(il +4). Consider
CX,3, the cost of neighboring communication in accessing the RHS elements X (i1 — 2)
in statement 33. Similarly, no extra cost of neighboring communication is paid if dx

is chosen as either dA — 1 or dA + 2. Therefore, dA + 2 is the only solution of d X (line

(13))-

4.5.5 Performance Comparison

Table 4.3 shows the comparison of the overall cost of neighboring communication
generated by the MWST, STOA, and AGOA algorithms using Example 13. The
AGOA algorithm outperforms the other two algorithms. In the AGOA algorithm,
each element in access offset tuple is taken count into in performing the best post-

alignment optimization. For example, when dx is chosen as dA + 2, in statement

129

Table 4.3. Comparison of the MWST, STOA, and AGOA algorithms using Example
13

 

 

Algorithms Overall cost of
neighboring communication
MWST 42
STOA 40
AGOA 36

 

 

 

 

33 elements X (i1 — 2, 0) and A(il) are owned by the same processor. However, this
information is not utilized by the STOA algorithm since the STOA algorithm only

considers the two end points in u A.3 =< —3, 0, l > and ignores the middle element 0.

4.6 Avoiding Redundant Communication

When the same context of remote elements are referenced in more than one F DRALL
assignment, the local processor should receive a single copy of these elements instead
of multiple identical copies. The importance of such redundant communication avoid-
ance has been shown in Section 3.6 regarding to the base alignment analysis. In this
section, we concentrate on issues of avoiding redundant communication regarding to

the offset alignment analysis.

4.6.1 Redundant Communication

In Example 15 (Figure 4.17), array A is two-dimensional and other arrays are one-
dimensional. A is partitioned in columns such that all elements in the same column are
collapsed to the same processor. Only the outmost loop indexed by i1 is distributed
across the processors. Assume that offset alignment is pre-determined such that
(1,, 2 d3 = dz = 0. Since d3 = dz, the LHS element Z(i1) in statement 33 and the

LHS element B ( il) in statement 34 are located to the same local processor. Therefore,

130

the two copies of A(i1+3, 0) referenced in both statements are identical. The messages
transmitting remote element A(i1+3, 0) in statement 33 and 34 are redundant. Assume
that processor P(i) owns elements A(m, : m,“ — 1,0). The remote A elements
referenced by P(i) in executing statement 33 and 34 are A(m,+1 : mg+1+2, 0). Consider
statement 32. Since dA = d3, the remote A elements referenced by P(i) in executing
statement 32 are A(m,+1 : m,-+1 +4, 0). Therefore, the messages transmitting remote A
elements in statements 33 and 34 are further redundant with the messages transmitting
remote A elements in statements 32. Combining the remote A elements referenced
by local processor P(i) in executing statements 32, 33, and 34, we conclude that
A(m;+1 : m,“ + 4,0) are only remote A elements required to be transmitted from

the remote processor to the local processor.

 

FORALL(i1 = 0 : 9)
812 AU], 0) = X01)
822 3(21) = A(Zl + 4,0) + A(Z] + 5,0)
832 2(21) = A(Zl + 3,0) * 8(21)
S42 8(21) =X(21)+A(21 +3,0)
FORALL(i2 = O :9)
S52 A(il,i2) = A(il,i2 — I) + t2
END FORALL
862 8(21) 2' 8(21) + A(Zl +1,i2)
END FOR
END FORALL
Figure 4.17. Example 15: A Dhrystone benchmark loop

 

 

 

As addressed in section 3.6, the identiﬁcation of redundant neighboring commu-
nication depends on both location and the context. The location requires that the

senders of redundant communication must be the same, so do the receivers. The

131

context requirement implies that the context of the redundant remote elements must
be the same. As introduced in Section 3.6, the concept of single assignment block can
be used to identify the distinct contexts with regard to a particular array variable. In
Example 15, there are two single assignment blocks with regard to A: {31, 32, s3, s4}
and {35,36}. Therefore, the message transmission of A(m;+1,0) in executing state-
ment .96 is not redundant with the message transmission of A(m,+1 : mg+1 + 4, 0) in
executing statements 33, 33, and 34 because the context of array A is re—deﬁned in
statement 35.

More accurately, redundant neighboring communication can be classiﬁed into two
types: fully redundant messages and partially redundant messages. In respect to fully
redundant messages, the messages carrying remote elements for executing two distinct
FORALL assignments are identical, such as the messages transmitted in executing
statements 33 and 34 (Example 15). Only one copy of the fully redundant messages
needs to be transmitted and the cost of neighboring communication is equal to the
size of the single message. In respect to partially redundant messages, the messages
carrying remote elements for executing two distinct FORALL assignments overlap in a
subset of common elements. However, the two messages are not identical, such as the
messages transmitted in executing statements 32 and 33 (Example 15). In this case,
the redundant elements should be only transmitted once. In Example 15, the result
of combining remote A elements accessed in executing statements 32 and 33 can be
treated as if instances A(il +3, 0), A(i1+4, 0), and A(i1+5, 0) are accessed in the same
statement and thus the access offset tuple becomes < 3, 4, 5 >. The understanding of
avoiding partially redundant messages is important in correctly estimating the cost

of necessary neighboring communication.

132

4.6.2 Enhanced Alignment Graph

As addressed in Section 4.6, the enhanced alignment graph (EAG) is used to identify
single assignment block. The deﬁnition of an enhanced alignment graph (EAG) G =
(Va,V,, E, A,) is similar to that of an alignment graph G = (Va,V,, E) except that an
EAG has an extra arc set A,. The deﬁnition of Va, V3, and E can be found in Section
4.5. There is an arc in A, from node 35 to node 3;, if an array deﬁned in statement 3;
is used in statement 3),. A distinct EAG is used for the offset alignment analysis in
each aligned-base group. Figure 4.18 shows the EAG for Example 15 (Figure 4.17).

In Figure 4.18, edges are weighted 1 if their weights are not labeled.

//_.v —T K“ _
1"]— 31...!” \E‘
, . (I "I - \\ ‘3
All “i i i .
/
I . . .\\
I \ .~.\
// / x i/ \
~ )
/ / ‘ Z ’ k\§/
/ / - ‘ _/'
I r
1’ __ ',,4-/—’—“\_\ f—‘x
/“s* \~ ”’ .1- . s w 3
K 2 /> / K\ S3 _//) T\",—4f”"/ 10 \FL/
\_ <\\ / f —’—"Q ~“ ['14’,' \ //” I
\. 1’/ , \ / ’ /,/
‘4’ / \
//’ AT“\\~ , / TTV‘\\\T\‘ /’ I,” \ l
l/ , f)>-.:'\ /"’;~.< \
I“ . 'V /x\\
// // ,»’ ” .// \\‘ ‘ \\ i
/ x I" ,/' ,.—.\
I/ ,z/ i ’1 / I i i
r// // ,«’/// '(,/' i, ‘(\3§)\~
, ,///,x///-/
"'l :1“, '/
4 2 4,2,4
(AH—_— ( s )
\\-’/ \\\~§.—//

 

Figure 4.18. Enhanced alignment graph for Example 15

4.6.3 EAG-Based Offset Alignment Algorithm

The EAG-based offset alignment (EAGOA) algorithm is shown in Figure 4.19. The
EAGOA algorithm is an invariant of AGOA algorithm introduced in Section 4.5. Like

the AGOA algorithm, in Figure 4.19, set Q)c is associated with each node 3), in V3.

133

Initially, each set Q)c is empty. During the execution, each Q)c will contain elements
in access offset tuple for each array which alignment offset has been determined and
is referenced in statement 3),. gm”, represents an element in the access offset tuple
uAJ, = {g1,,4,k,g1,,4,k,. . .,gh,A,k} (lines (5) and (15)). The ﬁrst array selected in line
(1), say A, serves as the template. The values of other alignment offsets are in the

form of i d A + k where k is any integer.

 

(1) for each node B in Va do

(2) T = 45

(3) for each neighboring node s)c (in V,) of B do

(5) for each element guy, in 213,), do

(6) for each element q in Q6 do

(7) if q — gag), is not in T then

(8) T = T U {q '- game}

(9) end if

(10) end for

(11) end for

(12) end for

(13) Assign dB to be the element in T such that the sum of the
cost of neighboring communication over each single assignment
block with respect to B is the minimum

(14) for each neighboring node 3): (in V,) of B do

(15) for each 91.8.1: in uBJc do

(16) If d3 + gag]; is not in Qk then
(17) Q]: = Qk U {dB + 91,8,k}

(18) end if

(19) end for
(20) end for
(21) end for

Figure 4.19. The enhanced alignment graph offset alignment algorithm

 

 

 

The major difference between the EAGOA and AGOA algorithms is how to es-

134

timate the sum of the cost of neighboring communication (line (13)). In the AGOA
algorithm, the sum of the cost is estimated based on each FORALL assignment where
a particular array is referenced. However, in the EAGOA algorithm, the sum of the
cost is estimated based on each single assignment block with respect to that par-
ticular array. Therefore, the EAGOA algorithm prevents the selection of alignment
offset from the adverse impact of redundant neighboring communication. This feature
makes the EAGOA algorithm superior to the AGOA algorithm. Example 15 can be
used to illustrate the idea.

In Figure 4.18, A has the maximum connectivity degree. Therefore, A is selected
ﬁrst. Executing lines (14)-(20) in the EAGOA algorithm, we have Q1 2 {dA}, Q2 =
{dA+4,dA+5}, Q3 = {dA+3}, Q4 = {dA+3}, Q5 = {dA}, and Q5 = {dA+1}. Assume
that B is selected next. Since B is referenced in statements 32, s3, s4, and 36, T is only
constructed based on Q3, Q3, Q4, and Q6. By lines (3)-(12), we have T = {dA +1, dA +
3, d A + 4, dA + 5}. Statements 33, 33, and 34 comprise the single assignment statement
with regard to arrays A, X, and Z. When redundant neighboring communication is
removed, the sum of the cost of neighboring communication in statements 32, 33, and
35 can be modeled by the piecewise linear function < 3, 5 > with respect to d3 — dA.
On the other hand, the cost of neighboring communication generated by statement
36 is modeled by piecewise linear function < 1 > with respect to d3 — dA. Since
statement 36 is not in the same single assignment block in which statements 32, 33,
and 34 are (regarding to array A), the sum of the cost in line (13) is equal to < 3, 5 >
+ < 1 > + < 1 >. Note that the cost of neighboring communication is weighted
by 2 for statement 36. Therefore, the minimal value of < 3,5 > + < 1 > + < 1 >
can be achieved when (13 = dA + 1. In contrast, if the AGOA algorithm is used, the
sum of the cost of neighboring communication is estimated based on each FORALL
assignment. In other words, the cost sum would be c133 + cB,3 + 08.4 + c193, where

63,2 =< 4,5 >, cB,3 =< 3 >, CBA =< 3 >, and cB,6 =< 1 > + < 1 > regarding

135

to d3 — dA. This wrong cost estimation leads to the solution d3 = dA + 3 which
minimizesthevalueof<4,5>+<3>+<3>+<1>+<1>.

The extra time complexity in the EAGOA algorithm is spent in identifying single
assignment blocks for each array variable. Using the dataﬂow analysis technique
proposed in [64], the extra time complexity is O(IVallel). Thus, the overall time
complexity is

O(lvallV8l+ Z luBJCIIOQ Z luB.kl)
(Bvsk) (Bvsk)

CHAPTER 5

Data Distribution

The task of data distribution is to determine the mapping of the template elements
onto the processors. Data elements are assigned to the processor to which the mapped
template element is assigned. The goal of data distribution is to both reduce neighbor-
ing communication and increase processor workload balance. In this chapter, segment
distribution is introduced as the best distribution pattern for data distribution within
a single dimension. An optimal processor allocation algorithm is proposed to further

minimize the overall cost of neighboring communication across multiple dimensions

of the template array.

5.1 Segment Distribution

In this section, we study data distribution with regard to a single dimension in the
template array. The issues involved in data distribution across multiple dimensions

in the template array will be addressed in the next section.

5.1.1 The Limitation of Existing Distribution Types

We illustrate the limitation of existing distribution types by using Example 16 (Fig-

ure 5.1).

136

137

 

FORALL(i1 = 0 : n —1,i2 = i1 : n — 2)
31: B(i1,i2+1)=B(n—— 1+i1 —i2,i2)**2
END FORALL

Figure 5.1. Example 16: An Electromagnetic benchmark loop

 

 

 

In Example 16, array B is distributed in columns. In other words, the distribution
function 63 is deﬁned as follows:

61
t: 63( 2 b2

52
where T(t) is a template element and B(b1,b3) is a B element. By Proposition 1,
assignment 31 is free of reorganization communication. However, neighboring com-
munication cannot be avoided. By Proposition 5, the basic cost of neighboring com-
munication generated by assignment 31 is 1. An effective element is a data element
that is used or deﬁned in assignment 31. Limited by the loop boundary, the effective
elements form the upper triangle in the two-dimensional data space. Thus, density

function tag can be deﬁned as follows:

w3(t) = t where T(t) is a template element

Figure 5.2 shows three different types of data distribution for Example 16: cyclic
distribution, block distribution, and segment distribution. In Figure 5.2, array B is
declared as 11 x 11. Template T is one-dimensional array with 11 elements. There
are three processors available, denoted as P(O), P(1), and P(2). Figure 5.2 shows
the distribution patterns of both the template elements and the effective elements in

array B.

138

00.00.0000 T
ooooooooooo . . ‘
P[O] o o o o o o o o o a» If" :22 ’ =26 ( =18
PH] 0 o o o o o o o 0 HO] PU] P[2]
P2 0 o o o o o o o C = C _, C ._.
”P[0] o o o o o o o ”0] 22 Pill-26 P[2] 18
mini] : 3 z z 2 B workload balance '[P[1]-iP[2]l=8
P[O] o o o o . _
pm . . 0 communication Cpm=26
P[2] o o
P[0]o
PH]
(a) cyclic distribution
oooooooooooT
...OOOOO...
oooooooooo _ _ ,a _
P[O]. O O O “i O O O O lP[0]_10 (Fur—26 (P[2] —30
com do...
0 o o o o o o cp[o]=4 cplll=8 cp[2]=0
o c o o o o B
p[no : z : : workload balance l[P[2]-[P[O] |=20
. . . communication cP[1]=8
o o
P[2] .
(b) block distribution

OOOOOOOOOOOT

...COOOOC’OC
0000000000 3 _ 1" =1":
. . . . o o o . . ‘PIOI‘21 ‘Pm 24 ‘Pm 2‘
O O O O O O O .
FIG] 0 O O O O O O CP[0]=4 CP[1]=9 CP[2]=O
O O O O O O
0 o 0 0 0 B workload balance lipm- [P[0]|=3
O O O .
Pl” 0 ' ' communication cPlll=9
O O
O
P[2]

(c) segment distribution

Figure 5.2. Different types of data distribution for Example 16

139

Since the RHS computation is uniform among all the LHS elements, workload on
a processor is proportional to the number of the LHS elements that the processor
owns. For simplicity, 6pm, the workload on processor P(i), is represented by the
number of the LHS elements that processor P(i) owns. The quality of processor
workload balance imposed by a particular distribution pattern can be evaluated by

the variance of workload,

Og¢€g§,{|€p(i) — howl}

The cost of neighboring communication generated on processor P(i) is denoted as
CP(£)- The value of ep(,-) is determined by the number of remote elements that pro—
cessor P(i) accesses in executing assignment 31. Since the basic cost of neighboring
communication in Example 16 is 1, each processor will access one solid line owned
by a remote processor. However, since the length of solid lines are varied, the value
of ep(,-) for different processor P(i) is different. Therefore, the cost of neighboring
communication is measured by

ggggfcmol

Figure 5.2(a) shows cyclic distribution in which the template elements are dis-
tributed to three processors in the round-robin fashion. Though it attains a good
processor workload balance (Itpm — €p(2)| = 8), cyclic distribution generates an ex-
tremely high cost of neighboring communication (Cp(0) = 22). Figure 5.2(b) shows
block distribution in which the template elements are distributed to three proces-
sors in the block fashion. The number of template elements each processor owns is
equal. Though it attains a low cost of neighboring communication (Cp(1) = 8), the
poor quality of processor workload balance (Mpg) — 3pm] 2 20) is hard to accept.
A new distribution pattern, segment distribution, is employed in Figure 5.2(c). In
segment distribution, the template elements are consecutively assigned to the proces-

sors. However, the number of template elements owned by different processors can be

140

varied. In Figure 5.2(c), the number of template elements owned by each processor
is niced arranged such that processor workload is balanced (“P(2) -— [pm] = 3) and
the cost of neighboring communication is small (613(1) 2 9).

The study of Example 16 reveals two important facts. First, as shown in F ig-
ure 5.2(a), with cyclic distribution, a remote element is read for writing each local
element in Example 1. This implies that the cost of neighboring communication
generated by an inappropriate distribution pattern can be as much as the cost of re-
organization communication generated by mismatched alignment bases. Second, the
impact of neighboring communication can be signiﬁcantly reduced by consecutively
assigning template elements to processors. As shown in Figure 5.2(b) and (c), tem-
plate elements are consecutively assigned to processors in both block distribution and
segment distribution. As a result, the cost of neighboring communication is small,
CPU) = 8 in Figure 5.2(b) and CPU) 2 9 in Figure 5.2(c). The relative difference
of the cost of neighboring communication between block distribution and segment

ditribution will be smaller when the size of arrays increases.

5.1.2 Segment Distribution

As discussed in the previous section, in segment distribution, template elements are
consecutively assigned to processors in order to minimize the impact of neighboring
communication. The length of the segment assigned to each processor may be var-
ied. This feature gives segment distribution a big ﬂexibility to exploit the maximum
processor workload balance.

In data parallel programs, data elements are typically written by the processor
which owns them. As a result, the workload assigned to a processor is determined by
the LHS elements owned by that processor. Therefore, the distribution strategy for
the LHS array elements on each assignment statement determines processor workload

balance.

141

Given total q processors, template array T(O : n — 1) is partitioned into q pieces
such that each piece T(m, : m,“ — 1) is owned by processor P(i) where 0 S i S q — 1,
mo = O, and mq_1 = n. Our goal is to ﬁnd the value of those break points m,-
(0 S i S q — 1) such that the workload imposed by elements in each segment can be

equal. This can be formalized as follows:

m1—1 1712-1 mq—l_1

Z wA(t) = Z wA(t) = = Z wA(t) (5.1)

t=mo j=m1 j=mq—2

where A serves as the LHS array in a FORALL assignment and wA(t) is the density
function of A.

In Equation 5.1, the density wA(t) only shows the number of LHS A elements
mapped to the template element T(t). Strictly speaking, the number of the LHS
elements may not be proportional to the workload imposed by the LHS elements.
For example, in the Linpack TQL2 loop (Figure 5.3), array Z is referenced on the
LHS. Since Z is the one-dimensional array, there is only one Z element mapped to
each template element. However, based on the inner loop indexed by i2, the workload
imposed by different Z elements is obviously different: total nn inner loop iterations
are executed for element Z(1), while only one inner loop iteration is executed for

element Z (nn)

 

FORALL(i1 = 2 : nn)

312 Z(21)= Z(Zl +1)
DO (i2 = i1,nn)
822 Z(21)= Z(21)+S*(H(21,22)+F*H(21—l,32))
END DO
END FORALL

Figure 5.3. Linpack TQL2 benchmark loop

 

 

 

142

In order to precisely model the workload, we modify the deﬁnition of density

function am as follows:

wA(t)= 2 [4(0)

6302)::
where T(t) is a template element, A(a) is a data element, and «A(a) represents the
computation estimate to deﬁne the LHS element A(a). The computation can be
estimated based on the number of integer or floating-point operations required by the

RHS expression evaluation. For instance, in assignment 3; of Example 16,

7rz(z) = (nn — z +1)(2t... + 2t+ + 2t) + t,)

where t... represents a ﬂoating-point multiplication, t+ represents a ﬂoating-point ad-
dition, tm represents a memory load operation, and t, represents a memory store
operation. The amount of computation can be further normalized by converting
various operations to the equivalent number of clock cycles.

In most scientiﬁc applications, the density function of a LHS array is typically a
constant of an afﬁne function with a single variable. Figure 5.4 shows a few common
loop patterns extracted from scientiﬁc application programs. In Figure 5.4, A is a
two-dimensional array and B is a one-dimensional array. The template array is one-
dimensional and A is distributed in columns. In Figure 5.4(d), (e), (f), and (h), the
inner loop is sequential. In Figure 5.4(a), wA(t) = t for element T(t). In Figure 5.4(b),
wA(t) = n— 1 —t for element T(t). In Figure 5.4(c), cm 2 min{t, n —1—t} for element
T(t). In Figure 5.4(d), w3(t) = t for element T(t). In Figure 5.4(e), wB(t) = n —1—t
for element T(t). In Figure 5.4(f), we 2 min{t,n — 1 —— t} for element T(t). In
Figure 5.4(g), wB(t) = 1 for element T(t). In Figure 5.4(h), wB(t) = n for element
T(t). In Figure 5.4(i), wA(t) = n for element T(t).

For this reason, in this thesis we assume that to); = aj + b for each template

element T(j). Here, both a and b are ﬁxed integers. Suppose there are total q

143

processors. Template elements T(O : n — 1) are partitioned into q segments such that
each segment T(m; : m,“ — 1) is owned by processor P(i) where 0 S i S q -— 1,
mo = 0, and mq_1 = n. Break points m,- (0 S i S q — 1) are optimized such that the

workload imposed by each segment isequal. In other words,

m1—1m2—1 mq—l-l
Zaj+b=2aj+b=~ = Z aj+b
j=mo J= —m1 j=mq—2

FORALL(i1=0:n-1)
FORALL(i2=0:z'1)
A(il, 22) = . . .
END FORALL
END FORALL

(a)

FORALL(i1=0:n-1)
FOR(22=0.21)
END FOR

END FORALL

(d)

FORALL(i1=0:n-1)
END FORALL
(g)

FORALL(i1=O:n-l)
FORALL(i2=i1:n-2)
A(i1,i2) =
END FORALL
END FORALL

(b)

FORALL(i1=0:n-1)
FOR(i3=i1:n-2)
Boa) =
END FOR
END FORALL

(e)

FORALL(i1=0:n-1)
FOR(22=0.n-2)
B(i1) =
END FOR
END FORALL

(h)

FORALL(i1=0:n-1)
FORALL(i2=i1:n-1-i1)
A(i1,i2) =
END FORALL
END FORALL

(C)

FORALL(i1=0:n-1)
FORALL(i2=i1:n-1-i1)
8(21) = . . .
END FORALL
END FORALL

(f)

FORALL(i1=0:n-1)
FORALL(i2=0:n-1)
A(ilai2) =
END FORALL
END FORALL

(0

Figure 5.4. Some common loop patterns

We ﬁnd these optimal break points by extending density function wA to all real

numbers within the range [0, n — 1]. The density function is extended to em = av + b

for any real v in [0, n — 1]. Segment [0, n — 1] is partitioned into q pieces of [v.-,v.-+1]

144

where 0 S i S q — 2, v0 = 0, and v9-1 = n — 1. We need to ﬁnd break points v,-

(0 S i S q— 1) such that

U1 112 Uq_1
[av+bz/ av+b=...=/ av+b
U0 2)] 119.2

This implies

/v'av+b=:/vq_lav+b
120 q U0

U: ' n-l
[av+bzi/ av+b
o q 0

Solving the integration, we get

01'

l 2 .—il _ 2 ._
2avi +bv, — q(2a(n 1) + b(n 1))
2b i 2b

—. ——1 2 — —1
v,+av: q((n )+a(n ))

Solving the above equation, we get the solution of v,- as

 

—%”+ 4——.":‘+4 ((n-1)2+ff(n—1))
2
22.: b—+3((n—1)2+g—b(n—1))— 9 (5.2)

a2 q a a

 

vi:

 

Since v,- may not be an integer grid, m, is chosen as an integer approximation of v,-

by

 

rvi—rf— «)n—l +2-—”(n—1))— 51

a

On the other hand, we should know what processor a particular template element

T( J) is assigned to. Suppose that T(j) is owned by processor P(i). Therefore, we

145

should have u,- S j S 1),-+1. Thus, by Equation 5.2, we have

 

 

 

b2 i b b

— — — — — <

p+ﬁun u an 1» 0-]

b2 i+1 2 2b b
J- a,+ q(m—1)+a0w4»— 5

This can be re-written as

K (1(27'7 +j2) - 2:502 - 1)

 

 

 

 

_ (n- 1V
(1(2—bj'l‘j2l- 2b(n-1) 1<.
(n—nz -‘
In other words, we have
q(%‘lj+j)—%§(n—1)_1<.<q(i—”j+j2)—%”(n—1)
(n *1)2 _ _ (n -1)2

Since it should be an integer, i can be selected as

__ 9(%bJ+J2)- 2301-1)
—|'q (n—l)2 J

 

Overall, we summerize the above results as follows:

Proposition 12 Assume that wA( j) = aj + b is the density function conducted by
the mapping from a data array A to a one-dimensional template array T(O : n — 1).

Given the q processors, segment distribution of T can be speciﬁed as follows:

1. Processor P(i) (0 S i S q — 1) owns the consecutive elements T(m, : m,+1 —— 1)

where

 

HiJ—ﬁ 5 Mn—1-+%m—4»—§1

a

 

 

mg+1:i:—|'\/2+ i+1 q((n—1)"’+-2--b(n—1))—§'|

146

2. Element T(j) is owned by processor P(i) where

_ q(2:"j +1”) - 27W - 1)
_l‘ (Tl-U2 ‘l

 

The speciﬁcation of segment distribution in Figure 5.2(c) can be obtained using
Proposition 12. In Figure 5.2(c), the template has 11 elements T(O : 11 — 1) and
thus n = 11. wA(j) = j for each element T(j) and thus a = 1 and b = 0. Using

Proposition 12, we obtain that

772020

m1 = [log] = 6
m, 2 mg]

m3=11—l

Therefore, T(O : 6 — 1) is distributed to processor P(O), T(6 : 9 — l) is distributed
to processor P(1), and T(9 : 11 — 1) is distributed to processor P(2). Given element

T( j ), it is owned by processor P(i) where
-_ L 2
z — [3(10) J

5.1.3 Multiple-Variable Density Function

If the template is a multi-dimensional array, a density function can be a function
of multiple variables each of which represents a distinct dimension in the template
array. In this case, we cannot simply formulate the requirement of processor workload
balance by using Equation 5.1. In this section, we study data distribution regarding

to multiple-variable density function.

147

Assume that the template is represented by a m-dimensional array T(O : n1 —
1,0 : n2 — 1,...,0 : nm_1 — 1). Processors are represented by a m-dimensional
array P(O : ql — 1,0 : qg —1,...,O : qm-1 — 1). Each element P(ko,k1,...,km_1)
represents a particular processor. Each dimension of the template array is distributed
to processors in segment fashion. Assume that the j-th dimension of the template
is partitioned into qj blocks: T(. ...,njp : nil — 1,. . .), T(. ..,n,-,1 : leg —1,...), ...,
and T(...,nj,q,_2 : ”1301-1 — 1,...) where 113-,0 = 0 and ”am-1 2 nj. Therefore, the
block of the template elements which processor P(ko, k1, . . . , km_1) owns is T(noJco :
no,ko+1 — 1,721,}.1 : n1.k1+1 — 1,...,n1,km_, : n1,km_,+1 — 1) where 0 g kg 3 go — 1,
ngl Sq1—1,...,andOSkm-1 Sqm_1—1.

Assume that A is the LHS array referenced in FORALL assignment sh. Let wA be
the density function with respect to A. Therefore, (pumklmkmq), the workload on

processor P(ko, k1, . . . , km_1), can be formalized as follows:

"0,ko+1-1"1,k1+1—1 "1.km—1+1-1
(mot. ..... km_1)= Z Z Z wA(Jo,Jk,.--,Jm-1) (5-3)
j0=n0,ko j1=n1,k1 jm—l=n1,km_1

In Equation 5.3, template element T(jo,j1,...,jm_1) is owned by proces-
sor P(ko,k1,...,km_1). wA(jo,jk,...,jm_1), the density of template element

T(j0,j1, . . . ,jm_1), can be estimated as follows:

wA(jo,jk,...,jm_1)= Z ”A(ao,a1,...,ah)
6A(ao,a1,...,ah)=T(jo,jk,...,jm_1)

where «A(ao, a1, . . . , ah) is the estimate of the RHS expression evaluation in order to
deﬁne the LHS element A(ao, a1, . . . , ah).

Similar to Equation 5.3, Ah, the overall workload over all LHS elements with

148

respect to statement sh, can be obtained as follows:

710—] 111—1 nm—l-l

[sh = z: Z Z: wA(jo,j1,...,jm_1) (5.4)

jo=0 j1=1 jm—1=0

Therefore, the problem of ﬁnding the optimal segment size can be formalized as

follows.
Proposition 13 If the size of segment T(no,ko : n0.ko+1 — 1,n1,k1 : n1.k1+1 —
1,. ..,n1,km_, : n1,km_,+1_1) owned by processor P(ko,k1,...,km_1) is optimal, then

the following condition must be satisﬁed.

6

3h

 

[(pko ’pkl ----- I‘m—1) =

where q = qoql . . .qm_1 is the number of total processors. A distribution pattern is

optimal if the size of the segment assigned to each processor is optimal.

By Equation 5.4, the workload distribution depends on the property of density
function am. For certain density function am, there may not exist an optimal distribu-
tion pattern which can satisfy Proposition 13. For example, consider two-dimensional

template T(O : no -— 1, 0 : n1 — 1) with density function

. . 1 where 0 Sjo =j1 5 min {n0,n1}
am (10.11) =
0 otherwise
Therefore, min {n0, n1} is the overall workload. However, as shown in Figure 5.5(a),
there always exists some data block on which workload is zero as long as there are
two or more processors assigned to each dimension of the template T. Therefore,

the optimal distribution pattern does not exit if T is partitioned along both row and

column dimensions.

149

10100.0 0100 10100l0 0:00
1 |
01l00:0 0,00 01:00;0 0'00
__ ._- __ __ _ __ _ ‘__ ._., _ I
00'1010 0100 0011010 0100
__0_o__|gf_1_l_(_)__9l0__0 0 0101i0 0:00
00l00'1 000 0010011 0'00
| 1 . l I 1 . l
... ...
| | . I l | . I
00}00,0 1100 00‘00‘0 1100
00i00i0 0l10 00|00l0 0:10
l i i 1
00:00;0 0,01 00:00:0 0.01
(a) each dimension is assigned (b) all processors are allocated to
two or more processors the row dunensron

Figure 5.5. An example of no optimal distribution pattern

To resolve this problem, we convert the multiple-variable density function into a
new single-variable density function and then use the proposed segment distribution
to assign the template elements along that dimension which index is the variable
in the new density function. This idea can be formalized as follows. Assume that
template T(O : n1 — 1,0 : n2 — 1, . . . ,0 : nm_1 — 1) is m-dimensional array. A density
function 02,4 is a function of dimensions i1, i2, ..., and i}, where 2 S h _<_ m. No
processor is allocated to dimension i2, i3, .. ., and ih. The template elements on the

il-th dimension is distributed in segment distribution by the new density function

I

wA:
niz—l nga—l nib-1
I a n o o o
wA(il)= E E E wA(21,22,23,...,Zh)
i220 i320 {1:0

For the example shown in Figure 5.5, since wA is a function of variables jo and j], no
processor is allocated to dimension j]. As shown in Figure 5.5(b), template elements
are only distributed to processors along the jo-th dimension with the new density

function

"'1
“12(10): 2 wA(j0aj1):1

11:1

150

Since 02:4 has uniform density, the number of columns allocated to each processor

should be equal.

5.1.4 Experimental Results

 

 

 

 

4500 l l l

4000 - workload variance on segment distribution A— -
workload variance on block distribution 43—

3500 ' 16 processors on an nCUBE—2 '—
the maximum 3000 - -
workload 2500 ‘ E
per processor 2000 _ i
1500 - r
1000 - A// a

500 u l l

64 80 96

Size of nn

Figure 5.6. Comparison of processor workload balance between block distribution
and segment distribution

Figure 5.6 shows the comparison of processor workload variance between block
distribution and segment distribution by running the Linpack TQL2 loop (Figure 5.3)
on a 16-node nCUBE-Z. In our experiment, the number of processors is ﬁxed as 16
and the problem size is varied. Figure 5.6 only shows the workload variance of each
distribution pattern. The cost of neighboring communication involved in accessing
remote elements is not included. Segment distribution outperforms block distribution

in all cases.

151

5.2 Virtual Processor Allocation

If the data alignment analysis generates multiple aligned-base groups, the template
must be a multi-dimensional array. Segment distribution is used to specify the dis-
tribution pattern of template elements regarding to each dimension of the template
array. Segment distribution maximizes processor workload balance regardless of the
number of assigned processors. Therefore, as long as template elements are distributed
in segment fashion along each dimension of the template array, processor workload
will be well-balanced. However, though segment distribution minimizes neighboring
communication regarding to each dimension of the template array, the sum of the
cost of neighboring communication across all dimensions depends on the number of

processors assigned to each dimension of the template array.

5.2.1 Reducing the Overall Neighboring Communication

Figure 5.7 shows the impact of processor allocation on the overall cost of neigh-
boring communication in Example 4 (Figure 2.11), in which the template is a two-
dimensional array.

In Figure 5.7, arrays W, Z, and T (the template array) are declared as 6 x 12.

Alignment functions (SW and ﬁg are speciﬁed as follows:

t1 “)1
t2 1.02
t1 21
lg 22

where W(w1, 2122) is a W element, Z(zl, 22) is a Z element, and T(t1,t2) is a template
element. Thus, there are two aligned-base groups: the row dimension of W is aligned

with the row dimension of Z; the column dimension of W is aligned with the column

 

-—>W2 ‘ W
‘ O K, O O 3 . W Q [I C) C‘- I”: (“I O (I O ‘
W1 I O 0 T} I.) (“I ’ O O l i i) r If, 0 b I”. I.) I) If) I; I“ (I)
l k (I) It.) I 0 CI I I r") (I II’ Ix) i3 ) I’m I“ (_I If.
I W I I (Tr If) i. (I: L”) r :I I I I I...) II'T'IICI (J ‘3 OI I?) I I», II, b (.3-
I I l I I0 d ’4 ~‘I 0 (I I“: I I I I I III II Ii (tI VIII 0 I) If: :) (_II (;
I I I I FI ' (3 tI ’i)‘ Q II, I I I I it." I.. III N l) i,“- '0 I“ P I? k‘ {I
FIT? _Ig(0.0) . IIP(Q. , I I II I I \ new?L4ur—I-«IIL—TII .ILJLIILI I I I I l
|\<\J C) IOI {A ’ .. 7, (”X I . I)? - ' ._.-.Ihtqufc‘; , ‘I””'_I+Li AI»: . I
I \hI (I -':«'«___«:I_ _C‘ i AI | I I a? J‘I' II" Ii" ‘1 'I z'r' I'I' I77“ .IP(1,II
T i) (I ’1 X I ‘ v —7£I:‘_W IL I H1 0 ‘1 )7 \l I!" 'Iil ,Liéixt H. A t) (\‘f
I II I I .. L «- I « I- I «Wife wI—II—I. 6 “a I,
I I I III? ?I " C’I‘i. "II 'I L I I moi III II. III I ”I I. I2, I». .I. I-7‘w~.(2"’
! I II- CI; 11’ " I I f I IILIF'II ’ II“ Iddtt“
I I I l . I | I , H II I I I I I
"(2‘2 (I, "'l I, II, II ,I I II I I \ "I II: I.:III:».I.«:.' I .:-II.~II.«.I Ii I I I I
21 ' 1I .I'I; l .I l I Z' I» I II « «I I II II «II I I I I
I .» I. .- « ; I «z «I «I <I I It.» '1: I I
I] II I.) O]. k,, I LI (I) ‘\J I _ {I} i II N O (‘1 II, 1’ I I, 0 ll)
Z O i; Q .0 O P I; if.) I (I (.I ( O r) h Lil of O O 0 (I) I
U (ii (I» (‘0 n I (w I ii ( I '5 (I O O U "1’ r O 0 C" Ii’
(3) Processor array is 2x3 0’) Processor array is 3X2

Figure 5.7. Two different processor allocation patterns for Example 4

dimension of Z. There are total 6 processors available. As shown in Figure 5.7,
template T is distributed to processors in both row and column dimensions. Each
dimension is distributed in segment fashion. Since each template element is mapped
by one W element and one Z element, the density function of both W and Z is uni-
form. Therefore, each dimension in fact is distributed in block fashion. Figure 5.7(a)
and (b) show two different processor allocation patterns.

By the construction of array subscripts in assignment 31 (Example 4), one remote
Z element must be referenced to deﬁne W(i1,i2) if W(i1,i2) is a boundary element
in the data block owned by the local processor. For example, in Figure 5.7(a), in
order to write the LHS element W(l, 4) owned by processor P(O, 1), the remote RHS
element Z (1, 3) owned by processor P(O, 0) is read. Moreover, two remote Z elements
will be accessed to deﬁne W(i1,i2) if W(i1,i2) resides on the corner of the data block
owned by the local processor. For example, in Figure 5.7(b), in order to write the

LHS element W(4,5) owned by processor P(2,0), the remote RHS elements Z (4,6)

153

owned by processor P(2, 1) and Z (3,5) owned by processor P(1,0) are read. This
implies that the basic cost of neighboring communication in either aligned-base group
is 1. The cost of neighboring communication regarding to the aligned-base group
represented by the row dimension of the template is the number of the elements on
the vertical boundary. Similarly, the cost of neighboring communication regarding
to the aligned-base group represented by the column dimension of the template is
the number of the elements on the horizontal boundary. Thus, the overall cost of
neighboring communication is equal to the number of elements on the circumference.

Since each block is two-dimensional, given the total number of processors, the
length of the circumference in each block depends on the shape of template array and
the shape of processor array. In Figure 5.7(a), the processor array is specified as 2 x 3,
denoted as P(O : 1,0 : 2). Processors P(O, 1) and P(l, 1) access 10 remote elements
each. The other four processors access 7 remote elements each. In Figure 5.7(b), the
processor array is speciﬁed as 3 x 2, denoted as P(O : 2, 0 : 1). Processors P(l, 0) and
P(1,1) access 14 remote elements each. The other four processors access 8 remote
elements each. Therefore, the amount of neighboring communication in processor
allocation 2 x 3 is much less than that in processor allocation 3 x 2. Given the total
number of processors, the shape of processor array makes difference in reducing the

overall neighboring communication.

5.2.2 Optimal Processor Allocation

In this section, we propose the optimal processor allocation approach such that overall
neighboring communication can be minimized by given a multi-dimensional template
array and a total number of processors. The problem of processor allocation can
be formalized as follows. Let T(O : n1 — 1,0 : n2 — 1,...,0 : nm_1 — 1) be the
m-dimensional template array. Assume that ck is the basic cost of neighboring com-

munication regarding to the aligned-base group represented by the k-th dimension of

 

154

the template (0 g k S m — 1). There are totally q number of available processors.

Proposition 14 If there exists a k (0 S k S m — 1) such that ck = 0, assign all

processors to the k-th dimension of the template.

If there does not exist a k (0 S k S m — 1) such that ck = 0, ﬁnd the positive

 

 

 

 

 

 

integer numbers ql, ([2, . . ., qm such that the total cost of neighboring communication
n n nm_ n n nm_ n n _ n nm_
—l—2... 1 cl—O—z... 1+...+ck—0... k1 k“... 1
41 (12 qm—l 40 ‘12 qm—1 (Io qk—l Qk+1 qm—l
n n nm_
+Cm—l—2_1- ' 2 (5 5)
go ql qm—2
can be minimized where qo,q1, . . . , qm_1 are subject to
qul . . . qm—l = q (5.6)

In Equation 5.5, the value

no nk—l nk+1 Tim—1

(Io qk—l qk+l qm—l

 

 

 

is the number of total elements on the k-th dimension boundary in the block owned by
processor P(k). The rest of this section shows how to ﬁnd the solution of minimizing
Equation 5.5.

Let n be the total number of the elements in the template array. Thus, we have
n = non] . . . nm_1. Substituting the value of qk from Equation 5.6 into Equation 5.5,

Formula 5.5 can be re-written as follows:

n n n
Co—l'l'Cl—l+---+Ck—l+---+Cm—l q
no (10 n1 (11 "k Qk nm—l qm—l

 

 

 

155

Since 61.5% for 0 S k S m — 1 is constant, it is denoted as yk = ckgf. Furthermore,

we deﬁne

:ckzi where OSkSm—l
(1k

and

new—-

Therefore, the problem of ﬁnding the minimal value of Equation 5.5 is equivalent to

that of ﬁnding the minimal value in the following equation

310530 + y1$1 + - - - + ym—lxm—l

where $0.731 . . .mm_1 = :c (5.7)

Note that

 

310170 + 311330 + . . . + ym—lxm—l
m

 

'"V 310160311131 - - ~ym—133m—1 S

The equality holds if and only if

yoxo = mm = . . . = ym_1$m_1 (5.8)

 

156

Therefore, the minimum value is achieved when Equation 5.8 holds. From Equa-

tion 5.8, we have

 

 

 

171 = w 1170
i 58k = w: 580 (5'9)
\ xm-l = ym-l (80
Substituting values of $1, 332, ..., and :cm_1 from Equation 5.9 into Equation 5.7, we
have
_ m (Cylyz . . . ym_1
0 313‘“

Substituting the value of 11:0 in Equation 5.9, we get

$1 = m/W
311

 

 

._ IUD-uyk-lyk-i-l-"y —1
i $1: — '\7/ ym—I m
k

I _
xm—l : m yoylm-ym 2
\ ym—l

Therefore, by substitution of the real values of 2:1,, yk, and :c (O S k S m — 1), we

 

have the following conclusion.

Proposition 15 If there does not exist a k (0 S k S m — 1) such that ck = 0, then

the total cost of neighboring communication

n1 n2 nm—l no n2 nm—1 no nk—l nk+1 nm—l

—— o I o 1 o o + o o o + Ck— o a o o o o

(11 (12 qm—l 90 (12 (Im-l (Io qk—l Qk+1 (Im—l
no n1 nm-z

(10 (11 (Im—z

 

 

 

 

 

+...

 

157

can be minimized when

 

n ...n_nn ...n .6“1
qk='<[qo 1‘1 H1 mlk where OSkSm—l

 

m—l
nk Co...Ck_1Ck+1...Cm_1

The optimal processor allocation strategy proposed by Proposition 15 is machine
independent and architecture independent. We assume that all processors are inter-
connected with a fully-connected network topology. The mapping from such a virtual
model of fully-connected network to a particular machine architecture is beyond the

scope of this thesis.

5.2.3 Performance Results

Figure 5.8 shows a livermore kernel 18 benchmark loop. In Figure 5.8, the size of
all arrays is 256 x 256. Array Z A serves the template. There are two aligned-base
groups. The (j — 1)-th rows of ZU, ZP, ZQ, ZR, and ZM are aligned with the
j-th row of ZA. The k-th rows of ZU, ZP, ZQ, ZR, and ZM are aligned with the
k-th row of Z A. Let co be the basic cost of neighboring communication with regard
to the aligned-based group represented by the row dimension of Z A. Therefore,
co = 1 due to the access to remote element ZR( j, k). Let c1 be the basic cost of
neighboring communication with regard to the aligned-based group represented by
the column dimension of Z A. Therefore, c1 = 4 due to the accesses to remote elements
ZU(j — 1,k+1), ZP(j — 1,k +1), ZQ(j— 1,k+1), and M(j —— 1,k +1).

Figure 5.9 compares the overall cost of neighboring communication in three differ-
ent approaches on a 64-node nCUBE-2. Three approaches are the optimal allocation

proposed in this thesis, the square allocation, and the row allocation. By Proposi-

158

 

FORALL(j = 2 : 255,k = 2 : 255)
31: ZA(j,k) = ZU(j — 1,k + 1) + ZU(j —1,k)+(ZP(j—- 1,k +1)+
zoo — 1,1: +1) — ZPo — 1.1:) — zoo — 1,1.» * (ZR<j,k)+
ZRU - 1,k))/(ZM(j - 1,16) + ZMU - 1,]? +1))
END FORALL

Figure 5.8. A livermore kernel 18 benchmark loop

 

 

 

 

1200 I I I

Square Allocation A—
1100 l- Optimal Allocation @—
1000 - Row Allocation B— _

900 - 64-node nCUBE—2 -

 

 

 

 

 

Cost of
Communication 800 - —4
usec)
700 - _
600 " .1
500 - ..
400

16 36 64

Number of Total Processors

Figure 5.9. Comparison of different processor allocation strategies on a 64-node

nCUBE—2

159

tion 15, the optimal processor allocation can be speciﬁed as

qo=ﬁ
41=2\/§

where qo is the number of the processors assigned to the row dimension of Z A, ql is
the number of the processors assigned to the row dimension of Z A, and q is the total
number of processors. In the square allocation, the same number of processors is allo-
cated to each dimension. In the row allocation, all processors are allocated to the row
dimension since the basic cost on that dimension is smaller. The experimental results

show that the optimal allocation outperforms the other two allocation strategies.

 

CHAPTER 6

Conclusions and Future Research

Data decomposition is critical to the performance of data parallel programs on scal-
able parallel computers. This thesis studies each fundamental phase in data decom-
position and develops important theoretical results and practical algorithmic results
to determine efﬁcient data decomposition. In this chapter, we summarize the salient
contributions made by this research and present interesting avenues for possible future

research.

6.1 Research Contributions

The data decomposition model can be viewed as a two-level mapping of array elements
to abstract processors. Data alignment determines what array elements are aligned
relative to one another, and data distribution resolves how the group of aligned arrays
is distributed onto the processors. The template array performs as an abstract index
space, each grid in which represents a group of aligned array elements. The con-
cept of the template makes the speciﬁcation of the alignment and distribution clear.
Depending on the alignment relationship as within dimension or across dimension,
alignment can be classiﬁed into base alignment and offset alignment. If two arrays are

mismatched in base alignment with respect to a reference, the whole data structure

160

161

of the array to be used is required to be reorganized and almost evey array element
will be involved in the data movement across the processors. As a result, an efﬁcient
base alignment has the ﬁrst priority in the data decomposition analysis. After base
alignment is determined, offset alignment speciﬁes the offset between aligned array
elements of various arrays with respect to the abstract index space. Since the align-
ment offset is a constant, the penalty of mismatched offset alignment with respect
to a reference will be the data shift operation. The communication cost of the data
shift operation can be greatly reduced if elements (in the template) are consecutively
assigned to the processors. This requires that the pattern of data distribution be
of the block distribution type. On the other hand, however, limited by the owner-
computes rule in code generation, the requirement of processor workload balance
favors the cyclic distribution type when the RHS computation is not uniform among
all the LHS elements or when the number of the LHS elements mapped to different
template elements is varied. This conﬂict between increasing workload balance and
reducing data shift communication has become an open issue in the research of data
distribution.

Using the afﬁne alignment function [17], we have extended and developed the
mathematical framework to model the relationship between data reference and base
alignment. The cost estimate of reorganization communication has been studied based
on different types of data reference. Data reference graph model is used to describe
the impact of multiple data references and resolve the conﬂict of compatible alignment
requirement. An efficient spanning tree algorithm addresses the fundamental issues in
base alignment. Efficient base alignment algorithms are proposed to be incorporated
with the RHS expression evaluation and dataﬂow optimizations. These contributions
make this research unique from related research.

This thesis has made a signiﬁcant contribution in the research area of offset

alignment. The mathematical framework has been constructed to model the inter-

162

relationship between offset alignment and data reference. The cost of data shift
communication has been estimated based on different types of data reference. In
particular, the piecewise linear function is introduced to represent the cost of data
shift movement with regard to multiple distinct instances of the same array which
are accessed in the same statement. This cost model solved the accuracy problem in
measuring the quantity of data shift movement, an unresolved problem left by other
work in this area. Based on this cost model, the optimal post-alignment algorithm has
been ﬁrst proposed to exceed the limitation of the owner-computes rule and minimize
the amount of data shift movement in each FORALL assignment after offset alignment
is determined. Data reference graph model has been proposed to model the prob-
lem of offset alignment and develop efﬁcient spanning tree algorithms. Like the base
alignment analysis, the RHS expression evaluation and dataﬂow optimizations have
been incorporated with the proposed offset alignment algorithms.

The thesis has done extensive research in the area of data distribution. The open
problem of the efﬁcient distribution type has been resolved to a great extent by this
research. Segment distribution has been proposed to resolve the conﬂict between
reducing data shift movement and increasing processor workload balance. Regarding
to a particular dimension in the template array, segment distribution minimizes the
impact of data shift movement by allocating elements consecutively to processors
and balances the processor workload by varying the size of the segment assigned to
different processors. The concept of the density function is introduced to estimate
the computation load at compilation time. An optimal processor allocation algorithm
has been proposed to minimize the overall cost of data shift communication across
multiple dimensions of the template array. The segment distribution and optimal
processor allocation proposed in this thesis provide the best data distribution support
for most data parallel programs.

We have demonstrated the effectiveness of the proposed algorithms on the nCUBE-

163

2 commercial multiprocessors. The performance results have shown that the proposed
algorithms are superior to the approaches presented by other research. We believe
that our framework of the data decomposition analysis will serve to optimize data
decomposition for increasingly popular data parallel programs with greater ﬁdelity

than exists in the current state of the art.

6.2 Directions of Future Work

The data decomposition framework presented in this thesis establishes a foundation
for future study but needs to be extended in several ways.

The algorithmic results proposed in this research are considered for paralleliz-
able loops. However, a real application program is a mixture of different types of
subprogram structures including parallelizable loops, DOACROSS loops, and intrinsic
loops. Some preliminary approaches have been proposed [71] to estimate the amount
of communication for intrinsic loops with respect to different types of data alignment
and data distribution. An integration of data decomposition information for different
subprogram structures is highly demanded.

A good framework for data redistribution and data realignment is still an open is-
sue in the research area of the data decomposition analysis. Data arrays are required
to be re—aligned for different computing structures in different subprogram phases.
Moreover, based on the data distribution analysis proposed in this research, the tem-
plate may be re-distributed not for the requirement of data re-alignment but for the
purpose of balancing processor workload instead. This condition brings additional
complexity into the process of ﬁnding a good framework for data redistribution and
data realignment.

Finally, the data decomposition framework presented in this research is proposed

for compilation-time optimization. However, certain programming characteristics are

164

unknown until run-time, in particular, the information for processor workload balance
[60]. In this case, the framework design for static data decomposition should be

incorporated with the run-time support.

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] BBN Advanced Computers Inc., Inside the TC2000 Computer, 1990.

[2] Cray Research, Inc., Mendota Heights, MN, CRAY T 3D System Architecture
Overview, Sept. 1993.

[3] Convex Computer Corporation, Richardson, Texas, CON VEX Exemplar Pro-
gramming Guide, Mar. 1994.

[4] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed
Computing, vol. 1, no. 3, pp. 187—196, 1986.

[5] Intel Corporation, A Touchstone DELTA System Description, 1991.
[6] NCUBE Company, NCUBE 6'4 00 Processor Manual, 1990.

[7] Meiko Limited, Waltham, MA., Computing Surface: CS-2 Communications Net-
works, 1993.

[8] C. E. Leiserson et al., “The network architecture of the Connection Machine
CM-5,” in Proceedings of the ACM Symposium on Parallel Algorithms and Archi-
tectures, (San Diego, CA.), pp. 272—285, Association for Computing Machinery,
1992.

[9] W. Gropp, E. Lusk, and S. Pieper, “Users Guide for the ANL IBM SP-l
DRAFT,” Tech. Rep. ANL/MCS-TM-OO, Argonne National Laboratory, Feb.
1994.

[10] C. B. Stunkel et al., “Architecture and implementation of vulcan,” in 8th Inter-
national Parallel Processing Symposium, (Cancun Mexico), pp. 268-274, IEEE,
1994.

7,

[11] D. Patterson, “A case for now (networks-of-workstations), in Hot Interconnects

II, (Stanford), IEEE, Aug. 1994.

165

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

166

M. Blumrich and at e1, “Two virtual memory mapped network interface designs,”
in Hot Interconnects II, (Stanford), IEEE, Aug. 1994.

J. C. Adams, W. S. Brainerd, J. T. Martin, B. T. Smith, and J. L. Wagoner,
Fortran 90 Handbook. 1221 Avenue of the Americans, New York, NY 10020:
Intertext Publications, 1992.

High Performance Fortran Forum, “High Performance Fortran Language Speci-
ﬁcation (version 1.0),” May 1993.

S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Compiler optimizations for
Fortran D on MIMD distributed-memory machines,” in Proceedings of Super-
computing ’91, pp. 86—100, Nov. 1991.

S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Compiling Fortran D for MIMD
distributed-memory machines,” Communications of the ACM, vol. 35, pp. 66—80,
Aug. 1992.

J. M. Anderson and M. S. Lam, “Global optimizations for parallelism and lo-
cality on scalable parallel machines,” in Proceedings of the ACM SIGPLAN’QS
Conference on Programming Language Design and Implementation, pp. 112—125,

June 1993.

J. Ramanujam and P. Sadayappan, “Tiling multidimensional iteration spaces for
nonshared memory machines,” in Proceedings of Supercomputing’QI, pp. 111—
120, Nov. 1991.

J. R. Gilbert and R. Schreiber, “Optimal expression evaluation for data parallel
architectures,” Journal of Parallel and Distributed Computing, vol. 13, pp. 58—64,
Sept.1991.

H. Xu and L. M. Ni, “Optimizing data alignment in data parallel programs,”
in Proceedings of the IEEE/CS 14th International Conference on Distributed
Computing Systems, pp. 336—344, June 1994.

H. Xu and L. M. Ni, “Optimizing data decomposition for data parallel programs,”
in Proceedings of the 19.94International Conference on Parallel Processing, vol. 2,
pp. 225—232, Aug. 1994.

L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct
networks,” IEEE Computer, vol. 26, pp. 62 -— 76, Feb. 1993.

167

[23] W. Pugh, “A practical algorithm for exact array dependence analysis,” Commu-
nications of the ACM, pp. 102—114, Aug. 1992.

[24] U. Banerjee, Dependence Analysis for Supercomputing. 101 Philip Drive,
Assinippi Park, Norwell, Massachusetts 02061: Kluwer Academic, 1988.

[25] M. W. Hall, K. Kennedy, and K. McKinley, “Interprocedural transformations for
parallel code generation,” Tech. Rep. CRPC-TR91149, Rice University, Center
for Research on Parallel Computation, Apr. 1991.

[26] J. Allen and K. Kennedy, “PFC: A program to convert Fortran to parallel form,”
Tech. Rep. MASC-TR82-6, Rice University, Mar. 1982.

[27] R. Allen and K. Kennedy, “Automatic loop interchange,” in Proceedings of the
ACM SICPLAN 84 Symposium on Compiler Construction, pp. 233—246, June
1984.

[28] V. Balasundaram and K. Kennedy, “A technique for summarizing data access
and its use in parallelism enhancing transformations,” in Proceedings of the ACM
SIGPLAN 84 Symposium on Compiler Construction, June 1989.

[29] M. E. Wolf and M. S. Lam, “A loop transformation theory and an algorithm to
maximize parallelism,” IEEE Transactions on Parallel and Distributed Systems,
vol. 2, pp. 452—471, Oct. 1991.

[30] M. Wolfe, “Loop skewing: The wavefront method revisited,” International Jour-
nal of Parallel Programming, vol. 15, pp. 279—293, Aug. 1986.

[31] U. Banerjee, “Unimodular transformations of double loops,” The Third Work-
shop on Programming Language and Compilers for Parallel Computing, Aug.
1990.

[32] M. Wolfe and C.-W. Tseng, “The power test for data dependence analysis,”
Tech. Rep. TR CS/ E 90-015, Oregon Graduate Institute, Aug. 1990.

[33] D. A. Padua and M. J. Wolfe, “Advanced compiler optimizations for supercom-
puters,” Communications of the ACM, vol. 29, pp. 1184—1201, Dec. 1986.

[34] D. J. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe, “Dependence graphs
and compiler optimizations,” in Proceedings of the 8-th ACM Symposium on

Principles of Programming Languages, pp. 207—218, Jan. 1981.

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

168

H. Xu, E. T. Kalns, P. K. McKinley, and L. M. Ni, “ComPaSS: A Communica-
tion Package for Scalable Software Design,” Journal of Parallel and Distributed
Computing, vol. 22, pp. 449—461, Sept. 1994.

Message Passing Interface Forum, “Document for a standard Message-Passing
Interface,” Tech. Rep. CS-93-214, University of Tennessee, Nov. 1993.

A. Bar-Noy, J. Bruck, C.-T. Ho, S. Kipnis, and B. Schieber, “Computing global
combine operations in the multi-port postal model,” in Proceedings of the ﬁfth

IEEE symposium on parallel and distributed processing, pp. 336—343, Dec. 1993.

C.-T. Ho and M.-Y. Kao, “Optimal broadcast on hypercubes with wormhole
and E—cube routings,” in Proceedings of the 19.93 International Conference on
Parallel and Distributed Systems, pp. 694—697, 1992.

E. Kalns, H. Xu, and L. M. Ni, “Evaluation of data distribution patterns in
distributed-memory machines,” in Proceedings of the 19.93 International Confer-

ence on Parallel Processing, vol. II, pp. 175—183, Aug. 1993.

V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, “A static performance
estimator to guide data partitioning decisions,” in Proceedings of the Third ACM
SI GPLAN Symposium on Principles and Practice of Parallel Programming, Apr.
1991.

J. H.Saltz and C.-W. Tseng, Compilation and Runtime Support for Massively
Parallel Processors. Supercomputing’93 Tutorial, 1993.

Z. Shen, Z. Li, and P.-C. Yew, “An empirical study of F ortran programs for
parallelizing compilers,” IEEE Transactions on Parallel and Distributed Systems,

vol. 1, pp. 356—364, July 1990.

M. Wolfe, Optimizing Supercompilers for Supercomputers. Cambridge, Mas-
sachusetts: The MIT Press, 1989.

W. Shang and J. Fortes, “Tiling of iteration spaces for multicomputers,” in
Proceedings of the 1.9.90 International Conference on Parallel Processing, vol. 2,

pp. 179—186, Aug. 1990.

T. H. Tzen and L. M. Ni, “Dependence uniformization: A loop paralleliza-
tion technique,” IEEE Transactions on Parallel and Distributed Systems, vol. 4,
pp. 547—558, May 1993.

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[55]

169

L. M. Ni, H. Xu, and E. T. Kalns, “Issues in Scalable Library Design for Massively
Parallel Computers,” in Proceedings of Supercomputing ’93, pp. 181—190, Nov.
1993.

S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. H. Teng, “Automatic
array alignment in data-parallel programs,” in the Twentieth Annual ACM
SIGACT/SIGPLAN Symposium on Principles of Programming Languages,
pp. 16—28, Jan. 1993.

H. Xu, P. K. McKinley, and L. M. Ni, “Efﬁcient implementation of barrier syn-
chronization in wormhole-routed hypercube multicomputers,” Journal of Parallel
and Distributed Computing, vol. 16, pp. 172—184, 1992.

D. F. Robinson, D. Judd, P. K. McKinley, and B. H. C. Cheng, “Efﬁcient collec-
tive data distribution in all-port wormhole-routed hypercubes,” in Proceedings
of Supercomputing ’93, pp. 792-801, Nov. 1993.

H. Xu, Y.-D. Gui, and L. M. Ni, “Optimal software mulitcast in wormhole-routed
multistage networks,” in Proceedings of Supercomputing ’94, Nov. 1994.

H. Xu, P. K. McKinley, and L. M. Ni, “A Scalable Multicast Service in 2d Mesh
Networks,” in Frontiers ’92: The 4th Symposium on the Frontiers of Massively
Parallel Computation, pp. 156—163, Oct. 1992.

Z. Fang, P.-C. Yew, , P. Tang, and C. Q. Zhu, “Dynamic processor self-scheduling
for general parallel nested loops,” IEEE Transactions on Computers, vol. 39,

pp. 919—929, July 1990.

C. Polychronopoulos and D. Kuck, “Guided self-scheduling: A practical self-
scheduling scheme for parallel supercomputers,” IEEE' Transactions on Com-
puters, vol. C-36, pp. 1425—1439, Dec. 1987.

P. Tang and P.-C. Yew, “Processor self-scheduling for multiple-nested parallel
loops,” in Proceedings of the 1986' International Conference on Parallel Process-
ing, pp. 528—535, Aug. 1986.

M. Weiss, Z. Fang, C. R. Morgan, and P. Belmont, “Effective dynamic scheduling
and memory management on parallel processing systems,” in Proceedings of the

1989 COMPSAC, pp. 122—129, Sept. 1989.

T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical scheduling
scheme for parallel compilers,” IEEE Transactions on Parallel and Distributed
Systems, vol. 4, pp. 87—98, Jan. 1993.

[57]

[58]

[59]

[60]

[61]

[52]

[63]

[64]

[65]

[66]

[67]

170

H.-M. Su and P.-C. Yew, “Efﬁcient DOACROSS execution on distributed shared-
memory multiprocessors,” in Proceedings of Supercomputing ’91, pp. 842—853,

Nov. 1991.

J.-H. Chow and W. L. Harrison III, “Switch-stacks: A scheme for microtasking
nested parallel loops,” in Proceedings of Supercomputing’90, pp. 190—199, Nov.
1990.

L. M. Ni and C.-F. E. Wu, “Design tradeoffs for processor scheduling in shared-
memory multiprocessor systems,” IEEE Transactions on Software Engineering,
vol. 15, pp. 327—334, Mar. 1989. also in Proc. of the 1985 Int. Conf. on Parallel

Processing, pp. 63—70.

J. Liu and V. A. Saletore, “Self-scheduling on distributed-memory machines,” in

Proceedings of Supercomputing ’93, Nov. 1993.

P. Lee and T. Tsai, “Compiling efﬁcient programs for tightly-coupled distributed
memory computers,” in Proceedings of the 1993 International Conference on
Parallel Processing, vol. II, pp. 161-165, Aug. 1993.

J. Li and M. Chen, “The data alignment phase in compiling programs for
distributed-memory machines,” Journal of Parallel and Distributed Computing,
vol. 13, pp. 213—221, Oct. 1991.

A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and
Tools. Addison-Wesley Publishing Company, 1986.

C. Gong, R. Gupta, and R. Melhem, “Compilation techniques for optimizing
communication on distributed-memory systems,” in Proceedings of the 1993 In-
ternational Conference on Parallel Processing, vol. II, pp. 39—46, Aug. 1993.

E. Duesterwald, R. Gupta, and M. L. Soffa, “A pratical data ﬂow framework
for array reference analysis and its use in optimizations,” in Proceedings of the

ACM SIGPLAN’93 Conference on Programming Language Design and Imple-
mentation, pp. 68—77, June 1993.

G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C.-W. Tseng, and
M.-Y. Wu, “Fortran D language speciﬁcation,” Tech. Rep. COMP TR90-141,
Rice University, Department of Computer Science, Dec. 1990.

H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald, Vienna For-
tran: A Language Speciﬁcation ( Version 1.1), 1991.

171

[68] K. Knob, J. D. Lukas, and G. L. Steele, “Data optimization:allocation of arrays to

reduce communication on SIMD Machines,” Journal of Parallel and Distributed
Computing, vol. 2, pp. 102—118, Feb. 1990.

[69] B. Cukic and F. B. Bastani, “Automatic array alignment as a step in hierarchical

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

program transformation,” in 8th International Parallel Processing Symposium,

(Cancun Mexico), pp. 578—582, IEEE, 1994.

M. Gupta and P. Banerjee, “Demonstration of automatic data partitioning tech-

niques for parallelizing compilers on multicomputers,” IEEE Transactions on
Parallel and Distributed Systems, vol. 3, pp. 179—193, Mar. 1992.

J. Li and M. Chen, “Compiling communication-eﬂicient programs for massively
parallel machines,” IEEE Transactions on Parallel and Distributed Systems,
vol. 2, pp. 361—376, July 1991.

S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. H. Teng, “Optimal evaluation
for array expressions on massively parallel machines,” in the Second Workshop
on Languages, Compilers, and Runtime Environments for Distributed Memory
Multiprocessors, Oct. 1992.

S. Chatterjee, J. R. Gilbert, and R. Schreiber, “The alignment-distribution
graph,” in the Sixth Annual Workshop on Languages and Compilers for Par-
allelism, Aug. 1993.

J. R. Gilbert, S. Chatterjee, and R. Schreiber, “Mobil and replicated alignment
of arrays in data-parallel programs,” in Proceedings of Supercomputing ’93, Nov.

1993.

J. Ramanujam and P. Sadayappan, “Compile-time techniques for data distri-
bution in distributed memory machines,” IEEE Transactions on Parallel and
Distributed Systems, vol. 2, pp. 472—482, Oct. 1991.

S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and A. W. Lim, “An overview
of a compiler for scalable parallel machines,” in the Sixth Annual Workshop on

Languages and Compilers for Parallelism, Aug. 1993.

M. E. Wolf and M. S. Lam, “A data locality optimizing algorithm,” in Proceedings

of the ACM SICPLAN’91 Conference on Programming Language Design and
Implementation, pp. 30—44, June 1991.

R. A. Usmani, Applied Linear Algebra. Marcel Dekker INC., 1987.

172

[79] A. V. Aho, J. E. Hopcraft, and J. D. Ullman, Data Structures and Algorithms.
Addison-Wesley, 1983.

[80] P. Feautrier, “Dataﬂow analysis of array and scalar references,” International

Journal of Parallel Programming, vol. 20, Jan. 1991.

[81] P. Havlak and K. Kennedy, “An implementation of interprocedural bounded reg-

ular section analysis,” IEEE Transactions on Parallel and Distributed Systems,

vol. 2, July 1991.

"I[Milli][Illllll