:‘l rS‘Q-ﬁ." y‘

I“
CL.

A
:h

E
a

w t“
‘i’ﬁ'w '9 in”
‘ _ ‘ x. i
- ~.'1L.“¢"/1$‘9(£
% *’ ‘iu. «a

‘19”: life ‘

32;, .-
<'> flit“:
“Hulk“
. v1.

1
..

 

 

THESis

Ill’iiﬂii'liiilWliili‘lﬂiiiHTiWiﬂliiﬂ

1293 01399 5588

This is to certify that the

dissertation entitled

Scalable Data Redistribution Services for
Distributed Memory Machines

presented by
Edgar T. Kalns

has been accepted towards fulfillment
of the requirements for

Wdegree in Wence

\VWQ $4 . NU

Major professor

Date _Ma.Ll.9_._.L9_9§___

MSU i: an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

.- 7—”— “Ta—-

 

LIBRARY
Michigan State
University

 

 

 

DATE DUE DATE DUE ' DATE DUE

Cmﬂi it i

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SCALABLE DATA REDISTRIBUTION
SERVICES FOR DISTRIBUTED-MEMORY
MACHINES

By

Edgar T. Kalns

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1995

ABSTRACT
SCALABLE DATA REDISTRIBUTION SERVICES FOR
DISTRIBUTED-MEMORY MACHINES
By
Edgar T. Kalns

Run-time data redistribution can effect higher algorithm performance in
distributed-memory machines. Redistribution of data can be performed between al-
gorithm phases when a different data decomposition is expected to deliver increased
performance for a subsequent phase of computation. Data-parallel Fortran languages,
e.g., HPF (High Performance Fortran), support run-time data redistribution. Data
redistribution can be invoked explicitly with primitives or may occur implicitly. One
type of implicit data redistribution occurs when the distribution of actual parame-
ters to a subprogram do not match the distribution of the dummy arguments in the
subprogram interface. Typically, with distributed-memory machines, redistribution
causes data exchange among processor memories resulting in interprocessor commu-
nication overhead. Consequently, there is a performance tradeoﬂ between the higher
efficiency of a new data distribution for subsequent computation and the time required

to establish it.

This dissertation investigates the pertinent issues that affect the performance of
data redistribution on distributed—memory machines, focusing on four primary areas.
First, we address the partitioning (or mapping) of data onto processor memories.
A technique that facilitates the minimal amount of data exchange among processor
memories during redistribution between HPF’s regular patterns is proposed. Second,
we present the design of a portable and communication—efﬁcient data redistribution
library whose implementation is portable among a large class of distributed-memory
machines. Portability is enhanced through MPI (Message Passing Interface) commu-
nication primitives. Third, we develop a framework for quantifying the scalability of
parallel algorithms together with the machines upon which they execute. Fourth, we
apply the framework to quantify the scalability of the data redistribution library for
a large range of processor conﬁgurations and data set sizes on selected distributed-

memory machines.

© Copyright 1995 by Edgar T. Kalns
All Rights Reserved

To my parents

ACKNOWLEDGMENTS

My deepest gratitude to my parents whose continual love and encouragement
helped me achieve my academic goals. For as long as I can remember, they have told
me, and shown me by example, that life’s most gratifying accomplishments require
industriousness, perseverance, tenacity, and an ability to ﬁnd enjoyment not only in
the achievement itself, but in the progression toward the goal.

To my dissertation advisor, Professor Lionel Ni, I extend my warmest thanks for
his guidance, inspiration, and patience. His door was always open to me, his con-
sultation forthcoming whenever I asked for it. With his help, I honed my abilities
to think critically as a researcher, and I heightened my satisfaction of pursuing solu-
tions to difficult problems in Computer Science. I will always be indebted to him for
his contribution to my professional development. Additionally, I would like to thank
the remaining members of my Ph.D. guidance committee: Professors Li, McKinley,
Mutka, and Rover for their numerous helpful comments and suggestions.

I would like to thank a number of fellow graduate students with whom I have
developed close friendships. I am particularly grateful to Barbara Birchler, who with
her love, kindness, and encouragement, helped me immensely. She critiqued my ideas

vi

and proofread my technical reports and dissertation chapters. With her ever-present
smile and cheerful nature, she helped me through the inevitable ups and downs.
To my “lunch crowd” friends: Reid Baldwin, Barb Birchler, Kevin Bradtke, Eddie
Burris, Dave Chesney, Marie—Pierre Dubuisson, Jon Engelsma, Maureen Galsterer,
Dan Meyers, Michele Morin, Tim Newman, Ron Sass, Steve Turner, Christian Trefftz,
and Steve Walsh I extend my thanks for many fond, and oftentimes boisterous, noon-
time get-togethers. A special thanks to Ron Sass, for his careful preparation of the
document style used to format this dissertation in compliance with the Graduate
School’s regulations.

To Frank Northrup and to all the past and present Computer Science systems
administrators with whom I have worked, I extend my gratitude for the camaraderie
and team spirit we shared in managing one of the largest computing facilities in the
country. I would especially like to thank Frank for his understanding and friendship.

Lastly, I extend my gratitude to the Michigan State University as a whole for
providing me the highest quality graduate education. As a Michigan native, I have
been extremely fortunate that this institution, attended by students the world over,

exists in “my backyard.”

vii

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xi
1 Introduction 1
1.1 HPF Example ................................. 4
1.2 Motivation ................................... 6
1.3 Organization of Dissertation ......................... 9
2 Data Decomposition 11
2.1 Static: Align and Distribute ......................... 12
2.2 Dynamic: Realign and Redistribute ..................... 16
2.3 Example of Redistribution in HPF ..................... 18
2.4 Rudimentary Operation of Data Redistribution .............. 20
2.4.1 Determination of Processor Send and Receive Sets . . . L ....... 21
2.4.2 Data Exchange with Message-Passing ................... 23
2.5 Implicit Redistribution ............................ 25
2.6 Redistribution Optimizations ........................ 28
3 Data Mapping Optimization 32
3.1 One-dimensional Logical Processor Mapping ................ 33
3.1.1 Processor Mapping Technique ....................... 35
3.1.2 Mapping Function ............................. 38
3.1.3 Optimality of Mapping Function ..................... 39
3.2 Multidimensional Logical Processor Mapping ................ 46
3.2.1 m—dimensional Redistribution ....................... 46
3.2.2 Two—dimensional to One-dimensional Redistribution .......... 48
3.2.3 One-dimensional to Two-dimensional Redistribution . . ....... 53
4 Data Redistribution Library 59
4.1 Library Design Issues ............................. 60
4.1.1 Advantages of Software Libraries ..................... 60
4.1.2 Role of HPF Compiler ........................... 61
4.1.3 Use of MP1 for Libraries .......................... 63
4.1.4 Scalability .................................. 64
4.2 DaReL Design Overview ........................... 65
4.2.1 DaReL Interface .............................. 65
4.2.2 Compute Data Exchange Sets (CDES) .................. 69

viii

ix

4.2.3 Partitioning CDES Functionality ..................... 72
4.2.4 Exchange Data (ED) ............................ 73
4.2.5 Standard and Derived Datatypes ..................... 76
4.2.6 Scalability .................................. 80
5 Quantifying Scalability 82
5.1 Examples of Ambiguity in Scalability .................... 84
5.2 Goal-directed Scalability Metrics ...................... 87
5.3 Metrics for Quantifying Scalability ..................... 89
5.3.1 Taxonomy of Scalability Measures .................... 92
5.4 Mathematical Framework .......................... 94
5.4.1 Algorithm and Machine Parameterization ................ 94
5.4.2 Fixed-work Problems ............................ 95
5.4.3 Scaled—work Problems ........................... 98
5.4.4 Deriving AM Pair Parameters ....................... 99
5.5 A Three—dimensional Illustration of AM Pair Scalability ......... 100
5.6 Case Study: Matrix Multiplication ..................... 104
5.6.1 Algorithm-Machine Pair alM ....................... 105
5.6.2 Algorithm-Machine Pair 02M ....................... 106
5.6.3 Algorithm-Machine Pair a3M ....................... 108
5.6.4 Computing AM Pair Parameters ..................... 108
5.6.5 Scalability Quantiﬁcation of a1 M, (12M, and a3M ............ 113
5.7 Inclusion of Other Scalability Models .................... 115
6 Performance and Scalability 120
6.1 Processor Mapping Technique Analysis ................... 122
6.1.1 Effect of Optimal Mapping on the Programmer ............. 122
6.1.2 Effect of Optimal Mapping on the Compiler ............... 125
6.1.3 Performance Comparison with Traditional Technique .......... 126
6.2 DaReL Performance and Scalability ..................... 127
6.2.1 CDES and ED Execution—time Performance ............... 129
6.2.2 Algorithm Choices for ED ......................... 134
6.2.3 Scalability .................................. 141
7 Conclusion and Future Work 147
7.1 Research Contribution ............................ 147
7.2 Future Work ................................. 150

BIBLIOGRAPHY 152

LIST OF TABLES

4.1 MP1 Communication Primitive Choices ................... 77
5.1 AM parameters for al M, (12M, and a3M .................. 112
5.2 S(/\D) measure for 01M, 02M, and a3M .................. 113
6.1 Redistribution test cases for DaReL ..................... 129

1.1

2.1
2.2
2.3
2.4
2.5
2.6
2.7

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

5.1
5.2
5.3
5.4
5.5

LIST OF FIGURES

HPF Linear System Solver ..........................

Example of HPF alignment .........................
One- and two-dimensional HPF data distributions. ............
Example of !HPF$ REDISTRIBUTE. ......................
Redistribution from (*,BLOCK,CYCLIC) to (CYCLIC,CYCLIC,BLOCK)

Collective communication patterns .....................
Realignment of .7: causes redistribution ....................
Redistribution at subprogram interface ...................

Data Distribution and Redistribution onto physical nodes. ........
Logical processor to data mapping alternatives. ..............
Mapping lpids to place holders. .......................
lpz'd, is mapped to ﬁrst element of its data block. .............
2 S p. .....................................
z > p. .....................................
(BLDCK3,BL0CK4) t0 (CYCLIC3(3) ,CYCLIC4(2)) redistribution .......
(BLOCK3,BL0CK2) to (CYCLICe(2) ,* ) redistribution ............
(BLOCK4,BLDCK3) to (CYCLIC12(1) ,*) redistribution. ...........
(BLDCK6,*) to (CYCLICg(2) ,BLUCKg) redistribution. ...........
(BLOCK12,*) to (CYCLIC4,BL0CK3) redistribution. .............

Distinct scatter—gather sets. .........................
C-based interface for DaReL .........................
Parameters for the invocation of DaReL ..................
Different shaped process conﬁgurations ....................
CDES example for process (1,1) .......................
Partitioning CDES ..............................
ED using MPI point—to—point communication ................
ED using MPI collective communication ..................
Limitation of derived types and collective communication .........

Comparison of three AM pairs ........................
Scaled work vs. execution time of two AM pairs ..............
Quality of Solution for sin(a:). ........................
Execution time vs. number of processors tradeoff .............
Data distribution of a1 ............................

xi

27

62
66
67
68
71
73
75
76
79

85
87
91
93
106

5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15

Algorithm a1 .................................
Data distribution and operation of a2 ....................
Communication pattern of a2 ........................
Algorithm a2 .................................
Data distribution and operation of a3 ....................
Algorithm a3 .................................
Tree-based MPI Reduction used in a3M ..................
A generic AM pair 3D surface illustration .................
Scalability metrics illustrated on AM surface ................
5(AD) comparison of (12M and a3M where AD = 20 ............
S(/\D) comparison of a1M and azM where AD = 5 .............
Incorporating speed metric ..........................
Incorporating efﬁciency metric ........................

Mappings when physical node mapping is static ..............
Mappings when physical node mapping is dynamic ............
(BLOCK,*) to (CYCLIC(c),*) on 8 X 1 processors ..............
(BLOCK,BLOCK) to (CYCLIC(c),CYCLIC(c)) on 4 x 3 processors ......
(BLOCK,BLOCK) to (CYCLIC(c),CYCLIC(c)) on 6 x 4 processors ......
CDES performance on 20 processor configurations .............
ED performance on 20 processor conﬁgurations ..............
Local data set for (BLOCK,BLOCK) to (CYCLIC,CYCLIC) ...........
Local data set for (BLOCK,CYCLIC) to (CYCLIC,BLOCK) ...........
CDES performance on 100 processor conﬁgurations ............
ED performance on 100 processor conﬁgurations ..............
(BLOCK,BLDCK) to (CYCLIC,CYCLIC) redistributions .............
Communication paradigm using Scatter (top) and Gather (bottom) . . .
1000 x 1000 redistribution with .......................
DaReL scalability measured by N egSlope(N , N’) .............

107

110

117
118
119
119

123
124
128

135
137
141
144
146

Chapter 1

Introduction

Only in the past several years, Scalable Parallel Computers (SPCs) have come to
be viewed not only as indispensable for large—scale engineering and scientiﬁc applica-
tions, but also as viable platforms for commercial and ﬁnancial applications. Many
SPCS, such as the IBM Scalable POWERparallel (SP-1 and SP-2), Intel Paragon,
Cray T3D, nCUBE-2/ 3, and Networks of Workstations (NOW), have demonstrated
scalable performance in these domains. Distributed-memory machines comprise an
ensemble of nodes where each node consists of a processor, local memory, and other
supporting devices. The nodes of the machine are interconnected by a point—to—point
(direct) or switch-based (indirect) network. These distributed-memory systems offer
proportional performance increases as the processor conﬁguration size, network and
I/O bandwidth, and memory capacity and bandwidth are increased without changing
the basic machine architecture.

In order to fully utilize SPCS, the application software must also be scalable to
exploit increased computing capacity as it becomes available with larger machine

1

2

conﬁgurations. Programming SPCS has been a major challenge impeding the greater
success of such systems. Unfortunately, the traditional message-passing programming
paradigm for these machines based on separate name spaces is tedious, time consum-
ing, and error-prone for programmers. In contrast, the data-parallel or SPMD pro-
gramming model provides an easier and more familiar programming style for users.
The SPMD model is based on a single name space and loosely synchronous parallel
computation with a distinct data set for each processor. In order to provide high-
level language support for data-parallel programming, several data-parallel Fortran
languages have been proposed, such as Fortran D [1] and Vienna Fortran [2]. The
High Performance Fortran Forum, composed of over forty academic, industrial, and
governmental agencies, has developed HPF (High Performance Fortran) [3] in an effort
to standardize data-parallel Fortran programming for distributed-memory machines.
Many of the concepts originally proposed in Fortran D, Vienna Fortran, and other
data-parallel Fortran languages have been incorporated in HPF.l Dataparallel C [4],
and pC++ [5] represent language design efforts focusing on C-based data-parallel
extensions.

An essential part of HPF is the speciﬁcation of the decomposition of data arrays
through compiler directives. Due to the non-uniform memory access times character-
istic of distributed—memory machines, determining an appropriate data decomposition
is critical to the performance of data—parallel programs on these machines. The data

decomposition problem involves data distribution, which deals with how data arrays

 

1The research contained in this dissertation is presented in the context of HPF due to its emerging
acceptance as a standard. With minor syntactic changes, the concepts proposed herein could be
applied to other data-parallel Fortran languages as well.

3

should be distributed among processor memories, and data alignment, which speci-
ﬁes the collocation of data arrays. The goal of data decomposition is to maximize
system performance by balancing the computational load among the processors and
by minimizing remote memory accesses (or communication messages). To facilitate
data distribution onto an application-dependent, rather than a machine-dependent,
topology, an HPF programmer declares a multi—dimensional mesh of logical (or vir-
tual) processors. The mapping of the logical processors to the physical machine is
determined by the compiler or run-time system.

HPF provides ﬂexible primitives for specifying data decompositions; however, it
does not provide the programmer any guidance in selecting a suitable data decom-
position. Given the large number of distribution and alignment possibilities, it is
inherently difficult for programmers to select an appropriate decomposition which
maximizes program performance. Mace [6] proved that determining an optimal data
distribution, even for one- and two-dimensional arrays is NP-complete. Li and Chen
[7] proved that ﬁnding a set of alignments for the indices of multiple program ar-
rays that minimizes data movement among the processors is also N P-complete. A
number of heuristics for determining suitable distributions and alignments have been
proposed in the literature.

Assuming a viable technique for selecting an appropriate data decomposition, the
best possible performance may not be achieved with only a single data distribution
or alignment for the entire program. Depending upon the algorithm and its data-
parallel implementation, a particular data decomposition that is well-suited for one

phase of an algorithm may not be good, in terms of performance, for a subsequent

4

phase. Therefore, many data-parallel languages deﬁne mechanisms for explicit run-

time data redistribution and realignment.

1 .1 HPF Example

We illustrate the above data decomposition constructs in a small HPF code example
for solving a linear system of equations. Figure 1.1 illustrates an HPF program for
solving the system An: = b using Gaussian elimination and subsequent backward
substitution [8]. The basic idea of the algorithm is to reduce the matrix A to upper
triangular form in the ﬁrst phase and then to use backward substitution to diagonalize
the matrix. In the program, A is declared as an n x (n + 1) array, where the (n +1)st
column of A stores the constant vector b. Following the array declarations are a set of
compiler directives, including alignment and distribution primitives described earlier.
Statements s7—s8 demonstrate the alignment of two different data vectors to the rows
of A, while statement 39 shows the alignment of another array to the (n +1)st column
of A. Statement s10 declares a vector of n logical processors. The matrix, along with
its aligned data, is distributed by contiguous rows in statement 31] onto the logical
processor array. Statement s27 speciﬁes a run—time redistribution of data between the
Gaussian elimination phase (313-326) and the backward substitution phase ($28-$33).
Note that this code segment merely illustrates HPF constructs and is by no means

the most efﬁcient algorithm.

 

 

31:
32:
33:
34:
35:
36:
37:
38:
39'

310:
311:
312:
313:
314:
315:
316:
317:
318:
319:
320:
321:
322:
323:
324:
325:
326:
327:
328:
329:
330:
331:
332:
333:

PROGRAM LINSYS(A,x,n)

REAL INTENT (IN) :: A(:,:)

REAL INTENT (OUT) :: x(:)

INTEGER INTENT (IN) :: n

INTEGER xindx(n)

REAL intrm(n), temp(n)
lHPF$ ALIGN x(:) WITH A(*,:)
lHPF$ ALIGN xindx(z) WITH A(*,:)
lHPF$ ALIGN intrm(z) WITH A(:,n+1)
lHPF$ PROCESSORS P(n)
lHPF$ DYNAMIC, DISTRIBUTE A(BLOCK,*) ONTO P
FORALL (i = 1m) xindx(i) = i
DO i = 1, n

maxloc = MAXLOC(A(i,i:n))

maxval = A(i,maxloc)

temp = A(:,maxloc)

A(:,maxloc) = A(:,i)

A(:,i) = temp

tempx = xindx(maxloc)

xindx(maxloc) = xindx(i)

xindx(i) = tempx

A(i,i:n+1) = A(i,i:n+1) / maxval
iHPF$ INDEPENDENT (j,k)

FORALL (j = i+1:n, k = i+1:n+1)

A(j’k) = A(jak) ' A(jai) * A(iik)

END DO
lHPF$ REDISTRIBUTE A(CYCLIC, *)
intrm(n) = 0
DO i=n, 1, -1

x(xindx(i)) = (A(i,n+1) - intrm(i)) / A(i,i)

FORALL (j = i-1:1:-1)

intrm(j) = intrm(j) + (A(j,i) * x(xindx(i)))

END DO

Figure 1.1: HPF Linear System Solver

 

 

1.2 Motivation

Two-dimensional and three-dimensional FF T (Fast Fourier Transform) [9] and ADI
(Alternating Direction Implicit method) [10] are frequently cited examples for which
efﬁcient data redistribution between algorithm phases results in increased perfor-
mance. Use of redistribution mechanisms, however, involves communicating pro—
gram data arrays among the nodes of the machine resulting in interprocessor com-
munication overhead. Data redistribution may require a temporary interruption of
application-useful computation while processors await their new data. Consequently,
there is a performance tradeoff between the higher efﬁciency of a new data decompo-

sition for a subsequent algorithm phase and the time required to establish it.

Many researchers have espoused the necessity of incorporating a redistribution
capability into Fortran- and C-based data-parallel languages. The Kali language
[11] was one of the ﬁrst to incorporate run-time data redistribution mechanisms.
DINO [12] addresses the implicit redistribution of data at procedure boundaries.
The Hypertasking compiler [13] for data-parallel C programs incorporated a run-time
redistribution facility. Both Vienna Fortran [14] and Fortran D [15] specify data

redistribution primitives as well.

While data-parallel languages provide run-time primitives that facilitate an ea:-
plicit change in data decomposition, implicit data redistribution is possible as well.
Array assignment statements, subprogram boundaries, and array realignment are im-
plicit sources of redistribution in data-parallel programs. For instance, consider two

arrays: A and B whose distributions are different. The assignment statement A=B

7

causes an implicit redistribution of array B to match the distribution of array A. At
subprogram boundaries, data redistribution occurs when the distribution of actual pa-
rameters to a subprogram do not match the distribution of the dummy arguments in
the subprogram interface. Finally, realignment of data can require the redistribution
of data to satisfy the chosen new alignment.

Many data-parallel languages deﬁne dozens of language intrinsics that could be
implemented as software libraries, signiﬁcantly simplifying the role of the language
compiler. Data redistribution is one type of operation whose implementation can
be provided as a software library. Providing a rich set of libraries to data—parallel
programmers is becoming increasingly popular because libraries offer some unique

advantages [8]. For instance,

0 A uniform interface provided by the library enhances the portability of programs
among various distributed-memory architectures.

0 Although a library provides a uniform interface to programs, the design and
implementation of the library can explicitly exploit machine speciﬁc features to
provide better performance.

0 The program development cycle for the consumer of the library can be shortened
as the correctness of libraries can be independently verified.

o The existence of libraries can simplify the tasks that the data-parallel compilers
would otherwise have to perform.

As there are multiple sources of redistribution in a data-parallel program, even
when it is not explicitly invoked by the programmer, we contend that research in
the area of data redistribution in distributed-memory machines has particular merit.
Changing array data decompositions within a program may yield substantial perfor-

mance gains, but only if the overhead of such operations is mitigated. Therefore, the

8

efﬁcient and scalable implementation of data redistribution mechanisms is important
to the overall performance of data-parallel programs on SPC architectures.

The term scalability has been used extensively in the parallel processing com-
munity to characterize the ability of parallel architectures and algorithms to exhibit
greater performance as more processors are employed to solve a problem. However,
the term “performance” is equivocal as it may be used to identify widely varying
algorithmic or architectural properties: speedup, execution time, processor speed or
efﬁciency, or the quality (accuracy) of a solution. Many of these properties have been
used either separately or in conjunction to describe the scalability of parallel algo-
rithms and architectures. The ambiguity regarding scalability leads us to ask several
questions: How are the relative scalabilities of parallel algorithms and machines re-
lated? Can a universal deﬁnition of scalability be applicable to all scenarios? How
can scalability be quantiﬁed? If scalability can be quantiﬁed, which property best
describes it, or is scalability a combination of factors? Answers to these questions are
relevant to providing scalable redistribution methods.

Efﬁcient and scalable data redistribution necessitates consideration of many is-
sues. First, redistribution is a communication-dominant task, thus, the efﬁciency of
the communication mechanisms used is of great importance. For instance, the relative
merits of point-to—point or collective communication for redistribution must be con-
sidered along with the tradeoffs of using blocking or non—blocking message-passing.
Second, the amount of data communicated between processors during redistribu-
tion is a signiﬁcant factor in the total execution time of the operation. Consequently,

methods for reducing the amount of data that must be exchanged via message-passing

9

may beneﬁt redistribution performance. Third, to be viable to potential data-parallel
programmers and compilers alike, efﬁcient redistribution must consider the obtained
performance as data and processor conﬁguration sizes as scaled. Quantiﬁcation of
redistribution scalability using relevant performance metrics facilitates comparison
of different techniques. Fourth, to the extent possible, efﬁcient redistribution mecha-
nisms should be portable among distributed-memory machines to enhance their utility

in today’s heterogeneous high-performance computing environments.

Problem Statement

The eﬁicient operation of implicit and explicit data redistribution on distributed-
memory architectures can greatly impact the overall performance of data-parallel pro-
grams. This dissertation investigates language, data mapping, message-passing, and
scalability issues, all of which affect the performance of data redistribution. This dis-
sertation is distinguished as the earliest known research to propose a portable and
scalable redistribution library, processor-data mapping techniques for optimizing data

exchange, and a framework for quantifying the scalability of data redistribution on

SPC platforms.

1.3 Organization of Dissertation

Chapter 2 discusses static (distribute and align) and dynamic (redistribute and re-
align) data decomposition. Salient data redistribution issues are discussed and related

work is summarized. Chapter 3 presents a technique for partitioning (or mapping)

10

data onto processor memories that facilitates the minimal amount of data exchange
among processor memories during redistribution between a large class of regular HPF
distribution patterns. The technique is independent of the underlying machine archi-
tecture, and thus it is portable among distributed-memory platforms. We prove the
optimality of the technique with respect to minimizing the amount of data exchanged.
Chapter 4 proposes a data redistribution library, DaReL, for distributed-memory ma-
chines that utilizes the emerging message-passing standard, MP1 [16]. The library
supports multi-dimensional data redistribution among arbitrary regular HPF distri-
bution patterns. A uniform interface to DaReL is deﬁned, facilitating its potential
incorporation into a data-parallel compiler. Chapter 5 proposes a framework for quan-
tifying scalability that incorporates end-user requirements to enable quantiﬁcation of
the scalability of parallel algorithm-machine combinations, or pairs. We illustrate
the framework with a matrix multiplication case study to assess the relative perfor-
mance of several different algorithm-machine pairs. Chapter 6 presents performance
and scalability results. We demonstrate redistribution performance improvements
of up to 40% using the processor mapping technique (Chapter 3) on a distributed-
memory machine, an IBM SP-rc.2 We address the impacts of the mapping technique
on the data-parallel programmer and compiler, respectively. We extend the scalabil-
ity framework (Chapter 5) to quantify DaReL’s performance as data and processor
conﬁguration sizes are scaled. Chapter 7 concludes the dissertation with a summary

of the major contributions and suggests potential future research directions.

 

2SP-a: denotas an IBM SP multicomputer consisting of SP-l processors and an SP-2 network
switch.

Chapter 2

Data Decomposition

Due to their non-uniform memory access times, determining an appropriate data de-
composition among different memories is critical to the performance of data-parallel
programs on distributed-memory machines. The goal of data decomposition is to
maximize system performance by balancing the computational load among the pro—
cessors and by minimizing remote memory accesses. Data-parallel languages, e.g.,
HPF, support compiler directives that statically specify the decomposition of data
to processor memories prior to program execution. Additionally, these languages
support modiﬁcation of data decompositions during run-time in favor of a different
one that, presumably, is better suited to the ensuing computation. Existing data-
parallel Fortran languages, however, provide no inherent capability for determining
static data placement nor run-time modiﬁcation of data decomposition. This chapter
surveys research in the area of data decomposition with particular emphasis on data
redistribution issues.

11

12
2.1 Static: Align and Distribute

In HPF, data arrays are aligned relative to one another and then, as a group, they are
distributed onto an abstract, or logical, processor conﬁgurationl. This process of static
data decomposition is accomplished with the iHPF$ ALIGN, !HPF$ DISTRIBUTE, and
!HPF$ PROCESSORS compiler directives. These directives do not affect the result of the
program, rather they suggest an implementation strategy to the compiler. Therefore,

these directives may only appear in the declaration part of an HPF program.

The !HPF$ ALIGN directive speciﬁes the collocation of data array elements from
distinct arrays. Figure 2.1 illustrates a possible alignment of an array, It to the m-th
column of an n X m matrix A. For aligning multiple arrays to one another, it is some—
times convenient to deﬁne an HPF template. The !HPF$ PROCESSORS directive deﬁnes
a processor conﬁguration as a multi-dimensional mesh of processors. Data arrays, or
templates, are mapped onto a processor conﬁguration with the !HPF$ DISTRIBUTE
directive. The programmer applies an HPF distribution pattern to each dimension of
a data array specifying the mapping of the data dimension onto a dimension of the
processor conﬁguration. The number of data dimensions must equal the number of
processor dimensions. BLOCK, CYCLIC, BLOCK(b) , and CYCLIC(C) comprise the set of
regular data distribution patterns in HPF; * denotes the absence of a distribution
pattern for a dimension, i.e., the entire dimension is allocated to the processor. [17]

discusses possible HPF language extensions to permit user-deﬁned array distribu-

 

1The mapping of logical processors onto the physical nodes of the machine is not within the scope
of HPF. Henceforth, we assume logical processor when we write processor, unless otherwise stated.

13

tion patterns.2 With these mechanisms, the program data is aligned and distributed
among the processors. By HPF’s owner-computes rule, the processor that stores the
data element on the left-hand side of a program statement is responsible for computing

its update. Thus, the distribution of computation is determined by the decomposition

 

ofdata.
1 2 3 4 "' m
I -1 ......... ' ..... i .............
2 .......................
3 ..... s ..... s ..... LAB ..... 2...--...\\
""" i """ i """ § """ i """"" x
n

 

 

 

 

Figure 2.1: Example of HPF alignment

With the BLOCK speciﬁcation, contiguous blocks of an array are distributed to
each processor. If n is the data size along a given dimension and p denotes the
number of processors, the block size is [g] for the ﬁrst p — 1 processors, the last
processor receiving the residual if it does not receive a full block. With the CYCLIC
distribution, elements of a dimension of an array are assigned to each processor in
a round-robin fashion. An extension of BLOCK and CYCLIC are the BLOCK(b) and
CYCLIC(c) distribution patterns. With BLOCK(b), each processor is assigned a data
segment of size b, with the restriction that n S p x b, while CYCLIC(c) distributes
round-robin contiguous blocks of size c to each processor. By deﬁnition, CYCLIC is

equivalent to CYCLIC(1). BLOCK(b) and CYCLIC(c) imply the same distribution when

 

2This dissertation focuses on regular distribution patterns as deﬁned in the original language
speciﬁcation.

14

b = c and n S p x b; however, BLOCK(b) additionally asserts that there is no wrap-
around. The potential round-robin effect, when n > p x c, of CYCLIC(c) may be

difﬁcult to discern at compile-time if c is an expression.

Figure 2.2 illustrates various data distribution examples: (a) and (b) Show an
eight-element vector distributed BLOCK and CYCLIC, respectively, on four processors;
(c) illustrates CYCLIC(c) where c = 2 on two processors; ((1), (e), and (f) show
various possible distributions for an 8 x 8 data matrix. The distributions for (d)—(f)
are shown as an ordered pair: the ﬁrst pattern applies to the row dimension while
the second pattern applies to the column dimension. The processor IDs, indicating
data ownership, are shown in italics: in (a)-(c) they are superimposed over the data,
while in (d)-(f) they are shown as Cartesian coordinates. Processor IDs are numbered
beginning with 0. Recall that, the * denotes the absence of a distribution pattern for

a dimension. The row dimension is not distributed in (f).

 

 

 

 

 

 

 

 

"i 2 :<-- >§1§<~- >§ 2 ‘<--
[0 1J213l lolllzl3lol1l213l l 0 I I 0 II I
(a) (BLOCK). P=4 (1)) (CYCLIC). [=4 (C) (CY CL1C(2)). P=2

0 1 a I 9.. 01230123

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0
1
2 4
3
1
2
3
(d) (CY CLICBLOCK) (e) (BLOCKCYCLIC(2)) (f) (*,CYCLIC)
p=4x2 p=2x2 p=1x4

Figure 2.2: One- and two-dimensional HPF data distributions.

15

Figure 2.2 shows only a few of the many possible distributions for the one- and
two-dimensional data. Recall that depending upon the application algorithm, the
choice of data distribution will affect the computational load balance and the size
and number of remote memory accesses when implemented on a distributed-memory
machine. Selecting the optimal distribution is NP-complete as stated in Chapter 1.
The situation is further complicated when one considers the possible data array align-
ments. Automatically determining array alignments is an active area of research [18].
Several heuristics have been proposed to automatically determine an appropriate data
decomposition based on the application code. A survey of some of these heuristics and
their limitations and drawbacks can be found in [19]. Anderson and Lam [20] present
a mathematical framework for systematically determining data decompositions. A
ﬁrst-order model for reducing the search space of possible distributions to a select
few based on architecture-speciﬁc communication and computation parameters can
be found in [21]. Xu and Ni [22] propose low-complexity algorithms for determining
array alignments in an effort to maximize processor load balance while minimizing
interprocessor communication. Ponnusamy, et al. [23] focus on automatic selection
of data distributions for irregular problems, i.e., programs for which the data access
patterns are input-dependent. Lee and Tsai [24] propose a dynamic programming
technique for determining data distributions.

The HPF programmer speciﬁes an initial distribution of data onto a set of logical
processors as discussed previously. The data mapping, from the programmer’s per-
spective, is independent of the underlying machine topology and physical processor

IDs. The mapping of the logical processor IDs to the physical nodes of the machine

16

is relegated to the compiler. To facilitate efﬁcient program execution, the logical
to physical processor mapping must consider the machine topology, communication
subsystem, memory capacity, and number of nodes [25]. Chittor and Enbody [26, 27]

investigate logical to physical processor mapping for wormhole-routed architectures.

2.2 Dynamic: Realign and Redistribute

Run-time data redistribution has been shown to improve the performance of F F T and
ADI. HPF deﬁnes run—time primitivas for changing data alignment (iHPF$ REALIGN)
and distribution ( !HPF$ REDISTRIBUTE). These constructs may be inserted anywhere
in the executable portion of a program to change the alignment or distribution of the
global data arrays or templates. Henceforth, we shall refer to the aggregate program
data as the global data and the data owned by each processor, from the perspective
of the processor, as its local data.

To determine the appropriateness of utilizing these run-time primitives, an HPF
programmer must consider the tradeoff between the anticipated increased perfor-
mance of a new distribution or alignment and the cost of changing the data decom—
position which typically involves interprocessor communication. Given the difﬁculty
of estimating the execution time of a phase of an algorithm with a particular decom—
position, the effective use of realignment and redistribution is an important research
issue. Chatterjee, et al. [28] propose a model for assessing realignment cost, and they
formulate the alignment problem as an optimization of realignment cost. Algorithms

for automatically determining good mobile alignments, i.e., alignments which are a

17

function of the loop induction variable, are presented. Kunchithapadam and Miller
[29] present a graph-coloring technique for recording data movement among nodes
during program execution. They propose a simple algorithm to optimize the graph
coloring to describe new distributions which would result in reduced communication.
Thus, the technique identiﬁes places within a program’s execution where redistribu-
tion can effect greater program performance.

In contrast to HPF’s approach to user-speciﬁed static and dynamic data decom-
positions, techniques for automatic data decomposition are proposed in [20]. The
compiler, rather than the programmer, determines the decomposition of data for each
program loop. Redistribution is viewed as an implicit activity to be undertaken when
the decomposition of an array in a loop differs from the decomposition of the same
array in another (subsequent) loop. They propose a graph model, where nodes repre-
sent 100ps and edges represent redistribution communication costs between loops. A
greedy algorithm for eliminating edges with large weights, thereby eliminating costly
redistributions, is presented. The elimination of the redistributions deemed to be too
costly, however, is achieved at the cost of reduced parallelism in the resulting code.
Chase and Reeves [30] propose a similar approach for automatically performing array
redistribution. Arrays are grouped into compatibility classes, i.e., aligned together
and distributed. The compiler inserts assertions about the classes into the code, and
these assertions are checked during run-time to discern whether a redistribution is
necessary. Should a compatibility relationship change, redistribution is performed.
Unfortunately, the cost / beneﬁt tradeoff of redistribution is not addressed.

Many researchers have espoused the necessity of incorporating a redistribution

18

capability into data-parallel languages. The Kali language [11] was one of the ﬁrst
to incorporate run-time data redistribution mechanisms. DINO [12] addresses the
implicit redistribution of data at procedure boundaries. Hall et al. [15] discuss global
optimizations that can be employed to a set of redistribution calls in a Fortran D
program. The Hypertasking compiler [13] for data-parallel C programs incorporated
a run-time redistribution facility. To avoid packing and unpacking of data and own-
ership calculation in some instances, the Hypertasking compiler performs data re-
distribution by moving an entire data set around to all processors, each processor
removing its portion. Chapman et al. [14] introduce Vienna Fortran’s dynamic data

distribution capability and discuss the high-level implementation of it in the Vienna

Fortran Engine (VFE).

2.3 Example of Redistribution in HPF

Figure 2.3 presents a segment of HPF code illustrating the use of data redistribution
in HPF. A 20 x 12 x 10 array of real numbers, A, is initially distributed onto a
three-dimensional processor conﬁguration, P, with distribution patterns *, BLOCK,
and CYCLIC applied to the three dimensions, respectively. P contains forty processors

and each processor owns = 60 data elements. Following some amount of

20x12x10
4o

computation, A is redistributed with a new set of patterns, CYCLIC, CYCLIC, BLOCK,

onto a different shape logical processor conﬁguration, Q. Figure 2.4 illustrates the

initial distribution of A (left) and its subsequent redistribution (right). The shaded

portions of the ﬁgure depict the data elements of A owned by processor (0,0,0).

19

Whereas (0,0,0) owns data in globally contiguous locations initially, it owns data
in globally non—contiguous locations following redistribution. Note that processor

conﬁguration 0 also consists of forty processors; however, P and Q vary in shape.

 

REAL, DIHENSION<20,12,10) :: A

lHPFS PROCESSORS P(4,1,10)

!HPF$ PROCESSORS Q(5,4,2)

!HPF$ DISTRIBUTE A (BLOCK,¥,CYCLIC) ONTO P
(computation)

EHPFS REDISTRIBUTE A (CYCLIC,CYCLIC,BLOCK) ONTO Q

(computation)

Figure 2.3: Example of !HPF$ REDISTRIBUTE.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

; \ K \
\ \ \ \
\ \
\ \ \ \
\ \ \ \
\ \ \ \
1\ \ \
\ \ \ \
o
1
z
9
3
' o
. z
3
43 o
I
1 2
' J
' a 1 2 J a: 23 40 12 4

 

 

 

Figure 2.4: Redistribution from (*,BLOCK,CYCLIC) to (CYCLIC,CYCLIC,BLOCK)

 

The example illustrates one redistribution possibility; many more distinct com-
binations of source and destination distribution patterns are possible. The HPF
programmer has great ﬂexibility in declaring data and processor conﬁgurations of
arbitrary dimension. Recall, however, that HPF restricts the programmer to have

the same dimensionality for the data, processor, and distribution pattern declara-

20

tions, e.g., in Fig. 2.3, the data, processor, and pattern declarations are all three-
dimensional. HPF does not require that the source and target processor conﬁgura-

tions have equal size.

2.4 Rudimentary Operation of Data Redistribu-
tion

Henceforth, we denote the initial (source) distribution pattern(s) of a program as D,
and the destination (target ) distribution pattern(3) as D,. Similarly, we shall refer to
the source and target processor conﬁgurations as P, and Pt, respectively. Note that

D; and P, are embedded in the !HPF$ REDISTRIBUTE primitive in Fig. 2.3.

The previous example illustrates that data redistribution results in changing the
mapping of data to processors, thus necessitating data exchange among the proces-
sors. From the perspective of a sending processor, the data exchange is viewed as a
scattering of data elements to the processors in Pt, i.e., the sending of distinct data
elements to each processor in the target processor conﬁguration. This paradigm is
illustrated in the top portion of Fig. 2.5. From the perspective of a receiving proces-
sor, the data exchange is viewed as a gathering of data elements from the processors
in P,, i.e., the receiving of distinct data elements from each processor in the source
processor conﬁguration. This paradigm is illustrated in the middle portion of Fig. 2.5.
When P, = Pt, data exchange is viewed as an all-to-all personalized communication

as illustrated in the bottom portion of Fig. 2.5.

21

In order to perform data redistribution between D, and D,, a processor must
determine the identity of the processors from which it is to receive data as well as
the identity of processors to which it must send data. Once these sets are computed,
processors exchange their data. Recall that we have deﬁned redistribution between
logical processors; therefore, the logical to physical processor mapping determines
the actual data movement among the memories of the machine. For example, logical
processors mapped to the same physical node would obviously not require message-

passing to exchange data.

2.4.1 Determination of Processor Send and Receive Sets

Several research efforts have focused on efﬁcient methods for determining the send and
receive processor sets for redistribution. Gupta et al. [31] derive closed form expres-
sions for these communication sets based on Fortran D’s (BLOCK) , (CYCLIC) , and
(BLOCK-CYCLIC) distribution patterns. In this approach, global array (or template)
indices combined with knowledge of D, and D, are used to compute the send and
receive sets. An alternative approach [32] is for each node to scan its local data array
once, determine the destination processor for each element and place the element in a
message packet bound for that processor. In [32], one-dimensional data redistribution
is performed by distinct algorithms for different combinations of distribution patterns
for D, and D,. Distinct algorithms are proposed in order to introduce optimizations
that apply only to the given case. Multi-dimensional redistributions are implemented

as a series of (sequential) one-dimensional redistributions. This approach to multidi-

22

 

 

before

after

before

after

before

after

P1 P2 P3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

P1 P2 P3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

..././.. sit a...

 

 

 

\i'

 

 

 

 

 

 

 

 

 

 

 

 

 

ALL-TO-ALL

P4
send buffer
receive buffer
P4
send buffer
receive buffer
P4
send buffer
receive buffer

 

 

Figure 2.5: Collective communication patterns

 

 

23

mensional redistributions is unnecessarily costly and does not scale well. Stichnoth
et al. [33] prepose methods for computing ownership sets for array assignment state-
ments. Due to the similarity of determining send and receive sets, they advocate
computing these together on the sending processor and communicating the informa-
tion together with the data to a receiver. While this approach is chieﬂy intended for
communicating right-hand side operands, it can be incorporated into data redistri-
bution. Ramaswamy and Banerjee [34] propose a mathematical representation for
regular distributions called PITFALLS, which facilitates determining the processor
sets for data redistribution. [PITFALLS robustly handles arbitrary source and target
processor conﬁgurations and arbitrary number of data array dimensions in a scalable
manner. PITFALLS is being developed for inclusion in the PARADIGM [35] com-
piler project at the University of Illinois. The research presented in [31, 32, 33, 34]
focus on the efﬁciency of computing send / receive processor sets, excluding the actual
data exchange portion of redistribution which can be several orders of magnitude
more costly, in terms of execution time, than send and receive set determination; see

Chapter 6.

2.4.2 Data Exchange with Message-Passing

Data exchange on a distributed-memory machine is performed with either point-
to—point or collective communication message-passing primitives. In an effort to
standardize message-passing primitives on distributed-memory platforms, the MPIF

(Message Passing Interface Forum) [16] was founded. MPI deﬁnes a host of point-

24

to-point and collective primitives. Collective communication primitives such as
MPLScatter, MPI.Gather, or MPLAlltoall may achieve lower communication laten-
cies and less network contention than straightforward point-to-point routines if imple-
mented to effect parallelization in communication. Techniques for efﬁcient multiple
scatter and gather operations are presented in [36]. To achieve a higher performance,
the scatter and gather operations may be optimized for the speciﬁc architecture.
Portability of a redistribution implementation that utilizes MPI primitives would be
ensured among platforms which conform to the MPI standard. A data redistribution
library utilizing MPI is described in Chapter 4. McKinley et al. [37] survey issues
related to efﬁcient collective communication in Wormhole-routed machines. Of partic-
ular relevance to data redistribution, they review techniques for all—to—all personalized
communication.

The efﬁciency of the simultaneous redistribution of data among physical proces-
sors is affected by the topology, routing, and switching mechanisms of the underlying
machine. The routing mechanism, such as dimension-order routing [38, 39] for hy-
percube and mesh-connected machines, affects communication latency. A technique
for communication-efﬁcient data redistribution which addresses message contention
for certain topologies is presented in [40]. The authors propose a data redistribution
communication cost model which parameterizes the number of messages and their
sizes. Network contention is modeled by expressing the communication as a sequence
of permutations which may be executed in a ﬁxed number of (contention-free) steps.
Multi-phase redistribution is deﬁned as redistributing data to intermediate distri-

bution patterns, eventually arriving at the destination distribution. The model is

25

used in conjunction with multi-phase redistribution to show that lower overall cost
can be achieved as compared to single-phase redistribution. Redistributing perfect

power-of-two sized arrays on hypercubes is discussed in [41].

As static distribution of a data array or template causes the distribution of all
aligned data; redistribution of those arrays or templates causes the redistribution of
aligned data as well. Depending upon the data alignment(s), there may be more data
distributed to some processors than others. Thus, during redistribution, the amount
of data to be exchanged among processor memories could vary signiﬁcantly, resulting
in an unbalanced communication load. This situation could increase the time of the

redistribution operation for some physical nodes.

2.5 Implicit Redistribution

While the example in Figs. 2.3 and 2.4 illustrates explicit run-time redistribution
of data arrays, other program statements or HPF constructs can cause the implicit
redistribution of data. For instance, array assignment statements, !HPF$ REALIGN,
and subprogram interfaces may cause implicit data redistribution in data-parallel

programs.

Consider two arrays A and B that are distributed BLOCK and CYCLIC, respectively.
The assignment statement A=B causes an implicit redistribution of array B to the
distribution of array A. While such an assignment is not the same as a redistribution of
B, the former operation must perform the same functions as the latter, e. g., computing

processor sets and exchange of data. This type of redistribution may occur frequently

26

in data—parallel programs with many arrays that have dissimilar distributions.
Figure 2.6 illustrates how realignment can cause implicit data redistribution. Ini-
tially, a: is aligned to the last column of A; then it is shown realigned to the ﬁrst row
of A. With the (BLOCK,*) distribution, this would cause the implicit redistribution
of a portion of a: from processors 1, 2, and 3 to processor 0. Processor 0 must remap
its own local data set to conform to the new alignment. Realignment, like static
alignment, may cause an unbalanced communication load since processors may not

own equal amounts of data.

 

 

 

 

 

 

 

 

 

 

 

 

Matrix A Matrix A
0 x’ 0
I I
2 2
3 3
lI-IPFS ALIGN x(:) WITH A(:,n) lHPFS REALIGN x(:) WITH A(l,:)

Figure 2.6: Realignment of 3: causes redistribution.

Implicit redistribution can occur at subprogram (subroutine) boundaries. If a
distribute or align directive is applied to a dummy argument speciﬁed in a subpro—
gram interface, then HPF requires that the corresponding actual parameter to the
subprogram conform with the dummy argument’s decomposition. If the dummy ar-
gument speciﬁes a decomposition that differs from the actual parameter, implicit
data redistribution and/or realignment is necessary. Additionally, upon return to
the main program, the original decomposition of the parameter data arrays must

be re—established. Thus, there are two redistributions and / or realignments necessary

27

 

REAL A(20,20)
!HPF$ DISTRIBUTE A(*,BLOCK)

CALL SUB (A)

SUBROUTINE 303(2)

REAL 2(2o,2o)
anprs TEMPLATE'T(20,20)
!HPF$ DISTRIBUTE T(BLOCK, BLOCK)
EHPF$ ALIGN WITH T::Z

Figure 2.7: Redistribution at subprogram interface

 

 

 

for each parameter. Figure 2.7 demonstrates this type of implicit redistribution. In
this case, the actual parameter, A, originally distributed (*,BLOCK) is redistributed

(BLOCK,BLOCK) to conform with the distribution of the dummy parameter, 2.

Programmers will attempt to optimize the subprogram with a particular data de-
composition [20], so long as the change in decomposition does not result in higher
overhead than the performance beneﬁt of a new decomposition. Remapping of the
actual parameter to the dummy argument can be overridden with HPF’s INHERIT
attribute. This attribute speciﬁes that the dummy argument takes on the data de-
composition of the actual parameter, whatever it may be. Providing for any legal

distribution in a library, however, may be complicated and may notbe cost effective
[8]-

Johnsson discusses redistribution at subroutine calls in the context of designing
subroutine libraries [42]. He cites the ability to communicate data decomposition
information to the subroutine, and subsequent determination of whether to use redis-

tribution or realignment, as salient issues. These issues are explored in the design of

28

FFT and matrix-vector multiplication subroutines. Hall et al. [15] propose insertion
of redistribution primitives in the caller, rather than the callee. This approach is
shown to lead to compiler optimizations, for instance, the elimination of unnecessary

redistributions.

2.6 Redistribution Optimizations

Due to the overhead cost that data redistribution represents, it can be beneﬁcial to
explore optimizations that can reduce the execution time of data redistribution. For
instance, data that is to be redistributed at a certain point in program execution may
not be referenced in subsequent computation. Certainly, for correct program execu-
tion, such data need not be redistributed. For example, suppose the redistribution
of a template T with aligned arrays A and B is called for, but only B is used in
subsequent computation; redistribution of A would be unnecessary for correct pro-
gram semantics. If the compiler is able to recognize such a situation, then the data
could be left in place, reducing the redistribution cost. If such data represented a
signiﬁcant portion of the whole and could be recognized by the compiler as unused
in later computation, then this optimization may prove useful. Less straightforward
situations, however, may prove more difﬁcult to optimize. For instance, say that A is
a large matrix which is to be redistributed, but only a fraction of its elements will be
accessed subsequently. A desirable optimization would be to only redistribute data
elements which will be referenced later.

Pipelined redistribution of data is another possible optimization considered by

29

several researchers. Strip Mining Redistribution [43] is a technique whereby the data
to be communicated in a redistribution is temporally overlapped with the subsequent
computation phase. Rather than redistribute all the data prior to resuming algorith-
mic computation, data is redistributed in a pipelined manner. A key issue in this
technique is the ability to identify the computation pattern under the destination
distribution pattern and to arrange the communication schedule to enable signiﬁcant
overlapping with computation. The technique is shown to have speedups of up to 1.7
over “massive redistribution”, i.e., non-pipelined, when the data sizes are relatively
large compared to message start-up latencies. Strip Mining Redistribution is limited
to two-to-one and one-to—two dimension redistributions so that the size of the indi-
vidual messages fed into the pipe are large enough to amortize message startup cost.
Explicit Data Placement (XDP) [44, 45] is similar to Strip Mining Redistribution in
that it facilitates pipelined data redistribution. A novel feature of this approach is the
uniﬁed treatment of data and ownership transfer3 to allow for compiler-recognizable
optimizations. Using a 3D FFT application as an example, XDP is shown to facili-
tate overlapped data redistribution with algorithm computation. Unfortunately, no
performance data illustrating the performance of the proposed method are given.
Various message-level optimization techniques, such as message coalescing, mes-
sage aggregation, and message pipelining [25] may prove useful in data redistribution.
Message coalescing involves the combining of separate references to the same array.
Message aggregation combines messages from distinct arrays to the same processor.

Message pipelining attempts to hide communication latency through the separation

 

3In contrast to HPF’s owner-computes rule.

30

of send and receive primitives. Message aggregation, for example, could be utilized to
combine aligned data in a common message buffer for redistribution to another pro-
cessor. Techniques for applying these optimizations to the data movement problem,
when the owner—computes rule is relaxed, are presented in [46].

Prior to data redistribution, whether using iHPF$ REDISTRIBUTE or at a subpro-
gram interface, the logical to physical processor mapping is already determined for
the initial data distribution, i.e., D,; however, it is possible to manipulate the data
element / logical processor mapping in the target, i.e., the mapping of P, onto 0,, so
long as the semantics of the target distribution pattern are not violated. For instance,
in Fig. 2.2 (b), the mapping of processors to data elements is in increasing processor
number order, i.e., 0, 1,2, 3. This mapping, however, need not be in increasing order.
For instance, the mapping order could be permuted to be 0,2,1,3. The alternative
mapping does not violate HPF semantics and can lead to possible run-time optimiza-
tions. In Chapter 3, we present a theory for permuting the processor-to—data mapping
for HPF patterns.

Wakatani and Wolfe [47] propose a technique for mapping logical to physical
processors in a data redistribution library that can reduce communication overhead.
This technique assumes an underlying torus topology and maps the data in such a
way that communicating processors are partitioned into non—overlapping sets, i.e.,
there is no channel contention between sets. Thus, by keeping communication local
to a group, message contention on the physical network, presumably, is reduced. The
drawback of this approach is that the logical to physical data mapping is based on

“local” information. In other words, the technique imparts a mapping based solely

31

on what is best to optimize redistribution. However, since the logical to physical data
mapping must remain consistent throughout the execution of a data—parallel program,
the chosen mapping may increase communication cost elsewhere. We assert that the
compiler, privy to global information, i.e., the entire data-parallel program, ought to

determine the logical to physical processor mapping.

Chapter 3

Data Mapping Optimization

Since interprocessor communication is the predominant source of redistribution execu-
tion time, minimizing the total amount of data movement among processor memories
may increase redistribution performance. This chapter presents a technique, based on
logical processor to data element mapping, that minimizes the total amount of data
movement among processor memories for BLOCK to CYCLIC(c) , and vice-versa, redis-
tributions. We present mapping functions for one-to—one, m-to—m, one-to—two, and
two-to-one dimension data redistributions, and we prove their optimality. The pro-
posed methodology is architecture-independent facilitating its potential integration
into distinct redistribution implementations for different distributed-memory archi-
tectures.

In Chapter 6, we discuss the possible impacts on the programmer and compiler of
using the optimal mapping technique. We show that the technique offers the program-
mer extra ﬂexibility in determining data placement in some instances. Additionally,
the technique could be used in a straightforward manner by a compiler. We demon-

32

33

strate redistribution performance improvements over the traditional data-processor

mapping of up to 40% using the optimal technique on an IBM SP-x.

3.1 One-dimensional Logical Processor Mapping

We begin by illustrating the utility of the mapping technique for one-dimensional data.
Figure 3.1 illustrates an initial BLOCK distribution of a sixteen element data array, A,
onto eight processors and a subsequent redistribution of the array using the CYCLIC
pattern. The mapping of the initial distribution of data onto the physical nodes of the
machine is established at program initialization; see arrow labeled (1). Subsequent
redistribution among processors must retain a consistent logical to physical processor
mapping; see arrow labeled (2). Logical processor IDs (lpids) are in bold italics and
are superimposed over the data. Each processor owns two contiguous elements under
BLOCK, but two non—contiguous elements, with a stride of p, the number of processors,
under CYCLIC. The lpids are mapped in increasing numerical order as speciﬁed in [3].
We claim that this is an unnecessary restriction as any permutation of the lpids, 0.. 7,
can conform to the semantics of the CYCLIC pattern since data distribution to logical
processors in HPF has no concept of the “ordering” of processors. CYCLIC requires
only that the global data elements owned by a processor have global indices that are
separated by a stride of p. For clearer presentation, we View the data as being static,

and we manipulate the processor mapping to the data.

Figure 3.2 illustrates the beneﬁt of permuting the lpids when mapping them to

data elements. Using the same sixteen element array of Fig. 3.1, we show two alter-

34

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DISTRIBUTE A(BLOCK) REDISTRIBUTE A(CYCLIC( l ))
V ONTO P(8) ONTO P(8) 1
" " o o o E £1
[j : 2 1 0 1 1 t

t 2 I 2 2
3 3 3

4 4 4

s 2 5 5

o 3 6 6

7 logical processor [BS in 7 7

8 bold italics 8 0

4

9 9 I

10 10 2

11 5 11 3

12 12 4

13 6 13 5

'4 7 Initial distribution of data Redistn'bution of 14 6

'5 to physical nodes; data performed '5 7
performed by compiler at run-time E

 

 

Figure 3.1: Data Distribution and Redistribution onto physical nodes.

natives for redistributing the array cyclically. Choice 1 shows the conventional cyclic
mapping of lpids to data. This results in only two of sixteen data elements (marked
with ﬁlled rectangles) remaining on the same processor following redistribution. We
call this the number of data hits among all processors. Choice 2 shows an alternative
cyclic mapping with the lpids permuted, 0,4,1,5,2, 6,3, 7, which results in a total of
eight data elements, one per processor, remaining on their original memories. This
eliminates the exchange of six data elements among processors. Another advantage is
that processors 1,...,6' must send data to two processors when using Choice 1, but only
one processor each when using Choice 2, thus reducing the number of destinations.

Each of these factors reduces interprocessor communication overhead. The reduction

35

in redistribution cost for the example in Fig. 3.2 is small given the size of the example.

If we extend A to be sixteen million data elements with the same permutation of lpids,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

we eliminate the exchange of six million data elements across processors.l
Choice 1 M Choice 2
n 9
of redistrilfution
(CYCLIC( I )) (BLOCK) (CYCLIC(1))

"o 0 ........................................................ 0

.... ................................. 0 ...................... ..
...! I .................................................. 4
.51 2 ................................. , .............. — ...... I
.3 3 ........................................................ 5
...‘. 4 ................................. 2 ............... _ ...... 2
...5 5 ....................................... 6
.._e 6 ........................... , ............... — ..... 3
...7 7 ........................................................ 7
...s 0 ................................. 4 ................................. 0
...9 1 ..................................... — ...... 4
.19 2 ................................. 5 ...................... 1
.1! 3 .................................... _ ..... 5
.1? 4 ................................. 6 ................................. 2
.1? 5 ..................................... - ...... 6
1‘? 6 ................................ 7 ....................... ’ 3
1.5. 7 ....... _ ................... — .. 7

DataHits=2 DataHits=8

Figure 3.2: Logical processor to data mapping alternatives.

3.1.1 Processor Mapping Technique

To consistently minimize the size of data transfer for arbitrary data block and proces-
sor set sizes, we develop a systematic method for determining a permutation of lpids
to map to the data. The technique ensures that each processor retains the maximum
amount of data possible while conforming to the semantics of the source and target
distribution patterns, D, and D,.

We establish the upper bound on the ratio of the amount of data that can be

 

1Assuming the number of processors is kept the same, the block size in BLOCK is two million, and
the block size in CYCLIC(c) is one million.

36

retained on p processors to the total number of data elements, n. We present a
function for determining an lpid to data mapping for redistribution from BLOCK to

CYCLIC(c)2 that achieves the upper bound. We make the following assumptions:

1. Let p be the number of processors numbered 0..p — 1, and n be the total num-
ber of data elements, numbered 0..n — 1, distributed over p. We assume each

processor owns b elements, thus b = n / p.

2. We consider redistribution between BLOCK and CYCLIC(c) patterns; the
BLOCK(b) pattern, where b is a variable, is not considered; b = n/p by as-
sumption 1. For CYCLIC(c), we assume that c divides b, i.e., b = cz for an

integer z.

3. If the data is initially distributed among p processors, then we assume that the

data is redistributed among p processors.

Extensions of the above assumptions to the general case will be discussed in Chapter 7.

Deﬁnition 1 Deﬁne r as the number of data elements of the global array that re-
main (data hits) on their original processors following redistribution between BLOCK
and CYCLIC(c). Deﬁne HitRatio as r : n and MaxHitRatio as the upper bound on

HitRatio.

 

2Redistribution is symmetric in terms the amount of data movement among processors, i.e.,
redistribution from BLOCK to CYCLIC(c) or redistribution from CYCLIC(c) to BLOCK results in an
equal amount of data movement.

37

Lemma 1 Let b and c be the block sizes in the BLOCK and CYCLIC (c) patterns, respec-
tively. Ifn = bp and z = b/c, where z is an integer, then MaxHitRatio = [ékp : n

and HitRatio S MaxHitRatio.

Proof: Case 1: z = ip, for all i Z 0,i is an integer. For every cycle of cp
data elements, each lpidj, 0 5 j g p — 1, maps to c contiguous data elements with
CYCLIC(c). Since b = icp, there are i complete cycles of lpids that map to one
complete data block owned by lpid,- under the BLOCK pattern. Processor lpid,- will
map to exactly ic of the elements, i.e., there are ic data hits. For p processors, the
number of data hits is icp = [%]cp = [%]cp. HitRatio is [$[cp : 12.

Case 2: (i — 1)p < z < ip, for all i > 0,i is an integer. Since b < icp, there cannot
be i complete cycles of lpids mapped to lpids-s original block of data under BLOCK.
Thus, the number of hits can be no greater than i for each lpid; consequently, the

number of hits across all processors can be no greater than icp. The remainder of

the proof follows as in Case 1. D

MaxHitRatio is achieved with any permutation of lpids when 2 is an integer mul-
tiple of the number of processors, i.e., when 2 = ip. However, our goal is to achieve
the upper bound for all values 2 where (i — 1)p < z < ip. In order to satisfy this aim,
we must consider different permutations of the lpids to maximize the number of data
hits. Figure 3.2 demonstrates that not all permutations of lpids yield MaxHitRatio.
Next, using the semantics of D,, we deﬁne a function for determining a permutation

of lpids to map to the data array.

38
3.1.2 Mapping Function

Deﬁne a p—tuple, (qo,q1,q2, ...,qp_1), as p place holders. Let f : i —t j be a function
that maps lpid,- to place holder q, for 0 S i, j S p — 1. An assignment of each lpid
to a place holder speciﬁes a permutation of the lpids and represents a mapping of
lpids to data elements for the CYCLIC(c) distribution pattern. There are p! possible
permutations for p processors and it may be the case that many of the permutations
yield a ratio of MaxHitRatio. However, exhaustively testing each permutation to de-
termine whether it produces the ratio would be impractical since this would require
an exponential amount of computation for general p. Therefore, we present a func—
tion for determining a permutation that achieves MaxHitRatio for b, p and c. Each
lpidg, 0 S i S p— 1, maps to a unique gm). Equation (3.1) speciﬁes the function that

maps lpid,- to qf(,-).

f(i) = (iz) mod p, 0 S i S p— 1, z = b/c. (3.1)

The intuition behind the mapping function is to ﬁrst view the place holders, qj, as
a circular list. The function maps lpido to place holder qo, maps lpidl 2 places from
lpido, maps lpidg 2 places from lpidl, and so on. In general, we map lpid-+1, (2 mod p)
places from lpidg. Figure 3.3 illustrates the behavior of f applied to different values
of z and p. For simplicity, we choose c = 1 which reduces D, to CYCLIC. Part (a)
shows an example for p = 7 and z = 4, while part (b) illustrates the case for p = 6
and z = 4. The mapping is broken into rows to better illustrate the distance 2 = b/ 1

between consecutive lpids. The distinction between the two examples is that f : i —) j

39
is one-to—one in part (a), but f is not one—to—one in part (b); that is, in part (b), more
than one lpid maps to some locations, while no lpids map to other place holders. More
formally, depending on the values of z and p, it is possible that f(i) = f ( j), i at j.
While f yields one permutation for part (a), it produces six possible permutations3
for part (b): since lpido and lpid3 map to qo, it turns out that we can arbitrarily map
the two lpids to place holders go and q;. The same holds true for lpid; and lpids to

place holders q2 and q;, and for lpidl and lpid4 to place holders q, and q5. We shall

prove this result in a later lemma.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

<10 0 <10 0 3
<1, 2 (11

<12 4 (12 2 5
q3 6 q3

q4 1 94 I 4
95 3 <15

‘16 5

 

 

(a) p=7, z=4 (b) p=6, 2:4

Figure 3.3: Mapping lpids to place holders.

3.1.3 Optimality of Mapping Function

Given arbitrary z and p, the gcd(z, p) determines whether f is one-to—one or not. Lem-
mata 2 and 3 establish this. Lemmata 4 and 5 establish that f achieves MaxHitRatio

whether or not it is one-to—one.

Lemma 2 Let p and 2 be natural numbers and z = b/c. Let gcd(z,p) be the greatest

common divisor ofz and p. Ifgcd(z,p) = 1, then f : i —>j establishes a one-to-one

 

3all optimal in terms of MaxHitRatio

40

mapping between lpidg, 0 S i S p — 1, and place holder qj, 0 S j S p — 1. In other

words, ifgcd(z,p) = 1, then f(j) 7b f(k) for all j,k and 0 S j,k S p — 1, j 7b 13.

Proof: Proof by contradiction. Assume gcd(z, p) = 1, f(j) = f(k) = n for
arbitrary j,lc and choose ls: > j. Recall that the mapping function, f, maps each
lpid-+1, (2 mod p) places from lpid,. Let r = k — j, then lpid;c is mapped a distance
of r(z mod p) place holders from lpidj. Since f(j) = f(k) = n, then lpid,- and lpid;t
map to the same place holder, so their distance, modp, is zero, i.e., r(z mod p) = 0.
Since r > 0, it must be that 2 mod p = 0. This implies that the gcd(z,p) > 1, a
contradiction. Thus, our assumption that f(j) = f(k) = 13 when gcd(z,p) = 1 is

false. Cl

Lemma 3 Let p and 2 be natural numbers and z = b/c. If gcd(z,p) = k, then f

maps lpid,+J-(,,/k) to qu for 0 S i S p/k — 1 and 0 S j S k — 1.
Proof: Show that f(i) = f(i +j(p/k)) for 0 S i S P/k — 1, 0 Sj S k — 1.

f(i) = f(i +155)

3

k)z modp

(iz) mod p = (i +j
(iz) mod p = (iz + jz-E) mod p

(iz) mod p = (iz) mod p + (jzg) mod p

41

Since I: divides z, jzﬁ = mp for some integer m. mp mod p = 0 for arbitrary m.

Thus, the second term of the sum, (jzf) mod p = 0, so we are left with

(iz) mod p = (iz) mod p

We have established two lemmas that capture the behavior of f in Equation (3.1)
for natural numbers 2 = b/c and p and shown the relationship of gcd(z,p) to the
function f. In Fig. 3.3 (a), gcd(z, p) = 1 and thus f is one-to—one, while in part (b),
gcd(z, p) = 2 and thus f maps two lpids to place holders qo, q2, q4. Next we establish

two lemmas that show f will produce permutations that always yield MaxHitRatio.

Lemma 4 Let p and 2 be natural numbers and z = b/c. If gcd(z,p) = 1, then f

determines a single permutation of lpids that achieves MaxHitRatio.

Proof: Since f maps lpid,“ z places from lpidg, each lpid will map, under
CYCLIC, to the ﬁrst data element of its data block under BLOCK. Figure 3.4 illustrates
the situation for arbitrary lpid,. Thus, there is always at least one data hit per lpid.

Case I: If 2 S p, then there are exactly 0 data hits per lpid. The mapping cycle
begins with lpid,- ensuring c data hits. Since 2 S p, which is equivalent to b S cp,
lpid, cannot map to another 0 data elements of the b-sized data block. If it did, this
would violate the semantics of a cyclic mapping. Thus, there are exactly c hits per

processor, cp hits over all processors. It follows that cp : n = [ékp : it since b S cp.

42

Case 2: If 2 > p, then there are at least c data hits per lpid as established
above. Since lpid,- maps to the ﬁrst element of the data block, it will also map to
the (cp + 1)th element as well since 2 > p, see Fig. 3.4. Let j = z/p = b/cp (integer
division), then lpid,- will map to elements numbered kcp, kcp + 1, ..., kcp + c — 1 for

0 S k S j - l; a total of cj elements“. Thus, there are jc = [ﬁle data hits per lpid

 

 

 

 

 

 

 

 

 

and [ékpdata hits over all lpids. Again, [$[cp : n is achieved. D
o
o o
o
lpid; T I]
lpid i c
lpid,- ] cp
b o
data . , o
elements lp'd‘ ' [
lpid:
o
o
o
o o
o o
o o

 

 

 

 

(BLOCK) (CYCLIC(C))

Figure 3.4: lpid,- is mapped to ﬁrst element of its data block.

Lemma 5 Let p and 2 be natural numbers and z = b/c. Ifk = gcd(z,p) > 1, then
the k lpids that map to €111; under f can be remapped, in any of k! ways, to the place
holders QikaQik+laQik+2a"-aQ(i+l)k-11 f0? 0 S i S P/k — 1- A” (P/klk! permutations

yield MaxHitRatio.

 

4Data elements are numbered 0..b — 1 for the data block owned by lpidg.

43

Proof: In Lemma 3, we established that if k > 1, then It lpids, namely
lpid,-+,-(p/k) for 0 S j S k — 1 map to the same place holder, qu. Consequently,
k lpids map to the ﬁrst data element of lpidi-s data block under BLOCK.

Case I: If 2 S p, then b S cp, and there are exactly c data hits per lpid. Fig-
ure 3.5 illustrates the situation. If 2 S p, then k S 2. Thus, ck S cz and cz = b.
Furthermore, the k lpids with c data elements each can “ﬁt” into lpide b—sized data
block. Each lpid that mapped to place holder q f(i) can be remapped to one of the ﬁrst
ck data elements regardless of the permutation of the k lpids. Since lpid; is in this
group, there are c data hits for it. lpid,- cannot appear again within the data block
since b S cp. Therefore, there are exactly c data hits for lpidg, cp =[%]cp data hits

total among all p processors.

l

b
data _
elements [PM i

 

 

"[-
T C( gcd( 2.12))
C data

] elements

 

‘52
.35-

 

 

 

 

 

 

 

 

 

 

00. .0. I. E .0.

(BLOCK) (CYCLIC(c))

Figure 3.5: 2 S p.

Case 2: If 2 > p, then b > cp and there are at least c data hits per lpid as

established above. For some integer m, if mcp < b < (m + 1)cp, then there are me

44

data hits for lpidg; since all p lpids map at least m cycles to a b—sized data block.
We claim that there are (m + 1)c data hits, even when b < (m + 1)cp. Figure 3.6
illustrates the situation. There are m groups of p processors mapping c-sized blocks
to lpidﬁs data block under BLOCK. We must show that lpid,- maps to one of the last
b — mcp elements of the data block. In other words, show ck S b — mcp. Proof by
contradiction. Assume ck > b — mcp. There exist two integers, r, s, such that z = rk

andpzsk. Ifz>p,thenkSp.
ck>b—mcp

k>§—mp
c

k>z—mp2rk—mk

k>rk—mk

k>k(r—m)

m>r
mp>rp

z
z>mp>—

Z>—Z

45

Since 19/1: 2 1, we have a contradiction. Thus, c(gcd(z,p)) S b — mcp. Given this
result, lpidﬁ-s c—sized block must appear within the last b — mcp elements of the data
block regardless of permutation order. Therefore we have (m + 1)c data hits per lpid;

(m + 1)cp for p processors. Cl

 

 

 

l E’\

 

 

 

 

b : ' m groups
elements lpid; . .
T
{W
(b - mcp) < Cp
.1.
o
o
o

 

 

 

(BLOCK) (CYCLIC(c))

Figure 3.6: z > p.

Lemmata 1 through 5 prove the following result.

Theorem 1 For redistribution from BLOCK to CYCLIC(c), where b and c are the
respective block sizes, 2 = b/c for an integer z, p is the number of processors, and
n = bp is the number of global data array elements, the logical processor mapping

function in Equation (3.1) achieves MaxHitRatio = [élcp : n.

46
3.2 Multidimensional Logical Processor Mapping

This section extends the logical processor mapping technique presented in Sec-
tion 3.1 to m-dimensional data arrays. Speciﬁcally, we extend the technique
to optimizing the logical processor mapping when redistributing from (BLOCK,
BLOCK, . . . , BLOCK) to (CYCLIC(CO) . CYCLIC(CI) , . . . , CYCLIC(6m_1)). Addition-
ally, we demonstrate the approach for redistribution of two-dimensional data that is
(1) initially distributed in both dimensions and subsequently redistributed in only
one dimension, e.g., (BLOCK,BLOCK) to (CYCLIC, *), and (2) initially distributed
in one dimension and then redistributed in both dimensions, e.g., (BLOCK, *) to

(CYCLIC(cl) ,CYCLIC(62) ).

3.2.1 m-dimensional Redistribution

We extend the technique by applying the one-dimensional lpid mapping to each
dimension of the global data array. Let A be a m-dimensional data array with
no x n1 x n; x x nm_1 data elements. The PROCESSORS directive in HPF declares
a processor arrangement specifying the name, rank, and extent in each dimension.
Let P = (p0, p1, ..., pm_1) be the processor arrangement over which A is distributed.
Let B = (bo,b1, ...,bm_1) and C = (co,c1, ...,cm_1) be the set of block sizes and cyclic
block sizes respectively for each dimension of A. We extend Equation (3.1) to Equa-
tion (3.2) to map the rectilinear set of lpids p, to the n,- data elements in the j-th

dimension for O S j S m — 1.

47

9(1) 2 (le) mod 19,-, 23' = j/CJ', 0 S 2 S R, - 1,0 Sj S m — 1. (3.2)

For m-dimensional data redistribution, MaxHitRatio is very similar to the ratio
presented in Section 3.1; it is the product of the MaxHitRatios for each of the m

dimensions. We present a lemma to substantiate this result.

Lemma6 MaxHitRatio for a m-dimensional global array A as deﬁned above is

”gigglcopo X X l—bmlCm—IPm—l) : (no x x nm_1) and the mapping func-

Cm—le—l

tion in Equation {3.2) achieves the upper bound.

Proof: Lemmata 1 through 5 established that [élcp : n is the MaxHitRatio
for one—dimensional data and that the mapping function, f, achieved the upper
bound. For m—dimensional data, the product of the upper bounds in each dimension
yields the maximum data hit ratio for the m-dimensional array. The function g is a
generalization of f. By applying g respectively to each dimension, the upper bounds

are achieved. [:1

Figure 3.7 illustrates 9 applied to an 18 x 16 data matrix distributed across a
3 x 4 processor grid; lpids are in bold italicss. We redistribute the data matrix from
(BLOCK,BLOCK)to (CYCLIC(B) ,CYCLIC(2)). The lpid to data mappings for D, in
both dimensions is marked Block. The traditional cyclic mapping of lpids to data
is indicated with Trad, while the optimized technique using g is shown with Opt.

The key to the right of the ﬁgure indicates data hits following redistribution for the

 

5Subscripts in the ﬁgure label denote the number of processors in the given dimension.

48

traditional and optimized mappings. The traditional mapping yields HitRatio = 1 z
12. The mapping using g results in HitRatio [($75)](3X3) x [(7:77)](2X4) : (18)(16)
which reduces to 1 : 4, a three-fold increase in the number of data hits over the

traditional mapping. Theorem 1 and Lemma 6 prove the following result.

Theorem 2 For redistribution from (BLOCK, BLOCK,..., BLOCK) to (CYCLIC(co),
CYCLIC(c1),..., CYCLIC(cm—1)), where b,- and c,- are the respective block sizes, 2,- =
bg/C,‘ for an integer 2,, p,- is the number of processors, and n,- = bip, is the number of
data elements in dimension i, 0 S i S m —— 1, the logical processor mapping function
in Equation (3.2) achieves MaxHitRatio = (ligglcopo x x [ﬁlcm_lpm_l) :

(710 X X nm_.1).

3.2.2 Two-dimensional to One-dimensional Redistribution

Redistribution of an m-dimensional array need not necessarily involve redistribu-
tion in all m dimensions. For instance, a data matrix may initially be distributed
(BLOCK,BLOCK) and redistributed to (CYCLIC,*) in which only the row dimension is
distributed. We extend the mapping technique to these cases. We deﬁne a vector of
p = popl place holders, qo, ql, ...qp_1. Equation (3.3) deﬁnes a new function that maps

a Cartesian lpid to a placeholder.

h(r,s) = (rz) mod (p0 x pl), 2 = bo/co. (3.3)

49
Block—>3 0 1 2 3

 

 

 
  

 

     

 

   

 

 

I O I I lid
Trad—ﬁ0g132g3 05152= [gains
Opt-60:2:1:3=0:2:1=. 9”
........i ....... i0 I E2 3 54 s 56 7 g8 931011§1213§1415§<——— data

-J_- ;-J--L-J-
0 0 1 _ 1 I I
2 . ‘1"."1‘
0 .............. I I I I sdI --I--I---I--I IT" I
I I I I I I I I I I I I
--r-1--r-1'-r-1--r-1*-r‘1-'r' :p-~4
I I I I I I I I I I I I T .

DataHlts

------------------- - - 1' - - i- - 4 - - '~ ' ._,' - -. - - Trad

 
     
 

     

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 
 

*i-n--r-w--gr y~

. I I I
-J--L-J_- " ”

I I I I
I I I I I I I I

 

 

 

 

 
 

 

 

 
 
   

  
   

 

 

 

Figure 3.7: (BLOCK3,BLOCK4) to (CYCLIC3(3) ,CYCLIC4(2)) redistribution.

Lemma 7 establishes MaxHitRatio for two-dimensional to one-dimensional redistribu-

tion. Based on the value of gcd(z, p0), Lemmata 8 and 9 establish that the mapping

function h achieves MaxHitRatio.

Lemma 7 For redistribution between (BLOCK,#)6 and (CYCLIC(co) ,*), let b0 be the

block size in the row dimension of D, and co be the cyclic block size in 0,. Max—

HitRatio = ([ﬁ—p—l-lcopoplbl) : (no x m) for an no >< n1 data matrix.

Proof: The proof is quite similar to the proof of Lemma 1. Here, the

two—dimensional grid of processors is remapped into a vector of place holders since

 

6# denotes any HPF distribution pattern

50

D, is distributed in only one dimension. We substitute popl for p in Lemma 1 and
the remainder of the proof follows from that point. Thus [ﬁlmpopl counts the

number of hits in the row dimension. This number is then multiplied by the block

size in the column dimension, b1, to obtain the total number of data hits. D

We prove that the mapping function, h, yields MaxHitRatio by ﬁrst establishing
the correspondence between lpids and place holders and then showing the mapping
yields the given ratio. Since the column coordinate, s, of an lpid is not relevant
to the value computed by h, all lpids with the same row coordinate, r, map to the
same place holder, qhm. Since 0 S s S p1 — 1, there are exactly p1 lpids for each
r, 0 S r S p0 — 1 that map to the same place holder. Thus, no greater than p0 place
holders are mapped to under h. Lemma 8 establishes that h achieves MaxHitRatio
when only lpids with the same row coordinate r map to the same place holder while

Lemma 9 establishes the result when kpl lpids map to the same place holder.

LemmaS Let 2 be a natural number such that z = bo/co. If gcd(z,po)

1, then exactly p0 place holders are mapped by h(r,s). The p1 lpids that map
to place holder qh(,.,,) can be remapped in any of p1! ways to place holders
gm”),qh(,,,)+1,qh(,.,,)+2,...,qh(,,,)+p,_1. All (po)p1! permutations yield MaxHitRatio

:lcopom lCoPoPI b1

Proof: The ﬁrst conjecture of the lemma follows directly from Lemma 2. The
proof of the second part of the lemma is similar to the proof of Lemma 5. Instead

of k = gcd(z, p) lpids mapping to the same place holder as in Lemma 5, we have p1

51

lpids mapping to the same place holder as established previously. The remainder of
the proof follows after making this substitution. MaxHitRatio is the same result as
in Lemma 5 except that b1 must be a factor since the number of data elements in a

row contribute to the number of data hits. [:1

Lemma9 Let 2 be a natural number such that z = bo/co. If gcd(z,po) = k,
then po/k place holders are mapped by h(r,s). h(r,s) maps lpidr+j(po/k),, to place
holder qu) for 0 S r S (pg/k) —1 and 0 S j S l: — 1. The plk lpids that
map to place holder qu) can be remapped in any of (p119)! ways to place holders
qh(r,a)a‘1h(r,s)+1aqh(r,s)+2a--°aqh(r,s)+p1k-l- A” (P0)(P1k)! Permumtions yield MaxHitRa-

tio =([af:g;'|copoplbl) : (no x 121).

Proof: The ﬁrst portion of the lemma follows directly from Lemma 3. The
proof of the second part of the lemma is similar to the proofs of Lemmata 5 and 8.
In the current situation, we have plk lpids that map to the same place holder. The
remainder of the proof follows after making this substitution. MaxHitRatio is the

same as in Lemma 8. D

Figure 3.8 demonstrates the mapping function, h, applied to a 24 x 16 matrix.
The matrix is originally distributed (BLOCK,BLOCK) on a 3 x 2 processor grid and is
redistributed (CYCLIC(Z) ,*) on a six processor vector. lpids are in bold italics as
before. lpidm identiﬁes a processor in the Cartesian system; 0 S r S p0 — 1 and

0 S s S p1 — 1. The block size in the row dimension of (BLOCK,BLOCK) is eight

52

(b0 = 8); the cyclic block size in (CYCLIC(2) ,*) is two.

In Fig. 3.8, we do not show a “traditional” mapping since mapping from a set
of Cartesian lpids to a vector of lpids is undeﬁned in HPF. Using h to map the
lpids produces a HitRatio of 96 : 384 = 1 : 4. The distribution for columns of D,
is inconsequential to the number of data hits that can be achieved since only rows
are distributed in D:. In other words, redistribution from (BLOCK, CYCLIC(c)) to
(CYCLIC,*) would result in the same number of data hits as the redistribution of
(BLOCK,BLOCK) to (CYCLIC,*). Additionally, any of the lpids with the same row
index, r, could be permuted and the same data hit ratio is achieved; e. g. , lpidop and
lpidOJ. Figure 3.9 shows an example of when gcd(z, p0) > 1. Processors lpido", and
lpid“ for 0 S s S 2 could be permuted in any of the 6! possible ways and the optimal

data hit ratio would be obtained.

Lemmata 7 through 9 prove the following result.

Theorem 3 For redistribution from (BLOCKJ) to (CYCLIC(co) ,*), where b0 and co
are the respective block sizes in the row dimension, 2 = bo/co for an integer 2, b1
is the block size in the column dimension, p0 x 101 deﬁnes the grid of processors for
0,, no >< n1 is the number of global data array elements where n.- = bip; 0 S i S 1,

the logical processor mapping function in Equation (3.3) achieves MaxHitRatio =

l—bn—lcopoplbl-

COPOPI

53

Block ——>f a 1

 

  

    
  
 
   

 

      

 

 

 

 

 

 

 

 

   

   

 

  

 

 

 

 

Its“
------------------ i o" ' . | I.”
(0’1)'1 3 lL-Jl 25:3:
0 .................... :__ 4 :-.‘-:I. :-r‘...--:~;- x.- 2.7.1.3.:
4 I I I I I I I I I I I I I I I
(2,0):2 --I--.~--:--r--:--r-1--r-1--r-1--r-1--r--:--
5 I I I I I I I I I I I I I I I
""""""""""" ’1"r‘1"r“1"r'1"r'1"r'1"r-1"r'1"
6w I I I I I I I I I I I I I I I
(2.0-3 --I.--.I~I.--.~-I.-+4.--r--:--r-1--r-1--r-1--
....................... 7 . _I I, 1. ;.-,I
; I I
(1)0)“ lL--I
9 I I
10 "
11 I5
(’) II . _, _,
I """""""""" 1' 5”“."f".".'"."."‘.". . . . . . . .
(0.0M .-.,-Hu,_.,--I.-.,-_,-,_-,_.'--,-,_-,-,-- DataHits
13 I I I I I I I I I I I I I I I
"""""""""" 4"1"r'1"r'1"r'1"r‘1"r'1“r'1"r‘1"
14 I I I I I I I I I I I I I I I
(0.0:! --I--.I-J.--.L-J.~.L-J.--r-1--r-1--r-1--r-1--
...................... 1?. I I I
1631‘?
(2,0):2 .
17s
18 4293;. 1111
21):?
( ’ 19 ' ' ' ' | I l :E I J£I> I I
2 -----------------2-6T--:--r--:--:---:--:--~:--. I I i I I I I
(1,0)..4 I-1--r--:--:---:--:---:--r-1--r-1--r-1--r--:--
21 I I I I I I I I I I I I I I I
""""""""""" |"1"r‘“I""I"'I"r""l''I""I"I"''I"I""'I"'I""I"
22 I I I I I I I I I I I I I I I
(LIN “1"."1"?"i".“1"."1"."1"?”Ft":-
....................... g; I I I I I I I I I I I I I I I

 

 

Figure 3.8: (BLOCK3,BL0CK2) to (CYCLIC6(2) ,* ) redistribution.

3.2.3 One-dimensional to Two-dimensional Redistribution

Another possibility is to have a data matrix initially distributed in one dimension
and then redistributed in two dimensions. We derive a new mapping function, Equa-
tion (3.4), for redistribution from (BLOCK,*) to (CYCLIC(co) ,t). Since D, speciﬁes
a two-dimensional data distribution, we declare a two-dimensional grid of place hold-
ers, q[,.,,], 0 S r S p0 — 1, O S s S p1 — l. The function, 1?, maps a processor

lpidg, 0 S i S p — 1, to a place holder qlm]; thus, f(i) computes an ordered pair.

Lemma 10 establishes MaxHitRatio for one-dimensional to two-dimensional redistri-

54

elements

 

 

16171819 20 2122 23-‘—data

   

IIIIIIIIIIIIIIIIIIII

 

..IIrILIIrILIIw
- — _ .

"IIrILIIrI

J . .
.? IJIIﬂIJII1IU

ILIIFILIIPI

I.” .

s IJIIﬁIJIIﬂI.
.E. ILIIrILIIrI

.. — . .
.2; . _ . “
:i.IJII1IJII1I.
33,ILIIrILIIrI.
$i. - - _ - .

“Intuit-n -J--ﬁ-J"‘I

                                 

. .. u _ H"...
. t-'1lJ"1IJ|'hI I.

. .IIrILIIrILIIm

.u . . . n s
. “II1IJII1IJIIH

m LII1IJII1IJIIM w

.
.
'1
-

If
.
.
:1

Ifl
.
.
:1

IFI
.
.
:1

IFI
.
.
:1

If!
.
.
:1

If!
.
.
:1

Irl
.
.
“I

 

.
- -J|I‘-
. . .
ILIIrI
. .

..n.. . .
. I J ' ' 1 '
. . _
1 ”ILIIrI
.. . _

'1-'r'1-'r-1-'r'1-'

- _
imam - -
. ........... I L I I r .I

      

-J_-L-J_-L-J-_L_J--
I
I

I

I
--r-1--r_1-_r-1--r_1_-r_1_-

I

I
-J--L-J--L-J--L-J_-L-J--

I

I

    

 

vb -J--L-J--L-J--L-J--L-J--L-J--L-J--L-J--
I

"r-ﬁ-'r-1‘-r-1-‘r-1--

..................... a- ILIIrILII u ILIIrILIIrILIIrI ai.II I II I
.......... . . _ a . . . _ . .
. . . . . .
. . . . . . . . .
. . . . . . . .
. _ . . . . . . .

 

a . . . .

m m ILIIrILIIrI w.
E —

                

.
J-'1'
. . .F
LIIrI“
. .

. .
JIIWI w
_ . IL:
L...... w.
. . E.
_ . w

JII-II .I. .
.

LII

...........
......

                   

.u. .....u.....

.141“.
... .

......

 

I . . . . . . . . . . .
. . _ — . u . . . . . .
IJII‘IJII II1IJII1IJIIWIJII1I IJII1 Jllﬁl
o. - - . . . a . . . . - .
.................. rlﬁ b 4 - TI» - u p p w p

(03H
...(ZIQH.... ..

(2’0”

Cycl‘lc(l)
""IIIIQQQHII

Block
0

 

gazes.-- .Ii

"'ii.a.92.aq..i

...(iezae..2....

(W8 .221?

...(I...I.2.=..Z...i"

I

 

 

 

(BLOCK4,BLOCK3) to (CYCLIC12(1) ,*) redistribution.

Figure 3.9

bution. Based on the value of gcd(z, p0), Lemmata 11 and 12 prove that Z achieves

MaxHitRatio.

(3.4)

bo/Co.

Z :2

SID—17

OSi

I

I7100]

mod p0,

)

f(i) = [(iz

and (CYCLIC(co),#), let b0 be

*)

9

Lemma 10 For redistribution between (BLOCK

the block size in the row dimension of D, and co be the cyclic block size in Dt. Max-

(no x m) for an no x 121 data matrix.

(1332;160:2012le

HitRatio

55

Proof: The proof is quite similar to the proofs of Lemmata 1 and 7. In the
single dimension case of Lemma 1, the redistribution between BLOCK and CYCLIC(c)
is considered. In the current situation, the ﬁrst dimension of both D, and D1 is
BLOCK and CYCLIC(c); thus, the proof of Lemma 1 can be applied. As with Lemma 7,

we must include the factor b1 for the second dimension. Cl

Lemma 11 Let 2 be a natural number such that z = bo/co. If gcd(z,po) = 1,

then all p place holders are mapped by [(i) and the mapping yields MaxHitRatio

=(l‘—b°—IC0P0P151)3(T‘0 X n1).

CoPoPI

Proof: The proof is similar to the proof of Lemma 4 since f(i) maps lpid,“
2 places from lpidg, thus each lpid will be mapped to the ﬁrst row of its data block
under the (BLOCK,*) pattern. Since lpid,- owns b1 elements in the second dimension,

this factor contributes to MaxHitRatio. C]

Lemma 12 Let 2 be a natural number such that z = bo/co and p = popl. If
gcd(z,po) = k, then (pg/Mp1 place holders are mapped by f(i). All place holders
in the p; dimension are mapped, while only po/k place holders in the po dimension
are mapped. Exactly k lpids map to each place holder. The I: lpids that map to place
holder (ﬁnal can be remapped to place holders qlm]: q[,.+1,,], ..., q[,.+k_1,,] in any of k! ways

and achieve MaxHitRatio =(lgggﬁlcopoplbll : (no X m).

Proof: The proof is similar to the proof of Lemma 5 and Lemma 9. In the

current situation, there is a second, p1, processor dimension and all place holders

56

in this dimension are mapped. The po dimension in this case is the same as the
one-dimensional mapping in Lemma 5. Since lpid,- owns b1 elements in the second

dimension, this factor contributes to MaxHitRatio. C]

Figure 3.10 illustrates the lpid mapping function, I, applied to a 24 X 24 ma-
trix. The matrix is initially distributed (BLOCK,*) across six processors, and redis-
tributed (CYCLIC(2) , BLOCK) across a 3 x 2 processor grid. lpids (in bold italics)
are superimposed on the matrix. The data hit ratio, using the mapping function is
[W](2)(3)(2)(12) : 24 x 24 = 144 : 576 = 1 : 4. Figure 3.10 is an example of
when gcd(z, po) = 1 while Fig. 3.11 illustrates a situation where gcd(z, po) = 2. The
latter ﬁgure demonstrates the ﬂexibility of permuting lpids. For instance, processors
lpido and lpidg could be permuted and the same data hit ratio would be achieved.

Lemmata 10 through 12 prove the following result.

Theorem 4 For redistribution from (BLOCK,*) to (CYCLIC(co) ,II), where bo and co
are the respective block sizes in the row dimension, 2 = bo/co for an integer 2, b1
is the block size in the column dimension, po X p1 defines the grid of processors for
D,, no x n1 is the number of global data array elements where n.- = hp; 0 S i S 1,

the logical processor mapping function in Equation (3.4) achieves MaxHitRatio =

l—QLICOPOPIblo

COPOPI

57

Block 0 1

12 13 14 15 l7 l8 I9 20

   

elements

0
0

1

2
1

0

1
2

2

0
3

I

2
4

0

1
5

2

Figure 3.10: (BLOCKoﬁI) to (CYCLIC3(2) ,BLOCKo) redistribution.

58

we m
h.& nu
a “I . _ . . . . . . . . . . .
LI. . . . . . . . . . . . .
rI.I i4 I1IJIIﬁIJIIﬁIJII1IJII1IJII1IJII
n . . . . . . . . . . . . . .
IIJI ILI IrILIIrILIIrILIIrILIIrILIIrILII
1 .1 . . . . . . . . . . . . .
2t L . . . . . . . . . . . . . a
I I JII1IJII1IJII1IJII1IJII1IJII1IJII
mm "av .11. .nv. .11. .nU. .11. .nv. .11.
1‘ IIJUI IL1drILﬂﬂrILudrILﬂlrILudrILerILuda
9 . . . . . . . . . . . . . _ I
1 LI. . . . . . . . . . . . . .
8II.II IJI I1IJII1IJII1IJII1IJII1IJII1IJII
l . . . . . . . . . . . . . .
' -.J' :L ' r:'.L.' ..r...|—'.' r.- L - I r.| I-I.I.r.l IuI-I.r.'_L ' '
7 . . . . . . . . . . . . . .
1 LI. . . . . . . . . . . . .
6II.II I4II1IJII1IJII1IJII1IJII1IJII1IJII
1 . . . . . . . . . . . . . .
IIIIIIIIIIIIIIIIIII A‘IJ 'IIL 'r'LIl-lr-lb'nl rl
5 c . . - . . .
1. L . . . . . .
4II.I IJI I1IJII1IJII II1I
1 . . . . . . .
IIJII ILI IrILIIrILII IIrI
3 . . . . . .
l L . . . . . .
2II. II JII1IJII1IJI IIﬁI
I; 1 17:10 . . . . ”/34
l . . . . . .
l L . . . . . .
OII. II1I JI I1IJII1IJII IIWI
1 . . . . .
9 . . . . .
. _ _ - _ .
rI¢II1I 1IJII1IJ II1I
8 . . . . .
r
.

        

 

 

 

 

 

2

......IOOOQIOO

3

         

  

— .
a . . — . .
ILIIPILIIrlLIIrILII

,-.--.- .--T .--T .--
.4. 6. 5.7.4. 6. 5. 7

ILIIPI LIIrI LIIFI LII

.
IJII‘IJIIWIJIIWIJII

ILllrlLllrILIlrlLll
. _ . . _ . .

. u . .
' J ' I.‘.. J.‘ '.1.'.J | I.1.' J | I

ILIIPILIIFILIIFILII

 

 

7

_ c - . . . - _ c _ . _
. — . . . . . _ - _ . _
WIJIIWIJIIWIJIIWIJIIWIJIIWIJII
. . . . . . u — . . — _
. - . _ . . .
. . . . . . u _ n . "
WIJIIWIJIIWIJII WI JIIWI JII1I JII
_ . . . .
FILIIFILIIFILW‘FI LIIHULWmnquN
. . . . . . - u n .
. . . . . _ u "
WIJIIWIJIIWIJIIWIJII1IJII1IJII
. . _ . . . a . - . . _
PILIIFILIIFILIIPILIIFILIIf LII
. . . . _ . - . - . . .
. . . . . _ . - . . . _
1IJII1IJII1IJIIWIJIIWIJII1IJII
. . . . . . _ . . . . . . .
. .r .p .L. L.i. . I. . .i+ . w
o n o o . u u . o n o .1. . .
. w.n.n.n.Mnﬂuwnn.m.wnmn2nn.3.
o . n c . . . . c u - . . - .
0. 2.3.0.1.2.3.0.1"2.3.0.1313...
o n o o e . o u o o e u o u .
o . . a . . o . g n o . . . .
u u u u . 0 n I u
4 5 n 6 . 7 u no u 9 . I. u I ”

lOIl .

(BLUCK12,*) to (CYCLIC4,BLUCK3) redistribuf

Figure 3.11

Chapter 4

Data Redistribution Library

HPF deﬁnes dozens of language intrinsics that could be implemented as libraries, sig—
niﬁcantly simplifying the role of HPF compilers. In this chapter, we propose a library,
DaReL, for HPF-style redistribution. Speciﬁcally, DaReL supports multi-dimensional
data redistribution for HPF’s regular distribution patterns, BLOCK, CYCLIC, and *.
The library is designed for !HPF$ REDISTRIBUTE; however, we envision its applicabil-
ity to implicit redistributions that can occur within HPF programs; see Section 2.5.
We discuss a number of salient issues affecting the design of a redistribution library
for HPF including scalability and the respective roles of the compiler and the library.

We present a detailed overview of DaReL’s design and motivate the critical de-
sign choices. In contrast to other approaches, e.g., [32], DaReL decouples processor
send/ receive set calculation from data exchange. We assert that this decoupling sim-
pliﬁes library design and facilitates multiple data exchange algorithm options. Data
exchange is performed with MPI primitives, enhancing DaReL’s portability among

distributed-memory platforms that utilize the emerging message-passing standard.

59

60

We discuss the advantages of using MP1 for data redistribution, and we give a de-
tailed description of DaReL’s use of MPI point-to—point and collective communica-
tion, process topology, and derived datatype constructs. Performance and scalability

analysis of DaReL is presented in Chapter 6.

4.1 Library Design Issues

There are a number of issues to consider in the design of a redistribution library for
distributed-memory machines. For example, what are the advantages of implement-
ing data redistribution as a library? What is the role of the HPF compiler vis-a-vis
DaReL? How do we provide a ﬂexible yet concise interface between the compiler and
DaReL? What is the appropriate choice of message-passing mechanisms for imple-
menting data movement among processors? What are the advantages of using MPI?

How can we ensure DaReL’s scalability? These issues are examined next.

4.1.1 Advantages of Software Libraries

The functionality of !HPF$ REDISTRIBUTE may be provided as a library that could be
utilized by an eventual HPF compiler or programmer. Providing a rich set of libraries
to data-parallel programmers is becoming increasingly popular because libraries offer

some unique advantages [8]. For instance,

0 A uniform interface provided by the library enhances the portability of programs
among various distributed-memory architectures.

61

0 Although a library provides a uniform interface to programs, the design and
implementation of the library can explicitly exploit machine speciﬁc features to
provide better performance.

0 The program development cycle for the consumer of the library can be shortened
as the correctness of libraries can be independently veriﬁed.

o The existence of libraries can simplify the tasks that the data-parallel compilers
would otherwise have to perform.

4.1.2 Role of HPF Compiler

The compiler usually translates data-parallel Fortran code into a set of message-
passing SPMD node programs [25]. Correspondingly, the implementation of a re—
distribution library is message—passing code to be linked with the node programs
generated by the compiler. The message-passing programs are in turn compiled by
the native machine compiler.l Given the number of possible data distributions, it
would be impractical to provide different redistribution code for every possible HPF
source/destination distribution scenario. A redistribution library ought to provide a
generic redistribution capability for an arbitrary number of data and processor con-
ﬁgurations and HPF distribution pattern possibilities. This is the approach taken in
the design of DaReL.

Since redistribution does not contribute to the result of the computation being
performed by the HPF program, its execution time must be minimized. If some of
the functionality of !HPF$ REDISTRIBUTE can be performed at compile- rather than
run-time, then the execution time of !HPF$ REDISTRIBUTE can be reduced, albeit at a

cost of increased compile—time. This, however, is typically preferable for data-parallel

 

1It is possible that HPF compilers may generate object code directly from HPF source code.

62

applications.

In order to redistribute data as called for by a speciﬁc invocation of
!HPF$ REDISTRIBUTE, a node program must know the identity of the processors from
which it is to receive data (gather set) as well as the identity of processors to which it
must send data (scatter set) as explained in Section 2.4. It is possible for these sets to
be distinct. Figure 4.1 shows 15 data elements distributed across 5 processors under
the BLOCK and CYCLIC patterns, respectively. Processor 0 owns data elements 0,1,2
under BLOCK, but it owns elements 0,5,10 under CYCLIC. Redistribution from BLOCK
to CYCLIC causes processor 0 to scatter to processors 0,1,2 while it gathers from 0,1,3;
the scatter and gather sets are distinct. In addition to calculating the scatter/ gather

sets, the data elements which are to be sent, kept, and replaced must be identiﬁed.

% scatter set for 0

CYCLIC“)
—~ gather from 0 —* 0 ,
2 -: l

—* gather from 1 —>

 

‘O@~IO\M&UN

 

 

-- gather from 3 -* 10 .

—

—
_
—

 

~—
UN
§
_—
UN

 

 

 

—
b
_n
b

 

Figure 4.1: Distinct scatter-gather sets.

Following processor and data set identiﬁcation, the data must be exchanged among
processors utilizing message-passing primitives. The node programs use the commu-
nication services provided by the target machine to perform the data transfer. One

could use unicast (point—to—point) or collective communication primitives depending

63

upon the services offered. The relative performance and scalability of the routines
will be critical factors in deciding which set of primitives is most appropriate. We

discuss communication primitive selection in greater detail in Section 4.2.

4.1.3 Use of MP1 for Libraries

Data redistribution may utilize MP1 to perform the necessary data exchanges. The
primary advantages to using MPI are its portability, ease—of-use, and in particular, its
support for parallel library development [48]. MPI uses communication contexts, ﬁrst
proposed in Zipcode [49], to provide safe communication spaces for processes. Com-
munication contexts shield processes from unrelated, ongoing communication which
otherwise may be interpreted, erroneously, as related to the current communication
due to the asynchronous behavior of SPMD programs. This is particularly important
in the context of libraries as different processes may enter a library asynchronously
[50].

MPI also facilitates the speciﬁcation of arbitrary logical process topologies with
automatic support for Cartesian topologies. This mechanism frees the MPI program-
mer from deﬁning communication in terms of hardware-dependent processor IDs.
Process topologies are particularly important in the context of !HPF$ C REDISTRIBUTE
as the programmer has similar ﬂexibility in specifying logical Cartesian processor

conﬁgurationsz in HPF with the !HPF$ PROCESSORS directive.

MPI speciﬁes a number of point-to—point and collective communication primitives

 

2We use HPF’s “processor” and MP1 “process” interchangeably since they are functionally
equivalent.

64

that can operate in conjunction with communication contexts and process topologies.
The collective communication primitives that are of particular relevance to DaReL

are the MPLScatter, MPLGather, and the MPIJllltoall primitives.

4.1 .4 Scalability

To be viable for SPC architectures, scalability is an important design criterion for par—
allel software libraries. A redistribution library ought to achieve good performance
within a reasonably large range of physical processor conﬁguration sizes and for var-
ious sizes of data that are to be redistributed. The scalability of data redistribution
will in large part be dictated by the scalability of the MPI communication primitives
used; see Section 6.2.3. The scalability of MP1 communication routines may vary
across different distributed-memory platforms as vendors are free to implement MPI
routines in the most efﬁcient manner possible and thus will attempt to optimize the
primitives for their system. One can track the most efﬁcient MPI routines as they
are developed and incorporate them in the library. Additionally, factors such as the
distribution patterns, size of data, and number of processors used are the primary

factors affecting redistribution performance.

Another factor which may inﬂuence the scalability of a redistribution library is
the size of data to be redistributed. Since the library must determine and store
data ownership, the size of data can become an important issue if the overhead for

performing this function grows as a multiple of data size.

65
4.2 DaReL Design Overview

DaReL is proposed as a scalable and portable MPI-based library to perform data
redistribution. DaReL comprises two major modules: Compute Data Exchange Sets
(CDES) and Exchange Data (ED). CDES determines the identity of the processes to
(from) which a process sends (receives) data and which data elements are involved
in the exchange. Using the ownership information computed by CDES, ED performs
the actual data exchange. This section presents the design of these modules and
DaReL’s interface. We present a detailed explanation of DaReL’s use of MPI pro-
cess topologies, derived datatypes, and point-to—point and collective communication

primitives.

4.2.1 DaReL Interface

Figure 4.2 presents the C-based interface between DaReL’s calling environment, (i.e.,
the SPMD application message-passing program created from HPF source by the
compiler), and the library. All parameters, except the data to be redistributed, are
unchanged by the library. dim_size is the number of data and process dimensions.
src_pconf and dest_pconf are the source and destination MPI process conﬁgurations
(P, and Pt), respectively. src_ptrns and dest.ptrns describe the source and desti-
nation patterns (D, and Di) chosen from: BLOCK(b), CYCLIC(c), or *. src.globa1
and destglobal specify the global indices owned by a process under D, and 0,.

loca1.data is the process’ application data which is to be redistributed. This pa-

rameter is modiﬁed by the library and returned to the calling environment.

66

 

redistribute (

short din_sizo, {IN}
NPI_Conn src_pconf, {IN}
NPI_Conn dest_pconf, {IN}
struct distribution src_ptrns, {IN}
struct distribution dost,ptrns, {IN}
struct globa1_indicos src_globa1, {IN}
struct globa1_indicss dest_globa1, {IN}
datatype 10ca1_data {IN-OUT}

Figure 4.2: C-based interface for DaReL

 

 

 

Figure 4.3 gives an explanation of each parameter in the interface. Each param-
eter is marked either IN or IN-OUT, the former indicates a value that the calling
environment provides to the library, the latter represents a value that is provided
by the calling environment and is modiﬁed by the library upon return to the calling
program. Though many of the parameters are passed as pointers, we use the IN,
OUT, IN-OUT formalism of Fortran 90 [51] to indicate which variables are changed

by DaReL’s operation and which are not.

Figure 4.4 illustrates a Cartesian to process rank mapping. The source topology,
P,, is a (3 x 2) grid and P, is a (2 x 3) grid. The ﬁgure demonstrates the association
of a process in the source, s(2,1), and destination, d{1,2), grids and its rank within
the process group to which it is mapped. The process’ rank is then mapped to a
physical processor ID of the machine. Since MPI message-passing routines use rank
to identify the processes, the association of process rank to physical processor ID is
not relevant to DaReL, i.e., DaReL performs its function in MPI process topology

space. The two Cartesian topologies in Fig. 4.4 have different shapes but the same

67

 

 

dim.siz

src_pconf

dest_pconf

src_ptrns

dest _ptrns

src_global

dest.globa1

localﬁata

The number of data and process dimensions; these must be the
same to avoid ambiguity. Although, it is legal to have data di-
mensions which remain undistributed; these dimensions must be
denoted by the special symbol * in the corresponding placeholder.

an MPI communicator which contains information regarding the
source distribution process conﬁguration. It stores the number of
process dimensions, the number of processes in each dimension,
i.e., P in the HPF example in Fig. 2.3, and the identity of the
local process.

an MPI communicator which contains information regarding the
destination distribution process conﬁguration. It stores the num-
ber of process dimensions, the number of processes in each di-
mension, i.e., Q in the HPF example in Fig. 2.3, and the identity
of the local process.

array[0..dim.siz—l] of distribution structures; stores the pat-
tern and block size information for the source distribution; each
array element corresponds to the distribution for that dimension.

array[0..dim.siz—1] of distribution structures; stores the pat-
tern and block size information for the destination distribu-
tion; each array element corresponds to the distribution for that
dimension.

array[0..dim.siz-1] of globalindices structures; stores the
global indices corresponding to the local process’ data under the
source pattern.

array[0..dim.siz-1] of globaLindices structures; stores the
global indices corresponding to the local process’ data under the
destination pattern.

array[0..dim_siz-1] of type datatype. the actual data to be redis-
tributed. The type of data, 6. g. , integer or real, can is interpreted
as a byte stream. Note: the data prior to redistribution could be
passed as an IN and the redistributed data returned by another
parameter as an OUT. We choose to provide one parameter and

make it IN-OUT.

Figure 4.3: Parameters for the invocation of DaReL

 

 

68

size. It is possible in HPF to have different sized conﬁgurations for the source and
destination. With different sized conﬁgurations, a process may be a member of only
one of the two conﬁgurations. The association between Cartesian coordinates and
process rank must be known within DaReL for both conﬁgurations. For instance, if a
process has a rank deﬁned in the source topology, but not in the destination topology,

then it will scatter data, but not gather data.

 

 

 

 

 

 

 

 

 

 

 

Source Destination
0 I 0 I 2
0 0
I
‘\ a”,

- Rank in group [0.5]
I
. Physical processor ID

Figure 4.4: Different shaped process conﬁgurations.

To create a Cartesian topology, MPI provides MPI_Mak9_cart. In functionality,
it is very similar to HPF’s PROCESSORS construct since it deﬁnes a logical topology
in terms of the number of dimensions and the extent in each dimension. Since the
compiler has a “global view” of the entire application and DaReL does not, it is rea-
sonable for the compiler to determine global performance optimizations such as the
logical to physical processor mapping as discussed in Section 2.6.3 DaReL inherits the
processor mapping as determined by the calling environment. Thus, it follows that

the creation of the Cartesian topologies, to be used within DaReL, occur within the

 

3Alternatively, command-line arguments or the run-time system may determine the best mapping

[3].

69

calling environment as well. There are two reasons for this. First, MPIJiakeLart fa-
cilitates choosing a speciﬁc logical to physical mapping with the reorder parameter;
and second, the Cartesian topologies are most likely used in the calling environ-
ment already for message-passing that occurs prior to, and after, the invocation of
!HPF$ REDISTRIBUTE. DaReL assumes that src.pconf and dest_pconf have been

created with MPIJiakesart in the calling environment.

4.2.2 Compute Data Exchange Sets (CDES)

CDES utilizes the MPI Cartesian topology routines to manipulate the process sets
since these routines naturally correspond to logical processor topologies as deﬁned in
HPF. Similar to [31], each process computes the process sets for data exchange based
on the global data elements it “owns”4 under each of D, and D,, respectively. DaReL
relies on the calling environment to furnish global index information which could be

provided either by the compiler or the run-time system.

To determine the scatter set, a process must identify the ownership of its “current”
data under the destination distribution. In other words, by mapping D, over the initial
global indices, src_globa1, the scatter set is computed. To determinethe gather set,
the process identiﬁes the ownership of its “future” data under the source distribution.
In other words, by mapping D, over the destination global indices, dest_globa1, the
gather set is computed. Given the global indices owned by a process under each of

D, and D,, CDES performs these computations for each data dimension as follows.

 

“HPF’s owner computes rule.

70

Consider one dimension of data. Given n data elements distributed across p
processes, Equations 4.1 and 4.2 compute the process ID, pg, that owns data element
with global index j for BLOCK(b) and CYCLIC(c) respectively, where 0 S i S p -— 1,

and 0 S j S n — 1. For *, a process owns all data in the corresponding dimension.

Pi =j/b (4-1)

Pi = (i/C) mOd P (4-2)

Since each dimension of data must have a corresponding distribution pattern and
number of processes onto which the data is mapped, Equations 4.1 and 4.2 are applied

to each dimension independently.

Note that a process need not compute ownership for each data element it owns,
rather process ownership along each dimension sufﬁces. Thus, for multidimensional
data, a Cartesian process ID is obtained. The Cartesian process ID is subsequently
used in an MP1 message-passing primitive in ED. The identity of source and desti-
nation distribution patterns is inconsequential to the operation of ED. Our approach
alleviates the need for numerous closed form expressions and algorithms [31, 32] for
different source/ destination pattern combinations. The computational complexity of
CDES is 0(no + m + + nm_1) when no,n1, ...,nm_1 are the data extents in each
dimension of an m-dimensional redistribution. If CDES were to compute data own-
ership for each data element, rather than by dimension, the complexity would be

0("0 X n1 X X nm_1).

71

Figure 4.5 illustrates the use of CDES by process (1,1) in a (BLOCK,BLOCK) to
(CYCLIC,CYCLIC) redistribution to determine the set of processes to which it must
scatter data. The source and destination process conﬁgurations are the same, 4 X 2.
Under D, = (BLOCK,BLOCK), process { 1,1 ) owns global indices j,., where 4 S j, S 7 in
the row dimension, and global indices j,, where 0 S j, S 1 in the column dimension.
Since D, is CYCLIC in each dimension, Equation (4.2) is applied to the row dimension,
i.e., pi, = (jr/l) mod 4 where 4 S j, S 7 and in the column dimension, i.e., Pi, =
( j,/ 1) mod 2 where 0 S j, S 1, respectively. The ownership for scattering data in

each dimension is shown in the right portion of the ﬁgure in bold-italic.

.....................................................................

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. .......... 9 ......... 1. ........................................... gl obal ind
1 0 1 process ind
4; 1 0
5 1 1
6 1 2
7 1 3
source destination

Figure 4.5: CDES example for process (1,1)

Storing ownership information for each local data item is unnecessarily expensive
in terms of memory cost. Thus, CDES builds an ownership record for each recipient
of local data, and it builds a record for each sender of data for which the local process
is a receiver. The description of ownership is done with displacement, block size and
stride arguments. The displacement speciﬁes the location of the ﬁrst element, the
block size indicates the number of contiguous elements, and the stride speciﬁes the

stride between blocks. This ownership information is stored on a per-dimension basis,

72

e. g. , for a two-dimensional array, there are ownership records for the row and column
dimensions, respectively. This method of storing ownership information is minimal in
terms of storage cost and maps well to MPI derived datatypes which will be explained

in detail in Section 4.2.5.

4.2.3 Partitioning CDES Functionality

Figure 4.6 illustrates two high-level approaches to implementing CDES that differ,
not in their functionality, but in their partitioning of tasks between compile- and run-
time. In Fig. 4.6(a) all of CDES is performed at run-time while in Fig. 4.6(b), CDES
is partitioned between compile- and run-time. The motivation for “elevating” part of
CDES to compile-time is to reduce program execution time at the expense of increased
compile-time. The compile—time portion of CDES involves partially computing the
data exchange sets. Equations 4.1 and 4.2 could be performed at compile time if all
pertinent values are known at compile-time. The run-time portion of CDES completes
the computation, using a process’ own ID. The CDES compile—time service would be
provided as a library to the HPF compiler, thus designated hpf.a. All run-time
library services are found in node. a, so designated since they are linked by the node
programs. One would expect the option in Fig. 4.6(b) to be preferable in all situations,
since it decreases program execution time. Unfortunately, the CDES compile-time
module can only be performed if all relevant arguments, e.g., data size and number
of processes, are known at compile—time as they are in the example code segment

in Fig. 2.3. Since HPF allows these arguments to be variables whose values remain

73

 

 

 

 

Compute Data
Exchange Set (CDES)
(process-independent)
compile
time W3)

 

 

time
Compute Data Compute Data
Exchange Set Exchange Set (CDES)
(CDES) (pmess-depsuaIu)

 

 

 

 

 

 

a
i:

(a) Run-time only . (b) Compile- and run-time

Figure 4.6: Partitioning CDES

unknown until run-time, the CDES variant in Fig. 4.6(a) is necessary. The extra run—
time computation needed is of little consequence as CDES execution time is small

relative to ED execution time, see Chapter 6. DaReL is currently implemented with

the CDES version of Fig. 4.6(a).

4.2.4 Exchange Data (ED)

ED performs the data exchange among the set of processes computed in CDES. ED
utilizes MPI point-to—point or collective communication primitives depending upon
the services offered on a particular system and their relative scalabilities. Additionally,
ED uses MPI derived datatype facilities to avoid the overhead of packing data into a
message buffer at the sender and unpacking data at the receiver.

This section presents a more detailed explanation of the use of MPI in the ED mod-
ule. Given the communication pattern required by data redistribution, it is natural to
consider the MPI collective communication services: MPI.Scatter, MPLGather and

MPIJllltoall in ED. Additionally, MPI point-to—point routines are viable choices as

74

well. Broadcast, i.e., MPchast, is not a valid choice for redistribution since it sends
the same data to each process. In some cases, redistribution performance using point-
to—point message-passing may be better than that of redistribution using collective
routines if the latter are not optimized for the target architecture. See Chapter 6 for
more detail regarding performance.

Figure 4.7 illustrates the algorithm used by ED to perform data exchange using
MPI point-to—point message-passing. In the point-to—point algorithm, each process
sends Scount data elements of type Stype from an offset of Sdispls in .S'buf to all other
processes. Stype is a derived datatype; see Section 4.2.5 for a discussion of DaReL’s
use of MPI derived datatypes. Each process receives data from all other processes
where the placement of the data in Rbuf is described by Rcount, Rtype, and Rdispls.
The boolean arrays send and recv indicate whether the indicated process is a recipient
or a sender of data, respectively.5 Recall in Fig. 4.1 that the scatter and gather sets
may be distinct.

Figure 4.8 illustrates the algorithm when one of the collective communication
paradigms is used to exchange data. If MPIScatter or MPLGather are used, then a
loop over all group members where each group member is the root of one operation
is needed. For MPIJllltoall, a single invocation is needed.

Note that each of the collective communication routines in Fig. 4.8 uses variants of
the base operation with “v” appended, e.g., MPLGatherv. The “v” extends the prim-
itive to allow for speciﬁcation of displacements from the starting address of a buffer,

and it facilitates different sized data buffers to be speciﬁed for the distinct processes

 

5From the local process’ perspective depending upon the data computed by CDES.

75

 

Input: Sbuf
Output: Rbuf

For root = 0 To groupjize-l
If (root == myJank)
For rank = 0 To grouinze-l
If ((rank ! = myJank) AND (send[rank] == TRUE))
MPI.Send( Sbuf, Sdispls[rank], Scount[rank], Stype[rank] )
End If
End For
If (send[root] == TRUE)
Copy relevant portion of Sbuf to Rbuf
End If
Else If (recv[root] :2 TRUE)
MPI.Recv( Rbuf, Rdispls[root], Rcount[root], Rtype[root] )
End If
End For

 

 

Figure 4.7: ED using MPI point-to—point communication

 

in the collective group. This ﬂexibility is necessary for generic data redistribution as

provided by DaReL.

Each of the MPI point-to—point and collective routines takes a single communicator
as an argument (not shown) and all participants in the collective operation must
specify consistent arguments, e. g., all members must specify the same value for root.
When the communicators (process topologies) corresponding to P, and P, in a data
redistribution are equivalent, then any of the MPI routines discussed above may be
used. If P, and P, are the same size, but have different shapes, one communicator
sufﬁces so long as an association between each process in the source and each process
in the destination communicators can be made when the topologies are created. When

the source and destination topologies have different sizes, however, there is a problem

76

 

Input: Sbuf
Output: Rbuf

Switch (collective-type)
Case Scatter:
For root = 0 To grouinzel
MPI.Scatterv( Sbuf, Sdispls, Scount, Stype, Rbuf, Rdispls, Rcount, Rtype, root )
End For
Case Gather:
For root = 0 To group.size-l
MPI-Gatherv( Sbuf, Sdispls, Scount, Stype, Rbuf, Rdispls, Rcount, Rtype, root )
End For
Case All2all:
MPI.Alltoallv( Sbuf, Sdispls, Scount, Stype, Rbuf, Rdispls, Rcount, Rtype, root )
End Switch

 

 

Figure 4.8: ED using MPI collective communication

 

using MPIJllltoall, since at least one process is in either the source or destination,
but not in both. MPIScatter or MPLGather may be used, but only if a new group
communicator is created for each scatter or gather. Creating groups would contribute
to the overhead of the entire operation and does not scale well as the group size
increases. MPIX [52] proposes extensions to MP1 which would overcome the above
limitation of collective communication. Table 4.1 summarizes whether or not an MPI
routine can be used, based on the relationship between source and destination process

topologies.

4.2.5 Standard and Derived Datatypes

To avoid costly packing of data at the sender and unpacking of data at the receiver,

DaReL utilizes MPl’s derived datatype facility. MPI has great ﬂexibility in specify

77

Table 4.1: MP1 Communication Primitive Choices

 

 

 

 

 

 

 

 

Primitive |P,| 2 IR] |P,| = [Pt] P, 56 Pt
Shape same distinct distinct
Pt-to—pt Yes Yes Yes
Scatter Yes Yes Yes
Gather Yes Yes Yes
All—to—all Yes Yes No
Broadcast No No No

 

 

 

 

 

 

 

datatypes for message passing. MPI manages the system resources for buffering
messages and storing their internal representations. The buffers are not directly
user—accessible, thus, MPI provides “handles” in user space to access system objects.
This design hides the internal representations of MPI data structures allowing for
similar calls between dissimilar languages (C and Fortran) and also avoids typing
rule conﬂicts between these languages. MPI deﬁnes a number of standard datatypes,

e.g., MPLINT, MPLFLOAT, MPLCHAR, MPIJYTE.

To specify the complex data mappings necessary for efﬁcient redistribution, DaReL
utilizes MPI’S facility for derived datatypes. MPI standard datatypes facilitate spec-
ifying the data content of a message when the data is contiguous and of the same
datatype. Derived types extend MPI datatype speciﬁcation by allowing for non-
contiguous and mixed-type data in a message buffer. DaReL makes use of the former

feature of derived types.

In the redistribution example of Fig. 2.3, redistribution requires processes to send
multiple data items to a particular destination from locally non-contiguous locations.

For large data sizes, it is impractical to send data items individually due to sending

78

latency overhead costs, thus DaReL’s approach is to aggregate data destined for
a speciﬁc process. Data can be aggregated into the sending buffer, and retrieved
into non-contiguous data locations at the receiver, by means of the MPIJ’ack and
MPI.Unpack facility, respectively. This approach, however, introduces extra overhead
at both the sender and receiver. At the sender, data is copied from the original non-
contiguous locations in user-space to the buffer for sending, also in user-space; the
inverse copying is performed at the receiver. The ﬂexibility of MPI derived datatypes
alleviates this costly packing/ unpacking overhead.

DaReL uses the MPI_Type_vector and MPI_Type_struct primitives to specify the
message buffers for the exchange of data. The former primitive facilitates replication
of a datatype mapping onto non-contiguous memory locations. The ownership data
computed by CDES is used to form the MPI.Type_vect or argument for each dimension
of data. There is a limitation of MPI.Type.vector, however, when used in conjunction
with collective communication primitives, necessitating the use of MPI’S most general
derived type constructor, MPI_Type_struct. We illustrate’this limitation with an
example involving the scatter primitive.

Figure 4.9 shows a BLOCK to CYCLIC redistribution in one dimension among three
processes. We focus on the operation to be performed by process 1 which scatters
its local data set to the other two processes (shaded) and retains one element of
data itself. Using MPI.Type_vector, process 1 would create datatypes for each of
processes 2 and 0 specifying a block size of 1, a stride of 3, and a total count of
2. The displacement from the beginning of the local data is not an argument to

MPI.Type.vector, but it is an argument to MPIScatterv. Thus, process 1 would

79

specify displacements of 0 and 1 for processes 2 and 0, respectively. The difﬁculty

arises in that MPI.Scatterv interprets the displacement as follows,

send-buf + displs[i] x extent(sendtype[i])

where send_buf is the sending buffer, displs[i] is the displacement for process i, and
extent(sendtype[i]) represents the size of the datatype, sendtype[i]. The extent of
a datatype generated by MPI_Type.vector is calculated to be count x block size x
stride. In the current example, the extent would be 6. Thus, the actual displacements
as used by MPIScatterv would be 0 for process 2 and 6 for process 0. This, of
course, is erroneous since process 0 ought to start receiving data from displacement
1 as explained earlier. This problem occurs regardless of the choice of collective
communication primitive. One alternative is to use the pack/unpack facilities as
described earlier. Of course, we wish to avoid the overhead that this alternative

implies.

 

Block Owners 0 [£1 2

 

 

 

 

 

Local Indices

 

 

Cyclic Owners 0 1 2 0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.9: Limitation of derived types and collective communication

MPI provides a facility that allows the programmer to override the default

datatype extent calculation. MPI.Type_struct together with MPI.UB provide the nec-

80

essary functionality. MPI.UB is a “pseudo-datatype” used to mark the upper bound
of a datatype. We use this facility to specify the extent of a datatype to be correct
for use in a collective communication operation. HPI.Type_struct allows the most
general datatype construction for this purpose. For the above example, we would de-
ﬁne the upper bound of the datatype, send-type to be 1. Thus, the displacements for

processes 2 and 0 would be 0 and 1, respectively, as is needed for correct operation.

4.2.6 Scalability

Beyond the basic redistribution operation described in this chapter, DaReL ought
to provide scalable performance over large range of data and process conﬁguration
sizes. AS alluded to earlier in this chapter, the performance of the MPI point-to—
point and collective message-passing primitives may vary for different sizes of data
and numbers of processes. Additionally, the choice of distribution patterns for D,
and D, may affect redistribution performance. We assert that DaReL must provide

the best performance possible for any given Situation.

Therefore, DaReL shall provide the ability of selecting, at run-time, which of the
primitives from Table 4.1 is best suited to the given situation. While DaReL may
not be privy to the mapping of the process topologies onto the physical processors,6
it can determine the number of physical processors and the shape of the conﬁgu-

ration onto which the logical topology is mapped. HPF supports two intrinsics

NUMBERDFJROCESSORS and PROCESSORSHAPE [3], for inquiring about the physical

 

6The calling environment may provide this information.

81

nodes. If the logical topology is larger than the number of physical processors in
the underlying hardware, then more than one logical processors will be mapped to
some or all of the physical nodes. If the aforementioned HPF intrinsics are supported
in the environment that DaReL is operating within, then they may provide useful

performance information for selecting the appropriate primitive.

Chapter 5

Quantifying Scalability

The term scalability has been used extensively in the parallel processing community
to characterize the ability of parallel architectures and algorithms to exhibit greater
performance as more processors are employed to solve a problem. However, the term
“performance” is equivocal as it may be used to identify widely varying algorithmic
or architectural properties: speedup, execution time, processor speed or efﬁciency, or
the quality (accuracy) of a solution. Many of these properties have been used either
separately or in conjunction to describe the scalability of parallel algorithms and
architectures. The ambiguity regarding scalability leads us to ask several questions:
What is a scalable algorithm and a scalable architecture, and how are their relative
scalabilities related? Can a universal deﬁnition of scalability be applicable to all
scenarios? How can scalability be quantiﬁed? If scalability can be quantiﬁed, which
property best describes it, or is scalability a combination of factors? This chapter
shall attempt to address these questions.

A scalable architecture is one which is designed to offer a pr0portional performance

82

83

increase with a balanced increase in the number of processors, memory capacity and
bandwidth, and network and I/ O bandwidth of the machine without changing the ba-
sic underlying design. Typically, architectures based on a distributed-memory model
have been characterized as scalable due to their ability to provide such balanced in-
creases. While the above deﬁnition captures the essence of scalability as it relates
to parallel architectures, it does not provide guidance for assessing relative machine
scalabilities. It does not sufﬁce to simply compare the peak performance of machines
since sustained performance typically provides a more accurate measure of the per—
formance seen by the end-user. Sustained performance, however, is a function of the
parallel algorithm used. In fact, not only the algorithm, but its implementation af-
fects sustained performance. The implementation of an algorithm includes its data
decomposition, message-passing paradigm, degree of parallelism, and processor syn-
chronization, all of which affect performance [8]. Henceforth, when we use the term
algorithm, we mean parallel algorithm and its implementation detail.

A scalable algorithm can utilize additional machine resources, e. 9., additional num-
ber of processors, to increase algorithm performance. Increased algorithm perfor-
mance may be measured as producing a more accurate result or producing an equiva-
lent result in less time. As with the deﬁnition for scalable architecture, this deﬁnition
describes the property of scalable algorithms but does not assist us in comparing
relative scalabilities of parallel algorithms. The performance of an algorithm will be
greatly inﬂuenced by the underlying machine. For instance, if an algorithm frequently
performs barrier synchronization, the efﬁciency of the barrier implementation on the

underlying machine will greatly affect the execution time of the algorithm. Similarly,

84

the memory capacity of the machine will inﬂuence the size of problem, or accuracy,

that can be solved with an algorithm.

We conclude that scalability quantiﬁcation necessarily involves both the parallel
algorithm and the parallel target machine. Thus, we shall examine the scalability
of algorithm-machine (AM) pairs. A number of researchers have formulated scala-
bility models based on algorithm—machine pairings, some of which we shall survey in

Section 5.3.

5.1 Examples of Ambiguity in Scalability

We demonstrate the equivocal nature of the term scalability and demonstrate the
difﬁculty in quantifying the scalability of AM pairs with two examples. Our goal
in the following examples is to assess which among several AM pairs is “the most

scalable” for the application.

In the ﬁrst example, suppose we have two algorithms A and B and two machines
m1 and um from which to choose to solve a hypothetical problem. We deﬁne an
end-user to be a consumer of the result(s) generated by an algorithm on a particular
machine. We compare the performance of algorithms A and B executing a constant
amount of work on m1 and the relative performance of machines m1 and mg with
the same algorithm, A. Figure 5.1 shows the execution times of pairs Aml, Bml, and
Amg as a function of different sized processor conﬁgurations. On m1, B is superior to

A up to approximately 35 processors, while A achieves lower execution times between

85

35 and 70 processors.1 The execution times of Bml begin to increase after only 60
processors. Executing algorithm A, m2 is superior to m1 up to about 55 processors;
m1 is preferable between 55 and 70 processors. The best execution time is achieved

on 70 processors with Aml.

 

 

 

 

 

35]-

£2!»

15%

10'-

 

 

 

O 10 20 30 40 50 60 70 80 90 100

Figure 5.1: Comparison of three AM pairs

The example clearly demonstrates the importance of considering the performance
of an algorithm together with the parallel machine upon which it is executed. Even
in the simple example, however, it is unclear which AM pair is the most scalable; to a
large extent it depends upon the goals and constraints of the end-user. To determine
which AM pair is the “most scalable,” we need to answer the following questions.
Does the end-user have access to as many processors of ml or m; as desired? Is
the end-user’s goal to achieve minimum execution time for constant work as depicted

in Fig. 5.1? Or, is the user concerned with scaling work to achieve better solution

 

1We are not interested in execution times for beyond 70 processors since they begin to increase.
We shall elaborate on this phenomenon in Section 5.5.

86

accuracy? The answers to these questions will affect the relative “scalabilities” of the
AM pairs.

In the second example, we compare two real matrix multiplication algorithms, a1
and a2, executing on the same machine M, an SP-x. Let the base amount of work
be to compute the product of two 100 x 100 matrices of single-precision (4—byte) real
numbers. We assume two hypothetical end-users, Userl and Userg, each of whom
wants to choose “the most scalable” matrix multiplication algorithm among a1 and
a; for his/her application. The difﬁculty in choosing “the most scalable” AM pair
arises when the goals of each end-user differ. Userl’s application requires comput-
ing the product of a set of 100 x 100 matrices and thus wants to choose the fastest
AM pair to perform these multiplications. User; wants to increase (scale) the base
work performed to achieve a more accurate result for his/ her application and thus
requires matrix multiplications up to a maximum size of 800 x 800. Thus, Userg is less
concerned with execution time and more concerned with the quality of the solution.
Figure 5.2 plots the execution time (vertical axis) as the matrix size is increased (hor-
izontal axis) for the two algorithms a1 and a2 using 16—processor conﬁgurations. a1 is
faster than a; on a 100 x 100 matrix (dashed line). Even as matrix size is increased,
a1 has consistently lower execution times than a2. This is due to the implementation
detail of a1: one of the two matrix operands is replicated among all processors thus
avoiding interprocessor communication. Such replication, however, limits the size of
the matrix that can be computed as the aggregate memory requirements grow quickly
with increasing matrix size. Algorithm al’s large memory requirements limit its abil-

ity to scale the problem size beyond a 540 x 540 matrix computation. Algorithm a2,

87

while slower, can scale up to an 800 x 800 matrix product. In a2, the matrix operands
are distributed, thus reducing the per-node memory requirement but incurring inter-
processor communication. Clearly, a1 is unable to satisfy the needs of USCI‘z; thus,
a2 would be the most scalable choice. Algorithm a1, however, best suits the needs of

User; since it can execute sets of 100 x 100 products in less time.

 

20000

 

IMO[

 

 

 

”OUT

 

 

 

 

O 1 2 3 4 5 6 7 8 9 10
”1118126061!”

Figure 5.2: Scaled work vs. execution time of two AM pairs

5.2 Goal-directed Scalability Metrics

The above example illustrates the difﬁculty in quantifying scalability due primarily
to the inability of a single metric to adequately capture variations in end-user re-
quirements. We assert that a framework for quantifying scalability must account for
end—user constraints and requirements. Rather than achieving higher speedups or pro-
cessor efﬁciencies, our thesis is that the end-user is chieﬂy concerned with achieving

the following criteria with an AM pair:

88

o By scaling the amount of work, increase the quality (accuracy) of the generated
solution given more time and / or machine resources. The amount of acceptable
execution time may be constrained by the end-user. The extent to which quality

can be increased is a measure of the scalability of the AM pair.

0 For a ﬁxed solution quality, provide a large range of execution times and pro-
cessor conﬁguration sizes (number of processors) that can achieve the indicated
solution quality. Large ranges will facilitate greater choices for the end-user.

The size of these ranges is a measure of the AM pair’s scalability.

This chapter presents a simple end—user—oriented framework for quantifying the
scalability of AM pairs together with a facility to aid in visualizing AM pair behav-
ior. Speciﬁcally, we propose several metrics for quantifying the scalability of AM
pairs based on the needs of the end-user as captured by the above criteria. We de-
velop a mathematical framework to characterize the properties of AM pairs, e.g.,
communication overhead, and we propose a framework for visually illustrating an
AM pair’s performance over ranges of number of processors and scaled work. The
beneﬁt of a three-dimensional illustration is that a number of critical parameters, e. 9.,
memory capacity and communication overhead, can be viewed together in a uniﬁed
manner. Additionally, the end-user’s chosen scalability metric(s) can be visualized as
distinct points or lines along the generated surface. We present a detailed case study
that illustrates the utility of our framework in comparing the relative scalabilities of
distinct AM pairs for performing matrix multiplication. In Chapter 6, we examine

extensions to this framework geared speciﬁcally to quantifying the scalability of data

89

redistribution.

5.3 Metrics for Quantifying Scalability

Perhaps the most widely accepted metric for measuring the performance increase
gained by parallel systems is speedup. Speedup is usually computed as the ratio
of execution time on a single processor to execution time on a parallel machine.
There are different categories of speedup: ﬁxed—size, ﬁxed—time, and memory-bounded
speedup. Using ﬁxed-size speedup, where the amount of work remains constant, Ware
[53] deﬁned Amdahl’s Law based on Amdahl’s earlier research [54]. Gustafson et al.
[55] introduce the concept of scaled speedup [55] where the amount of work is increased,
as the number of processors increases, to achieve a more accurate result. In this
context, Gustafson deﬁned ﬁxed-time speedup [56] where execution time is ﬁxed and
the ability to scale work is measured. While Gustafson’s scalability metric constrains
execution time, the amount of aggregate memory may also impose a limitation on the
ability of scaling work. Memory-bounded speedup [57] measures the ability to scale
work as the memory capacity increases with the number of processors.

A number of models for quantifying the scalability of algorithm-machine pairs
have been proposed. Isoeﬂ‘iciency [58] measures scalability by assessing an AM pair’s
ability to maintain a constant processor efﬁciency as the problem size and number
of processors are increased. In this framework, efﬁciency is deﬁned to be the ratio of
speedup to the number of processors. Extensions to isoefﬁciency to account for the

relation between the amount of memory used by the algorithm and size of available

90

machine memory is proposed in [59]. As a problem is scaled, however, it can be
problematic to obtain execution time on a single processor due to lack of memory
capacity. Obtaining single processor execution time is a necessary component for
measuring speedup.2 The isospeed metric [57] removes the dependence on speedup
by measuring an AM pair’s ability to maintain average speed as the problem size and
number of processors are increased. In this context, speed is deﬁned to be the ratio
of work to execution time. A comprehensive study of various scalability measures is
presented in [60].

While measuring scalability in terms of efﬁciency or speed may be a viable ap-
proach, we believe that the primary focus ought to be the quality of the computed
solution. Other researchers have developed benchmarks which measure a system’s
ability to generate quality solutions for a given ﬁxed time (SLALOM [61]) or as a
function of increasing time (HINT [62]). We strive to apply this concept of solution

quality to measuring the scalability of an AM pair.

Deﬁnition 2 The Quality of Solution (C203) is a measure of the accuracy or precision
of a computed result. It is a function of the amount of work performed by an algorithm-

machine pair.

We illustrate QoS with a simple example. Let the objective be to compute sin(a:)
to a speciﬁed accuracy. Let the range of needed accuracies be [5,35] signiﬁcant
digits. The higher the accuracy, the larger the discretization of the sin(:I:) curve.

Thus, obtaining a higher accuracy may require more computation time and memory

 

2Systems that support virtual memory may obtain superlinear speedups.

91

resources. For sin(:r), we measure QoS by the number of digits of accuracy that
are computable by an AM pair. Thus, QOSmgn = 5 and C203,“, = 35. For four
hypothetical AM pairs, Fig. 5.3 illustrates the maximum achievable QoS of each pair:
7.5, 20, 40, and 47.5, respectively. The dashed lines denote C208,“... and QoSmu,
respectively. All AM pairs achieve QOSmgn while only pairs 3 and 4 achieve the

maximum desired accuracy. If the end-user is concerned primarily with the solution

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

O
I-
I-
I-
I-
I-

AMPalrs

Figure 5.3: Quality of Solution for sin(a:).

accuracy, then processor speedup or efﬁciency may be of little use since there is no
guarantee of a correlation between those metrics and solution quality. However, these
metrics may provide a secondary measure for assessing which of pairs 3 or 4 ought to
be chosen given their ability to provide the highest accuracies desired. Later in this

section, we propose a metric to aid an end-user in choosing between these pairs.

We contend that end-users of scalable systems are primarily concerned with so-

92

lution quality. How QoS is measured is application—dependent; however, as (208 is a
function of work, we can use the amount of scaled work to determine an AM pair’s
scalability. Additionally, we can rank different AM pairs based on the magnitude of

the achievable scaling.

5.3.1 Taxonomy of Scalability Measures

To quantify the absolute scalability of an AM pair, or compare the relative scalabilities
of several AM pairs, we derive three scalability metrics. One or more metrics may
apply to an end-user’s situation, and the relative importance among diﬂerent metrics
are problem dependent. The ﬁrst two scalability metrics quantify, respectively, the
maximum work computable and the maximum work achievable with time constraints.
The third metric is for work ﬁxed at a level desired by the end-user. We observe the
range of number of processors and execution times that can compute the ﬁxed level

of work. Let W represent the base amount of work.

A00 is the largest factor by which W can be scaled, i.e., AOOW represents the max—
imum amount of scaled-up work achievable on the AM pair. This value is
bounded by either the aggregate memory capacity of the machine or QoSmax,

whichever is smaller.

AT is the largest factor by which W can be scaled on the AM pair when the exe—

cution time, T, is ﬁxed by the end—user. Implicitly, AT S A00.

S(/\D) is the set of ordered pairs (Nj,T,°) that can compute ADW, where the scale

factor, AD, is set by the end-user. Each pair (N j, Tj) is the number of processors

93

and the corresponding amount of execution time needed to compute ADW on
the AM pair. Figure 5.4 is a generic illustration of the quantiﬁer for an arbitrary
amount of ﬁxed work. Typically, as the processor conﬁguration size increases,
execution time decreases. However, the cost of additional communication may
outweigh the beneﬁt of splitting the work among more processors resulting in
increasing execution time. Thus, in Fig. 5.4, each point on the curve to the left
of the dashed line represents one (Nj,Tj) pair in the set 500). The number
of pairs in the set is determined, in part, by the granularity by which processor
conﬁguration size can be increased, e.g., for hypercubes the increment is a

power-of—2. Implicitly, AD S A00.

 

 

 

 

 

 

 

8..---- ---—------__-----__-__-.

Figure 5.4: Execution time vs. number of processors tradeoff

For the first two metrics, the larger the value of the metric, the more scalable the AM
pair. Determining the maximum work for a ﬁxed time (AT) is similar to the SLALOM

[61] benchmark which ﬁxes computation time at one minute and determines how

94

accurately an answer could be generated in that time. For the third metric, S(/\D),
scalability is measured by the number of (NJ-,Tj) pairs and the magnitude of the
respective N,- and T,- ranges. One could use 50.0) to determine the most scalable
AM pair in the sin(a:) example given earlier where QoSma, is a function of AD. In
contrast to the sin(:2:) example, many applications particularly in image and signal
processing have physical limitations on data set sizes and, therefore, the work does
not scale [63]. The S(/\D) metric could be utilized by end-users of such applications

where AD = 1.

5.4 Mathematical Framework

To quantify AM pair scalability with the metrics introduced in Section 5.3, the various
properties that characterize algorithms and machines must be parameterized. We
introduce a simple mathematical framework, similar to the model used in Amdahl’s
Law [54, 53], for this purpose. This model is intended to facilitate AM pair scalability
quantiﬁcation and not as an absolute predictor of performance. A number of factors
inhibit a precise prediction of algorithm performance on a parallel machine, e.g.,

overlap of computation and communication, cache effects, network contentions, etc.

5.4.1 Algorithm and Machine Parameterization

Let W = W1 + Woo represent the amount of work to be performed for an implemen-
tation of an algorithm, where W1 is the amount of work that can only be performed

sequentially and Woo is the amount of work that can be fully parallelized, i.e., W00 can

95

take advantage of whatever number of processors are available. This model obviously
provides the most optimistic case of an algorithm in terms of the degree of parallelism.
For many applications, the amount of work can be characterized in terms of the data
size of an application. For example, the amount of work for solving a system of n
2 3

,gn.

equations with n unknowns using Gaussian elimination is a function of n, i.e.

To characterize the machine upon which an algorithm executes, we assume a
distributed-memory architecture composed of N homogeneous nodes. Each node
consists minimally of a processor with computation capacity A, local memory of size
m, and a communication router. The aggregate computation capacity and memory

of the machine are AN and mN, respectively. Henceforth, we shall use the terms

node and processor interchangeably.

We deﬁne M (W, N) as the amount of memory capacity required to execute W
work on N processors. Depending upon the algorithm, M (W, N) is a function of W
or N or both. For instance, if the data size requirements of an algorithm increase with
added computation, then the memory capacity requirement is a function of W. An
algorithm where data is replicated among processors would have a memory capacity

that is a function of N.

5.4.2 Fixed-work Problems

With the above parameters, we can formulate execution time in terms of computed
work. Let t1 = g; and too = EA” be the respective execution times of W1 and Woo on

one processor. Equation (5.1) deﬁnes the execution time of W work on N processors;

96

it is the theoretical lower bound on execution time since it ignores the communication

overhead and does not account for the memory constraints of the machine.

(5.1)

When a parallel algorithm is executed on a distributed-memory machine and the
number of nodes is greater than one, then typically there is communication among the
nodes of the machine for non-local memory accesses. We deﬁne the communication
overhead in terms of execution time as a function of work and number of processors,
h(W, N). Note that presently W is ﬁxed, but N is variable as we may change the
number of processors involved to compute the given work. Equation (5.2) expresses
execution time that incorporates the amount of time attributable to interprocessor

communication.

TN(W) =t1+ ’ﬁ + h(W, N) (5.2)

Similar to the discussion on memory capacity in Section 5.4.1, communication
overhead h(W, N), may be a function of either the amount of work or the number of
processors or both. In a ﬁve-point stencil application [64], nodes communicate with a
ﬁxed number of other nodes; thus, total communication overhead is not a function of
N, but may be a function of W, e.g., it may depend on the number of loop iterations.
In contrast, assume a particular matrix multiplication algorithm where the columns
of one of the matrix operands are distributed evenly to the nodes of the machine. In

the process of computing the matrix product, columns are shifted around the nodes

97

of the machine until each column has been to each node. Such a communication
paradigm is dependent on W (or matrix size in this case) and on N. Furthermore,
the implementation of communication primitives used in an algorithm may also affect
communication overhead. For example, a naive implementation of broadcast, where
the source node uses separate messages sent sequentially to inform all destinations,
i.e., separate addressing, is an example of communication cost that is a linear function
of N. In contrast, a tree-based implementation of broadcast would have logarithmic

complexity, i.e., log2(N) [39].

In general, modeling communication overhead can be complicated since there are
a number of issues to consider. Sending, receiving, and network latencies impact mes-
sage transmission time. Second, different messages can contend for the same physical
channels in a network causing increased network latency as message sizes increase [39].
Third, communication may overlap with algorithm computation. Message contention
and communication/computation overlap can be very difﬁcult to model precisely.
We assert that modeling communication—overhead requires thorough analysis of the
algorithm and the implementation of its message-passing primitives on the chosen

machine. We elaborate on this process in Section 5.4.4.

In the computation of TN(W), there is an implicit assumption that M (W, N) is
not greater than the aggregate memory of the machine. If the memory requirement
were greater than the aggregate memory capacity of the machine, then TN(W) would
increase substantially due to the overhead of paging, if it is supported. Equation (5.3),

therefore, deﬁnes the memory bound of an AM pair. Therefore, TN(W) is deﬁned for

98
an AM pair if and only if Equation (5.3) holds.

M(W, N) g mN (5-3)

5.4.3 Scaled-work Problems

Scaling up the initial work can contribute to a higher QoS as discussed in Section 5.3.
A higher QoS may include an increase in computation, e.g., increased number of
iterations, and/or an increase in memory requirements. We deﬁne scaled work as
WU“) 2 W1 + kW00 if the fully parallelizable portion of the algorithm, Woo, can
be scaled up by a factor of k and the sequential portion of the algorithm remains
constantf’. Obviously, expecting the sequential portion of work to remain constant
as work is scaled up is an optimistic assumption. The model could be extended to
account for a corresponding increase in the sequential portion of the algorithm. We
omit such an extension at this time. The execution time of scaled work is given in
Equation (5.4).

TN(W(")) = t1 + “T” + h(Wl"), N) (5.4)

Consistent with the formula for scaled work, Equation (5.4) multiplies the scaling
factor, k, by the parallel portion of the algorithm. Communication overhead is ex-
tended similarly to account for scaled work. As discussed previously, the derivation
of communication overhead, h(Wl"), N) when k > 1, is necessarily algorithm-speciﬁc.

Scaling the work may also increase the amount of memory required. Equation (5.5)

 

3We place It in parenthesis to clarify that we are not raising W to the power of k.

99

speciﬁes the increased memory as a function of k and the original work where a is a

real number.

M(WW, N) = k“[M(W, N)] (5.5)

When a = 0, scaling the work does not increase the memory required to perform
the work. For instance, in a particular molecular dynamics application [65], force
calculations are performed for a ﬁxed amount of time. Increasing the number of
iterations requires additional work (computation) but not additional memory. In
general, if a = 1, the memory increase is linear, if a = 2, the increase is quadratic,
etc. In Section 5.6, we Show an example where a = g. Equation (5.6) speciﬁes the
scaled work memory bound; it is a generalization of the inequality in Equation (5.3),

i.e., in that case k =1.

M(WW, N) g mN (5.6)

AM pair a2M in F ig.5.2 illustrates the potentially signiﬁcant impact of the scaled work

memory bound. We shall explore more examples in greater detail in Section 5.6.

5.4.4 Deriving AM Pair Parameters

As our goal is to quantify scalability of an AM pair, we must not only determine the
appropriate formulas that characterize the AM pair but also determine the appropri-
ate values for the chosen parameters. We assert that the determination of these values
can be largely performed by algorithm analysis and empirical testing. We illustrate
the determination of these parameters for several AM pairs in Section 5.6.

The work parameters, W1 and Woo, can typically be determined by evaluating the

100

major loop bounds of an algorithm together with its data decomposition among the
nodes of the machine. The sizes of these bounds may vary with program data size.
Modeling communication can be difﬁcult as discussed earlier. However, it is typically
straightforward to determine the number of communication invocations, the size(s)
of messages, and the number of processors involved. Parameterizing the complexity
of the implementation of a communication primitive, i.e., a linear or logarithmic
function of N, may be obtained given access to vendor source code or by performing

measurements.

5.5 A Three-dimensional Illustration of AM Pair

Scalability

Typically, the work of a parallel application cannot be scaled without bound. As work
is scaled, the memory size requirements of the algorithm may exceed the aggregate
capacity of the machine as expressed in Equation (5.6). Additionally, communication
overhead may grow with an increase in the number of processors as shown in Fig. 5.4.
Given that we can characterize an AM pair with the mathematical framework of Sec-
tion 5.4, we assert that it may be difﬁcult to fully visualize the scalability of an AM
pair or the relative scalabilities of a number of AM pairs. AM pairs may be charac-
terized by a number of independent and complex formulas. A three—dimensional view
of an AM pair, based on its mathematical characterization, may facilitate scalability

analysis.

101

To illustrate our AM visualization, let us suppose a hypothetical AM pair whose
scaled work execution time and memory capacity requirements are given with Equa-
tions (5.7) and (5.8), respectively. These formulas were arbitrarily chosen for the
purpose of illustration. Let n = 10 represent the data size of the problem to be
solved by the AM pair. Furthermore, let the parallelizable work be a function of the
data size and let the sequential component be constant, i.e., W = W1 + W0o = co+n2.
Scaled work is expressed as WU‘): co + knz. Communication overhead, which is a
function of k and n, is given by the last operand of Equation (5.7). co, c1 and c2 are

small constants.

T~(W“°’) = co + k (An—N) +(N)<c1 + eaxn?» (5.7)
MW“. N) = WWW) (58)

Equations (5.7) and (5.8) are fairly simple, yet it is not straightforward to visualize
their behavior or interrelationship as the number of processors or work are scaled.
Additionally, suppose A were doubled by virtue of new processor hardware: how

could this change affect scalability and how can it be visualized?

Figure 5.5 depicts the above hypothetical AM pair’s scalability as a three-
dimensional surface.4 The vertical axis plots execution time, TN(W(")), as a function
of the base axes: processor conﬁguration size, N, and scaled work, WU‘). Recall

from Section 5.4 that TN(W(’°)), is deﬁned only when the memory requirement is

 

4Figure 5.5 and all subsequent color ﬁgures are found at the end of this chapter.

102

within the aggregate memory capacity of the machine conﬁguration, i.e., the rela-
tion in Equation 5.6 must hold. For the present example, TN(W(")) is deﬁned when
k(n2)(\/N) S mN. The BLUE and WHITE regions in Fig. 5.5 denote, respectively,
points which satisfy the relation and the points which do not. The bounds for an AM

pair base axes are described next; in {}, we give these values for the current example.

0 W - Base work, i.e., WU‘), where k =1; {1}.

o ACOW - Upper bound on scaled work, i.e., the maximum WU“), for which Equa-
tion 5.6 holds. If there is no sequential component, W1, then the maximum k

is equal to A0,; {10}.

o N", - Lower bound on the number of processors on which W can be performed;

{4}-
. Nu), - Total number of processors on the machine; {128}.

Due to the scaled work memory bound, only the blue region of Fig. 5.13 is of
further interest. Figure 5.14 shows the same AM pair with the white region occluded
and the vertical axis reduced in scale to “zoom-in” on the blue region. Figure 5.14 is

partitioned by color with the following meanings.

e GREEN The execution time of the base work, W, over [N15,Nub] is denoted
with this line. As the size of the processor conﬁguration grows, execution time
initially decreases, but may eventually increase, if the cost of greater commu-

nication outweighs the beneﬁt of splitting the work among more processors.

103

Fixed-work problems (Section 5.4.2) can be modeled using this line exclusively.

Thus, they represent a subset of our framework.

0 BLUE The values in this region denote execution times for scaled work, i.e.,
k > 1. Lightest to darkest shadings represent execution time intervals: (0,2],
(2,3], (3,4], (4,6). The shadings facilitate mapping the maximum work that can
be accomplished in time T, i.e., the AT measure. For example, suppose we want
to ﬁnd AT when T = 3. The border between the (2,3] and (3,4] shadings depicts
this ﬁxed time value for the entire range of processors. The YELLOW point
at the “top” of the (2,3] shading depicts the point at which maximum work is

achieved when T = 3, occuring on 64 processors. Here, AT = 4.8.

e MAGENTA The upper bound on scaled work occurs with Nu), processors. The

maximum work is ACOW. This point is labeled “Inf.”

0 _Eg This line plots execution time over a range of processors for ﬁxed work,
“D”, and thus illustrates the S(/\D) measure. In Fig. 5.14, the amount of work
performed is ADW where AD = 3. Points along the solid portion show the range
of decreasing execution times on increasingly larger processor conﬁgurations,
i.e., the set (NJ-,Tj) that achieve the desired level of work. The dashed seg-
ment illustrates increasing execution time due to the communication overhead
of additional processors. The end-user would not be interested in the dashed
segment since equivalent execution times are obtainable on smaller processor
conﬁgurations. Alternatively, the entire red line segment may denote QoSmaz,

the upper bound on the end-user’s range of desired solution quality. Under this

104

end-user constraint, the upper bound on the scaled work axis would be replaced

by 3.

Figures 5.13 and 5.14 are generic illustrations of an AM pair’s performance.
Different AM pair surfaces may vary greatly depending upon their TN(W(")) and
M (WU‘), N) functions and the range of available processors. It is precisely for this
reason that we assert that presenting an AM pair in this manner is of particular
importance. In the next section, we demonstrate the utility of our approach in com-
paring the relative scalabilities of three different AM pairs for performing matrix

multiplication.

5.6 Case Study: Matrix Multiplication

We compare the relative scalabilities of three AM pairs for performing matrix multi-
plication. We choose this application for simplicity of illustration, i.e., the straight-
forward algorithms allow us to focus our attention on presenting the utility of our
scalability quantiﬁcation framework. We plot the three-dimensional surfaces of the
AM pairs and compare their respective S (AD) metrics for ﬁxed values of AD.

We select three parallel algorithms a1, a2, and a3, executing on the same machine,
M, an IBM SP-a: at Argonne National Laboratory [66]. The allowable processor
conﬁguration sizes are [NIb,Nub]=[1,128]. Each AM pair computes C, the product of
two n x n matrices A and B. The algorithms differ in basic operation and distribution
of matrices across the nodes of the machine. The major distinctions of a1, a2, and a3

are summarized next.

al

02

as

105

Matrices A and C are distributed by rows among nodes, and B is replicated
on all nodes. The replication of B allows each processor to compute its portion
of C without accessing off-processor data items. Thus, there is no interproces-
sor communication needed. The memory cost, however, grows with increasing

processor conﬁguration size due to the replication of B.

Matrices A and C are distributed by rows among nodes, and B is distributed
by columns. The total memory capacity requirement is reduced in comparison
to al as it is not a function of the number of processors; however, point-to—point
communication is used to shift columns of B among the processors. The total
number of messages exchanged is a function of N, and the message size is based

on the ratio of n to N.

Matrices B and C are distributed by rows among nodes, and A is distributed by
columns. Total memory cost is the same as for a2. Data exchange is performed

with a collective sum reduction operation for every data element of C.

Next, we present the pseudo-code of each algorithm and describe their operation in

more detail.

5.6.1 Algorithm-Machine Pair alM

Figure 5.5 depicts the distribution of the three matrices in CI, for four processors

numbered 0,1 ,2,3. Subscripts denote the processor number to which the data is

mapped.5 Figure 5.6 shows the operation of al from the perspective of each processor.

 

5The distributions of A and C follow the (BLOCK,*) pattern of HPF [3].

106

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A0 Co

Al C1
B :

A) X a,1,2,3 C2

A3 C3

 

 

 

Figure 5.5: Data distribution of a1

Note that the bounds on the indices in statements 51-83 are relative to the node’s
local data set. a1 adheres to the owner-computes rule [3] that stipulates that the
processor that stores the data element on the left—hand side of a program statement
is responsible for computing its update. Thus, each node is responsible for computing
its respective portion of C, an 1% X 12 data block. The outermost loop, 81, iterates

over the node’s local block C owned by the node. The degree of parallelism of a1 is

N S Nab.

 

Sl: For rows = 0T0 %—1

S2: For cols = 0 To n — 1

S3: Forj=0Ton—l

S4: C[rows,cols] = C[rows,cols] + A[rows,j] x B[i,cols]
S5: End For

S6: End For

S7: End For

 

 

Figure 5.6: Algorithm a1

 

5.6.2 Algorithm-Machine Pair 0.2M

In algorithm a2, contiguous rows of A and C are partitioned among processors, just
as in al, but a; partitions contiguous columns of B among nodes. The operation of

a2 is described for four processors as follows. Computation begins with the diagonal

107
data blocks of C as depicted by the darkest shaded boxes of Fig. 5.7. The diagonal

f,- x fly-sized blocks of C are owned by different processors and thus can be computed
in parallel. Next, each node proceeds with an off-diagonal block as illustrated with
progressively lighter shading in Fig. 5.7. Each shading shows the block set that can
be computed in parallel. Computing off-diagonal blocks, however, requires access
to off—processor data. For example, to compute block Cm, processor 0 requires row
A0 (which is local) and column B1 which is owned by processor 1. In general, to
perform the latter steps, the columns of B are circulated one step to the left around
the processors arranged in a ring as shown in Fig. 5.8 using the MPI [67] send-and-

receive primitive which combines the two message—passing operations in one call.

 

 

 

AI
42

 

 

 

 

 

 

 

 

 

 

 

 

execution order —’

Figure 5.7: Data distribution and operation of a;

Figure 5.9 presents the a; algorithm. The number of steps in the outermost loop (SI)
is determined by the number of processors since each processor shifts its locally-owned
column to all other processors. All processors perform a barrier synchronization at

statement S4.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Processor [ [ I [_7
0 Bo BI 32 B3
1 B1 32 B3 B0
2 32 B3 B0 31
3 B3 Bo 3] 132
i__ l— L_

Step 1 Step 2 Step 3 Step 4

Figure 5.8: Communication pattern of a2

5.6.3 Algorithm-Machine Pair a3M

In algorithm a3, A is distributed by contiguous columns, and B and C are distributed
by contiguous rows as Shown in Fig. 5.10 for four processors. In contrast to the pre—
vious two algorithms, a3 uses all processors in the computation of each data element
of C. Figure 5.10 illustrates the computation of element C43 where each processor
computes a partial sum of products of row 4 of A and column 3 of B. Processor 2, to
which C43 is distributed, performs a sum reduction operation to obtain the ﬁnal re-
sult. Figure 5.11 illustrates the general operation of a3. Statements SI and S2 iterate
over all row and column indices bounded by n. In statement S3, processors compute

their respective partial sums. The operation of S6 is implemented as a tree-based

reduction in MPICH [68].

5.6.4 Computing AM Pair Parameters

Scalability quantiﬁcation begins with the determination of the values of the param-
eters that characterize AM pair behavior, e.g., Woo and A. The derivation of the

parameters is based on the matrix sizes and distributions, and measured execution

109

 

 

 

81: For step = 0 To N — 1
S2: ofst = [(myJank + step) mod N] x £7
S3: If (step > 0)
S4: MPI.Sendrecv(B-col)
S5: End If
86: For rows = 0 To 7?,- — 1
S7: For cols = 0 To 7% — 1
S8: Forj=0Ton—l
39: C[rows,cols+ofst] : C[rows,cols+ofst] + A[rows,j] x B_col[j,cols]
510: End For
811: End For
312: End For
813: End For
Figure 5.9: Algorithm a2

 

01234567 01234567 01234567

 

 

 

 

 

 

 

 

_

 

 

 

 

 

 

 

 

 

 

 

\IOMAUN-‘o

 

 

 

 

Figure 5.10: Data distribution and operation of a3

times on machine M.

Clearly, the amount of work in matrix multiplication is a function of data size, or n.
We ﬁx the base size to be n = 1000. To determine the amount of work (computation),
we examine the number of ﬂoating-point operations (FLOPS) in the three algorithms.
As presented, the algorithms perform It multiplications and n additions to obtain each

element of C. Since there are n2 elements of C, a rough estimate of the base work

is W = 2n3 = 2 GFLOPS, for n = 1000. As W can be fully parallelized (limited by

110

 

Sl: For rows = 0 To n—l

S2: For cols = 0 To It — 1

S3: Forj=0To-I'(’-,-l

S4: partial = partial + A[rows,j] x Bﬁ,cols]
S5: End For

S6: MPI.Reduce(partial,sum,root)

S7: If (root)

58: C[(rows mod fr),cols] = sum

89: End If

810: End For

S11: End For

Figure 5.11: Algorithm a3

 

 

 

Nab), W = Woo. The sequential component W1 is never exactly zero for any algorithm
due to such factors as process initialization. However, in a1 and a2, the sequential
component is of minimal consequence and can be omitted in the present analysis.
The computation is scaled by increasing the matrix size by a real factor i, i.e., we
solve (in) x (in) matrices, where i 2 1. The amount of increased work required to
solve a factor i larger problem is given by k = 2i3. Henceforth, we use i rather than

k as the former is more intuitive.

Memory capacity requirements for each AM pair is based on the sizes and dis-
tributions of the matrices. For a1, each node stores (in) x (31%) blocks of elements
of A and C and all of B. Thus, the memory requirement per node is (in)2(1 + ﬁ).
The aggregate memory requirement, M (Wm, N), is (in)2(N +2). Due to the replica-
tion of B, each additional processor added to the computation causes the aggregate

memory requirement to grow by n2. For a; and a3, M (Wm,N ) = 3(in)2. Each

111

single-precision real data element displaces four bytes on machine M.

The derivation of several AM pair parameters is gathered by empirical testing.
We execute alM, 02M, and a3M to obtain values of TN(W(’)) for various i and
N. Using Equation (5.9) and execution times achieved with a; M, we solve for A.
Equation (5.9) has no h component since there is no communication in algorithm a1.
To obtain a stable value for A, we perform a number of tests with varying i and N

and then take the mean.
2(in)3

 

(5.9)

Using this approach, we obtain a per-node sustained computing capacity estimation
on the SP-a: of A = 56 MFLOPS. Algorithm a2 contains a communication component
which comprises a sending and receiving latency, co, and network latency as a function
of the message size, cﬁﬂ). The number of communication steps in a2 is N — 1.

N

The total execution time of a2 is given in Equation (5.10).

 

TN(wm) = 2%?“ [co+el ((’”)2)] (N— 1) (5.10)

Using a system of equations based on measured execution time of a2, we determine the
values of co = 0.075 and cl = 7.94‘5. In a3, the communication component is based
on its use of the MPLReduce operation implemented with a tree-based paradigm [69].
Figure 5.12 illustrates the message-passing sequence for eight nodes. Based on the
(binary) rank of the node, starting from the right least signiﬁcant bit, if the bit is

1, send to the node with that bit 0 and exit; if the bit is 0, receive from the node

112

with that bit set and combine with the addition operator. MPLReduce combines
summation and communication, however, we Shall not separate this primitive into
distinct work and communication components. We model the execution time of the
operation with c2 [log(N)], where C; is some constant. The message size of each
constituent message sent in the collective reduction is one data element, regardless of
matrix size. Reduction is invoked for each data element in C; thus, the total cost of

reduction is (in)2(c2 [log(N)]). Equation (5.11) captures the total execution time for

Step

3 4//—\
1 A A A A

 

 

 

[000] [001] [010] [011] [100] [101] £10.] [111]

 

 

 

 

Figure 5.12: Tree—based MPI Reduction used in a3M

a3M. We determine the value of c2 to be 1.33-5.

T~(W<‘)) = 2%?— + (in)’(62l10g(N)l) (5.11)

Table 5.1 summarizes the parameter values for the three AM pairs. Next, we quantify

the relative scalabilities of alM, (12M, and a3M based on these parameter values.

 

AM pair W“) M(W“), N) A h
alM 2(in)3 (in)2(N + 2) 56MFLOPS n/a
azM 2(in)3 3(in)2 56MFLOPS [cO + c,(£‘7’31)](N — 1)
a3M 2(in)3 3(in)2 56MFLOPS (in)2(c2 [log(N)])

 

 

 

 

 

 

 

 

 

 

 

 

Table 5.1: AM parameters for alM, 02M, and a3M

113
5.6.5 Scalability Quantiﬁcation of alM, 02M, and a3M

Let us assume two end-users for matrix multiplication, User) and USCI‘z, who are
interested in determining the most scalable AM pair based on the S (A D) measure for
AD = 5 and AD = 20, respectively. Due to its memory constraint (see Table 5.1),
alM can only scale i from 1 to 5.6. In contrast, (12M and a3M can obtain matrix
scaling on the range, 1 S i S 36.6. The measure AD represents a ﬁxed value of i.
Thus, for AD 2 20, we can only compare the latter two AM pairs.

Figure 5.15 illustrates the respective behaviors of 0.2M (left) and a3M (right)
for 1 S i S 36.6 and 1 S N S 128. Base work is shown in green as explained
in Section 5.5. The red lines in Fig. 5.15 illustrate AD = 20 for 02M and a3M,
respectively. The sets (NJ-,Tj) for a2M and a3M are given in Table 5.2. Note that
the processor ranges, N], are given from smallest to largest and the corresponding
execution times range from largest to smallest. As explained in Section 5.5, execution

time values which do not satisfy Equation (5.6) are obscured.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AD a1 M a2M a3M
N ,- T,- N ,- T,- N ,- T,-
5 [9,128] [464,32] [5,128] [2363,1935] n / a n / a
20 n / a n / a [45,128] [41576,37329] [45,128] [189094,184334]

 

Table 5.2: SOD) measure for alM, (12M, and a3M

When AD = 20, (12M and a3M have equivalent processor ranges, but 02M achieves
much faster execution times. The larger execution times for a3M are due to the
number of communication reduction operations required. Each reduction has a small

message size as only one data element is contributed by each node. Each participant,

114

however, must incur a costly sending latency for each invocation. Due to the large
number of reductions, sending latencies result in a large aggregate overhead as data
sizes grow. The “ﬂatness” of the surfaces with respect to increasing N in Fig. 5.15,
pictorially illustrates that increasing the number of processors has a minimal impact
on execution time. This phenomenon is explained by the communication overhead
components of Equations (5.10) and (5.11). In each equation, (in)2 contributes far

more to the overall cost than the respective N and log(N) factors.

The surfaces in Fig. 5.16 illustrate the respective behaviors of a1M (left) and
a2M (right) over 1 S i S 5.6 and l S N S 128. Again, base work is shown in
green.6 Figure 5.16 (left) illustrates the full range of matrix size scale-up for a1M .
Figure 5.16 (right) illustrates agM’s matrix scale range up to only 5.6 to facilitate
comparison with al M. We omit a3M from the AD = 5 comparison since Fig. 5.15
clearly establishes that (12M achieves better relative performance over the entire range
of i from 1 to 36.6. The red lines in Fig. 5.16 illustrate performance for the ﬁxed level
of work corresponding to AD = 5 for alM and (12M, respectively. Again, the precise

values for (N j, Tj) are summarized in Table 5.2.

We conclude that alM is the most scalable AM pair for User; who restricts AD
to 5. This conclusion is based on alM’s relatively lower execution times; however,
it can only perform the required work on no fewer than 9 processors, compared to 5
processors for a2 M. Presumably, this limitation would be of minimal importance to

User). For AD = 20, both (12M and a3M require the same number of (NJ-,Tj) pairs,

 

6Note that the higher resolution for (12M reveals that base work execution time increases as the
processor configuration size grows.

115

but 02M achieves signiﬁcantly better execution times over the entire range. Thus,

a2M is the most scalable AM pair for USCI‘z.

5.7 Inclusion of Other Scalability Models

We favor a scalability framework that is primarily concerned with an AM pair’s ability
to increase solution quality through increased work over the approach taken by other
models which measure scalability mainly in terms of processor speedup, efﬁciency, or
speed. However, our model does not preclude using these latter metrics as additional
parameters. For instance in Fig. 5.17, the shading within the blue region is based on
the ability to maintain average speed, the quotient of work over the product of number
of processors and execution time [57]. Speed values are normalized, i.e., the base work
on a single processor has average speed equal to 1.0. The borders between shadings
from darkest to lightest are given by average speed values of [1.0,0.9,0.8,0.7,0.6,0.2).
Similarly, efficiency [58], measured as the quotient of speedup over number of proces-
sors, may be incorporated in an AM surface. Since speedup is required to measure
efﬁciency, we are limited to scaling the work up to the memory bound for a single
processor. Figure 5.18 illustrates efﬁciencies using the same hypothetical AM pair
and the same values for shadings as given above for average speed. Note that the up-
per bound on the axis for scaled work on the efﬁciency plot is 10. Thus, the efﬁciency

surface represents a subset of the average speed surface (Fig. 5.17).

116

    
  

 

30 1 i

25 -* IIIIII

20— f .

g 15 L . I .......................
l: "I’.’"""
_ (I,

1 0
5 - - 10
o a . . - .....
0

       
  

6° 80 ’
1 00
12° Scaled Work

Processors (N)

 

 
   

8° 10° 120

Scaled Work

Processors (N)

Figure 5.14: Scalability metrics illustrated on AM surface

 

 

Matrix Scale Factor

 

 

Matrix Scale Factor

Processors(N)

Figure 5.15: 30.0) comparison of agM and a3M where AD = 20

2500~~""”'

moor-""5

1500-

Time (sec)

 

100

 

Matrix Scale Factor

Processors(N)

SOOOV

2500-"

2000~-'

1500~>

Time (sec)

1000~~ ’"‘

500-.

 

 

100 I ~' , 2
120 1

Matrix Scale Factor

Processors(N)

Figure 5.16: SOD) comparison of a1 M and a2M where AD 2 5

 

 

12° Scaled Work
Processors (N)

m

Figure 5.11: Incorporating speed metric

 

“" 4".

ar
4'00"
'0'."

any WAD-Ar.
4". 4'ch ‘0. '
‘ ﬂucﬂé‘g'ﬁ"
'4'.

 

‘20 Scaled Work

Processors (N)

Figure 5.18: Incorporating efﬁciency metric

Chapter 6

Performance and Scalability

The goal of data redistribution is to enhance data-parallel program performance by
providing a capability to rearrange data among the processor memories of a machine
that better suits subsequent computation. Data redistribution does not contribute to
the data—parallel program’s computed result; thus, the ability to scale work or scale
the quality of the solution are not relevant factors for analyzing redistribution. We
assert that the viability of the optimal processor mapping technique (Chapter 3) and
redistribution library, DaReL, (Chapter 4) ought to be measured in terms of execution
time performance as that is the most relevant criterion for measuring redistribution
performance. For any particular program data size, selected distribution patterns,
and processor conﬁguration, the time required to perform data redistribution should
be minimized.

Beyond providing adequate performance for a particular redistribution scenario,
we emphasize the importance of providing long—term, i.e., scalable, redistribution
performance as data and processor conﬁguration sizes increase. Chapter 5 presents a

120

121

model for quantifying algorithm-machine (AM) pair scalability using end-user-deﬁned
criteria. Central to the quantiﬁcation framework is measuring an AM pair’s ability to
scale the amount of work performed. Since data redistribution scalability assessment
focuses on the efﬁciency of data exchange, and not computation, we shall address
the question: by what criteria is data redistribution scalability to be quantiﬁed? We
assert that the trend in the slope of the redistribution execution time curve over a
range of processors is a viable means for quantifying redistribution scalability. The
concept of slope over a range of processors is similar to our earlier work where we

coined range of scalability [8] as a means for assessing scalable library design.

In this chapter, we present data redistribution performance using DaReL on an
IBM SP-zr: at Argonne National Laboratory [66] and demonstrate that DaReL is
highly communication—dominant. The performance of redistribution using the opti-
mal processor mapping technique is compared with redistribution performance using
the traditional mapping technique. We review a number of MP1 point-to—point and
collective communication algorithm options to perform data exchange, and we deter-
mine the best, in terms of performance, implementations as a function of data size
and number of processors. Lastly, using the scalability quantiﬁcation framework as a
basis, we propose extensions for quantifying the scalability of DaReL, and we present

results based on these extensions.

122
6.1 Processor Mapping Technique Analysis

In this section, we discuss the possible impacts on the data-parallel programmer
and the compiler, respectively, when using the optimal processor mapping technique.
We present the execution time beneﬁt of the technique compared to the traditional

mapping technique for a number of data redistribution cases when performed on an

IBM SP-x.

6.1.1 Effect of Optimal Mapping on the Programmer

Permuting lpids with the optimal mapping functions in Sections 3.1 and 3.2 may incur
undesired side effects and thus may not be suitable in all redistribution instances.
For instance, the programmer may lose “neighboring data” relationships. In Fig. 3.2,
under the traditional cyclic mapping, lpids 0 and 1 are neighbors1 while under the
optimized mapping, lpids 0 and 4 are neighbors. Suppose a programmer imparts a
particular logical to physical processor mapping for a given program and utilizes this
information for maintaining neighboring data relationships for physical processors.
In order to preserve these relationships, a programmer may favor the traditional
processor-data mappings over the optimal mapping in a call to data redistribution.
Figure 6.1 illustrates static, i.e., not varying from one program execution to another,
logical to physical processor mappings: logical processor 2' always maps to physical
node pi. In (a), the traditional mapping, i.e., 0,1,2,3, preserves the neighboring

processor relationships while in (b) neighboring data relationships on the physical

 

1A processor is a neighbor if it owns data elements that are adjacent from the perspective of the
global data.

123

nodes is corrupted when using the optimal mapping, i.e., 0,21,? makes p0 and p2

become neighbors where previously p0 and p1 were neighbors.

_Iﬂgical Physical Egical Physical

 

 

0 ----- a- po 0 ----- «b 130
1 ----- # pl 2 ----- * p2
2 ----- ’ p2 I ----- ’ pl
3 ----- ’ p3 3 ----- 4’ p3
(a) Traditional (b) Optimal

Figure 6.1: Mappings when physical node mapping is static

When a compiler or run-time system uniquely determines the logical to physi-
cal processor mappings, however, the programmer is unable to maintain neighboring
data relationships on the physical nodes. Indeed, indirect-network multicomputers,
e. g. , IBM SP-x, or most workstation clusters have no concept of “neighboring nodes.”
Furthermore, the allocation of parallel jobs, and thus data, may vary for distinct ex-
ecutions of the program since different physical processors may be allocated to each
job. Figure 6.2 illustrates a number of program executions where the allocation of
logical processors to physical nodes is different for each invocation. The top part
of Fig. 6.2 illustrates the traditional processor-data mapping for three distinct pro-
gram executions; the bottom portion of Fig. 6.2 illustrates the optimal processor-data
mapping for three distinct program executions. From the perspective of the physical
nodes of the machine, the permutation of the lpids, whether traditional or optimal, is
transparent to the programmer. We argue that in these situations, the use of the op—
timal mapping is always justiﬁed since neighboring data relationships on the physical
nodes cannot be maintained, or exploited, by the programmer.

Permuting lpids can also facilitate greater ﬂexibility: the programmer may want to

124

 

 

Traditional
Egical Physical Logi al Physical Logi a] Physical
0 ----- * p0 0 ----- ’ p2 0 ----- * pl
1 ----- * pl 1 ----- * p0 1 ----- * p2
2 ----- > p2 2 ----- - pl 2 ----- . p3
3 ----- e p3 3 ----- ’ p3 3 ----- * p0
(a) First execution (b) Second execution (c) Third execution
Optimal
ggical Physical Logical Physical Logical Physical
0 ----- * p0 0 ----- . p2 0 ----- * pl
2 ----- ’ pl 2 ----- * p0 2 ----- ’ p2
I ----- . p2 I ----- e p] I ----- * p3
3 ----- ’ p3 3 ----- * p3 3 ----- ’ p0
(a) First execution (b) Second execution (c) Third execution

Figure 6.2: Mappings when physical node mapping is dynamic

inﬂuence which data elements are redistributed to other processors and which remain
on-processor. Under the traditional mapping technique, the programmer has but
one choice, i.e., lpids are mapped in increasing numerical order. With the optimized
mapping technique, there are often several options since a number of lpids may map
to the same place holder; see Fig. 3.3(b). If the compiler supports programmer-
speciﬁcation of lpid permutations, then the user has inherently greater control over the
data to processor mapping. Such ﬂexibility may become even more signiﬁcant when
data alignments are introduced. In such a situation, a number of data elements may
map to some local indices while fewer data elements map to other indices. Therefore,
being able to inﬂuence which indices are redistributed and which indices remain on-

processor could enhance overall performance.

125
6.1.2 Effect of Optimal Mapping on the Compiler

The loss of neighboring data relationships may complicate the role of the compiler in
generating SPMD node programs from the HPF source code. Let us examine a simple
example. Let A be a 16—element array that is initially distributed with BLOCK, and
then redistributed CYCLIC; see Fig. 3.2. Assume that the following reference pattern
appears in the compiler-generated SPMD code following the call redistributing A to
CYCLIC.

M0=AU—U+AU+D

Using the traditional mapping technique, the compiler generates the following commu-
nication paradigm to obtain off-processor elements: Excluding the boundary proces-
sors, each lpid,- communicates with lpid,-_1 and lpid-+1. Under the optimal mapping
technique, neighboring processor relationships are not maintained, thus the above
communication paradigm cannot be used. For example, lpid3 communicates with
lpids and lpid-I, while lpid4 communicates with lpido and lpidl. This problem can be
easily overcome, however, if the compiler utilizes the place holder mapping informa-
tion determined by the optimal mapping function. Let q = 0,4, 1, 5,2,6, 3, 7 be the
array of place holders of the lpids as generated by Equation (3.1). Under the optimal
mapping, the compiler would specify neighboring communication for non-boundary
processors as follows: for lpid,- in place holder qj, communicate with lpids in qJ-_1
and qj+1. Essentially, the compiler performs a table lookup to determine neighboring
lpid relationships. The extension to boundary processors is straightforward and is

excluded from the present discussion.

126

6.1.3 Performance Comparison with Traditional Technique

The optimal mapping technique has been integrated into DaReL. Either the tradi-
tional or optimal mapping technique can be selected as a run-time option without need
for recompilation. To assess the run-time performance of the optimal mapping tech-
nique, we compare data redistribution execution times using the optimal technique
with redistribution execution times using the traditional mapping. An experimental

research version of MPI, MPI—F [70, 71], was used to obtain the performance results

on the Argonne IBM SP-x.

Figures 6.3 through 6.5 illustrate various performance comparisons of the two
mapping techniques. Figure 6.3 shows (BLOCK,*) to (CYCLIC(c) ,*) redistributions on
8 nodes over a range of matrix sizes: 32—thousand to 34—million 4—byte ﬂoating point
elements. The block size of the CYCLIC(c) pattern (row dimension) was maintained at
one-half the block size, b, of the BLOCK distribution. The optimal mapping technique,
applied to only the row dimension of the matrix, demonstrates signiﬁcantly lower ex-
ecution times over the traditional mapping. The optimal technique achieves the same
redistribution result in roughly 60% of the time needed for redistribution using the
traditional technique. The larger the value of c with respect to b the greater the effect
of the optimized mapping technique on overall execution time because signiﬁcantly
more data hits occur relative to the traditional mapping. With smaller values of c
relative to b, the beneﬁt of the optimal technique is lessened since the relative differ-
ence in the number of data hits is reduced. For all redistribution instances, regardless

of block sizes (b and c) or global data size, we ﬁnd the optimal technique outperforms

127

or equals the traditional mapping.

Figures 6.4 and 6.5 demonstrate (BLOCK,BLOCK) to (CYCLIC(c),CYCLIC(c)) redis-
tributions on 12 (logically 4 x 3) and 24 (logically 6 x 4) processor conﬁgurations,
respectively. Matrix sizes ranged up to 91-million ﬂoating point numbers. In these
performance plots, the mapping technique is applied to both dimensions of the ma-
trices. As before, the optimal mapping redistributions outperform the traditional
mapping in all cases. The block size c was maintained at one-sixth and one—eighth
of the respective block sizes in the row and column dimensions of the matrices. The
execution time advantage of the optimal mapping technique remains consistent for
larger (> 24) processor conﬁgurations as well.

The computation of the send and receive processor sets, CDES, is included in all
of the execution time plots shown. Execution time attributable to these calculations
represented a small fraction of the overall total, i.e., on the order of hundreds of
micro-seconds for small data sizes and on the order of tens of milliseconds for the

largest data sizes plotted. See Section 6.2.1 for more details.

6.2 DaReL Performance and Scalability

In this section, we examine in detail DaReL’s performance for several types of data
redistribution scenarios covering a large range of data shapes and sizes and processor
conﬁgurations. We assert that scalability in the case of DaReL ought to be deﬁned by
its ability, for a ﬁxed data size, to improve performance with additional processors.

To this end, we propose a simple metric for quantifying DaReL scalability. The

128

 

 

 

 

 

 

 

 

 

2.5 3 3.5

o 0.5 1 is 2
mm (4% M) 1:10,

Figure 6.3: (BLDCK,*) to (CYCLIC(c),*) on 8 x 1 processors

‘5 Y I I V T T

 

 

 

 

 

 

 

 

 

7 B 9 10

0 1 2 3 4 5 6
Manama-byte ﬂoats) 11107

Figure 6.4: (BLOCK,BLOCK) to (CYCLIC(c),CYCLIC (c)) on 4 x 3 processors

2.5

 

Lsr

The (sec)

 

 

0.5 -

+ 0pm

 

 

 

 

 

6 7 8 9 10

o 1 2 a 4 5
Matrixslze(4-bytetloats) “01

Figure 6.5: (BLOCK,BLOCK) to (CYCLIC(c),CYCLIC(c)) on 6 x 4 processors

129

metric is based on the slope of the execution time curve with respect to the number

of processors used.

6.2.1 CDES and ED Execution-time Performance

The performance of DaReL’s main components, CDES and ED, for a number of dis-
tinct redistribution cases are presented. The results were obtained using the public
domain MPI, MPICH [68], on the Argonne IBM SP-x. Eight different redistribu-
tion cases were performed, see Table 6.1. CASES 1-4 represent different kinds of
redistributions in terms of data shape and selected patterns executed on 20-processor
conﬁgurations of the SP-x; CASES 5-8 are the same as CASES 1-4 except that they
performed on IOU-processor conﬁgurations. The individual cases within the groups
1-4 and 5-8 are dissimilar to demonstrate DaReL’s ﬂexibility in dealing with arbi-
trary patterns and data shapes and sizes. We explore the execution time effects of
these factors. All data points were obtained using the traditional processor mapping

technique.

Table 6.1: Redistribution test cases for DaReL

 

CASE D, Dt P, = P, Data conﬁguration
1 (BLOCK,BLOCK) (CYCLIC,CYCLIC) 5 X 4 1000 X 1000 - 4000 X 4000

 

 

 

 

 

(BLOCK,CYCLIC) (CYCLIC,BLOCK) 5 x 4 1000 x 1000 - 4000 x 4000
(BLOCK,*) (CYCLIC,*) 20 x 1 400 x (2500 - 40000)
(acycuc) (*,BLOCK) 1 x 20 (1250 — 20000) x 800

 

(BLOCK,BLOCK) (crcrrcprcrrc) 10 x 10 1000 x 1000 - 4000 x 4000
(BLOCK,CYCLIC) (crcrrc,srocx) 10 x 10 1000 x 1000 - 4000 x 4000
(BLOCK,*) (warms) 100 x 1 400 x (2500 — 40000)
(acvcrrc) (*,BLOCK) 1 x 100 (1250 — 20000) x 800

 

 

 

GONGUu-hww

 

 

 

 

 

 

 

 

 

130

Figures 6.6 and 6.7 show the execution time of the CDES and ED modules, respec—
tively, on 20—processor conﬁgurations, CASES 1-4. The range of data sizes is shown on
the horizontal axis corresponding to the last column of Table 6.1. The vertical axis
shows the execution time of the respective module. For all four cases, the execution
time of CDES is approximately two orders of magnitude smaller than the execution
time of ED. The execution time data is obtained with the point-to—point algorithm
shown in Fig. 4.7. Later in the chapter, we discuss the signiﬁcance of the selected ED

algorithm on the overall performance of redistribution.

 

 

 

 

 

 

 

 

0.07 I I I T T I I
,rlr
I
I I
o 06. x (answerer) , ,
' o (BL,CY)—>(CY,BL) , , ’
1" (BL.')->(CY.') , ’
I
0.05 ~ + (‘.CY)->('.BL) , ’ ‘
/ I
I
I I
$0.0M ,/
v z I a
g I ’ I ’
u. I I I I
.- 0.03 '- I I I I I T
I ’ I I
I I
I I ’
I ’ I
0.02 - , ’ , 2*
X , , ’
I ’ ’
I I I I I I
0.01 i' / I 4+’ __________ .
I , , ’ _____ + ----
' I, I : ._ _. —-' .....
f _, ..
o l l l I l l I
0 2 4 6 8 10 12 14 16
Data size (4-byte) floats x 106

Figure 6.6: CDES performance on 20 processor conﬁgurations

CDES execution time depends upon the size and the shape of the data to be
redistributed. Recall, from Chapter 4, that the complexity of CDES is 0(no + n1 +
+ nm_1) where m is the number of dimensions and n.- is the number of elements

in dimension i. For CASES 1-4, CDES complexity is 0(no + 721). Thus, CASES 3-4

131

 

16 I I I I T T I

 

 

 

 

 

 

 

)P
’1
14» x (mammal) ,’ .
I
o (BL.CY)->(CY.BL) , ,
,2. x raL.°)->(cv.') x
+ (.,CY)—>(.,BL) I I I I ’ I
1o~ ,’ ,z’
I, ’
§ *’ /’
VB" I, ’1 _,
é 2’ 2’
l- ,’ ,x’
z I’
6" I’ ’1 'l
”/ III
4)- ’¥”/’ -1
/”’x’ ...-".‘
I .—".
2)- II” ...—"..-. -1
I’ll ‘ —————
*’ “."-
0 vi: 1 1_ 1 1 1 1
0 2 4 6 8 10 12 14 16
Data size (ll-byte) floats x 10°

Figure 6.7: ED performance on 20 processor conﬁgurations

whose data matrices are not square in shape, exhibit higher complexities than CASES
1-2 whose shapes are perfect squares.2 The measured execution time of the four cases

is consistent with the CDES complexity estimations.

Since ED constitutes the majority of redistribution execution time, we concen—
trate on assessing its performance among CASES 1-4 in greater detail. Redistribution
CASES 1 and 4 exhibit much higher execution times, up to a factor of 5 on 20 proces-
sors, than the remaining two redistributions cases. CASES 1 and 4 do not distinguish
themselves by exchanging more data, indeed, the amount of data exchanged in all four
cases is exactly the same. The discrepancy in execution time is explained by whether
the data to be sent to other processors is taken from contiguous or non-contiguous

locations in the local processors’ memory.

 

28cc last column of Table 6.1.

132

We compare CASES 1 and 2, which appear quite similar, but whose measured exe-
cution times on 20 (logically 5 x 4) processors vary signiﬁcantly. We discuss their exe-
cution for the largest data size redistributions, i.e., sixteen million elements. In both
cases, each processor stores a 800 x 1000 block of 4—byte ﬂoating-point numbers stored
in row-major order. Figures 6.8 and 6.9 illustrate the local data sets for processor (0, 0)
for CASE 1 and CASE 2, respectively. The 0,1,2,3 numbering applied to the columns
and the 0,1,2,3,4 pattern applied to the rows of Fig. 6.8 illustrate the ownership of
(0,0)’s elements under the D, pattern, i.e., (CYCLIC,CYCLIC). The shaded blocks
indicate the local elements that are sent to processor (0,1); there are forty-thousand
such elements. Each of these elements must be obtained from non-contiguous mem-
ory locations. MPI’s derived datatype facility, discussed in Section 4.2.5, provides for
the collection of non-contiguous data to be done by the underlying MPI implemen-
tation rather than by the user. Collecting data from non-contiguous data locations,
however, impacts the overall execution time of a message—passing operation [72]. For
CASE 1, all processors send forty-thousand element messages to the nineteen remain-
ing processors where the elements are taken from non-contiguous locations.

Contrast the situation of CASE 1 to that of CASE 2 shown in Fig. 6.9. The same
0,1,2,3 and 0,1,2,3,4 patterns are applied to the columns and rows, respectively, as in
CASE 1. However, due to the semantics of D: in the column dimension, contiguous
data blocks for this dimension are destined for individual processors. The semantics
of D; in the row dimension precludes the entire data block (shown shaded) from being
obtained from fully contiguous locations. Nevertheless, the data destined to (0,1) is

obtained from 160 non-contiguous locations of 250-element blocks. We assert that the

133

lower overhead of collecting data for CASE 2 accounts for its better execution time
relative to CASE 1. Similar distinctions can be shown when comparing redistribution
CASES 3 and 4. For CASE 3, processors send data from largely contiguous locations

while for CASE 4, all data elements are taken from non—contiguous locations.

 

 

a 1000 =

 

01230123012 0123
a a as = '

 

 

800

 

 

«hUN—‘C-tht—O

 

#wN—‘O

ll

 

 

 

Figure 6.8: Local data set for (BLOCK,BLOCK) to (CYCLIC,CYCLIC)

We turn our attention to the 100—processor redistributions, CASES 5-8 in Ta-
ble 6.1. The performance of CDES, shown in Fig. 6.10 for each case corresponds
almost exactly to the performance exhibited by the equivalent type of redistribution
in the 20—processor case, e. g. , CDES performance of CASES 4 and 8 are almost exactly
equal. Thus, we conclude that the number of processors is of minimal consequence
to CDES performance, and data size is the primary performance factor. This con-
clusion is consistent with our earlier CDES complexity analysis. ED performance for

CASES 7 and 8 are signiﬁcantly lower relative to their counterparts, CASES 3 and 4,

134

[~— 250—-§«— 250+ 250+ 250—4

10 o...o:11...112 2...2[3 3...3'

 

 

 

800

th~D&WN—O

 

 

 

 

#ri-‘O

 

ll

 

Figure 6.9: Local data set for (BLOCK,CYCLIC) to (CYCLIC,BLOCK)

respectively. This phenomenon is attributable to the small number of destinations
that a processor sends to in the former cases as compared to the latter cases. For
instance, each processor in CASE 7 sends to only three other destinations. In CASES
7 and 8, the data shapes are are highly non-square. If the data shapes were more
square in shape, each processor would need to send to a greater number of processors,
most likely resulting in increased execution time. CASES 5 and 6 are similar to the

execution times of CASES 1 and 2, respectively.

6.2.2 Algorithm Choices for ED

Due to the predominance of ED execution time, in this section, we examine the
different MPI-based implementations of the ED module discussed in Chapter 4; see

Figs. 4.7 and 4.8. We deﬁne four algorithm-machine (AM) pairs (see Chapter 5)

135

 

0.07

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f r I r T r T ’*
0.06 L x (BL-BLF>(CY.CY) I , ,
° (BL.CY)->(CY.BL) ,’ l
‘ (BLu')->(CY,.) I I
0.05 - + ('.CY)->(',3L) I , .
’I
’I
I
30.04» x’ ‘
Em I’ , - I ’ 1
I ’ ’
I I
I ’ ’
I ’ I
I I ’
0.02 - , I , A, ,
X , , I
I I ’
I I
I I I
0.01_ I I ’ *I ’ d
I I ’ I ’ -------------
z: ————— l- ---------- ‘F T
o 4 1 1 1 1 k 1
Dan 8'20 (4-byto) ﬂoats x10°
Figure 6.10: CDES performance on 100 processor conﬁgurations
14 I . , . I 1 r
,1:
12 ~ x (BLBL)—>(CY.CY) , , , q
o (BL,CY)—>(CY,3L) I I ,
ﬂ (BL.)->(CY,.) I I I
‘0 ’ + (now->030 , I ’ .
A 81- ’x/ 1
§ ,
V I
I
E ,
F 6" ’1 q
’I
’I
I
4h ’ I _ - - «o
I x ...... : ' ’ ’ ‘-
I’ ”’G—‘."”"”
2 h I I ’ ’ ’ I ’ ’ ’ ’ ’+’ ’ I ’ -‘
I ’e I ’ ’ ’
X ’ , I ' .. ’ I
9‘ ' ‘ , .. + ’
041-..... ----- . 1
Data size (4-byle) ﬂoats x 10'

Figure 6.11: ED performance on 100 processor conﬁgurations

136

for performing ED. The algorithms are distinguished by their use of MPI primitives.

Below, N represents the number of participating processors:

1. Point-to—point Uses N — 1 calls to MPLSend and N — 1 calls to MPLRecv.

2. Gather Uses N calls to MPLGather.

3. Scatter Uses N calls to MPI.Sca.tter.

4. All-to—all Uses one call to MPIJtlltoall.

Figure 6.12 shows the execution time performance of the four AM pairs over a range
of processor conﬁguration sizes, [4, 100], for (BLOCK,BLOCK) to (CYCLIC,CYCLIC) redis-
tributions. This type of redistribution was selected since it exhibited the worst case
redistribution performance on the lOO—processor conﬁguration and the second worst
performance on the 20—processor conﬁguration. We assert that the scalability analysis
applied to this case can be extrapolated to the other redistribution cases as well. The
four plots in Fig. 6.12 range from a data size of 1000 x 1000 (4—byte) ﬂoating point
numbers (upper left) to 4000 x 4000 ﬂoating point numbers (lower right).

The four plots in Fig. 6.12 show that execution time increases with increasing
data size. Consistent throughout the four plots is the relative performance of the
four AM pairs. In all plots, the Scatter algorithm represents the worst performance
followed by the Point-to—point algorithm. Signiﬁcantly better performance is seen
with the Gather and All-to—all algorithms. To analyze the performance data, we

examine the implementation details of the above collective communication primitives

137

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

iooonooou—mumaum 2000:2000(4-MM)W0
I' V Y 1 f 7 f Y ’1. 5 Y Y r Y Y Y T I ‘T
’,”.u
,x’/" 4.5»!
’ ’a’ ‘ \
f "I” \
,r ,x 4r \
I? ,’ \
” o’ \ __- ----------------
r X . _____________
,” ’I’ 3.5.. K I ----- I- "-‘_“ __________
-“', ,v' \\ ,-—"'-
" 3. it"
3‘ I _____ l? x "b“ 4 i ‘ X PI-b—ﬂ
v o Gama v25. ‘\ o m
E 0 I Senna 4 E Q‘ I m
. \
+ M—b—al 2» ‘, \ + m
\*\
4 1.5- ‘, ‘\
a—"'0 ‘ \\
.......... ‘ ‘
u'.-—-— 1. .\ \
"\ ,,,, “\ ‘~
‘V' 'i ‘._\ ...... Q —————————————————— ‘
‘ 0.5- “~
\~»—_.-.-‘_-__.*-'_--_---—---’"----ii \~‘
w __________________
"o zoaowsooomaoootoo‘btozoaoaosooomaooo
mam mam
WK-bybuhoddm aooouooou-muommm
‘lc f V f T Y Y T 1 V :3 ‘ Y 1 T I Tr ﬁ 1' Y 1'
\
u \
9r \ ‘ 15- \ ..
\ \
a ‘5‘ a— """"""" ii ‘
b ---------------- \
~ - .............. - ...: .............
~. —————— x " ﬂ . .......... ’ ,,. ______________________ J I
7*x ”"”-a + \\ :"""
‘u' 12? *—"‘ d
o» * . t
v . \ o Gum . v ‘ 0 am.
5 \ Q\
E ‘0 I 53L ‘\‘ ' m .
\
pf \ + a.” < \‘ + Al—b—d
\ \\ 6* “\
\ 1 \x\
\\ ‘ ‘\
4“ ‘F ‘ ‘\\
\\ ‘ .\ *“
‘v-‘---“---o‘ ----------------- \‘.——-::‘_‘- ______
‘ 1’ 2* ‘5; --------------- 1
\— ______________________
20 30 40 50 00 70 no 90 100 V0 10 20 30 4O 50 no 70 I) 90
WNW ’4de

Figure 6.12: (BLOCK,BLOCK) to (CYCLIC,CYCLIC) redistributions

138

in MPICH.3 Each of the collective operations is implemented by MPI blocking and
non-blocking send and receive primitives. HPIJlltoall is implemented exclusively
with non-blocking sends and receives. This approach is particularly advantageous
when the number of processors is large as messages can be received from processors
in the order of arrival. Using blocking receives imparts a static order on the reception
of messages in user space from distinct senders. The imposed order may not mirror
the actual order of message arrival leading to inefﬁciencies.

Perhaps the most surprising performance result is the clear advantage of the
Gather over the Scatter algorithm. Intuitively, both ought to exhibit comparable
performance since the former is a sequence of N many-to—one operations while the
latter is a sequence of N one—to—many operations. An MPI_Gatherv is implemented
as a sequence of blocking MPLRecv calls for the root process and a single blocking
MPLSend call for all non-root processes. To gather data from its send buffer into its
receive buffer, the root process performs a non-blocking send to itself that is eventu-
ally received with a blocking receive. An MPLScatterv is implemented as a sequence
of blocking MPLSend calls at the root process and a single blocking MPLRecv call for
all non-root processes. The root process performs a blocking MPLSendrecv to scatter
data from its send buffer into its receive buffer.

In the context of MPI, a blocking send is deﬁned as a process being blocked until
its communication buffer can be reused. Thus, returning from a call to MPLSend does
not imply that the intended recipient has received the message. On the other hand,

returning from a blocking MPLRecv means that the message has been successfully

 

3MPICH is public domain software, thus, we can analyze the source code.

139

received into the user’s communication buffer. To explain the performance difference
of the Gather and Scatter algorithms, we examine the interaction of communicating
processes using the two algorithms combined with message-passing benchmark data
for the SP-m. The amount of time consumed by a point-to-point message is commonly
subdivided into sending, network, and receiving latencies. The sending latency, tund,
is the time required to inject a message into the network; the network latency, tact,
is the time consumed traversing the network, and the receiving latency, tree”, is the
time needed to retrieve the message at the recipient. Nupairoj and Ni [73] ﬁnd the
following values for these parameters on the SP-a: where m represents the message

size in bytes:

tum; = 20+(0.02)m p-seconds
tnet = (0.03)m p-seconds

tree” = 35 + (0.02)m p—seconds

We examine a speciﬁc case when the redistribution data size is 1000 X 1000 on
four processors. For these data points, the relative proportions of tum; : tnet : tram, are
approximately 1.0 : 1.5 : 1.0. Using these relative proportions, Fig. 6.13 illustrates
the communication paradigm of the Scatter (top) and Gather (bottom) algorithms
for performing the ED function with the processors numbered 0,1,2, 3. For both algo-
rithms, processor 0 assumes the role of root for the ﬁrst MPLScatter (MPLGather)

operation. Subsequently, the other processors in increasing numeric order become

140

the roots of sequential scatters (gathers). Data is communicated using the blocking
MPI.Send and MFLRecv primitives. The line segments labelled S and R represent
sending and receiving latencies, respectively. Diagonal lines indicate network latency.
Note that in both portions of the ﬁgure the ﬁrst MPLScatter and ﬁrst MPI_Gather
complete in the same number of time steps, 5.5. The Gather algorithm, however,
has already initiated all four MPI_Gather operations and completes them in 16 time
steps.4 In contrast, Scatter completes the same number of MPIScatter operations
in 19 time steps. The Gather algorithm has the advantage that more sending, net-
work, and receive latencies are completely overlapped, i.e., they occur in parallel.
The current analysis does not entirely account for the difference between Gather and
Scatter execution times; for instance, such factors as message packetization and mes-
sage contention are not considered. The analysis, however, provides some insight into
the non-intuitive advantage of Gather over Scatter for performing ED.

The advantage of the Point-to—point algorithm over the Scatter algorithm is ex-
plained by the implementation of the Scatter algorithm. As described earlier, the
MPICH implementation of Scatter is a succession of calls to the blocking point-to—
point routines. In other words, the Scatter algorithm looks very much like the Point-
to—point algorithm in Fig. 4.7. The distinction is that Scatter uses a MPLSend-recv
operation to scatter data from its sending buffer to the receiving buffer. In the Point-
to-point algorithm used in ED, MPLPack and MPLUnpack are used for the reshufﬂing

of data that stays local to a process. This results in better execution time performance

 

4The diagram assumes that all four processes initiate the respective collective communication
calls simultaneously. In SPMD, this may not always be the case. We assume simultaneous behavior
for the purpose of this analysis.

 

 

 

 

 

 

 

Figure 6.13: Communication paradigm using Scatter (top) and Gather (bottom)

as shown in the four plots of Fig. 6.12.

6.2.3 Scalability

In Sections 6.2.1 and 6.2.2, we discussed DaReL’s performance for various redistribu-
tion cases using different algorithmic approaches to perform data exchange. In this
section, we address DaReL’s scalability. Speciﬁcally, we quantify DaReL’s scalability
as the number of processors participating in redistribution is increased. Scalability,
in the context of data redistribution, measures the long-term trend in execution time
as the number of processors used increases. To quantify scalability, we shall observe

the slope of the execution time over a range of processors (N, N ’], where N’ > N.

In Chapter 5, we declared that an AM pair whose execution time increased with
additional processors performing the same amount of work, was inherently unscalable.

We use this idea to deﬁne scalability in terms of slope: as we increase from a pro-

142

cessor conﬁguration of size N to one of size N’, we expect increased performance.
In the context of redistribution, the increased performance is measured by reduced
execution time. To measure increased performance, it sufﬁces to observe the slope
of the execution time curve over the range [N , N ’]. We specify slope formally in

Equation (6.1).

TN: - TN

Slope(N, N’) = N’ _ N

(6.1)

We deﬁne negative slope, N egSlope(N , N’) as the negation of Slope(N, N’) in Equa-
tion (6.2).

NegSlope(N, N’) = —Slope(N, N’) (6.2)

When N egSlope(N , N’) < 0, execution time is increasing with additional processors;
when NegSlope(N, N’) > 0, execution time is decreasing with additional processors.
We deﬁne the scalability/unscalability of a data redistribution AM pair based on
whether N egSlope(N , N’) is positive or negative. If N egS'lope(N, N’) is positive, its

magnitude is used to quantify scalability.

Definition 3 An AM pair is unscalable on [N, N’] when NegSlope(N, N’) < 0.

Deﬁnition 4 An AM pair is scalable on [N,N’] when NegSlope(N, N’) > 0. The

larger the magnitude of a negative slope, the more scalable the AM pair is.

We examine the use of the quantiﬁer N egSlope(N , N’) by calculating the slopes of
the four AM pairs in the 1000 x 1000 plot of Fig. 6.12. Figure 6.14 repeats the plot
with the computed values for N egSlope(N, N’ ) shown directly below it. The line

segments illustrate N egS'lope(N, N’) values on a speciﬁc range of processors for a

143

given AM pair. Note that the All—to—all AM pair consistently achieves the highest
N egS lope(N , N’) values. Certainly, the trend in slope can be seen in either the top or
bottom portion of Fig. 6.14. The additional information contributed by the bottom
plot is the relative magnitude of the scalability quantiﬁer N egSlope(N, N’) for the
distinct AM pairs. For instance, Scatter’s N egSlope(4, 10) is less than All—to—all’s
N egSlope(10, 25) Additionally, one can clearly see for which processor ranges the
NegSlope(N, N’) values drop below zero and by what magnitude. Note that none
of the AM pairs succeeds in having a positive NegSlope(50,100). This result is
attributable to the ratio of data size to the number of processors participating in the
redistribution. The data set size is a constant 1000 x 1000 ﬂoating-point numbers.
Beyond 50 processors, the size of the local data set, and thus the message size, is too
small to amortize the cost of sending and receiving latencies.

As shown in Section 6.2.2, DaReL has a number of algorithms at its disposal for
performing the exchange of data. Given the gathered performance data of Fig. 6.12,
we are able to ﬁne-tune ED to select the highest performer among the AM pairs for a
particular data size and processor conﬁguration. The selection can be done in ED by
means of a switch / case-type statement based on run-time information provided to the
library; see Section 4.2.6. For example, in the case of the 1000 x 1000 redistribution
(Fig. 6.14) the Gather AM pair is used on the range [4,20] and the All-to—all AM
pair is used on the range [21,100]. The Point-to—point and Scatter AM pairs are not
considered since their execution times are not competitive with Gather and Alltoall.
Thus, DaReL’s performance for any given data point of number of processors and

data size, is determined by the best performing AM pair. The performance plots are

144

100m1000(4-byteﬂoat) mum

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10‘ TV I I I Y 1 r T 1 ’1
I”,’:‘
1.2 , , , x’ .
I ’,r ,I”
1 \\ I””’x’ -:
n-~‘ ,’ ’,
“r’ ”’
$0.8 x,-x ----- V x n40.“ <
“' 0 Guitar
0 i
E06 ‘ " Settle! .
i \\ 4‘ All-to—ul
'\
Q\ \
0.4 \ *. ..
‘ \\. _________ 0
ag‘.‘ ,,.e ________
0.2 '?\V—‘— .l
\-—-._____.+.---____-____-____,__w
o 1 1 L l 1 l 1 l l
0 10 20 30 40 50 60 70 80 90 100
mam
00% I l' I F 17 1 T I T
+—+
X Pt-to-u
0'0“ 0 Gem ‘
I Scatter
° ° + Al-to-al
0.03 .
0.02 +—————+ [
H
0.01 .
o——o
u———u
0 n f: $3 .I
H V
g f
-o.o1 1 L . I 1 . . n .
0 10 20 30 40 50 60 70 80 90 100
Numberotprocesoors

Figure 6.14: 1000 x 1000 redistribution with

145

coarse, providing information for only speciﬁc processor and data sizes. To perform
a ﬁne-tuning of DaReL many more performance results for ﬁne—grain size increments
would be needed. For the obtained data, we select the best-performing AM pair
and quantify DaReL’s scalability in terms of the NegSlope(N, N’) measure for the
selected AM pair. We examine the change in this measure as the number of processors
and data size increase.

We compute NegSlope(N, N’) for DaReL over the full range of processors under
consideration, [4, 100]. This is done for data sizes in the range 1000 x1000 up to 4000 x
4000. The result of this analysis is illustrated in Fig. 6.15. Each of the horizontal
line segments represents N 895 lope(N , N’) values for a speciﬁc redistribution data size
and range of processors. The vertical line segments attach the horizontal segments for
clearer presentation. As data size increases, the values for N egS'lope(N , N’) increase
as well. As processor conﬁguration size increases, N egSlope(N , N’) decreases. This
result is acceptable, so long as N egSlope(N , N’) does not become negative. With the
exception of N egSlope(50, 100) for the 1000 X 1000 case (see dashed line segment in
Fig. 6.15), DaReL achieves N egS'lope(N, N’) for all other execution time data.

The plotted values of Fig. 6.15 are coarse since the processor increments are rel-
atively large: chosen to establish long-term trends. With smaller processor value
increments, the plotted lines would be smoother. The “bump” in each line is due to
the switch from the Gather AM pair to All-to—all AM pair that yields a higher slope
value.

In the computation of N egSlope(N , N’), the units for TN values are in seconds.

Note that this affects the magnitude of the slope computation. Thus, the relative

146

 

 

 

 

 

 

Xx

1-
as» . , ,
‘>’<‘~-;1smipli¢n , §
0.6% ° 9mm”? :
§ "')<..,§milllon 5
20.4‘ P 2 ; i.
0). g
o .

2

  

x 107
Data size (4-byte ﬂoats)

Number of processors

Figure 6.15: DaReL scalability measured by N egSlope(N , N’)

values for NegSlope(N, N’) rather than absolute values are important for scalability
quantiﬁcation. If time units were measured in nanoseconds, then the vertical axis in
Fig. 6.15 would change, but the relative distinctions between the plotted lines would

not.

Chapter 7

Conclusion and Future Work

Changing array distributions within a data-parallel program can substantially affect
overall program performance. Explicit redistribution is intended to optimize subse-
quent computation while implicit redistribution can be a common occurrence in a
program. Implicit redistribution can occur with array assignment statements where
the distribution of the operands are not equivalent, through the use of data realign-
ment, or through the use of subroutine libraries whose operands require speciﬁc dis-
tributions of data. This dissertation has examined the critical issues for performing
efﬁcient and scalable data redistribution in distributed-memory machines, and it has

proposed methods to mitigate the cost of redistribution in terms of execution time.

7 .1 Research Contribution

Since interprocessor communication is the predominant source of redistribution execu-
tion time, minimizing the total amount of data movement among processor memories

147

148

may increase redistribution performance. This dissertation presented a technique,
based on logical processor to data element mapping, that minimizes the total amount
of data movement among processor memories for BLOCK to CYCLIC(c) , and vice-versa,
redistributions. Mapping functions for one-to—one, m-to—m, one-to—two, and two-to-
one dimension data redistributions were presented and their optimality was proven.
The proposed methodology is architecture-independent, facilitating its potential inte-
gration into distinct redistribution implementations for different distributed-memory
architectures. We discussed the possible impacts on the programmer and compiler of
using the mapping technique. We showed that the technique offers the programmer
extra ﬂexibility in determining data placement in some instances. Additionally, the
use of the technique in a straightforward manner by the compiler is discussed. We
demonstrated redistribution execution time improvements over the traditional data-
processor mapping of up to 40% using the optimal mapping technique on an IBM
SP-sc. Portions of this work have appeared in [74] and shall appear in [75].

We presented a simple end-user-oriented framework for quantifying the scalability
of algorithm-machine (AM) combinations, or pairs, together with a facility to aid
in visualizing their scalability in three dimensions. We developed a mathematical
framework to characterize the properties of AM pairs, e.g., communication overhead,
and we proposed a model for visually illustrating an AM pair’s execution-time per-
formance over user-deﬁned ranges of processor conﬁgurations and scaled work. The
beneﬁt of a three-dimensional illustration is that a number of critical parameters, e. g.,
memory capacity and communication overhead, can be viewed together in a uniﬁed

manner, together with the end-user’s chosen scalability metric(s). We presented a de-

149

tailed case study that illustrates the utility of our framework in comparing the relative
scalabilities of distinct AM pairs for performing matrix multiplication. We demon-
strated the ﬂexibility of the framework in its ability to incorporate other scalability
metrics, such as isospeed [57] and isoefﬁciency [58]. The scalability quantiﬁcation
framework can be found in [76] which has been submitted for publication.

We proposed the design of a portable, MPI-based library, DaReL, for explicit
data redistributions, and we motivated the critical design choices. DaReL supports
multi-dimensional data redistribution for HPF’s regular distribution patterns, BLOCK,
CYCLIC, and *. It is designed for !HPF$ REDISTRIBUTE; however, we envision its
applicability to implicit redistributions that can occur within HPF programs. We
discuss a number of salient issues affecting the design of a redistribution library for
HPF including scalability and the respective roles of the compiler and the library. In
contrast to other approaches, e. g. , [32], DaReL decouples processor send/ receive set
calculation from data exchange. This decoupling simpliﬁes library design and facil-
itates multiple data exchange algorithm options. Data exchange is performed with
MP1 primitives, enhancing DaReL’s portability among distributed-memory platforms
that utilize the emerging message-passing standard. We discussed the advantages of
using MP1 for data redistribution, and we detailed DaReL’s use of MPI point-to—point,
collective communication, process topology, and derived datatype constructs.

We demonstrated the execution time performance of DaReL’s implementation on
an SP-a: and focused our attention on optimizing the data exchange portion of DaReL
due to its execution-time predominance. We examined a number of MP1 point-to—

point and collective (scatter, gather, and all-to—all) communication algorithms for

150

performing data exchange, selecting the best performer for a given data size and
number of processors. The clear advantages of using the collective gather and all-to-
all primitives for data redistribution were demonstrated. Portions of this work have
appeared in [77].

We extended the framework for quantifying scalability speciﬁcally for data redis-
tribution algorithm-machine pairs. We proposed a metric based on slope, which is
measured as the ratio of change in execution time to the change in the number of pro-
cessors. We quantiﬁed the scalability of DaReL using this metric and demonstrated

its scalability for a large range of data and processor conﬁguration sizes.

7 .2 Future Work

There are a number of areas pertaining to efﬁcient data redistribution services that
could be pursued in the future. The impact of data alignment and realignment on
redistribution represents one such area. Data alignments may cause an unbalanced
amount of data exchange between processor memories during data redistribution.
Such imbalances may lead to increased redistribution execution time. Another area
of future investigation would be a general theory for minimizing data exchange among
processors using HPF regular distribution patterns. For instance, optimal processor-
data mappings for CYCLIC(c1) to CYCLICch) redistributions where c1 5£ e; have
not been addressed. Furthermore, the impact of data alignments may also inﬂuence a
general theory. Further investigation of DaReL’s scalability and performance on other

distributed—memory platforms supporting MPI would be desirable. First, it would

151

validate the claim of portability; and second, it would be beneﬁcial to investigate
whether the relative merits of the different message-passing algorithms are consistent
on other hardware platforms. Perhaps, other, more efﬁcient, algorithms can be found.
Further study on different hardware platforms may suggest new scalability criteria
for data redistribution as well. Finally, we anticipate that DaReL can be used not
only for explicit redistribution, but also in instances of implicit redistribution. The
fundamental operation of DaReL would remain unchanged; however, the interface to
the compiler may change. Also, optimization of certain components of the library

unneeded for implicit redistribution may be found.

BIBLIOGRAPHY

Bibliography

[1] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C.-W. Tseng, and
M.-Y. Wu, “Fortran D language speciﬁcation,” Tech. Rep. COMP TR90-141,
Rice University, Department of Computer Science, Dec. 1990.

[2] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald, Vienna For-
tran: A Language Speciﬁcation ( Version 1.1), 1991.

[3] High Performance Fortran Forum, “High Performance Fortran Language Speci-
ﬁcation (version 1.0, draft),” May 1993.

[4] P. J. Hatcher and M. J. Quinn, Data-Parallel Programming 0n MIMD Comput-
ers. Cambridge, Massachusetts: MIT Press, 1991.

[5] F. Bodin, P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, and B. Mohr,
“Implementing a parallel C++ runtime system for scalable parallel systems,” in
Proceedings of Supercomputing ’93, pp. 588—597, Nov. 1993.

[6] M. Mace, Memory Storage Patterns in Parallel Processing. Kluwer Academic,
1987.

[7] J. Li and M. Chen, “The data alignment phase in compiling programs for
distributed—memory machines,” Journal of Parallel and Distributed Computing,
vol. 13, pp. 213—221, Oct. 1991.

[8] L. M. Ni, H. Xu, and E. T. Kalns, “Issues in scalable library design for massively
parallel computers,” in Supercomputing ’93, pp. 181—190, Nov. 1993.

[9] U. Kremer, “Automatic data layout for distributed-memory machines,” Tech.

Rep. CRPC-TR93299—S, Rice University, 1993.
[10] W. Press, Numerical Recipes in C. Cambridge, 1988.

[11] P. Mehrotra and J. V. Rosendale, Programming Distributed Memory Architec-
tures Using Kali, ch. 19, pp. 364—384. MIT Press, 1991.

[12] M. Rosing, R. B. Schnabel, and R. P. Weaver, “The DINO parallel programming
language,” Journal of Parallel and Distributed Computing, vol. 13, pp. 30—42,
Sept.1991.

152

153

[13] M. Baber, “Hypertasking support for dynamically redistributable and resizeable
arrays on the iPSC,” in Proceedings of the Sixth Distributed Memory Computing
Conference, pp. 59—66, Apr. 1991.

[14] B. Chapman, P. Mehrotra, H. Moritsch, and H. Zima, “Dynamic data distri-
butions in Vienna Fortran,” in Proceedings of Supercomputing ’93, pp. 284—293,
Nov. 1993.

[15] M. W. Hall, S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Interprocedural
compilation of Fortran D for MIMD machines,” in Proceedings of Supercomput-
ing’92, pp. 522—534, Nov. 1992.

[16] Message Passing Interface Forum, “MPI: A Message-Passing Interface Stan-
dard,” tech. rep., University of Tennessee, Mar. 1994.

[17] High Performance Fortran Forum, “HPF-2 scope of activities and motivating
applications,” Nov. 1994.

[18] M. Ikei and M. Wolfe, “Automatic array alignment for distributed memory mul-
ticomputers,” in Proceedings of the Hawaii International Conference on Systems
Sciences, Jan. 1994.

[19] M. Gupta and P. Banerjee, “Demonstration of automatic data partitioning tech-

niques for parallelizing compilers on multicomputers,” IEEE Transactions on
Parallel and Distributed Systems, vol. 3, pp. 179—193, Mar. 1992.

[20] J. M. Anderson and M. S. Lam, “Global optimizations for parallelism and lo—
cality on scalable parallel machines,” in ACM SIGPLAN Notices, Proceedings
of the ACM SIGPLAN ’93 Conference on Programming Language Design and
Implementation, vol. 28,,pp. 112—125, June 1993.

[21] E. T. Kalns, H. Xu, and L. M. Ni, “Evaluation of data distribution patterns in
distributed-memory machines,” in Proceedings of the 1993 International Confer-
ence on Parallel Processing, vol. II, pp. 175—183, Aug. 1993.

[22] H. Xu and L. M. Ni, “Optimizing data decomposition for data parallel pro
grams,” in Proceedings of the 1994 International Conference on Parallel Pro-
cessing, vol. II, pp. 225—232, Aug. 1994.

[23] R. Ponnusamy, J. Saltz, R. Das, C. Koelbel, and A. Choudhary, “Embedding
data mappers with distributed memory machine compilers,” in ACM SI GPLAN
Notices, Workshop on Languages, Compilers, and Run- Time Environments For
Distributed Memory Multiprocessors, vol. 28, pp. 52—55, Jan. 1993.

[24] P. Z. Lee and T.-B. Tsai, “Compiling efficient programs for tightly-coupled dis-
tributed memory computers,” in Proceedings of the 1993 International Confer-
ence on Parallel Processing, vol. 2, pp. 161—165, Aug. 1993.

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

154

S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Compiling Fortran D for MIMD
distributed-memory machines,” Communications of the ACM, vol. 35, pp. 66—80,
Aug. 1992.

S. Chittor and R. Enbody, “Performance evaluation of mesh-connected
wormhole-routed networks for interprocessor communication in multicomput-
”

ers, in Proceedings of the 1990 Supercomputing Conference, pp. 647-656, Nov.
1990.

S. Chittor and R. Enbody, “Predicting the effect of mapping on the communi-
cation performance of large multicomputers,” in Proceedings of the 1991 Inter-
national Conference on Parallel Processing, vol. 2, pp. 1—4, Aug. 1991.

S. Chatterjee, J. R. Gilbert, and R. Schreiber, “Mobile and replicated alignment
of arrays in data-parallel programs,” in Proceedings of Supercomputing ’93, Nov.
1993.

K. Kunchithapadam and B. P. Miller, “Optimizing array distributions in data-
parallel programs.” 7th Annual Workshop on Languages and Compilers for Par-
allel Computing, August 1994.

C. M. Chase and A. P. Reeves, “Data remapping for distributed-memory mul-
ticomputers,” in Proceedings of the 1992 Scalable High Performance Computing
Conference, pp. 137—144, Apr. 1992.

S. Gupta, S. Kaushik, S. Mufti, S. Sharma, C.-H. Huang, and P. Sadayappan,
“On the generation of efficient data communications for distributed-memory

machines,” in Proceedings of the 1992 International Computer Symposium,
(Taichung, Taiwan), pp. 504—513, Dec. 1992.

R. Thakur, A. Choudhary, and G. Fox, “Runtime array redistribution in HPF
programs,” in Proceedings of the 1994 Scalable High Performance Computing
Conference, pp. 309—316, May 1994.

J. Stichnoth, D. O’Hallaron, and T. Gross, “Generating communication for array
statements: Design implementation, and evaluation,” Journal of Parallel and
Distributed Computing, vol. 21, no. 1, pp. 150—159, 1994.

S. Ramaswamy and P. Banerjee, “Automatic generation of efﬁcient array redis-
tribution routines for distributed memory multicomputers,” Tech. Rep. CRHC-
94—09, Center for Reliable and High Performance Computing, Computer Systems
and Research Laboratory, University of Illinois, 1994.

P. Banerjee, J. A. Chandy, M. Gupta, J. G. Holm, A. Lain, D. J. Palermo,
S. Ramaswamy, and E. Su, “The PARADIGM compiler for distributed—memory
message passing,” in First International Workshop on Parallel Processing, (Ban-
galore, India), Dec. 1994.

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

155

J. Bruck, R. Cypher, C.-T. Ho, and S. Kipnis, “Efﬁcient algorithms for the
index operation in message-passing systems,” Tech. Rep. RJ 9300 (82230), IBM
Research Division, Almaden Research Center, San Jose, CA and T.J. Watson
Research Center, Yorktown Heights, NY, Apr. 1993.

P. K. McKinley, Y.-J. Tsai, and D. F. Robinson, “A survey of collective com-
munication in wormhole-routed massively parallel computers,” Tech. Rep. MSU-
CPS-94—35, Department of Computer Science, Michigan State University, 1994.

C.—T. Ho and S. L. Johnsson, “Distributed routing algorithms for broadcast-
ing and personalized communication in hypercubes,” in Proceedings of the 1986
International Conference on Parallel Processing, pp. 640—648, Aug. 1986.

P. K. McKinley, H. Xu, A.-H. Esfahanian, and L. M. Ni, “Unicast-based multicast
communication in wormhole-routed networks,” IEEE Transactions on Parallel
and Distributed Systems, vol. 5, pp. 1252—1265, Dec. 1994.

S. D. Kaushik, C.-H. Huang, R. W. Johnson, and P. Sadayappan, “An approach
to communication-efﬁcient data redistribution,” in Proceedings of the 1994 In-
ternational Conference on Supercomputing, pp. 364—373, July 1994.

S. L. Johnsson and C.-T. Ho, “The complexity of reshaping arrays on boolean
cubes,” in Proceedings of the F ifth Distributed Memory Computing Conference,
vol. 1, pp. 370—377, Apr. 1990.

S. L. Johnsson, “Language and compiler issues in scalable high performance
scientiﬁc libraries,” 1992. (Tutorial Notes at Frontiers’92).

A. Wakatani and M. Wolfe, “A new approach to array redistribution: Strip min-
in'g redistribution,” in PARLE ’94 Parallel Architectures and Languages Europe,
6th International PARLE Conference, Athens, Greece, pp. 323—335, July 1994.

V. Bala and J. Ferrante, “Explicit data placement (XDP): A methodology for
explicit compile-time representation and optimization of data movement, (ex-
tended abstract),” in ACM SIGPLAN Notices, Workshop on Languages, Com-

pilers, and Run-Time Environments For Distributed Memory Multiprocessors,

vol. 28, pp. 36—39, Jan. 1993.

V. Bala, J. Ferrante, and L. Carter, “Explicit data placement (XDP): A method-
ology for explicit compile-time representation and optimization of data move-
ment,” in Proceedings of the Fourth ACM SICPLAN Symposium on Principles
and Practice of Parallel ProgrammingA CM SIGPLAN Notices, vol. 28, pp. 139—
148, July 1993.

S. Amarasinghe and M. Lam, “Communication optimization and code genera-
tion for distributed memory machines,” in ACM SICPLAN Notices, Proceedings
of the ACM SIGPLAN ’93 Conference on Programming Language Design and
Implementation, vol. 28, pp. 126-138, June 1993.

156

[47] A. Wakatani and M. Wolfe, “Optimization of the redistribution of arrays for
distributed memory multicomputers,” tech. rep., Matsushita Electric Industrial

Co., Ltd., Osaka, Japan and Dept. of Computer Science and Engineering, Oregon
Graduate Institute, Portland, OR, Dec. 1993.

[48] A. Skjellum, N. E. Doss, and P. V. Bangalore, “Writing libraries in MP1,” in
Proceedings of the 1993 Scalable Parallel Libraries Conference, pp. 166—173, Oct.
1993.

[49] A. Skjellum and A. P. Leung, “Zipcode: A portable multicomputer communi-
cation library atop the reactive kernel,” in Proceedings of the Fifth Distributed
Memory Computing Conference (DMCC5), pp. 767—776, Apr. 1990.

[50] D. W. Walker, “The design of a standard message passing interface for dis-
tributed memory concurrent computers,” Tech. Rep. ORNL/TM-l2512, Oak
Ridge National Laboratory, Oak Ridge, TN, Oct. 1993.

[51] J.C.Adams, W.S.Bra.inerd, J.T.Martin, B.T.Smith, and J.L.Wagener, Fortran
90 Handbook. 1221 Avenue of the Americans, New York, NY 10020: Intertext
Publications, 1992.

[52] A. Skjellum, N. E. Doss, and K. Viswanathan, “Inter-communicator extensions
to MPI in the MPIX (MPI eXtension) library,” tech. rep., Department of Com-
puter Science and NSF Engineering Research Center for Computational Field
Simulation, Mississippi State University, Mississippi State, MS, Aug. 1994.

[53] W. Ware, “The ultimate computer,” IEEE Spectrum, vol. 9, pp. 84—91, 1972.

[54] G. Amdahl, “Validity of the single-processor approach to achieving large scale
computing capabilities,” in Proceedings AFIPS Conference, pp. 483—485, 1967.

[55] J. L. Gustafson, G. R. Montry, and R. E. Benner, “Development of Parallel Meth-
ods for a 1024-Processor Hypercube,” SIAM Journal on Scientiﬁc and Statistical
Computing, vol. 9, pp. 609-638, July 1988.

[56] J. L. Gustafson, “Reevaluating Amdahl’s Law,” CACM, vol. 31, pp. 532—533,
May 1988.

[57] X.-H. Sun and D. T. Rover, “Scalability of parallel algorithm-machine combina—
tions,” IEEE Transactions on Parallel and Distributed Systems, vol. 5, pp. 599—
613, June 1994.

[58] A. Grama, A. Gupta, and V. Kumar, “Isoefﬁciency: Measuring the scalability of
parallel algorithms and architectures,” IEEE Parallel and Distributed Technol-
ogy, vol. 1, pp. 12—21, Aug. 1993.

[59] C.-C. Lin, “Isoefﬁciency analysis of parallel systems with limited memory,” in
Contributed Papers of the IPPS 93 Workshop on Analyzing Scalability of Parallel
Algorithms and Architectures, pp. 11—15, Apr. 1993.

157

[60] V. Kumar and A. Gupta, “Analyzing scalability of parallel algorithms and ar-
chitectures,” Journal of Parallel and Distributed Computing - Special Issue on
Scalability of Parallel Algorithms and Architectures, vol. 22, pp. 379-391, Sept.
1994.

[61] J. L. Gustafson, “The design of a scalable, ﬁxed-time computer benchmark,”
Journal of Parallel and Distributed Computing, vol. 12, pp. 388—401, 1991.

[62] J. L. Gustafson and Q. O. Snell, “HINT: A new way to measure computer perfor-
mance,” Tech. Rep. IS-5109, UC-405, Ames Laboratory, Ames, Iowa, 50011-3020,
1994.

[63] P. Dinda, T. Gross, D. O’Hallaron, E. Segall, J. Stichnoth, J. Subhlok, J. Webb,
and B. Yang, “The CMU task parallel program suite,” Tech. Rep. CMU-CS-94—
131, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
15213, 1994.

[64] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and
D. W. Walker, Solving Problems on Concurrent Processors: Volume I: General
Techniques and Regular Problems, Chapter 7. Prentice Hall, 1988.

[65] S. Sharma and High Performance Fortran Forum, “HPF-2 scope of activities and
motivating applications,” Nov. 1994. (Molecular Dynamics application, Section
4.3).

[66] W. Gropp and E. Lusk, “User’s guide for the AN L IBM SPx system.” URL
http: / / www.mcs.anl.gov / Projects / spl / guide-r2.html, 1995.

[67] Message Passing Interface Forum (MPIF), “MPI: A Message-Passing Interface
Standard,” tech. rep., University of Tennessee, May 1994.

[68] P. Bridges, N. Doss, W. Gropp, E. Karrels, E. Lusk, and A. Skjellum, “Users’
guide to MPICH, a portable implementation of MPI,” tech. rep., Argonne Na-
tional Laboratory and Mississippi State University, 1994.

[69] Argonne National Laboratory and Mississippi State University, “Comments from
C-based source code of MPICH reduction primitive,” 1994. ,

[70] H. Franke, P. Hochschild, P. Pattnaik, and M. Snir, “MPI-F: An efﬁcient imple—
mentation of MPI on IBM SP-l,” in Proc. of the 1994 International Conference
on Parallel Processing, vol. 3, pp. 197—201, Aug. 1994.

[71] H. Franke, P. Hochschild, P. Pattnaik, J.-P. Prost, and M. Snir, “MP1 on IBM
SP1 /SP2: Current status and future directions,” in Proceedings of the 1994
Scalable Parallel Libraries Conference, Oct. 1994.

[72] H. Franke, “MPI-F an implementation for IBM SP-l/SP-2,” Tech. Rep. Version
1.39, IBM T. J. Watson Research Center, Yorktown Heights, NY, Feb. 1995.

[73]

[74]

[75]

[76]

[77]

158

N. Nupairoj and L. Ni, “Benchmarking of multicast communication services,”

Tech. Rep. MSU-CPS-ACS-103, Department of Computer Science, Michigan
State University, Apr. 1995.

E. T. Kalns and L. M. Ni, “Processor mapping techniques toward efﬁcient data
redistribution,” in Proceedings of the 8th International Parallel Processing Sym-
posium, pp. 469—476, Apr. 1994.

E. T. Kalns and L. M. Ni, “Processor mapping techniques toward efﬁcient data
redistribution,” IEEE Transactions on Parallel and Distributed Systems. (ac-
cepted to appear).

E. T. Kalns and L. M. Ni, “A framework for quantifying scalability,” Tech. Rep.
MSU-CPS—ACS-IOO, Department of Computer Science, Michigan State Univer-
sity, Mar. 1995. submitted to Supercomputing ’95.

E. T. Kalns and L. M. Ni, “DaReL: A portable data redistribution library for
distributed-memory machines,” in Proceedings of the 1994 Scalable Parallel Li-
braries Conference II, Oct. 1994.

"lllllllllllllllf