MSU RETURNING MA‘IERIALS:
Place in book drop to
LJBRARJES remove this checkout from
—3—. your record. FINES will
- » be charged if book is
returned after the date
stamped below.

 

 

Fiﬁ"W‘”«ﬂ‘

- ,
ﬂ

‘ .‘ . b 4 \ *.
, 7“, E." '; ’ .-/

 

 

PIPELINED DATA PARALLEL ALGORITHMS —

CONCEPT, DESIGN, AND MODELING
By

Chung-Ta King

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science
Michigan State University
East Lansing, MI 48824

1988

ABSTRACT

PIPELINED DATA PARALLEL ALGORITHMS —

CONCEPT, DESIGN, AND MODELING

By

C hung-Ta King

As multiprocessors become more and more popular, the need for efﬁcient parallel
algorithms becomes more and more imminent. In this thesis, the concept, design, imple-
mentation, and modeling of a class of parallel algorithms, called pipelined data parallel
algorithms, is discussed. When running on distributed-memory multiprocessors, pipe-
lined data parallel algorithms exploit processor level macro-pipelining to achieve a bal-

ance between computation and communication.

The concept of pipelined data parallel computations is ﬁrst introduced by contrast-
ing with other styles of parallel computations. Various considerations in designing
efﬁcient parallel algorithms, especially data parallel algorithms, are discussed. A new
paradigm, called large-grain pipellning, to facilitate the design of pipelined data parallel

algorithms on distributed-memory multiprocessors is introduced.

To estimate the performance of pipelined data parallel algorithms, an analytic
model is introduced. The accuracy of the analytic model is studied by comparing the ana-
lytic results with experimental results from a 64-node NCUBE multiprocessor. The close
match between these two sets of results indicates that the analytic model can also be used
to determine the optimal design parameters, such as the granularity, for a given problem

instance.

A systematic approach for designing pipelined data parallel algorithms on
distributed-memory multiprocessors is presented. Starting from a nested-loop program,
the approach restructures the program through a series of transformations. From data
dependencies between the loops, data and operations in the program are grouped
together. The grouping technique allows the control over the granularity so that a bal-
anced computation and communication results in the generated pipelined data parallel
program.

Finally, major contributions of this thesis are summarized and future work is out-

lined.

To my parents and my wife

iv

ACKNOWLEDGEMENTS

I wish to thank my advisor, Dr. Lionel M. Ni, for his guidance over the years.
Without the research environment that he provided and his intellectual advice and
inspiration, this dissertation would not have been possible.

I would also like to thank Dr. Anthony S. Wojcik for his valuable suggestions and
comments on this dissertation and careful reading the manuscript. I am grateful to Dr. A.
Esfahanian, who is always willing to answer my questions and provide me with valuable
references. I also wish to express my appreciation to Dr. D. Feldman for his encourage-
ment and support.

A person cannot accomplish anything without the help from others. I would like to
thank all the people who helped me during my stay at Michigan State. In particular, I
wish to acknowledge my friends T. Gendreau, E. Wu, Y.H. Liu, S.W. Chen, J. Liao,
W.H. Chou, T.H. Pan, B. McMillin, Youran Lan, D. Ra, X.H. Sun, F. Fotouhi, and T.
Znati.

Finally, I would like to thank my parents and my wife for their constant support,
patience, and love.

TABLE OF CONTENTS

List of Tables ............................................................................................................. viii
List of Figures ............................................................................................................ ix
Chapter 1. Introduction ........................................................................................... 1
1.1. Shared versus Distributed Memory Multiprocessors ............................. 2

1.2. A Model of Distributed-Memory Multiprocessors ................................ 4

1.3. Styles of Parallel Computations ............................................................. 8

1.4. Problem Statement ................................................................................. 15

1.5. Thesis Overview .................................................................................... 16
Chapter 2. Considerations for Designing Parallel Algorithms ............................ 18
2.1. A Representation of Parallelism — P-nets ............................................ 19

2.1.1. Basic Deﬁnition of P-nets ............................................................ 19

2.1.2. Variations and Analysis of P-nets ................................................ 25

2.2. Considerations for Partitioning and Mapping ........................................ 27

2.2.1. Number of Nodes and Partitions .................................................. 27

2.2.2. Mapping Considerations .............................................................. 28

2.2.3. Inter-Processor Scheduling .......................................................... 30

2.3. Considerations for Scheduling within Processors ................................. 31
Chapter 3. Designing and Analyzing Data Parallel Algorithms .......................... 35
3.1. NCUBE Multiprocessors ....................................................................... 36

3.1.1. The Hardware .............................................................................. 36

3.1.2. The Software ................................................................................ 38

3.2. Modeling Computation and Communication on DMMs ....................... 39

3.3. Case Study — Array Summation ........................................................... 45

vi

3.3.1. The Algorithms ............................................................................ 46

3.3.2. Analytic Modeling ....................................................................... 48

3.3.3. Performance Analysis .................................................................. 51

3.4. Case Study — Matrix Multiplication .................................................... 54

3.4.1. The Algorithms ............................................................................ 55

3.4.2. Analytic Modeling ....................................................................... 59

3.4.3. Performance Analysis .................................................................. 60

Chapter 4. Large-Grain Pipelining ........................................................................ 64
4.1. Parallel Programming Paradigms .......................................................... 65

4.2. Basic Concept of Large-Grain Pipelining .............................................. 68

4.3. Characteristics of Large-Grain Pipelining ............................................. 73

4.4. Performance of Pipelined Matrix Multiplication ................................... 76
Chapter 5. Modeling Pipelined Data Parallel Algorithms ................................... 80
5.1. A Model of Pipelined Computation ....................................................... 81

5.2. Throughput Analysis of a Computational Unit ..................................... 83

5.3. Throughput Analysis of Pipeline Arrays ............................................... 89

5.4. Analysis of Pipelined Matrix Multiplication ......................................... 95

5.5. A Comparative Study of the Analytic Model ........................................ 99
Chapter 6. Designing Pipelined Data Parallel Algorithms .................................. 103
6.1. Synthesizing Systolic Arrays ................................................................. 104

6.2. The Design Procedure ............................................................................ 110

6.3. The Grouping Problem .......................................................................... 111

6.4. Grouping with One or Two Dependence Vectors ................................. 117

6.5. Grouping with Three Dependence Vectors ........................................... 122

6.6. Grouping with Four or More Dependence Vectors ............................... 131

6.7. An Example of Grouping ....................................................................... 134
Chapter 7. Conclusion and Future Work .............................................................. 140
7.1. Summary ................................................................................................ 140

7.2. Future Work ........................................................................................... 143
Bibliography .............................................................................................................. 146
Appendix .................................................................................................................... 150

vii

LIST OF TABLES

Tabel 1.1. Performance ﬁgures for three DMMs ....................................................... 7
Table 1.2. Different styles of parallel computation ................................................... 10
Table 3.1. System parameters on the NCUBE .......................................................... 45
Table 5.1. Optimal partitions for pipelined algorithm ............................................... 100

viii

LIST OF FIGURES

Figure 1.1. The block diagram of a generic DMM ................................................. 5
Figure 1.2. A concurrent function parallel computation ......................................... 11
Figure 1.3. A concurrent data parallel computation ................................................ 12
Figure 1.4. A pipelined function parallel computation ........................................... 12
Figure 1.5. A pipelined data parallel computation .......................................... -. ....... 14
Figure 1.6. Another pipelined data parallel computation ........................................ 14
Figure 2.1. Matrix multiplication and its corresponding P-net ............................... 21
Figure 2.2. Execution of a P-net at various time instants ........................................ 23
Figure 2.3. Partitioning a P-net among three processors ......................................... 24
Figure 2.4. The D-net corresponds to the P-net in Figure 2.1 ................................. 26
Figure 2.5. A schedule for the D-net in Figure 2.4 ................................................. 32
Figure 3.1. Gathering/scattering operations in the NCUBE .................................... 42
Figure 3.2. Measured host overhead in sending a message ..................................... 43
Figure 3.3. Measured node to node communication time ....................................... 44
Figure 3.4. Array summation with centralized accumulation method .................... 46
Figure 3.5. Array summation with tree-structured accumulation ........................... 47
Figure 3.6. Timing diagram of array summation using centralized

accumulation ........................................................................................ 48
Figure 3.7. Timing diagram of array summation using tree-structured

accumulation ........................................................................................ 51
Figure 3.8. Comparison of ratioed and equal partition ........................................... 52
Figure 3.9. Comparison of tree-structured and centralized accumulation .............. 53
Figure 3.10. Predicted performance of various methods ........................................ 54
Figure 3.11. Matrix Multiplication with strip partition ........................................... 56
Figure 3.12. Matrix multiplication with block partition ......................................... 58
Figure 3.13. The algorithm for matrix multiplication ............................................. 59
Figure 3.14. Execution time of the matrix multiplication ....................................... 61

ix

Figure 3.15. Predicted speedup of the matrix multiplication .................................. 62

Figure 4.1. Algorithm 4.1 for pipelined matrix multiplication ............................... 69
Figure 4.2. Data ﬂows in Algorithm 4.1 .................................................................. 70
Figure 4.3. Algorithm 4.2 for pipelined matrix multiplication ............................... 72
Figure 4.4. Data ﬂows in Algorithm 4.2 .................................................................. 73
Figure 4.5. Algorithm 4.3 for pipelined matrix multiplication ............................... 74
Figure 4.6. Data ﬂows in Algorithm 4.3 .................................................................. 75
Figure 4.7. Comparison of pipelined matrix multiplication .................................... 77
Figure 4.8. Effects of partition sizes ........................................................................ 78
Figure 4.9. Comparison of speedups ....................................................................... 79
Figure 5.1. Model of a computational unit .............................................................. 82
Figure 5.2. Functional description of the computational unit ................................. 82
Figure 5.3. Timed Petri-net of a computational unit ............................................... 85
Figure 5.4. A linear array with single input stream ................................................. 87
Figure 5.5. A linear array with two input streams ................................................... 89
Figure 5.6. The Timed Petri-net of a linear array with two stages .......................... 90
Figure 5.7. Computational units connected in a 2-dimensional mesh .................... 93
Figure 5.8. The Timed Petri-net of a 2-dimensional mesh ...................................... 94
Figure 5.9. Measured and analyzed performance for various pipeline

conﬁgurations ....................................................................................... 99
Figure 5.10. Measured and analyzed performance for various

partition sizes ........................................................................................ 100
Figure 5.11. Predicted speedup for pipelined matrix multiplication ....................... 101
Figure 6.1. The computational structure of the matrix multiplication .................... 107
Figure 6.2. Projected computational structures of matrix multiplication ............... 109
Figure 6.3. Grouping of a computational structure along [1 0] .............................. 114
Figure 6.4. A computational structure with possible cycles ................................... 116
Figure 6.5. A computational structure with one dependence vector ....................... 118
Figure 6.6. A computational structure with two dependence vector ....................... 119
Figure 6.7 . Grouping along two directions .............................................................. 122
Figure 6.8. A computational structure with three dependence vectors ................... 123
Figure 6.9. A grouping which generates a non-dependence-preservin g

structure ................................................................................................ 125
Figure 6.10. Dependence relationships between groups ......................................... 127
Figure 6.11. Relationship between groups in a universal planner array ................. 133
Figure 6.12. The computational structure of the example program ........................ 135

X

Figure 6.13. The projected computational structure of the example ....................... 136
Figure 6.14. The grouping along [1 0] with a size of 2 .......................................... 138

xi

CHAPTER 1

INTRODUCTION

A multiprocessor is a computer system containing multiple processors which are
capable of communicating and cooperating at different levels in order to solve a given
problem [HwBr84]. A distributed-memory multiprocessor (DMM) is a multiprocessor
in which each memory module (not cache memory) is physically associated with each
processor and an interprocessor communication network provides a mechanism for
communication between processors. Centered around the development of efﬁcient
parallel algorithms on DMMs, this thesis focuses on the design, implementation, and
modeling of a special class of parallel algorithms, called pipelined data parallel algo-
rithms, on DMMs.

DMMs are different from shared-memory multiprocessors (SMMs), in which all
memory modules are equally accessible to all processors through a processor-memory
interconnection network. Both architectures have commercial implementations
[Kwan87]. A brief review and comparison of SMMs and DMMs are presented in Sec-
tion 1.1. This discussion serves to motivate our study of DMMs. A model of DMMs is
given in Section 1.2. Characteristics of DMMs are discussed in more detail, which form
the basis for the following discussion. As will be seen in Section 1.2, the major
bottleneck in current DMMs is communication. The problems caused by the non-
negligible communication overhead are discussed, and, from a user’s point of view, the

best way to get around with the communication problem is to carefully design the

1

algorithm. The work presented in this thesis is devoted entirely to achieving this end.

Pipelined data parallel algorithms are a class of parallel algorithms which uses
pipelining and data level partitioning to achieve a very high degree of parallelism. Spe-
cial features of pipelined data parallel algorithms make them very suitable for execution
on DMMs. As a result, pipelined data parallel algorithms are the primary subject of this
thesis. In Section 1.3, the basic concept of pipelined data parallel algorithms is intro-
duced by contrasting them with other styles of parallel computations.

Finally, a formal statement of the problems addressed in this thesis is given in Sec-

tion 1.4, followed by an overview of the thesis in Section 1.5.

1.1. SHARED VERSUS DISTRIBUTED MEMORY MULTIPROCESSORS

The multiprocessors discussed in this thesis are MIMD machines, in which multi-
ple instruction streams simultaneously operate on multiple data streams. As previously
mentioned, to provide such a concurrent processing environment, SMMs treat the
memory as a shared resource and use a processor-memory interconnection network to
allow the memory to be accessed equally from any processor. The simplest such inter-
connection network is a single high speed bus to which multiple processors, memory
modules, network interfaces, and device controllers are all tied. Most minisupercomput-
ers, such as Alliant’s FX series, Encore’s Multimax, or Sequent’s Balance and Sym-
metry series [Hwan87], choose this kind of connection. The advantage is, of course, the
simple architecture which performs very well for up to a few tens of processors. How-
ever, as the number of processors increases, the contention to access the bus increases,
which, in turn, increases memory latency. This limits the scalability and expandability
of bus-structured SMMs.

At the other end of the spectrum is the SMMs with a crossbar interconnection net-

work. The crossbar provides a one-to-one link from any processor to any memory

module. The only contention that may occur is when two processors attempt to access
the same memory module. The major disadvantage of the crossbar is the high cost in
building the network, which, again, limits the scalability of the SMM. Another direct
connect architecture [Hwan87] is the multistage interconnection network, which is a
compromise between the bus and the crossbar. One of the problems with the multistage
network is the difﬁculty in achieving cache coherence and in synchronizing parallel
activities.

Popular processor interconnection networks for DMMs include multistage and
point-to—point networks. For example, BBN’s Butterﬂy uses a butterﬂy multistage net—
work to allow processors to access each other’s local memories [BBN87]. Among
point-to-point interprocessor connections, the hypercube is the most popular topology,
as is found in NCUBE’s NCUBE/10, Intel’s iPSC, and AMTEK’s S-14 [Hwan87,
HaMS86]. The reason for this is because the hypercube topology has a uniform struc-
ture with logzN diameter, where N is the number of nodes in the hypercube, and a rich
set of other topologies, such as ring and mesh can be embedded in it [SaSh85].

One of the most signiﬁcant advantages of DMMs is their modularity and scalabil-
ity. For example, in the hypercube, when the number of processors doubles (from N to
2N), the diameter of the network only increases by one, and the number of neighbors a
processor has to connect to is also increased by one. Currently, DMMs with a few hun-
dred processors are common. Thus, DMMs hold the promising potential for massive
parallelism. It is for this reason DMMs are studied in this thesis.

However, DMMs are not without problems. The main problem is due to the need
for communication between processors to exchange data held in each other’s local
memories. In general, there are two possible synchronization and communication
methods for DMMs: shared-variable and message-passing. In the shared-variable
approach, processes communicate by sharing common variables. Though the shared-

variable approach provides a uniﬁed and friendly environment for users, it takes a lot of

effort to support the logic view of a shared, global memory on top of distributed, local
memories. Very high speed communication is necessary to close the performance gap
between accessing local and remote memory modules.

In the message-passing approach, processes communicate by passing messages
explicitly. Message-passing is much simpler to implement on DMMs, because it only
uses simple primitives such as send and receive. More importantly, to achieve real mas-
sive parallelism, it is necessary to connect hundreds or even thousands of processors
together. For such a complex system, message-passing is a natural choice for
interprocess(or) communication. Nevertheless, message-passing requires a completely
different style of algorithm design and programming than that of the conventional
shared-variable approach. Therefore, in this thesis, we will concentrate on developing

efﬁcient parallel algorithms on message-passing DMMs.

1.2. A MODEL OF DISTRIBUTED-MEMORY MULTIPROCESSORS

Figure 1.1 is the block diagram of a typical DMM. A set of node processors are
interconnected together through a point-toopoint network, and each can communicate
with a frontend host. The words processors and nodes will be used interchangeably in
the following discussion. .

(1) Node structure:

Nodes have homogeneous structures and run in an asynchronous fashion. They
communicate with each other by passing messages. Each node consists of a computa-
tion unit, a communication unit, and a local memory. The computation is usually per-
formed by the node CPU with some ﬂoating-point accelerators. Optional vector proces-
sors may be used, such as in Intel’s iPSC-VX [Hwan87].

The communication unit is in charge of the communication with the host and other

nodes. The task of communication may be performed by the CPU or by dedicated

 

 

 

 

 

 

 

 

 

 

 

 

Host I/O
processor processors
f I
l 1
node 0 node 1 ........ node N -1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A point-to-point interprocessor connection

 

 

Figure 1.1. The block diagram of a typical DMM

processors through DMA (Direct Memory Access). For example, in many second gen-
eration DMMs, nodes contain intelligent routers dedicated to communication [ShF187].
These routers not only reduce the time for initiating and receiving a message, but also
reduce the time in relaying messages. More about communication will be discussed
shortly.

Local memory in each node contains space for node operating system, user
processes and communication buffers. For many systems, the size of local memory is
still a limiting factor. For example, each node in the NCUBE has only 512 Kbytes local
memory. In early DMMs, virtual memory is not available at the nodes, because there is
no direct access from the nodes to any system peripherals other than through the host.
The situation is changing in the second generation DMMs, in which special I/O subsys-
tems are incorporated into the nodes to allow direct access to I/O devices [WiCM8 8].
(2) Host:

One or more frontend hosts are connected to the nodes to manage the whole sys-
tem. Popular host to node interconnection topology include one-to-one direct connec-

tion, as in NCUBE’s NCUBE/10, or bus connection, as in Intel’s iPSC/l. In the ﬁrst

generation DMMs, all peripherals, such as disks, tape drivers, terminals, and network
controllers, are all connected to the host. Therefore, hosts provide the primary user
interface and perform such functions as program development, performance monitor-
ing, program and data downloading, ﬁle system maintenance, and program debugging.
With new concurrent I/O subsystems, the role of hosts in future DMMs may be changed
or replaced by the nodes.

(3) Node interconnection network:

Most DDMs use a static point-to-point network for node interconnection, such as a
hypercube, mesh, tree, or ring [Hwan87]. In earlier implementations, messages are
routed through the network using the store-and-forward approach. That is, a message or
a packet of a message is stored completely at each intermediate node before it is sent
out to the next node. This not only introduces a very high communication delay, which
is proportional to the distance of the communication path, but also increases the degree
of interference with the intermediate nodes, e.g., the need to allocate communication
buffers in these nodes.

In the second generation DMMs, intelligent routers utilizing new techniques, such
as wormhole routing [DaSe87], are used to reduce the overhead in routing messages. In
worrnhole routing, as soon as a node examines the header of a message, it selects the
next channel and begins forwarding the message down the channel. Thus, only a few
control bits are buffered at each intermediate node. As a result, less interference is
experienced in the intermediate nodes and message latency remains almost constant,
regardless of the distance between the source and destination. The implication is that
the network becomes virtually fully connected [ShFi87].

(4) System software:

In a message-passing DMM, the only means of communication and synchroniza-

tion at the system level, just as at the machine level, is by passing messages. There are

no shared variables nor centralized sequence control. One node is dedicated to one job

at a time. A user must gain exclusive access to a set of nodes before the user can use the
nodes. Again, this situation will be changed in future generation DMMs. Conventional
programming languages are supported with enhanced library routines to handle mes-
sages and node management. A more friendly and advance programming environment
is needed, which provides intelligent compilers and debuggers for DMMs.

As mentioned earlier, new router designs reduce the latency in routing messages.
However, there is still a non-negligible overhead in setting up and receiving a message.
Most overhead comes from the system software when invoking send/receive routines
and maintaining message buffers. Thus, even though DMA channels and routers can
reduce the communication overhead, it is still up to the software — both operating sys-
tem and algorithm — to minimize the overall communication delay.

From the above discussion, one can see that DMMs achieve modularity and exten-
dibility by minimizing the coupling between nodes. Nevertheless, this also introduces a
non-negligible cost of communication. Table 1.1 lists performance ﬁgures for three

representative ﬁrst generation DMMs [Duni87].

Table 1.1. Performance ﬁgures for three DMMs

 

 

 

 

 

AMETEK S-14 Intel iPSC N CUBE
8-byte transfer time (us) 640.0 1120.0 470.0
8-byte multiply time (us) 33.9 43.0 14.7
Communication/Computation 19 26 32

 

 

 

 

 

It can be seen in Table 1.1 that transferring a datum (a 64-bit ﬂoat-point number)
will take a much longer time than will performing a computation. This imbalance is
expected to be improved in the second generation DMMs through streamlined com-
munication primitives and intelligent routers. However, as discussed, the distributed

nature of DMMs introduces a non-uniform access time in referencing local and remote

memories. Therefore, it is essential not only to minimize the number of messages
exchanged in the algorithm, but also to choose an appropriate message size in order to
balance the computation with communication.

In general, there are two possible sources of communication bottlenecks in a
DMM. One is the communication bandwidth between the nodes, which has been dis-
cussed above. Another bottleneck is the I/O bandwidth between the nodes and external
world, e.g., disks. For example, in the current generation DMMs, only the host has
access to disks. Thus, at initialization and summing-up stages of the computation, there
is the need for the host to download data to the nodes and nodes to upload data back to
the host, perhaps sequentially. It follows that the host and host-to-node interconnection
(or I/O bandwidth) also tend to be system bottlenecks.

Putting all these considerations together, it can be seen that, from the algorithm
designer’s point of view, the best way to minimize the communication effect is to care-
fully partition the algorithm and schedule the Operations — a topic which will be

further discussed.

1.3. STYLES OF PARALLEL COMPUTATIONS

In general, a parallel computation can be classiﬁed from two perspectives: one is
according to the way the computation is partitioned and distributed, and the other is
based on the way the computation is executed. From the ﬁrst perspective, we can distin-
guish between computations that are data parallel or function parallel [Oste87]. A
function parallel computation decomposes the program into segments. Different proces-
sors execute different segments in parallel. Function parallelism is suitable for applica-
tions that can be implemented using programs with many unique subroutines, e.g.,

ﬂight simulation.

A data parallel computation divides data among the processors. Processors may be
running the same program but working on different subsets of data. Algorithms using
this technique are called data parallel algorithms [HiSt86]. Data parallelism is
appropriate for applications that perform the same operations repeatedly and indepen-
dently on a large set of data. In terms of programming, programs with loops to handle
static and regular data structures are suitable for data parallel decomposition. Typical
applications of data parallel algorithms include circuit simulation, network simulation,
matrix operations, signal processing, and ray tracing.

From the perspective of scheduling, a parallel computation can be characterized to
be either concurrent or pipelined [HwBr84, Squi86]. Concurrency exploits spatial
parallelism by utilizing several processors to execute multiple independent tasks simul-
taneously. Tasks may be data parallel or function parallel.

Pipelining exploits temporal parallelism in which each processor (called a stage)
behaves like a ﬁlter or transformer which operates on its input data and passes output
data to the succeeding processor. Through pipelining, communication occurs only
between ﬁxed and neighboring stages; and a datum, once entering the system, will be
used or modiﬁed repetitively along the pipeline. Thus, a very high ratio of computation
to I/O rate will result.

One of the most important characteristics of pipelined computation is the notion of
data ﬂows — data ﬂowing in the pipeline as they are processed stage by stage. To see if
a computation contains data ﬂows, the following ﬂow test suggested in [NeSn87] seems
to be very helpful:

Flow test:

Select an arbitrary edge (not necessarily at the boundary) and "radioactively

tag" a single transmission across that edge. As the computation progresses,

let all values computed with one or more radioactive values become radioac-

tive. Then, deﬁne the "contaminated region" as the set of processors and

10

edges touched by some radioactive values.
An algorithm passes the ﬂow test if the edge over which the initial value was transmit-
ted is on the boundary of the contaminated region. Note that the ﬂow test is not perfect
in the sense that there are algorithms generally called pipelined or systolic which fail
the test, e.g., the systolic array for matrix multiplication proposed in [KuLe78]. How-
ever, it does capture the essence of the ﬂow concept and serves well in most cases.

From the above discussion, we can see that there are four different styles of paral-
lel computations, as shown in Table 1.2. In the remainder of this section, examples will

be used to further explain the basic idea of each style.

Table 1.2. Different styles of parallel computation

 

 

 

 

 

 

 

 

Partition
Function Data
Concurrent Concurrent Concurrent
Scheduling function parallel data parallel
Pipelin ed Pipelined Pipelined
function parallel data parallel

 

 

(1) Concurrent function parallel computation
In this style of computation, efforts are devoted to identifying concurrently execut-

able operations or program modules. Consider the following program segment:

1:: 10;
toriz=1tokdo (1.1)
dt i=1X0i-1 + biXCt+1 ;

Note that in the program segment above, lxaH is independent of biXC[+1. There-
fore, these two operations can be executed in parallel, as shown in Figure 1.2. Here, two
processors are used, and it is easy to see that workloads are not always balanced. Note

that concurrent function parallel computation can also be applied to other program

11

constructs, such as a program statement, a function block, or a subroutine. Most
scheduling and load balancing algorithms concentrate on scheduling these program

modules to achieve concurrent function parallelism [Gonz77].

 

 

 

 

 

 

 

 

 

 

 

PMCSSOIO ........I.’.".‘??.‘?§.S.‘?.r. .1 .........
1 =10 :

z I . l

d;:=tmpa +tmpb tmpa :=lxa,--1 tmpb :=b,-xc,-+1

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.2. A concurrent function parallel computation

(2) Concurrent data parallel computation

If data parallelism is used in decomposing the program segment (1.1), we will
obtain an execution scheme similar to that in Figure 1.3. Since there is no data depen-
dency between loops, iterations in (1.1) can be executed on multiple processors in
parallel — processors perform the same task but use different data.

A DMM can be used to execute the computation shown in Figure 1.3, in which all
array elements are downloaded into corresponding processors beforehand, and then pro-
cessor 0 (or the host) broadcasts the value of l (10 in (1.1)) to other processors. Note
that the number of processors is not necessarily the same as the loop count. We can use
n (<k) processors and allocate [k In j loops to each processor. Remaining loops are allo-
cated to the ﬁrst few processors, one for each.

The computation can also be executed on a SMM. The variable 1 now becomes a

shared variable. Again, fewer than k processors can be used. In this case, It tasks are

12

Processor 0

000000000000

 

 

 

 

 

 

d12=l><ao
+b1XC2

 

 

 

Processor 1

 

d2:=lxa1
+b2XC3

 

 

 

Processor 2

 

dkt=l Xdk_1

 

+bk><Ck+r

 

I. .. Megsob; .15.. ..I

 

Figure 1.3. A concurrent data parallel computation

created, one for each loop, and are put into a linear list. Each processor fetches the ﬁrst
task in front of the list, executes the loop, and then gets the next task. In this way, load
balancing is automatically enforced — no one will be idle if there are still tasks to do.
(3) Pipelined function parallel computation

Traditionally, the term "pipelining" refers to this style of computation. Different
stages in the pipeline perform different functions, and, when data ﬂows through the
stages, they are modiﬁed along the way. An example of pipelined function parallel
computation is the parallel compiler shown in Figure 1.4 [Quin87]. Each individual
phase of the compiler is assigned to one processor. Other examples include pipelined

vector processing [HwBr84], image processing, and speech recognition.

 

 

 

 

 

 

 

Processor 0 Processor 1 Processor 2 Processor 3
. . Code Code
——:- Scanmng Parsrng . . . . —:—-
; Generation Optimization ;

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.4. A pipelined function parallel computation

13

The primary consideration in pipelined function parallel computation is to keep
the processing speeds of all stages roughly equal. Otherwise the slowest stage will
become the bottleneck of the pipeline. Again, this involves equally partitioning the
functions among stages.

(4) Pipelined data parallel computation

Systolic arrays are a typical example of pipelined data parallel computation
[FoWa87, KuLe78]. Processors in the pipeline perform the same function in a synchro-
nous fashion. To carry out the computation, each processor may take input operands
from different data streams ﬂowing in different directions and generate outputs to
several other streams. A pipeline in this kind of computation can serve as a transformer,

as is exempliﬁed by the following program segment:

3:: O;
tori :=1 to k do (1.2)
S 2=S +a,~;

This program segment can be executed as shown in Figure 1.5, in which each pro-
cessor accumulates on top of the received 3 its local value and sends the result (a
transformed s) to the next processor. At ﬁrst thought, this linear reduction scheme is
inferior to a binary-tree reduction scheme, because the latter takes only logzk steps.
However, from the view point of each individual processor, there is only one addition
should be executed in the reduction if the linear scheme is used. This implies that, after
one addition, the processor can "forget" and go ahead to perform the next operation.

A pipeline may serve another purpose: data dissemination. Take the program seg-
ment (1.1) as an example. Instead of broadcasting the value of 1, processor 0 can also
send 1 to processor 1. Processor 1 copies the value, sends l to processor 2 immediately,
and, then, executes the loop. In this way, I is "piped" through the pipeline formed by
processor 1 to k (see Figure 1.6). If several [values have to be disseminated among the
processors, then these data form a data stream. It follows that the communication

bandwidth (or I/O bandwidth if processor 0 is the host) between processor 0 and the

14

 

 

 

 

 

 

 

 

 

 

 

ProcessorO
s:=0
Processor 1 Processor 2 Processor k
Si is s is s s as
' s=a1+s ' ' s=a2+s —:——- ....... ——-—'- s=ak+s —‘——-

 

 

 

 

 

 

 

 

 

Figure 1.5. A pipelined data parallel computation

remaining processors is reduced.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.....1?.r992§§.9.r.9 .....
4 1 =10
1 d1 =lxao 1 d2:=lxa1 1 1 (1,, —lxak_1
+b1><c 2 +b zxc 3 '''' +bkxck+1

Figure 1.6. Another pipelined data parallel computation

So far, we have discussed the four different styles of parallel computations listed
in Table 1.2. Different applications may prefer a different style of computation. How-
ever, for most parallel algorithms, a mixture of two or more styles is common. As dis-
cussed, pipelined computations allow a very high computation to communication ratio,
while data parallelism is more suitable for applications with regular computation pat-

terns and is easier to achieve load balancing. Combining these two factors together,

15

pipelined data parallel computation is a very suitable style of computation for DMMs.
Yet, there still lacks of an extensive and systematic study of pipelined data parallel

algorithms on DMMs. Such a study is the major topic of this thesis.

1.4. PROBLEM STATEMENT

As indicated in the beginning of this chapter, our major concern in this thesis is
designing efﬁcient parallel algorithms for DMMs. Due to the distributed nature and
communication overhead on DMMs, the designer must address the following issues:

- How can communication be balanced with computation?

- How can workloads be balanced and operations be overlapped?

0 How can time and space be balanced?
To arrive at a balanced design, the following application speciﬁc questions have to be
considered:

0 How can the problem or data be partitioned?

o How can the partitions be mapped onto the processors?

0 What is the granularity of the partitions?

0 What is the operation sequence within each processor?
A design methodology or paradigm is needed. Through paradigms, it is possible to
answer the above questions and design efﬁcient parallel algorithms in a systematic way.
The particular paradigm studied in this thesis is called large-grain pipelining, which
uses a pipelined data parallel style of computation.

Given a pipelined data parallel algorithm, an analytic model is required, which
serves the following two purposes:

- Estimate and predict the performance of the given algorithm;

- Determine optimal design parameters such as the partition size.

16

To develop an accurate model, one must take into account the characteristics of the
underlying architecture as well as the logic behavior of the algorithm. The accuracy of
the model will determine how close the estimated design parameters are to the optimal
ones. In this thesis an accurate analytic model for pipelined computation will be
presented.

Perhaps as important as design paradigms is a systematic way of designing parallel
algorithms. Using such a systematic design methodology, a designer can start the algo-
rithm development from the problem deﬁnition and follow the procedure step by step
until a satisfactory parallel algorithm is obtained. A systematic methodology lends itself
to an automatic design tool which can be part of an intelligent compiler for DMMs. In
this thesis, a systematic approach for designing pipelined data parallel algorithms on

DMMs will be introduced.

1.5. THESIS OUTLINE

The rest of the thesis is organized as follows. Chapter 2 addresses the issues of
designing efﬁcient parallel algorithms on DMMs. Considerations for algorithm
representation, partitioning, mapping, and scheduling are discussed. A variation of
Perri-nets, called P-nets, will be described as an example of algorithm representation.

In Chapter 3, attention is turned to designing, implementing, and modeling data
parallel algorithms. The algorithms are designed to execute on a 64-node N CUBE mul-
tiprocessor (a ﬁrst generation hypercube DMM). Therefore, features of the NCUBE are
introduced, followed by a general model of computation and communication on
DMMs. This model provides a basis for analyzing the algorithms. Two case studies,
array summation and matrix multiplication, are given to illustrate different considera-
tions involved in the design. Performance analysis and comparison of different

approaches are presented, which use both analytic results as well as experimental

17

results.

The basic concept of large-grain pipelining is introduced in Chapter 4. First, vari-
ous parallel programming paradigms are reviewed. From these paradigms, a new para-
digm which applies a pipelined data parallel style of computation on DMMs becomes
necessary. The basic idea of large- grain pipelinin g is introduced through the example of
matrix multiplication. Characteristics of pipelined data parallel algorithms are summar-
ized, and experimental results of the pipelined matrix multiplication algorithms exe-
cuted on the NCUBE are presented.

Chapter 5 covers the analytic model for pipelined data parallel algorithms. The
model is introduced ﬁrst, and important formula for characterizing design parameters
are given. The model is then used to analyze the performance of the pipelined matrix
multiplication described in Chapter 4. The analytic results are compared with experi-
mental results from the NCUBE. The close match between these two kinds of results
indicates the accuracy of the analytic model.

A systematic procedure is described in Chapter 6 to design pipelined data parallel
algorithms on DMMs. Starting from a nested-loop program, the procedure restructures
the program in various steps until one that is suitable for DMMs is obtained. The most
important step in this procedure is to group operations and data elements. Through
grouping, the granularity of the algorithm can be controlled, while an appropriate
granularity will balance the computation with communication and achieve the a high
degree of overlapping.

Finally, conclusions are presented in Chapter 7. Improvements of the work
presented here and the impact on second generation DMMs are discussed. Plans for
future work is given. The experience gained from this thesis research is also summar-

ized.

CHAPTER 2

CONSIDERATIONS FOR

DESIGNING PARALLEL ALGORITHMS

Designing parallel algorithms is much different from designing conventional serial
algorithms. For one thing, the designer has to ﬁgure out an efﬁcient way of utilizing the
available processors. This would require the distribution of workloads evenly among
the processors so that all processors keep busy doing useful computation at all times.
On DMMs, this requires a balance between computation and communication as well as
between time and space. Moreover, the designer has to insure the correctness of the
parallel algorithm. The latter would require a proper synchronization and ordering of
events between the processors.

To achieve the above goals, one must ﬁrst identify and be able to express the
parallelism inherent in the problem. A suitable representation is necessary to deﬁne the
data sets and operations used in the algorithm and unveil the concurrency and depen-
dency existing among them. Next, the computation is parallelized according to avail-
able concurrency. On DMMs, parallelization involves partitioning the problem into
many subproblems, allocating the subproblems to the processors, and scheduling the
execution sequence within each processor. The underlying assumption is that the neces-
sary operations to solve the problem are known before the execution, and thus deter-

ministic scheduling is possible [Gonz77]. Most data parallel algorithms possess this

18

19

property. In this chapter, we shall discuss these points — representation, partitioning,
mapping, and scheduling — in more detail.

The rest of the chapter is organized as follows. A representation of parallelism,
called P-nets, is presented in Section 2.1. Based on Timed Petri-nets [Ramc74], P-nets
can express and model general computation, especially the data parallel style of compu-
tation. Based on the P—net representation, considerations for algorithm partitioning and
mapping will be studied in Section 2.2, followed by discussions for algorithm schedul-

ing in Section 2.3.

2.1. A REPRESENTATION OF PARALLELISM — P-NETS

A good algorithm representation should unveil the dependency and concurrency
inherent in the problem and carry useful information, such as time and space, for design
decisions and performance analysis. In this section a graph-theoretical representation,
called P-nets, is presented. The formal deﬁnition of P-nets will be presented in Section

2.1.1. Section 2.1.2 will discuss variations of P-nets and show how to analyze P-nets.

2.1.1. Basic Deﬁnition of P-nets

A P-net is a directed bipartite graph consisting of two types of vertices: data ver-
tices (U) representing data components and operation vertices (V) representing opera-
tions. If a data vertex u,- is an input operand to an operation vj, then there is an arc
(u,-,vj) in the P-net. Similarly, if an operation v,- generates a data vertex u j, then there
exists an arc (vi,uj) in the P-net. Data vertices which have no preceding (succeeding)
vertices are called initial (resultant) data vertices and are denoted by U ,- (U,).

Each data vertex u,- in the P-net is associated with a parameter fu(u;) to indicate
the size of the data component represented by u,-. Each operation vertex v, is associated

with a parameter f,(v,-) to denote the amount of time to perform v,-. Thus, the static

20

structure of a P-net can be deﬁned by the ﬁve-tuple G =(U, V,E,fu,f,), where:

U is a ﬁnite nonempty set of data vertices,

V is a ﬁnite nonempty set of operation vertices,

E cU xV U VxU, is the set of arcs between data vertices and operation vertices,

f“ : U —9R+, is the data size function,

f, : V-rR”, is the operation time function,

and R+ is the set of positive real numbers.
<Example 2.1 >

Consider the multiplication of two matrices A and B, AXB=C. Figure 2.1(a)
describes the task, where circles represent data vertices (the matrices) and squares
represent operation vertices (the multiplication). One way of performing the multiplica-

tion is to partition A and B into four submatrices as follows:

 

 

A00 A01 300 301 C00 C01
AXE: = :
[A10 A11]x[810 311 C10 C11

where C,-,- =C§,9>+c§}> =A,OBO,-+A,-131,-, os1,js1. The P—nct shown in Figure 2.1(b)
deﬁnes the necessary operations to calculate the matrix product in this case. The set of
initial data vertices is U,-={A,-j, Bi]- l 09', jSl }, and the set of resultant data vertices is
U,=[ C,-,- l OSi,jSI }. Suppose that both matrices A and B have a size 64x64, and that
scalar multiplications and additions both take 1 unit of time. Then, each data vertex
represents a 32x32 submatrix and has a size fu=1024. Also, fv=65,536 for each multi-
plication vertex and fv=1024 for each addition vertex.
CI

The dynamic behavior of a P-net is described by token markings. A token in a data
vertex u,- indicates the existence of a data element, which occupies a memory space of
size fu(u,-). Data vertices can be viewed as FIFO queues for tokens. A token has two
states: busy and ready. A token is busy when it is ﬁrst created and its data elements are

being generated by the preceding operation vertices. A token becomes ready when its

21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

m OI 10 ll 00 10 01 11
x x l x x l x x x x
X
0 1 ll ll l0 9 Si 91
+ + l + +
C00 C11 C10 C01
(a) (b)

Figure 2.1. Matrix multiplication and its corresponding P-net

data elements are ready for use by the succeeding operation vertices. Also associated
with a token is a variable tag, which is initialized to the out-degree of the corresponding
data vertex and is decremented by 1 whenever one succeeding operation ﬁnishes
referencing the token. Basic ﬁring rules for an operation vertex of a P-net are as fol-
lows:
(1) An operation is enabled if and only if each of its input data vertices holds at
least one ready token.
(2) An operation can ﬁre if it is enabled.
(3) When an operation v,- ﬁres, one busy token is deposited in each of the output
data vertices, with the tag initialized accordingly.
(4) The operation will continue ﬁring for fv(v,-) units of time.
(5) At the end of the ﬁring, the busy tokens in all output vertices become ready
and the tags of the ready tokens in all input vertices are decremented by 1. If a

tag is reduced to 0, the corresponding ready token is removed.

22

Figure 2.2 illustrates the executions of the P-net shown in Figure 2.1 at various
instants of time, assuming maximal parallelism. It can be seen from Figure 2.2 that the
total job execution time is 66,560 and the maximum memory space required is 16,384.
Note that for a pure sequential system (e.g., a uniprocessor system) the computation
time becomes 528,384, which is simply the sum of all the f, values.

The execution of a P-net on a DMM can be characterized from two perspectives:
one is time (when should an operation vertex be ﬁred?) and the other is space (where
should a data vertex reside?) A P-net can be modiﬁed to incorporate the effect of parti-
tioning and communication on DMMs. Consider the following example:
<Example 2.2>

Suppose the matrix multiplication in Example 2.1 is to be scheduled on a DMM
with three processors. A possible allocation of the P-net among these three processors is
shown in Figure 2.3. The effect of communication delay is represented by introducing
new Operation vertices, called communication vertices and denoted by vc. The buffers
for received data in receiving processors are represented by the buﬁ‘er vertices, which
are the outputs of communication vertices. The value of fv(vc) is the time to transmit
the corresponding data elements from the source processor to the destination processor.
Note that given a P-net and a mapping of the vertices to processors, it is straightforward
to incorporate communication delay and derive the P-net shown in Figure 2.3.

Cl

In summary, data generations and consumptions in a P-net are represented by
operation ﬁrings and token movements. Thus, data vertices in a P-net serve as templates
of data — a token in a data vertex denotes the existence of a data element. Communica-
tion and buffer vertices are introduced to represent communication and partitioning.
Space and time parameters are incorporated into the model, which allows the trace of
system status and resource requirements. Thus, P-nets are more suitable for algorithm

development and performance analysis on DMMs.

23

16,384

0, Space

Time

Initial Conﬁguration

(a)

O O O O

(9 O (9 G

(d)

(C)

Cl: ﬁred operation

0. busy token

0: ready token

Figure 2.2. Execution of a P-net at various time instants

24

Processor 1 Processor 2

Processor 0

1
I
I
l
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
.J

 

 

 

 

 

 

 

 

 

 

 

 

. .
. I we I n 1 .
. v .
. .
. l .
. 0 WI .
. .
. .
. 6 I .
. 1 v (0 _
. l .
. l (l .
rl'|'rw ''''''' c """"" I c " """"""" L

I V V
n. dim. 1 (m
.
. I 5 u I.
_ v v .
. .
- I. o -

4 0

. 0 v v... .
.
. A. (I
PIIIIIIW IIIIIIII C I IIIIIII C

IWI‘r V V
1".' A (I- |||||||||||||| - " """""" J
. _1 0 .
. .
. 3 .
. v .
. .
. 2 .
. I (I .
. v .
. .
. l 9 .
_ V V .
. .

I"
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I.

Figure 2.3. Partitioning a P-net among three processors

25

Note that basic Petri nets [Pete77] are a special case of P-nets, in which fv=fu=0
for all vertices. This implies that P-nets have at least the same modeling and computing

power as Petri nets.

2.1.2. Variations and Analysis of P-nets

Based on the characteristics of data parallel algorithms, we can simplify the dis-
cussion by considering a subset of P-nets, which has the following properties:

(1) acyclic, i.e., the P-net does not contain a loop,

(2) decision-free, i.e., every operation vertex has only one output data vertex,

(3) safe, i.e., any data vertex can contain at most one token at any time, and

(4) functional, i.e., every data vertex is generated by at most one operation vertex.

From these properties, a much simpler representation can be derived, by noting
that there is a one-to-one correspondence between a data vertex and an operation ver-
tex. Thus, corresponding to each such P-net, G = (U,V,E,f,,,f,,), a simpliﬁed graph,
called a D-net, can be deﬁned. A D-net is a four-tuple, DG = (U,E’, f“ , f ’u), where

(u;,uj) e E’, if and only if there exists a v e V such that (u;,v) e E and (v,u,-) e E

f’u(uj) =fv(V). if (v.19) E 5
That is, f’u(uj) is the time to generate the data vertex u}. Figure 2.4 depicts the
corresponding D-net for the P-net shown in Figure 2.1.

If there is only one set of input data to be processed, i.e., there is no pipelining,
then each data vertex will be ﬁred only once and will contain at most one token. This
case may occur if, for example, we consider only one iteration of a loop program or
unfold the whole loop. It follows that, the execution of a D-net on a DMM can be
characterized by two state functions. The state function Tu(u,-) denotes the time that
data vertex u,- becomes ready, and the function Su(u,-) denotes the processor that u,- is to

be generated. Given these two functions, other state information can be deduced. For

‘sztzsgl

Bases:

at as» «a
®

Figure 2.4. The D-net corresponds to the P-net in Figure 2.1

@t
O ® ®

example, the total job execution time, T, is given by

T = max Tu(u,-) (2.1)
are U.

Similarly, the time a data vertex u,- becomes busy is given by
T111011) = 711041) -f'u(ut)
and the time u,- can release its token is given by

T u- = max T u-
112(1) (“"1965 14(1)

It follows that the memory required for the execution is

S = 52% Z K(ur.t)fu(u.-) (2.2)

“35 U

where

__ 1 if Tur(u1)St 57112041)
Km”) ' {0 otherwise

27

2.2. CONSIDERATIONS FOR PARTITIONING AND MAPPING

Scheduling in a multiprocessor involves the assignment of operations or data ele-
ments to particular processors for execution at particular instants. The multiprocessor
scheduling problem has been extensively studied for over 20 years [Gonz77]. The list
scheduling or critical path scheduling algorithm [AhCh74, Hu6l, PoBa87] is the most
well known algorithm. Unfortunately, this kind of scheduling algorithms does not con-
sider, and is difﬁcult to incorporate into, communication delay and, thus, is mainly
applied on SMMs.

On the other hand, scheduling algorithms for DMMs emphasize minimizing inter-
processor communication for a given problem. In general, the problem is solved by the
min-cut algorithm [RaSH79] or is formulated as variations of the quadratic assignment
problem [MaLT82]. One drawback of these scheduling methods is the ignorance of data
dependence relationships. This may result in processor idleness due to synchronization.

To take into account both communication delay and data dependency, the schedul-
ing problem should be studied at two levels — inter-processor and intra-processor.
Inter-processor scheduling is concerned with the partitioning and mapping of a given
algorithm onto a multiprocessor. Given a mapping, intra-processor scheduling is con-
cerned with the execution sequencing within each processor. Scheduling decisions at
one level will affect those made at the other level.

In this section, general considerations for partitioning and mapping algorithms are
studied ﬁrst in Section 2.2.1 and 2.2.2, respectively. Then we will concentrate on parti-

tioning and mapping methods for algorithms represented by P-nets in Section 2.2.3.

2.2.1. Number of Nodes and Partitions
Given an application and the associated D-net, the ﬁrst problem facing an algo-

rithm designer is the number of processors and the algorithm partitioning scheme that

28

should be used. The primary goal of algorithm partitioning is to partition the algorithm
(i.e., the corresponding D-net) so that processors have an equal amount of the load to
work with. The "load" of a processor should include not only the computation time of
the assigned task but also the communication time with other processors and the idle
time due to synchronization (see Example 2.1). Therefore, if the communication takes
more time than the computation, the communication delay between two communicating
vertices residing in two different processors will negate the gain due to parallelism. In
this case, we should use fewer processors and group several related vertices together to
increase the computation to communication ratio. We refer to those vertices assigned
to the same processor as a group or partition. It follows that a good partition scheme, in
addition to distributing workloads, should also minimize the communication trafﬁc
between groups and match the host capacity.

At the end of the partitioning phase, an application contracted graph is obtained.
A contracted graph, M =(VM,EM), has k=IVMI vertices, where kSIVH l. Each vertex
corresponds to one group, and there is an edge between two vertices in M if communi-
cation exists between their corresponding groups. The partitioning phase is sometimes

referred to as contraction [Berm87].

2.2.2. Mapping Considerations
The multiprocessor on which the algorithm is to be executed can be represented by
an undirected graph, called the host graph, which is denoted by H =(V,,,EH). Processor
i in the multiprocessor is represented by a vertex vie V”, and a communication link
between processor i and processor j is represented as an edge denoted by (v;,vj)e E 3.
Based on the contracted graph, M, of an algorithm, a suitable processor intercon-
nection pattern is chosen and each group in M is assigned to a different processor in H.

The mapping procedure is inconsequential if there is no inter-group communication in

29

the contracted graph, or the host graph is a complete-connected graph. However, if the
communication cost is non-uniform across the system and is a function of the distance
between two communicating processors, then one must insure that the communication
takes place using the shortest path.

For machines, such as a hypercube multiprocessor which embeds a large set of
topologies, the choice of a suitable interconnection pattern is relatively easier and more
ﬂexible. In general, mapping is a two phase operation. One phase, the group mapping
phase, may be deﬁned as g (v;)=vj, for all vie VM and any vie V”, which is a one-to-one
mapping between vertices in M and processors in the host graph H. Two types of paths
in a mapping may be identiﬁed: a logical path which is an edge in the contracted graph
and a physical link which is an edge in the host graph. The second phase of the map-
ping, the path mapping phase, is to deﬁne a function, h, to map the logical paths onto
the physical links. The whole mapping function, f, is then deﬁned as f=g o h.

The number of physical links that a logical path has to traverse deﬁnes the dilation
of that logical path. The number of logical paths which traverse a particular physical
link is referred to as the sharing of that link. Since a physical link allows one message
to pass at a time, the higher the sharing is, the longer the communication delay will be.
Also, the more links which a message must traverse, the greater the communication
trafﬁc will be. Thus, a good mapping function, f=go h, should try to minimize the aver-
age dilation cost and sharing cost.

It can be shown that, for an arbitrary problem topology, both component partition-
ing and group mapping are NP-complete problems [GaJo79]. Therefore, a good heuris-
tic approach should be used, such as simulated annealing [Berm87]. It is believed that
the partitioning and mapping problem must take into account the characteristics of the
underlying architecture as well as the behaviors of the algorithms running on such a

system.

30

2.2.3. Inter-Processor Scheduling

As mentioned earlier, the loads of a processor include the computation, communi-
cation, and idle time of the processor. Partition and mapping utilize this timing informa-
tion to distribute the workload evenly among the processors. However, processor idle
time cannot be determined unless intra-processor execution sequence is determined.
The intra-processor schedule, in turn, determines inter-processor synchronization. On
the other hand, an inter-processor partition and mapping scheme must be decided
before the intra-processor schedule can be determined.

One approach to break this dilemma is to perform partition and mapping without
concern for the processor idle time. Then, given a D-net, (U,E’,fu,f;), the inter-
processor scheduling problem can be formulated as a quadratic assignment problem as
follows. Let du denote the distance between processor k and l in H, and qJLM) be the
computation cost of vertex v,- on processor k, where 4,, is the speed factor of processor
k. Deﬁne the assignment matrix X to be such that xik=1 if vertex v; is mapped to proces-

sor k, and 0 otherwise. Then, the cost of an inter-processor schedule is

Cos: (X) = met/mm + 2 arm/admirer)
k I ("1‘o",‘)€EI

where c is the cost of transmitting one byte of data between two adjacent processors.

Memory limitation requires that, for each processor k,

U11 (vary; 5 Mk
i

where M k is the maximum size of the available memory on processor k.

The above formulation can be solved by iterative improvement, which starting
from an arbitrary allocation, improves the allocation iteratively through such techniques
as that proposed in [KeLi70], simulated annealing [KiGV83], or neuron networks
[HoTa85]. Using the iterative method, we may apply the inter- and intra-processor
scheduling alternatively in each iteration to obtain a more accurate cost estimation.

Note that, through intra-processor scheduling, it is easy to identify the critical path(s) of

31

the computation. New conﬁgurations which involve moving vertices on these critical
paths have a better chance of reducing the total execution time of the algorithm. The
problem then becomes that of identifying, on the critical paths, the vertex whose reallo-
cation will reduce the cost the most, and determining where the candidate vertex should

be moved to.

2.3. CONSIDERATIONS FOR SCHEDULING WITHIN PROCESSORS

In this section, issues for intra-processor scheduling are discussed. Given a D-net
D=(U,E’,f,,,f’u) and a partition (U(1),U(2),...,U(")), where Uanng for all i¢j
and U(1)UU(2)U...UU(")=U, intra-processor scheduling attempts to develop an execu-
tion schedule, X, for all us U. The partition U (i), ISiSk, is assigned to processor i,
where k is the number of processors used. Figure 2.5 shows a possible schedule for the
D-net shown in Figure 2.4. Dashed arcs are added by the schedule to enforce the total
ordering.

In the following discussion, we will not consider the scheduling for initial data
vertices, U3, and buffer vertices, Ub. Initial data vertices are assumed to have been
loaded beforehand, while the execution time of a buffer vertex can follow that of the
predecessor vertex, if remote data are assumed to be transferred to local nodes as soon
as they are generated.

A lower bound for the total execution time T of an algorithm is given by:

Tap/gt 2 f’u(u)

ueUm

However, due to data dependence relationships, a processor may idle waiting for data
from other machines. Thus, the above lower bound may not be achievable. To obtain
the optimal schedule, we must ﬁrst determine the cost for a given schedule. Recall that
the state of a D-net can be completely described by the state function Tu and 5,, (see

Section 2.1.2). 5,, is given from the partitions (i.e., from an inter-processor scheduling),

32

Processor 0 Processor 1 Processor 2

 

Figure 2.5. A schedule for the D-net in Figure 2.4

and it is easy to see that

’ ~+ max T u
fu(ui) “Rm” u( ) ifuidU,

Tum.) = 0 if uie U,- (23)
where, as deﬁned in Section 2.1.2, f ’u(u,-) is the time to compute u,- and R (u;,X) is the
set of predecessor vertices of u,- in the schedule X. A vertex u j is a predecessor vertex of
another vertex u,- if either (uj,u,-)e E’ or u; is generated immediately after u; in the
schedule. From Tu and Equations (2.1) and (2.2), one can obtain the total execution
time of the schedule and ﬁnd out the maximum memory needed for each partition. If a
schedule’s memory requirement does not exceed that of the maximum available, then

the schedule is a feasible schedule.

33

One way to determine an optimal or near optimal intra-processor schedule for a
given D-net is by means of the branch-and-bound method. The cost of a particular
search path is expressed as the sum of the cost of the partial solution found so far and
the cost from the present partial solution to the ﬁnal solution (future cost). Possible can-
didates for expansion in a search path are those data vertices in the D-net that are
enabled at that point. For intra-processor scheduling, a simple estimation for the future

cost is

max 2 f’u(u)

ISi 5k“ 6 I]?

where U?) is the set of data vertices in partition U (i) that have not been scheduled so
far. A particular search path can be pruned if it violates space or precedency constraints.

Though branch-and-bound is a powerful way to search for feasible solutions, its
complexity is still exponentially proportional to the problem size. One heuristic adapted
from critical-path scheduling [AhCh74, PoBa87] works by associating two level values
with each vertex in a D-net. The forward level of vertex u, af(u ), denotes the max-

imum "distance" from u to a resultant vertex and is deﬁned as

a/(u) = max 2 f’u(ui)

I IQEPI

where P f is a path from the vertex u to a resultant data vertex. The backward level of a
vertex u, ab(u), denotes the maximum "distance" from u to an initial vertex and is
deﬁned similarly. An intra-processor schedule can thus be found by the following steps:
( 1) For each partition U (i), lSiSk, arrange all vertices in that partition in an nonin-
creasing order into a linear list L; according to af ﬁrst, then orb.
(2) For each partition U (i), lSiSk, check space requirement for L,- using Equation
(2.1) and (2.2).

(3) For each partition U (i), ISiSk, modify L,- if space constraint is violated.

34

(4) Report the resultant Li’s as the schedule if step (3) succeeded. Otherwise,

there is no feasible solution for the set of constraints.

A modiﬁcation in step (3) may involve swapping the order of generation of data
vertices within a processor so that the time interval that a piece of data will occupy the
memory is minimized. Another way of increasing memory efﬁciency is to control the
time to transfer a data vertex from one processor to another. If a schedule cannot be
obtained at the end of step (4), then we might want to increase the number of processors

or use another optimization method such as branch-and-bound introduced earlier.

CHAPTER 3

DESIGNING AND ANALYZING

DATA PARALLEL ALGORITHMS

In the previous chapter, we discussed general considerations in designing parallel
algorithms. In this chapter, we turn our focus to designing data parallel algorithms for
DMMs, with emphasis on implementation and analysis issues.

Designing parallel algorithms cannot be isolated completely from the architectural
aspect of the underlying multiprocessor. Design and implementation of a parallel algo-
rithm on a multiprocessor involves the knowledge of the speciﬁc language, operating
system, and multiprocessor that are used. On DMMs, this may imply the knowledge of
how send and receive routines are speciﬁed in the language and invoked in the operat-
ing system, how messages are maintained and transmitted, and what are the rates of
performing computation and communication. The more knowledge one has, the more
efﬁcient the resultant program will be.

The same is true for analyzing algorithms. The purposes of analyzing algorithms
include estimating algorithm performance to identify inefﬁciencies in the design and
suggesting better scheduling schemes. Again, accurate models of the underlying archi-
tecture and algorithm behaviors are needed. These models dictate what timing informa-
tion is relevant and how timing information can be combined to provide an overall per-

formance ﬁgure.

35

36

In this chapter, the design, programming, and analysis of data parallel algorithms
for two typical examples, array summation and matrix multiplication, are discussed
[NiKP87]. These algorithms were implemented on a 64-node NCUBE multiprocessor.
Therefore, in Section 3.1, important characteristics of the NCUBE are reviewed ﬁrst.
Then, modeling computation and communication on DMMs is discussed in Section 3.2,
including the measured timings on the NCUBE. Based on the model, data parallel
algorithms for array summation and matrix multiplication are presented in Section 3.3

and 3.4, respectively.

3.1. NCUBE Multiprocessors

The algorithms and experiments discussed in this thesis are implemented on the
NCUBE multiprocessor. Therefore, in this section, the architecture of the N CUBE mul-
tiprocessor is reviewed. Section 3.1.1 will concentrate on the hardware aspect of the
NCUBE, followed by an introduction in Section 3.1.2 to the software aspect of the

NCUBE.

3.1.1. The Hardware

The NCUBE multiprocessor is a DMM with a hypercube interprocessor connec-
tion which is capable of supporting up to 1024 nodes [HaMS86]. A hypercube of
dimension 11 consists of N=2" nodes, each with a direct connection to n other nodes.
There is an edge between nodes i and j if i and j have exactly one bit difference in their
binary representation, where 0 S i, j SN -1. The most attractive feature of the hypercube
tOpology is the rich set of topologies, including meshes and rings [SaSh85], that can be
embedded inside the hypercube. Generally speaking, the hypercube offers a good bal-
ance between node connectivity, communication diameter, and algorithm embeddabil-

ity. That is the reason why most DMMs available today use the hypercube as the

37

underlying interprocessor connection network [Hwan87].

The processor in each node of the NCUBE is a 32-bit proprietary processor which
includes a ﬂoat-point arithmetic unit, memory management logic, and interprocessor
communication mechanism, all on a single VLSI chip. Combined with six memory
chips for a total of 512 Kbytes, a node consists of only 7 chips. The instruction set is
very similar to that of the VAX, which include arithmetic and logic operations, shift,
and jump. Using a 10 MHz clock, nonarithmetic instructions can be executed at about 2
MIPS (million instructions per second), single-precision ﬂoating-point operations at 0.5
MFLOPS (million ﬂoating-point operations per second), and double-precision at 0.3
MFLOPS. The NCUBE used in this thesis research has 64 nodes with a clock rate at 7
MHz.

Each node has 22 bit-serial I/O lines which are paired into 11 bidirectional chan-
nels. In this way, a node can communicate with up to 10 other neighbors to form a 10-
dimensional hypercube and has one more link to connect to the host. Communication
with other nodes is handled through asynchronous DMA.

The host board in the NCUBE multiprocessor uses an Intel 80286 as the CPU and
has support for various peripherals. Within each host board, there are 16 NCUBE node
processors (called the I/O processors) to provide 128 channels to the hypercube, each
I/O processor is connected to 8 nodes in the hypercube. The memory spaces of these
I/O processors are part of the 4 Mbytes memory space of the host CPU. In this way,
memory in the host board is shared between the 80286 CPU and 16 U0 processors.

The hardware design of the NCUBE multiprocessor is excellent. A 1024-node
NCUBE has a potential execution rate of 2000 MIPS or about 500 MFLOPS. However,
it is up to the software —— the operating system and algorithm design — to realize this

massive computing power.

38

3.1.2. The Software

The operating system running on the host of the NCUBE multiprocessor is called
AXIS, which is UNIX-like but lacks many Of the features of a mature UNIX. TO AXIS,
the hypercube is nothing more than a device. As with ordinary ﬁles, the hypercube can
be opened, written to, read from, and closed. Also, AXIS allows a user to allocate dis-
joint subcubes. Thus, more than one user or application can share the hypercube. Pro-
gramrrring on the NCUBE can be done using conventional C or FORTRAN augmented
with system primitives to handle messages and cubes.

The node operating system is called VERTEX, which is a small nucleus (less than
4 Kbytes) residing in each node. Its primary function is to provide communication and

process management. System primitives available from the VERTEX include:

nread() receive messages

nwrite() send messages

ntest() test for messages arrivals

ntime() return time since node initialization

whoami() return node parameters such as node id and subcube dimension
From a node program, nwrite() can be called as follows:
nwrite(msg, msglen, dest, type, status, ﬂag)

where msglen is the length Of the message in bytes, msg is a linear array holding the
message, dest is the logical id Of the receiving node, type indicates the type Of the mes-
sage, and status and ﬂag are for error checking purposes. Therefore, if data in a mes-
sage are scattered in different places, the program must ﬁrst gather data into a linear
array before nwrite() can be called.

A pool Of system buffers is maintained by VERTEX for communication. These
buffers are used on both sending and receiving nodes to support non-blocking send and

blocking receive Operations [MuBA87]. Thus, the caller of nwrite() can resume after a

39

communication buffer has been allocated and the message has been copied into it.
Buffers also serve the purpose of queuing messages when the communication channel
is busy. When the DMA controlled message transfer has been completed, an interrupt is
signaled. The corresponding interrupt service routine will release the communication
buffers.

The arguments to nread() are similar to those of nwrite(). nread() examines the
unclaimed receiving buffers for a message speciﬁed in its arguments. When the
requested message does arrive and is located, nread() will copy the message to the user
process space and release the message buffer. Otherwise, it will wait until an appropri-
ate message arrives. Thus, nread() supports blocking receive.

Note that, even though the low level hardware implementation of an NCUBE node
supports broadcasting, there is no broadcast primitive provided in the VERTEX. Note
also that VERTEX uses a store-and-forward mechanism to relay messages. This
mechanism has been shown to be inefﬁcient and abandoned in the second generation
DMMs [DaSe87, ShFi87].

Using current AXIS and VERTEX, not only is the programming more difﬁcult
than using the shared-variable model, but a very high communication overhead is
induced by the inefﬁciency in the software. It follows that appropriate parallel pro-
gramming tools and algorithm development methodologies are critical in efﬁciently

utilizing the NCUBE and other DMMs.

3.2. MODELING COMPUTATION AND COMMUNICATION ON DMMS
Modeling an algorithm on a computer system is to identify the most essential

characteristics of the algorithm, with a goal to predict the behavior of the algorithm

when executed on the machine. To achieve this goal, we need to deﬁne design parame-

ters, such as the size of partitions, to characterize the behavior of the algorithm.

40

Furthermore, we need to obtain system parameters, such as the message transmission
time and primitive computation time, of the underlying architecture to characterize the
behavior of the machine. In this section we will show how to model communication
and computation in a DMM, and what are the measured system parameters on the
NCUBE.

From the view point of programming, communication is nothing more than mani-
pulating system calls to send and receive messages. In general, when invoking these
communication primitives, the overhead involved can be classiﬁed as one that can be
overlapped with the computation processor and one that cannot. For each category, one
can further distinguish between the overhead that is invoked per message and the over-
head that is paid on a per byte basis. Thus, for overhead that cannot be overlapped with

the computation, we have the following estimations [GrRe86]:
0'), + thxs for communication initiated by the host (3.1)
0,, + 1,,“ for communication initiated by the nodes (3.2)

where

0;. = message startup delay in the hOst

12;, = single-byte transmission time for messages initiated by the host

6,, = message startup delay in the node

12,, = single-byte transmission time for messages initiated by the nodes

s = size of the message (in bytes)

On the NCUBE, for example, 0;, and O',,, will include such overhead as invoking
the send and receive primitives, nwrite() and nread(), and setting up the DMA channels
[MuBA87]. On the other hand, 1:), and 1,, will include the overhead to copy data to and
from system buffers. Similarly, we can deﬁne Gh’, th', 0,“, and 1:,,’, to characterize the
overhead that can be overlapped with the computation processor. Speciﬁcally, th’ and
tn’ will include the time to transmit the message over the communication link. It fol-

lows that, to estimate the time to transmit a message of s bytes from the host to a node,

41

we can use the following formula:
(0,, + rhxs) + (oh’ + Ih’XS) + tnxs (3.3)

The ﬁrst term accounts for the time to invoke the nwrite() primitive and to copy data to
the system buffer, the second term takes care of the overhead involving data transmis-
sion over the link, and the ﬁnal term is the time for the destination node to copy the
message to the receiving process. It is assumed here that the receiving process has
already issued the nread() call and is waiting for the message. Thus, the time 0,, is not
included.

There is another important factor which will affect the performance of the resul-
tant program, especially when matrices are involved. Note that matrices are usually
stored in the memory in a row-major or column-major fashion. However, when sending
a submatrix, the host process must ﬁrst gather all elements of the submatrix into a con-
secutive linear array before it passes the starting address of that array to the send primi-
tive (see Figure 3.1(a)). Similarly, when receiving a submatrix, the system must copy
the message into a linear array before the receive primitive returns. Then, the receiving
process scatters the data to the corresponding locations in the matrix (see Figure 3.1(b)).
Note that the gathering/scattering operation is performed on the host sequentially,
which cannot be parallelized. Therefore, it is a very expensive operation.

The host communication overheads with and without data gathering/scattering are
both measured. To measure the host overhead in sending a message, the host sends r
rounds of s-byte messages to n nodes. Similarly, the host overhead in receiving a mes-
sage is measured by using n nodes, each sends r messages of s bytes to the host. The
time, T, to perform these operations is measured on the NCUBE using the VERTEX
system call, ntime(). To do this, the host designates one node, say node 0, as the time
server. Immediately before the operation begins, the host sends one message to inform
node 0 to start counting the time. Then, immediately after the operation ﬁnishes, the

host sends another message to node 0 to stop the clock, and processor 0 sends back the

42

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

250999. §P.a.°E 2999?). SIP???
5 matrix I 5 matrix I
gathering : E scattering : E
' : message ' : message
________ 3 buffer _ _ _ _ _ _ _ _ 3 buffer
buffer EXSIQIP§P§SE buffer £899.81)???
copy 1 I copy I i
1 com. 2 comm.
5. 3 buffer §_ ________ J: buffer
(a) Gathering (b) Scattering

Figure 3.1. Gathering/scattering operations in the NCUBE

elapsed time. When the number of messages in the experiment is large, the overhead
involved in sending these control messages can be ignored.

The value returned by ntime() is measured in ticks where 1 tick is equal to
1024/processor clock rate (in Hz). Thus, if the processor clock rate is 7 MHz, then 1
tick will equal to 146.2 usec. From above experiments, the host overhead in sending or
receiving a message is given by T/nr. The average time in sending one message from
the host to a node is depicted in Figure 3.2. Data gathering/scattering is simulated in the
experiments by copying the message from one array to another in the host before it is
sent or received.

It is easy to see from Figure 3.2 that the measured data can be ﬁt into straight
lines. In fact, if data gathering/scattering is included, then the measured host communi-

cation overhead can be modeled by
0'}, + thxs = 8.20 + 0.068xs for sending a message (3.4)

0),, + thxs = 4.55 + 0.068xs for receiving a message (3.5)

43

 

 

’ X
120 J - - with gather/scatter x, ’
100 a - - - - w/o gather/scatter I , x’
Messa e I I x’
g 80- ,x’
transmission 60 _ , x , ’
time (ticks) 40 _ I , ’x ’
’X
”he: ..... . ..... . ..... . ..... . ..... o ..... . ..... . ----- .
0
F r l I 1
0 500 1000 1500 2000
Message length (bytes)

Figure 3.2. Measured host overhead in sending a message

On the other hand, if the computation does not involve data gathering/scattering, then

the host communication time can be modeled by
6;, + thxs = 8.08 + 0.007xs for sending a message (3.6)
0),, + thxs = 4.21 + 0.007xs for receiving a message (3.7)

Note that (3.4), (3.5), (3.6), and (3.7) all conform to (3.1). Also, from the experiments,
we observe that, although there are some overlappings between computation and com-
munication, e.g., the average message transmission time, T/nr, decreases as the number
of rounds, r, increases, the net effect is too small. Therefore, in analyzing algorithms
run on the NCUBE, we will ignore the parameters 6;] and th’, which represent the
communication overhead that can be overlapped with the computation processor.

Node communication overhead on the N CUBE can be measured by arranging n
nodes into a ring. Again, r rounds of message exchanges are performed. In each round,

a node sends one message of s bytes to the next node in the ring and receives one

44

message from the previous node. The measured average message transmission time
(T /2nr) is plotted in Figure 3.3 against message sizes. Again, the message transmission

time can be modeled linearly by

0,, + 1:,,xs = 3.52 + 0.017xs (3.8)

Message , x’
transmission x ’

time (ticks) , ’

 

 

n 1 1
0 500 1000 1500 2000

Message length (bytes)

Figure 3.3. Measured node to node communication time

Note that (3.8) applies to both send and receive operations. Also, the overlapping
between computation and communication is small, e.g., (3.8) is independent of the
number of receiving nodes. Thus, 0’", and tn’ can be ignored in the modeling. Note that
Expression (3.8) does not include data gathering/scattering operations, because in the
algorithms studied in this thesis, nodes do not perform any gathering/scattering opera-
tions. Finally, the time (1,.) to execute the operation axb+c and associated loop manipu-
lation overhead is measured, where a, b, and c are ﬂoating-point numbers. Similarly,
the time (1b) for a+b and associated overhead is measured. The following table sum-

marizes the system parameters measured on the N CUBE:

45

Table 3.1. System parameters on the NCUBE

 

 

 

0'}, 0’ hr 0,, 1:}, 1,, Tc Tb
with G/S 8.20 4.55 3.52 0.068 0.017 0.24 0.15
w/O G/S 8.08 4.21 3.52 0.007 0.013 0.24 0.15

 

 

 

 

 

 

 

 

 

 

<Note>: all quantities are in ticks

3.3. CASE STUDY — ARRAY SUMMATION

In this and the next section, we present two examples, array summation and matrix
multiplication, to see how different factors, such as computation, communication, time,
and space, may affect the performance of the algorithms. Algorithms were implemented
on a 64-node NCUBE, which allows the discussion to relate to a more realistic environ-
ment.

Given a large set of numbers, a1, a2, ..., a, the array summation is to ﬁnd out the
sum, A = a1 + a2 +...+ aL. To parallelize this operation, the array is partitioned into k
subarrays, which are allocated to k processors of a hypercube computer. Each proces-
sor will receive one subarray and calculate the partial sum. The partial sums are then
accumulated to give the total. The primary considerations here are the optimal number
of processors to be used and the size of each partition.

Based on the way the partial sums are accumulated, we consider two different
array summation schemes in Section 3.3.1. In the centralized accumulation method, the
host collects all the partial sums from the processors and evaluates A. Another
approach, the tree-structured accumulation method, uses a binary-tree reduction among
all the nodes to accumulate the partial sums. In Section 3.3.2, performance of the algo—
rithms will be modeled using the analytic model introduced in Section 3.2, and, in Sec-
tion 3.3.3, the analytic results will be studied and compared with experimental results

from the NCUBE.

46

3.3.1. The Algorithms

In the following discussion, let L denote the size of the array and x,- denote the size
of the subarray loaded to node i, where L =xo + - - - +x,,_1. The execution time
(including local computation and communication) at node i is denoted by t (x,-) and the
size of the resultant data is denoted by s (x;). Suppose further that each data element
takes b bytes. The P-net describing the array summation using centralized accumula-
tion method is shown in Figure 3.4, where X,-, OSi Sk -1, is the set of array elements that

are sent to processor i.

16’"
9
9.-

J

@197
.Iéi-Ei-éé

r.
I
I
I
I
L

""""'-"1
I
---------J

Figure 3.4. Array summation with centralized accumulation

Since there is no communication between nodes, any one-to-one mapping from the
subarrays to the processors will generate the same result. A straightforward way of par-
titioning the array is the equal partitioning scheme, in which subarrays have the same
size. However, this scheme does not take into consideration the balance between com-

putation and communication. Thus, another partitioning scheme, ratioed partitioning, is

47

proposed, in which the array is divided in such a way that, when node i has just ﬁnished
its uploading, node i+1 is ready to upload. By doing so, we can prevent the host from
being idle between two adjacent data uploadings. In Section 3.3.2, we will show how to
use analytic model to obtain the optimal partition sizes for ratioed partitioning.

The P-net which describes array summation using the tree-structured accumulation
method is shown in Figure 3.5, where k=4 nodes are used with node 0 as the root of the
reduction tree. The root of the reduction tree is responsible for returning A to the host.
Note that this tree-structured reduction operation can be executed on a hypercube in
such a way that data movements between nodes in each step are only one hop away
[SaSh85], i.e., all nodes only have to communicate with their direct neighbors. Again,

either the equal partitioning scheme or the ratioed partitioning scheme can be used.

O
U:

1
O
1
H
1
1
I.»
1

-------1

93.9-

----------------1

OEGEO

L----- --------J

'éimé'i'éié"

Figure 3.5. Array summation with tree-structured accumulation

48

3.3.2. Analytic Modeling

In this subsection, we will illustrate how to derive the execution time (T) for array
summation. Consider ﬁrst the centralized accumulation method using the ratioed parti-
tioning scheme. The timing diagram of this method is shown in Figure 3.6, where C is
the host idle time. From the timing relationship in Figure 3.6, any pair of x,- and xm ,

i = 0, ..., k-2, can be related as follows:

(Inbxi‘mbxiH(6n+Tnb+’thb)+'tb+ohr = (011+Thbxi+1 +Tnbxt+1 )+Tbxi+1 +(0'n+Tnb)

 

 

 

«—— downloading o—uploading—-
x0 x1 xk-l 8(X0) 8(11) Sock-1)
Host I'""+"“I .......}.....t ._C._.. I.--+--__1 ""'I‘-"""I
t(xo)
Node 0 l :
t(x1)

..... Kirk—1)
Node k-l 1 1

 

I:- data transfe . - - -

- computation: —

Figure 3.6. Timing diagram of array summation using centralized accumulation

In the left-hand side, t(x,-)=(t,,bx,~+tbx,-) comprises the local execution time in node i,
including copying data from message buffer and calculating partial sum, (6,,+1:,,b+1:;,b)
is the uploading time (see Equation (3.3)), and 1,, and 6),, are the times during which
the host accumulates the sum and invokes the next receive primitive, respectively.
Quantities in the right-hand side can be interpreted similarly. Deﬁne

Tab 4' Tb Thb + Tb + Ch, - 0'},
CL = and B =
Thb + tab + 1?), ml; + tab + 1,,

 

 

49

Then we have
x,-+1 = wt,- + B i =0, k—2 (3.9)
It is easy to derive from (3.9) that
x, =a‘xo +1— 01" i=0, k—l
Combined with the fact that L = x0+ ...+ xk_1, we have

_(1-oc)L-nl3 B
x,_ H1, +1_a (3.10)

Let Td denote the, downloading time of the last k-l subarrays, i.e.,
k-l
T4 = 2 (0'), + thbxg) = (k-I)O’h + Thb (L —Xo)

i=1

Then, the host waiting time C can be obtained from the timing relationship between the

host and node 0 in Figure 3.6 as
C = p(r,,bxo+rbxo+o,,+t,,b-Td+oh,)

where

_ y if y >0
p0)- {0 otherwise

Note that C=0 implies that the host is a bottleneck, because, before the host
ﬁnishes downloading to node k—l, node 0 has generated the ﬁrst partial sum and is
waiting for uploading. In other words, the host cannot keep up with the computing
speed of the nodes. This also implies that we use too many processors that the commun-
ication dominates the computation. From above expressions, the whole job execution

time can be found as:

k—l k-l
T = z (Ohthxg) + C + 2 (Oh,+‘thb+’tb)
i=0 i=0

=k(o,,+oh,+bt,,+tb)+bthL+C (3.11)

50

For the equal partitioning scheme, the derivation of the total execution time is

- - . g . _ — —

If the host is the bottleneck, then we need only consider operations in the host, which is

equal to
L
T, = k (0h+t;,b-k-) + k (0,,,+r,,b+rb)

On the other hand, if node computation dominates the array summation, then we need

only consider the operations in node k—l, which take
L L
T2 = k (0,,+1:,,b—];-) + (rub-14b); + (0,,+1:,,b+1:,,b)+tb

It follows that the total execution time can be approximated by
T =Max{T1,T2} (3.12)

Derivations for tree-structured accumulation methods are similar. The timing
diagram of the tree-structured accumulation method with ratioed partitioning is shown
in Figure 3.7. Using ratioed partitioning scheme, we can identify the following relation-

ship between adjacent nodes from Figure 3.7:
Tnbxi + '6in =(Gh+1hbxi+r+1nbxt+1) + Tbxi+1
Deﬁning

rnb + 1,, i 0,,
a = =
Thb + Tub + Tb Thb + Tab 4' “Cb

 

 

we have

xi+1 = 0‘11 -§ (3'13)

51

 

Host 1- - + - +--+---l

 

 

 

 

Node0 1 1' l—-—l l—l
Nodel : %--l

Node2 : : 1——+----1
Node3 } g-..

 

- data transfer: - - -
0 computation: —

Figure 3.7 . Timing diagram of array summation using tree-structured accumulation

C = (rnb+rb)x,,_1 + log2k(0,,+2r,,b+rb) + (0,,+r,,b)
T = k0,, + (L+1)t,,b + C (3.14)

Similarly, the execution time for the equal partitioning scheme is given by
T = (k +1)0,, + (L+1)t,,b + (Inbﬂb)% + r,,b + log2k(0,,+2'c,,b+rb) (3.15)

From (3.11), (3.12), (3.14), and (3.15), performance of algorithms running on the
NCUBE can be estimated using the system parameters given in Table 3.1. Optimal
design parameters, k and x5, 0_<.iS.k-l, can also be determined by optimizing these

expressions.

3.3.3. Performance Analysis
In this subsection, performance of the array summation algorithms presented in
Section 3.3.1 will be studied. First of all, the centralized accumulation method using the

ratioed and equal partitioning schemes are compared (see Figure 3.8). The size of the

52

array is 8192 elements. The partition sizes, x,-, 0.<.iSk—l, for the ratioed partitioning
scheme are obtained from Equations (3.9) and (3.10), and the analyized execution times

are obtained from (3.11) and (3.12).

 

2000 — . —— Ratioed partition, measured
E - - - - Ratioed partition, analyzed
1600 — *, Equal partition, measured
1‘; - - - Equal partition, analyzed
1:
Execution 1200 ‘ I
time (T) 300 _
400 —
0

 

 

0 5 10 15 20 25 30

Number of processors (k)

Figure 3.8. Comparison of ratioed and equal partition

The accuracy of the analytic models is evident from Figure 3.8 by observing the
close match between the curves representing analytic and experimental results. From
Figure 3.8, the optimal number of processors for each scheme can be obtained at the
"valley" of the corresponding curve — k=12 for both schemes. Note that the optimal
number of processors is the dividing point between host-bound or node-bound execu-
tion. Beyond this point, the host becomes the bottleneck. It can also be seen from Fig-
ure 3.8 that the ratioed partitioning scheme improves the performance, because a better
balance between communicath and computation can be achieved by adjusting data
partition sizes.

Comparison of centralized and tree-structured accumulation is presented in Figure

3.9 for both the equal and the ratioed partitioning schemes. Again, the partition sizes of

53

 
 
    
 

 

1200 1
i.
1000 - . —— Centralized, ratioed
----- Tree-structured, ratioed
800 — Centralized, equal
Execution Tree-structured, equal .
600 — ....................
time (7) . . ............
400 —
200 -
O I 1 l r

 

r I
0 5 10 15 20 25 30

Number of processors (k)

Figure 3.9. Comparison of tree-structured and centralized accumulation

the ratioed partitioning schemes are derived from the analysis. It can be seen from Fig-
ure 3.9 that the performance of the tree-structured approach is almost identical to that of
the centralized approach, for all k up to the optimal value. This is because, for array
summation, the accumulation phase involves only the transmission and addition of a
single data element, which incurs only a small portion of the total execution time. How-
ever, when k becomes large, the difference between these two approaches becomes
apparent. This is evident from the curve corresponding to the equal partitioning, tree-
structured accumulation in Figure 3.9, which extends below the curve of the ratioed
partitioning, centralized accumulation method when k227.

For the tree-structured accumulation method, the Optimal partition sizes obtained
from (3.13) will become negative when k becomes large. This implies that the given
number of processors is too large for efﬁcient execution. This is also the reason why, in
Figure 3.9, there is no data beyond k=16 for the tree-structured accumulation method

using ratioed partition.

54

   
  
 

 

 

1200 -1
\
1000 — — Centralized, ratioed
----- Tree-structured, ratioed
800 ~ Centralized, equal
Execution i, - - - Tree-structured, equal
600 — L =8192, 0;.=0,,,=0,,=1.0
time (T)
400 — ,___ . _.
200 — w - - -
O I I I l I I

 

0 5 10 15 20 25 30

Number of processors (k)

Figure 3.10. Predicted performance of various methods

Next, let us consider the inﬂuence of system parameters on algorithm design. Sup-
pose the message startup delay can be improved from 8.08 ticks on the host and from
4.21 ticks on the nodes to 1 tick. Figure 3.10 illustrates the resultant total execution
time, predicted by means of the analytic models. Again, tree-structured and centralized
accumulations perform almost the same. Compared with Figure 3.9, the improvement
due to decreased message startup delay is about 25% for the optimal execution time. In

addition, the host can support more processors to improve the performance.

3.4. CASE STUDY — MATRIX MULTIPLICATION

Matrix multiplication on a hypercube multiprocessor has been studied in different
contexts [ChSm87, FOOH87]. In this section, the same problem is studied again. The
purpose is not to propose any novel algorithm, but to show the tradeoffs between dif-

ferent partitioning and mapping schemes and to illustrate ways of arriving at good

55

design decisions. It is assumed that the multiplication of the two mxm matrices,
AxB = C, is to be performed on a hypercube with a ﬁxed dimension, n=log2N. Each
element in the matrices has a size of b bytes. Matrices A and B are initially stored in the
host and C will be stored back to the host. It follows that the overhead involved in data
downloading and uploading must also be taken into account.

Various considerations in designing the matrix multiplication algorithm will be
discussed in Section 3.4.1, followed by analytic modeling of the algorithm in Section
3.4.2. Using both experimental and analytic results of the matrix multiplication algo-
rithm, we then discuss in Section 3.4.3 different tradeoffs in arriving efﬁcient parallel

algorithm designs.

3.4.1. The Algorithms

A straightforward approach to perform the matrix multiplication on a hypercube
machine is to partition matrix B into N, mx(m/N) submatrices, Bo, B 1, ..., BN_1. Each
node is loaded with one submatrix B,- together with the whole matrix A. The resultant
products C,- = AxB,- are then transferred back to the host to form a complete matrix C.
The mapping is trivial because there is no interprocessor communication. However,
each node will require a large amount of local memory to hold matrices A, B,-, and C,.
With matrices of size 512x512 running on a 5-cube and having 4 bytes per element,
each node will require more than 1 Mbyte of local memory.

To relax the memory requirement, we can partition both matrices A and B. Con-
sider the strip partition shown in Figure 3.11(a), in which matrix A is partitioned along
the rows and B along the columns. Each processor is allocated with one submatrix of A
and one submatrix of B. In each iteration, the processor performs a multiplication on the
two local submatrices, and then shifts the B submatrix to the next processor in a ring

fashion (as indicated by the arrows in Figure 3.11(b)). Repeating this circulation and

 

 

 

 

 

56

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Ao : : : Coo:C01:C02:C03
""""" 1 I 1 "'1"'T""1"
A C C C C
..... 1---- 30:3,i32133 3519133-?
A2 : : : Czo'C2riC22'C23
A3 : i : C30:C31:C32:C33:
(a)
A0.30 141.31 A3.33 A2.432
C 00 or ' 11 10 T)
(b)
F- - 1- - - '1
i i : ____B_o____
I i i B,
A01A11A2|A3 ......... C
1 1 1 32
I I I ---------
I I I 83
(C)
A0.30 141.31 142.32 A3.33 Tothe
00 01 10 ll hOSI
((1)

Figure 3.11. Matrix Multiplication with strip partition

57

multiplication, we can evaluate all submatrices of C. Note that by using the grey code
sequence as shown in Figure 3.11(b), all B submatrix circulations are only one hop
away. Also, all submatrices of C can be sent back to the host as soon as they are gen-
erated.

Another strip partition method is to divide matrix A along the columns and B
along the rows, as shown in Figure 3.11(c). Again, each pair of submatrices A,- and B,-
are loaded into one node, 0 Si S 3. in this case, we obtain at each node a portion of the
whole matrix C, C“) =A,-xB,-. One way to accumulate C, where
C = C (0)+C (1)+C (2)+C(3), from all nodes is the tree-structured accumulation method
(see Figure 3.11(d)). The root of the reduction tree is responsible for uploading the
resultant matrix C back to the host. Note that all data transfers in the reduction tree are
only one hop away.

In fact, the above two methods are only special cases of the block partition
scheme. In general, we can partition the matrices along both columns and rows, and use
two parameters, n1 and n2, to denote the number of partitions along rows and columns,
respectively (see Figure 3.12 for n1=3 and n2=4). In each iteration, every node per-
forms a submatrix multiplication and accumulates the resultant C submatrices with all
nodes in the same row using a binary-tree reduction. B submatrices are then transmitted
down the column in a ring fashion. Assume that n1 and 11; are integer powers of two.
Let nodeij denote the processor located at row i and column j of the mesh. Then the
matrix multiplication algorithm can be taken as a combination of the ring and tree
operations presented above. The algorithm is given in Figure 3.13.

There are 112 rings involved in the multiplication. Each ring contains n1 nodes.
Tree reduction is performed across the rings between corresponding nodes. Note that in
Algorithm 3.1 the roots of the reduction trees are shifted at each iteration so that the
corresponding processors will not be overloaded by having to upload every submatrix

of C. The program for performing matrix multiplication using Algorithm 3.1 is listed in

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

58

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.12. Matrix multiplication with block partition

_ ' ' -
- . . . - 30013011302 r . . -
A00140111402403 ---r---i--- Coogcorgcoz
---1---1---1--- 310:311:312 "'r"'1'"
AroiArriArzlArs X ---:----1--- Croicrricrz
---1-—1--1--- 820.821.1922 ---1--1---
A201A211A221423 ---. -- .--- C201C211C22
- - 33013311332 - -

(a)

”2‘ ”2‘ ”3‘ A
00 01 02 03 C C C
300 310 320 330 01 02 00

\___/
A10 A11 A12 A13 TOII'IC
301 311 321 B31 CIZCIOCH hOSt
\_"___/
A A A A
3(2): 21 22 231—-e C20C21C22

59

<Algorlthm 3.1: matrix multlpllcatlon>
Host
load Aij and Bj,’ IO nodeij, 0 SI S Ill-1, O Sj 5112-1;
receive C1,- from nOdegk, 0 S i,j S n1-1,k =j mod n2;
C <— [C,-,-], OSi,an1-1.
Node: ("Odeyn 0 S I = "1-1, 0 S j 5 712-1)
receive A1,- and B},- trom the host;
for! t—ito nl—l then 0to i—l do
CH) (— Aij X311;
perform a tree reduction with nodegk, OSkSnz-l, using node1,,
r = 1 mod 112 as the root;
if (i =1 mod n2) then send C1, to the host;
if (j at i—l) then
send B j, to ”Odekj, k = (i —1) mod n1;
receive B j, M from nodekj, k = (i +1) mod n 1;

Figure 3.13. The algorithm for matrix multiplication

the Appendix.

3.4.2. Analytic Modeling

We now proceed to analyze the performance of Algorithm 3.1. Before this can be
done, it should be noted that in Figure 3.12 the host has to gather data before it down-
loads submatrices and scatter data after it receives submatrices. Therefore, we should
use (3.4) and (3.5) to model the host communication overhead.

Now consider the matrix multiplication algorithm. Due to the operation of ring
shifting, nodes in the same column will be synchronized by the slowest one — the one
that is loaded the last by the host. Downloading submatrices of A and B from the host to

all the corresponding nodes will take time

2
T, = 2N(0,, + BIL—1,19)

60

At each iteration, the following operations will be performed in the nodes:

(1) Receive a submatrix of B, which takes time 1:,,bm2/N;

(2) Perform a submatrix multiplication, which takes time (tcm3)/(n%n 2);

(3) Perform a tree reduction with all nodes in the same row, which takes time
logzn 2(0,,+(21:,,b+‘tb )m 2M? );

(4) The root of the reduction tree sends the resultant submatrix of C to the host,
which takes time 0,,+t,,bm2/n%;

(5) Send submatrix of B to the next node, which takes time 0,,+’t,,bm2/N;

(6) Prepare to receive the next submatrix of B, which takes time 0,,.

It follows that the iteration time at each node is equal to

2 3 2 2
T,- = 2(o,.+1:,,bﬂ—) + t,-:’— + log2n2(0,,+(2t,,b+tb)—m2 )+ (0,,+t,,b—-m2 )
N "1’12 n1 n1

Now consider the host. During each iteration the host will receive n1 submatrices
of C, i.e.,
2

m
Tu = n1(0h,+thb—f)
n1

The total execution time of the algorithm depends on which event takes longer. Let

2
m
T =20 +1: b—
m I: II N
We have
T=Td+(T,-—T,,,)+T,, +(n1—1)Max{T,-, Tu} (3.16)

3.4.3. Performance Study
In this subsection, analytic results of Algorithm 3.1 are compared with experimen-
tal results obtained from the NCUBE to gain more insight into design considerations.

Note that the execution time given in (3.16) is a function of both n1 and n2. However,

61

if the problem is running on a particular hypercube, the size of the hypercube N is
known. Then, we need only consider n1 (N =n 1 n2).

Figure 3.14 illustrates the relationship between n1 and the execution time (T) for
various sizes of the hypercube. The experiment uses 64x64 matrices. Again, results
from the analytic model (Equation (3.16)) match closely with those measured from the

NCUBE. Therefore, the analytic model can be used to predict the optimal partitions,

 

 

 

 

 

 

n 1 and n2.
40000 . Measured
35000 —\ N -2 Analyzed
30000 .. 64x64 matrices
Execution _
"M
time (T) 20000 ............. N=4
15000 —1\\
10000 —\' - N =
5000 — N :16
0 I l l I
1 2 4 8 16

Number of column partitions (n1)

Figure 3.14. Execution time of the matrix multiplication

From Figure 3.14 it can be seen that performance degrades when n1 is small.
Recall that there are n1 tree reductions (logznz steps each) and nl-l ring circulations.
The smaller that n1 is, the fewer number of data transfers there are. However, each step
in a tree reduction requires the transmission of a matrix C 9) , which has (m /n1)2 ele-

ments. Thus, the total size of data to be transferred in the tree reduction phase is

2 2
m m N
n1(—) 10g2n2 = —1032—
n1 n1 n1

(3.17)

62

The smaller that n1 is, the larger the data size given in (3.17) will be, and the longer it
will take to transmit the data. This results in a longer execution time as shown in Figure
3.14. The analytic model presented in Section 3.4.2 is a very useful tool in choosing the
optimal n 1 and arriving at a balanced design.

The speedup of an algorithm is deﬁned to be

execution time using N nodes

 

= 3.18
speedup execution time using one node ( )
- - - with G/S, no download/upload
60 " __ with G/S I x x
50 _ ...... w/o G/S X X
. X
40 _ 64x64 matrices x

Speedup

 

 

 

 

Number of Processors (N)

Figure 3.15. Predicted speedup of the matrix multiplication

The optimal speedups for different sizes of the hypercube is depicted in solid line in
Figure 3.15. It can be seen in Figure 3.15 that the speedup levels off very fast. Now
suppose the gathering/scattering operations can be eliminated by, say, allowing the send
primitive to access the matrix directly, then the speedup would nearly double (see the
dotted line in Figure 3.15). Furthermore, if matrices reside in the nodes before and after
the multiplication [ChSm87, FOOH87], then there will be no data downloading and

uploading. The dashed curve in Figure 3.15 shows that the performance in this case is

63

much better than the original one (almost triple). Therefore, it is desirable to keep
operations at the nodes in the hypercube as much as possible, avoiding unnecessary
data downloading and uploading.

Finally, assuming n1 S n2, then, the memory requirement at each node for Algo-
rithm 3.1 is

2
2xbe"— + 12ml)2
N n1

Again, considering the case where matrices have a size of 512x512 running on a 5-
cube, the local memory required is about 131 Kbytes using the partition n1=4 and
n2=8, which improves over the 1 Mbyte memory needed in the ﬁrst approach described

earlier in Section 3.4.1.

CHAPTER 4

LARGE-GRAIN PIPELINING

One of the reasons why parallel algorithm design is considered difﬁcult is the lack
of design guidelines. Without general methodologies, it is very hard for a beginner to
start with parallel algorithm development. Over the years, as we are gaining in experi-
ence with parallel programming, several problem solving strategies are identiﬁed as
effective means in structuring efﬁcient parallel algorithms. These strategies are called
programming paradigms.

As pointed out in [NeSn87] programming paradigms not only provide a means for
programmers to begin when designing a new algorithm, but they also relay solutions for
other related problems to the problem at hand. Nevertheless, paradigms for parallel
programming are different from those for conventional serial computations. Parallel
programming paradigms must address the issues of partitioning, cooperation, sharing,
and communication.

In Section 4.1, we will ﬁrst review important parallel programming paradigms.
Based on the characteristics of DMMs and pipelined data parallel computations, a new
programming paradigm, called large-grain pipelining [KiCN88a, KiNi88], is intro-
duced in Section 4.2. Large-grain pipelining exploits the development of pipelined data
parallel algorithms on DMMs. Using processor level macro-pipelining in the DMM, the
resultant algorithm can achieve a computation rate that balances the communication

rate. As an example, a pipelined matrix multiplication is presented. Again, the purpose

64

65

is not to introduce any new algorithm, but to illustrate the basic idea behind large- grain
pipelining. In Section 4.3, important characteristics of large-grain pipelining will be
summarized and compared with other programming paradigms. Finally, in Section 4.4,
experimental results of the pipelined matrix multiplication algorithm on the NCUBE

are presented. Various factors that might affect the performance are discussed.

4.1. PARALLEL PROGRAMMING PARADIGMS

In general, there is no precise deﬁnition of what a paradigm should or should not
be. In fact, different researchers might use different names to describe similar ideas and
fail short, in one way or another, in accounting for some other aspects of the ideas
[NeSn87, Fink87, Quin87]. Therefore, the discussion in this section is subjective and is
by-no~means exhaustive and complete. Note that different paradigms may sometimes
overlap with each other. Note also that a parallel algorithm might use a combination of
several paradigms.
(l) Divide-and-conquer

In a broad sense, all parallel problem-solving strategies use some sort of divide-
and-conquer technique. However, the divide-and-conquer paradigm discussed here is
more Oriented toward a recursive and usually logarithmic style of computation. The
problem is divided into smaller subproblems. The subproblems are further divided into
even smaller subproblems until the sizes of the subproblems are manageable, where
these subproblems are solved independently. The solutions are then combined together
recursively until the ﬁnal solution is obtained.

Due to its recursive nature, the divide-and-conquer paradigm can easily generate
parallel algorithms with a logarithmic time complexity. It is also obvious that divide-
and-conquer is most suitable for developing concurrent data parallel algorithms, in

which data are partitioned recursively and processed on multiple processors. Several

66

algorithms belong to this category, including bitonic-merge sort and fast Fourier
transform (FFT).
(2) Master-slave

In this paradigm, one or more masters generate and deal out tasks to a set of
slaves. Slaves carry out the tasks (usually independently), and the results are gathered at
the masters. The master-slave paradigm is suitable for developing concurrent algo-
rithms, either data or function parallel.

A variation of the master-slave paradigm is the compute-aggregate-broadcast
(CAB) paradigm [NeSn87], in which processors perform some computations indepen-
dently, resultant data are combined into one or a few global values, and these global
values are broadcast back to each processor. The processors that aggregate and broad-
cast the global values play the role of masters. Algorithms using the CAB paradigm are
usually engaged in a loop with each sequence of compute, aggregate, and broadcast
constituting one iteration. The CAB paradigm is more suitable for concurrent data
parallel algorithms due to the broadcast operation.

Another variation of the master-slave paradigm is the client-server style of compu-
tation found in many distributed systems. Clients (the masters) direct their requests to a
number of servers (the slaves) which perform services on behalf of the clients. Many
parallel program developing tools for SMMs, such as Force [Jord87, Oste87], adopt this
paradigm in scheduling the computation. One advantage of the client-server paradigm
is dynamic load balancing among the servers. Also, the computation is independent of
the number of servers available. This paradigm is suitable for developing concurrent
algorithms.

(3) Election

Election is the operation in which one processor from among a group of processors

is singled out to perform some special functions. Election ﬁnds its application mostly in

distributed systems, including replicated data updating, crash recovery, and mutual

67

- exclusion [KiGN 88]. Parallel algorithms on multiprocessors also exhibit the behavior of
election, e.g., ﬁnding the pivot column in a parallelized simplex method [Fink87].
Sometimes, election is the only way to perform reliable computation in an unreliable
environment.

(4) Relaxation

In relaxation, each processor works with the most recently available data [Quin87]
— no processor has to wait for another processor to provide it with data. The computa-
tion proceeds in an asynchronous and iterative fashion. Usually, during each iteration, a
processor needs only exchange its most recent values with neighboring processors. A
typical example of relaxation is solving partial differential equations (PDEs). However,
the asynchronous nature of this paradigm causes difﬁculties in performance prediction
and requires extra effort to detect the terminate conditions. In general, relaxation is very
suitable for data parallel algorithms.

(5) Pi pelined and systolic computation

Pipelining, in its most common sense, refers to the operation involving an ordered
set of stages in which the output of one stage is the input of the next stage. Each stage
behaves like a ﬁlter; and data, when passed along the pipeline, are modiﬁed along the
way. Traditionally, pipelining is used to develop pipelined function parallel algorithms
(see Section 1.3). Examples include multi-pass transformers such as compilers, scene
analyzers, and vector processors.

Systolic algorithms are a special kind of pipelined data parallel algorithm which
centered around VLSI computations [FoWa87, KuLe78]. Systolic algorithms are
characterized by regular and rhythmic data ﬂows and a set of identical and simple pro-
cessors. Communication with the outside world occurs only at the boundary processors. -
Thus, whenever a datum enters the system, it is used repetitively in the pipelines.

In general, systolic algorithms are synchronous (except wavefront arrays

[KuLJ87]). Therefore, timing is very important in systolic arrays. Furthermore, since

68

processing cells in a systolic array are simple and primitive, systolic algorithms use a
very ﬁne granularity (at the level of single data elements). Systolic algorithms are suit-
able for developing data parallel algorithms and have applications in signal processing,
matrix arithmetic, image processing, and pattern matching, and dynamic programming.

From the above discussion, one can see that most of the parallel programming
paradigms focus on concurrent computation. In view of the high communication over-
head in DMMs, pipelining seems to be a more attractive programming paradigm for
these machines. However, traditional pipelining techniques concentrate on function
parallelism, while systolic algorithms aim at VLSI computation, which uses very ﬁne
granularity. Neither is oriented at developing efﬁcient data parallel algorithms on
DMMs. In the next section, a new paradigm, called large-grain pipelining (LGP), is
introduced. LGP is speciﬁcally directed at developing parallel data parallel algorithms
on DMMs.

4.2 BASIC CONCEPT OF LARGE-GRAIN PIPELINING

The basic idea of LGP is to use node level macro-pipelining in a DMM to regulate
the information ﬂows in the system so that data can be processed and transmitted in a
pipelined fashion. Nodes in the multiprocessor are organized into pipelines, and data
are partitioned into blocks to form data streams. The sizes of data blocks determine the
partition sizes of the problem, and the directions of data ﬂows determine the execution
sequence. Note that although each node may be equipped with pipelined arithmetic
units, the major concern of LGP is the pipelining between nodes. Through LGP, the
degree of overlapping between computation and communication can be maximized,
which minimizes the effect of communication overhead. The net result is a balance

between the computation rate and the communication bandwidth.

69

As an example of LGP, consider again the multiplication of two matrices,
AxB=C, where A, B, and C are mxm matrices. As in Section 3.4.1, A and B are sup-
posed to be stored in the host initially, and C will be stored back to the host. Thus, input
data must be downloaded from the host to the nodes and results must be uploaded back
to the host.

Assume that the multiplication is to be performed on a hypercube DMM with N
nodes. The following partitioning scheme will be used: The hypercube is conﬁgured
into an n1xn2=N mesh, and matrices A and B are partitioned along columns and rows
into mm; and nzxn3 submatrices, respectively.

One way to perform the multiplication is to load submatrices of A into the
corresponding processors in the mesh, then pipe submatrices of B into the hypercube
from the host. Let nodeij denote the processor located at row i and column 1' of the
mesh. The algorithm is described in Figure 4.1. The corresponding data ﬂows in the

algorithm are shown in Figure 4.2.

<Algorlthm 4.1>:

Host:
1.1. send Aij to node,-,-, OSI SI! 1-1, 0Sanz-l;
1.2. SGI'Id Bjk IO nodeoj, OSanz-l, 0SkSn3-1;
1.3. receive C,- from node;,,,2_1, 0Si Sn 1-1.

Node: (nodeij, 0Si Sn1-l, OSjSle—I)
2.1. receive A1,- from the host;
2.2. for k := 0 to 113-1 do
2.3. it (i = 0) then receive 81,, from the host
2.4. else receive B ,1, from node,-_1,j;
2.5. II (I ¢ n1-1) then send Bjk IO node;+1,j;
2.6. Ca) 2: Al} XBjk;
2.7. perform a binary-tree reduction with node”, 0_<.l Snz—l, to obtain

C1 := C§°)+...+Cf""1) in node1,,,2_1, where CE” := [C18, C1234 ];

2.8. if (i = n2—l) then send C,- to the host.

Figure 4.1. Algorithm 4.1 for pipelined matrix multiplication

70

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

h—n3=4—-—-1
300 301 302 303
From
310 311 312 313
the host
1——320 321 322 323
1—330 331 332 333
I A00 A01 Aoz A03
1 I
"1:3 A10 A11 A12 A13
1 I I 1
A20 A21 A22 A23
I-———n2=4 9.
(a) submatrix multiplication
Cit” Ci” Ci?) CP—wo
C10) C11) C12) C13)—.C1 To the host
\__/
C50) Ci” Ci” CP—ect
V

(b) submatrix addition

Figure 4.2. Data ﬂows in Algorithm 4.1

71

It can be seen from Figure 4.2(a) that each pipeline will receive a stream of B sub-
matrices from the host, and each stream consists of n3 data blocks, i.e., submatrices of
B. After all submatrices of B have passed through the pipelines, each processor will
have collected a portion of a submatrix of C with size (M /n1)xM. To obtain the ﬁnal
result, a binary-tree reduction is performed among the processors in the same row (Fig-
ure 4.2(b)). Thus, there are two phases in each iteration: submatrix multiplication for
evaluating A11- xB j), and submatrix addition for accumulating C11,, where 0Si<n1,
0Sj<n2, and 0Sk<n3. In the submatrix multiplication phase, each column of nodes
forms one pipeline to disseminate submatrices of B. Thus, there are n2 pipelines in the
system, and each has n1 stages.

The pipelining concept can be further applied in the reduction, in which each C11)
is further partitioned into n4 smaller sub-submatrices. Then these small sub-
submatrices are piped through the reduction tree to accumulate the ﬁnal results. Note
that the granularity (the size of data blocks) plays a very important role. The smaller
the granularity, the higher the degree of overlapping. However, since there is a ﬁxed
overhead associated with the setup of each message, a small granularity results in a
large number of messages and a high communication cost. The choice of a suitable
granularity is a very important issue and will be discussed in Chapter 5.

Note that when downloading, the host can feed all the pipelines in a round-robin
fashion. In this way, even though the downloading is performed sequentially, opera-
tions among the processors can overlap. However, the major problem with Algorithm
4.1 is that each processor has to store a large amount of intermediate data, i.e., subma-
trices C118, ..., Clary To reduce the amount of memory storage, we can intennix the
submatrix multiplication and the submatrix addition phase, as in Algorithm 4.2 shown
in Figure 4.3. In Algorithm 4.2, as soon as a C submatrix is generated, a binary-tree
reduction is invoked among the row processors to accumulate the submatrix. This com-

plete C submatrix is then uploaded back to the host. No intermediate submatrices of C

72

<Algorlthm 4.2>:
Host:
1.1. send A1,- to nodeij, 0SiSn1—l, OSanz-l;
1.2. SBTId Bjk IO nodeoj, 0Sj$ng-l, 0SkSng—1;
1.3. receive C11, from node-”2-1, 0Si Sn 1-1, OSkSn3-1.
Node: (nodeij, (ISIS?! 1—1, 0Sang-1)
2.1. receive A1,- from the host;
2.2. for k := 0 to n3-l do
2.3. if (i = 0) then receive B ,7, from the host
2.4. else receive B 1,, from node1-1,,-;
2.5. if (i a: n1-1) then send 81,, to node,-+1,j;
2.6. C1? := A1,- x311;
2.7. perform a binary-tree reduction with node”, 0S1 Sn 2-1, to obtain
Cu, := C52)+...+C§2"1) in node-$2-1;
2.8. 110' = ng-l) then send C1,, to the host.

Figure 4.3. Algorithm 4.2 for pipelined matrix multiplication

need to be stored in the processors. The corresponding data ﬂows are described in Fig-
ure 4.4.

Note that, in Algorithm 4.2, the size of the data blocks involved in the tree reduc-
tion is smaller than that in Algorithm 4.1. Therefore, the balance between granularity
and communication overhead becomes even more important. If the communication cost
is too high, then we can combine two or more submatrices of C into a larger data block
in the reduction. In other words, the tree reduction is performed every two or more
iterations instead of one. By increasing the granularity, we can reduce the effect of
communication overhead in setting up messages.

Binary-tree reduction is not the only way to accumulate submatrices of C. The
algorithm using a linear reduction scheme is described in Figure 4.5. After calculated a
C submatrix, each processor sends that C submatrix to its right neighbor. The accumu-
lation progresses from left to right in a linear fashion until the complete C submatrix is
obtained in the rightmost processor, where it is uploaded back to the host. The

corresponding data ﬂows are illustrated in Figure 4.6.

73

#— n3=4 ——-I
300 301 302 303
310 311312 313

h t
320 321 322 323 OS

 

From the

 

 

 

 

 

{-330 331 332 333

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I Aoo-Aor\ 027403HC03 C02 C01C00
i l ‘1’ i Tothe

n1=3A10—'A11 A12 A13-'C13 C12 C11 C10
1 \ 7 1 host

I 20-Azr\-4227A23-’C23 C22 C21 C20

 

 

 

 

 

 

 

 

 

 

‘v

k— n2=4—-l

Figure 4.4. Data ﬂows in Algorithm 4.2

Note that in Algorithm 4.3 a processor will perform a submatrix multiplication
before it pauses to wait for the C submatrix from the left neighbor (Statement 2.6 - 2.9
in Figure 4.5). This is because submatrix multiplication is a computation intensive
operation (0 (n3)). If every processor waits until it receives the C submatrix from the
left neighbor to perform its multiplication, then the delay within each stage will be very
long. Performance analyses of these algorithms are given in the Section 4.4, where con-

siderations for designing pipelined data parallel algorithms will be discussed further.

4.3. CHARACTERISTICS OF LARGE-GRAIN PIPELINING
The most important characteristic of pipelined data parallel computation on
DMMs is the concept of large-grain data ﬂows. Through this concept, LGP exploits

pipelined data parallel computation from the following two aspects. First, data ﬂows,

74

<Algorlthm 4.3>:

Hosu
1.1. send Aij IO nodegj, 0SiSn1—l, OSanz-l;
1.2. send Bjk IO nodeoj, OSanz-I, 0SkSn3-1;
1.3. receive C1,, from node1,,,2_1, 0Si Sn1-1, OSkSn3—1.

Node: (nodeij, OSiSnl-l, OSanz—I)
2.1. receive A1,- from the host;
2.2. for k := 0 to n3-l do
2.3. if (i = 0) then receive B 11, from the host
2.4. else receive B ,1, from node--111;
2.5. if (i a: n1-1) then send Bjk to node-+1};
2.6. Ca) := A,- x311;
2.7. if (i :0) then

2.8. receive C11“) from node1,j_1

2.9. C1? := C1? + C11”); .

2.10. if 0 = nz-l) thensend Ca) to the host
2.11. else send C51) to node; 1+1.

Figure 4.5. Algorithm 4.3 for pipelined matrix multiplication

data are partitioned into blocks. The ﬂows of data blocks along the processors form data
streams. The size of data blocks determines the granularity of the algorithm and is usu-
ally large on DMMs to offset the communication overhead. The data blocks are the
most basic unit of processing. In Figure 4.2, each submatrix of B or C forms a data
block. '

Second, the ﬂows of data connect the nodes in the system into a network. The net-
work consists of multiple pipelines operating on multiple data streams. Each node may
participate in several pipelines during the course of computation. The basic mode of
operations in a processor is to receive one data block from each of the input streams and
generate one data block to each of the output streams. This multiple pipelining scheme
results in both spatial (multiple units) and temporal (pipelines) overlapping. In Figure
4.2, there are two kinds of data streams ﬂowing in the system: one along the columns to

disseminate submatrices of B, and the other along the rows to accumulate submatrices

of C.

75

 

From the

 

320 321 322 323 hOSt

 

 

 

1-330 331 332 333

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I 00-A01 402- 03 C03 C02 C01C00
I I I I Tothe

"1:3 A10-'A11-'412-A13-’C13 C12 C11C10
‘ i I hOSI

1 20*421 Azz-eAzst-rczs C22 C21 C20

 

 

 

1._ n2=4 —-1

Figure 4.6. Data ﬂows in Algorithm 4.3

From the viewpoint of data ﬂows and the characteristics of DMMs, we can sum-
marize the following important features of pipelined data parallel computations:

(1) Optimized performance: Performance of pipelined data parallel algorithms can
be ﬁne-tuned by controlling such design parameters as the number of pipelines, the
number of stages in the pipelines, and the size of data blocks. In the next chapter,
an analytic model will be introduced to serve this optimization purpose. Note that,
due to the communication overhead in DMMs, granularity in pipelined data paral-
lel algorithms is usually large.

(2) Regular and local communication: Pipelined data parallel algorithms usually
have regular data ﬂows. Each node in the system only has to interact with its
neighbors (i.e., predecessors and successors in the pipeline).

(3) Asynchronous and data-driven operations: Due to the autonomous nature of

DMMs, no global synchronization should be used in pipelined data parallel

(4)

(5)

(6)

76

algorithms. Operations in the system are initiated by the arrivals of data blocks. In
this regard, pipelined data parallel algorithms on DMMs are very similar to wave-
front arrays [KuL187]. Note that, to control the ﬂow between two successive
nodes (stages) in the pipeline, local memory in the nodes can serve as the buffer
between the producer and the consumer.

Problem size independence: In DMMs, each node, being a general-purpose pro-
cessor with local memory, is able to handle data sets with different sizes. Thus,
given a ﬁxed node conﬁguration, one can change the size of data blocks to accom-
modate variable problem sizes.

Reduced I/O bandwidth: The number of active channels that the host has to
maintain in data downloading and uploading is reduced by using pipelining. For
example, in Figure 4.4, the host only has to communicate with nodes in the top
row when downloading and with nodes in the right-most column when uploading.
Identical node programs: Nodes in a pipelined data parallel algorithm usually
perform the same function but on different data sets. This makes the development
of pipelined programs easy.

Given the basic idea of LGP, we will study in the next section the performance of

the pipelined matrix multiplication algorithm, when they are executed on the NCUBE.

4.4. PERFORMANCE OF PIPELINED MATRIX MULTIPLICATION

The design parameters n1 and n2 determine the conﬁguration of the pipelines,

where n2 gives the number of pipelines in the submatrix multiplication phase and 111

gives the number of stages in each pipeline. If the number of nodes (N) is ﬁxed, then we

only have to consider either n1 or n2, because N =n1n2. Figure 4.7 illustrates the rela-

tionship between the column partition (n1) and the execution time (T) for the three

algorithms introduced in Section 4.2. Multiplications of 64x64 matrices are studied

77

using 2 to 32 nodes. The optimal conﬁguration for a given N can be found from the
corresponding curve by choosing its minimum point.

It can be seen that Algorithms 4.2 and 4.3 have the same performance around the
optimal regions, and both are superior to Algorithm 4.1. This is because the latter has a
reduction tree which involves large submatrices. Note that Algorithm 4.3 is expected to
perform worse when n2 is large, due to a long linear path to accumulate C submatrices.
However, from Figure 4.7, we can see that Algorithm 4.3 performs as good as Algo-
rithm 4.2 does. A reasonable explanation is that operations in the processors can be

fully overlapped in Algorithm 4.3.

40000 - ----- Algor. 4.1
35000 -., a . N=2 — Algor. 4.2
30000 _ ........... Algor. 4.3

Execution 25000 ~ 64x64 matrices, n3=8
20000 -~ -

time (73 15000 _

  

 

 

 

. N=16. N=32
0 I I I I T
1 2 4 8 16 32

Number of column partitions (n 1)

Figure 4.7. Comparison of pipelined matrix multiplication

It can also be seen in Figure 4.7 that a close to square partition always results in
good performance [FOOH87]. However, due to the host downloading and uploading
operations, the algorithm favors partitions with few pipelines, i.e., a large n1. Note that,
when n1 is small, C submatrices will be large, and so will be the reduction tree. In this

case, the communication overhead is too high for efﬁcient execution.

78

Parameter n3 determines the size of data blocks (81,-). The effect of block sizes on
the performance of the algorithm is illustrated in Figure 4.8 using Algorithm 4.2. It is
evident from Figure 4.8 that granularity does have a bearing on the algorithm perfor-
mance. A large granularity forces the operations in the nodes to be executed nearly
sequentially, which reduces the effect of pipelining between nodes. On the other hand, a
small granularity can increase the degree of pipelining, but, due to the ﬁxed overhead in
transmitting messages, communication cost will also rise. Though it is not shown, the

best partition size in this case is around n3=16.

11000- ---- ”3:2
R n3=8

 

10000 _.'..‘\ 00.00.00 "3:32
"-.,\\ 64x64 matrices, 16 nodes

, 9000 a
Executron

8000 -
time (T)7

-

 

6000-

5000 1 I 1 1
1x16 2x8 4x4 8x2 16x1

Partition pattern (n 1 xnz)

 

 

Figure 4.8. Effects of partition sizes

Next, we compare the pipelined matrix multiplication algorithm (Algorithm 4.2)
with Algorithm 3.1 introduced in Section 3.4, which was developed without using the
concept of pipelining. In Algorithm 3.1, both submatrices of A and B are initially
loaded into the nodes. Due to the operation of the ring shifting in Figure 3.12, nodes in
the same column are synchronized to the slowest one. The measured speedups are plot-

ted in Figure 4.9, where the speedup of an algorithm is deﬁned in (3.18).

79

30 - ------ Algorithm 3.1 x
— Algorithm 4.2 x x
25 7 64x64 matrices x
20 -l x x
X
Speedup 15 - X

  
   
 

10-
5—1
0

o
.....
0‘.
O

C
.0
O
.0

 

 

Number of processors (N)

Figure 4.9. Comparison of speedups

Note that the pipelined algorithm performs better than the non-pipelined algorithm
does. This is because pipelined data parallel algorithms can better utilize the overlapped
operations and balance the computation with communication through the choice of
optimal design parameters. However, the improvement is not as signiﬁcant as expected.
The major reason is because of the communication overhead in current generation
hypercube multiprocessors, e.g., NCUBE. The overhead in setting up a message is still
too high, during which time the host cannot perform any useful computation (see the
discussion in Section 3.2).

Given the basic concept of pipelined data parallel computation, the most important
issue, then, is to characterize its performance. An analytic model of pipelined data
parallel algorithms will be introduced in Chapter 5, including performance analysis of

the pipelined matrix multiplication from the analytic perspective.

CHAPTER 5

MODELING PIPELINED

DATA PARALLEL ALGORITHMS

The primary strength of large-grain pipelining lies in its ability, through pipelin-
ing, to balance computation with communication in the resultant pipelined data parallel
algorithms. Nevertheless, this strength cannot be unleashed unless there is a way of
deciding the optimal design parameters of the given algorithm, such as

0 The number of pipelines used

0 The number of stages in each pipeline

- The number of data blocks in each data stream

- The size of data blocks

In general, analytic modeling of an algorithm serves two purposes. First, through
the model, the performance of the algorithm can be predicted and studied under various
environments. Second, by optimizing the estimated performance, one can determine the
optimal set of design parameters. In this way, no expensive trial-and-error is required in
determining the parameters, and the design can cope with changes in problem size or
even the underlying machine architecture. We have presented in Chapter 3 a concise
model for DMMs in general and the NCUBE multiprocessor in particular. Important
system parameters of the NCUBE are given in Table 3.1. Therefore, in this chapter we

will concentrate on modeling the logical behavior of a pipelined data parallel algorithm.

80

81

In Section 5.1, a model of pipelined computation is presented which focuses on a
single stage in the pipeline [KiCN88b]. Throughput of a stage is analyzed in Section
5.2, and the model is extended in Section 5.3 to analyze the throughput of linear arrays
and 2-dimensional meshes. As an illustration, in Section 5.4, the model is used to
analyze the performance of the pipelined matrix multiplication algorithm (Algorithm
4.2). Accuracy of the analytic model is studied by comparing the predicted perfor-
mance with that measured on the NCUBE.

5.1. A MODEL OF PIPELINED COMPUTATION

Multiple pipelining is a general pattern of computation in pipelined data parallel
algorithms. Following the ﬂow concept, nodes in a pipelined data parallel computation
can be modeled as a computational unit which has multiple input and output streams
(see Figure 5.1). Each data sueam consists of many data blocks. From this point of
view, a pipelined data parallel algorithm can thus be modeled as a network of computa-
tional units linked together by data streams as exempliﬁed in Figure 4.6. To analyze
such a network, the relationship between input and output streams has to be studied
ﬁrst.

An m-input computational unit in a pipelined data parallel computation can be
viewed as a process, which takes inputs from m data streams, ids,- (0Si Sm -1), performs
a function, ods=h (idso, ..., idsm_1), and generates another data stream, ads, as an out-
put. Here only a single output data stream is considered. The result can be easily
applied to the case of multiple output data streams by adding constant offsets. Suppose
each input stream contains n data blocks. Then, the computational unit has to repeat a
receive-compute-send sequence 11 times to complete the task. This iterative process is

described in Figure 5.2.

82

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

data block (s)
60 80 I10 / 8 8
p.
6 8 n u .. . __.
l 1 “1 Computational ,
unit 8 = 8 p.
‘8m_ 48m... ‘m_1 .o c. on o —.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.1. Model of a computational unit

function h( ) {
l‘ I : begin of computation ‘/
start_up( ); /‘ initialization time: or ‘/
torj:=0 to n—l do {
780V (id80, data 0. ..... );
Pol-z data_handling (data 0); /' handling time: f0 '/

recv (ids,,,-1 , datam-1, ..... );
P",_1.j: data_handling (data,,,_1); /‘ handling time: f,,,-1 ‘/
G j: data =computation (datao, ..., datam_1); I‘ computation time: c '/
send (ads, data, ..... );
}

/‘ E: end of computation “I

}

Figure 5.2. Functional description of the computational unit

Given the functional description of a computational unit, we are interested in the
evaluation of its throughput. Before we can proceed, the following notions should be
clariﬁed.
<Deﬂnitlon 5.1>

83

A data stream, ds, is characterized by a two-tuple (u, 8), where
u = the arrival time of the ﬁrst data block
8 = the interarrival time between two consecutive data blocks.

D

1.1 is an absolute time value measured from a ﬁxed reference time instant, usually the
instant the whole computation begins. Note that, although one can identify other param-
eters of data streams, only 1.1 and 8 affect the modeling discussed here. Note also that 8
is, in general, a function of the size of data blocks and the delay induced by nodes
upstream. Again, for the purpose of modeling, 8 is taken to be a constant by, say,
averaging over the interarrival time between each pair of consecutive data blocks.
<Definition 5.2>

A computational unit, cu, is characterized by
or — the initialization time;
f1, 0SiSm -l — the time required to handle a data block from input stream idsi;

c — the time spent in computation and invoking the send() request in each iteration.
[:1
Note that Otis an absolute time value, while f,- and c are time intervals.

Given the set of parameters for the m input streams, (11,-, 81), 0Si <m, and assuming
that (01, c, f,-) are known, we are interested in the throughput of the m-input computa-
tional unit, i.e., (u, 8) of its output stream. Note that 1.1 is the time the ﬁrst output data
block arrives at the subsequent computational unit, and 8 is equal to the loop iteration

time of the function h (.). Detailed analysis is given in the next section.

5.2. THROUGHPUT ANALYSIS OF A COMPUTATIONAL UNIT
The timing sequence of a computational unit can be described by a Timed Petri-

net, as illustrated in Figure 5.3. Each place in the Petri-net represents one time instant,

84

and each transition in the Petri-net represents an event which takes a certain amount of
time to ﬁnish. Note that, due to the way a computational unit is modeled, the
corresponding Timed Petri-net is acyclic and deterministic.

The following time instants (or places in Figure 5.3) are of interest to us, where
0SiSm—1, 0San—1:

I — The starting instant of the computational unit.

X,- — The starting instant of sending data stream ids,-.

P1]- —— The instant that recv() for the j-th data block from data stream ids,- is issued.

G ,- + The instant that computation() in the j-th iteration starts.

R1]- — The instant that the j-th data block from data stream ids,- arrives.

D1,- — The instant that data stream ids,- is ready to deliver the (i+1)-th data block.

E — The instant that the whole computation is done.
Thus, for example, the time instant [’11- indicates that at the j-th iteration, the data block
from stream ids,-_1 has been handled, and that the computational unit is ready to receive
the data block from idsg. The procedure, data_handling (data,), can start depending on
which event (P1,- or Rij) occurs last.

From the above discussion, we can observe that the time a place, ve {P113 G}, E I
0SiSm-l, 0Sj Sn-l], will be reached is determined by the longest path among all
paths from {X ,-, I l 0Si Sm—l} to v. The length of a path is deﬁned to be the sum of the
transition intervals on the path. For example, the time the place P20 will take place is

equal to

Maxla+f0+frt llo+f0+f1i llr+f1 }

Let ¢i=f1+ ...+ f,,,_1 denote the length of the path from P1,- to G j, where 0SiSm-1
and 0San-1. Let Tj be the time the place 61-, 0San-l, can take place, i.e., the time
in the j-th iteration that all input blocks have been received and that the function com-

putation() is to be executed. We have

Tj=ws%gﬁ_l IG+¢0+f (¢0+C), I-li'I'k 51+¢i+0 -/< )(¢0+C)}

85

 

 

Figure 5.3. Timed Petri-net of a computational unit

86

= Ma?‘ {0000. l~li+¢i+k (51-(¢0+C)) }+j (904-6)
OSkS 1,0S1Sm-l

However, in the above expression, in order to obtain the maximum value for T], k can

only take two possible values: If 81S(¢o+c), then k=0; otherwise, k= j. Deﬁne

>
K<x>={lt 12%

Consequently, we have the following lemma.
<Lemma 5.1 >

The time the place G j, 0Sj Sn—l, will be reached is equal to
Tj= Max {a+¢o. lli+<l>i+i K(5i-(¢0+C))} +J'(¢o+C) (5.1)
0SiSm-l

Cl

From Lemma 5.1, the parameters (1.1, 8) of the output data stream can be deter-
mined. For 1.1, we have the following theorem:
<Theorem 5.1)

Let ‘V be the communication delay for sending one output data block to the subse-

quent computational unit. Then,
l~l= Max {a+¢oi lit'+<ilt'}+6‘i"lI (5.2)
0SiSm-l

<Proof>
The time that the subsequent computational unit will receive the j-th output data

block is Tj+c+u1. From (5.1) we can see that, when j is 0, To: O<Maxl{oc+¢0, u;+¢,-}.
-15»:-

Thus, we obtain (5.2).
C]

Unfortunately, 8 cannot be represented in such a simple form as 1.1. If we take
11,- =Tj—Tj_1, lSj <n, then A,- would not be a constant — instead Aj is (00+c) plus
some arbitrary delay time. Thus we can only estimate a constant 8 to approximate A j.

Since our major concern is the completion time of a computational unit, we will use

87

Tn—l ‘TO
n-l

8

(5.3)

In summary, given the parameters, (a, f,-, c, 11,-, 81), 0Si Sm—l, the throughput of
an m-input computational unit can be determined from (5.2) and (5.3). Note that, in a
pipelined data parallel computation, the output data of a computational unit become the
input data of the subsequent computational units. Thus, by tracing through the ﬂow of
data streams and applying (5.2) and (5.3) repetitively, the throughput and execution
time of the whole computation can be derived. As an illustration of the analytic model,

let us consider the execution of a linear pipeline with only one input stream, as shown

 

 

 

 

 

 

 

 

 

 

 

 

 

in Figure 5.4.
o) 0) _ ,
(03,53)—.0 1_—_. ..... —~N1r—.
Figure 5.4. A linear array with single input stream
<Example 5.1>

Assume that there are N stages in the linear array. The single input stream to stage
0 has an arrival time of 1180) and a block interarrival time of 880). There are n data
blocks in the stream. Each stage of the array is viewed as a computational unit which
has parameters 01, f0, and c, for process initialization time, block handling time, and

block computation time, respectively (see Deﬁnition 5.2). Consider stage 0. From (5.1),

we have
T8” = Maxta+¢ot 1189100}
11.02, = Maximo. llbo)+¢0+(n-1)K(5i10)-(¢0+C))} + (n-1><¢o+c)

The throughput of this stage can be derived from (5.2) and (5.3), which give the follow-

ing parameters for the output stream:

88

118‘) = T 8°’+c+v = Maxie. 110} + ¢o+c +11!

don-1180’)

581)=Max{¢to+c, 85‘”- n 1

l

The analysis for other stages are similar. In general, for stage i, 0Si SN -1, we have
T8" = Max1a+¢o. u8>+¢oi = Maxia, 110} + i(<l>0+C+\l!) + to
1111, = Maximo, u8’+¢0+(n -1)K(5ii)-(¢0+C))} + (n —1)<¢o+c>
= 119400-1188)
because 8g)2(¢0+c). Thus, the output stream from stage i has the following parameters
118'“) = Maxla. 1101 +<i+1x¢o+c+v>
58'“) = 68') = 58"

It follows that the ﬁrst output data block will be generated by stage N —l (and, thus, by

this linear array) at

MaXIa, 110} + (N -1)(¢0+C +ll!) + (¢0+C)

with an output interval of 881). The last output block will be generated by stage N —l at
Maxie. 110} + (N-1)(¢0+C +111) + (¢0+C) + (rt—1183"

which equals the total execution of this linear array. The above expression conforms to

the general execution time of a pipeline [KwBr84],
(N +n -1)xclock time

where N is the number of stages in the pipeline and n is the number of data elements
passed through the pipeline.
El

Note that the model presented in this section only describes the behavior of a sin-
gle computational unit. However, as can be seen in Example 5.1, the behavior of com-

putational units in regular arrays are very similar. Thus, it is possible to derive a

89

closed-form to characterize the performance of such regular arrays. The next section
expands the analytic model presented here to study the throughput of linear arrays and

2-dimensional meshes.

5.3. THROUGHPUT ANALYSIS OF PIPELINE ARRAYS

Suppose computational units are connected in a linear array or 2-dimensional
mesh. In this section, we are interested in obtaining closed-form expressions for the per-
formance of the whole array. Note that the throughput at each single stage can be
derived using the techniques presented in the previous section. The throughput of a
linear array with only one input stream has been studied in Example 5.1. Let us con-

sider linear arrays with two input streams, as shown in Figure 5.5.

018‘”. 88‘”) (118”. at") 018”“. 88“”)

 

 

 

 

(111°).810’) ___., 0 1 _. ......__.N-1__.

 

 

 

 

 

 

 

 

 

Figure 5.5. A linear array with two input streams

(1) Linear arrays with two input streams

Again, there are N stages in the array. Each stage has two input streams: one from
the outside world and one from the previous stage (except stage 0). Assume that the
arrival times of data streams have the following relation: ugl=ug‘1)+8, and that
885880), for 0<i SN -1. Again, there are 12 data blocks in each data stream. The events
that occur in the stages are depicted in Figure 5.6 using the Timed Petri-net with N =2

and n =2.

90

Stage 0 Stage 1
O O
110 a 110 or
O O
”1 f0 f0

 

 

 

‘4 61 G . . f0 .
f
b . . 6,111
. f1 -‘ 50 c
a” (To 0
f0
9 o 0
at
O

Figure 5.6. The Timed Petri-net of a linear array with two stages

 

C

91

Consider TIE), 0SiSN—l, OSkSn-l — the time the i-th stage has received both
input data blocks in the k-th. iteration (see Figure 5.6). Again, T5!) is dependent upon the
longest path from any initial place in the Timed Petri-net to its corresponding place. By
inspecting the interaction between stage i and i-l (see Figure 5.6), we can identify the

following relation:

Tl." = 0131311 a+¢o+k (¢0+C). #8)+¢0+kMaXI5ii).(¢0+C)}, (5.4)
_j<
T1“1)+<c+¢i+v>+(k —j)<¢o+c>1

The ﬁrst two terms are obtained directly from (5.1). The remaining terms in (5.4) can be
simpliﬁed by noting that T §02T§i21+¢0+a This is true because from stage 0 in Figure
5.6, it can be seen that T80)+¢0+c is just the length of one of the paths incident to the
place labeled T10). However, T10) is equal to the maximum of the lengths of all paths

incident to its corresponding place. It follows that TI,” can be rewritten as follows:

T1": Max1a+¢o+k (¢0+C)i 1180):” 5+¢0+kMax{580).(¢0+6)}. rt‘”+<c+¢i+w>}
where 1.18) is replaced by 1150M: 8. To simplify the expressions, we introduce the fol-
lowing symbols: 9

Z =01+¢0 +k(¢o+c)
U = pho)+¢o+k MaXISIi’). ¢o+c1
X 2 0h + C + \II

Thus, we have
TS,” = Max{Z, U+i8, rf'1>+X}
= Max{Z+X, U+i8, U+(i-1)8+X, rfjrzl+2m

= Max {Z+(i—1)X, U+(i-j)8+jX, mount}
OSlSi-l

However, note that I can only take two values in order to maximize T5,”: 0 or i—l. Also,

from (5.1), we have

92

It” = Maxia+¢o+k<¢o+c>. ut°l+¢o+kMaxist°>r¢o+c>h
ul°l+¢i+kMax151°h<¢o+c>11
Combined together, we obtain an expression for 751' ):
r51) = Max{Z+iX, U+iMax{8,X],1110)+¢1+kMax{810),(00+c)}}
= Max{a+¢o, 113°>+¢0+k «ohm-(owner K(8—(¢1+c +11», (5.5)
ul°l+¢i+k K(5l0)-(¢0+6))} + k (¢o+C) + i<¢i+c +110

Notice the similarity between (5.5) and (5.1). In fact, if i=0, then Equations (5.1) and
(5.5) are identical. From (5.5), the tow] execution time of the whole array is equal to
Tﬁﬂqluc.
(2) 2-dimensional mesh

Figure 5.7 shows the computational units that are connected into a 2-dimensional

mesh. Each stage has two inputs, and input streams have the following relations:

118°” = 118“” +1821 88"” = 580°)
111“” = 11100) + i 8'1 51“” = 510°)

The corresponding Timed Petri-net is shown in Figure 5.8.

From Figure 5.8, we can see that
Tit" ’ = Maxia+¢o+k (¢o+C). Tl! ‘l'f’+(¢o+c +111). Till-”+01% +10} (5.6)

The reason that T 5"”) and TIM-1), 0S1 <k, are not considered in (5.6) has been
explained as in (1) above. Again, to simplify the expressions, we use the following
symbols:

X = ¢O+C +ut

Y = ¢1+c +ur

Z = a+¢0+k (¢0+C)

U 0 = 1180)+¢0+kMax{8{10), ¢0+C}

Ui = ul°>+¢i+kMax18f>. ¢n+c1

93

01800). 880‘”) 018°”. 88°”) 01602). 6802))

l

(1110”). 8500)) ——-~ 00 01 02 ___.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(11110), 5110)) ___. 1o 11 12 ___.

 

 

 

 

 

 

 

 

 

 

 

Figure 5.7. Computational units connected in a 2-dimensional mesh

Thus, we have
Tiff) = Max{Z, ﬁrmer, Tif'j'l)+l’}
= Max{Z+Y, ﬂi‘l'j)+X, Tlf‘l'j'l)+X+Y, TEJ‘Z)+2Y}

= Max {Z+(i—1)Y, Tf‘l'j‘ll-hXHY, Tij°>+jn
OSlSj-l

However,
T530) = Max{Z, T5,"1v°>+x, U1+i8'1}
Thus, we have
7517) = Max{Z+jY, Tf‘l'j")+X+lY, U1+i81+jY}
OSISj
= 0M?{Z+X+1Y, rfj-Zf-‘uzxw, U1+i81+jY, U1+(i-1)81+X+1'Y }
- J
= Gigi-ax1zm—1)X+jr, TIO'j")+iX +lY, U1+81+(i—1)K(81-X )+(i-1)X +1Y}
- S}
However,

T101): Max{Z, U0+j821, TIO'j‘1)+Y}

94

node 1' -1 ' j
T a
c nodeij

f1

OOO
s»
Q

 

 

c
O.
f0
OO
f1
0
c

Figure 5.8. The Timed Petri-net of the 2-dimensional mesh

95

- and

T100) = Max[Z, U0, U1}
It follows that T)? ) can be expressed as
1511' l = Maximo. 1100)+¢1+kK(5I)O)-(¢0+C))+jK(50-(C+\V+¢1)), (5.7)
ui°’+¢i+k K(510)-(¢o+6))iK(5i-(C+\ll+¢o))} + k(¢0+C)
+ i<¢o+c +1) + j (¢1+C +10

The total execution time of the 2-dimensional array is then given by T9111” '1). For
computational units with other interconnection structures, the technique developed in

this section can also be applied to analyze their performance.

5.4. ANALYSIS OF PIPELINED MATRIX MULTIPLICATION

The analytic model presented in the previous section can be directly applied to
Algorithm 4.3, which has a 2-dimensional mesh structure. However, due to the tree
reduction performed in Algorithm 4.2, the analysis derived in Section 5.3 cannot be
used directly. Therefore, in this section, the execution time (T) of Algorithm 4.2 (see
Figure 4.3) is derived using the analysis techniques for a single computational unit.

Assume that matrices A, B, and C are all square matrices with size M xM. The
analysis can be easily extended to non-square matrices. Matrices A and B are parti-
tioned into n1xn2 and nzxn3 submatrices, respectively. Assume that n1, n2, and 113
can divide M evenly, and that N =n 1 11; nodes are used to perform the matrix multiplica-

tion. Thus, we have (see Figure 4.4)

96

2

 

size ofAU is bM for0Si<n1,0Sj<n2

nln2
2

 

. . bM . .
srze OfBij rs for 0Si<n2,0Sj<n3
n2n3

2

 

size of C1]- is for 0Si<n 1, 0Sj<n3

n1n3

where b = 4 bytes is the size of a ﬂoating-point number on the NCUBE. Deﬁne the fol-

lowing notations:

 

 

 

 

 

 

M3 M2

t0 = Te— tr = To
n1n2n3 n1n3

sz _ 12M2
912 = 1311 W12 - Tn

n1n2 nln2

bM2 _ 121142
923 = In W23 - Tn

n2n3 n2n3

sz _ sz
913 = Th W13 - Tn—

n1n3 "1H3

where to is the time to perform a submatrix multiplication (A1]- xB 11,), and t1, is the time
to perform a submatrix addition (Cf-P+C§£+1)). Also, following the deﬁnitions of 023
and W23, the time to transmit one submatrix of B from the host to a node will be 0h+023
and that between nodes will be 0,,+\V23 (see (3.1) and (3.2)). Other parameters can be
similarly interpreted.

The ﬁrst step in analyzing the performance of a pipelined data parallel algorithm is
to identify the critical path of the computation. Note that in Figure 4.4, all nodes at the
right-most column are the ones that are loaded from the host the latest and will start the
last among all nodes at the same row. Thus, we need only focus on the operations in
node1,,,2-1, 0SiSn1—l. Consider nodeo.,,2_1 ﬁrst. Operations in this node depend on
two events: the arrival of a B submatrix and the setup of the receive routine for that sub-

matrix. Note that, before the host can download B,,2_1.o to node 01,,2_1, it must send all

submatrices of A as well as 300, ..., B,,2_2.o to the corresponding nodes. Thus, 3,, 2.111

97

will arrive at node 01,,2_1 at
1400) = N (01144912) + "2(0h'I'923)
The interval between successive B submatrix arrivals is
580) = "2(0114'923)
which is the time the host loads all nodes in the top row exactly once. On the other

hand, node 01,,2..1 will setup the receive routine for B,,2_1,0 only after it has received

A 01,,2_1 from the host. Thus, this node will be ready for 8,, 2.111 at
01(0) = "2(Gh+912) + W12 + 6n

The second term, r1112, accounts for the time to copy the received A submatrix (A (1,, 2_.1)
from the system buffer, and 0,, is the time to invoke the receive routine for B,,2_1,0.

From Equation (5.1), the node can start the submatrix multiplication phase at
T8” = Maxtuito’. 01(0)} + (160’

where 418°)=ﬂ10)=\1123 is the time to copy B,,2_1.0 from the system buffer. From now on,
in each iteration, node 0,,,2_1 will
(1) Send the received B submatrix to node 1.,,2_1, which takes time 0,, + um;
(2) Perform a submatrix multiplication, which takes time to;
(3) Perform a tree reduction with all nodes in the top row, which takes time
log2n2(on+2\l’l3+t1)3
(4) Send the resultant C submatrix to the host, which takes time 0,, + \1113;
(5) Issue the recv() request to receive a B submatrix from the host, which takes
time 0,,.

Let

em) =(Gn'HVBTHO'HOan2(Gn+2\I/13+t1)+(Gn+Wl3)+On

The iteration time, 8(0), at this node can be obtained from Equation (5.6). Speciﬁcally,

98

if 080)+c(0)2880), we have
5(0) z <18” + 6(0)

Note that 8“” is also the interval between the sending of successive output data blocks

(i.e., C submatrices) from node 01,,2_1 to the host. The time instant that the ﬁrst block of

this C submatrix stream arrives at the host is equal to
11“” = 780) + c (0) _ 0,.

In general, for node1,,,2_1, 0Si Sn1—l, the time, T8), that the submatrix multiplica-

tion can start is dependent upon (1) when the ﬁrst B submatrix, B,,2_1,0 , arrives, i.e.,

118" = T80) + (i—1>(o.+2v23)
and (2) when it receives the A submatrix and is ready for 8,, 2.110 , i.e.,
or“) = (i+1)n2(0h+912) + W12 + 0n
Then, T8), 1.1.“), and 50') can be Obtained from (5.1), (5.2), and (5.3), respectively. Note
that the interarrival time (88)) of adjacent input data blocks to node1,,,2_1 is equal to the
iteration time of node,-_1.,,2_1 (804)).

Now consider the host. The host will ﬁnish downloading submatrices of A and B

and be ready for the uploading at
01 = N(01t+912) + n2n3(01t+923) + 0111

When uploading, the host can also be viewed as a computational unit, which has n1
input streams of C submatrices with parameters (“(1“). 8(0), 0Si Sn1-l. After receiving
one C submatrix, the host needs an interval of f,-=013 + 0),, to copy the submatrix from
the system buffer and set up the next receive routine. From (5.1) we can see that the

host will ﬁnish the uploading (which is also the time the job ﬁnishes) at

T = Max l[ot+q>0, u<‘l+o,+(n3—1)(5<‘>-n1(913+o,,,))} +(n3-1)n1(013+01.r)

05isn1-

where (I); = (n1—1)(013+0;,,) + 013 and C = 0hr:

99

5.5. A COMPARATIVE STUDY OF THE ANALYTIC MODEL

In this section, the execution time derived in Section 5.4 is validated through
experimental results obtained from the NCUBE. The close match between analytic and
experimental results indicate the accuracy of the model. Note that the analytic results

are obtained by using the system parameters listed in Table 3.1.

 

 

 

 

 

40000 -1 —— Measured
35000 -+ N=2 ------ Analyzed
30000 - 64 x 64 matrices, n3=8
. 25000 -
Executron
20000 -
_ \__. _
trme (T) N’4
15000 —
v\ N:
10000 -—
\- _ N=16
5000 — r ________ ;— N=32
0 I I I I P
1 2 4 8 16 32

Number of column partitions (n 1)

Figure 5.9. Measured and analyzed performance for various pipeline conﬁgurations

In Figure 5.9, the total execution times of the pipelined matrix multiplication a1 go-
rithm (Algorithm 4.2), both measured on the NCUBE and predicted by the analytic
model, are plotted. Again, multiplication of 64x64 matrices are studied using 2 to 32
nodes. The block size of submatrices of B is ﬁxed (by setting n3=8), and the pipeline
conﬁguration is changed by choosing the parameter n 1. It can be seen from Figure 5.9
that our analytic model predicts the execution of pipelined matrix multiplication
correctly and that errors between analyzed and measured data are within 5%. The same
accuracy of the analytic model can also be observed if we change the partition size, as

shown in Figure 5.10.

100

   
  

 

11000— —— measured: n3=32
\.._ measured: n3=2
10000 ° — - - analyzed: n3=32
9000 _ — — analyzed: "3:2
Execution 64x64 matrices, 16 nodes"
8000 - .-7
time (T)
7000 -
6000 — ‘ ~ -----
5000 l h l l l

 

1x16 2x8 4x4 8x2 16x1
Partition pattern (n1xn2)

Figure 5.10. Measure and analyzed performance for various partition sizes

The optimal partitions predicted by the analytic model and obtained from the
NCUBE are listed in Table 5.1. Again, we can observe a perfect match between these
two sets of data. This indicates that the analytic model can be used to choose the best

partition for a given problem instance.

Table 5.1. Optimal partitions for pipelined algorithm

 

(n 1 , n2. n3)

N 2 4 8 16 32
measured (2.1.16) (4.1.16) (8.1.16) (8.2.16) (8.4.8)
analyzed (2.1.16) (4.1.16) (8.1.16) (8.2.16) (8.4.8)

 

 

 

 

 

 

 

 

 

 

 

101

Having established the accuracy of the analytic model. we then use the model to
predict the performance of Algorithm 4.2 using 1024x1024 matrices and up to 1024
nodes. Figure 5.11 shows the predicted speedups for various environments. From Fig-
ure 5.11, it can be seen that. if the underlying architecture remains unchanged, i.e., we
use the set of systems parameters in Table 3.1, then the speedup of the algorithm levels
off when the number of processors increases to more than 600 (see the dashed line in
Figure 5.11). As discussed. the communication overhead between nodes and between

the host and nodes all contribute to this inefﬁciency.

    
    
 

1000 4 - - - - G/S, no overlap x x
900 -- Ideal x
800 _ -- -- -- ~- 50% overlap x x
700 — — w/o G/S x x
600 -—

m=1024

Speedup 500 _
400—
300 - ..--"'L IIII
200 _ .-'; ;;;

O
O
..
O
O
.O
O
.....
.O
.O
C
.0
.
..
----------—---—--

 

 

l 1 l r
0 200 400 600 800 1000

Number of processors (N)

Figure 5.11. Predicted speedups of pipelined matrix multiplication

Now suppose that 50% of the communication overhead can be overlapped with the
computation processor. Thus. for example. the single-byte transmission time on the
host. which is th=0.068 and th’=0 in Table 3.1. now becomes th=th’=0.034, where th’
is the transmission time that can be overlapped with the computation processor. The
dotted line in Figure 5.11 shows the speedup corresponding to this case. It can be seen

that. by increasing the overlapping between computation and communication

102

processors. the speedup of the algorithm can be doubled (at N =1024 nodes). Using the
new techniques as described in Section 1.2, such a degree of overlapping is not difﬁcult
to achieve in future generation DMMs.

In Section 3.2, we discussed the operation of gathering/scattering in the host and
noted that this operation is very expensive. The solid line in Figure 5.11 shows the
speedup that would obtain if the gathering/scattering operation takes no time. We can
see that the speedup becomes almost linear in this case. The major reason is because we
eliminate the sequential operation of gathering/scattering, which cannot be parallelized.
Therefore, it worthes the effort to reduce the overhead involved in this kind of opera-
tion by either system designs (e.g., better complier or host hardware) or programming

techniques (e. g.. using pointers instead of indices to reference matrix elements).

CHAPTER 6

DESIGNING PIPELINED

DATA PARALLEL ALGORITHMS

For many, designing algorithms is an art. However, as the technology advances,
systematic approaches for developing algorithms have been discovered for solving a
certain class of problems. This is possible because, in essence. the process of program
development is nothing more than a series of transformations: from problem
speciﬁcation. through algorithm design, down to program coding and testing.

In previous chapters, we presented the basic concept of large-grain pipelining and
the modeling techniques to analyze pipelined data parallel algorithms. However, the
power of large-grain pipelining cannot be fully exploited if the development of pipe-
lined data parallel algorithms is very difﬁcult. In this chapter, a systematic approach for
designing pipelined data parallel algorithms on DMMs is described.

The key to such a design procedure lies in the realization that systolic algorithms
are a special class of pipelined data parallel algorithms (see Section 1.3). Therefore.
many techniques for synthesizing systolic arrays can be adopted here [FOFW85,
FoWa87]. In general. these techniques start with a nested-loop program or a set of
linear recurrent equations, go through a series of transformations. and end up with a

systolic array design. A brief review of these techniques will be presented in Section

6.1.

103

104

The major difference between pipelined data parallel algorithms running on
DMMs and systolic algorithms is the granularity. Processing cells of systolic arrays
operate on and exchange one data element at a time. However, this mode of operation
is inefﬁcient on DMMs due to the communication overhead in invoking messages. It
follows that several data or operations should be grouped together to run on one proces-
sor in the DMM, and data exchange between processors uses one group as a unit. The
grouping problem thus becomes the unique feature and primary consideration in the
design procedure. In particular. one has to address the following issues:

0 Which data and operations can be grouped together?

0 What is the size of a group?

- How is the communication from one group to other groups minimized?

A general procedure for designing pipelined data parallel algorithms on DMMs is out-
lined in Section 6.2. Then. in the remainder of this chapter. the grouping problem is
deﬁned. assumptions are stated. and techniques for grouping data and Operations are
elaborated. An example will be given at the end of the chapter (Section 6.7) to illustrate

the application of the grouping techniques.

6.1. SYNTHESIZING SYSTOLIC ARRAYS

Over the past few years, there has been a signiﬁcant amount of work on synthesiz-
ing systolic arrays [FaSh87. FOFW85, MoFo86]. These techniques target at problems
which can be expressed as shift-invariant nested-loop programs, i.e., programs whose
loop data dependencies do not change with loop indices [KuLJ 87]. Without being stated
explicitly. all programs referenced in the following discussion will be assumed to be
shift-invariant. In this section, the basic ideas of these synthesizing techniques are

reviewed. which form the foundation of our design procedure.

105

Underlying all these synthesizing techniques is a representation which expresses
data dependencies between loops of the given program. In general, a nested-100p pro-
gram with n levels is usually expressed as follows:

for qo := 10 to uo do
for ([1 :=l1 t0 u1 do

for q,,_1 Z= 1,1-1 IO 11”-} do
(loop body)

The above program can be transformed into a directed graph, Q, in an n-dimensional
space. Each vertex in Q represents one loop instance and has a coordinate (qo. q,,_1)
in the space if the corresponding loop instance has a loop index (qo, ..., q,,_1). There is
an arc from one vertex v1 to the other v1 if the loop corresponding to v1 references a
variable which is generated in the loop corresponding to v1. Also, there is an arc from
an input variable to every loop that references that variable. A variable is an input vari-
able if it is referenced before or without being assigned a value in the program. Such a
graph will be called the computational structure of the loop program [MiWi84]. The
arcs are expressed as the difference vector of the two end vertices and will be called
dependence vectors of the structure.

However, the derivation of a computational structure from a given loop program is

not straightforward. Consider the program for matrix multiplication:

fori :=0 t0 M—l do
forj:=0toM-1 do (6.1)
fork := O to M-l do
c1,- := c,-j + aikijg‘ ;

There is an implicit data dependency between adjacent iterations in the inner-most loop
via the variable c1}, because c.)- is updated in each of the M iterations and the value of
6;,- will depend on its own value in the previous iteration. To enforce an explicit depen-
dencies between the iterations. we should require that the the single-assignment rule be

followed. Note also that aik will be used in each of the M iterations with the same j

106

index. If one iteration is executed on one processor (as in systolic arrays), then this is
equivalent to a broadcast of a1}, to all processors corresponding to the M vertices. How-
ever, broadcast is undesirable and is difﬁcult to handle in systolic arrays [FoM084].
Thus. to eliminate broadcast and to enforce single-assignment rule. we must
rename variables in program (6.1) as follows (with decorations and initializations omit-

ted for clarity):

for i := O to M-l do
forj := 0 to M-l do
fork := 0 to M-l do
aijk == at.j-1.1t ; (6.2)
bijk == bi-1.j,k 3
Cijk 3= 01.}.1-1 + Giijbijk 3

The corresponding computational structure is shown in Figure 6.1. The numbers in
the circles indicate the loop index (i,j,k) of the corresponding iterations. Note that.
since the program is shift-invariant, the set of dependence vectors out of each vertex is

same for all vertices. In Figure 6.1, there are three dependence vectors for each vertex:
D = {(10, d1, d2}={[010]a [100]: [001]}

where D is called the dependence set of the computational structure. The process of
eliminating broadcast and enforcing the single-assignment rule will eventually
transform each variable in the loop into a pipelined variable. which depends explicitly
on all the loop indices [MiWi84].

Given the computation structure. Q. of a program. we can map Q onto an array of
processors by assigning each vertex (an loop instance) in Q to one processor. The
dependence vectors in Q deﬁne the necessary interprocessor communication. However.
in this implementation. each processor only executes one iteration and idles otherwise.
As a result. there are no continuous data ﬂows in the system and. the advantages of sys-
tolic arrays are not fully exploited. It follows that a second level transformation needs

to be done.

107

 

 

 

 

 

 

 

 

 

 

 

 

(122 202 212 222
a21—— 201 211 221
am 200 210 ‘220
‘ ‘
012%02 112 5{2
a11— 10} er 121

 

 

&
H
O

p—a

A \h
00 110 129

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a01— 001 ¥11 021
aooj 000>— 010 02?
1
“ bzo 1’21 I922
bro bu b12
k , boo bor b02
ijk I

Figure 6.1. The computational structure of the matrix multiplication

108

The basic idea is to project all vertices in Q along a particular direction. and thus
transform Q from an n-dimensional structure into an (n -1)-dimensiona1 structure Q’.
The projection direction represents the time axis. which indicates the progress of time.
The projected structure Q’ represents the ﬁnal spatial layout of the systolic array. Given
a projection direction z, an index transformation matrix H can be derived [FaSh87].
Each vertex v in Q is projected onto a vertex HvT in Q’. Similarly, each arc din Q is
projected onto an arc HdT in Q’. Thus. if vertices in the computational structure in Fig-

ure 6.1 are projected along [i.j.k]=[1,1,1], we will have an index transformation matrix

_ 4.10
H" [-1-1 2]

The corresponding projected systolic array is depicted in Figure 6.2(a), which is the
same as that proposed in [KuLe78]. The numbers in the circles of Figure 6.2(a) denote
the indices of the loops that are projected onto it. Similarly, a projection along [0.1.0]

will generate an index transformation matrix of

_ -100
H"[001]

The corresponding systolic array is shown in Figure 6.2(b), which has the .same struc-
ture as that in Figure 4.6.

In summary. systolic array synthesizing techniques adopt an intelligent representa-
tion to express data dependencies within a nested loop program. From this representa-
tion, a systematic transformation of the program into a systolic array is possible.
Nevertheless, since VLSI circuits have only 2-dimensiona1 layouts. these techniques
usually concentrate on computational structures in 2- or 3-dimensional spaces. i.e., n
equals to 2 or 3. Although much work was developed with n being a general parameter,
less known are the situations when n is greater than 3 or there are more than one time

axis.

002
012
022
011
“21
010
am.

109

Coo
011

022

(a)

 

 

520
bzr
I722
bll
b12
bor
boz
002001000
012011010
022 021020

. . n

‘ lrcatro
' multrp
atrrx
the m
s of
' nal structure
utatro
comp
'ected
2. Pro]
' e 6.
Figur

110

6.2. THE DESIGN PROCEDURE

Based on the discussion presented in the previous section. a general procedure for
designing pipelined data parallel algorithms on DMMs is outlined in this section.
Again, we start with shift-invariant nestedoloop programs. The initial stages of this
design procedure are very similar to those of systolic array synthesizing techniques.
However. due to the communication overhead on DMMs. the design procedure will
require additional stages to group data and operations so as to control the granularity.
(1) Derive the computational structure of the program

Transformations are necessary to restructure the program, for instance. to elim-
inate broadcast and to enforce the single-assignment rule. Then, each iteration in the
program is represented by one vertex in the computational structure, Q. Data dependen-
cies between loops are identiﬁed and are represented as arcs in Q.
(2) Map the computational structure onto the space-time coordinate system

One direction is designated as the time axis. along which all vertices in Q are pro-
jected. The purpose of projection is to increase processor utilization and to introduce
systolic effects. However, the projection along a particular direction is equivalent to the
grouping of all vertices along that direction. Thus, this step is optional and can be com-
bined with the next step.
(3) Group vertices in the computation structure

In this step, adjacent vertices in Q’ are merged together to form larger vertices. By
controlling the size of the groups. the granularity of the algorithm can be adjusted to
balance the computation and communication on a DMM. Nevertheless. the grouping
problem is non-trivial. For many computational structures, certain grouping schemes
will introduce extra dependence vectors between the groups. This implies extra com-
munication in the resultant algorithm. Details of the grouping will be discussed in the

following sections.

111

(4) Assign the groups to the processors of the DMM

In this step. we allocate all vertices (i.e., loop instances) in a group to one proces-
sor in the DMM. Since the resultant computational structure has simple interconnec-
tions. the mapping process is straightforward. Besides. as discussed in Chapter 1, using
next generation DMMs [ShFi88], the mapping problem is no longer critical.
(5) Determine the optimal design parameters

Design parameters such as the exact size of groups and the number of pipelines are
determined. The analytic model introduced in Chapter 5 can be used here. The accuracy
of the model and its associated parameters have a bearing on the ﬁnal performance.

Note that due to the asynchronous nature of large-grain pipelining, the design pro-
cedure presented here does not have to consider timing. On the contrary, since systolic
arrays are synchronous, timing is very important in synthesizing systolic arrays to
decide the time when a datum should be sent to another processor. In this regard, the
concept of large-grain pipelining is similar to that of wavefront arrays [KuAGSZ].

In the remaining sections. the core of the above design procedure —- grouping —

will be discussed more formally and in detail.

6.3. THE GROUPING PROBLEM

Let Z. I, and 1", denote the set of integers. non-negative integers, and positive
integers. respectively. The input to the grouping problem is a computational structure
obtained from a shift-invariant nested-loop program with or without projection. There-
fore, each vertex in the structure represents one loop instance (in the unprojected case)
or a set of instances (in the projected case).
<Deflnition 6.1>

The computational structure, Q, is a two-tuple. Q =(V,D). where

V={(q0. ..., q,,_1) I l,-Sq,-Su,-, 0S1'Sn—1} is the set of vertices in Q. l.- and u,-. 0SiSn-1,

112

are the lower and upper bounds of the i-th coordinate. and D={d0. dm_1} is the set
of dependence vectors in Q.
El

Suppose that V is a subset of Z ", the set of integer n-tuples. Thus. all coordinates
referred to will be integers. Also, without loss of generality. the upper and lower bounds
in V, i.e., u,- and l,-. 0Si Sn-l, are assumed to be constant, and l,-S0 and u,->0 so that the
vertex with coordinate (0. ..., 0) is deﬁned in Q.

In this study. only acyclic computational structures are considered. In other words.

if there exist do. ..., d1,_1eD and a0. ..., ak-1 e I, kSm, such that
dodo +a1d1 + ‘ ‘ ' +ak_1dk_1 =0

then a0: ...=a,,-1=0. Cycles will result in a lock-step synchronization between adjacent
nodes for each iteration. which is undesirable for DMMs. More details of acyclic com-
putational structures will be given later.
<Definition 6.2>
Let Q (V,D) be a computational structure.
(1) A vertex v15 V is dependent on a vertex V16 V along a vector (1 if V] -v1=d
(2) A vertex v15 V is reachable from a vertex v16 V via the set of vectors
D’=[do. ..., d1,_1} if v1 = v1+ aodo + ...+ ak_1dk.1, where aieZ. 0SiSk—l.
(3) The set of vertices. U (V,D’), which is spanned from a vertex vs V by a set of
vectors, D’={d0. ..., dk_1}, is the set of vertices in Q that are reachable from

v via D’, i.e.,
U(v.D’) = [w l we V, w=v+aod0+ ...+ ak_1dk_1, a,-eZ. 0SiSk-l}

[:1
Note that any vertex in U (V,D’) is reachable from any other vertex in U (V,D’).
Therefore. it is sometimes convenient to omit where the spanning starts and to speak

only of the set of vertices that is spanned by D’, which is denoted U (D’).

113

<Deflnition 6.?»

The grouping Gd,,(Q) of a computation structure Q along a direction (1 of size r is
to partition all vertices in Q into disjoint subsets. P0, ..., P1,-1, such that

(1) IP11 I: ...= lPk_1 I=r

(2) For each subset P1, 0SiSk-1, there exists an ordering for all vertices in P1, i.e.,

(v0. ...,v,_1), such that v1+1-v1=d. 0Sj Sr-2.
Each subset P,- is called a group of 0,1,,(Q), and the ﬁrst vertex, V1), in the ordering in
(2) is called the base vertex of P,-. A group P,- is dependent on another group P 1- along (1
if there is a vertex uke P,- which depends on a vertex vke P ,- along (1.
El

The physical meaning of a grouping is the same as that of a projection, i.e., all the
loops (represented by vertices) enclosed in the same group will eventually be assigned
to one processor and be executed according to their data dependence relationships. Fig-
ure 6.3(a) shows a computational structure Q (V,D) with V=[(i, j ) l 0Si. j S3} and
D={[1 1], [0 1]}. A grouping 0,1,,(Q) along d=[l 0] of size r=2 is illustrated in
dashed boxes in Figure 6.3(a).

Note that a grouping might not divide all vertices in a computational structure
evenly. In this case, we can add dummy vertices at the boundary to make it even or
include those extra vertices into boundary groups. Thus, for simplicity, we will assume
the boundary conditions are satisﬁed in all cases. Since groups are disjoint sets, we can
immediately identify the following relation in any grouping:

n—l
rxk = n(u,--l,-+1)
i=0
Note also that a grouping may proceed along multiple directions with different sizes
along each direction. In this case, it is not possible to ﬁnd a total ordering of all vertices

in a group.

114

 

(a) (b)
Figure 6.3. Grouping of a computational structure along [1 0]

<Definltion 6.4>
The contracted structure, Q’, of a computation structure Q (V,D) with respect to
the grouping Gd.,(Q) is a directed graph, where '
(1) Each vertex in Q’ corresponds to one group in G4,,(Q);
(2) If P,- and PJ- are the groups in Gd,,(Q) which correspond to the vertices v1’
and v1’ in Q’, respectively, then there is an are from v1’ to v1 ’, if P,- depends
on P,.
Q’ is dependence preserving if Q’ is an acyclic computational structure. (V’,D’),
with ID’ISID |.
E]
The graph shown in Figure 6.3(b) is a contracted structure of that shown in Figure
6.3(a).
The idea of grouping is very similar to that of emulation in [FiFi82]. Note that, for

a given grouping. the corresponding contracted structure may not be unique. depending

115

on how the base vertices are chosen. The signiﬁcance of dependence preservance is that
there is no extra dependence relationship introduced in the grouping. In other words. the
number of processors with which a processor needs to communicate does not increase
when loops are grouped. This is particularly important in DMMs with non-negligible
communication overhead. The contracted structure shown in Figure 6.3(b) is depen-
dence preserving. In fact. it has the same structure as the original computational struc-
ture.

As mentioned earlier, we only consider acyclic computational structures in this
study. The implication of acyclic computational structures needs further explanation. In
general, any computational structure derived directly from a nested-loop program is
acyclic. Otherwise. there will be a deadlock in the execution of the program —— a loop
needs a value from another loop which is waiting for the value generated in the ﬁrst
loop. Cycles will be introduced only after a computational structure is projected or
grouped. For example. consider the following nested-loop program:

f0ri:=0 to M—l do

forj := 0 to M-l do
bij I= bi.j-1 + bi-1.j+1 3

The corresponding computational structure for M =4 is shown in Figure 6.4(a). The pro-
jection of the computational structure along the direction [1 0] will results in a con-
tracted structure as shown in Figure 6.4(b). Cycles are introduced. Note that each ver-

tex. k, 0Sk S3, in Figure 6.4(b) represents a program segment as follows:

fori:=0t03do
but I= biJc-l + bt-1,1t+1 3

Suppose each vertex in Figure 6.4(b) is executed on one processor. Then, each
processor must exchange data with its two neighbors at every iteration. In other words.
processor k cannot execute two consecutive loops and produce, say. both by‘ and b1),
together without communicating with others. It follows that, even though several ver-

tices are grouped together, the granularity of the algorithm does not increase at all!

116

 

along j

W

(C)

I project

Figure 6.4. A computational structure with possible cycles

On the other hand, the grouping of the computational structure in Figure 6.4(a)
along [0 1] will result in an acyclic contracted structure as shown in Figure 6.4(c).
Each vertex, k, 0Sk S3, in Figure 6.4(c) represents a program segment as follows:

for] := 0 to 3 do
bkj == bk.j-1 + bk—1.j+1 ;

Again. if each vertex is assigned to one processor, then processors do not have to
exchange data at every iteration. The execution in processor k depends only on the out-
puts from processor k—l. Thus. for example, processor k can execute two loops, j=0. 1.
before it sends bko and b“ together to processor k+1. In this way. the granularity can
be adjusted by changing the grouping size. Thus. cyclic computational structures may
exist in systolic arrays (see Figure 6.2(a)), but they are not desirable in pipelined data

parallel algorithms when executed on DMMs.

117

From what we have discussed so far. the grouping problem considered in this
paper can thus be stated as follows:
<Grouping problem>

Given a computational structure Q. the grouping problem is to determine those
groupings of Q which will result in dependence preserving contracted structures.
[II

In the following discussions, we will concentrate on computational structures in
2-dimensional spaces. For higher dimensional spaces. the results obtained here may be
extended. We will consider computational structures with one, two. three, and more
than three dependence vectors. respectively. For each case. we study the conditions
under which the computational structure can be partitioned into independent sets of ver-
tices. We also present the necessary conditions that a grouping will result in depen-

dence preserving contracted structures.

6.4. GROUPING WITH ONE OR TWO DEPENDENCE VECTORS

Suppose the computational structure Q (V,D) has only one dependence vector. i.e.,
D={do). Figure 6.5(a) depicts such a computational structure with do=[0 1]. Since
there is only one dependence vector in a 2-dimensional space, vertices in Q are divided
into a number of independent sets. Each independent set is spanned by the vector do,
and vertices in different independent sets cannot reach each other. i.e., there is no data
dependency between these set. Thus. these sets can be executed independently of each
other.
<Theorem 6.1>

The grouping Gd”, rel+ of a computational structure Q (V,D) with D=[d0} is

dependence preserving.

118

eoee
oooe
comes

’0 o e it:

Figure 6.5. A computational structure with one dependence vector

    

I'""""'-‘I

99

 

I'”""""""I

<PrOOI>

Consider any group P in Gd“... Since there is only one dependence vector in the
system. there exists a linear ordering of all vertices in P, say. (v0. ..., v,_1). where
v1+1-v1=do. 0S j <r-1. The only vertex which has an arc to another vertex vet P is v,._1.
and the only vertex which depends on a vertex vé P is v0. It follows that the number of
group dependencies from P to other groups is only 1. Thus. the resultant contracted
structure is dependence preserving.

CI
Figure 6.5(b) shows a grouping along do=[0 1] with r=2. and it can be seen that the
contracted structure is indeed dependence preserving.

Now. suppose the grouping is along d1¢aodo, 006 Z. Then, this case is equivalent
to a computational structure with two dependence vectors, D ={do. d1}. An example of
such a computational structure is shown in Figure 6.6(a). where D={[1 0]. [1 2]}. It
can be seen from Figure 6.6(a) that the vertex (0.1) cannot be reached from the vertex

(0,0). In other words, the vertices are divided into two independent sets (see Figure

119

 

[1 2]

Z [10]

 

 

 

 

 

 

 

 

(b)

(a)

Figure 6.6. A computational structure with two dependence vector

6.6(b)).

Deﬁne the base set. W, of the computational structure Q (V,D) to be
W = { w I we 22, w=xdo+yd1, 0st <1 }

Then, we have the following theorem:
<Theorem 6.2>
In a computational structure Q (V,D) with D={do, d1 }. the number of indepen-
dent sets spanned by do and d1 is equal to IW I.
<Proof>
Let v16 V be a vertex in Q such that the vertices v1=v1+d0+d1, vk=v1+do, and

v1=v1+d1, are all in Q. Deﬁne the set
W, = [ w I weZz, w=v1+xdo+yd1. 0Sx,y<1}

Then, obviously, IW I: l W, l. Note that vertices in W, cannot reach each other via do
and d1. To prove the theorem. we have to show ﬁrst that Wgi. This implies that there

are at least IW l independent sets. Next. we have to prove that all vertices in V are

120

reachable from a vertex in Wx. In other words, the number of independent sets in Q is at
most I W I. It follows that vertices in Q are divided into exactly I W l independent sets.
Now, recall the deﬁnition of V (Deﬁnition 6.1). On a 2-dimensional space, we

have

V = {(40.41) I 1054090. 115415141}

Since v1, V1, V11, V16 V, all points with integer coordinates and inside the parallelogram
enclosed by v1, V1, V1,, and v1 must also be in V. It follows that Wgi.
On the Other hand, since Q is acyclic, do and d1 are independent. Any vertex vs V

in Q can be expressed as
v = v1+(ao+x)do + (a1+y)d1 = v1+ aodo + a 1d1 + w’

where do, 0162 and w’=xdo+yd1eZZ, OSx,y <1. By choosing appropriate do and an,
we can make w’e W. This implies that v can be reached from a vertex, v1+w’ in W,.
Thus, the theorem is proved.
El

From Figure 6.6, it can be seen that W=[(0,0)_, (0,1)}. Therefore, there are two
independent sets in the computational structure. The signiﬁcance of independent sets is
not only that each set can be executed independently, but also that the grouping can be
formed across the independent sets in any arbitrary manner.

Let us now consider the grouping along either do or d1.
<Theorem 6.3>

The grouping Go,,(Q) of a computational structure Q (V,D) with D={do, d1} is
dependence preserving if de D, r21, and the base vertices of the groups are chosen
along do and d1.
<Proof>

Since the grouping in one independent set will not affect that in other sets, we can

consider each independent set separately. Thus, without loss of generality, we will

121

assume IW I=l, d=do, and d’=d1.
Let P,- and Pi be two groups in Go”. Order vertices in P1- and Pj according to

their dependencies in do as follows:
For Pi: (no, ..., Ill-.1), WhCI‘C uk, 111-.16P1' and uk+1-llk=do, 03k <r-1
For Pj: (vo, v,_1), where vk, v..._1er and vk+1-vk=do, OSk <r—1

Now, suppose Pi is dependent on P,- along do. Since the grouping is along do,
then, among all vertices in P,- and P j, only vo depends on u.._1 along do. Thus, no extra
dependency will be introduced along do.

On the other hand, suppose PJ- depends on P,- along d1. Since base vertices are
chosen along do and d1, we have vo-uo=d1. For any two vertices ups)”; and vkeP-,
where OSkSr-l, we have uk=kdo+uo, and v11=kdo+vo, and vk — uk = vo - uo = d1. It
follows that each vertex in PJ- depends on a vertex in P,- along d1. Thus, there is no
extra dependency introduced along d1, and the grouping is dependence preserving.

C]

The above theorem can easily be expanded to allow groupings along both direc-
ﬁons
<Corollary 6.1>

The grouping of a computational structure Q (V,D), where D={do, d1}, along do
with a size ro21 and along d1 with a size r121 is dependence preserving if the base
vertices are chosen along do and d1.

D

Figure 6.7 illustrates a grouping along both directions, where groups are enclosed
in the dashed lines. It is easy to check that the contracted structure Q’ is dependence
preserving. In fact, Q’ has the same structure as the original computation structure.
Again, the grouping along d2, where d2¢aodo and d2¢a 1d1, do, 0162, is equivalent to
that on computational structures with three dependence vectors, D={do, d1, d2 }. The

latter will be discussed in the next section.

122

   

Figure 6.7. Grouping along two directions

6.5. GROUPING WITH THREE DEPENDENCE VECTORS
In this section, the grouping on computational structures with three dependence
vectors is discussed. Let Q(V,D) be a computational structure with D={do, d1, d2}.

Assume further that
02d2 =aodo +a1d1 (6.3)

where do, a 1, and a; are the smallest positive integers to satisfy (6.3). A typical exam-
ple is shown in Figure 6.8(a), where do=[1 O], d1=[1 2], d2=[l 1], and 2[1 1] = [1 O]
+ [1 2].

Again, we start with the study of independent sets in Q. It turns out that do plays a
very important role in determining the independent sets.
<Lemma 6.1>

Let W be the base set with respect to do and d1 in Q. Then, I W l2a2.
<Proof>

Since IW l21, the case for a2=1 is trivial. Now, suppose a221. Due to the fact
that do, 01, and a2 are the smallest positive integers to satisfy (6.3), there does not

exist any integer solution, x,-, y,-, 1.<.i 502-1, for each of the following az-l equations:

123

 

Figure 6.8. A computational structure with three dependence vectors

idz =xido + Yidl
In other words, for any vertex vs V the vertices v1=d2+v, ..., v33..1=(a 2—1)d2+v, are not
reachable from v via do and d1. Furthermore, none of these vertices can reach each
other via do and d1. It follows that each of v, v1 , ..., v1.2-1 lies in a different independent
set spanned by do and d1. Thus, IW IZaz.
Cl

Let W’ denote the base set with respect to do, d1, and do. Then, W’ can be deﬁned

as
W’= {w I weZz, w=xdo+yd1, 09c,y<l, and W¢Zd2, 252}

We have the following theorem to relate the independent sets spanned by do and d1 and
those spanned by do, d1, and d2.
<Theorem 6.4>

Given a computational structure Q(V,D) with D={do, d1, d2}, the number of

independent sets spanned by do, d1, and d; is equal to IW’I and I W I: I W’ l xaz.

124

<Proof>
The proof of the ﬁrst part is very similar to the proof of Theorem 6.2. Let v be a

vertex in V such that all vertices in
Wx’= {w 1 W622, w=v+xdo+yd1, 05x,y<1, and W¢v+zd2, 262}
are deﬁned in V. Then, lWx’ l =| W’ I. Note that
Wx'=Wx-U (v. {(12 })

where W, is deﬁned as in the proof of Theorem 6.2. Now, Wx contains vertices in V that
cannot reach each other via do and d1. U (v,{d2}) is the set of vertices which can be
reached from v via d2. It follows that the vertices in W,’ cannot reach each other via do,
d1, as well as d2. In addition, any vertex ue V is reachable from a vertex in W,’,

because
ll = a ado-+0 1d1-1'0 2d2+W+V

where aie Z, 05132, we W’. Thus, IW’I gives the total number of such independent
sets.

Now consider each independent set spanned by do, d1, and d2. Following the
proof in Lemma 6.1, it can easily be seen that, within each set, there are a2 independent
sets spanned only by do and d1. It follows that IW I: I W’ l xaz.

D

In Figure 6.8, since a2=2, there are two independent sets spanned by [l 0] and
[1 2]. Furthermore, we have W={(0,0), (0,1)} and IW I=2. This implies that IW’|=1
and that all vertices can be reached from (0,0) via those three dependence vectors.

As mentioned in Section 6.4, independent sets spanned by do and d1 can be con-
sidered as forming independent "planes". Then, the third vector, d2, can be viewed as
forming a thread from one plane to the other. From Theorem 6.4, we can see intuitively
that the grouping along (1; with a size r2=c12 will be dependence preserving, because

we are linking a2 independent sets together. However, any grouping along (12 with a

125

size larger than a2 will include vertices which belong to the same independent set into
one group. As a result, cycles are introduced into the contracted structure, and the resul-
tant structure is not dependence preserving. Consider the example shown in Figure

6.9(a) where where do=[1 0], d1=[0 l], d2=[l l], and [1 l] = [1 0]+[0 1].

 

(a) (b)

Figure 6.9. A grouping which generates a non-dependence-preserving structure

In Figure 6.9(a), the grouping along d2=[l l] is depicted in dashed boxes. The
labels besides the boxes denote the group ids. Figure 6.9(b) shows the corresponding
contracted structure with the group ids indicated in the circles. It can be seen that an
extra dependence vector, [0 -1], is introduced. Thus, the contracted structure is not
dependence preserving. To prove the above points in a more formal way, we need the
following Lemma:
<Lemma 6.2>

Let Q(V,D) be a computational structure with D={do, d1, d2}, where

azdz =aodo + a1d1. Let P,- and Pj- be two groups in the grouping Go”, rel+, such

126

that P} is dependent on P1 along de {do, d1}. If G111, is dependence preserving, then
the base vertex of P 1- must depend on the base vertex of P,- along (I.
<Proof>

Let no and vo be the base vertex of P,- and P j, respectively. If vo depends on no
along (I, then there is a one-to—one dependence between the vertices in P,- and P ,- along
d. As a result, no extra dependence vector will be generated along (I in the contracted
structure. On the other hand, assume that PJ- depends on P,- along d, but vo—uo¢d. Sup-
pose further that P j also depends on P1 along d2. The relationship between P1, P j, and
P1 is illustrated in Figure 6.10. Let the vertices in P1, Pj, and P, be ordered according to

the dependencies in d; as follows:
For P1: (uo, ..., ur_1), where uk, u,_1eP,- and uk+1-uk=d2, 03k <r—1
For Pj: (vo, ..., v1.1), where v11, “.1er and vk+1-vk=d2, 0.<.k <r—l
For P1: (wo, ..., wr_1), where W11, W1._1e P1 and wk+1—wk=d2, OSk <r—1

Let (1’ denote the other vector which is different from d in {do, d1 }. Let uke P1,
0<kSr-1, be the vertex that vo-uk=d. If such a uk cannot be found, then we consider
P1’ instead, where P,- depends on P1, along d2. The vertex uk can be found in Pi’.

Consider the vertex uk.1. It can easily be seen from Figure 6.10 that
w.._1-uk.1=d. That is, both P j and P1 depend on P,- along (1, and we introduce one extra
dependence along (I in the contracted structure. To keep Goa, dependence preserving,
we must make P ,- or P, also depend on P,- along (1’.

Suppose there exists a vger such that vg—uk=d’ (see Figure 6.10). Then, from

the dependence relationship between vg and vo, we can see that
d + ng2 = d,

which is a violation of (6.3). The same is true for the case where wge P1 and wg-uk=d’.
Thus, there is no way to make Pj or P; depend on P,- along (1’ while keeping G112,

dependence preserving. It follows that we must have vo—uo=d to allow Gob, be

127

 

Figure 6.10. Dependence relationships between groups

dependence preserving.
El
<Theorem 6.5>

Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where
azdz = aodo + a1d1. Then, the grouping Gdz,,(Q) with a size r>a2 is not dependence
preserving.
<Proof>

Suppose the group is dependence preserving, and P is a group in the grouping. If
we move from the base vertex of P along do for do steps, then along d1 for a1 steps,
then in each step we should visit a base vertex of a group (Lemma 6.2). However, since
r>a2 and the relation in (6.3) holds, in the last step, we will visit a vertex which
belongs to P and is not a base vertex. Thus, from Lemma 6.2, we arrive at a contradic-

tion and prove that the grouping is not dependence preserving.

128

El

Following the same argument, we can arrive at the following result:
<Corollary 6.2>

Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where
(12d; = aodo + a1d1. Then, the grouping Go”, with r<a2 is dependence preserving if r
can divide a 2 evenly.
<Proof>

The proof is similar to that of Theorem 6.5. Let P be a group in Go”. Then, as we
move from the base vertex vo of P along do for do steps and then along d1 for a1 steps,
we should visit base vertices along the way until the last step where we arrive at a ver-
tex vk. From (6.3), the following relation holds: vk-vo=aodo+a1d1=a2d2. However,
Lemma 6.2 requires that vk be a base vertex in order for Go”, to be dependence

preserving. This condition will be satisﬁed only when the group size r is a factor of a2.
C!

Now, let us consider the grouping along do and d1 and see how (12 may affect this
grouping. We study the dependence relationships within a contracted structure ﬁrst.
<Lemma 63>

Let Q (V,D) be a computational structure with D =(do, d1, d2 }, where
02d; =aodo + a1d1. Then, there exists a dependence preserving grouping along d.
with size (11, where is {0,1,2}. Furthermore, the dependence vectors in the resultant

contracted structure Q’(V’,D’) have a relationship of
az’dz’ = ao’do’ + a 1 ’d1’ (6.4)
where D’={do’, d1’, dz’}, and aj’=aj, ifd1¢d1; aj’=1, otherwise, 05]“ $2.
<Proof>
Let d, and (1,, x,ye {0,1,2} denote the two vectors in D other than d1, and W’ be

the base set with respect to do, d1, and d2. The grouping which is dependence preserv-

ing along d1 can be formed as follows:

129

For each vertex in W’ perform the following operations:

(1) Form a group along d1 with size 01-;

(2) From the base vertex of the current group, choose the base vertex of the next

group along a1d1, dx, or (1,.

until all vertices in the current independent set are grouped.

First, it is necessary to show that the above procedure does produce a grouping
which is dependence preserving. Consider any independent set spanned by do, d1, and
d2, which contains we W’. Let U (w, A) be the set of vertices which are spanned from w
via A={a,~d1, dx, dy }. Note that vertices in U (W,A) can reach each other via A. From
step (2) above, it can be seen that all base vertices in this independent set are contained
in U (w, A). In addition, U (w, A) contains only base vertices. Suppose that this is not the
case, and that uke U (w,A) is not a base vertex. Assume further that uke P,- and no is the

base vertex of P1. Then,
uk = w + za1d1+xdx +ydy = no + kd1 -> no = w + (za,--k)d1+xdx +ydy

where x, y, 26 Z. However, k<a,-. Thus, no cannot be reached from w via A, which con-
tradicts the operation in step (2). It follows that moving away from any base vertex
along any vector in A will reach only base vertices. Therefore, there will be no extra
dependency introduced in the contracted structure (see the proof of Lemma 6.2), and
the grouping is dependence preserving.

We now prove (6.4). Note that if we start from the base vertex no of any group P,-
and move along dx for a, steps and then along dy for a, steps (actually, the move is
along -d2, if dx=d2 or dy=d2), we will arrive at the base vertex vo of another group P -,
where vo=uo+a,-d1. Now, all vertices between no and vo along d1 are in P1, because the
grouping size of P,- is 01-. Thus, PJ- depends on P,- along d1, and we obtain the depen-
dence set satisfying (6.4) in the resultant contracted structure.

[:1

130

<Lemma 6.4>

Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where
d; = do + d1. Any grouping along de {do, d1} with a size rel + is dependence preserv-
ing.
<Proof>

Without loss of generality, assume d=do. Consider any group P in the grouping.

Order all vertices in P according to the dependencies in do as follows:
(vo, v,_1), where v11, v,_1eP and vk+1-vk=do, 03k <r—1

Since vk+d2=vk+do+d1=vk+1+d1, 03k <r-1, only the following r+1 vertices will
depend on a vertex in P along either d1 or d2: vo+d1, ..., v1._1+d1, and
v,_1+d2=v,.1+do+d1. Note that all these r+1 vertices are linked by do in a chain. Thus,
they can only be grouped into two groups. It follows that P has only three dependence
vectors to other groups, and that the grouping is dependence preserving.
CI
<Theorem 6.6>

Let Q (V,D) be a computational structure with D ={do, d1, d2 }, where
02d; = aodo + a1d1. Then, the grouping along do with a size of iao, along d1 with a
size of ja 1 , and along do with a size of a2, where i, j e I +, is dependence preserving.
<Proof>

From Lemma 6.3, we can see that, if Q is grouped along do with a size ao, d1 with
a size 01, and d; with a size a2, then the resultant contracted structure Q’ is depen-

dence preserving and has a set of dependence vectors with the following relationship
d2, = do, + d1, (6.5)

Therefore, the grouping of vertices in Q’ along do’ with a size 1' and along d1’ with a
size j, i, je I + is dependence preserving (Lemma 6.4). However, the grouping in Q’

along do’ with a size 1' is equivalent to the grouping along do with a size iao in Q.

131

Similarly, the grouping in Q’ along d1’ with a size j is equivalent to the grouping along
d1 with a size ja1 in Q. Thus, the theorem follows.
[:1

The strategy used in Theorem 6.6 is to transform the computational structure into
the one with a hexagonal structure (described by (6.5)). This kind of structure is
referred to as the universal planner array, which is the most general systolic array in
2-dimensional spaces [MiWi84]. The same structure will be used in the next section to
construct grouping schemes for computational structures with three or more depen-

dence vectors.

6.6. GROUPING WITH FOUR OR MORE DEPENDENCE VECTORS

When computational structures have four or more dependence vectors, the rela-
tionships between the dependence vectors become very complicated, which make it
difﬁcult to analyze individual vectors separately. Furthermore, the chance that there are
independent sets spanned by all vectors is rare. As a result, we concentrate in this sec-
tion only on grouping schemes but not on independent sets.

Given an acyclic computational structure Q (V,D) on a 2-dimensional space with
D: {do, ..., dm_1], where m>2, we can always ﬁnd (from linear algebra theory) two

vectors, say, do and d1, such that
Cid} = aido + bid] (6.6)

where a1, b1, (2161+, 251' <m. Deﬁne

at bi
251' <1); C1' 251' <m C;

We have the following lemma concerning the "range of inﬂuence" of a vertex in Q.

<Lemma 6.5>

132

Let Q(V,D) be a computational structure with D={do, ..., dm_1}, where m>2,

_ c1d1 = aido + b1d1, and 2Si <m. For any vertex ve V, deﬁne
X(v) = {w l we V, w=v+xamaxdo+ybmaxdb 05x,y 51}
Then, any vertex ue V which is dependent on v is in X (v).
<Proof>
For any vertex ue V, if u is dependent on v, then there exists a dependence vector
d1, 09' <m, such that

a,- b;
u — V = dl = c—do + lesamudo +bmaxdl
i 1'

Thus, ueX (v).
C]
<Theorem 6.7>

Let Q(V,D) be a computational structure with D={do, ..., dm_1}, where m>2,
c1d1 = agdo + b1d1, and 251' <m. Then, the grouping along do with a size ro_>.amax and
along d1 with a size r12bmx is dependence preserving.
<Proof>

Consider any group P with base vertex vo. Let P;, Pj, and Pk be the three neigh-
boring groups of P as shown in Figure 6.11. It sufﬁces to show that, for any vertex v in
P, X (v) is covered by P, P1, P1, and Po. The immediate consequence is that there will
only be three dependence vectors in the resultant contracted structure, and these three
dependence vectors satisfy the relation in (6.6). In other words, the contracted structure
has a structure of the universal planner array.

Any vertex w falls within P, P1, P j, and Pk, can be expressed as follows:
w = vo +xamaxdo +ybmaxd1 0Sx,y <2 (6.7)
Any vertex v in P can be expressed as

133

Now, from the deﬁnition of X (v), any vertex v’eX (v) can be expressed as
v’ = v + xvtamudo +yv'bmaxd1 OSxo',y,/.<_l
= V0 + (xv+xv’)d0 + (yv+yv’)dl

where 05(xv+xv'), (yv+yv:)<2. Thus, v’ satisﬁes (6.7) and is in the region covered by P,
P1, P j, and Pk. It follows that, for any v in P, X (v) falls within the above four groups
(see the dashed box in Figure 6.11). Therefore, the grouping is dependence preserving.

D

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

do

 

 

 

 

 

 

l vo

l.— amax ——'I

Figure 6.11. Relationship between groups in a universal planner array

Theorem 6.7 presents a powerful grouping method for computational structures.
Specifically, as long as the group size is large enough, then there always exist depen-
dence preserving groupings along do and d1. In fact, do and d1 need not be dependence

vectors of the computational structure, as long as the relations in (6.6) are satisﬁed.

134

By applying Theorem 6.7, the resultant contracted structure has only three depen-
dence vectors. It follows that, in the ﬁnal implementation, each processor only has to
communicate with at most three other processors. The communication bandwidth is
reduced, and the partition task is simpliﬁed, because any partition with a size greater
than anm and hm, will be dependence preserving. The remaining task is to match the
number of partitions (i.e., the number of groups) with the number of processors avail-
able.

Note that Theorem 6.7 can be extended to computational structures on higher
dimensional spaces. In a 3-dimensional space, for example, a cube is adjacent to 7 other
cubes in the ﬁrst octant. Thus, any computational structure in 3-dimensional space can
be grouped along three directions. As long as the group size is large enough, the group-

ing is dependence preserving and each resultant group is adjacent to at most 7 other

groups.

6.7. AN EXAMPLE OF GROUPING
An example is given in this section to illustrate the application of grouping and to
summarize what has been discussed so far. Consider the development of a pipelined

data parallel algorithm from the following "artiﬁcial" nested-loop program:

fori:= O to M—l do
forj := 0 to M—l do
for k := 0 to M—l do (6.8)
C1"; 1: CU + bikXEj'k-2/gi-1,k 3

(1) Derive the computational structure of the program

It can be seen from the above program that bgk’s and g1-1.k’s, OSiJcSM -1, are
broadcast along the loops indexed by j, and e ”4’s, OSchSM -1, are broadcast along 1'.
By eliminating broadcast and enforcing the single-assignment rule, the above program

can be transformed into the following program:

fori :=0toM-1 do

135

forj := Oto M-l do
fork := 0 to M—l do
by]; 2: bi. 12.4.11; (6.9)
eijk == €i-1.j,k-2;
8.71: == 8i—r.j-1,k;
cijk 1= chJc-l + bykxeijk/gijk

The necessary initializations of b111,, em, and ggjk are omitted for clarity.
From the transformed program, the dependence vectors can be identiﬁed as fol-

lows: D={[0 1 0], [1 0 2], [1 l 0], [0 O 1]}. The variables which introduce the

dependence vectors are listed below:
bah—)[O 10] egjk—>[l O 2] gijk—>[1 10] Cijk“)[0 01]

Figure 6.12 depicts the computational structure of the program in (6.9). Due to the
complexity of the ﬁgure, we only show the projected vertices on each of the three
planes. Thus, for example, in Figure 6.12, the plane labeled (1) contains the 16 vertices
when the computational structure is project onto the ij-plane. Each vertex in Figure

6.12 represents one loop instance in the original program.

    

(3)

Figure 6.12. The computational structure of the example program

136

(2) Map the computational structure onto the Space-time coordinate system

To take advantage of pipelining and further increase the granularity, a projection is
sought to transform the 3-dimensional computational structure into a 2-dimensional
one. For example, if we project the vertices along the j-axis, we will obtain a projected
computational structure as shown in Figure 6.13. Now, each vertex in the projected
structure represents M loops. In fact, the vertex at position (i’,k’) represents the follow-

ing program segment:

forj := O to M—l do
6:; == 61'} + bi’L’xej,K—2/gi’-l,l(; (6-10)

 

 

 

Figure 6.13. The projected computational structure of the example

Due to this projection, the variables b111, OSjSM -1, are projected onto the same
vertex in the projected computational structure. This means that these variables are kept

stationary in the corresponding processor, and that they have to be downloaded to the

137

corresponding processors before the execution can start.
(3) Group vertices in the computational structure
Note that the resultant dependence vectors in the projected structure are: do=[1 0],

d1=[0 1], and d2=[1 2], where
a2d1=[1 2]=[1 O]+2x[0 1]=aodo+a1d1 (6.11)

Thus, we have ao=a2=1 and a1=2. Since a2=1, from Theorem 6.5, we can conclude
that any grouping along d2=[1 2] is not dependence preserving. To derive groupings
which are dependence preserving, either Theorem 6.6 or Theorem 6.7 can be used. For

example, if Theorem 6.7 is used, then note that

do 01
amn=7‘—2-=1 and bmax=zg=2

Thus, any grouping along [1 0] with a size ro21, and along [0 l] with a size r122, is
dependence preserving. The dashed boxes in Figure 6.13 enclose groups of vertices
corresponding to the grouping with ro=r1=2. On the other hand, if Theorem 6.6 is used,
then we can ﬁrst group vertices in Figure 6.13 along [0 1] with a size 2. From Lemma
6.3, we can see that the resultant contracted structure will have a set of dependence vec-

tors which satisﬁes
d2, = (10’ + (11’

Figure 6.l4(a) depicts the grouping. The corresponding contracted structure is shown in
Figure 6.14(b). The contracted structure has do’=[l 0], d1’=[0 l], and d2’=[1 1]. Now,
from Lemma 6.4, any grouping along do’ and d1’ is dependence preserving. Thus, for
example, the grouping along [1 0] with a size 2 will generate a dependence preserving
grouping which is same as that shown in Figure 6.13.

Finally, as is noted in Section 4.3, we can exploit pipelining in pipelined data
parallel algorithms not only from multiple pipelines but also from partitioning data to

form data streams. Thus, we can further partition the M loops indexed by j in program

138

   

 

 

: l" (a) (b)
Figure 6.14. The grouping along [0 l] with a size of 2

(6.10) into a number of blocks, each has rZSM loops. Assume that r2 divides M evenly.
Then, the program executed in the vertex (i’,k’) of the resultant contracted structure

becomes the following:

forl:=0toM-1byr2 d0
/" communicate with neighboring processors */
forj 2= [IO l+r2—l d0
fori := i’xro to (i’+1)xro—1 do (6.12)
for k := k’xr1 to (k’+1)xr1—1 do
Cij == Cij + bikxej,k-2/gi—1,k:

where ro is the grouping size along [1 0] and r1 is the grouping size along [0 1]. The
implicit assumption here is that the loops can be interchanged [PaW086]. Thus, the
loops indexed by j are moved to the outer-most level. In general, loop interchanging is
possible if there is no cycle in the computational structure. Thus, each processor can

execute large chunks of data (e.g., the three inner-most loops) before it needs to

139

communicate.
(4) Assign the groups to the processors of the DMM

Suppose the underlying DMM has a hypercube interprocessor connection topol-
ogy. Then, there does not seem to have a perfect embedding of a hexagon onto a hyper-
cube with dilation 1. A simple embedding with dilation 2 is to conﬁgure the hypercube
into a 2-dimensional mesh, and map the hexagon onto the mesh, with each diagonal
link, [1 1], in the hexagon going through a pair of edges, [1 0] and [0 l], in the mesh.
In new generation DMMs, this mapping should not induce too much communication
delay.

(5) Determine the optimal design parameters

The exact values of ro, r1, and r2 can be determined by the analytic model
presented in Chapter 5.

Note that, from a grouping, the following parameters can be obtained:

0 With which other processors does a processor need to communicate?

0 Which data does the processor need to exchange?

0 Howllarge should each message be?

0 How often does the processor need to communicate?

Referring back to the program in (6.12), we can see that the processor (i’,k’) needs
to communicate every r2 iterations (indexed by j) with its neighbors. In each data
exchange, there will be r1 elements of ggjk sent to processor (i’+l,k’), r1 elements of
e1},c and ro elements of 011-1 sent to processor (i’,k’ +1), and r1 elements of egjk sent to
processor (i’+1,k’+l). The same amount of data should also be received from proces-
sors (i’-1,k’), (i’—l,k’-1), and (i’,k’-—l). Finally, a total of ror1 elements of by; should

be loaded into the processor before program (6.12) can start.

CHAPTER 7

CONCLUSION AND FUTURE WORK

We have presented in this thesis the design, implementation, and modeling of
pipelined data parallel algorithms on DMMs. In this chapter, we summarize what we
have discussed and highlight major contributions of the work reported here. Next,
improvements to the current work will be suggested, impacts of new techniques and

architectures will be assessed, and plans for future work will be outlined.

7.1. SUMMARY

In Chapter 1, features of current DMMs were reviewed, from which important sys—
tem bottlenecks were identiﬁed. Special considerations are required when one attempts
to design efﬁcient parallel algorithms on DMMs. Different styles of parallel computa-
tions were discussed, and pipelined data parallel algorithms were singled out to be the
primary subject of this thesis. Then, in Chapter 2, starting with a representation of
parallel computation, the P-nets, we discussed various considerations in designing
parallel algorithms on DMMs. The implementation and modeling of two data parallel
algorithms were given in Chapter 3 to illustrate the tradeoffs in designing parallel algo-
rithms in terms of computation and communication as well as time and space.

With these tradeoffs in mind, a new parallel algorithm design paradigm called
large-grain pipelining was introduced in Chapter 4. Large-grain pipelining exploits

140

141

node level macro-pipelining to increase the degree of overlapping between computation
and communication. Examples of pipelined matrix multiplication algorithms were
given to illustrate the application of large- grain pipelining.

As discussed in Section 4.3, the unique feature of large-grain pipelining is the
focus on balancing computation with communication by forming data blocks with
appropriate sizes. These data blocks in turn form data streams which ﬂow in the system
to produce pipeline effects. Operations in the system are scheduled according to the
ﬂows of data, i.e., data-driven scheduling. Furthermore, the macro-pipelining in pipe-
lined data parallel algorithms increases the degree of overlapping between nodes while
reducing the I/O bandwidth in accessing data. The net effect is a highly parallel compu-
tation.

In Chapter 5, a model for analyzing pipelined data parallel computation on DMMs
was presented. The model views a pipelined computation as consisting of elementary
computational units, which are interconnected through data ﬂows. The throughput of
individual computational units as well as linear arrays and 2-dimensional meshes is
analyzed by studying the time events represented as Timed Petri-nets. The analyzed
results of the pipelined matrix multiplication algorithms were compared with experi-
mental results obtained from a 64-node NCUBE multiprocessor. The close match
between these two sets of data indicates the accuracy of the analytic model.

The accuracy stems from two sources: (1) the capability of the analytic model in
describing the behavior of a computational unit, and in modeling asynchronous and
deterministic events; and (2) the accuracy in estimating system parameters, such as the
message transmission time and primitive computation time. Note that the analytic
model can be automated and be applied to a more general class of computations.

In Chapter 6, a systematic approach for designing pipelined data parallel algo-
rithms on DMMs was introduced. The design procedure takes a shift-invariant nested-

loop program as its input and produces a pipelined data parallel algorithm through a

142

series of transformations. From the relations between loop dependence vectors, we stu-
died properties of grouping vertices in the computational structure. The major result
obtained is that it is possible to group vertices in a 2-dimensional computational struc-
ture in such a way that the resultant contracted structure is a universal planner array
(see Section 6.6). The extension to computational structures with higher dimension is
possible.

The major contribution of the grouping concept and related theorems is, of course,
to lead the way to a systematic approach for designing pipelined data parallel algo-
rithms on DMMs. The grouping concept is unique in that it takes a totally different
view of algorithm transformation from that in synthesizing systolic arrays, i.e., group-
ing versus projection. However, the results obtained in this thesis are also applicable to
the latter, since projection is a special case of grouping. For example, the knowledge of
grouping may assist in predicting the performance of systolic arrays derived from dif-
ferent projections.

Overall, it can be seen that we have presented a systematic and thorough study of
pipelined data parallel algorithms — from the basic concept, performance modeling, to
design and synthesis. Through the techniques introduced in this thesis, the resultant)
algorithms can achieve a balance between computation and communication. Further-
more, the systematic approaches used in these techniques hold the key to automated
algorithm and program development. Various possibilities that stem from the research

developed in this thesis will be probed further in the next section.

7.2. FUTURE WORK
From the discussion throughout this thesis, we can identify at least the following

areas that need to be studied further:

143

(1) Study other applications of large-grain pipelining

We have presented pipelined matrix multiplication algorithms in this thesis. Due
to their generality, it is expected that many problems can be solved by directly applying
techniques developed here, especially those problems which have systolic array solu-
tions. It is interesting to see how well the resultant algorithms perform compared to the
existing algorithms. Another interesting topic would be the application of large-grain
pipelining to problems which do not have a static and regular data structures such as
those found in artiﬁcial intelligence.
(2) Study pipelined data parallel algorithms under different environments

We have studied pipelined data parallel algorithms mainly under the environment
of message passing DMMs. It is interesting to see how and in what form pipelined data
parallel algorithms can be run on a shared-memory multiprocessor. More importantly,
since the shared-variable approach is a more natural way of programming, we should
also study how large-grain pipelining can be incorporated into such an environment,
perhaps in terms of reducing the amount of remote memory accesses.
(3) Assess the impact of future generation DMMs

Important trends in future generation DMMs have been discussed in Chapter 1,
including larger memory, higher computing capability, improved computation to com-
munication ratio, and concurrent I/O subsystems. It is expected that large-grain pipelin-
ing can beneﬁt from these novel designs. For example, increasing the communication
speed will decrease the block size and increase the number of blocks in a data stream.
This is equivalent to increasing the vector length in a pipelined vector processor, which
results in higher pipeline efﬁciency [HwBr84].

With reduced communication overhead, large-grain pipelining can fully exploit
the overlapping between computation and communication, which is not possible in
current implementations. In addition, more sophisticated communication primitives can

be supported, which alleviate, say, the overhead in data gathering/scattering operations

144

(see Section 3.2). Examples include the modiﬁcation of send and receive primitives to
allow the user to specify the access patterns within a block of data, or the implementa-
tion of the broadcast primitive.

Pipelined data parallel algorithms can also take advantage of concurrent I/O sub-
systems, through which several streams of data can be directed into the system simul-
taneously. This can further reduce the bottleneck induced by the host. However,
depending on how nodes access the disks, new issues may arise concerning how data is
distributed on the disks and how access conﬂicts are resolved.

(4) Reﬁne the grouping techniques

Most theorems presented in Chapter 6 are aimed at characterizing the properties of
grouping, e.g., grouping along which direction will result in a dependence preserving
grouping. We have not yet addressed the issue of choosing which data elements to pack
together into one message, that of deﬁning and determining the optimal grouping
scheme, and the issue of implementing the ideas discussed. Mathematical formulations
are also needed to determine the computation and communication requirements of a
particular grouping. This information can be used to determine the granularity of the
algorithm.

One important issue is to study the grouping of computational structures in higher
dimensional space. Intuitively, the results presented in this thesis can be applied to
higher dimensional spaces. However, a uniﬁed and rigorous mathematical treatment of
all cases will be desirable, which should allow the extension of the theorems more
easily.

It is interesting to study other program constructs which differ from shift-invariant
nested-loop programs. Examples include the IF-THEN—ELSE construct and anything
not enclosed within a loop. We should also consider the problem of how to incorporate
pipelined data parallel computations into other styles of computation. This includes the

interface with Other nested loops in the program.

145

Finally, it is important to have a complete understanding of the relations between
grouping and projection. The idea is to use grouping techniques to assist in synthesizing
systolic arrays, and the goal is to predict the performance of the ﬁnal systolic array from
the knowledge of grouping.

(5) Implement the techniques

Most techniques presented in this thesis can be computer irnplemental. These
tools should facilitate the development of parallel algorithms on DMMs.
(6) Develop an intelligent compiler for DMMs

The ultimate goal of the work presented in this thesis is to develop intelligent
compilers on DMMs. The intelligent compiler takes a sequential program as its input
and generates a parallel version suitable for DMMs. As a ﬁrst step, a pre-processor to
the compiler may be developed. The user provides information necessary for the paral-
lelization as directives. The pie-processor attempts to parallelize the given program as
much as possible and generates a transformed source program for compilation. Using
the grouping techniques, many nested loops in the program can be parallelized and
transformed into pipelined data parallel forms.

Intelligent compilers are complex and difﬁcult to implement. However, the tech-
niques presented in this thesis point to a new direction in achieving this goal. Addi-
tional research is needed to design more efﬁcient parallel algorithms and to further

advance our knowledge and capability in parallel processing.

BIBLIOGRAPHY

[AdCD74] T.L. Adam, K.M. Chandy, J.R. Dickson, "A Comparison of List Schedules

[BBN87]
[Berm87]

[ChSm87]

[DaSe87]

[Duni87]

[FaSh87]

[Fink87]

[FiFi82]

[FoFW 85]

[FoM084]

[FoW387]

[FoOH87]

[GaJo79]

for Parallel Processing Systems," Comm. ACM, Dec. 1974, pp. 685-690.
BBN Advanced Computers, Inc., Butterﬂy Products Overview, Oct. 1987.

F. Berman, "Experience with an Automatic Solution to the Mapping Prob-
lem," in The Characteristics of Parallel Algorithms, L.H. Jamieson, D.B.
Gannon, R.J. Douglass, Eds., MIT Press, 1987.

V. Cherkassky, R. Smith, "Efﬁcient Mapping and Implementation of
Matrix Algorithms on a Hypercube," Technical Report, Department of
Electrical Engineering, Univ. of Minn., 1987.

W.J. Dally, C.L. Seitz, "Deadlock-Free Message Routing in Multiprocessor
Interconnection Networks, "IEEE Trans. on Computers, May, 1987, pp.
547-553.

T.H. Duningan, "Hypercube Performace," Hypercube Multiprocessors
1987, Ed., M.T. Heath, SIAM, 1987. PP. 178-191.

N. Faroughi, M.A. Shanblatt, "Systematic Generation and Enumeration of
Systolic Arrays from Algorithms," Proc. of the 1987 Int’l Conf. on Paral-
lel Processing, Aug. 1987, pp. 844-847.

R.A Finkel, "Large-Grain Parallelism — Three Case Studies," in The
Characteristics of Parallel Algorithms, L.H. Jamieson, D.B. Gannon, R.J.
Douglass, Eds., MIT Press, 1987.

JP. Fishburn, R.A. Finkel, "Quotient Networks," IEEE Trans. on Comput-
ers, Apr., 1982, pp. 288-295.

J.A.B. Fortes, K.S. Fu, B.W. Wah," Systematic Approaches to the Design
of Algorithmically Speciﬁed Systolic Arrays," Proc. of 1985 Int’l Conf. on
Acoustics, Speech, and Signal Processing, 1985, pp. 300-303.

J.A.B. Fortes, D.I. Moldovan, "Data Broadcasting in Linearly Scheduled

Array Processors," Proc. of Symp. on Computer Architecture, 1984, pp.
224-231.

J. Fortes, B.W. Wah, "Systolic Arrays — From Concept to Implementa-
tion," IEEE Computer, July, 1987, pp. 12-17.

G.C. Fox, S.W. Otto, A.J. Hey, "Matrix Algorithms on a Hypercube 1:
Matrix Multiplication," Parallel Computing, Jan. 1987, pp. 17-31.

M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the
Theory of NP-Completeness, W.H. Freeman & Company, 1979.

146

[Gonz77]

[GrRe86]

[HaMS 86]

[HiSt86]

[HoTa85]

[Hu61]

[HwBr84]

[Hwan87]

[J ord87]

[KeLi70]

[KiCN 88a]

147

M.J. Gonzalez, Jr., "Deterministic Processor Scheduling," Computing Sur-
vey, Sep. 1977. PP. 173-204.

D.C. Grunwald, D.A. Reed, "Benchmarking Hypercube Hardware and
Software," Technical Report, UIUCDCS-R-86-1303, Department of Com-
puter Science, University of Illinois at Urbana-Champaign, 1986.

LP. Hayes, T.N. Mudge, Q.F. Stout, S. Colley, J. Palmer, "Architecture of
a Hypercube Supercomputer," Proc. Int’l Conf. on Parallel Processing,
Aug. 1986, pp. 653-660.

W.D. l-Iillis, G.J. Steele, Jr., "Data Parallel Algorithms," Comm. ACM, Dec.
1986. PP. 1170-1183.

J .J. Hopﬁeld, D.W. Tank, "Neural Computation of Decisions in Optimiza-
tion Problems," Biological Cybernetics, 1985, pp. 141-152.

T.C. Hu, "Parallel Sequening and Assembly Line Problems," Management
Science, Nov. 1961, pp. 841-848.

K. Hwang, F.A. Briggs, Computer Architecture and Parallel Processing,
McGraw-Hill Book Co., 1984.

K. Hwang, "Advanced parallel processing with supercomputer architec-
tures," Proc. of the IEEE, Oct. 1987, pp. 1348-1379.

H.F. Jordan, "The Force," in The Characteristics of Parallel Algorithms,
L.H. Janrieson, D.B. Gannon, R.J. Douglass, Eds., MIT Press, 1987.

B.W. Kemighan, S. Lin, "An Efﬁcient Heuristic Procedure for Partitioning
Graphs," Bell System Technical Journal, Feb. 1970, pp. 291-307.

C.T. King, W.H. Chou, L.M. Ni, "Large-Grain Pipelining on Distributed-
Memory Multiprocessors," Proc. of the Third Int’l Conf. on Supercomput-
ing, May, 1988.

[KiCN88b] C.T. King, W.H. Chou, L.M. Ni, "Pipelined Data Parallel Algorithms —

[KiGN88]

[KiNi8 8]

[KiGV83]

[KuLe78]

Concept and Modeling," Proc. of the 1988 ACM Int’ 1 Conf. on Supercom-
puting, July, 1988.

CT. King, T.B. Gendreau, L.M. Ni, "Reliable Elections in Broadcast Net-
works," to appear in Journal of Parallel and Distributed Computing.

C.T. King, L.M. Ni, "Large grain Pipelining on Hypercube Computers,"
Proc. of the Third Conf. on Hypercube Concurrent Computers and Appli-
cations, 1988.

S. Kirkpatrick, C.D. Gelatt, Jr., M.P Vecchi, "Optimization by Simulated
Annealing," Science, May 1983., pp. 671-680.

H.T. Kung, C.E. Leiserson, "Systolic Arrays (for VLSI)," Sparse Matrix
Proc., 1978, pp. 32-63.

[KuLJ87]
[MaLT82]
[MiWi84]

[MoFo86]
[MuBA87]
[NeSn87]
[NiKP87]

[Oste87]

[Pete77]
[PaW086]

[PoBa87]

[Quin87]

[Ramc74]
[RaS H79]

[SaSc85]

148

S.Y. Kung, S.C. Lo, S.N. Jean, J.N. Hwang, "Wavefront Array Processors
— Concept to Implementation," IEEE Computer, July 1987, pp. 18-33.

PR. Ma, E.Y. Lee, M. Tsuchiya, "A Task allocation Model for Distributed
Computing Systems,"IEEE Trans. Computers, Jan. 1982, pp. 41-47.

W.L. Miranker, A. Winkler, "Spacetime Representations of Computational
Structures," Computing, 1984, pp. 93-114.

D.I. Moldovan, J .A.B. Fortes, "Partitioning and Mapping Algorithms into
Fixed-Size Systolic Array," IEEE Trans. on Computers, Jan. 1986, pp. 1-
12.

TN. Mudge, G.D. Buzzard, T.S. Abdel-Rahman, "A High Performance
Operating System for the NCUBE," Hypercube Multiprocessors 1987, Ed.,
M.T. Heath, SIAM, 1987.

F.A. Nelson, L. Snyder, "Programnring Paradigms for Nonshared Memory
Parallel Computers," in The Characteristics of Parallel Algorithms, L.H.
Jamieson, D.B. Gannon, R.J. Douglass, Eds., MIT Press, 1987.

L.M. Ni, C.T. King, P. Prins, "Parallel Algorithm Design Considerations
for Hypercube Multiprocessors," Proc. of 1987 Int’l Conf. on Parallel Pro-
cessing, 1987, pp. 717-720.

A. Osterhaug, Guide to Parallel Programming on Sequent Computer Sys-
tems, Sequent Computer Systems, Inc., 1987.

J .L. Peterson, "Petri Nets," Computing Survey, Sep. 1977, pp. 223-252.

D.A. Padua, M.J. Wolfe, "Advanced Compiler Optimizations for Super-
computers," Comm. of ACM, Dec., 1986, pp. 1184—1201.

C.D. Polychronopoulos, U. Banerjee, "Processor Allocation for Horizontal
and Vertical Parallelism and Related Speedup Bounds," IEEE Trans. Com-
puters, Apr. 1987, pp. 410-420.

M.J. Quinn, Designing Eﬁicient Algorithms for Parallel Computers,
McGraw-Hill Book Co., 1987.

C. Ramchandani, "Analysis of Asynchronous Concurrent Systems by
Timed Petri Nets," Ph.D. Dissertation, Project MAC, MAC-TR-120, MIT,
1974.

G.S. Rao, H.S. Stone, T.C. Hu, "Assignment of Tasks in a Distributed Pro-
cessor System with Limited Memory,"IEEE Trans. Computers, Apr. 1979,
pp. 291-298.

Y. Saad, M.H. Schultz, "Topological Properties of Hypercubes," Technical
Report, YALEU/DCS/RR-389, Department of Computer Science, Yale
University, June 1985.

149

[ShFi87] Y. Shih, J. Fier, "Hypercube Systems and Key Applications," in Parallel
Processing for Supercomputing and A1, K. Hwang, D. DeGrOOt, Eds.,
1987.

[Squi86] M.L. Squires, "Applying Parallel Processing," COMPCON Spring 86,
1986. pP. 394-396.

[WiCM88] A. Witkowski, K. Chandrakumar, G. Macchio, "Concurrent I/O System for
the Hypercube Multiprocessor," The Third Conf. on Hypercube Concurrent
Computers and Applications, Jan., 1988.

APPENDIX

/*
"' Array summation using nee-structured accumulation method with ratioed partitioning

*/

#include <stdio.h> '
#defrne MAXNODE 64 /"' Maximum number of processors used */
#deﬁne MAXARRAY 100000 /* Maximum size of the array */

log2(n) /"‘ Calculate the integer logrithm of an integer number */
int n;
{ int x=1;

n--; while(n = n>>1) x++;

return(x);

l

main(argc,argv) /" The program running in the host */
int argc; char *argv[];
[ int i,j,k,*ipt;
int m; /* Size of the array */
int n,d; /* Number of nodes used and the dimension */
ﬂoat a[MAXARRAY]; /"' The array */
ﬂoat sum,time,* fpt;
int x[MAXNODE]; /* The size of the subarray in each processor */
int ch,type,nd,status,nllags; /* Communication parameters */
FILE *fopen0,*fpl,*fp2;

/* Initialization */
m=atoi(argv[1]); n=atoi(argv[2]); d=log2(n):

/* Input the partition sizes from the ﬁle ’part’ */
fpl=fopen("part","r"):
for(i=0,j=0,ipt=x; i<n; i++) {

scanf(fpl,"%d ",ipt); j += *(ipt++);}

/* Initialize the cube and load node program from the ﬁle ’node’ */
ch=nopen(d);
status=nloadm(ch,"node",(-l)*n,nllags,a,1024*sizeof(ﬂoat));

/"' Input the array from the ﬁle ’data’ */
fp2=fopen("data","r");
for(i=0,fpt=a; i<m; i++) fscanf(fp2,"%f",fpt++);

/"' Inform processor 0 to start coming the time */
nwrite(ch,a,sizeof(lloat),0,90);

/“' Load the subarrays */

for(i=0,fpt=a,ipt=x; i<n; i++,ipt++) {
status=nwrite(ch,fpt,(*ipx)*sizeof(lloat),i,10);
fpt += (*ipt); }

150

151

/* Receive the resultant total sum */
type=40; nd=0; nread(ch,&sum,sizeof(ﬂoat),&nd,&type);

/"' Inform processor 0 to stop counting the time */
nwrite(ch,a,sizeof(iloat).095);

/* Receive the total execution time from processor 0 */
type=50; nd=0: k=nread(ch,&time,sizeof(lloat),&nd,&type);
printf("total sum is %7.1f total execution time is %f ticks\n",sum,time);

III!
* Array summation using tree-structured accumulation with ratioed partitioning --- node program
*/

#deﬁne MAXARRAY 4096 f" Maximum size of a subarray */

mainO
{ int i,j,k,diff,ll;
int m; /"' Size of the subarray */
int root=0; f“ Root of the reduciton tree */
int nd,id,dim,pid,host,type,ﬂag; /"' Pamretcrs for communication */
int time; /* Execution time */

ﬂoat ai[MAXARRAY]; /"' The subarray */
ﬂoat sum=0.tmp:

/" Get environment information */
whoami(&id,&pid,&host,&dim);

/* If processor 0, then start coming the time */

if (id = O) {
type=90; nread(ai,sizeof(lloat),&host,&type,&llag);
time=ntime0; }

/"‘ Receive the subarray */
type=10; m=nread(ai,MAXARRAY*sizeof(ﬂoat),&host,&type.&ﬂag);
m=mM;

/* Subarray summation */
for(i=0; i<m; i++) sum += ai[i];

/"' Tree reduction to perform the ﬁnal accumulation */
diff=id‘root;
for(i=l,ll=l; i<d; i++) 11 = 11 << 1;
while(ll>diff) {
nd=(-1); type=30;
nread(&unp,sizeof(ﬂoat),&nd,&type,&llag);
sum += trnp: 11 = ll>>1;]
if(ll != O) {
nd=id‘ll; nwrite(&sum,sizeof(lloat),nd,30,&llag);}
else f“ I am the root, send back to the host */

152

nwrite(&sum,sizeof(ﬂoat),host,40,&llag);

/* If processor 0, then stop counting the time */

if (id = 0 ) [
type=95; nread(ai,sizeof(lloat),&host,&type,&ﬂag);
ai[0]=ntime0 - time;
nwrite(ai,sizeof(ﬂoat),host,50,&ﬂag); }

I!!!
* Mauix multiplication using Algorithm 3.1
*/

#include <stdio.h>
#deﬁne MAXMT'X 64 /"' Maximum size of the matrices */

81’3“") /* Gray code of n */
int n;
[ retum((n>>1)‘n); }

ginv(n) /* Inverse of the gray code it */
int n;
[ int x=n;

while(n = n>>1) x = x‘n;

retum(x); ]

log2(n) /"‘ Calculate the integer logrithm of an integer */
int n;
{ int x=0;

while(n = n>>1) x++;

remark): }

main(argc ,argv)
int argc; char *argvﬂ;
{ int i,j.k.l.U.V.X.y;

int m,n; /"' Size of the matrices and number of processors */
int nl,n2,n3; /* Number of partitions */

int dl,d2: /"' Dimension of the subcubes */

int 312.5135] .52,s3: P Size of submatrices */

int ch,type,nd,status,nilags; f“ Communication parameters */

ﬂoat a[MAXMTX][MAXMTX],b[MAXMT‘X][MAXMTX]; /* Matrices A and B */
ﬂoat buf[4097]; f" Buffer for loading */

FILE *fopen0,*fpl;

/* Initialization */

m=atoi(argv[1]); n1=atoi(argv[2]); n2=atoi(argv[3]); n3=atoi(argv[4]);
n=n1*n2; dl=log2(n1); d2=log2(n2);

s1=m/nl; /"' Number or rows in one partition in A */

s2=m/n2; F Number or columns in one partition in A */

s3=m/n3:

s12=sl*sz; f“ Size of the submatrix in A: (n1,n2) partition */
sl3=sl*sl; /* Size of the submatrix in C: (n1,n1) partition */

l

153

/"' Read the input mauices */
fp1=fopen("data","r");
for(i=0; i<m; i++)

for(j=0; j<m; j++) fscanf(fp1,"%f‘,&a[i][j]);
for(i=0; i<m; i++)

for(i=0; j<m; j++) fscanf(fp1,"%f",&b[i][i]);

/* Initialize the cube */
ch=nopen(dl+d2);
status=nloadm(ch,"node",(-l)*n,nllags,buf,1024*sizeof(lloat));

/* Load initialization parameters */
buf[0]=m; buf[l]=n1; buf[2]=n2; buf[3]=n3;
for(i=0; i<n; i++) nwrite(ch,buf,4*sizeof(lloat),i,10);

/* Inform processor 0 to start counting the time */
nwrite(ch,buf,sizeof(ﬂoat),0,90);

/"' Load matrix A and B to the cube, size of 312 each */
for(u=0; u<nl; u++)
for(v=0; v<n2; v++) {
i=(gray(u)<<d2)+v; /* Node id to which the submatrix is loaded */
x=(u+1)*s1; y=(v+l)*s2;
for(i=x-sl,l=0; j<x; j-H)
for(k=y-82; k<y; k++.l++)buf[1]=al.i][k];
status=nwrite(ch,buf,s12*sizeof(iloat),i,15):
for(k=x-sl,l=0; k<x; k++)
for(i=y-s2; j<y; j++,l++) buf[l]=b[j][k];
status=nwrite(ch,buf,sl2*sizeof(lloat),i,20);
}

/* Receive results */
for(i=0; i<n1*n1; i++) {
type=40; nd=-1; nread(ch,buf,(s13+l)*sizeof(lloat),&nd,&type);
x=(ginv(nd>>d2)+l)*sl; y=(buf[513]+1)*33;
for(i=x-sl,l=0; j<x; j++)
for(k=y-s3; k<y; k++,l++) aU][k]=buf[l];
}

/* Inform processor 0 to stop counting the time */
nwrite(ch,buf,sizeof(iloat),0,95);

/* Print out matrix C */

printf("The resultant matrix On");

for(i=0: i<m; i++) {
for(i=0; j<m; j++) printf("%3.0f ",a[i]lj]);
pﬁntf("\n"); l

/* Receive the execution time from processor 0 */
type=-l; nd=0; k=nread(ch,buf,sizeof(ﬂoat),&nd,&type);
printf("Execution time is %f ticks\n",buf[0]);

154

/It
"' Matrix multiplication using Algorithm 3.1 --- node program
*/

gmy<n> /"' Gray code of n */
int n;
{ retum((n>>1)‘n); }

ginv(n) /* Inverse gray code of n */
int n;
( int x=n;

while(n = n>>l) x = x‘n;

rcmmu); l

log2(n) /"' Calculate the integer logrithm of an integer I"I
int n;
{ int x=0;

while(n = n>>l) x++;

retumos); l

mainO
{ int i,j,k,l,ll,idd,diff,root,max;
int m,n1,n2,n3,d,dl,d2,512,s13,sl .52;
int id,id1,id2,su,nd,pid,host,type,ﬂag;
ﬂoat ai[2048],bi[2048],ci[2049],buf[2048]; /" Buffers */

/" Receive the initialization parameters from the host */
whoami(&id,&pid,&host,&d); /"' Get environment info */
type=10; nread(ai,4*sizeof(ﬂoat),&host,&type,&llag);
m=ai[0]; n1=ai[l]; n2=ai[2]; n3=ai[3];

d1=log2(n1); d2=log2(n2);

sl=m/n1; s2=m/n2; s12=sl*s2; sl3=s1*sl;

/* Find predecessor and successor nodes in the ring */
idl=id>>d2: id2=id & (n2-1);

k =id1 ? gray(ginv(id1)-l) : gray(nl-1);

su=(k<<d2) I id2;

/* Processor 0 starts counting time *I

if (id =-- O) [
type=90; nread(ai,sizeof(ﬂoat),&host,&type,&ﬂag);
max=ntime0; }

/* Receive submauix A */
type=15; nread(ai,s12*sizeof(lloat),&host,&type.&ﬂag);

/"' Start submatrix multiplication */
root: (idl<<d2); /" Root of the reduction tree */
diff=id‘root; pid=ginv(idl)-l;
for(i=0; l<n1; l++) {
/* Receive submatrix B */
nd :1? (-l) : host;
type=20; mead(bi,s12*sizeof(lloat),&nd,&type,&ﬂag);

155

/* Submatrix multiplication */
for(i=0,ll=0; i<312; i+=s2)
for(i=0; j<sl2; j+=s2,ll++)
for(k=0,ci[ll]=0; k<s2; k++) ci[ll] += ai[i+k]*bi [k+ j];

/"' Shift out submatrx B */
if(l<n1-l) nwrite(bisl2*sizeof(ﬂoat),su,20,&ﬂag);

/* Tree reduction to ﬁnd the complete submatrices of C *I
II = n2 >> 1;
while(ll>diff) {
nd=(-l); type=30+l; nread(bufsl3*sizeof(ﬂoat),&nd,&type,&ﬂag);
for(i=0; i<sl3; i++) ci[i] += buf[i];
11 = ll>>1; }
if(11 != 0) {
nd=id‘ll; nwrite(cisl3*sizeof(ﬂoat),nd,30+l,&ﬂag); }
else I f" I am the root, send back to the host */
ci[sl3] = pid = (pid+l) % nl;
nwrite(ci,(s13+l)*sizeof(ﬂoat),host,40,&ﬂag); }
l

/" Processor 0 sends the execution time bacj to the host */
if (id =-- 0) {
type=95; nread(ai,sizeof(lloat),&host,&type,&ﬂag);
buf[0]=ntime0 - max;
nwrite(buf,sizeof(ﬂoat),host,50,&ﬂag); }
}

/*
* Matrix multiplication using Algorithm 4.2
*l -

#include <stdio.h>
#deﬁne MAXMTX 64 /* Maximum size of the mauices */

log2(n) f“ Calculate the integer logorithm of an integer */
int n;
{ int x=0;

while(n = n>>l) x++;

retum(x); }

main(argcargv)
int argc; char *argv[];
{
int i,j,k,l,u,v,x,y;
int ch,type,nd,nllags; /" Communication parameters */
int m,n; /" Size of the matrices and number of processors */
int n1,n2,n3; /"' Number of partitions */
int d1,d2; f" Dimension of the subcubes for n1, n2 */
int s12,sl3,s23,sl,s2,s3; /* Size of submatrices */
ﬂoat buf[4096]; /"' Message buffer */
ﬂoat a[MAXMTX][MAXMTX],b[MAXMTX][MAXMTX]; I“ Matrices A and B */

156

FILE *fopen0,*fp1;

/* Initialization */
m=atoi(argv[1]); nl=atoi(argv[2]): n2=atoi(argv[3]); n3=atoi(argv[4]);
n=n1*n2; d1=log2(n1); d2=log2(n2);

sl=m/nl; /" Number or rows in one partition in A */
s2=m/n2; f" Number or columns in one partition in A */
s3=m/n3;

312=sl*s2; P Size of the submatrix in A l"I
813=s1*s3;

s23=s2*s3; /* Size of the submatrix in B */

/* Read the input matrices */
fpl=fopen("data","r");
for(i=0; i<m; i++)

for(i=0; j<m; j++) fscanf(fp1,"%f",&a[i][j]);
for(i=0; i<m; i++)

for(i=0; j<m; j++) fscanf(fp1,"%f',&b[i][j]);

/* Initialize the cube and load the program */
ch = nopen(dl+d2);
nloadm(ch,"node",(-1)*n,nllags,buf,4096*sizeof(iloat));

/" Load initialization parameters */
buf[0]=m; buf[1]=n1; buf[2]=n2; butI3]=n3;
for(i=0; i<n; i++) nwrite(ch,buf,4*sizeof(lloat),i,10);

/"' Inform processor 0 to start counting the time */
nwrite(ch,buf,sizeof(lloat),0,90);

/* Load mauix A to the cube, size of $12 each */
for(u=0; u<nl; u++)
for(v=0; v<n2; V-H-) {
°=(u<<d2)+v; P Node id */
x=(u+l)*sl; y=(v+l)*32;
for(j=x-sl ,l=0; j<X; i++)
for(k=y-52; k<y; k++,l++) bufﬂl=aﬁllklz
nwrite(ch,buf,s12*sizeof(ﬂoal).i.20); }

l“ Load B in a pipeline fashion */
for(u=0; u<n3; u-H-)
for(v=0; v<112; v++) {
x=(u+l)*s3; y=(v+l)*s2;
for(k=x-s3,l=0; k<x; k++)
for(i=y-82; j<y; j++.l++) buf[ll=bLi] [k];
nwrite(ch,buf,s23*sizeof(ﬂoat),v,25): ]

/"' Receive results */
for(i=0; i<n1*n3; i++) {
type=40; nd=(-l); k=nread(ch,buf,(sl3+l)*sizeof(ﬂoat),&nd,&typc);
/* Assembling C from each submatrix received */
y=((nd>>d2)+l)*sl; =('buf[sl3]+1)*s3;
for(i=y-sl,l=0; j<y; j++)
for(k=x-33; k<x; k++,l++) a[j][k]=buf[l]; }

157

/* Inform processor 0 to stop counting the time */
nwrite(ch,buf,sizeof(ﬂoat),0,95);

/"' Print out the resultant C */
for(i=0; i<m; i++) {
for(i=0; j<m; j++) printf("%3.0f ",a[i][j]); printf("\n"); }

/* Receive the execution time from processor 0 */
type=-l; nd=0; k=nread(ch,buf,sizeof(ﬂoat),&nd,&tYP€);
printf("Total execution time is %f ticks\n",buf[0]);

l

P
* Matrix multiplication using Algorithm 4.2 --- the node program
*/

817M") /* Gray code of n */
int n;
{ return((n>>1)‘n); }

ginv(n) P Inverse gray code of n */
int n;
{ int x=n;

while(n = n>>1) x = x‘n;

retumm; l

log2(n) /* Calculate the integer logorithm of an integer */
int n;
{ int x=0;

while(n = n>>l) x++;

remmm: l

mainO

[ int iJ.k.l.ll;
int m,nl,n2,n3,312,sl3,323,s1,s2,s3,d1,d2;
int nd,pid,host,type,ﬂag,id,id1,id2,idg,diff,su,pr,maxt;
ﬂoat ai[4096],bi[4096],ci[4097]; /"' Buffers */

/"' Receive the initialization parameters from the host */
whoami(&id,&pid,&host,&diff); /"' Get environment info */

type=10; nread(ai,4*sizeof(ﬂoat),&host,&type,&ﬂag);

m=ai[0]; n1=ai[1]; n2=ai[2]; n3=ai[3];

d1=log2(n1); d2=log2(n2);

sl=m/nl; s2=m/n2; s3=m/n3; s12=s1*s2; sl3=sl*s3; $23=s2*s3;

idl=id>>d2; id2= id & (n2-l);
su = (id2 == n2-l) ? host: id+1;
idg=ginv(idl); k=gray(idg+1); pr=(k<<d2)lid2;

/"' Processor 0 starts counting the time */
if (id = 0) {

158

type=90; nread(ai,4.&host,&type.&ﬂag);
maxt=ntime0: }

/* Receive submatrix A */
type=20; nread(ai,s12*sizeof(ﬂoat),&host,&type.&ﬂag);

/"' Start the pipeline: there will be n3 submatrices of B coming in */
diff=id‘(idl << d2);
for(l=0; l<n3; l++) (
/"' Receive a submatrix of B */
nd=(-l); type=id1 ? 27:25;
nread(bi,323*sizeof(ﬂoat),&nd,&type,&ﬂag);

if(idg l= (n1-l)) nwrite(bi,s23*sizeof(ﬂoat),pr,27,&ﬂag);

P Submatn'x multiplication */
for(i=0,ll=0; i<sl2; i+=52)
for(i=0; j<523; j+=s2,ll++) {
for(k=0,ci[ll]=0; k<52; k++) ci[ll] += ai[i+k]*bi[k+j];
l

/"' Accumulate submauices of C using a binary tree */
II = n2 >> 1;
while(lbdift) {
nd=(-1); type=30+l; nread(bis13*sizeof(lloat),&nd,&type,&llag);
for(i=0; i<sl3; i++) ci[i] += bi[i];
11 = ll>>l; }
if(ll 1= 0) {
nd=id‘ll; nwrite(cis13*sizeof(ﬂoat),nd,30+l,&ﬂag);}
else { /"‘ I am the root, send back to the host */
ci[slB] =1; nwrite(ci,(sl3+l)*sizeof(ﬂoat),host,40,&ﬂag); }
l

/* Processor 0 stops counting the time */
if (id = 0) [
type=95; nread(ai,4,&host,&type,&ﬂag);
ai[O]=ntimeO-maxt;
nwrite(aisizeofaloat),host,50,&ﬂag); }
l