9'

If

‘ WWW
.7 Mu. .H , . ..-«—-~~§\’~~\‘::r z w
' :32? we?!” {2- ~-

I“

“-4. lul'up “K

M

:I-
t.’
x 15-N- m w t“ I
" . ”a". ?.,1..M ,.
1' 7.."‘2a’72w: ‘ 2:2; .. ,3.“-
- $.33: -w ' 4‘ réb‘" -z 4&3: "Earn?“
; ‘-~~‘- - . ' . l. t‘x a. nu.
= 454443“?- ‘3? 2; 45:35:35,:4a- _
. . ... A .M

: ~ I ' , .. W‘
.. 34:3: aﬁgj: xi: 2
.C .f «14-15%» M1- ' “32::
_ 1, n . .4 v - "‘J' ‘."' “‘37:?“
_ _‘“—*‘up»4 441%; 31,14 31,. un‘: 37%: 2.3? ‘ «34-43% m; 41:?ng
. ' ~42? - It . ~ ‘. at 91"“ ‘ "
a- 9k quailf“ “MN“?! I?“ ‘ 3‘ — ‘
'e: . , “1“?”"'@ A
‘r '0". r A {'1' .. r -4 ‘34»: 331-"
l ‘ 7““ . ._ v -- km , 1:, w

. _. ....::. "mica,”
1-344: 44 ,

4‘... pun"

: L ¢‘ ‘1
f ,1 , “w. . “: :2“. v.

’ ‘ 4‘ u... v '4- w. a A‘ 14m.
1’ 4 u " , " 4., .5." u ' .3. {.1 . .15 n. «7.- “5.“‘5: ,‘1. ‘5."th
: ‘ ‘ z . ,_ J‘ ,
'7
t .
s .7 , _ u
.‘ , .‘t.’ ' 4 ‘:”:L‘.‘ “/3,
. ‘3;

 

 

 

as» x-

a

l.
4+ 5.33”
>_, ~uﬁ‘F. ,
A‘JK

.. .‘_

- .

._J-u

,M

~, . g~q~53‘.’.1 P.

,.

II ’ /‘
245:4. 1‘

.33; '.

 

- i 4 ' 44 ‘ ""4 4 ' 4 4 ‘ ' ' 5 - 4- \ ‘ ‘14:: .345“ .‘f‘. '4?
" ' -‘ v'., on: ." . a . 2, - .;4 m v . 375.-
- . . I, . 4,544‘4 44,, 4 J, 2:,» 4, ,“ g ». mg? mmf
,V.. V I ‘ 1m. h u y. “ A a
II.5

.444.
I ($133!:

 

IVSIYERT

|44|4444|444444444444444444|4444

3 1293 007914
f

 

44444484444444

LIBRARY
mama State
University

 

 

 

 

 

K

This is to certify that the

dissertation entitled

PARALLEL COMPUTATION MODELS:
REPRESENTATION, ANALYSIS AND APPLICATIONS

presented by

Xian-He SUN

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degree in Computer Science

 

 

\thNVSLM. N 4'

Major professor

Date 11/2/90

 

MSU Lt an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

DATE DUE DATE DUE DATE DUE ||

___=4
__4__
ﬂl—ﬂ

4
—_4

I
MSU Is An Affirmative Action/Equal Opportunity Institution
ammo-9.1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PARALLEL COMPUTATION MODELS:
REPRESENTATION, ANALYSIS AND APPLICATIONS

By

Xian-He Sun

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science
1990

(0(5'3if7 1:7

ABSTRACT

PARALLEL COMPUTATION MODELS:
REPRESENTATION, ANALYSIS AND APPLICATIONS
By
Xian-He Sun

Although there are various multiprocessors available, both software support and algo-
rithm development for these machines are far behind their hardware counterpart. This is
especially true for multicomputers in which each processor has its own local memory. A new
concept, compute-exchange computation, is proposed for efﬁcient parallel algorithm design
in multicomputers. Based on this new concept, the most frequently used scientiﬁc and en-
gineering applications can be describled by a simple structured representation. Structured
design and rethinking, therefore, become possible. The basic building blocks of these struc-
tures, called parallel computation models, are identiﬁed and studied. A performance metric
of parallel processing, parallel speedup, is carefully examined. A new model of speedup is
presented which considers the memory capacity as an inﬂuential factor and provides the
quantum of the tradeoff between time and space. .

, These research results are useful in many areas of computer science. The dissertation
focuses on two areas: eﬁcient algorithm design and performance prediction. Examples cho-
sen from real applications will be used in both areas to illustrate the concept and usefulness
of the computation models. The applications used for eﬁcient algorithm design are solv-
ing large scale tridiagonal systems and solving electrical power ﬂow problems. Several novel
algorithms are developed for these two applications based on combinations of computation
models. Implementation results confirm that these algorithms are competitive with any
existing algorithm in the ﬁeld.

©Copyright By
XianQHhWSun

1990

Dedicated to my parents
Yu Lin and Chang-Xiang Sun

and to my family.

iv

ACKNOWLEDGMENTS

I would like to express my sincere appreciation to all of the faculty members who have
helped and contributed to make this work a reality. This includes Professor Lionel M.
Ni, my thesis advisor, for his support, guidance and inspiration, and Professors Richard
Enbody, Moon Jung Chung, and Richard Hill for their insightful comments and serving in
my Ph.D. committee, and Professors Nabil Kamel, Abdol Esfahanian and Fathi Salam for
their encouragement.

I would like to acknowledge all of my fellow students who have given me help and friend-
ship during my stay at MSU. In particular, I wish to thank Dr. Chung-Ta King, Jayashree
Ramanathan, Dongyul Ra, Paul Wolberg, Xian-La Lin, and Honda Shing, for their help and
valuable discussions, and David Robinson, Dr. Shixiong Guo and Ten Hwan Tzen for their
cooperation and wonderful work.

I am very grateful to my parents and parents-in-law for their years support, concern and
understanding. I am indebted to my wife Hong Zhang-Sun for her constant encouragement,
assistance and care during the course of my graduate studies.

This research has been supported in part by National Science Foundation under grant
ECS-88-14027.

Table of Contents

List of Tables viii
List of Figures ix
1 INTRODUCTION 1
1.1 Multicomputers .................................. 2
1.2 Parallel Algorithm Design Considerations .................... 3
1.2.1 Sources of Degradation .......................... 4

1.2.2 Algorithm Characteristics ........................ 6

1.3 Motivation and Problem Statement ....................... 8
1.4 Thesis Organization ................................ 10

2 PERFORMANCE METRIC OF PARALLEL ALGORITHMS 12
2.1 Preliminary .................................... 13
2.2 Models of Speedup ................................ 17
2.3 Simpliﬁed Models of Speedup .......................... 19
2.4 Comparison Study ................................ 26

3 PARALLEL COMPUTATION MODELS FOR SCIENTIFIC COMPUT-

ING 29
3.1 Structured Representation ............................ 30
3.2 Parallel Computation Models .......................... 36
3.2.1 Local Computation Model ........................ 38
3.2.2 Global-Exchange Computation Model .................. 39
3.2.3 Compute-Aggregate—Broadcast Computation Model .......... 39
3.2.4 Divide-and-Conquer Computation Model ................ 40
3.2.5 Domain Decomposition Model ...................... 44
3.2.6 Pipeline (Ring) Computation Model .................. 45

vi

3.2.7 Recursive Doubling Computation Model ................
3.3 Application Considerations ............................

4 APPLICATION-DRIVEN ALGORITHM DESIGN
4.1 Preliminary ....................................
4.2 Power Flow Application .............................
4.3 Homotopy Method ................................

4.4 Implementation Issues and Results .......................

5 ARCHITECTURE-DRIVEN ALGORITHM DESIGN
5.1 Linear Tridiagonal Solvers ............................
5.1.1 The LU Decomposition Method .....................
5.1.2 The Parallel Prefix Method .......................
5.1.3 A Novel Matrix Partitioning Technique .................
5.2 A Compute-Aggregate-Broadcast Computation Solver .............
5.3 A Global-Exchange Computation Model ....................
5.4 A Solver With More Than One Computation Model ..............

5.5 Comparison and Experimental Results .....................

6 PREDICTION OF PERFORMANCE
6.1 Performance Formulations ............................
6.2 Structured Prediction ...............................
6.3 The Inﬂuence of Problem Size on Speedup ...................

7 CONCLUSION AND FUTURE RESEARCH
7.1 Summary and Major Contributions .......................

7.2 Future Research Directions ......................... Z . .

Bibliography

vii

46
49

52
53
54
56
58

64
65
65
66
67
70
73
75
77

81
82
87
90

97
97
100

103

List of Tables

5.1 Computation and Communication Counts of Tridiagonal Solvers ....... 77

viii

List of Figures

1.1 A Generic Multicomputer Architecture ..................... 3
2.1 Parallelism Proﬁle of an Application ...................... 14
2.2 Shape of the Application ............................. 15
2.3 Amdahl’s Law ................................... 20
2.4 Gustafson’s Scaled Speedup ........................... 21
2.5 Simpliﬁed Memory-Bounded Scaled Speedup .................. 22
2.6 Matrix Multiplication with Local Computation ................. 24
2.7 Matrix Multiplication with Global Computation ................ 24
2.8 Amdahl’s law, Gustafson’s speedup and SMB speedup ............ 27
3.1 Compute-Exchange Computation ........................ 30
3.2 Multicast Data Exchange ............................. 32
3.3 Conjunctive Data Exchange ........................... 33
3.4 Partitioning of Graphs .............................. 34
3.5 FFT (Butterﬂy) Computation .......................... 35
3.6 Odd-Even Cyclic Reduction ........................... 37
3.7 Local Computation Model ............................ 38
3.8 Global Exchange Computation Model ...................... 40
3.9 CAB Computation Model ............................ 41
3.10 Divide-and-Conquer Compute-Exchange Computation ............. 42
3.11 Domain Decomposition Computation Model .................. 44
3.12 Pipelined Computation Model .......................... 47
3.13 Recursive Doubling Computation Model .................... 50
4.1 The Main Components of Power System .................... 54
4.2 The Homotopy Method .............................. 59
4.3 Global-Exchange Model for Merge Checking .................. 60

ix

4.4
4.5

5.1
5.2
5.3
5.4
5.5
5.6

6.1
6.2

Compute—AggregateBroadcast Model for Merge Checking .......... 61

Speedup Versus Different Number of Processors ................ 62
The Communication Pattern of the PPT Algorithm .............. 72
Tree Reduction .................................. 73
The Communication Pattern of the GE Computation . . . . . . . . . . . . . 74
The Communication Pattern of the PPH Algorithm .............. 76
The Speedup Over The LU Decomposition Method .............. 79
Efﬁciency Over The LU Decomposition Method ................ 80
Parallel Preﬁx Method .............................. 89
Measured and Estimate Speedup of Parallel Preﬁx Method .......... 91

Chapter 1

INTRODUCTION

Powerful computer systems able to handle computationally intensive problems are increas-
ingly in demand in many scientific and engineering areas. To cope with these computational
requirements, a natural trend is to construct systems consisting of multiple processors. This
approach has been shown in recent years to be the most straightforward and cost-effective
approach for achieving high performance. One system consisting of many processors commu-
nicating through an interconnection mechanism is called a multiprocessor system [28]. These
processors are usually homogeneous, and the communication latency between processors is
relatively small but non-negligible. The communication latency, along with other degrada-
tions, makes parallel computation much more complex than sequential computation. Much
effort has been exerted and is currently being exerted to obtain highly efﬁcient parallel com-
putation. Although there are various multiprocessors available, software support and efﬁcient.
algorithm development for these machines are far behind their hardware counterpart.

In scientific computing, there are several issues related to parallelization that do not arise
in a serial context. The ﬁrst issue is task partitioning, i.e., the breakdown of the total work—
load into smaller tasks, some or all of which, can be processed concurrently. This includes the
proper sequencing of the tasks when some tasks are interdependent and cannot be executed
simultaneously. The second issue is the synchronization of concurrent processes. In some
methods, processes must wait at predetermined points for the completion of certain computa-
tions or for the arrival of certain data. The mechanisms used to enforce such synchronization
have an important effect on algorithmic performance. A third issue is the communication
of interim results between the processes. The objective here is to carry out the communica-
tion efficiently, which is problem dependent as well as machine dependent. The fourth issue
is to obtain a good match between algorithmic requirements and architectural capabilities,
i.e., the pairing of applications and architectures. From a parallel algorithm design point of
view, the issue is how to explore all the execution potential of the underlying architecture,
which includes the determination of the number of processors needed as well as task and
data allocation across the processors. From the architecture point of view, architectures can
be tailored for the execution of a ﬁxed set of algorithms if the architectural requirement of

. the algorithms are understood.

All of the above issues are problem dependent. It is diﬁcult, if not impossible, to derive
any acceptable solution for parallel computation as a whole. For this reason the issues have
been traditionally studied problem by problem and solved by brute force methods. How-
ever, the issues are important in a general context. A more general study is needed that
does not reference a special problem. In this dissertation, a new concept, compute-exchange
computation, is proposed. Based on this new concept, most of the frequently used scien-
tific and engineering applications can be represented by a simple structured representation.
Structured design and rethinking, therefore, become possible. The basic building blocks of
these structures, called parallel computation models, are identiﬁed and studied. Issues of
performance metrics for parallel computation are also examined. The results of this research
are useful in many areas of computer science. The study presented will focus on two areas,
efficient algorithm design and performance prediction.

This chapter is organized as follows. Section 1.1 will address the parallel systems con-
sidered in this study. Section 1.2 will discuss the parallel programming considerations. The
motivation of the research will be given in Section 1.3. Finally, an overview of the dissertation

will be found in Section 1.4.

1.1 Multicomputers

The parallel systems considered in this study are multicomputers. Multicomputers are mes-
sage passing distributed-memory multiprocessors [28]. They are organized as an ensemble
of small programmable computers, called nodes, and communicate through a point-to-point
interprocessor communication network. Each memory module is physically associated with
each processor. When the number of processors increases, the memory capacity also in-
creases. Multicomputers with hundreds or thousands of processors are available commer-
cially. Examples of ﬁrst generation multicomputers include J PL’s MARK III, Intel’s iPSC
(up to 128 processors), Ncube’s NCUBE (up to 1024 processors), Ametek’s S/ 14 (up to 256
processors) and FPS’s T series (up to 2“ processors) [55] [23] [21]. Note that the Connection
Machine [25], though hypercube interconnected, is a bit-serial SIMD machine. New multi-
computers are also emerging. Both Intel’s iPSC-2 [46] and Ametek’s 2010 [3] are classiﬁed as
second generation multicomputers. The iWARP [6], being developed by Intel and CMU, is
also an example of a high performance multicomputer. All ﬁrst generation multicomputers
adopt the store-and-forward communication mechanism and the hypercube topology. Second

generation multicomputers have more advanced communication mechanisms. Although the

processors are physically interconnected as a hypercube or a 2D-mesh, they are logically fully
connected [3]. The structured representation proposed in this study aims at these new com—
munication mechanisms. Multicomputers hold a promising potential for providing massive
parallelism. In addition to architectural enhancement through an increase in computational
capability and a decrease in communication latency, the full potential of a multicomputer
cannot be exploited without a careful design of algorithms. The results of this research will
beneﬁt this careful design and should be useful in the design of future multicomputers to help
accommodate those frequently used parallel computation models. A generic architecture of

multicomputers is depicted in Figure 1.1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Processor Processor Processor
Memory Memory Memory
T I I 1 I
J I I ' I '
hirerprocessor Communication Network ]
Figure 1.1: A Generic Multicomputer Architecture

 

1.2 Parallel Algorithm Design Considerations

During the past two decades, the computing community has witnessed a rapid growth of
interest in parallel processing. The motivation behind this increasing interest is not that
parallel computing is easier to program, but that there are physical and technological limits
in uniprocessors. Modern VLSI technologies enable multiprocessors to be built and imple-
mented at a very cost-effective rate for solving computationally intensive problems. However,
the expected superior performance can be achieved only when all the advantages of the un-
derlying architecture are exploited. These include partitioning an application in such a

way that load balancing can be maximized and allocating the tasks in such a way that the

communication overhead can be minimized. Neither this maximum nor minimum can be
achieved easily. In fact, it has been shown that implementing parallel algorithms is much
more difﬁcult than most computer scientists had expected [30]. Parallelism degradations
exist that could degrade the performance of parallel algorithms. Depending on their char-
acteristics, different algorithms may be inﬂuenced by different degradations. In this section

we shall address possible degradations and characteristics of parallel algorithms.

1.2.1 Sources of Degradation

Parallel algorithms rarely achieve linear speedup for several reasons [18]. In designing efﬁcient
parallel algorithms, we have to know the sources of degradation and how to reduce their
inﬂuence or remove them completely whenever possible. In this section sources of degradation
are discussed. In the next section, parallel algorithms are characterized according to their

inherent degradations.

a Communication delay. From a computation point of view, in addition to the transmis-

sion and propagation delay, there are two other communication delays [5]:

- Communication processing time. This is the time required to prepare information

for transmission.

- Queuing time. Once information is assembled into packets for transmission on
some communication link, it must wait in a queue prior to the start of its trans-
mission. Queuing time is generally difﬁcult to quantify precisely, but simpliﬁed
models are often adequate to obtain valuable qualitative and quantitative conclu-
sions [34]. 4

Communication overhead can be reduced by overlapping communication with compu-
tation. The queuing, transmission, and propagation delays can possibly be overlapped

with computations, but communication processing cannot.

Communication cost is a determining factor for the complexity of parallel algorithms.
Various techniques have been developed to reduce communication overhead. These include
designing efﬁcient networks, overlapping communication with computation, and developing
algorithms carefully. In second generation multicomputers, the number of hops from a source
node to a destination node is not a major factor in determining the message delay times.
However, there are three classical ways to reduce communication delay by restructuring al-
gorithms which still hold true [31]: 1) reducing the quantity of information that must be

communicated, 2) reducing the frequency with which messages must be sent, 3) overlap-
ping communication with computation. In practice the above goals may not be mutually

compatible.

0 Uneven Allocation. It is desirable to distribute the computation load evenly so that all
processors have an identical amount of work. However, load balancing is very difﬁcult
to achieve. High level module load balancing may lead to unbalanced computation
due to the underlying data structure. Those processors that ﬁnish a step of computa-
tion earlier may have to wait until others are complete before continuing. Thus, the

processing power of those processors is wasted.

Communication delay and uneven allocation are the main sources of degradation in par-
allel algorithms. Reducing these two degradations is difﬁcult and may sometimes raise other
degradations. Such tradeoffs between degradations can be used as a design technique to

improve the overall performance.

0 Redundant Work. Transferring intermediate results is more costly than computation
for certain problems. Instead of transferring the results, we may wish to recompute
the results. This kind of redundant work is adopted in solving eigenvalue problems
[40] where eigenvalues are transferred and corresponding eigenvectors are recomputed.
Another kind of redundant work comes from the load balancing: To reach agreement on
shared data, an algorithm can either let one processor compute it while all the other
processors wait or let all the processors compute it concurrently. Since the former
involves waiting and receiving, the later is more favorable. This kind of redundant
work is used in solving linear tridiagonal systems [60]. Redundant work decreases the

communication overhead by increasing the computation requirement.

0 Reduced Knowledge. If performing calculations improves the knowledge base of the al-
gorithm, simultaneous calculations cannot make use of the most recent knowledge but
must rely on information that is to some extent out of date. This is often the case in iter-
ative methods where each iteration advances the knowledge base. For example, Jacobi
algorithms are widely used in multicomputers, while Gauss-Seidel algorithms, which
use the latest available information, usually converge faster than the corresponding
Jacobi algorithm. Receiving the newest information requires frequent communication
and synchronization, making Gauss-Seidel algorithms difﬁcult to implement efﬁciently

on multicomputers.

o Non-optimal. With more than one processor computing concurrently, the computa-
tional complexity might be misleading, since an algorithm could do more work and
still be fast. The algorithm with the lowest computation count is very often sequential
in nature. To achieve a high degree of parallelism, we may have to develop parallel al-
gorithms which require more computation than the optimal sequential algorithms. We
reduce the uneven allocation degradation by increasing the computation requirement.
This increase in computation requirement will lower the performance gain of parallel

processing, at least in the sense of absolute speedup (see Chapter 2).

1.2.2 Algorithm Characteristics

Parallel algorithm characteristics have been studied for mapping parallel algorithms into
parallel or distributed architectures [30]. In this section, parallel algorithms are characterized
from a different point of view. We study how these characteristics degrade the performance

of parallel algorithms.

0 Data Parallelism versus functional parallelism. Parallelism can be achieved by di-
viding the data among the processors, or by decomposing the algorithm into segments
that can be assigned to different processors. We call the former data parallelism and
the latter functional parallelism. Data parallelism is the parallelism which comes from
simultaneous operations across a set of data [26]. Thus, for data parallel algorithms
each processor executes the same instructions on a different data set. Task partitioning
is more ﬂexible and load balancing is relatively easy to achieve. Algorithms based on

functional parallelism typically use a small number of processors.

0 Granularity. The granularity or grain size deals with average subtask size, which
is measured in the number of instructions executed. It has a bearing on data alloca-
tion, communication requirements, processing capability, and memory requirements.
Fine-grain algorithms often require frequent communication. Large-grain algorithms
typically reduce the communication / computation ratio and have less communication
penalty. This communication penalty reduction makes some computation models, that
may be intolerable for ﬁne-grain algorithms, favorable for large-grain algorithms. For
example, with ﬁne-grain parallelism, both pipelined Jacobi iteration and Gauss-Seidel
iteration are unacceptable. With large-grain parallelism, these methods can achieve a

near linear speedup [49], [5].

0 Degree of Parallelism. We call the number of ’ processes that can operate indepen-
dently the degree of parallelism. Load balancing will be much more easily achieved for
algorithms with a static degree of parallelism. Unfortunately, parallel algorithms with
a dynamic degree of parallelism are not uncommon. An example is parallel Gaussian
elimination algorithm. To eliminate the lower triangular elements, n — 1 independent
processes are available for elimination in the ﬁrst column, n - 2 independent processes
are available for elimination in the second column, ..., and finally one process is avail-
able for elimination in the n — 1th column. Another example is parallel Given Rotation
[47]. For an n dimensional matrix, the degree of parallelism of Given Rotation goes up
from 1 to n/2 and then goes down from n/2 to 1.

o Concurrency and Pipelining. From the vieWpoint of scheduling, a parallel computation
can be characterized as either concurrent or pipelined [29]. Concurrency exploits spatial
parallelism by utilizing several processors executing multiple independent tasks simul-
taneously. Pipelining exploits temporal parallelism in which each processor (called a
stage) behaves like a ﬁlter or transformer which operates on its input data and passes
its output data to the succeeding processor as the later’s input. A systolic array is a
typical example of a pipelined computation in a synchronized environment. Through
pipelining, communication occurs only between ﬁxed and neighboring stages. Thus
a very high ratio of computation to I/O rate will result. A more detailed study on
pipeline computation can be found in [34] [49].

0 Data Dependencies. Data dependencies have a major impact on the design of parallel
algorithms. They inﬂuence data granularity, degree of parallelism and synchronization.
Data dependency is so important that two nontraditional parallel machines, namely
graph reduction machines and dataﬂow machines, have been proposed in which the
order of instruction executions is implied by data dependencies [61]. By representing
the operations as vertices and using directed edges to indicate the dependencies, data
dependency can be represented by a directed dependency graph. Much information
can be obtained by coloring the dependency graph. Operations with different colors
cannot execute concurrently. From a design point of view, wewould like to distribute
operations with the different color to the same processor. The multicoloring method
[5], which has been successfully employed in the solution of partial differential equations
and in image processing, is a good example of grouping the operations according to

data dependency.

0 Methodology. There are a variety of methods for solving systems of equations, usually
classiﬁed as direct and iterative methods. Direct methods follow ﬁxed procedures and
ﬁnd the exact solution with a ﬁnite number of operations. Iterative methods do not
obtain an exact solution in ﬁnite steps, they converge to a solution asymptotically.
Nevertheless, iterative methods often yield a solution, within acceptable precision, af-
ter a relatively small number of iterations. In this case, they are preferred to direct
methods. This is usually the case for linear systems when n is very large and the
coefﬁcient matrix is sparse, such as the system A1: = b which arises in the discretiza-
tion-of a linear partial differential equation and in many other applications. Iterative
methods may also have smaller storage requirements than direct methods when the
matrix A is sparse. The computation count of iterative methods depends on the num-
ber of iterations, which is problem dependent. Curve tracing is a special method for
obtaining solutions, in which we start at one end of the curve and want to get to the
other end point. The intermediate results along the curve are unimportant. A curve
tracing method for solving power ﬂow problems will be presented in Chapter 4. Several
newly developed direct methods for solving tridiagonal linear systems will be given in
Chapter 5.

1.3 Motivation and Problem Statement

The complex interaction of the many architectural, hardware, and software features of par-
allel systems results in a larger space of possible performance behavior and an increase in
algorithm design complexity. Designing efﬁcient parallel algorithms requires that users to un-
derstand the performance characteristics of parallel machines and to modify their algorithm
accordingly. These modiﬁcations are problem dependent. Therefore, parallel algorithms
have had to be ﬁne-tuned case by case to achieve higher performance. The painful, elusive
design process has excluded casual users and restricted parallel computers to a rather small
professional community. This situation needs to be changed to make parallel computers
usable for other scientists.

We would like to reduce the burden of parallel algorithm design and make the design
process more systematic. This raises the obvious questions: What are the techniques for
developing eﬂicient parallel algorithms? How could these techniques be used on a given
application? To answer the ﬁrst question, Nelson and Snyder [44] have proposed the concept
of parallel paradigm and identiﬁed several paradigms. Paradigms are recognized high level

9

methodologies common to many of the effective parallel algorithms. They are important,
because they provide good examples and may help users understand parallel computation.
However, these paradigms are described verbally and are isolated from each other. How to
connect these paradigms with general applications is unknown.

In this dissertation, I approach these two questions from a different angle. I study parallel
algorithm design from a general point of view. First, a representation methodology, struc-
tured representation, is developed. With this representation, most of the frequently used
scientiﬁc and engineering applications can be represented by simple formulations. These
formulations are combinations of some simple data structures, called parallel computation
models, and provide adequate information about performance degradations. Parallel compu-
tation models are the basic building blocks of structured representations. Since both parallel
computation models and parallel paradigms are commonly used data structures, it is not
surprising that they share some similarities. Some of them even share the same name. A
major advantage of parallel computation models over parallel paradigms is that computation
models are based on mathematical formulations, and they are the constructing components
of general scientiﬁc applications. The structured representation of most frequently used sci-
entiﬁc and engineering applications consists of computation models. The design techniques
used in computation models are the techniques needed for general scientiﬁc algorithm design,
and the design techniques are used in a similar way.

A mathematical foundation is necessary in moving toward a systematic design methodol-
ogy for parallel algorithms. Based on that foundation, different applications can be classiﬁed
and manipulated. Forming structured representation and acquiring computation models are
the ﬁrst step in developing such a foundation. After computation models have been well
identiﬁed and studied, the performance of an application can be predicted during its design
stage, and systematically mapping algorithms onto architectures becomes possible.

Structured representation and computation models are important. They provide a fast
way of estimating and understanding the performance of parallel programs in a parallel
system. They lead to efﬁcient parallel algorithm design for scientiﬁc and engineering appli-
cations. They provide guidelines for programming tools and systematic parallel algorithm
design methodology. Also, they suggest how each multicomputer should be used in applica-
tions and which multicomputer is the best for a given application.

This study focuses on scientiﬁc computations. The models are built for scientiﬁc appli-
cations and examples are chosen from scientiﬁc applications. We are interested in scientiﬁc

applications since scientiﬁc and engineering applications are the major driving force behind

10

parallel processing and the data structures of scientiﬁc applications are relatively simple.
With some effort, structured representation and computation models could possibly be ex-

tended to general non—numerical applications.

1.4 Thesis Organization

This dissertation is organized as follows. In Chapter 2, we study the performance issues of
parallel processing. The study focuses on one performance metric, parallel speedup. Three
models of speedup are studied. They are ﬁred-size speedup, fired-time speedup and memory-
bounded speedup. Memory-bounded speedup considers memory capacity as an inﬂuential
factor and provides the quantum of the tradeoff between time and space. Two sets of
speedup formulations are derived for these three models. One set requires more information
and provides more accurate estimations. Another set considers a simpliﬁed case and provides
a clear picture of the possible performance gain of parallel processing.

Structured representation is proposed in Chapter 3. Two different classes of data-
exchange are identiﬁed. One is regular data-exchange. Another is conjunctive regular
data-ezchange. Notation is developed to represent these two classes of data-exchanges.
A new concept, compute-exchange computation, is introduced. By this concept, every
parallel computation can be divided into two parts, compute and data-exchange. There-
fore, a parallel algorithm can be written in terms of data exchanges and computations,
which is called structured representation. Some basic components of structured representa-
tion, called computation models, are also studied in Chapter 3. They are Local Computa-
tion Model, Global-Exchange Computation Model, Compute-Aggregate-Broadcast Computa-
tion Model, Divide-and-Conquer Computation Model, Domain Decomposition Computation
Model, Pipeline (Ring) Computation Model, and Recursive Doubling Computation Model.

Parallel algorithms may be modiﬁed for different reasons to achieve high performance.
The modiﬁcations, however, mainly follow two approaches: explore the inherent parallelism
of the application to increase the concurrency and take advantage of the underlying architec-
ture to reduce the communication overhead. Two applications are studied to illustrate how
structured representation and computation models can be used in efﬁcient algorithm design.
The power ﬂow problem is studied in Chapter 4 to demonstrate how structured representa-
tion can help in application-driven algorithm design. In Chapter 5, using the design process
of linear tridiagonal solvers, we show how the parallel architecture will inﬂuence the choice

of computation models. The power ﬂow problem determines how electrical power generation

11

and distribution should be changed with customer demand, so that the power system can
be operated safely. We adopt the homotopy mathematical method to solve the power ﬂow
problem. Our ﬁrst approach is an algorithm using the local computation model. Then, we
change to a global-exchange computation. After learning more about the homotopy method,
we modify our algorithm into a compute-aggregate—broadcast computation. Solving linear
tridiagonal systems is a fundamental problem of scientiﬁc computation. We have developed
tridiagonal solvers on a ﬁrst generation NCUBE/ 7 hypercube multicomputer. To take‘advan-
tage of the hypercube architecture, we change our design from compute-aggregate-broadcast
computation to global-exchange computation. Then we change to a hybrid computation,
which is a combination of two computation models. The last one gives the best perfor-
mance. All the algorithms are implemented on an NCUBE multicomputer and the results
can be found in the chapters that follow.

A new performance prediction methodology is presented in Chapter 6. This methodology
adopts a divide-and-combine strategy. For a given application we write down its structured
representation and ﬁnd out the contained computation models. We then use the well-studied
computation models to predict the performance of the given application. Examples are given
and the performance measurements of identiﬁed computation models are studied. Instead of
studying the performance on a given architecture, the performance is studied from a dynamic
point of view. We study the inﬂuence of problem size on performance. Chapter 7 gives a

summary of my major contributions and the direction of future research.

Chapter 2

PERFORMANCE METRIC OF
PARALLEL ALGORITHMS

To study efﬁcient parallel algorithm design we first have to understand the performance
metrics. For sequential machines, elapsed time and memory space are the metrics by which
algorithms are measured. With parallel machines, the performance measurement becomes
more difﬁcult. In additionito time and space, the performance parameters for parallel algo-
rithms include the number of processors available, communication overhead and the inherent
parallelism of the given application. For this reason, despite the fact that parallel processing
has become a common approach for achieving high performance, performance evaluation
techniques of parallel processing are weak. There is no well-established metric to measure
the performance gain. This weakness is one of the reason which leads to confusion and limits
the growth of parallel computation. In this chapter, we shall study issues of performance
' metrics and seek a better understanding of one metric, parallel speedup.

The most commonly used performance metric for parallel processing is speedup, which
gives the performance gain of parallel processing versus sequential processing. However, with
different emphases, speedup has been deﬁned differently. One deﬁnition focuses on how much
faster a problem can be solved with N processors. Thus, it compares the best sequential
algorithm with the parallel algorithm under consideration. This deﬁnition is referred to
as absolute speedup. Absolute speedup has two different deﬁnitions, one which considers
machine resources and one which does not. In the ﬁrst deﬁnition, speedup is deﬁned as the
ratio of elapsed time of the best sequential algorithm on one processor over the elapsed time
of the parallel algorithm on N processors. In the second deﬁnition, the absolute speedup is
deﬁned as the ratio of elapsed time of the best sequential algorithm on the fastest sequential
machine over the elapsed time of the parallel algorithm on the parallel machine [47]. Another
type of speedup, called relative speedup, deals with the inherent parallelism of the parallel
algorithm under consideration. It is deﬁned as the ratio of the elapsed time of the parallel
algorithm on one processor over the elapsed time of the parallel algorithm on N processors.
Relative speedup is used because that the, performance of parallel algorithms varies with the

number of available processors. Comparing the algorithm itself with different numbers of

12

13

processors gives information on the variations and the degradations of parallelism. Absolute
speedup and relative speedup are two commonly used speedup measures. Other deﬁnitions
of speedup also exist. For instance, considering the construction costs of processors, a cost-
related speedup is deﬁned in [4].

Among all of the deﬁned speedups, relative speedup is probably the one which has had
the most inﬂuence on parallel processing. Two well known speedup formulations have been
proposed based on relative speedup. One is Amdahl’s law [2] and another is Gustafson’s
scaled speedup [20]. Both of these two speedup formulations use a single parameter, the
sequential portion of a parallel algorithm. They are simple and give much insight into the
degradations of parallelism. Amdahl’s law has a ﬁxed problem size and is interested in how
fast the response time could be. It suggests that massively parallel processing may not gain
high speedup. Under the inﬂuence of Amdahl’s law many parallel computers have been built
with a small number of processors. Gustafson [20] approaches the problem from another
point of view. He ﬁxes the response time and is interested in how large a problem could
be solved within this time. The argument of Gustafson is that the problem size should be
increased to meet the available computation power for better results. Experimental results
based on his argument show that speedup can increase linearly with the number of processors
available [22].

In this chapter we shall provide a careful study of relative speedup and reserve the word
speedup for relative speedup, unless explicit state otherwise. We ﬁrst study three models
of speedup, ﬁred-size speedup, ﬁred-time speedup and memory-bounded speedup. All three
models are based on relative speedup. With both uneven allocation and communication over-
head considered, general speedup formulations will be derived for all three models. When
communication overhead is not considered and the workload only consists of sequential and
perfectly parallel portions, the simpliﬁed ﬁxed-size speedup is Amdahl’s law; the simpliﬁed
ﬁxed-time speedup is Gustafson’s scaled speedup; and, with one more parameter, the sim-
plified memory-bounded speedup contains both Amdahl’s law and Gustafson’s speedup as
its special cases. Therefore, from different points of view, the three models of speedup are

uniﬁed.

2.1 Preliminary

The parallelism in an application can be characterized in different ways for different purposes,

such as data dependency graph [61], task precedence graph [11], Petri Net [41] and average

14

parallelism [15]. For simplicity, speedup formulations generally use very few parameters
and consider very high level characterizations of the parallelism. In our study we consider
two main degradations of parallelism, uneven allocation and communication latency. The
former degradation is application dependent. The latter degradation depends on both the
application and the parallel computer under consideration. To give an accurate estimate,
both of the degradations need to be considered. Uneven allocation is measured by degree of

parallelism.

Deﬁnition 1 The degree of parallelism of a program is an integer which indicates the number
of processors that are busy computing at a particular instant in time, given an unbounded

number of available processors.

The degree of parallelism is a function of time. By drawing the degree of parallelism
over the execution time of an application, a graph can be obtained. We refer to this graph
as the parallelism proﬁle. Software tools are available to determine the parallelism proﬁle
of large scientiﬁc and engineering applications [36]. Figure 2.1 is the parallelism proﬁle of a
hypothetical divide-and-conquer computation (see Section 3.2). By accumulating the time
spent at each degree of parallelism, the proﬁle can be rearranged to form the shape (see
Figure 2.2) of the application [53].

 

l 1
Degree

of
Parallelism q

2-
1- L]

v

0 Time T

 

 

 

 

 

Figure 2.1: Parallelism Proﬁle of an Application

 

Deﬁnition 2 The average parallelism is the average number of processors that are busy
during the execution of the program in question, given an unbounded number of available

processors.

l5

 

Degree
of
Parallelism

 

7—7

I I V

I
t3 -.".-t2 -..F t1 ..n

 

-10----

F- I;

 

 

Figure 2.2: Shape of the Application

 

By deﬁnition, average parallelism is the ratio of the total service demand to the execution
time with an unbounded number of available processors. This is equal to the speedup, given
unbounded number of available processors and without considering communication latency.

Therefore, average parallelism can be deﬁned equivalently as follows [15].

Deﬁnition 3 The average parallelism is the speedup, given an unbounded number of avail-

able processors and without considering communication latency.

Let W be the amount of work (computation) of an application. Let W,- be the amount
of work executed with degree of parallelism i, and m be the maximum degree of parallelism.
Thus, W = 23, W,. The execution time for computing W,- with a single processor will be

t.(1)= ‘33-. (2.1)

where A is the computing capacity of each processor. If there are i processors available, the

execution time will be

. W,-
t; = '.——.
(1) :15
With an inﬁnite number of processors available, the execution time will be
W.- .
.= . = _ < < .
t, t,(oo) iA for1_i_m

Therefore, without considering communication latency, the response time on a single pro-

cessor and on an inﬁnite number of processors will be

Ta) = i % (2.2)

i=1

at W;
The speedup will be
2m W-
T1( ) =1 12‘_ _Z___l';1 We
,S'Oo = __ = , 2.4
Too( 2; 5...: 2:: ‘T/z. ‘ )
The average parallelism, A, can be computed in terms of t,-,
[1:2 _ 2:21 “i (2.5)

 

._—_1 221 t 27;] t5'
Notice that t.- is the time for executing W,-. When an unbounded number of processors
are available, t,- = ‘12. Substituting t.- = {1:3, into Eq. (2.5), we have

A = 231“.” __ 2.21 W. ___ 500 (2.6)

This gives a formal proof for the equivalence of Deﬁnition 2 and Deﬁnition 3. Average
parallelism is a very important factor for speedup and efﬁciency. It has been carefully
examined in [15]. 500 gives the best possible speedup based on the inherent parallelism
of an algorithm. There are no machine dependent factors considered. With only a limited
number of available processors and with. communication latency considered, the speedup will
be less than the best speedup, See. If there are N processors available and N < i, then some
processors have to do gtfﬁf] work and the rest of the processors will do l—Y‘ [ﬁ] work. In

this case, assuming W.- and W,- cannot be solved simultaneously for i 5:9 j, the elapsed time
will be

W") = gag:
and

m

T(N)= 2':

i=1

{-1 (2.7)

”:IE

The speedup is

T_(l)__ _,_ w.-
T—(N)= m Elﬁl.

3:1 .

SN=

 

(2.8)

Communication latency is an important factor contributing to the complexity of a parallel
algorithm. Unlike degree of parallelism, communication latency is machine dependent. It

depends on the communication network, the routing scheme, and the adopted switching

17

technique. For instance, the switching technique used in ﬁrst generation multicomputers is
store-and-forward. Second generation multicomputers adopt circuit switching or wormhole
routing switching techniques. These new switching techniques reduce the communication
cost considerably. Let Q N be the communication overhead when N processors are used in

parallel processing; the general speedup becomes

_._T_(l)_ _ 231W:
5” ' T(N) ’( 2:. ”774731) +Q~' (2'9)

2.2 Models of Speedup

In the last section we developed a general speedup. formula and showed how the number of
processors and degradation parameters will inﬂuence the performance. However, speedup is
not only dependent on these parameters. It is also dependent on how we view the problem.
With different points of view, we will get different models of speedup and will get different
speedup formulations.

One vieWpoint emphasizes shortening the time it takes to solve a problem by parallel
processing. With more and more computation power available, the problem can be solved in
less and less time. With more processors available, the system will provide a fast turnaround
time and the user will have a shorter waiting time. A speedup formulation based on this
philosophy is called a ﬁred-size speedup. In the previous section, we adopted ﬁxed—size
speedup implicitly. Equation (2.9) is the general speedup formula for ﬁxed-size speedup.
F ixed-size speedup is suitable for many applications. I

For some applications we may have a time limitation, but we may not want to obtain
the solution in the quickest way. If we have more computation power, we may want to
increase the problem size, carry out more operations, get a more accurate solution, and keep
the execution time unchanged. This viewpoint leads to a new model of speedup, called
ﬁred-time speedup. Many scientiﬁc and engineering applications can be represented by some
partial differential equations, which can be discretized for different choices of grid spacing.
Coarser grids demand less computation, but ﬁner grids give more accurate solutions. If more
accurate solutions are desired, this kind of application will ﬁt the ﬁxed-time speedup model.
One good example is weather forecasting. With more computation power, we may not want
to give the forecast earlier. Rather, we may wish to add more factors into the weather model
- increasing the problem size and getting a more accurate solution - thus providing a more

precise forecast.

18

For ﬁxed-time speedup the workload is scaled up with the number of processors available.
Let W,’ be the amount of scaled up work executed with degree of parallelism i and m’ be the
maximum degree of parallelism of the scaled up problem when N processors are available.

In order to keep the same turnaround time as the sequential version, we must have
m m’ “I: i
E W: - Z; T [N] + QN
Thus, the general speedup formula for ﬁxed-time speedup is
2'11 W" 3:: i,
Sjv = , ', _ ' = —-;n—. (2.10)
2:: 514:] + on 2:... We

From our experience in using multicomputers, we have found that memory capacity

 

plays a very important role in performance. Existing multicomputers do not support virtual
memory, and memory is distributed and associated with each node. The memory associated
with each node is relatively small. When solving an application with one processor, the
problem size is more often bounded by the memory limitation than by the execution time
limitation. With more nodes available, instead of keeping the execution time ﬁxed, we
may want to meet the memory capacity and increase the execution time. In general, the
question is, if you want to increase the problem size, do you have enough memory for the
size increase? If you do have adequate memory space for the size increase, and after the
problem size is increased to meet the time limit you still have memory space available, do
you want to increase the problem size further by using this unused memory space and to get
an even better solution? For memory-bounded speedup the answer is yes. Like ﬁxed-time
speedup, memory-bounded speedup is a scaled speedup. The problem size is scaled up with
system size. The difference is that in ﬁxed-time speedup the execution time is the dominant
factor and in memory-bounded speedup the memory capacity is the dominant factor. Most
of the applications which ﬁt ﬁxed-time speedup will ﬁt memory-bounded speedup when
more accurate solutions are the goal. A good application for memory-bounded speedup is
simulation. If we simulate a nuclear power plant, obtaining an accurate solution will probably
be the highest priority.

With memory capacity considered as a factor of performance, the requirement of solv-
ing an application contains two parts. One is the computation requirement, which is the
workload, and another is the memory requirement. For a given application, these two re-
quirements are related to each other, and the workload can be seen as a function of the
memory requirement. Let M represent the memory requirement and let 9 represent the

function, we have W = g(M), or M = g"‘(W), where g‘1 is the inverse function of g.

19

Under different architectures the memory capacity will change differently with the number
of processors available. For multicomputers, the memory capacity increases linearly with
the number of nodes available. If W: 23.1 W,- is the workload for sequential execution,
W‘ = ,-._l W? is the scaled workload when N processors are available, m" is the maximum
degree of parallelism of the scaled problem, then the memory limitation for multicomput-
ers can be stated as: the memory requirement for any active node is less than or equal to

g"1 (2:,_1 W,-). Here the main point is that the memory occupation on each node is ﬁxed.
Equation (2.11) is the general speedup formula for memory-bounded speedup.

W‘

5' = "=1 2.11
N 3.1% W [1v] + QN ( )

2.3 Simpliﬁed Models of Speedup

 

Three general speedup formulations have been proposed for three models of speedup. These
formulations contain both uneven allocation and communication latency degradations. They
are closer to actual speedup and give better upper bounds on the performance of parallel
algorithms. On the other hand, these formulations are problem dependent and difﬁcult to
understand. They give more detailed information for each application, but lose the global
view of the possible performance gain. In this section, we study a simpliﬁed case for speedup,
which is the special case studied by Amdahl and Gustafson. We do not consider the commu-
nication overhead, so QN = 0, and we assume that the allocation only contains two parts, a
sequential part and a perfectly parallel part. That is W,- = 0, for i at 1 and i# N. We also
assume that the sequential part is independent of the system size, i.e., W1 = Wi = Wf'.

Under this simpliﬁed case, the general ﬁxed-size speedup formulation Eq.(2.9) becomes

W1 ‘1' WN
W1 + 7,,“ ,
which is known as Amdahl’s law. From Eq.(2.l2) and Fig. 2.3 we can see when the number
of processors increases the load on each processor “decreases. Eventually, the sequential part
will dominate the performance and the speedup is bounded by M.

Under the simpliﬁed conditions, ,--1 W: W1 + WN and Ev,;"=1—I- 1' W’I'N] + QN- — W’ + 4.
Therefore, for ﬁxed-time speedup, we have W1 + WN- -— W,’ + 7,“. Since W1 = Wl’, we have
WN = 313:. That is W,:, = NWN. Equation (2.10) becomes

SN = (2.12)

S" = i=1W"= W__1’-__+W'N =W1+NWN
N zg,W.- W,+W~ W,+WN.

 

(2.13)

 

20

 

Amount
of
Work

 

 

 

 

 

 

 

 

 

N
film
12 3 4 5 f

Number of Processors (N)

    

T1

  

TN

        

TN
TN TN T

12345

Number of Processors (N)

 

 

Figure 2.3: Amdahl’s Law

 

The simpliﬁed ﬁxed-time speedup formula Eq.(2.l3) is Gustafson’s scaled speedup, which
was proposed by Gustafson in 1988 [20]. From Eq.(2.l3) we can see that the parallel portion
of the application is scaled up linearly with the system size, and the speedup increases
linearly with the system size. The relation of workload and elapsed time for Gustafson’s
scaled speedup is depicted in Figure (2.4), where T1 is the execution time for the sequential
portion of work. TN is the execution time for the parallel portion of load.

We need some preparation before deriving the simpliﬁed formulation for memory-bounded

speedup.

Deﬁnition 4 A function g is a semihomomorphism if there exists a function 3 such that

for any real number c and any variable x , g(cz) = g(c)g(z).

One class of semihomomorphism functions is the power function g(z) = 2:", where b
is a rational number. In this case, 57 is the same as the function g. Another class of
semihomomorphism functions is the single term polynomial g(z) = az", where a is a real
constant and b is a rational number. For this kind of semihomomorphism functions, g(::) =
:c", which is not the same as g(z).

The sequential portion of the workload, W1, is independent of the system size. If we
do not consider the inﬂuence of memory on the sequential portion we have the following

theorem:

21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

l) ‘—
..Wz.
Amount ,—
0f Ela 3
Work —Ws- Tirrlfe
1:
.1511. WIN
——. N T, T, T, T, T,
N
N .
[81’er TN TN TN TN TN
1 2 3 ' 4 5 v 1 2 3 4 5 7
Number of Processors (N) Number of Processors (N)
Figure 2.4: Gustafson’s Scaled Speedup

 

Theorem 1 If W = g(M) for some semihomomorphism function g, g(cz) = g(c)g(:r), then,
with all data being shared by all the available processors and using all the available memory

space, the simpliﬁed memory-bounded speedup is

_ W1 '1' g(leN
— W1 'l' lily-2W”

Proof: As mentioned before, WN is the parallel portion of the workload when one
processor is used. Let the memory requirement of WN be M, WN = g(M). M is the
memory requirement when one node is available. With N nodes available, the memory
capacity will increase to NM. Using all of the available memory, for the scaled parallel
portion W5}, W}, = g(N M ) = g(N)g(M). Therefore, WK, = g(N)W~ and

 

5;, (2.14)

s: ___ W,- + W,:, = W, + g(N)WN
N W“ + Wir/N W, + ESS’JWN

 

 

(2.15)

C]

In the proof of Theorem (1), we claimed that W3, = g(N M ) = gg(M). This claim is
true under two assumptions: 1) the data is shared by all available processors, and 2) all the
available memory space is used for better solutions. A computation with the ﬁrst property
is called global computation. Equation (2.14) is the simpliﬁed memory-bounded speedup

22

 

E |

A 1
Amount 1K1.
of
Work TV:
17:

 

 

 

T
TN TN ”

 

 

 

1W~ IW"

 

TN TN
1 2 5 1 2 3 4 5
Number of Processors (N) Number of Processors (N)

 

W, N
_2 [Ml
3 '4

Figure 2.5: Simpliﬁed Memory-Bounded Scaled Speedup

 

      

 

 

for global computation when memory is fully used. In general, data may be duplicated on
different nodes and the available memory may not be fully used for increased problem size.
Replacing the function g by a general function G, that is W3, = G(N)WN, a more generalized

theorem will be

Theorem 2 If W5, = G(N)WN for some function G, then

.5, ___ W, +G(N)W~i
" w,+9§,’,9w~'

Proof: If WK, = G(N)WN for some integer N, then

 

s: = W; + W5, = W, + G(N)W,.,

 

 

(2.16) _

(2.17)

[3

Equation (2.16) will be referred to as simpliﬁed memory-bounded (SMB) scaled speedup.
SMB scaled speedup is determined by the function C(N), which represents how the change of

memory will inﬂuence the change of problem size. When the problem size is independent of

23

the system size, the problem size is ﬁxed, C(N) = 1. In this case, SMB scaled speedup is the
same as Amdahl’s law, i.e., Eq.(2.16) and Eq.(2.l2) are equivalent. The local computation
model is one computation model studied in Section 3.2. In the local computation model,
when more processors are available, work will be replicated on these available processors.
Computation is done locally on each node, and communication between nodes is not required.
In this case, when memory is increased N times, the workload also increases N times, i.e.,
C(N) = N. In this case, SMB scaled speedup is the same as Gustafson’s scaled speedup.
SMB scaled speedup contains both Amdahl’s law and Gustafson’s scaled speedup as its
special cases. For most scientiﬁc and engineering applications, the computation requirement
increases faster than the memory requirement. For these applications, g(N) > N and
memory-bounded speedup will likely give a higher speedup than ﬁxed-time speedup.

The proposed scaled speedup formulation, Eq.(2.16), may not be easy to fully understand
at ﬁrst glance. Here we use matrix multiplication to illustrate it. A matrix often represents
some discretized continuum. Enlarging the matrix size generally will lead to a more accurate
solution for the continuum. For matrices with dimension n, the computation requirement of

matrix multiplication is 2n3 and the memory requirement is 3n2 (roughly). Thus,

M i
W" = 2 (‘3')
This means that
M .3. 3
WN = g(M) = 2 (y) 1 90") = N1 (2-18)

The simpliﬁed memory-bounded speedup for global computation will be

. _ W1 + N % WN

" " W. + MWN'

Global computation uses distributed local memories as a large shared memory. All the
data is distributed and shared. When these local memories are used locally without sharing,
the computation is local computation and WK, = N g(M ) This means that g = N. The

 

(2.19)

speedup is

St _W1+NWN
N- W1+WN ’

 

24

which is Gustafson’s scaled speedup. For matrix multiplication C' = AB, let A.- be the i‘h

row of A, i = 1, ...,n, and let B,- be the j“ column of B, i = 1, ...,n. The local computation
and global computation of the matrix multiplication are shown in Figure (2.6) and (2.7),

respectively. In global computation, the rows of matrix A are rotated after each row, column

A81 AB;

ABN

multiply.

 

 

 

 

Figure 2.6: Matrix Multiplication with Local Computation

Lu» --------

A181 A282 ANBN

 

 

 

 

 

 

Figure 2.7: Matrix Multiplication with Global Computation

 

We have studied two cases of memory-bounded scaled speedup, global computation and
local computation. Most of the applications are some combination of these two computation
styles. Data are distributed in one part and duplicated in the other part. The duplica-
tion may be required by inherent properties of the given application, or may be added in
deliberately to reduce communication. Speedup formulations for these appliCations depend
on the ratio of the global and the local computation. Deriving a speedup formulation for
these combined applications is difﬁcult, not only because we are facing a more complicated
situation, but also because of the uncertainty of the ratio. The duplicated part might not
increase with system size. It might increase, but with a speed which is different from the
increasing speed of the global part. Also, an application may start as global computation,
but, when the computation power increases, duplication may be added in as a part of the
effort for a better solution. In general, C(N) is application dependent. We derive C(N) for
a special case as an example. The structure of this derivation can be used as a guideline for

general applications.

Lemma 1 If function g is a semihomomorphism function, g(cz) = g(c)g(z); and 9'1 exists

25

and is also a semihomomorphism, g‘1(cs:) = h(c)g’1 (1:) for some function h, then a has an

inverse and g“ = h. 0
Proof: Since

cy = 9[9’1(cy)l = 9[h(6)9"(y)l = §[h(C)lg[9"(y)] = §[h(6)ly.

we have
§[h(c)] = c for any real number c (2.20)

Also, since
cy = g“[g(cy)l = 9“[§(C)g(y)l = h[§(6)ly,

we have

h[§(c)] = c for any real number c (2.21)

By Eq.(2.20) and Eq.(2.21), the function g has an inverse and j“ = h.

Theorem 3 Assume W = g(M) for some semihomomorphism function g, where g(cM)
= §(c)g(M), g inverse exists and is a semihomomorphism. If the workload is scaled up to
meet the time limitation with global computation ﬁrst and the rest of the unused memory

space is then used to increase the problem size further with local computation, we have

G(N) = (1 + w — Cléﬂnm (2.22)

Proof: By the ﬁxed-time speedup, after the number of nodes changes from 1 to N,
the parallel portion of work will increase from WN to N WN (see Figure 2.4). The storage
requirement is given by the function g". For operation requirement N WN, the memory
requirement is g'1(NWN) = §‘1(N)g‘1(WN).

Let M represent the size of the memory associated with each node which can be used for
parallel processing. Then, when the number of nodes equals 1, the total memory available
is M, which is equal to g'1(WN). When the number of nodes equals N, the total memory
available changes to NM. We ﬁrst ﬁx the execution time and increase the problem size to
meet the time limitation. After the ﬁxed-time scale up, the unused memory space is the

difference between current available memory and current memory requirement, which equals

NM — g“(NW~) = NM — :7“(N)g"(W~) = NM — y"(N)M = (N — :7“(N))M.

26,

The unused space at each node is

[N - a"(N)lM =
N

 

[1 — 221619154.

The problem size can be further scaled by using this unused memory space. The further

scaled computation on each node is given by the function g, and it is equal to

 

“{1}"!le )] M) = 2(1—511—3—"9-MW) = 9(1 --§—’11f,—N—)-)w~ (223)

Therefore, the computation on each node becomes

the original operation on each node + the operation increase on each node

 

= WN + mug-11$” ))W~ = [1 + 2(1—213—112)]w~, (2.24)

and, for the scaled parallel computation W5},
-—1
W,:,=N[1 + g(l—g—N-QQHWN.

Thus, we have

G(N) = M1 + g(i — 93%)]. (2.25)

Figure 2.8 depicts the speedup difference among the ﬁxed-sized model, the ﬁxed-time
model and the memory-bounded model. The function, g, used in Figure 2.8 is the function
for matrix multiplication, g(N) = N i . As most matrix computations have the same function
g(N) = N 3', the speedup relation depicted by Figure 2.8 is in general true for a large class
of applications.

2.4 Comparison Study

It is known that the performance of parallel processing is inﬂuenced by the inherent par-
allelism of the application, by the computation power and by the memory capacity of the
parallel computing system. However, how these three factors are related to each other, and

how they inﬂuence the performance of parallel processing generally is unknown. Discovering

27

 

 

 

 

 

1000 - ' ' ' ‘ 8Q‘SMB-Global
a” JiMB-Combineld
is” $1"
800 ”a +,}: ..
.4"
A” AA"
600 - ' a A -
Speedup At
400 -
200 -
I Amdahﬂslm,
800 1000

Number of Nodes

 

 

Figure 2.8: Amdahl’s law, Gustafson’s speedup and SMB speedup

where Wl = 0.3 and g(N) = N}

 

 

28

the answers to these unknowns is very important for designing efficient parallel algorithms
and for constructing high performance parallel systems. In this paper one model of speedup,
memory-bounded speedup, is carefully studied. This model is simple, and it contains all of
these three factors as its parameters. It shows the degradations and the possible performance
gain of parallel computation. .

As part of the study on performance, two other models of speedup have also been studied.
They are ﬁred-size speedup and ﬁxed-time speedup. Two sets of speedup formulations have
been derived for these two models of speedup and memory-bounded speedup. Formulations
in the ﬁrst set are general speedup formulas. These formulas contain more parameters and
provide more accurate information. The second set of formulations only considers a special,
simpliﬁed case. These formulations give the performance in principle and lead to a bet-
ter understanding of parallel processing. The simpliﬁed ﬁxed-size speedup is Amdahl’s law,
the simpliﬁed ﬁxed-time speedup is Gustafson’s scaled speedup, and the simpliﬁed memory-
bounded speedup contains both Amdahl’s law and Gustafson’s speedup as special cases.
Amdhal’s law suggests that the sequential portion of the workload will dominate the per-
formance when the number of processors is large. Gustafson’ scaled speedup claims that
the inﬂuence of the sequential portion is independent of system size. Simpliﬁed memory-
bounded scaled speedup declares that the sequential fraction will change with the system
size. Since the computation requirement increases faster than the memory requirement for
most applications, the sequential fraction could be reduced when the number of processors
increases.

The three models of speedup, ﬁxed-size speedup, ﬁxed-time speedup and memory-bounded
speedup, are based on different vieWpoints and are suitable for different classes of applica-
tions. Applications exist which do not ﬁt any of the models of speedup, but satisfy some

combination of the models.

Chapter 3

PARALLEL COMPUTATION
MODELS FOR SCIENTIFIC
COMPUTING

With more than one processor operating concurrently, parallel processing enlarges the range
of possible performance and makes efﬁcient algorithm design more important. We have
studied performance metrics in the previous chapter. In this chapter, we focus on designing
' parallel algorithms, with emphasis on methodology and analysis.

One of the factors that makes parallel algorithm design difﬁcult is the lack of guidelines.
Different applications are rarely related to each other in the design process. Therefore, they
have to be handled one by one in an ad hoc fashion. The experience of experts hardly
beneﬁts general users. In Section 3.1, we propose a representation system, called structured
representation, for scientiﬁc and engineering applications. With this representation, the sim-
ilarities between applications become obvious and applications can be compared, classiﬁed,
and manipulated based on their structures. Developing structured representation is the ﬁrst
step toward the ultimate goal, replacing the ad hoc development of parallel applications with
a rational methodology based on analysis and measurement.

A parallel algorithm cannot be efﬁcient without considering the architectural aspects of
the underlying multiprocessor. This is especially true for multicomputers where communi-
cation overhead is a major consideration. Mapping an application onto an architecture is
both application dependent and architecture dependent. To lead to a systematic mapping,
the basic building blocks of structured representation are identiﬁed and studied in Section
3.2. With structured representation, an application can be decomposed into a set of com-
putational models which can be mapped onto the architecture using predeﬁned strategies.
We can modify our design by changing certain computation models. In this way, structured
design and rethinking become possible. Casual users can develop efﬁcient algorithms based
on experts’ experience. A general guideline for efﬁcient algorithm design is provided.

29

30

3.1 Structured Representation

Parallelism can be achieved by dividing a given application into pieces, called subtasks, and
solving these pieces concurrently. Ideally, these subtasks can be solved independently, where
the exchange of intermediate result is negligible. Some scientiﬁc and engineering applications
have this nice, easy parallelism property. For these applications, it is natural to solve each
of the subtasks locally on a different processor. This model of computation is called a local

computation model.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

IDmExchanae' . >
I_.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.1: Compute-Exchange Computation

 

The design of local computation algorithms is straightforward. Unfortunately, most appli-
cations do not have this easy parallelism property [1]. For most applications communication
is necessary for exchanging data and coordinating activities. Although various asynchronous
techniques have been designed to reduce the communication overhead, most communication
must be achieved in a synchronous fashion, that is the receiving node must receive the com-
municated message before continuing. This synchronous communication requirement makes
eﬁcient algorithm design very difﬁcult. The load needs to be balanced between synchroniza-
tions and special care has to be taken to reduce the communication overhead. Figure 3.1
depicts the general parallel computation pattern with synchronous communication consid-
ered. It shows that the solving process consists of two phases, compute and data-exchange.
These two phases occur alternatively and repetitively, and, therefore, form the compute—
ezchange computation. The data-exchange phase involves communication between compute
phases. The communication patterns vary largely from application to application, and may

be represented by a notation. To simplify this notation, we restrict ourselves to certain

31

classes of communication, which are large enough for our purposes - describing the most
frequently used scientific and engineering applications.

A processor sending a message in a communication is called a sender. A processor
receiving message in a communication is called a receiver. A processor could be both sender
and receiver in a communication. A graph G(V, E) is a structure which consists of a set of
vertices V = {v1,vg,...} and a set of edges E = {81,62,...} [17]. If we let processors in a
communication bevertices in a graph and add directed edge (v, w) from v to w if processor v
sends a message to processor w; a digraph (directed graph) is formed. This digraph is called
the communication digraph. Following the notations of graph theory [17], the outdegree of
a vertex v is the number of edges which have v as their start-vertex. In other words, the
outdegree of a vertex v is equal to the number of destinations that v sends its message to.
For this reason we also call the outdegree of a vertex the degree of a sender. The indegree
of a vertex and the degree of a receiver are deﬁned similarly. The degree of a receiver is the

number of sources from which it receives messages.

Definition 5 A regular communication is a communication in which all senders have the

same degree and all receivers have the same degree.

For a given undirected graph, if for every two vertices u and v there exists a path whose
starting vertex is u and whose ending vertex is v, then the graph is connected. A con-
nected subgraph G'( V’, E’) is a connected component if there is no other connected subgraph
containing C(V’, E’) as its proper subgraph.

The underlying (undirected) graph Of a digraph is the graph resulting from the digraph
if the direction of the edges is ignored. A connected component of a digraph is the corre-
sponding subdigraph of the connected component of its underlying graph.

A connected component of the communication digraph is called a pattern of the commu-

nication.

Deﬁnition 6 A regular communication is a regular data-exchange if it consists of one or

more copies of the same pattern.

By our deﬁnition, the communication requirement of a regular data-exchange is given
by the number of patterns it contains. The pattern of a communication is described by
the number of senders and the number of receivers in this pattern. The complexity of each
sender and of each receiver is given by its degree. Thus, a regular data-exchange can be

represented using ﬁve parameters as

32

Pl5(D), 13(4)], (3-1) -

where P is the number of instance of the pattern, S is the number of senders in each
instance of the pattern, and D is the degree of the senders. Similarly, R is the number of
receivers in each instance of the pattern, and d is the degree of each receiver. An example
of using this notation for presenting communication is given in Figure 3.2. Notation (3.1)
describes a communication by ﬁve parameters. Since broadcast is not provided in neither
ﬁrst generation nor second generation multicomputers, messages must be sent one at a time.
The number of times messages are sent and received is the dominant factor in communication
cost. Notation (3.1) indicates the characteristics of a communication. More information may

be needed when implementation is considered.

'9 '0
\ \‘s ‘ \

 

 

\‘ \‘~~
\ \‘s‘ \ \‘s‘
\‘ \ “ \‘ ‘ “
\
‘ “ ‘~ ‘ “ 2[1(3)3(1>1
‘\ \‘ ‘s \\ \ ‘~ '

\ \ ‘ \ \ s

\‘ \‘ s‘ \‘ \\ “
\
i a ‘3 1 \__‘a.

Figure 3.2: Multicast Data Exchange

 

 

 

 

 

 

 

The second class of data-exchange which we want to identify is called conjunctive reg-
ular data-exchange. We use the same ﬁve parameters to identify conjunctive regular data-
exchange. The difference between regular data-exchange and conjunctive regular data-
exchange is that in conjunctive regular data-exchange the patterns are not disjoint, they
conjoin one another. Consider two special cases: conjunction at the sender side only and

conjunction at the receiver side only. We have two general notations,
”5(0). R.(d)], (3-2)

and

P [51(0), 3(4)], (3-3)

33

where the subscript, c, points out which side has conjunctions. An example of conjunctive

regular data-exchange is given in Figure 3.3, in which the receiver side has conjunction.

Q
I A
" In\ I“\ H
’ \ I \ I \ ’ ‘

\ I \
”I X ’I \‘\ ’I \\ I, “
I ~ I t ,' X I ~ mm, 2 (1)]

I \ ,

I \ ’ ‘ ’ ‘
’ ‘v ’ \

Figure 3.3: Conjunctive Data Exchange

 

 

 

 

 

 

 

 

A graph G'(V’,E’) is a partition graph of G(V, E) if G’(V’, E’) is formed by splitting
a subset of vertices {v1,vg,...v,.} E V into two subsets of vertices as {v;,v;,...v:,} and
{v;’, v.1! , ...vz}, where v}, v,” are the vertices formed by splitting vertex v;, and having edges
(v£,v;-) and (vf’,v§’ when edge (v;,v,-) exists in graph G(V, E) These'divided vertices are
called partition vertices. Figure 3.4 shows two partition graphs of the given graphs. With

this terminology, conjunctive regular data-exchange can be deﬁned more mathematically as

follows.

Deﬁnition 7 A regular communication is a conjunctive regular data-exchange if one of
its partition graphs consists of one or more copies of the same pattern. If all partition
vertices are senders, the conjunctive regular data-exchange is a sender conjunctive regular
data-exchange. If all partition vertices are receivers, the conjunctive regular data-exchange

is a receiver conjunctive regular data-exchange.

Since a regular communication patterns could have more than one partition graph which
consists of one or more copies of the same pattern, a conjunctive regular data-exchange could
have more than one notation.

Once the data-exchange phase has been identiﬁed in an application, we can describe the
application in terms of data-exchange. An application might have different data-exchange
phases. Writing these data-exchange phases together in order by using the Z: symbol and

adding in the compute phases, we have a formula, called a structured representation, for each

34

 

 

 

 

 

Original Graph Partitoned Graph

Partition
Vertices

Original Graph Partitioned Graph

 

   
 

Case Two
Figure 3.4: Partitioning of Graphs

 

 

35

application. Figure 3.5 shows how to represent the Fast Fourier Transform (FFT) compu-
tation in terms of regular data-exchange. The communication divides the computation into
layers. The total computation depends on the number of layers as well as the computation
requirement at each layer. X ,- is the computation work on each processor between data-
exchange phase i - 1 and i, if we have even allocation. X.- is the computation work of the
processor which has the largest workload among all the working processors in the compute
phase i, if we have uneven allocation. Notice that X; is not equal to W,- which is deﬁned in
Chapter 2. W,- is the total workload with degree of parallelism i.

E!!!!!!!

 

 

x: :x: 3‘ ‘ :x: 4[2(1),2(1)]
223?:1'. :3: s: _, ~. 4[2(1).2(1)]

4l2(1).2(1)l

 

i(2k'1[2(1),2(1)] + Xi) lc = log(N) = 3

i=1

 

 

Figure 3.5: FFT (Butterﬂy) Computation

 

36

We can see from Figure 3.5 that different communications could have the same data-
exchange representation. The reason is that our notation. is a high level notation. It pro-
vides the communication complexity for. the data-exchanges. It does not contain detailed
information about how the communication takes place.

We assume that the processors which participate in the data communication also partic-
ipate in the computation, and the processors which do not participate in the data communi-
cation do not participate in the computation. With this assumption, the second application,
odd—even cyclic reduction (see Figure 3.6), shows how the structured representation illustrats
the uneven allocation degradation. Using the data-exchange information P[S' (D), R(d)] at
data-exchange phase i, we can see that there are P x S processors working at compute phase
i — 1 and there are P x R processors working at compute phase i. If the data-exchange is
a conjunctive regular data-exchange, P[SC(D), R(d)], the number of working processors at
compute phase i -— 1 will be P x (S — l) + 1. These can be conﬁrmed by observing Figure
3.6. Similarly, for conjunctive regular data-exchange P[S (D), Rc(d)], the number of working
processors at compute phase i will be P x (R - 1) + 1. Structured representation provides
information about uneven allocation at an abstract level. It provides the number of proces-
sors which participate in the computation at each compute phase. It does not provide the
workload information for each of the working processors.

Odd-even cyclic reduction is a commonly used method for scientiﬁc applications. A
well known parallel algorithm for tridiagonal linear systems is based on odd-even cyclic
reduction [33]. From Figure 3.6 we can see that the odd-even cyclic reduction application
contains two different structures. The upper half of Figure 3.6 is one structure and the lower
half is another structure. This is a common phenomenon of scientific applications. Most
of the frequently used scientiﬁc applications are combinations of a few simple structures,
which we call computation models. The information in computation models can be used in
general applications. Studying computation models will lead to a general algorithm design
guideline for scientiﬁc applications. In the next section we will present some of the identiﬁed

computation models.

3.2 Parallel Computation Models

Structured representation is a precise and perspicuous medium to express computation mod-
els. With structured representation, computation models can be deﬁned at different levels

for different purposes. In our study, we follow the conventional research of parallel paradigms

37

 

 

7[3 (1) 1(3)]

 

I

 

 

 

[3cm 1(3)]

1 [1(2).2(1)]

* 2[1(2).2(1)]

 

 

 

 

 

 

 

 

 

“1(2). 2(1)]
E((2*-‘-1)I3.(1).1(3)1+ X.)+ 2?;1-"(2‘-*[1(2).2(1)1+ x.)
k = [109(Nll = 4

Figure 3.6: Odd-Even Cyclic Reduction

 

 

38

and deﬁne the computation models from a coarse point of view. We identify the computa-
tion models based on design techniques, communication patterns, characteristies, sources of

degradation, and deﬁne them precisely with structured representations.

3.2.1 Local Computation Model

Parallelism can be achieved by dividing a given problem into pieces and solving these pieces
concurrently. In general, these pieces need to communicate in order to exchange data and
coordinate their activates. However sometimes the problem can be divided into almost com-
pletely independent subtasks where the exchange of the intermediate results is negligible.
Some computational problems have this ideal parallel property. The use of the Local Com-
putation Model is natural in solving these problems on a multicomputer. The characteristic

of the local model is asynchrony.

_ILJLJLJIIULI

Figure 3.7: Local Computation Model

 

 

 

 

 

 

Local computation for algorithms with data parallelism can be illustrated by matrix
multiplication. Given n x n matrices A and B, we want to compute their product, C.
Assuming N is less than n, we partition matrix B by columns evenly into N submatrices
Bo, Bl, ..., B~-1. Each node is loaded with one submatrix B.- (i = 0, ..., N - 1) together with
the whole matrix A. The resultant products C.- = A x B,- are then obtained on each processor
independently. The common input A shows another characteristic of the local computation
model. The processors give different parts of the output, but they may share some input.

The Homotopy method [9] (see also Section 4.3) solves systems of equations by curve
tracing. To solve a non-trivial equation P(Z) = 0, a homotopy function H (Z,t) is deﬁned
as

H(Z,t) = (1 — t)Q(Z) + tP(Z) t 6 [0,1] (3.4)

It can be seen from Eq.(3.5) that when t = 0, H(Z, 0) = Q(Z) and when t = l, H(Z,1) =
P(Z). Mathematical results have shown that if P(Z) 6 C", Q(Z) E C’, then P(Z) = 0 can
be solved by tracing the solution set of the the following equation

H(Z,t) = o, (3.5)

i.e., the equation P(Z) = 0 can be solved by

39

o solving the equation Q(Z) = 0,
a following the solution curves of Eq.(3.5) from t = 0 to t = l.

Each’solution of Q(Z) will lead to a solution curve of H (Z,t). By broadcasting the
homotopy function to each of the processors and by starting them from different solutions
of Q(Z), the solution curves of Eq. 3.5 can be traced locally on each processor. There is no
communication needed for coordination. The curves are traced by following discrete points.
The amount of computation on each point and the points themselves is determined at run
time.

The local computation model is a special case of compute-exchange computation where

the data-exchange phase does not exist.

3.2.2 Global-Exchange Computation Model

In the global-exchange computation model each processor executes a computational phase
and then performs a total data-exchange. At each data-exchange phase, each processor sends
messages to all the other processors and receives messages from all the other processors.
The communication requirement at each data-exchange phase is high. The Jacobi itera-
tive method for solving linear systems is an application which is based on global-exchange
computation. The characteristic of global-exchange computation is synchronism. If one pro-
cessor does not ﬁnish its computation in the current compute phase, then no other processor
can enter the next compute phase. This synchronization requirement and the high com-
munication overhead make designing efﬁcient algorithms for global-exchange computation
applications very difﬁcult. Figure 3.8 depicts the global-exchange computation model and

gives the structured representation for this model.

3.2.3 Compute-AggregateBroadcast Computation Model

Compute-Aggregate-Broadcast Computation Model (CAB) was ﬁrst proposed in [44]. It is
composed of three basic phases bearing those names: a compute phase which performs the
computation, an aggregation phase which combines local data into one or a few global values,
and a broadcast phase which returns global information back to each processor. This model is
characterized by synchronism communication. This synchronism requirement may be caused
by data dependency or is demanded by global control for better performance. The compute

phase varies widely from application to application. For a multicomputer we would like the

40

 

 

 

 

 

 

 

 

 

‘ l I ' I I

 

 

\ .l. \ ‘§ ’ 30.5.9 8e ’ I *‘1’9 ‘ r ; ’I
:\\\‘:!:3=£§f \¢A? ' vs: q.) I (I'L 2‘ ‘15'4 I’I '
i ‘ "‘§ '7“ — «(3.5 4" (- <3. ”I ’v ’ '

\ \ «$4 { . . u 9.» < , \I r
I I qr \ ‘ a ’ «1'»: ’ ’
I I\ , “‘4’“ ,v f . 't rap; ‘4‘)‘l \ I \ .
I 1’4 ,I’ It 1* a 1‘1on .33 ~)~~ s" \r
[‘1 ,V¢’ ' 1",“«J‘, ,v‘?tzg‘ FI~$$§3~st \.
. 1‘ ~ . «9“ . was ~ . p. 3.?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2(IN(N),N(N)1+ Xe)

i=1

Figure 3.8: Global Exchange Computation Model

 

 

 

compute phase to be dominant. The aggregation phase can be a tree-based computation that
combines data from the processors, then produces a single global information, such as the
convergent signal for solving PDE problems, or it can be a data gathering communication
pattern, in which all the computed results are sent to a special node that processes them
and coordinates the next round of computation.

Both the global exchange computation model and the compute-aggregatebroadcast com-
putation model are commonly used for iterative methods. Comparing these two methods,
the CAB computation mode has a lower communication requirement. However, also, it has
degradations caused by uneven allocation. At some compute phases only one processor works
and all the others are idle.

3.2.4 Divide-and-Conquer Computation Model

Divide-and-Conquer is a well known technique in sequential algorithm development [7]. In
divide-and-conquer computation, a problem is divided into two or more smaller problems.
These smaller problems are solved and their results are combined to give the ﬁnal solution.
The Divide-and-Conquer computation model can be divided into two subclasses, Recur-

sive Divide-and-Conquer and Partitioning. Recursive divide-and-conquer uses the smaller

41

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

- ‘

"'

 

 

 

 

 

 

 

 

 

 

 

206+ [1(N),N(1)1+ Y.- + [N(1).1(N)1)

i=1

Figure 3.9: CAB Computation Model

 

 

[1(8).8(1)]

[8(1).1(8)l

 

 

42

problems as a smaller instance of the original problem and does the divide and conquer
concurrently. We refer to divide-and-conquer in one stage as partitioning. This model is
also the basis for several classes of parallel algorithms, including a number of sorting and
searching algorithms [48]. In general the recursive divide—and-conquer computation model

has two shortcomings for multicomputers:
l. Synchronization: The algorithm requires synchronization at each merge stage.

2. Degree of Parallelism: The algorithm has a dynamic degree of parallelism.

 

   

I
I \ I \‘
I \ I
, \ ,’ ‘s
I \ ‘
I
\ I \ I
‘ I \ I
‘ I \ I
‘ I \ I

‘t‘ I \ I.
s ”’
Ic-l . 2k-1 .
2(2'I1(2),2(1)1+ x.) + 2 (2"‘"'+"[2(1),1(2)1 + x.), k = IogN
i=0 i=k

 

 

 

Figure 3.10: Divide-and-Conquer Compute-Exchange Computation

The following result shows the beauty of recursive divide-and-conquer technique [5].

Proposition 1 A non-singular triangular linear system A: = b can be solved in 0(log2 n )

time, using n3 processors, where n is the dimension of A.

43

Proof: It is sufﬁcient to prove that A"1 can be found in 0(log2 n) time with n3

processors.
We partition the matrix A into blocks:

A; 0
A: [A2 As], i (3.6)

where A; is of size [n/ 2] x [rt/2]. Since A, and A3 are lower triangular matrices. Moreover,

it is easily shown that

A?! 0 ] (3.7)

A'1 =
[ —A;‘A2A," .4?
Based on the above decomposition, we obtain the following algorithm.

Given an n x n triangular matrix A:

1. If n = 1, then A'1 can be found directly.
2. If n > 1, then partition A as indicated above and do the following:

(a) Find the inverse of A1 and A3 concurrently. (Notice that A1 and A3 are lower

triangular matrices, they can be inverted by using the same algorithm recursively.)
(b) Multiply A; by A5,“ on the left to obtain A; 1A2.
(c) Right-multiply the result of (b) by Afl.

Both steps (b) and (c) take 0(log n) time using n3 processors. Thus, if T(n) denotes
the time required by the algorithm for inverting a matrix of dimensions n x n, we have
T(n) = T( [n/2]) + 0(log n), which yields T(n) = 0(log2 n) computation using n3 proces-

sors. C]

The method presented here does not make things simpler than the presumably easier
task of solving the system A2: = b. Furthermore, if communication overhead is properly
taken into account, the time requirement can be much higher than 0(log2 n) for certain
cOmputer architectures. For this reason, and in view of its requirement for a large number
of processors, the algorithm is theoretically interesting but impractical. A more practical
method will be described later in the pipeline model.

Partition models are more practical than recursive divide-and-conquer models, especially
for N < n. Three partition methods for solving tridiagonal linear systems on multicomputers

will be given in Chapter 5.

44

3.2.5 Domain Decomposition Model

Domain decomposition reduces the original large problem into a set of subproblems through
decomposition of the problem domain. This technique was originally developed to reduce
overall computation time and storage requirements on sequential machines. However, it
becomes much more important in parallel algorithm design. The decomposed subproblems
may be independent of each other or may be dependent in some way. The goal of the design
is to reduce the dependence of the subproblems in order to achieve maximal parallelism. The
domain decomposition model is well suited to the ﬁnite difference method for solving PDEs

where each subproblem is dependent on a ﬁxed number of others [19], [31].

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

V s I w \ I V V
I“ ’I \ IA\ I, ‘s\ I s I“ I \\‘ 0‘ ”(4)33”
‘ I \ ’,’ ’I \ '3' ‘2." I ‘.’ I \ ‘.’ :A’ I \‘ \ I \‘ l
s I I \ I s s s

I x [a Ix l ‘\ I" s \ s \ ,

I I ‘ \

I \
I s

\
\I
\I§\I’\1‘\"‘\ 1"\l\‘\
I I \ I \ "\

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-. I— .—| ’..O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-ﬁ )—) _l p---

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I:
Z([N(4). N(4)1 + X.)

i=1

Figure 3.11: Domain Decomposition Computation Model

 

 

 

One problem faced by domain decomposition is how to partition the domain. In order

45

to solve Laplace’s equation, '
6’2 022

ax: 293? =
by the ﬁnite difference method, we decompose the my domain into subregions and assign

o, (3.8)

one region to each processor. The regions can be chosen either as strips or as rectangles.
With strip decomposition, each processor communicates with two neighbors. For rectangular
regions, each processor has to communicate with three of four neighbors, but has shorter
messages. The optimum decomposition depends on the computation/ communication ratio
of the underlying machine and the number of grid points in each region. Domain decom-
position computation is characterized by easily achieved load balance and by neighboring
. communication. However, if global convergence checking is adopted for the ﬁnite difference
method, then global communication is needed.

Problems can also be solved in parallel by decomposing the solution domain. The biseCs
tion method for solving eigenvalue and polynomial problems belongs to this category. This
kind of decomposition has a different communication pattern. We do not consider them as

part of the domain decomposition computation model.

3.2.6 Pipeline (Ring) Computation Model

Pipeline (ring) computation increases concurrency by dividing a computation into a num—
ber of steps and allowing a number of tasks in various stages to be executed at the same
time. Thus the characteristic of this model is that the intermediate result moves in a ﬁxed
direction and the ﬁnal result emerges in the last stage. Communication is automatically
overlapped with computation; this is the main advantage of this model [34]. One primary
consideration in pipelined computation is to keep the processing speeds of all stages roughly
equal. Otherwise, the slowest stage will become the bottleneck of the pipeline. There are two
types of pipeline computation models, the functional pipeline model and the data pipeline
model. Traditionally, the term pipelining refers to function pipelining. Different stages in the
pipeline perform different functions, and, when data ﬂows through the stages, it is modiﬁed
along the way. In the data pipelining model, processors in the pipeline perform the same
function in a synchronous fashion. Back substitution is a well-accepted pipeline paradigm
for solving triangular linear systems.

Under the assumption that A is a lower triangular matrix, the i‘h equation of the system
A1: = b is

aglxl + 0.52132 + + a,-,-:r,- = b,. (3.9)

 

46

The following parallel'version of the back substitution algorithm employs n processors. Sup-
pose that at the beginning of the if“ stage, the values of the variables 2:1, ...,:c;-1 and the
expressions ans-1 + + aj,,--1:i:,--1 for each j Z i, are available. Then the i‘h processor
evaluates :c; by solving Eq. (3.9):

1:,- = ;(b,- — aux] — — a;,,-_11:,-_1). (3.10)

Finally, each processor j, with j Z i+ 1, evaluates the expression 01'] 1:1 + + ajgz; by adding
ajgx, to the previously available expression ajl 1:1 «I» + aj,,-_1:r,-..1. The algorithm terminates
at the end of the n‘h stage when all the variables 2:1, ...,xn have been computed. Clearly,
the parallel time required for each stage is constant. Therefore, the total time required by
this version of back substitution is 0(n) using n processors. Here the communication cost
is excluded.

Connecting the output of a later stage to the input of an earlier stage of a pipeline model,
we get the ring computation model. The ring model is commonly used in iterative methods;

wrapped around the pipeline allows successive iteration to continue.

3.2.7 Recursive Doubling Computation Model

The best known example of parallelism is the computation of the summation a0 +
a1+, ..., +a,,-1 in 0(log n) time by % processors. It follows the process of recursive doubling,
which consists of the evaluation of subterms of size 2i for i = 0, ..., log(§). Computation of
the sums and all the partial sums leads to two different communication patterns. Pattern
1 can be seen as the conquer part of the recursive divide-and—conquer model, known as tree
reduction. Pattern 2 is much different (see Figure 3.13); it is called a preﬁx by some authors.
Recursive doubling is applicable to a large number of applications. The following result is
due to Kogge and Stone [35].

Proposition 2 The recursive function
x1 = bi, x: = fem—1) = f(b.-.g(a.-,z.--i)), 2 S i S n (3.11)

can be computed by a recursive doubling parallel algorithm in 0(log n) time, where b; and

a,- are arbitrary constants and f and g are index-independent functions that satisfy the three

restrictions:

1- f is associative- f(z,f(y,2)) = f(f(a=.y).z)-

47

 

 

‘
~~

.
~~

 

~
‘-

~~
.
~‘
~
~

 

 

 

 

 

rlr I
L‘.“___*‘-“_f‘.

 

 

, M A;
7 - ~ ~
‘s ‘5 ‘Q
~ ~ ~
~ s ~
~ ~ 5
5‘ ~~ ~~
~A ‘L ‘A

 

I

N—l

‘. “ ‘~
§ ~ ~
~ ~ ~
~ § ~
~ ~ ~
~. ~. ~.
~ ~ ~
~ ~

I)
In;

I:

23 (211(1). 1(1)] + y.) + Z(N([1(1). 1(1)] + Xi)

i=1

i=1

Figure 3.12: Pipelined Computation Model

 

 

48

2. g distributes over f- g(s.f(y,2)) = f(9(s.y).y(s,2))-

3. g is semiassociative, that is, there exists some function h such that

g(s.g(y,z)) = 9(h(s.y),z)-

A large number of linear recursive functions can be reduced to the above form by us-
ing a matrix representation. However, this reduction very often increases the sequential

computation count. Functions suitable for recursive doubling are given in [57].

Proposition 3 The LU decomposition of an n x n tridiagonal linear system can be computed

in 0(log n ) time, using 12 processors. -

Proof: Let A be a tridiagonal matrix of order n

i be Co ‘
(11 b] C]

A = ' ' ' . . (3.12)
art—2 bra—2 611—2
[ an-l bn—l ..

When A = LU and L is chosen to have its diagonal elements equal to unity, the matrices
L and U take the forms

 

 

m1 1 U1 61
mg I. 112

mn-2 1 Uri—2 cit-2

mn—l 1 .. (_ un—l

 

 

 

 

d

Note that the superdiagonal elements of U are identical to those of A. Thus we only need
to compute m,- and u.- to obtain this factorization. Observing the equation A = LU, we can

ﬁnd the recursive relation:

uo=bo u,-=b,-—a,-ﬁ:l— m;=—ai- ISiSn (3.13)

ui—l ui-l

To solve for u,- and m.- in parallel, Stone introduced the quantities q.- (—1 S i S n), that

satisfy the recurrence equations

9-! = 1, (10 = b0: 9i = biqi-i - aiCi-IQi-z- (3.14)

49

Let u,- = 321-1- for i Z 0 and write equation (3.14) into matrix form:
a bi - i '- a-
‘1 = a Cu 1 q 1 (3.15)
96-: 1 0 95-2
or
Q; = BiQi-l 1 S i S n - l i (3.16)

where q_1 = 0, q) = (2,. In this form all Q,- (l S i S n — 1) can be expressed in terms of
Q0:
Q9” = 8535-1...BlBoQo, t = 1,2, ...,n - 1.(3.17)

Thus each Q,- can be computed by evaluating the product of 2 x 2 matrices in parallel, and
it is not necessary to compute Q,- only after Q54 has been computed. Figure 3.13 indicates

the computation and communication process for n=8. 0

In the hypercube connection topology, the communication pattern shown in Figure 3.13
can be mapped such that the dilation cost is no greater than 2 [33]. This means that the
communication is either neighboring communication or passing through one intermediate

node.

3.3 Application Considerations

We have identiﬁed seven computation models for parallel computing. They are local compu-
tation model, global-exchange computation model, compute-aggregate-broadcast computation
model, divide-and-conquer computation model, domain decomposition computation model,
pipelined computation model and recursive doubling computation model. We have found that
various scientiﬁc applications are combinations of these models. Most of the examples given
in this chapter are simple applications. Complicated applications may involve more than one
computation model. For designing efﬁcient parallel algorithms we have to understand the
problem under consideration, which includes understanding the given application, the ap-
plied mathematical method and understanding the underlying computer architecture. The
computation model concept provides a way to gain an insightful view and understand the
relation between applications and architectures. Based on the performance information of
computation models, parallel algorithms may be deliberately switched from one model to
another to achieve better performance.

The performance of computation models is problem dependent as well as machine de-

pendent. It depends on the topology, the switching technique and the routing scheme of

50

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

\‘ \‘ s‘ \‘ \‘ ‘ ‘ 7[l(l).l(l)]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

“WHOM

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4001.10”

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

if“? " 2‘)l1(1)a1(1)l+ xi) 1: = logN

Figure 3.13: Recursive Doubling Computation Model

 

 

51

the underlying multicomputer. Topology determines which nodes are directly connected by
physic links. Hypercube is the popular topology for ﬁrst generation multicomputers. 2-D
mesh topology has become a strong alternative to hypercube in second generation multi-
computers. The switching techniques determine how two non-adjacent nodes communicate.
First generation multicomputers adopted the store-and-forward switching technique, while
second generation multicomputers use the circuit-switching or the wormhole routing switch-
ing technique. The routing scheme decides the sequence of channels (called paths) used for
communication between any pair of nodes. In ﬁrst generation multicomputers, the routing
scheme is the shortest path. In second generation multicomputers, the ﬁxed path routing
scheme is used. Different multicomputers may adopt different topologies, switching tech-
niques and routing schemes. Performance considerations for different multicomputers may
be different. Computation models should be studied on different multicomputers. Most of
my experience is on a ﬁrst generation NCUBE multicomputer. It is a hypercube multicom-

puter with store-and-forward switching and shortest path routing.

Chapter 4

APPLICATION-DRIVEN
ALGORITHM DESIGN

Designing efﬁcient parallel algorithms is difﬁcult. The structured representation and com-
putation model concept provide a guideline for efﬁcient algorithm design. With structured
representation, structured rethinking becomes possible. The bottlenecks of an algorithm will
be easier to find and high parallelism will be easier to explore. The performance information
of the computation models will suggest which structure is better for a given multicomputer.
Therefore, structured design becomes possible. We have used structured representation and
computation models on several applications. These applications include solving electrical
power ﬂow problem, solving tridiagonal linear systems and solving dense linear systems. The
ﬁrst application is one of the most import problems in the electrical power ﬁeld. The second
application is a fundamental problem of scientiﬁc computing.

The structured representation and computation model concept helps us in changing our
design from one structure to another. For solving the power ﬂow problem, we started with
a local computation design, and changed to a global-exchange computation design. After
studying the structure of the global-exchange algorithm, we adopted a compute-aggregate-
broadcast design which gives the best performance [59]. The algorithm design for the power
ﬂow application is mainly inﬂuenced by the chosen mathematical method, the homotopy
method. This kind of algorithm design is called application-driven algorithm design. We also
developed three algorithms for solving tridiagonal systems. They are all based on the same
matrix modiﬁcation formula. The algorithm design for solving tridiagonal systems is mainly
inﬂuenced by the topology of the underlying multicomputer [60]. This kind of algorithm
design is called architecture-driven algorithm design. Some of the design changes which we
have made are only suitable for the hypercube topology. More information concerning the
power ﬂow application can be found in the rest of this chapter and a detailed study of solving

tridiagonal systems can be found in the next chapter.

52

53

4.1 Preliminary

Determining steady state solutions of large-scale interconnected power systems is by far one
of the most important problems facing theoretical as well as applied researchers in the ﬁeld.
It has been formulated mathematically [64] and commonly referred to as the power ﬂow,
or load ﬂow, problem. The most frequent and acceptable numerical procedures for ﬁnding
solutions to the load ﬂow problem are variants of the Newton method. It is well known,
however, that the Newton method does not always converge to a solution, and if it does,
there is no guarantee that it would ﬁnd all possible solutions to the load ﬂow problem. The
Newton method is a local algorithm because it depends on the initial guess and, therefore,
it is dependent on the type of nonlinearity of the governing equations, i.e., the load ﬂow
equations.

Our work aims at solving the load ﬂow problem using the globally convergent probability-
one homotopy method originally proposed in [9]. The homotopy method has been shown to
have the following distinct features [52], [45], [51], [10], [37], [38]. First, numerical compu-
tation of the homotopy method can be systematically implemented. Second, the homotopy
method is globally convergent, i.e., one may choose any initial guesses and the homotopy
method is guaranteed to converge to all solutions with probability one. However, for solv-
ing large-scale load ﬂow problems, the computation requirements of the homotopy method
increases exponentially. The exponential increasing of computation requirements is the case
for most scientific and engineering applications. Large-scale scientiﬁc computing is one of
the major driving forces behind parallel computation, but the move into parallel territory
requires new conceptual strategies for formulating a problem, and new algorithms to shape
the problem for parallel computers. This implies that a brute-force approach to solving a
large—scale problem, such as the load ﬂow of power systems, on parallel computers does not
render the problem tractable.

The power ﬂow problem is a good example for illustrating the design process of large scale
scientiﬁc computing and how the structured representation and computation model concept
can be used to help the design. The design process of the power ﬂow application consists
of four parts - ﬁnding an appropriate mathematical model for the application, choosing
a suitable numerical method to solve this model, modifying the mathematical method to

reduce the computation, and changing the algorithm design to achieve high performance.

54

4.2 Power Flow Application

The power ﬂow, or load ﬂow, problem has been modeled on the sinusoidal steady state nodal
analysis of circuit theory. Figure 4.1 depicts the general model of a power system. It consists
of three main components: generators (suppliers), loads (consumers), and a transmission
network that connects generators and loads. Consumers demand electrical power to be
supplied. Suppliers generate the power and transmit the power to the consumers. The
demand changes with time. The suppliers, therefore, have to vary their production to ﬁt
the demand. Changing supply according to demand is done not only for economic reasons,
but, more importantly, for safety considerations. In order to operate a power system safely,
power generation must satisfy a set of inequality constraints. A solution which satisﬁes this
set of constraints is called a steady state solution. The power ﬂow problem is a matter of

determining the steady state solutions.

 

 

 

 

 

1 l 1?
Generators: 0
(Pv'busesl: 'Il‘ansmission Loads

(PQ-buses)

0 Network 0

o
o
- +1
Slack Bus—l- '-
Figure 4.1: The Main Components of Power System

 

 

 

Let the power system have N +1 buses or nodes with a voltage reference (or datum). Let
the load buses be subscripted from 1 to NL, and let the generator buses be subscripted from
N1,“ to N. We choow to subscript the slack bus by 0 and denote the N x N node or bus
complex admittance matrix by [Y], where its lei“ component is y)..- = G)..- + j E)..- (j = \/-—1).
Let E). denote the complex voltage of bus k. In polar coordinates it is

55

E). = Vpeje‘, (4.1)

where V). represents its voltage amplitude and 6). represents its phase angle. In rectangular

coordinates, it becomes

E). = VwosOk + ijsinOk = X]: + in, (4.2)

where X). represents the real part of the complex voltage E), and Y), represents its imaginary
part. Let S), = p). + jq). be the complex injected power at node 1:, then the injected

complex power balance equation may be written as

[13-] [Y] E — 5: = 0 (4.3)
or
N N
E); Z yin-E: + E; Emu-E; — 2P]: = 0, k = I, 2, ..., N (4.4)
i=0 i=0
N N
El: 2 yziE: - E; Z ykiEi — 2lec z 0) k =1: 2, "'3 NL (4'5)
i=0 i=0
as; - v: = o, k = NL+1,...,N, (4.6)

where the superscript * denotes the complex conjugate.
The above system of equations is a polynomial system. They are the full load equations
in complex form. The full load equations can be represented in other forms. They can be

written in rectangular coordinates as,

N
Z[G,.,-(X,.X,- + my.) + B.,(X,-Y,. — X).Y.-)] - P). = 0, k = 1, 2, N (4.7)
i=0
xi + Y: - v: = o, k = 1,2, ...,NL (4.8)
N
Banana. - in’.) — B),,-(X.-X,, + 15.14)] — Q), = o, k = NW, N. (4.9)
i=0

Or, it can be written in equivalent rectangular coordinate equations as

N N
XI: 2(GHXI: - BkiYi) + Yk 2(BkiXi - Gaye) " P1: = 0, k = 1,2, mr N (4-10)
i=0 i=0

xi + Y: — V,2 = o, k = 1,2, ...,NL (4.11)
N N
Y): 2(Gkixi - Buys) — XI: 2(Blcixi + Guys) - Q]: = 0, k = NL+1, ..., N, (4-12)

i=0 i=0

56

where the additional equations (4.8) and (4.11) represent the constraints on the amplitude

of the voltage of the PV-buses.
The full load equations also can be represented in terms of trigonometric functions, called

standard power ﬂow equations, as

N

2 V;,V.-(G'k.-cosOk,- + BkgsinOkg) — P). = 0, I: = 1, 2, ..., N (4.13)
i=0

N
Z VLVXGIn-sinOh- - thosOH) -- Q); = 0, lc = NL+1, ..., N. (4.14)
i=0

There are three types of buses in a power system network:

1. PQ bus: a bus where the real and reactive powers are speciﬁed.
2. PV bus: a bus where the real power and the voltage amplitude are speciﬁed.

3. A Slack bus: a ﬁctitious concept whereby one of the generator buses has only its
complex voltage speciﬁed. One purpose of this bus is to guarantee that the total

power injection into the network equals the total power consumed.

It is conventional to model loads as PQ-buses, one generator as a slack bus, and the
rest of the generators as PV buses, as depicted in Figure 4.1. With some assumptions, two
simpliﬁed models are also proposed for power ﬂow problems [51], which can be used to obtain

approximate solutions with reduced computation.

[4.3 Homotopy Method

The homotopy method [9] (see Section 3.2.1) is able to ﬁnd all the complex roots of a system
of polynomials with probability one. Let P(x) = 0 denote a system of n equations in n

unknowns.

’ P1021, 22, ...Zn)

P(Z)= P’(z""""’") =0, (4.15)

 

 

_ P..(21,zg, ...zn) 4
where Z is an n-dimensional complex vector (21, ..., z“). The homotopy method solves this

system by solving a trivial system

57

0(2) = 0. (4.16)

and then follows the solution curves of

H(Z,t) = (1 — t)Q(Z) + tP(Z) = 0, (4.17)
from t = 0 to t = 1. If Q(Z) = 0 is chosen correctly, the following properties hold [39]:
o Triviality
The solutions of Q(Z) = 0 are known.
a Smoothness
The solution set of H (Z,t) = 0 for 0 S t < 1 consists of a ﬁnite number of smooth
paths, each parameterized by t in [0,1).
0 Accessibility

Every isolated solution of H (Z, 1) = P(Z) = 0 is reached by some path originating at
t = 0. It follows that this path starts at a solution of H (Z,0) = Q(Z) = 0.

Let the degree of P,- be d, for 1 S i S n. In general, Eq.(4.15) has at most d =
d; x d; x x d,, isolated roots. It has been proven [9] with random complex numbers

b1, b2, ..., bn that

2;“ —- b1
62(2) = = o (4.18)
2:" — b,,

can be used to ﬁnd all the solutions of (4.15) by tracing the solution curves of equation (4.17)
from t = 0 to t = 1 with probability 1. The robustness, stability, and accuracy properties of
the homotopy method make it the best known choice for power ﬂow problems.

The number d = 2 x 2 x x 2 gives the upper bound of the possible solutions of the power
load system (4.4)—(4.6). In practice, power ﬂow systems have many fewer less solutions than
this upper bound. Many of the homotopy curves cannot reach t=1, because they go to
inﬁnity when they approach t=l (see Figure 4.2). These divergent curves generally require
more computation than convergent curves and consume most of the computation time. Not
only does the system of equations (4.4)-—(4.6) have many fewer solutions than the upper
bound, but also, in practical consequences, the majority of polynomial systems have the
same property. These kinds of systems are called deﬁcient by Li et a1. [37] and called

m-homogeneous by Morgan and Sommese [43].

58

Based on [43], we conclude that equation (4.17) is 2-homogeneous and, instead of 22’"
solutions, it has at most [N : 2N ] (i.e., N out of 2N) solutions. The following modiﬁed
initial function is proposed to trace the [N : 2N ] possible solutions, and the correctness of
the function is followed by the theoretical results of the general deﬁcient and m-homogeneous
systems [37] [51]. For the general complex form (4.4) - (4.6), the initial polynomial system

is chosen as

(E). + a.)(z£§'__, y;,E: + 3).), k = 1, 2...N,
Q(Z) = (2.1-:1 ykiEki + aN+k)(E£ + .BN+I:), ’6 = 1,2. ..., NL, (4-19)
(B). + 0N+Nn+k)(Ei + BNsl-NL'H‘)’ 1‘ = NL+1, "'1 N
where (1;, ,6;,i = 1, ...,2N are random complex scalars. For “almost all” (01, ..., am) 6 CZN,
(ﬁr. "'3162N) 6 CzN, Eq. (4.19) has [N : 2N] solutions. The equations of the polynomial
system Q(Z) are products of linear equations. Q(Z) is easy to solve. It has [N : 2N ] distinct
complex roots and all the solutions of equations (4.4) — (4.6) can be obtained by tracing the
solutions curves of equation (4.17).
With the speciﬁed initial system of equations (4.19), the computation time has been
reduced considerably. That is one reason for choosing the complex full ﬂowed model for our

study.

4.4 Implementation Issues and Results

With the homotopy method, the solution process for system (4.4) - (4.6) is the process of
curve tracing. We need to trace the solution curves, which are also called homotopy curves,
of the homotopy function (4.17), starting from t = 0 and gradually increasing t until it
becomes 1. Let to,t1, ...,tk be the set of sampling instants, where to = 0 and t), = 1. Deﬁne
s.- = t,- — t,--1 as the stepping distance from t;-1 to the next sampling instant t,- for 1 S i S k.
Given Z,(t,-_1), we have to solve Z,-(t.-) for t,- = t,-_1 + s,-. Several different techniques,
such as Newton’s iterative method, the ODE-based method, and normal ﬂow method [63]
may be used for the curve tracing. Theoretically, all the homotopy curves can be traced
concurrently and independently. Therefore, the power ﬂow problem is readily solved by the
local computation model.

Though homotopy curves have been proven disjoint, it is possible that one curve merges
to another curve [8] in the numerical implementation of curve tracing. This curve merging
will lead to the Occurrence of curve missing, and consequently, solutions being missed. The

stepping distances have to be very small to reduce the possibility of curve merging. Small

59

stepping distances increase the computation time signiﬁcantly. The performance of the local

computation approach is not very encouraging.

 

H H(z,t) = (1-t)Q(Z)+ tP(Z>

 

 

 

‘ Some homotopy curves diverge
End
. 0 Points
Q(Z)=o ° 2 P(Z)=O
Start 9 9
Points : Some curves may erge 3
O

 

 

 

/
"V

 

 

Figure 4.2: The Homotopy Method

 

 

To reduce the computation cost of curve tracing, we adopt a dynamic stepping strategy
[10]. Run time information is used to adjust the stepping distance to make it relatively
large. A predictor-corrector method is adopted for curve tracing. This curve tracing method
contains two steps, ﬁrst predict the value Z (t1) based on the value and derivative at t.--1,
then use Newton’s iterative method to correct the predicted value until it satisﬁes Eq. (4.17).
If after a certain number (threshold) of iterations, the predicted value does not converge to
a solution of equation (4.17), we say the tracing at t,_1 has failed. When this occurs, the
stepping distance 3,- = t,- - t,-_1 is reduced and the procedure is repeated. If the predicted
value converges to a solution within the threshold, the tracing is successful. The stepping
distance is increased for the next step. The iteration threshold is an important parameter for
the dynamic curve tracing method. Curve merging checks are also performed. We introduce

a series of c merge checking instants, 0 = to < t; < t; < < t; < l, where these c merge

60

checking instants are a subset of the sampling instants. Note that tracing different homotopy
curves dynamically may lead to different sets of sampling instants and a different number of
sampling instants, lc. However, for the purpose of merge checking, these c merge checking

instants must be included in any set of sampling instants.

PE- .."

,’I

 

 

 

 

 

 

 

I
I
' \ she ﬁg, :1’ \ ” Q‘
: ”333% “ WE‘} y 4 i. . [8(8).8(8)]
"5‘J’ve‘g? $3; ‘gé £53.: ‘:\\\‘
, t ' ’9 It ‘ \

 

 

 

 

 

 

 

 

 

 

I‘ 559! .9cf‘ k1 fé’uiﬂe‘s"
. \ V“ a a¢.gg:‘e‘:
: : :t’so 8 ‘ W5:

I ’9: ”92‘ ‘

 

ﬂa‘lzu \

53: s: ‘4‘

I A‘
} I"
T J aaagﬁ ,v‘ﬁ

[861.8(8)]

I
‘a 1;.

 

 

 

 

Figure 4.3: Global-Exchange Model for Merge Checking

 

 

 

 

 

 

 

 

Merge checking requires global information. Communication becomes unavoidable at
merge checking instants. This communication requirement changes our computation from
a local computation to a compute-exchange computation. Initially, this compute-exchange
computation is implemented in a synchronous fashion, which corresponds to the global-
exchange computation model (see Figure 4.3). All the nodes synchronize at merge checking
instants, and merge sort is used at these points to sort the intermediate values Z(t;'). If more
than one curve has the same value, curve merging has occurred. The merged curves will then
be retraced from t‘ _, with a more conservative iteration threshold. With the merge checking
strategy, the stepping distance can be relatively large. If only a few retracings have been
done, performance should be improved. This is true when the synchronous merge checking
algorithm is implemented with a single processor. However, with more processors, the merge
checking algorithm does not show any superiority over conservative tracing without merge
checking. To analyze degradations of parallelism we use the notation introduced in Chapter

3 to write down the synchronous merge checking algorithm. With eight processors, it is

61

an +‘Z([8(8),8(8)1+X.-). (4.20)

i=1
when no curve merging happens. When curve merging does occur at j of the c merge checking

points, it becomes

Xai +§([8(8),8(8)1+X.). (4.21)

i=1
where X,- is the workload of the node with the heaviest load during the two contiguous
synchronization. Note that we use dynamic stepping. Each curve, depending on its shape,
may require a different number of sampling instances from t:_, to t:. X,- is the workload
of the curve with the biggest number of sampling instances between t;_, and t2“. Equations
(4.20) and (4.21) also show that when curve merging occurs, some nodes are busy doing
retracing while all the others are waiting. This reveals that the load is not balanced between

synchronization, and that curve retracing makes the load imbalance even worse.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

’ 7 7 v v 3' 3'
I I " a" ,o ’-o
I ' a v
’ ’ I’ ’’’’’’
I I ’ I ooooo
I I ' at
o
I ’ I ” """""
I [’I’a'o=:::"
I I ’a"o’
D — )_1 E D
— —. . —
‘
\\s~“
\ s
\ “
\ \~
\ s‘
\ “
T}

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

shjjih-

Figure 4.4: Compute-Aggregate—Broadcast Model for Merge Checking

 

 

 

 

 

To overcome load imbalancing, the algorithm is redesigned to be asynchronous (see Fig-
ure 4.4). We choose one node as the checking node. All the other nodes work as the
computing nodes. Computing nodes do the curve tracing and send the computed value at

t; to the checking node. The checking node does the merge checking. When curve merging

62

 

 

 

45 I l I l l l

 

 

....... ><D
4° ” ....................... DC-
35 ' .................................. _

..X ....................... .AE
30 _ 33.. “0......" ....o'.......-' .....éA-i
25 _ s ,........;;;::Ii;::::::'+a

Speedup A .....----'"II.-;:::::: .........
20 _ .Jx..:.o':.:.... "ﬂun*..--.u-"’:: .... 12:53: ..... —

fin”... "5:26:33110'“

1'5 _ ﬁll-{V .....;.....-::::==--' -
10 .. ﬁ'ﬂfxé -
5 if? '

. I I l L 1 J

 

0 10 20 30 40 50 60
No. of Nodes

Method A: l iteration and without merge checking
Method B: 1 iteration and merge checking
Method C: 2 iteration and merge checking
Method D: 3 iteration and merge checking
Method B: 4 iteration and merge checking
Method F: 5 iteration and merge checking

Figure 4.5: Speedup Versus Different Number of Processors

 

 

63

is found, it sends a message to the corresponding nodes with the needed information. These
corresponding nodes will do retracing, while the others continue. Since the load is imbalance
between each pair of merge checking instants, each node arrives at the merge checking point
at a different time and the contention for the checking node is negligible (if the number
of available processors is not very large). The checking node collects and broadcasts data.
The computation becomes a compute—aggregate-broadcast computation. The asynchronous
compute-aggregate-broadcast algorithm was implemented on a 64-node N CUBE multicom-
puter with different iteration thresholds. The results of the implementation are shown in
Figure 4.5. Method A is based on the conservative local computation approach. Method B
to method F are based on the compute-aggregate-broadcast approach with different iteration
thresholds. It is very interesting to ﬁnd out that the iteration threshold used for convergence
check has a signiﬁcant impact on the system performance. When the iteration threshold is
very small, a divergence may occur to imply the need of a smaller stepping interval. When
the number of iterations is large, a convergence may be resulted from the merging to another
curve. In our experiments, method D caused three retracing computations; whereas, method
E and method F caused 20 and 40 retracing computations, respectively.

In our implementation, two merge-checking points, 0.1 and 0.96 were selected. For three
and four merge-checking points, we obtained very similar results as in Figure 4.5. The opti-
mal iteration threshold is apparently problem-dependent. Choosing an appropriate iteration
threshold is important. We can see from Figure 4.5 that method D, where the iteration
threshold is equal to 3, achieves a better performance than other methods. The speedup

used in Figure 4.5 is the relative speedup.

 

Chapter 5

ARCHITECTURE-DRIVEN
ALGORITHM DESIGN

In this chapter we study another application, solving tridiagonal linear systems, and show
how machine architecture will inﬂuence the algorithm design. We are interested in solving

the following tridiagonal linear system of equations
Ax = d, (5-1)

where A is a tridiagonal matrix of order n,

. b0 60 .
a1 b] C}

A = ° ' ' . (5-2)

art-2 biz-2 611-2
_ an—l biz-1,]
z = (2:0, , 3,,_1)T and d = (do, , dn-1)T. We assume that A; z, and d have real

coefficients. Extension to the complex case is straightforward.

 

 

Solving tridiagonal linear systems is one of the most important problems in scientiﬁc
computing. It is involved in the solution of differential equations and in various other areas
of science and engineering. Sequential algorithms for solving Eq. (5.1) have been well studied
[13]. A variety of methods have been developed for solving Eq. (5.1) on parallel computers.
A good survey of these methods can be found in [24], [33], [32], [47]. Among them, the
recursive doubling method (RCD) developed by Stone [56] and the cyclic reduction or odd-
even reduction method (OER) developed by Hockney [27] are able to solve Eq. (5.1) in
0(log n) time using n processors. However, both methods will fail if pivoting is required.
In a realistic parallel processing environment, the number of processors, N, is less, or much
less, than the dimension of the matrix, 11. Wang’s partition algorithm [62] is a relatively
new method which is designed for a practical parallel environment, where N is less than n.
However, it has some inherent sequential properties to remove the “fill—ins”. More recently,
a parallel preﬁx (PP) method developed by Egecioglu, Koc and Laub [16] modiﬁes the RCD
method to be applied to a limited number of processors, i.e., N < 12. While the OER

64 .

65

algorithm is a popular choice for vector machines, the RCD algorithm is a good candidate
for multicomputers.

Three parallel algorithms are presented in this thesis. All these algorithms are deve10ped
based on a novel matrix partitioning technique which partition the n x n tridiagonal matrix
A into N tridiagonal submatrices. This matrix partitioning technique requires solving a
small 2(N — l) x 2(N —-1) tridiagonal system. The parallel partition LU (PPT) algorithm
solves this small tridiagonal system in a single processor. If N is small, the parallel partition
global-exchange (PPG) algorithm solves this small tridiagonal system in every processor
redundantly to reduce the communication overhead. If N is large, the solving of this small
tridiagonal system can be performed in parallel. This algorithm is called the parallel partition
hybrid (PPH) algorithm. In many applications, the tridiagonal matrix A is evenly diagonal
dominant. In this case half of the off-diagonal elements of the small 2(N — 1) dimensional
matrix tend to zero exponentially. A fast and highly parallel algorithm, namely the parallel
diagonal dominant (PDD) algorithm, is proposed in [60]. We proved that the PDD algorithm
provides an approximate solution that is equal to the exact solution within the machine
accuracy when n / N > 1. The PDD algorithm is a fast algorithm for a special case. We will
not study the PDD algorithm in this chapter.

Communication mechanisms have a great impact on the performance of multicomputers.
Thus, the communication pattern of parallel algorithms should be carefully designed to
reduce the communication complexity. The communication consideration may lead to design
modiﬁcation. In this study, the algorithms are evaluated based on both computation and

communication complexities.

5.1 Linear Tridiagonal Solvers

Two known methods will be used in our algorithms. For completeness, they are brieﬂy

presented in this section.

5.1.1 The LU Decomposition Method

The LU decomposition method is also called the Gaussian elimination method without piv-
oting. The LUP decomposition method is the Gaussian elimination method with pivoting.
They are the most commonly accepted sequential algorithms and are used in the linear sys-
tem package, LINPACK [13]. The pivoting referred to in this paper is partial pivoting. The
LINPACK routine SGTSL solves Eq. (5.1) using the LUP decomposition method.

66

The algorithm first factors the matrix A into a product form
A = LU (or PA = LU when pivoting is involved), (5.3)

where L and U are lower and upper triangular matrices and P is the product of all of the
row permutations. Then A1: = d (or PAx = Pd) can be solved by solving both

Ly = d (or Ly = Pd) and Us: = y. (5.4)

It can be easily verified that the LU and LUP algorithms can solve Eq. (5.1) in 8n — 7

and lln - 12 arithmetic operations for the non-pivoting and pivoting cases, respectively.

5.1.2 The Parallel Preﬁx Method

The RCD method uses 0(log n) parallel computation time to solve Eq. (5.1) on a parallel
computer with n processors (see Section 3.2). The PP method [16] modiﬁes the RCD method
for N processors, where N < n.

Equation (5.1) can be written as
a,-:c.--1 + b,-:c,- + car,“ = d,- , 0 S i _<_ n — 1 , (5.5)
where 3.; = 2,, = 0. Solving for mg“, we have

$i+1 = (--:i)$r + (-§)3;-1 + (Ei) = 051'; + ﬂi-‘Pi—l + 7i- (55)

Here c,- aé 0 is assumed. Equation (5.6) can be written in matrix form as

$i+1 Ge ,3; 7i 1';
'1'; = 1 0 0 25.1 (5.7)
1 0 0 1 1
Define
$i+1
X,“ = z,- with $4 = 23,. = 0 , (5.8)
1 J
0i [3; 7i -
B.- = 1 0 o (5.9)
0 o 1,

 

67

Equation (5.6) becomes
X5.“ = B;X,' , 0 S i S n—l , (5.10)

and X.- (1 S i S n) can be expressed in terms of X0 as

X; = B;_1Bg_2...BlBoXo , 1 S 2 S n . (5.11)

Now solving Eq. (5.1) reduces to finding the partial products of matrices B; for
0 S i S n — 1. If N < n, evenly distribute matrices B.- to N processors, perform
sequential matrix multiplication on each processor, then use the preﬁx method on N pro-
cessors. There are (log N) + 1 parallel communication steps for implementing the prefix
method with N processors.

Let C? = B,-B,-+1...Bj. Then 0,1 is a 3 x 3 matrix. Since the last row of C? always equals '
[0,0,1], only six entries of C? need to be transferred at each parallel communication. Parallel
recursive doubling computation can be used to obtain all the partial matrix products. The
actual communication complexity of the PP method depends on the mapping of computation
units to processors, the. interconnection topology of the multicomputer, and the underlying
communication mechanism. In the case of the hypercube topology, the communication

pattern shown in Figure (3.13) can be mapped such that the dilation cost is no greater than
2 [33].

5.1.3 A Novel Matrix Partitioning Technique

Our parallel algorithms are based on the divide and conquer model of parallel computation.
In the divide part, matrix A is partitioned into submatrices. For convenience we assume
that n = Nm. Matrix A in Eq. (5.1) can be written as

A = A + AA, (5.12)

where the submatrices, A.- (i = 0, ..., N — 1), are m x m tridiagonal matrices. Let e,- be a -
column vector with its i“ (0 S i S n - 1) element being 1 and all the other entries being

zero. We have

68

 

 

 

 

 

 

— I I I — — I I I -'l
I I I I I
.- --L-—-—---L-q ---Q~Jl ------- I.-. .
I I I a I I
”4"". ------- Lu r-q--l6m---L-. .
i i ' l : al'ﬁ- :
A: : : : AA: : :'x :
I I I I I ' I
I I I I I ‘ I
I I ' I I I '
~--.--r-------r-- "nut ------- Hahn-1
_ : : :AA-I _ ' : “1“?"
F 55—1
63.
AA = amema cm—lem—la a2me2m9 C2m—le2m-lan-I C(N—1)m—1€(N-1)m-1] - = VET,
C(IV-im-l
CT
(N—l)m
where both V and E are n x 2(N - 1) matrices. Thus we have
A = A + VET. (5.13)

Based on the matrix modification formula originally defined by Sherman and Morrison
[54] for rank-one changes and generalized by Woodbury [14], Eq. (5.1) can be solved by

z = A-ld = (A + VET)-1d
= A-Id — A-IV( I + ETA-1V )‘1ETA-‘d .
Note that I is a 2(N—1)x2(N—1) identity matrix and I + ETA" V is a 2(N—1)x2(N—1)
band matrix with bandwidth 5. We introduce a 2(N—1)x2(N—1) elementary transformation

matrix P such that

(5.14)

P z = (21, zo, 23, 22,..., 22N-3, 22”,-” )T for all z E RAN-1). (5.15)

Based on the property that P'1 = P, Eq. (5.14) becomes

5 = A-ld — A“VP( P + ETA"VP)“ETA"1d. (5.16)

 

69

Note that Z = (P + ETA-IVP) is a 2(N — 1) x 2(N — 1) tridiagonal matrix. Let

A5 = d, (5.17)
AV = VP, (5.18)
h = E7513, (5.19)
Z = P + ETY, (5.20)
Zy = h , (5.21)
A2: = Yy. (5.22)

From Eqs. (5.17)—(5.22), (5.16) becomes
:1: = 51': - A3, (5.23)

where A is an n x n matrix, Y, V, and E are n x 2(N— 1) matrices, Z and P are 2(N — 1) x
2(N — 1) matrices, h and y are 2(N - 1) X 1 vectors, and A2, 5:, and d are n x 1 vectors.
Based on Eqs. (5.17)-(5.23), the computation required to solve Eq. (5.1) in a sequential
computer is described below.

In Eqs. (5.17) and (5.18), :‘E and Y are solved by the LU decomposition method. By the

structure of A and VP, this is equivalent to solving

A,- [§:(‘), v“), mm] = [d(‘), aimeo, c(,-+1),,,_1em-1] , i = 0,..., N—l . (5.24)

Here we have a0 = c,._1 = 0, e0, em-1 6 R’”, 5:“) and d“) are the i‘h block of 5: and d,
respectively, and v“), w“) are possible non-zero column vectors of the i“ row block of Y.
Equation (5.24) implies that we only need to solve three linear systems of order m with the
same LU decomposition for each i (i = 0, ..., N — 1). In addition, we can skip the first m - 1
forward substitutions for the third system since the ﬁrst m — 1 components of the vector at
the right hand side are all zeros.

Equation (5.19) only picks 2(N — 1) speciﬁed components from the vector 5:. There is no
computation or communication involved. The evaluation of Eq. (5.20) uses those possible
non-zero entries of speciﬁed 2(N - 1) rows of Y together with P to form matrix Z. Again

there is no computation or communication involved.

70

Equation (5.21) solves a 2(N —— l) x 2(N - 1) tridiagonal system, which is the major
computation involved in the conquer part of our algorithms. Since Y has at most two non-
zero entries at each row, the evaluation of Eq. (5.22) takes four arithmetic operations per
row.

Based on the above observations, together with a careful scaling process, the compu-
tational complexity required to solve Eq. (5.1) in a sequential processor is stated in the

following theorem.

Theorem 4 Equation (5.1) can be solved using Eqs. (5.17)-(5.23) with 1711 -— 6N — 23

operations without pivoting and 24n - 13N — 34 operations with pivoting.

5.2 A Compute-Aggregate-Broadcast Computation

Solver

Based on the matrix partitioning technique described previously, a parallel algorithm can be
easily developed with a compute-aggregate-broadcast computation. This algorithm is called
the Parallel Partition LU (PPT) algorithm and is described in this section.

Using N processors, the PPT algorithm to solve Eq. (5.1) consists of the following steps:

Step 0. Allocate A,-, d“) and elements aim and c(,-+1),,,_1 to the 1"“ node, where 0 S i S N —1.

Step 1. Use the LU decomposition method described in Section 5.1 to solve Eq. (5.24). All

N computations can be executed in parallel and independently on N processors.

Step 2. Send 5:9), 5:51,, v9), v31“ my), 1125,11 (0 S i S n — l) to a special node to form
matrix Z and vector h (Eqs. (5.19) and (5.20)). Hereafter the subindex indicates the

component of the vector.

Step 3. Use the LU decomposition method to solve Zy = h (Eq. (5.21)) on that special
node. Note that Z is a 2(N — 1) dimensional tridiagonal matrix.

Step 4. Send yg;_1 and yg; to the i“ node, (i = 0, ..., N— 1). Here we set y-1 = 31207-1) = 0.

Step 5. Compute Eqs. (5.22) and (5.23). We have

Ax“) = [5“), mm] [mi-ll ’ (5 25)

V2;
30') = 50') _ Ax“) ,

71

This step is executed in parallel on N processors.

As mentioned in Section 1.2, the underlying communication mechanism of multicomput-
era has a great impact on the performance of parallel algorithms. Before we address the
communication complexity issue, let’s take a close look at different communication mecha-
nisms. In the store-and-forward mechanism, which is used in all ﬁrst generation hypercube
multicomputers, the message transfer time between two adjacent processors can be expressed
as a + ,33, where a is the communication latency, ﬂ is the transmission time per byte, and
S is the number of bytes in the message. If a message is delivered to a node h hops away,
the message transfer time can be roughly estimated to be h(a + 65). In second genera-
tion multicomputers, advanced communication mechanisms are adopted, such as the circuit
switching used in iPSC-2 [46] and the wormhole routing used in Ametek 2010 [3]. In these
new communication paradigms, the message transfer time is almost independent of the dis-
tance (number of hops) between processors. Thus, in second generation multicomputers, the
transfer time of a message with 3 bytes can always be expressed as o: + 35 regardless of the
distance the message has to traverse.

The PPT algorithm has two communication patterns. The data gathering communica-
tion required in Step 2 has to transfer 6 data elements per processor. The data scattering
communication required in Step 4 has 2 data elements per processor. Figure 5.1 shows these
two communication patterns for the case when N = 8 and processor 0 is the special node
for solving Eq. (5.21).

The best way to handle the data gathering and scattering communications is to use the
tree reduction technique, as shown in Figure 5.2 for the case of data scattering when N = 8.
The data gathering communication is simply an inverse of the data scattering. It has been
shown that the spanning tree can be perfectly embedded in a hypercube [50]. Thus the
communication time required to scatter messages is logN units for N processors. Note that
in the ﬁrst data transfer, the message has to contain the data for the rest of the processors to
be visited. Each processor, upon receiving a message, has to strip its own data and forward

the rest of the data to the following processors. As a result, the communication time required

in data scattering (and also in data gathering) can be estimated to be

alog N + (N —1)55. (5.26)

Equation (5.26) can be applied to all second generation multicomputers and to ﬁrst genera-
tion hypercube multicomputers using the embedding scheme described in [50]. Now we are

ready to state the computation and communication complexities of the PPT algorithm.

72

WWW?

DataGthering )

IDDDDDDD

A3 5555 E

Figure 5.1: The Communication Pattern of the PPT Algorithm

 

 

 

 

 

 

 

 

Theorem 5 The PPT algorithm solves Eq. (5.1) in 17(n/N) + 16N — 45 and 24(n/N) +
22N — 69 parallel arithmetic operations for non-pivoting and pivoting, respectively. With 4
bytes per data element, it requires 2(alogN + 16(N — 1);?) communication time units, where

N is the number of processors and n = mN.

Let Tw, T5”, and TppT be the time required to solve Eq. (5.1) using the sequential
LU decomposition algorithm, the sequential partitioning algorithm (Eqs. (5.17)-(5.23)), and
the PPT algorithm, respectively. Let rm, represent the unit of a computation operation

normalized to the communication time. From those theorems shown previously, we have

T LU = (8n -— 7km, without pivoting (5.27)
Twp = (11n — 12km, with pivoting (5.28)
Tspr = (l7n — 6N — 23km, without pivoting (5.29)
TSPT = (24n — 13N - 34)rm, with pivoting (5.30)

TPPT = (17% + 16N — 45) Team, + 2(alogN + 16(N — 1);?) without pivoting (5.31)

T PPT = (24% + 22N — 69) Team, + 2(alogN + 16(N — 1);?) with pivoting (5.32)

73

Dividing the time required for a sequential algorithm by the time required to execute the
PPT algorithm, an estimated speedup of the PPT algorithm over other sequential algorithms
can be obtained. In Section 5.5, the theoretical results obtained here will be compared with
the measured results obtained from experiments on a 64-node NCUBE.

5.3 A Global-Exchange Computation Model

From Figure 5.1 we can see that the PPT method is a compute-aggregate—broadcast compu-
tation. Step 1 of PPT is the ﬁrst parallel computation phase. Step 2 is the data aggregation
phase. Step 3 is the single node computation phase and step 4 is the broadcast phase. Step 5
is the second parallel computation phase. On hypercube multicomputers, the data aggrega-
tion (gathering) and data broadcast (scattering) are achieved by tree reduction and inverse
tree reduction. Both tree reduction and inverse tree reduction take 0(logN) steps, where N

is the number of nodes involved in the computation, on hypercube architectures (see Figure

1391:9999

 

 

 

 

 

 

 

 

 

 

 

 

Ix“. \‘x \ 411(1),1(1)]
E3 Q r W
l \‘x, [{1ij 211mm»
éﬁbﬁsiﬁd
l ............ [1(1).1(1)]

 

111111
DUDE] FD

 

 

 

 

 

 

Figure 5.2: Tree Reduction

74

Using the communication pattern shown in Figure 3.5, the total data-exchange also can
be achieved in 0(logN) time on hypercube architectures. The 0(logN) time communication
suggests that we can add redundant computation, reduce the communication overhead and,
therefore, improve the overall performance. Step 2 of the PPT method can be changed
to total data exchange and step 4 of the PPT method can be removed. Step 3 will be
redundant on every working node. This redundant computation combined with step 5 of the
PPT method forms a computation phase. With this total data-exchange strategy, we achieve
a global-exchange computation method which is called the Parallel Partition Global-exchange
(PPG) method (see Figure 5.3).

gamma;

[N(N) N(N)]

ddibihhh

Figure 5.3: The Communication Pattern of the G-E Computation

 

 

 

 

 

Using P processors, the PPG algorithm for solving Eq. (5.1) consists of the following
steps:

Step 0. Allocate Ag, (1“) and elements a;,,,, c(,+1)m_1 to the 2"“ node, where 0 S i S N - 1.

Step 1. Use the LU decomposition method described in Section 5.1 to solve Eq. (5.24). All

the N computations can be executed in parallel and independently on N processors.

Step 2. Send 30') , $5,111, v35), v,(,',)_,, 109), w(,,',)_ (0 S i S N — 1) to every other node to
form matrix Z and vector h (Eqs. (5.19) and (5.20)).

Step 3. Use the LU decomposition method to solve Zy = h (Eq. (5.21)) on each node and
compute Eqs. (5.22) and (5.23) in parallel on N processors.

The computation requirement of the PPG algorithm is the same as the PPT method
while the communication requirement is reduced from 2(alogN + 16(N - l)ﬁ) to (01le +
16(N - 1))6). This reduction is not signiﬁcant for our machine where N = 64, but it may

have a larger inﬂuence when more processors become available.

75

5.4 A Solver With More Than One Computation
Model

The PPT and PPG algorithms discussed above provide better performance than most of the
existing parallel algorithms for solving Eq. (5.1) when N < n. However, the efﬁciency of
the PPT algorithm decreases as the number of processors, N, increases, because the major
computation in the conquer part (Step 3 in Section 5.2) of the PPT algorithm is sequential.
For this reason, we use the PP method (see Section 5.1), the limited processor version of the
RCD method, to solve Eq. (5.21). In order to apply the PP method, all the superdiagonal
elements of the coefﬁcient matrix must be non-zero. The following theorem is needed to

apply the PP method to solve Eq. (5.21).

Theorem 6 If all the superdiagonal elements of the matrix A are non-zero, the tridiagonal

matrix Z = P + ETA-1 VP has non-zero superdiagonal elements.

Proof: The superdiagonal elements of Z are either equal to l or the ﬁrst components

of the solutions

A55“) = c(,-+1),,,-1e,,,_1, i = 1,..., N—2, (5.33)
where w“), em-1 6 12'". Suppose ing) = 0, then to?) = 0 for j = 1,..., m — 1, since
the superdiagonal elements of A,- are non-zero. Therefore, we have Agw“) = 0, which is a
contradiction to Eq. (5.33). D

The new algorithm, namely the parallel partition hybrid (PPH) algorithm, consists of two
computation models, the recursive doubling computation model and the compute-aggregate-
broadcast computation model. The PPH algorithm is similar to the PPT algorithm except
for the following changes. After Step 1 of the PPT algorithm, the (2i — I)“ and 2i“, for
i = 0, ..., N - 1, rows of the tridiagonal matrix Z are stored in the i“ node (here the (--1)‘h
and 2(N - 1)“ rows of Z are assumed to be zero). Step 2 in the PPT algorithm which
performs data gathering is eliminated. Step 3 of the PPT algorithm is then replaced by the
PP method as described in Section 2. Step 4 performs the following data transfer.

Step 4. Send ygg.” from the i“ node to the (i + 1)“ node for i = 0, ..., N - 2.

In Step 3 the PP method has two communication patterns. The ﬁrst communication
pattern computes all partial matrix products (see Figure 3.13). The second communication
pattern is a broadcast which broadcasts the computed to (or X0 with another two known

components, 0,1) to all other nodes. Broadcast has a communication pattern similar to data

76

scattering as shown in Figure 5.2. However, in broadcast all nodes receive the same data.
Here we assume each ﬂoating point number has 4 bytes. Considering second generation
multicomputers, based on the communication model discussed in Section 5.2, the communi-
cation time required to obtain all partial products of B.- (Figure 3.13) is (a + 24,8)(1+logN)
time. Here we have 5 = 24 because each message transfer includes 6 data elements. The
broadcast communication takes logN steps and the message is one data element, which takes
(a + 4,6)logN time. The shift communication required in Step 4 takes a + 43 time. The
communication pattern of the PPH algorithm is shown'in Figure 5.4 for N = 8. Counting

the arithmetic operations and communication times, we have the following theorem.

Theorem 7 With n = N m, the PPH algorithm solves Eq. (5.1) in 17(n/N) +20logN +17
and 24(n/N) + 2010gN + 4 parallel arithmetic operations for non-pivoting and pivoting,
respectively. It requires ((a + 4,6) + (a + 24B))(1 + logN) communication steps.

WWW

(Prefix Method for Partial Products

655315615

 

 

 

 

 

 

Figure 5.4: The Communication Pattern of the PPH Algorithm

 

Let Tspg be the time required to execute the PPH algorithm in a sequential processor

and Tppy be the time required to execute the PPH algorithm on N processors. We then
have

77

 

 

 

 

 

 

 

 

 

I Algorithm l Computation Communication
Wang’s Partition 21f, - 12 — 87"; " 2lN — Illa +103l+la+163l
Recursive Doubling(PP) 3555 + 2OIogN - 29 (1 + logN)(a + 243) + logN(a + 43)
Cyclic Reduction 20%} " 0(logn) "'
FPT(non-pivoting) 1755 + 16N — 45 2_alogN + 16(N —l)3‘
PPT(pivoting) 247?; + 22N -— 69 2_alogN + 16(N -1)3‘
Fﬁﬂnon-pivoting) 1755 + ZOIogN + 17 “(a + 43) + (a + 243), (l + logN)
PPH(pivoting) 24% + ZOIogN + 4 ,(a + 43)+(a+243j(1+logN)

 

 

 

 

 

 

 

scalar count of Table V 162] divided by N
"the number of communication steps for the cyclic reduction method is not known. However it is easy to

ﬁnd that it has at least the complexity of 0(logn).

Table 5.1: Computation and Communication Counts of Tridiagonal Solvers

TSP” = ( 1712 + 8N —' 41 ) romp without pivoting
TSP” = ( 24n — 5N — 41 )rme with pivoting
Tppﬁ = (17% + 20logN + 17) Team, + ((a + 43) + (a + 243))(1 + logN) without pivoting
Tppg = (24% + 2OIogN + 4) Team, + ((a + 43) + (a + 243))(1 + logN) with pivoting

It is interesting to note that the PPH algorithm can reach a speedup of 2 over the PP
algorithm for l < N < n. When N = n, no matrix partitioning is needed and the PPH
algorithm is virtually the RCD algorithm. When N=l, there is no conquer part and as
the LU decomposition method is used in the dividing part, the algorithm becomes the LU

decomposition algorithm.

5.5 Comparison and Experimental Results

In this section we compare our methods with some existing methods for solving Eq. (5.1).
The arithmetic operation counts and communication steps of those methods are summarized
in Table 5.1.

Wang’s partition method and cyclic reduction method are well known. However, it is
difﬁcult to compare these two methods with our methods on multicomputers. Although we
believe that the limited processor version of these two methods will increase the computation
complexity, we list their computation counts in Table 5.1 assuming they can be perfectly

partitioned, i.e., the computation count is scaled down by a factor of N. The communication

78

cost of cyclic reduction is in the order of 0(log n) steps and each step takes at least a +
S 3 time. Michielse and Vorst modified Wang’s algorithm for local memory systems. The
communication count is based on their result [42]. However, the computation count in their
modiﬁcation is slightly greater than Wang’s algorithm and thus is not listed in the table.
As shown in Table 5.1 the computation counts of all our methods are better than other
methods. In terms of communication counts, the PPH method has the same order as the PP
. method. This is reasonable because they all use the communication pattern shown in Figure
3.13. The cyclic reduction method has a high communication count, and thus is popular
for vector machines. For our other methods, the communication complexity is less than
Wang’s method. In the following discussion, we shall compare our methods with respect to
the number of processors.

Figure 5.5 shows the estimated and measured speedup of the PP, PPT, PPH algorithms
with respect to the SGTSL routine (see Section 5.1). These algorithms are implemented on
a 64-node NCUBE multicomputer. NCUBE is a first generation multicomputer adopting
the store-and-forward communication mechanism. As indicated in Section 5.2, the com-
munication pattern for the parallel preﬁx method shown in Figure 3.13 cannot be perfectly
embedded in hypercube, and the worst dilation cost is 2. Thus, in the communication counts
of the PP and PPH algorithms, the factor (0+243) is doubled. In our NCUBE machine, the
following system parameters are measured: a = 5.0, 3 = 0.013, and room, = 0.08 (without
normalization). The dimension of matrix A is chosen as n = 6400. This value is limited
by the memory constraint of individual processors. As N increases, the PPH algorithm will
outperform the PPT algorithm because the dimension of the matrix Z (Eq. (5.21)) increases
as N increases, which favors the parallel approach used in the PPH algorithm. Again, due
to the limited number of processors available in our NCUBE, the maximum number of pro-
cessors used is N = 64. The performance of the PPH algorithm seems to be underestimated
compared with the measured results. This is mainly caused by assuming a dilation of 2
for all communications occurring in Figure 3.13. However, in actual implementation, some
communications have a dilation of 1.

As n goes to inﬁnity, the asymptotic speedup, compared with the LU decomposition
method, of all our methods is 0.471N. The asymptotic speedup for the PP method is 0.229N.
Dividing the speedup by the number of processors, N, Figure 5.6 shows the eﬁiciency of our
methods. For all methods, the efﬁciency decreases as the number of processors N increases.
This is mainly the result of the increasing ratio of communication to computation.

As mentioned in Chapter 2, there are two commonly accepted measures for the speedup

79

 

 

 

 

 

 

 

 

 

30 I T I I I I 30 I I I f j I
25 - - 25 - -
20 ' o‘ 20 ’ PPi
8 68d _ :9" _ S eedu _ . .. '
P 11P15 3'8 H P P15 .._,...--:::,.BBml
10 — ,..83’ +~+~ 10 - ...:ﬁ=°““"” ,......BPF-
5 r - ...-V"? - 5 _ .- ........ "I” .-
....-l~ cam-i.
0 .- l l l l l l 0 ' J J I l I
0 10 20 30 40 50 60 0 10 20 30 40 50 60
No. of Nodes No. of Nodes
(a) Estimated speedup (b) Measured speedup
Figure 5.5: The Speedup Over The LU Decomposition Method
when n = 6400

 

 

 

and the efﬁciency of parallel algorithms [47]. One focuses on how much faster a problem
can be solved by N processors. Thus, it compares the best serial algorithm with the parallel

algorithm under consideration, which is deﬁned as

 

execution time using the fastest sequential algorithm on one processor
execution time using the parallel algorithm on N processors '

5,, = (5.34)

Both Figures 5.5 and 5.6 are based on the above measure and the best sequential al-
gorithm chosen is the LINPACK subroutine, SGTSL. Another measure is interested in the

parallelism inherent in an algorithm and is defined as

execution time usin one rocessor I
5' = g p . (5.35)

’ execution time using N processors

 

The latter is the relative speedup. Since each node only has 512K memory capacity for
N CUBE multicomputers, the single node execution for solving a 6400 dimension tridiagonal
system is impossible. The estimated self speedup and efﬁciency of our methods can be found
in [60].

One of the advantages of our approach is its diversity. In the divide part, pivoting may
or may not be used. There has been a tacit assumption for the ﬁrst three methods listed in
Table 5.1 that no pivoting is required. In fact, it does not appear to be feasible to incorporate
a pivoting strategy into those algorithms (Wang’s algorithm allows very limited pivoting).
However, pivoting can be used by the PPT, PPG and PPH algorithms with slightly higher
operation counts. Thus, the PPT, PPG and PPH algorithms are more stable than others in

cases where pivoting is required. In the conquer part, the major computation requirement is

 

 

s t olv gpa o

80

 

 

 

 

 

 

 

 

 

 

.5 F" I I I 1 I 0-5 I I I i I I
.45 W28. 0.45 - -
.4 __ «.8385 4 0.4 $Cﬁ. .-
.35 '— "8) T _. 0'33 C 'zzzizzza-mmPPH -‘
Efﬁciencyég ;l'+ P Raﬁ Efﬁcieggzg r PPT“?
°- - ”+“'+~ . 0.2 m . . -
.15 - +~+P£H+j 0.15 ~' ' ‘ ' H" +4
.1 - - 0.1 - -
0.05 - - 0.05 - ~

0 l J l l l l 0 l 1 l I I l

0 10 20 30 40 50 60 0 10 20 30 40 50 60

No. of Nodes No. of Nodes
(a) Estimated efﬁciency (b) Measured efﬁcient
Figure 5.6: Efﬁciency Over The LU Decomposition Method
when n = 6400

 

 

 

devoted to solving Eq. (5.21). The methods used in this part are irrelevant to the methods
used in the divide part. Other strategies may also be used to solve Eq. (5.21).

Unless the matrix A is positive deﬁnite or diagonal dominant, there is no general rule
to guarantee that all the submatrices A,- (i = 0, ...,N — 1) of A (see Section 5.1) are non-
singular. The small matrix Z in the conquer part may be singular too. The latter case is
unlikely to happen. However, for a certain class of matrices at hand, very often we can ﬁnd
a criterion to avoid those singularities and make the algorithms work.

In this chapter, using the solution of tridiagonal linear system as an example, we have
shown how to change the algorithm design to fit the architecture and to reduce communica-
tion overhead. We have changed our design from compute-aggregate—broadcast computation
to global exchange computation and to hybrid computation for better performance. In
general, we may have different approaches for a given application. The performance of com—
putation models will suggest which approach is best for a given architecture, and we can

achieve a more efﬁcient algorithm design. Another new parallel algorithm, Parallel Diag-

onal Dominant (PDD) is also proposed for solving diagonal dominant tridiagonal systems.
Detailed information about the PDD algorithm can be found in [60].

 

Chapter 6

PREDICTION OF
PERFORMANCE

Performance measurement techniques can be divided into three categories: pre-run, run-time
and post-run [12]. Pre-run performance measurement is also called performance prediction.
It identifies the computationally intensive parts of an algorithm and predicts the perfor-
mance of an algorithm. Run-time performance measurement collects execution data that
can be analyzed on line. The post-run performance environment provides the measured im-
plementation results. Comparing with run-time and post-run performance measurements,
performance prediction is more economical for algorithm evaluation, and, without actual
implementation, it is not restricted by physical limitations which may be imposed by the
number of available processors and / or the maximum memory capacity. The predicted results
not only can be used for evaluating algorithms, they also can be compared with the imple-
mentation results. The prediction and implementation results comparison often leads to a
better understanding of the algorithm and to a possible performance enhancement. Further,
performance prediction is very important for real time systems. In real time systems, the
machines have to respond within a given deadline. Within the time limitation,'we would
like to obtain the most accurate solution possible. Knowing the turnaround time before
execution is the key to achieving this goal. However, providing an accurate prediction of
the performance of parallel algorithms is difﬁcult. As we have learned from previous chap-
ters, the performance of a parallel algorithm is dependent on the application, the number of
processors available and the problem size.

Execution time can be predicted by using a parallelism proﬁle (see Figure 2.1). A par-
allelism proﬁle provides information about uneven allocation degradation. The amount of

work executed with degree of parallelism i, VVg, can be obtained from this information and
a lower bound of elapsed time can be derived based on I’Vg. Equation 2.7

T(N) = 55% 731 (6.1)

i=1

is the predicted elapsed time based on the parallelism proﬁle. T(N) is a lower bound of the

81

82

execution time, but it certainly is not a greatest lower bound. Communication latency is a
very important factor inﬂuencing the performance. By considering communication overhead,

a better lower bound for execution time is
T(N) = ‘=‘ ‘ 2"] + Q" , (6.2)

where QN is the communication overhead when N processors are used (see Eq.2.9).
Structured representation provides adequate information for both uneven allocation and
communication latency degradations. With some measured data, the performance of an
application can be predicted based on its structured representation and Eq.(6.2). This
performance prediction can be obtained by collecting information from each compute and
data-exchange phase. In this approach, the parallel computation requirement of each com-
pute phase should be known and the performance of each communication primitive should be
premeasured. This performance prediction can also be obtained through a modular approach
by using the performance information of computation models. This modular approach has
many advantages over the detailed approach, which we will discuss in the next section.
This chapter is organized as follows. In Section 6.1, we will show the performance formu-
lations of two computation models, the domain decomposition computation model and the
recursive doubling computation model. An example is presented in Section 6.2 to illustrate
how the computation models and the memory-bounded speedup model can be used in per-
formance prediction. The inﬂuence of problem size on the speedup of computation models

is analyzed in Section 6.3.

6.1 Performance Formulations

The most commonly used performance metrics for parallel algorithms are elapsed time and
speedup. In this section, the performance formulations of two parallel computation models
are derived based on their structured representations, in terms of elapsed time and in terms
of speedup. Performance formulations for other computation models can be derived by
following similar arguments. These performance formulations of computation models can be
used to predict the performance of a large class of scientiﬁc applications.

Using the performance of computation models to predict the performance of a given ap-
plication has advantages over predicting the performance of the given application directly
through its structured representation. With the computation model approach, an application

is divided into parts. The performance of these parts, or computation models, will suggest

83

which section of your algorithm is the bottleneck. Second, the performance of an application
is not only dependent on the application and the machine architecture, but also dependent on
how the application is mapped onto the architecture. The performance information of com-
putation models contains the mapping information. It provides a more accurate estimate.
Also, the computation models are naturally associated with commonly used mathematical
methods. With the modular approach, when a mathematical method is chosen, the struc-
tured representation of a given application will be known, and the performance prediction of
the given application is available. The computation model based modular approach is easy to
understand, has less error accumulation, requires less communication and less architectural
knowledge.

Figure 3.11 in Section 3.2 shows a structured representation for a domain decomposi-
tion application. In general, the structured representation of the domain decomposition

computation model is

20W), N001 + X0,

i=1
where N is the number of processors available, I: is an integer, which sometime depends
on the convergence check, and d is the degree of each sender and each receiver. Studying
the domain decomposition computation more carefully, we ﬁnd that it has the following

characteristics.

0 Communication overhead is independent of system size

The degrees of senders and receivers are ﬁxed for domain decomposition applications.
It is independent of problem size and system size. The communication pattern is also
independent of problem size and system size. For second generation multicomputers,
the communication latency of the domain decomposition computation model is ﬁxed.
In fact, even for ﬁrst generation multicomputers we generally can manage to map
a domain decomposition onto an architecture such that the communication delay is
independent of the system size. We use Q to denote the common communication

latency.

0 Perfect degree of parallelism

The domain decomposition applications do not have uneven allocation degradation.
All the processors do useful work at each compute phase and the loads are evenly
distributed. Domain decomposition computations are good candidates for parallel

processing.

 

84

0 With a ﬁxed number of iterations at each compute phase, the computation requirement
at each compute phase is the same, X = X; for i = 1, ..., It. This is the case for most

domain decomposition applications.

The execution time of domain decomposition applications can be derived from the struc-

tured representation directly. The sequential execution time is

N(Zf=1 Xi)

A 3

where A is the computing capacity of each processor (see Section 2.1). Notice that, as

T(1) =

deﬁned in Section 2.1, WN = N (E?=1 X,-) for domain decomposition computations. The

parallel execution time for domain decomposition computations will be

Xg)+Q
A .

Three different speedup formulations can be derived for the domain decomposition model

 

by following the three different models of speedup given in Section 2.2. For the ﬁxed-size
speedup model the speedup is

WN
S = . 6.3
N 57%: ‘ )
For the fixed-time speedup model, we have
WN = #vk + Q

Wiv =N(WN-Q)

where Wiv is the scaled workload for ﬁxed-time speedup (see Section 2.1). Thus, the ﬁxed-
time speedup is

Wiv ___ N(WN-Q)

J + Q W” (6.4)

 

 

For the memory-bounded speedup model, if we let the integer k be ﬁxed, we have
W12, = N W”,
where W5, is the scaled workload (see Section 2.1). The memory-bounded speedup is

_ NWN
- Q'l'WN.

 

5;. (6.5)

 

85

The recursive doubling computation model is another computation model presented in
Section 3.2 (see Figure 3.13). The structured representation of the recursive doubling com-

putation model is

33“?" - 2")[1(1).1(1)] + Xe). k = logN. (6.6)

i=0
The recursive doubling computation model has interesting properties. It has a dynamic
degree of parallelism and has a ﬁxed computation requirement at each compute phase while
the number of compute phases increases with the system size. More precisely, this model

has the following characteristics.

0 The computation requirements at each compute phase are the same, X = X,- for
i = 1, ..., k.

0 Communication at data-exchange phases is regular data-exchange. The communication
patterns at each data-exchange phase are the same. Although the number of disjoint

patterns varies, communication latency at each data-exchange phase is the same.

0 The number of compute and data-exchange phases will increase as a function of

log(system size).

0 In both ﬁxed-size and ﬁxed-time speedup models, if Q is ﬁxed, X will decrease when

N increases.

0 The degree of parallelism varies from phase to phase. P x R gives the degree of
parallelism at each compute phase, which is equal to 2" - 2‘ at the i“ compute phase.
This means that W2h_gi = (2" —2‘)X = (2" —2‘)W1, and IV,- = 0 for j at 2" - 2‘, where
iemnvwk—n.

Based on the structured representation (6.6), the sequential execution time of the recur-

sive doubling computation is

 

15-! I: _ i
T(l) = i=0 (2A 2 )X

The parallel execution time is

k(X+Q).

T(N) = A

 

86

Just as with the domain decomposition computation model, three speedup formulations
can be obtained corresponding to the three models of speedup previously discussed. Follow-
ing the notation deﬁned in Section 2.1, we let W: 21-"! W,- be the work load when the
application is running on a single processor, and deﬁne W,’ and IV: correspondingly. For the

ﬁxed-size speedup model, we have

W
i=7} [1? + 01'],
W W
hollWI + Q] =le1 + Q]
W

”#53,"; + Q]

SN = j=2k—2‘

 

 

 

Thus, notice here that 21:3 (2" -- 2‘) = lcN - (N — 1),

 

 

 

_ _ N—l iw
5” — (N 1: )w+oIZ:i-':.‘(2*-2-'n (6'7)
= (N " NI:- 1)W+[(k-W)N+I]Q’ (6'8)

By deﬁnition, the speedup formulation of the recursive doubling model for the ﬁxed-time
speedup is

 

 

 

 

Ic-l WI .
51" = zi=0 j = 2k _ 2s
’-‘:‘ 5+ 01-]
;:1(2"— 2')W’ N -1 Wi
Ic-l =(N - ) I ‘
.0[W1 + Q] k W1 + Q
Notice that
k—l W' I
W= Zl<-’+Q]= k(W +0)
and W kQ
W; = l: ,
we have N l—W Wk
=(N - —) Q. (6.9)

 

Similarly, for memory-bounded speedup we also have

N-l W,-

3,1,:(1v— k )WHQ'

 

87

The workload on each processor will not change with the number of processors available in
memory-bounded speedup. Thus, the memory-bounded speedup is

N - 1) W
k W + 62'
From Eq. (6.8) we can see that with the degree of parallelism considered, the speedup

 

5;, = (N — (6.10)

formulation can be completely different from the simpliﬁed speedup formulas presented in
Section 2.3.

The structured representation of a given application contains adequate information about
degradations, and provides more information than a parallelism proﬁle. By using the domain
decomposition and recursive doubling computation models, we have shown how elapsed time
and speedup can be predicted based on structured representations. We also have shown how
to derive speedup formulations based on structured representations for different models of
speedup. Performance formulations of other computation models can be derived similarly.
The performance formulations of computation models can be used to predict the performance

of general scientiﬁc applications.

6.2 Structured Prediction

We have identiﬁed computation models and have shown performance formulations for some
of the identiﬁed computation models. We would like to predict the performance of general
scientiﬁc applications by using the performance information of parallel computation models.
The idea is that a given application can be seen as a combination of computation models.
These computation models can be found by studying the structured representation of the
application. The performance information of these computation models can then be com-
bined to provide a performance prediction of this application. If the measured performance
data of computation models are applied to a performance prediction, performance data can
be estimated for this application. If the performance formulations of computation models
are applied, a performance formula can be obtained for the application which shows how
different parts of the application will vary with system size and problem size. Therefore,
diﬂ'erent parts can be modiﬁed for different situations to achieve the best performance. We
call this kind of analysis structured analysis.

The combination of elapsed times is straightforward, but the combination of speedups
is somewhat more complex. The combination formulas for the elapsed times and speedups
are shown in Eq.(6.11) and Eq.(6.12) respectively for the combination of two models. The

88

combination formulas in the case of n models can be easily obtained by using Eq. (6.11) and
Eq.(6.12) recursively.

Execution time = Execution time of part one + Execution time of part two . (6.11)

Speedup = Speedup of part one x Ratio of part one (6 12)

+ Speedup of part two x Ratio of part two ,

where the ratio is the ratio over total execution time. Speedup can be derived from the
execution time. We would like to have speedup as a function of the speedups of the com-
putation models, since the variations of each model are easier to observe. Also, determining
speedup from Eq. (6.12) is simpler than determining speedup from Eq. (6.11), when the
ratio is independent of problem size and system size.

We use one simple example, the parallel preﬁx method for solving tridiagonal systems (see
Figure 6.1), to illustrate this performance prediction approach. The parallel preﬁx method

has the following structured representation

 

x. + Ea" - 2‘)([1(1).1(1)1+ x...)+[1(N).N(1)1+ 26..., (6.13)
k- Recursitnoubling 4

where the conpute phases are counted from 0 to k + 1. From this representation we can see
that the parallel preﬁx method consists of three parts, the preprocessing part, the recursive
doubling part and the post processing part. The preprocessing part and the post processing
part are local computations in which the workload is evenly distributed among the available
processors. The middle part is a recursive doubling computation.

The degree of parallelism of the postprocessing part is equal to P x S, where P is
the number of patterns in the broadcast data-exchange phase, and S is the number of
senders in each of the patterns. Since P = 1 and S' = N in this case, P x S = N. The
degree of parallelism of the preprocessing part is also equal to N. Since recursive doubling
computation models have equal load at each compute phase, X1 = X,,i = 1, ..., k - 1. Also,
since recursive doubling computations do not achieve perfect parallelism at any compute
phase, WN = N (X0 + Xk+l)- Using the recursive part as a basic structure, the sequential

execution time of the preﬁx method is

[NXo + 2319" - 2")(Xx + Q) + QN + NXml
A 9

 

(6.14)

 

89

 

 

 

 

 

E QE—Q—Q—Qu --

.-'-"-'"-'-'-"""'|-""-'|"'-.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Iii
l_LT

 

 

 

 

 

 

 

 

 

 

(fini-

 

 

 

 

 

Parallel Preﬁx Method

Figure 6.1

 

 

90

and the predicted parallel execution time of the preﬁx method is

[X0 + k(X1 + Q) 'l' QN + Xk-Hl
A ,

where A is the computing capacity of each processor, Q is the communication overhead

(6.15)

of data-exchange [1(1), 1(1)] and QN is the communication overhead of data-exchange
[1(N ),N (1)] Q is independent of N, the number of processors available, and QN is a
function of N. After removing the computing capacity A, Eq.(6.15) becomes

[X0 + k(Xl + Q) ‘l' QN + XH—ll- (6.16)
Using Eq.(6.12) the speedup of the preﬁx method is

 

NX x N- x, kx+Q Nx- x-
#[quamﬂ + (N - 7:1)an Eq.16.16 + XL: [Eq.(6.i6)] (6'17)
_ X H! - N-l x, k X +
"' Nr_J]qu.(6.I6) + (N " T)X,+o Eq.(6.16) (6'18)

where X1 is ﬁxed. It is independent of system size and problem size. Equation (6.18) shows
that the preﬁx method contains two parts, the local computation part and the recursive
doubling computation part. The local computation part has speedup N and the recursive
doubling part has speedup (N - Efllx—fiﬁ' The local computation part has a higher speedup.
We would like the local computation part to dominate the total computation. From Eq.(6.18)
we can see that when the problem size increases so does the ratio of local computation, and
when the system size increases the ratio of local computation will decrease. Figure 6.2 shows
the estimated and measured speedup of the preﬁx method. We can see that they are very
close to each other.

By the properties of ﬁxed computation at each compute phase, the speedup formulation
used for the recursive doubling part is the memory-bounded speedup. Thus, using one simple
example, we have shown how the computation models and memory-bounded speedup can

be used in performance predictions.

6.3 The Inﬂuence of Problem Size on Speedup

In the last two sections we showed how the performance of general applications can be
predicted using the performance information of computation models. We also derived per-

formance formulations for some computation models. In this section we will study the

 

91

 

 

 

 

 

25 I r l I r I
20 ~ ..
15 *- _
Speedup
M d
10 - 933%
O 1111
5 _ 09 """ Estimated 4
:::::3g: .....
0 Q L 4 L l l 1
0 10 20 30 40 50 60
No. of Nodes
Figure 6.2: Measured and Estimate Speedup of Parallel Preﬁx Method
when n = 6400

 

 

 

performance of computation models in more depth. We will study how problem size inﬂu-
ences performance. The study starts with general formulas and then special cases will be
noted. These special cases are those that are suitable for some computation models.

As shown in Chapter 2, speedup is a function of both problem size and the number of
processors available. The study is three fold. First, we only consider the simpliﬁed condition
discussed in Section 2.3. Then, communication overhead is considered. Finally, uneven
allocation degradation is added in to give a more accurate analysis.

According to Eq. (2.13), the simpliﬁed ﬁxedcsize speedup is

W1 + WN
S = . 6.19
N m ‘ )
Notice that by Eq. (2.1)
W,-
t.‘(1) = K.
We can rewrite Eq. (6.19) as
_ t1(1) +tN(1)

(6.20)

 

N- t1(1)+3”7$9'

In general, we have

 

92

ti=t,(1)=% forlsigm,

where m is the maximum degree of parallelism (see Section 2.1). When ti = 0 for i 75 1 and
i aé N, the simpliﬁed speedup Eq. (6.19) can be rewritten as a function of workload W,

T1(W) _ t1(W)+tN(W)
T~(W) ’ t1(W) +tN(W)/N'

Equation (6.21) is equivalent to Eq. (6.19), but execution time has been written as a function

 

 

SN(W) = (6.21 )

of problem size. Therefore, Eq. (6.21) is easier to use to analyze the inﬂuence of problem
size. In this section, instead of writing speedup in terms of workload, we write speedup
formulations in terms of execution time t,-(1), where the subscript i indicates the execution
of workload W,- and 1 indicates that only one processor is used in the execution.

If the sequential execution time t1 and the perfectly parallel portion of the execution
time tN increase with the problem size W at the same rate, then when the problem size

increases c times

T1(cW) = c1t1(W) + cltN(W)
TN(cW) c1t1(W) + cltN(W)/N
Thus, in this case, speedup Eq. (6.21) is independent of problem size.

SN(cW) = = SN(W). (6.22)

 

The problem becomes more complicated when communication overhead is considered
and the communication overhead is a function of problem size. In this case we could let the
sequential execution time, t1, be independent of the problem size or we could let it vary

with the problem size. In the former case, we have

151 + thwl
t1 + tN(Wl/N + Q(W)'
Assume that tN is proportional to W with coefﬁcient c1 and Q is proportional to W with

 

SN(W) = (6.23)

coefﬁcient c3, then after the problem size is changed from W to cW, we have

t1 + ctN(W)
t1 + CtN(W)/N + cQ(W)'
The variation of speedup is difﬁcult to understand through Eq. (6.24). We need to use a
new approach, the derivative of SN, to study the variation. By Eq. (6.23), the derivative of
SN on W will be

5N(CW) = (6.24)

 

61(t1 + Q + tN/N) ‘ (03 + Cl/Nth + tN).

WW) = (t. + mm + Q)’

 

(6.25)

93

The sign of S}, is determined by its numerator. Notice that under our assumption, we have

le = catN. Thus the numerator

61(t1 + Q + tN/N) - (Ca + c1/N)(t1 + tN)
= c1(t1(1—1/N)+ Q) - c3(t1 + tN)
= t1[61(1—1/N)— C3].

When c1 > 171%63’ 51“, > 0. Therefore, SN will increase with the problem size. If c1 < Nix—leg,
Sj‘l, < 0, speedup will decrease as the problem size increases. If e; = FIX—163, S}, = 0, speedup

is independent of the problem size. This gives the following results.

Theorem 8 If the sequential execution time is independent of the problem size, and the
parallel execution time and the communication overhead are proportional to the problem size
with coeﬂicient c1,c3 respectively, then, the speedup Eq. (6.23) will increase with problem
size when c1 > 171%1'63' It will decrease with an increase in problem size when c1 < N—IZTC3

and will be independent of problem size when cl = NIX—{cw
The following theorem gives an upper bound of the inﬂuence of problem size.

Theorem 9 Under the same assumptions of Theorem 8, the speedup is bounded by SN(I’V)+
tN/ Q

 

 

 

 

 

Proof:
_ h+tN(cW) _ 81+C8N
SN(CW) - tl+Q(CW)+tN(cW)/N — 81+OQ+ctNIN
= t £1 + “N < ‘1+‘N in (6.26)
1 +CQ+ctN /N 81 +OQ+ctN IN 81 +OQ+ctN [N cQ
< SN(W) + tN/Q with the assumption c 2 1

It is easy to see that if the sequential execution time is also proportional to the problem
size then the speedup (6.23) is independent of the problem size.

The performance model (6.23) 'ﬁts the domain decomposition computation model (see
Section 3.2) well. In the domain decomposition model, there is no uneven allocation degra-
dation. The problem can be perfectly partitioned and the degree of parallelism will be N
with N processors available. Each node has a ﬁxed number of communication requirements.
The communication latency is independent of system size and is a function of problem size.
The sequential workload, W1, may or may not change with the problem size depending on
the loading of the initial data. However, the communication overhead is proportional to the

problem size. Commonly, it is a rational power function of the problem size. For instance,

 

 

94

using the ﬁnite difference method to solve Laplace’s equation, Eq. (3.8), each node will solve
a sublinear system Ax, = b,- and send the result to its four neighbors. If A,- is an n by n
matrix, then the length of the result is n, and, with the Gaussian elimination method, the
solving process has a complexity 0(n3). Theorem 10 gives a prediction of the speedup when

communication overhead is a power function of problem size.

Theorem 10 If the sequential execution time, t1, is independent of the problem size, the
parallel operation is proportional with the problem size, tN = Cl W, and the communication

overhead is a power function of the problem size, Q = aW”, then when c3 < “121:1;111331351 ,

the speedup (6.23) will increase with the problem size; when 63 > [t1([i-I:I{::)l)gltN , the speedup

will decrease with an increase in problem size; and the speedup is independent of the problem
. __ [t1(1-1/N)+Q]tn
size when C3 — “N“llq .

Proof: We prove the theorem by observing the sign of the derivative of SN(W) on the
variable W. The sign is determined by the numerator of Sﬁ(W), which is equal to

c1(t1 + tN/N + Q) - (61/N + caawa-lxtl + tN)

= ~36- (C1W(t1 + tN/N + Q) — («W/N + caaWs-‘Wxn + tn»
= 5% (tN(t1 + tN/N + Q) - (tN/N + caQ)(t1 + tNll

= 5 (1,1,,(1 — 1/N) + tNQ - c2061 + tn))-

 

Therefore, Sj‘v(W) will be less than zero when c3 > [‘1°2:;I/+1:)14)'31h~1. va(W) will be

greater than zero when c3 < “1003173: It” and Sj‘(,(W) will be equal to zero if C3 =

[_tﬁl - l/N) + QltN D
(tN'l'tllo

 

Notice here that when the sequential workload is equal to zero, the sequential execution

time t1 is also equal to zero. In this case,

[t1(1-1/N)+ QltN =1

“N + tllQ ’
and Theorem 10 becomes easier to understand. Theorem 11 gives the upper bound of the
inﬂuence of problem size when the communication latency Q is a rational power function of

the problem size.

Theorem 11 Under the assumptions of Theorem 10, for any real number c 2 1, we have

_c t w
5,,(cW) 5 SN(W)+c‘ a-ggT,’

 

95

Proof: t «H (CW)
5~(6W) = W
‘L‘N‘Wl

= t1+cc3Q(“;)+cl(iw(W )7N + t1+c‘30(W)-I;ct%9;V)/N
< SN(W) +—N—— é,q(w)_ -SN(W) + chm-315,7.

 

 

 

If we let the sequential execution time, t1, vary with the problem size, by similar argu-

ments we have the following corresponding theorems.

Theorem 12 Under the assumptions in Theorem 10, if we also let the sequential execution
time be proportional to the problem size, with coeﬂ‘icient c2, then when C3 < 1, the speedup
(6.23) will increase with the problem size; when c3 > 1, the speedup will decrease with an

increase in problem size; and when c3 = 1 the speedup is independent of the problem size.

Theorem 13 Under the assumptions of Theorem 12, for any real number c 2 1, when
63 ..>. 1, 5N(CW) S S'N(W); and when C3 < 1, S'N(cW) S cl’QSN(W).

Proof: By Theorem 12, we only need to prove the c3 < 1 case. When c3 < 1,

 

_ ct1+ctu
SN(CW) - ctl-l-c‘3Q-l-ctN/N
t1+tN

= c‘3 “(a-ca t1+Q+c1-¢a tN/N)
5 cl":3 SN(W).

 

D

All of the above six theorems, Theorem 8 to Theorem 13, are based on speedup (6.23).
With the consideration of the sequential execution and communication latency degradations,
they show how the speedup is inﬂuenced by workload and what is the limitation of this inﬂu-
ence. They are useful prediction formulations for domain decomposition computations and
for other scientiﬁc and engineering applications. In the remaining portion of this subsection,
we study the inﬂuence of problem size with degree of parallelism considered. Recall that t;
is the execution time for workload IV,- when one processor is available. Assuming that each
t;, i = 1, ...,m is proportional to the problem size, then the speedup will be independent
of the problem size from the degree of parallelism point of view. This is conﬁrmed by the

following equation,

€i=1ﬂti(cwl Em ti(W)

51v(6W)= (cw)/22,-";1 t1(Wl/

=5~(W). (6.27)

 

96

Most of the frequently used applications do not distribute load evenly. For achieving
high performance, we would like workload with a high degree of parallelism to increase with
the problem size, and workload with lower degrees of parallelism to be ﬁxed, independent
of problem size. In general, we can divide the integer set 1, 2, ..., m into two disjoint subsets
i1, i2, ..., im, j1,j3, ..., jm, such that m1 + m2 = m and that workload in degree of parallelism
it, I: = 1, ..., m1 is independent of the problem size, and the workload in degree of parallelism

jk, lc = 1, ...,mg increases with the problem size. By this partition, we have

_ m,-,t;(W) 221.11 t1,+222'_1t5 (W)
S”(W)’ .:.t.(W)/i =2?..t.,/i.+zr:.t1:(W)/j,

Extending the average parallelism concept introduced in Section 2.1, we have

 

 

t1*+tn‘(W) .
t . +tN.(W)/A. S A . (6.28)

where t1 = 216:1 tik/ika tN'(W) =X:,,_1 tjk(W)’ and A‘= 2,51 tjk/[zkgl tjk/jkl is the
extended average parallelism. Thus, when the number of processors is ﬁxed, we return to
Eq. (6.21) and the results for Eq. (6.21) can be directly used for Eq. (6.28). Both the
super sequential execution time t1' and the extended average parallelism A‘ of Eq. (6.28)

SN(W) =

 

are independent of problem size. However, both of them are functions of system size.

The prediction formulation Eq. (6.28) is useful for divide-and-conquer computations. For
certain divide-and-conquer computations the divide-and-conquer parts have a ﬁxed number
of operations which are not inﬂuenced by the problem size. Only the computation part scales
up with the problem size. Since the computation part has N as the degree of parallelism,

by Eq. (6.28), near perfect speedup can be achieved when the problem size is large enough.

 

Chapter 7

CONCLUSION AND FUTURE
RESEARCH

The design, representation and performance considerations of parallel algorithms on mul-
ticomputers have been presented in this thesis. In this chapter, the work reported in this
thesis is summarized and the major contributions of this dissertation are highlighted. Next,
improvements to the current work are suggested, and future plans and research directions

will also be discussed.

7.1 Summary and Major Contributions

The central issues in the development of efﬁcient parallel algorithms involve exploring the
inherent parallelism in applications and the matching of applications to architectures. The
present study is an attempt to better understand of these issues. The contributions of this
study include developing a notation for describing the structures of the most frequently
used scientiﬁc and engineering applications, identifying the basic building blocks of these
structures, providing a guideline for the design of efﬁcient parallel algorithms, developing and
implementing efﬁcient parallel algorithms for several applications, proposing a new approach
for predicting the performance of parallel algorithms, and presenting a more general speedup
model, namely, memory-bounded speedup.

Traditionally, parallel algorithms have been designed by brute force methods and ﬁne
tuned on each architecture to achieve high performance. Rather than studying the devel-
opment and matching case by case, a systematic approach was proposed in my research
by studying the basic building blocks, called computation models, of frequently used ap-
plications. Anotation was ﬁrst developed in Chapter 3. Using this notation, most of the
frequently used scientiﬁc and engineering applications can be represented by simple formulas.
These formulas constitute the structured representation of the corresponding applications.
The structured representations are simple, adequate and easy to understand. They also
contain sufﬁcient information about uneven allocation and communication latency degra-

dations. With the structured representations, applications can be compared, classiﬁed and

97

 

98

partitioned. There are innumerable applications. However, most of the applications are
combinations of some computation models. Structured representations relate general ap-
plications to computation models. Studying computation models leads to a guideline for
efﬁcient parallel algorithm design for general applications.

Seven models have been identiﬁed and presented in Chapter 3. They are the Local Com-
putation Model, the Global-Exchange Computation Model, the Compute-Aggregate-Broadcast
Computation Model, the Divide-and-Conquer Computation Model, the Domain Decompo-
sition Computation Model, the Pipelined Computation model and the Recursive Doubling
Computation Model. The structured representation of each of the computation models was
presented, and the computation and communication patterns of the computation models
were studied. With the computation model concept, the matching of application to archi-
tectures becomes transparent. The performance information of computation models suggests
which structure is best suited for a speciﬁc multicomputer architecture, and which part of
an algorithm is the performance bottleneck. Therefore, structured design and rethinking
become possible.

Structured representation and computation models have been used on the development of
parallel algorithms for several scientiﬁc and engineering applications. These applications in-
clude solving electrical power ﬂow problem [52], solving tridiagonal linear systems [60], solving
dense linear systems [49] and solving tridiagonal symmetric algebraic eigenvalue problems.
(The work on the last two applications, which are published in [49] and [40], are not included
in this dissertation). We used different computation models for different applications. The
algorithm for solving dense linear systems is based on the pipelined computation model. The
tridiagonal symmetric algebraic eigenvalue problems were solved with local computation and
divide-and-conquer computation approaches. The newly proposed divide-and-conquer algo-
rithm for solving eigenvalue problems is very efﬁcient. It competes well with any existing
parallel algorithm for solving eigenvalue problems. Solving tridiagonal linear systems is a
fundamental problem of scientiﬁc computing, and solving power ﬂow problems is, by far, one
of the most important problems facing researchers in the electrical power ﬁeld. Structured
representation and computation model concepts help us in changing from one design to an-
other. Four new algorithms were developed for solving tridiagonal linear systems. Three
of the four algorithms are presented in this thesis. They are the partition LU, the parti-
tion global-exchange and the partition hybrid algorithms. The partition LU algorithm is
based on compute-aggregate—broadcast computation. The partition global-exchange algo-

rithm uses global-exchange computation. The partition hybrid algorithm is a combination

. p -. r-l'

 

99

of compute-aggregate—broadcast computation and recursive doubling computation. All of
the deve10ped parallel algorithms have been implemented on an NCUBE/ 7 multicomputer.
Our algorithms yield a higher speedup and efﬁciency compared to existing algorithms. Us-
ing structured representation, We modiﬁed our design for solving the power ﬂow problem
from the local computation model to the global—exchange model, and then to the compute-
aggregate-broadcast computation model. The last one gave the best performance. Based on
the computeaggregate—broadcast algorithm, we have developed a package, called PowerCube
[45], which adapts the most rigorous mathematical techniques and is able to obtain all the
steady state solutions of power systems. The package currently runs on NCUBE and BBN
GP-1000 multicomputers. The related details of solving electrical power ﬂow problems and
tridiagonal linear systems can be found in Chapter 4 and Chapter 5 respectively.

In association with the study of efﬁcient algorithm design, the performance issues of par-
allel algorithms were also studied [58]. Three models of speedup were discussed and analyzed
in Chapter 2. They are ﬁxed-size speedup, ﬁxed-time speedup and memory-bounded speedup.
Two sets of speedup formulations were derived for these three models. One set requires more
information and gives more accurate estimates. Another set considers a simpliﬁed case and
provides a clearer picture of the possible performance gain with parallel processing. The
simpliﬁed ﬁxed-size speedup constitutes Amdahl’s law, whereas the simpliﬁed ﬁxed-time
speedup is Gustafson’s scaled speedup. The simpliﬁed memory-bounded speedup contains
both Amdahl’s law and Gustafson’s scaled speedup as its special cases. This study proposes
a new metric for performance evaluation and leads to a. better understanding’of parallel
processing.

Performance prediction techniques have been developed based on structured representa-
tion and based On some precollected performance data. The performance predictions con-
sider both communication overhead and uneven allocation as degradations of parallelism,
and give a good bound on performance. Performance prediction is very important in real
time systems where we want to get the most accurate solution within a time limitation. The
predicted performances are also very helpful for efﬁcient algorithm design. Using structured
representation, performance can be predicted for different levels. The lower level accumu-
lates the predicted performance of each compute and data-exchange phase. The higher level
uses computation models as the basic modules and provides a more accurate and automatic
approach. The performance formulations of two of the computation models were derived in
Chapter 6. The performance of a large class of applications could be predicted by combining

these formulations and by using the higher level approach. The analysis process also showed

 

100

how we could derive the performance formulations of a given application through the lower
level approach. An example was given to show the performance prediction techniques. Re-
sults obtained from our implementation of this example matched the predicted performance

well.

7 .2 Future Research Directions

In this research, the structured representation method was developed to describe the most
frequently used scientiﬁc and engineering applications. The basic data structures of these
applications, called computation models, were identiﬁed and studied and a new metric for the
performance of parallel algorithms was proposed. Applying the structured representation,
the identiﬁed computation models and the new metric of performance, the issues in efﬁcient
algorithm design and in performance prediction were studied. All these topics, namely,
the representation, the basic data structure and the metric of performance, are fundamental
problems in computer science. Because of their fundamental properties, the results presented
in this dissertation can be applied to other areas of computer science. Further, these problems
are difﬁcult to solve perfectly due to their fundamental nature. Future research areas can
be identiﬁed along three directions - improving the current results of these fundamental
problems, continuing the research of applying these fundamental results to efﬁcient algorithm
design and performance prediction, extending the application of these fundamental results
to other areas of computer science.

Several questions remain open along the ﬁrst direction of research. For structured rep-
resentation, we have only developed notation for two classes of data—exchange, i.e., regular
data-exchange and conjunctive regular data-exchange. Communication phases do exist that
do not belong to either of these two kinds of data-exchanges. Can we extend our nota-
tion to cover more complicated communications? Structured representation provides the
information about how many processors participate in each data-exchange phase. If we
assume that a processor computes if and only if it participates in data-exchange, then we
know how many processors do useful work at each compute phase and, therefore, have some
knowledge of uneven allocation degradation. However, the workload of each processor is
unknown. Processors may compute at the same compute phase but with different workload.
Can we extend or modify our current notation to contain the workload information? Fur-
thermore, we assume that the communication is achieved in a synchronous fashion, can we

extend our current notation to contain asynchronous communication? Could we use graph

 

 

101

theoretical results and terminologies to make structured representation more general and
more fruitful? For instance, the results of bipartite graphs could be used to further divide
a communication pattern into smaller pieces. Also, further study on computation models
is needed. This includes studying more applications and identifying more models, studying
each model on different architectures, and deciding how to identify a computation model.
With structured representation, a computation model can be identiﬁed at different levels —
ﬁner or coarser. Which level is better is uncertain at this time. For example, divide-and-
conquer is a commonly used technique, which we identiﬁed in the conventional way (see
Section 3.2). However, it can be identiﬁed as two tree structures, which may be easier to
measure and understand. Three models of speedup, namely, ﬁxed-size speedup, ﬁxed-time
speedup and memory-bounded speedup, are studied for the performance of different classes of
applications. Applications exist which do not ﬁt any of these models of speedup, but satisfy
some combination of the models.

The study of efﬁcient algorithm design and performance prediction is still in a prelim-
inary stage. This is especially true for the research in performance prediction. Only four
applications have been studied on a single architecture — the hypercube architecture. We
would like to study more applications on different architectures and explore the dynamic load
balancing issues. We need more experimental data to support the theoretical results. Our
representation is based on advanced technology, so it does not consider the distance between
two processors as an important factor. The algorithms which have been implemented on
the ﬁrst generation NCUBE multicomputer should be tested further on second generation
multicomputers. More algorithms should be implemented on second generation multicom-
puters to determine the accuracy of our predictions and to improve our prediction schemes.
Structured representation contains adequate information about an application, but ﬁnding
the structured representation for an application is difﬁcult. It requires a deep understanding
of the application itself. How to detect the structured representation of a given application
automatically and systematically is an interesting research issue. Some tools exist for de-
tecting the degree of parallelism of a given application [36]. These tools provide starting
points for developing tools that can detect the structures and computation models contained
within applications automatically. The proposed performance prediction method is based on
structured representation and computation models. Performance tools can be developed for
performance estimation and prediction by applying the method. Further study of the per-
formance prediction methodology is needed before a tool can be developed, since there are

numerous implementation issues involved. This is another interesting research issue which

 

102

can be studied based on the results of this dissertation.

During my dissertation, I have conducted research on several fundamental problems and
applied the research results to different areas of computer science. Further research can be
continued along these directions. This research can also be extended to new areas to attack
untouched problems. For instance, structured representation could be modiﬁed for use with
shared-memory multiprocessors; the computation model concept may be extended for non-
numerical computations; kernel programs for machine benchmarks can be developed based
on the identiﬁed computation models. I plan to continue my current research and am always

looking forward to meeting new challenges.

Bibliography

 

O 0
Bibliography
[1] G. Almasi and A. Gottlieb. Highly parallel computing. The Benjamin/Cummings Pub-
lishing Company, Inc., 1989.

[2] G. Amdahl. Validity of the single-processor approch to achieving large scale computing
capabilities. In Proc. AFIPS Conf., pages 483-485, 1967.

[3] W. Athas and C. Seitz. Multicomputers: Message-passing concurrent computers. Com-
puter, pages 9-25, Aug. 1988.

[4] M. Barton and G. Withers. Computing performance as a function of the speed, quantity,
and cost of the processors. In Proc. Supercomputing ’89, pages 759—764, 1989.

[5] D. Bertsekas and J. Tsitsiklis. Parallel And Distributed Computation: Numercal Meth-
ods. Prentice-Hall, Inc., 1989.

[6] S. Borkar and et al. Iwarp: An integrated solution to high-speed parallel computing.
In Proc. of Supercomputing’88, pages 330—339, 1988.

[7] R. Brassard and Bratley. Algorithmics Theory and Practice. Prentice-Hall, Inc., 1987.

[8] S. Chang, X. Sun, and L. Ni. A projection tool for visualizing multidimensional curves.

Michigan State University, 1989. Technical Report, MSU-CPS-ACS-16.

[9] S. Chow, J. Mallet-Paret, and J. Yorke. Finding zeroes of maps: homotopy methods
that are constructive with probability one. Math. Comp., 32:887-899, 1978.

[10] S. Chow, L. Ni, and Y. Shen. A parallel homotopy method for solving a system of
polynominal equations. In Proc. of the 3rd SIAM Int ’l Conf. on Parallel Processing for
Scientiﬁc Computing, pages 121-125, Chicago, 1987.

[11] E. Coffman. Introduction to deterministic scheduling theory. In J. W. E.G. Coffman
and Sons, editors, Computer and Job/Shop Scheduling Theory, pages 1-50. 1976.

103

 

 

104

[12] F. Darema. Parallel applications performance methodology. In M. Simmons,
R. Koskela, and I. Bucher, editors, Instrumentation for Future Parallel Computing Sys-
tems, pages 49—58. The ACM press, 1989.

[13] J. Dongarra, J. Bunch, C. Moler, and G. Stewart. Linpacl: Users’ Guide. SIAM,
Philadelphia, 1979.

[14] I. Duff, A. Erisman, and J. Reid. Direct Methods for Sparse Matrices. Clarendon Press,
Oxford, 1986.

[15] D. Eager, J. Zahorjan, and E. Lazowska. Speedup versus efﬁciency in parallel system.
IEEE Transactions on Computers, pages 403—423, March 1989.

[16] O. Egecioglu, D. Koc, and A. Laub. A recursive doubling algorithm for solution of
tridiagonal systems on hypercube multiprocessors. Technical Report No. TRC88-1,
1988. Dept. of Computer Science, Univ. of California, Santa Barbara.

[17] S. Even. Graph Algorithms. Computer Science Press, 1979.

[18] R. Finkel. Large-grain parallelism — three case studies. In L. Jamieson, D. Cannon, and
R. Douglass, editors, The Characteristics of Parallel Algorithms, pages 21—63. The MIT
press, 1987.

[19] W. Gropp and d.E. Keyes. Complexity of parallel implementation of domain decom-
position techniques for elliptic partial differential equations. SIAM J. on SS TC, 9(2),
March 1988.

[20] J. Gustafson. Reevaluating amdahl’s law. CA CM, 31:532—533, May 1988.

[21] J. Gustafson, S. Hawkinson, and K. Scott. The architecture of a homogeneours vector
supercomputer. In Proc. of 1986' Int’l Conf. on Parallel Processing, pages 649-652,
Chicago, 1986.

[22] J. Gustafson, G. Montry, and R. Benner. Development of parallel methods for a 1024-
processor hypercube. SIAM J. on SS TC, 9(4), July 1988.

[23] J. Hayes, T. Mudge, Q. Stout, S. Collet, and J. Palmer. Architecture of a hypercube
supercomputer. In Proc. of 1986 Int ’1 Conf. on Parallel Processing, pages 653—660,
1986.

 

 

105

[24] D. Heller. A survey of parallel algorithms in numerical algebra. SIAM Review, 20:740—
777, Oct. 1978.

[25] W. Hill. The Connection Machine. MIT Press, Cambride, Mass., 1985.

[26] W. Hills and G. Steele. Data parallel algorithms. Communications of the ACM, 29,
Dec. 1986.

[27] R. Hockney. A fast direct solution of poisson’s equation using fourier analysis. J. ACM,
12:95—113, 1965.

[28] K. Hwang. Advanced parallel processing with supercomputer architectures. Proc. of
the IEEE, pages 33-47, Oct. 1987.

[29] K. Hwang and F. Briggs. Computer Architecture and Parallel Processin. McGraw-Hill
Book Co., 1984.

[30] L. Jamieson. Characterizing parallel algorithms. In L. Jamieson, D. Gannon, and
R. Douglass, editors, The Characteristics of Parallel Algorithms, pages 65-100. The
MIT press, 1987.

[31] V. N. J.H. Saltz and D. Nicol. Reduction of the effects of the communication delays
in scientific algorithms on massage passing mimd architectures. SIAM J. SCI. STAT.
COMPUT., 8(1), Jan. 1987.

[32] S. Johnsson. Communication efficient basic linear computations on hypercube multi-
processors. J. of Parallel and Distributed Computing, (4):]33—172, 1987.

[33] S. Johnsson. Solving tridiagonal systems on ensemble architectures. SIAM J. on SS T C,
8(3):354—392, May 1987.

[34] C. King, W. Chow, and L. Ni. Pipelined data parallel algorithm - concept and modeling.
In ACM Int ’l. Conf. on Supercomputing, 1988.

[35] P. Kogge and H. Stone. A parallel algorithm for the efﬁcient solution of a general class
of recurrence equations. IEEE- TC, c-22z786-793, 1973.

[36] M. Kumar. Measuring parallelism in computation intensive scientific/engineering ap-

plications. IEEE- TC, 37(9):1088—1098, Sep. 1988.

 

106

[37] T. Li, T. Sauer, and J. Yorke. Numerical solution of a class of deﬁcient polynimial
systems. SIAM J. Numer. Anal., 24(2), Apr. 1987.

[38] T. Li, T. Sauer, and J. Yorke. The random product homotopy and deﬁcient polynimial
system. Numer. Math., 51(481), 1987.

[39] T. Li and X. Wang. A more efﬁcient homotopy for solving deﬁcient polynomial systems.
Submitted for publication.

[40] T. Li, H. Zhang, and X. Sun. Parallel homotopy algorithm for symmetric tridiago-
nal eigenvalue problem. accepted to appear in SIAM J. of Scientiﬁc and Statistical
Computing.

[41] M. A. Marsan, G. Bailbo, and G. Conte. A class of generalized stochastic petri nets for

the performance analysis of multiprocessor systems. ACM TOCS, pages 93—122, May
1984.

[42] P. Michielse and H. Vorst. Data transport in Wang’s partition method. Parallel Com-
puting, 7:87—95, 1988.

[43] A. Morgan and A. Sommese. A homotopy for solving general polynomial systems that
respects m-homogeneous structures. App. Mat. and Comp., 24:101—113, 1987.

[44] P. Nelson and L. Snyder. Programming paradigms for nonshared memory parallel com-
puters. In L. Jamieson, D. Gannon, and R. Douglass, editors, The Characteristics of
Parallel Algorithms. The MIT press, 1987.

[45] L. Ni, F. Salam, T. Tzen, X. Sun, and S. Guo. Powercube: A software package for
solving load flow problems. In Proc. of the 32nd Midwest Symposium on Circuits and
System, Illinois, Aug. 1989.

[46] S. N ugent. The ipsc-2 direct-connect communication technology. In Proc. of the Third
Conf. on Hypercube Concurrent Computers and Applications, Jan. 1988.

[47] J. Ortega and R. Voigt. Solution of partial differential equations on vector and parallel
computers. SIAM Review, June 1985.

[48] M. Quinn. Designing Eﬂ‘icient Algorithms For Parallel Computers. McGraw-Hill, 1987.

 

107

[49] D. Robinson, X. Sun, and R. Enbody. A pipelined parallel approach to solving dense
linear systems. In Proc. of the Fourth Conf. on Hypercube concurrent Computers and
Applications, March 1989.

[50] Y. Saad and M. Schultz. Data communication in hypercube. Research Report,
YALEU/DCS/RR-42, Computer Science Department, Yale University, Oct. 1985.

[51] F. Salam, L. Ni, S. Guo, and X. Sun. Parallel processing for the load ﬂow of large-scale
power system: The approach and application. In Proc. of the 28th IEEE Conf. on
Decision and Control, Tampa, Florida, Dec 1989.

[52] F. Salam, L. Ni, X. Sun, and S. Guo. Parallel processing for the steady state solutions
of large-scale nonlinear models of power systems. In Proc. of the 1989 IEEE Int ’1
Symposium on Circuits and Systems (ISCAS), Portland, Oregon, May 1989.

[53] K. Sevcik. Characterizations of parallelism in applications and their use in scheduling.

In Proc. of ACM SIGMETRICS and Performance ’89, May 1989.

[54] J. Sherman and W. Morrison. Adjustment of an inverse matrix corresponding to changes

in the elements of a given column or a given row of the original matrix. Ann. Math.
Stat., 20(621), 1949.

[55] Y. Shih and J. Fier. Hypercube systems and key application. In K. Hwang and
D. DeGroot, editors, Parallel Processing for Supercomputing and Artiﬁcial Intelligence.
McGraw-Hill, 1988.

[56] H. Stone. An efﬁcient parallel algorithm for the solution of a tridiagonal linear system
of equations. J. of ACM, 20(1):27—38, Jan. 1973.

[57] H. Stone. Parallel computers. In Introduction to computer architecture. SRA Inc.,
Chicago, Ill, 1980 (2nd ed.).

[58] X. Sun and L. Ni. Another view on parallel speedup. In Proc. of Supercomputing ’90,
NY, NY, Nov. 1990.

[59] X. Sun, L. Ni, F. Salam, and S. Guo. Compute-exchange computation for solving
power flow problems: The model and application. In Proc. of the Fourth SIAM Conf.
on Parallel Processing for Scientiﬁc Computing, Dec. 1989.

108

[60] X. Sun, H. Zhang, and L. Ni. Parallel algorithms for solution of tridiagonal systems
on multicomputers. In Proc. of the 1989 ACM Int ’1 Conf. on Supercomputing, Crete,
Greece, June 1989.

[61] A. Veen. Dataﬂow machine architectures. ACM Computing Surveys, 18(4):365-396,
Dec. 1986.

[62] H. Wang. A parallel method for tridiagonal equations. ACM Trans. Math. Software,
7:170—183, June 1981.

[63] L. Wason. Numerical linear algebra aspects of globally convergent homotopy methods.
SIAM Review, pages 529—545, Dec. 1986.

[64] F. Wu. Stability, and security, and reliability of interconnected power systems. Large
Scale Systems: Theory and Applications, 7:99-113, 1984.

l- l.

"I[lllllllllllllllll