l
l

Hill!

314

 

 

 

’ Illllllllllll it'll

3 1293 010

m ”i ll

This is to certify that the

dissertation entitled

Advanced Loop Parallelization:
Dependence Uniformization and
Trapezoid Self-Scheduling

presented by

TEN-HWAN TZEN

has been accepted towards fulﬁllment
of the requirements for

PhD degree in Computer Sc ience

 

 

chstg H ML

Major professor

Date July 30, 1992

MSU i: an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

 

DATE DUE DATE DUE DATE DUE I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ﬁt‘ﬁr“?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(—71

 

 

MSU I. An Afﬁrmative Action/Equal Opportunlty Institution
cWMMG-M

ADVANCED LOOP PARALLELIZATION:
DEPENDENCE UNIFORMIZATION AND
TRAPEZOID SELF-SCHEDULIN G

By

Ten H. Tzen

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1992

692*«77

 

ABSTRACT

ADVANCED LOOP PARALLELIZATION:
DEPENDENCE UNIFORMIZATION AND
TRAPEZOID SELF-S CHEDULIN G

By

Ten H. Tzen

Due to the increasing demand for computational power in the areas of science and
engineering, multiprocessor systems provide a promising approach to high speed com-
puting. It is indisputable that loops are the major source of parallelism in programs.
This thesis studies loop parallelization which improves the execution of iterative loops
on multiprocessors.

Any nested loop can be parallelized as long as all dependence constraints among
iterations are preserved by applying appropriate synchronizations. However, the per-
formance is signiﬁcantly affected by the run-time overhead which is mainly induced
by synchronization and data transfer between processors. For nested loops with a
uniform dependence pattern, loop netting provides a partitioning method which does
not induce data transfer and synchronization between processors. This method is

particularly suitable for multiprocessors with a local cache.

In case of irregular and complex dependence constraints, it is very difﬁcult to

efﬁciently and systematically arrange synchronization primitives. The proposed data
dependence uniformization overcomes the difﬁculties in parallelizing a nested loop
with irregular dependence constraints. The approach is based on the concept of
vector decomposition. A simple set of basic dependences is developed of which all
dependence constraints can be composed. The set of basic dependences then will be
added to every iteration to replace all original dependences so that the dependence
constraints become uniform.

To dynamically allocate iterations of a DOall 100p to processors, load balancing
among processors may be achieved at the expense of run-time scheduling overhead.
By linearly decreasing the chunk size at run time, the best tradeoff between the
scheduling overhead and balanced workload can be obtained in the proposed trapezoid
self-scheduling approach. Due to its simplicity and ﬂexibility, this approach can be
efﬁciently implemented in any parallel compilers. The experiments conducted in a 96-
node Butterﬂy GP-lOOO clearly show the advantage of the trapezoid self-scheduling

over other well-known self-scheduling approaches.

To my mother and to the memory of my father

iv

ACKNOWLEDGEMENTS

I wish to thank the members of my Ph.D. guidance committee for their support
during my Ph.D. study. I would particularly like to express my appreciation to Dr.
Jesse Fang of Hewlett Packard Research Lab. for his helpful guidance in the area
of parallelizing compilers, and to Dr. Richard Enbody for his careful reading of this
dessertation and his many useful comments.

I want to thank my good friend Amaresh Joshi, who has painstakingly read
through drafts of this thesis, and helped me to make this presentation much more
readable. I would like to acknowledge my friend Honda Shing for providing expertise
when mine was lacking. I especially would like to thank Chia-Ying for helping me
through some pressing moments.

Finally, I would like to thank Dr. Lionel M. Ni for having been an excellent
advisor. Without his guidance, enthusiasm, and patience, I would not have been able
to complete this thesis work. I am very grateful for his support and encouragement.

 

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1

INTRODUCTION

1.1 Compilation Support for Loop Parallelization .............

1.2 Parallel Constructs ............................
1.2.1 DOall Loop ............................
1.2.2 DOacross Loop ..........................

1.3 Motivation and Research Objectives ...................

1.4 Thesis Organization ............................

CROSS-ITERATION DEPENDENCE _

2.1 Basic Concepts ..............................
2.2 Solving Dependence Equations ......................
2.3 Iteration Space ..............................
2.4 Synchronization Among Iterations .............. ' ......

TRANSFORMATION OF UNIFORM DEPENDENCE LOOPS

3.1 Sequentially Iterated DOall Loop ....................
3.2 Loop Netting ...............................
3.3' Wave-Front Method ................ V ............

DATA DEPENDENCE UNIFORMIZATION

4.1 Program Model ..............................

4.2 Analyzing Cross-Iteration Dependence .................
4.2.1 Solution Space and Dependence Convex Hull ..........
4.2.2 S-Test: A Dependence Slope Test ................

4.3 Dependence Decomposition .......................

4.4 Index Synchronization for Uniform Dependence Loops .........

4.5 Performance Evaluation .........................

vi

ix

CONQCJ'MH

10

12'
13
15
19
20

25
26
28
38

41
43
47
47
50
59
65
68

4.5.1 Theoretical Speedup ....... ...............
4.5.2 Experimental Results .‘ ......................

5 SCHEDULING OF DOALL LOOPS
5.1 Review of Loop Scheduling Schemes ...................
5.2 A Model for Loop Scheduling ......................

6 TRAPEZOID SELF-SCHEDULING (TSS)

6.1 Generic Approach: TSS(f,l) .......................
6.2 A Conservative Approach: TSS(2—’§, 1) .................
6.3 Mutual Exclusion Implementation ....................
6.4 Memory Management ...........................
6.5 Experimental Results . . . ; .......................

6.5.1 Workload and Performance Metrics ...............

6.5.2 Run-Time Behavior ........................

6.5.3 Performance Measurement ....................

7 CONCLUSION AND FUTURE RESEARCH
7.1 Summary .................................
7.2 Future Plans . . p ..............................

A FINDING THE DCH IN A N-DIMENSIONAL SPACE
A.1 Basic Principle ..............................
A.2 Data Structure ..............................
A.3 Algorithm .................................

BIBLIOGRAPHY

vii

68
70

74
75
78

83
84
88
90
92
95
95
97
108

118
118
121

123
123
126
128

134

LIST OF TABLES

6.1 The number of chores generated by various self-scheduling schemes . 89

viii

1.1

2.1
2.2
2.3
2.4
2.5

3.1
3.2
3.3

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14

5.1
5.2

LIST OF FIGURES

A Parallelizing Compiler Model ..................... 3

The dependence graph of Example 2.1 ..... . ............ 14

The dependence grapg of Example 2.2 ................. 15

A multiple-level nested loop model. ................... 16

The iteration space dependence graph of Example 2.5 ......... 21

The dependence graph of Example 2.6 ................. 22

The iteration space of Example 3.1(a) .................. 30

Loop netting transformation ....................... 36

The iteration space dependence graph of Example 3.5 ......... 39

The iteration space dependence graph of Example 4.1. ........ 42
A doubly nested loop program model ........... ' ........ 44
The DCH of Example 4.1 ........................ 48
The Algorithm of Finding the DCH in a 2-Dimensional Solution Space 51

The characteristic of DSF ........................ 53
A Monotonic DSF ............................ 54
The solution space of Example 4.2 .................... 56
The solution space of Example 4.1 .................... 58
The algorithm of S-test ......................... 60
The uniformized dependence pattern of Example 4.1 ......... 63
The Index Synchronization Method ................... 66
Experiment 4.1: granularity is 120 microseconds ............ 71
Experiment 4.2: granularity is 560 microseconds ............ 72
EXperiment 4.3: granularity is 1600 microseconds ........... 73
State diagram of a processor in 100p scheduling ............ 79
Chunk size function of a uniformly distributed loop for SS, 088(k),

and GSS schemes ............................. 82

ix

6.1 Four typical uniformly and nonuniformly distributed loops ......
6.2 Chunk size function of a uniformly distributed 100p for TSS scheme .
6.3 T(1) of the TSS(2-’,3,1) under the loop in Fig. 3(d) ...........

6.4 Static memory management .......................
6.5 Event log proﬁle of a uniformly distributed loop using the T88.

6.6 Event log proﬁle of a uniformly distributed loop using the CSS.
6.7 Event log proﬁle of a uniformly distributed loop using the G88.
6.8 Event log proﬁle of a decreasing workload using the T53 ........
6.9 Event log proﬁle of a decreasing workload using the CSS ........
6.10 Event log proﬁle of a decreasing workload using the G83 ........
6.11 Event log proﬁle of a uniformly distributed loop using the T33, where
processors start at different times .....................
6.12 Event log proﬁle of a uniformly distributed loop using the CSS, where
processors start at different times .....................
6.13 Event log proﬁle of a uniformly distributed loop using the G85, where
processors start at different times .....................
6.14 Experiment 6.1: I =100000, L(i)=110 microseconds ..........
6.15 Experiment 6.2: I =10000, L(i)==2 milliseconds .............
6.16 Experiment 6.3: 1:3000, L(i)=[50 to 77000] microseconds ......
6.17 Experiment 6.4: 1:3000, L(i)=[77000 to 50] microseconds ' ......

85
86
88
94
98
99
100
101
102
103

104

105

106
109
110
112
114

6.18 Experiment 6.5: I =10000, L(i)=[36(50%) or 1000(50%)] microseconds 115

6.19 Experiment 6.6: 1:100000, L(i)=112 microseconds, Processors start

at different times .............................
6.20 Experiment 6.7: P=80, I =100000, L(i)=112 microseconds ......

A.l A CH 3 in a 3-dimension space ......................
A.2 A CH3 cut by a H P2 in a 3—dimension space ..............
A3 The data structure of the CH 3 in Figure A.l. .............

117

124
125
127

CHAPTER 1

INTRODUCTION

Due to the increasing demand for computational power in the areas of science and
engineering, multiprocessor systems provide a promising approach to high speed com-
puting. A multiprocessor is a single computer system containing multiple processors
which are capable of communicating and cooperating at different levels in order to
solve a given problem [1]. More speciﬁcally, through some regular interconnection
network, processors exchange information to share the workload of a given program
so that the computation can be performed in a parallel manner.

However, programming and compiling for multiprocessor systems is a far more
complex task than traditional sequential programming. To efﬁciently and correctly
utilize a multiprocessor, one has to speculate about such issues as qualiﬁcation and
quantiﬁcation of parallelism, scheduling of tasks, synchronization between processors,
as well as memory management for spatial and temporal locality. Due to the complex-
ity of parallel programming, parallelizing compilers, which automatically transform a
serial program into a parallel form, become more and more popular.

It is indisputable that loops are the major source of parallelism in programs [2].
Current parallelizing compilers pay much of their attention to loop parallelization.
Thus, the iterations of a loop can be spread across processors by having different

processors execute different iterations simultaneously. This is also referred to as

microtasking [3]. A review of compilation support for loop parallelization will be
given in Section 1.1. The procedure of parallelizing a loop will be detailed.

Loops are usually transformed to proper parallel constructs. As will be deﬁned
in Section 1.2, two major types of parallel constructs, DOall loops and DOacross
loops, are widely applied to explicitly express parallelism. However, the performance
is restricted by dependence constraints that are caused by the underlying data flow
between iterations. This thesis addresses several new mechanisms to improve the
execution of iterative loops on multiprocessors. Variable locality, run-time overhead,
workload balancing, and even complicated and irregular dependence constraints are
considered in this research. A more detailed description of the research objective will

be given in Section 1.3. Finally, an overview of this thesis is given in Section 1.4.

1.1 Compilation Support for Loop Parallelization

A parallelizing compiler explores parallelism and generates parallel code to take ad-
vantage of the underlying parallel architecture. Currently, most supercomputer ven-
dors supply parallelizing compilers for their machines [4] [5] [6]. A number of commer-
cial and experimental restructuring compilers are also in use which include University
of Illinois’s Parafrase [7] [8] [2], Rice University’s PFC [9] [10] [11], KAI’s KAP [12]
[13], and Paciﬁc Sierra’s VAST [14]. They are mostly constructed by inserting a
parallelizing phase into, or by adding a parallelizing preprocessor before, an existing
sequential compiler.

Usually, parallelizing loops is implemented in the optimization phase of a sequen-
tial compiler. As shown in Fig. 1.1, there are three steps in the loop parallelization
phase: dependence analysis, loop transformation, and task scheduling. The goal
of dependence analysis is providing necessary dependence information of a nested

loop which can be used to restructure the loop from a serial to a parallel form in

l

[.....J
l

[ ...o..y

OP'I'IMIZER

 

 

 

 

 

L PostOpq\

l
[ .515... J
l

 

\

 

/_

/.

Dependence
Analysts

l

 

Loop
Transiormatlo

n1

 

 

% l %
\w I.

 

 

Figure 1.1. A Parallelizing Compiler Model

the next step, loop transformation. Analyzing data dependence among iterations
involves solving linear Diophantine equations and the bounds of real functions. It
is very time-consuming to exactly analyze dependence relations among iterations.
Parallelizing compilers usually apply only in-exact tests that provide an overly con—
servative dependence information. Thus, some parallelism is not exploited. A more
detailed study of dependence analysis will be given in Chapter 2.

Generally speaking, loop transformation (or called loop restructuring) is applied

to serve the following purposes:

1. manipulate nested loops in order to explore potential parallelism,
2. transfer nested loops to proper parallel code, and then

3. simplify the target parallel code to reduce run-time overhead.

Many well-known transformation techniques, such as loop interchange, loop dis-
tribution, loop fusion, loop ﬁssion, strip mining, loop partition, cycle shrinking, loop
skewing, and so on, have been preposed and widely applied [15] [16] [17] [18] [19] [20].
In order to be valid, any transformation of a given nested loop must honor the depen-
dence constraints in that nested loop. Since different transformations require different
dependence information, a universal method of dependence analysis is an important
issue in 100p parallelization. Also, given a nested loop and a parallel architecture,
there are many ways to transform it. Choosing an optimal order of transformations
is still an open problem.

In practice, depending on the dependence relation among iterations, loops and
nested loops are manipulated and then transferred to different types of parallel con-
structs for concurrent execution. A parallel construct is a segment of code in which
parallelism is explicitly represented. Parallel languages, which provide various types
of parallel constructs, have been proposed in recent years [21] [22] [9]. However, none

has been widely accepted. In general, there are two major types of parallel constructs

that are provided in all parallel languages: DOall loops and DOacross loops. A formal
deﬁnition and detailed description of these constructs will be given in next section.
After parallelism has been explored, the next problem for parallelizing compilers
is how parallelism can be efﬁciently exploited. This is the so called loop scheduling
problem for multiprocessors [23]. The goal of loop scheduling is to distribute these
independent iterations to idle processors in the most equitable manner and with the
least amount of overhead. This is not an easy task because the actual execution
time of each iteration may be different due to network contention, mutually exclusive
access, and complicated control ﬂow. In addition the amount of run-time overhead is

machine and even application dependent [24] [25] [26] [27].

1 .2 Parallel Constructs

Generally, an iteration which denotes a series of statements in the loop body is a
unit of work assigned to a processor. Therefore, the dependence constraints inside
an iteration can be ignored in parallelizing a nested loop. The dependence constraint
among different iterations, called cross-iteration dependence, is our major concern.
As indicated in the previous section, after dependence analysis and loop transfor-
mation, loops or nested loops are usually restructured to DOall loops or DOacross
loops. A loop or a level of a nested loop can be transformed to a. DOall loop (also called
a parallel 100p) [21] [22] if there is no cross-iteration dependence among iterations.
On the other hand, if some dependences occur among iterations, synchronization
between processors executing different iterations must be applied to preserve those

cross-iteration dependences. This technique is referred to as a DOacross loop.

1.2.1 DOall Loop

Since all iterations are independent of each other, iterations of a DOall loop can
be executed in any order or even simultaneously without any extra synchronization.

Consider the loop in Example 1.1(a).

Example 1 . 1:

do I I 1, N
do J t 1, N
81: 1(1, J) - B(I, J) + c(1+3, J+3)
82: C(I, J) - A(I, J) + C(I, J)
enddo
enddo

(a) Source code

do I = 1, M
doall J 3 1, N
81: A(I, J) 8 B(I, J) + C(I+3, J+3)
S2: C(I, J) I A(I, J) + C(I, J)
enddo
enddo

(b) DOall loop transformation

The DOall loop transformation fails at the I-loop because statement 81 in the
iteration 2' requires the computation results from the statement 82 in the iteration
i-3. Fortunately, DOall transformation can be applied at the J -loop, because, given
a ﬁxed I, statements 81 and 32 are independent of each other. Thus, the second level
loop can be transformed to a DOall loop. The parallel code is listed in Example
1.1(b).

Ideally, the peak speedup is equal to the degree of parallelism, N. The degree of

parallelism here represents the maximal number of iterations that can be executed

concurrently. Because parallelizing a DOall loop involves assigning iterations to avail-
able processors and barrier synchronization at the end of the loop, in practice, the
performance is signiﬁcantly affected by scheduling overhead, barrier synchronization
overhead and workload imbalancing [23] [28]. Also, if cache or local memory is em-
ployed in a multiprocessor, variable locality is another important factor to the overall
performance. More discussion on DOall loop transformation and scheduling will be

presented in Chapter 3 and Chapter 5, respectively.

1.2.2 DOacross Loop

A loop is termed DOserial loop if some cross-iteration dependences exist. To obey
dependence constraints, a DOserial loop can be simply executed serially by a single
processor. However, any nested loop can be executed in parallel as long as all cross-
iteration dependences are satisﬁed. As mentioned, one such technique is DOacross
loop in which some synchronization between the processors executing different iter-
ations must be added to obey cross-iteration dependences. The DOacross technique
is suggested by [29] and completely proposed in [30] [31]. It should be emphasized
that all cross-iterationdependences need be detected so that synchronization can be
applied. This may be difﬁcult for some loops that carry an irregular and complex
dependence relations among iterations. In addition the performance is determined by
the degree of parallelism and particularly synchronization overhead which might be
signiﬁcant. .

Consider Example 1.2(a). DOall transformation fails at both I~loop and J -loop.
Alternately, DOacross transformation can be applied at the I-leop by using a

post&wait synchronization primitive. The parallel code is listed Example 1.2(b).

Example 1.2:

do I I 1, N‘
do J I 1, N
81: A(I, J) I BCI, J) + C(I+2, J+3)
S2: C(I, J) I A(I, J+1) + C(I, J)
enddo
enddo

(a) Source code

array syn(N)
syn(*) I 0
doacross I I 1, N
do J I 1, N
if( I-2 >- 1) wait(syn(I-2), J-a),
$1: A(I, J) I B(I, J) + C(I+2, J+3)
S2: C(I, J) I A(I, J+1) + C(I, J)
post(syn(I), J)
enddo
enddo

(b) DOacross loop transformation

In the parallel code, the inner 100p is executed serially. There is only one barrier
synchronization required at the end of the outer loop. The variable syn(i) is a
synchronization variable which indicates the number of iterations ﬁnished at iteration
I = i. Iteration i can be started as long as it is not ahead of iteration i - 2 3 steps.
Since all iterations of the outer loop I can be started at the same time, the degree of
parallelism in the above parallel code is N. Thus, if the synchronization overhead is

ignored, the ideal speedup is equal to N.

1.3 Motivation and Research Objectives

Summarizing the above statements, the major concerns of loop parallelization are

listed as the following:
o compile-time cost
0 degree of parallelism
e workload balancing
0 data locality
e run-time overhead.

With a reasonable cost at compile-time, we hope the maximal degree of parallelism
can be exploited, the workload can be fairly distributed across processors, the data
transfer between processors can be reduced, and the run-time overhead of scheduling
and synchronization can be minimized.

However, it is difﬁcult and not necessary to consider all of the points listed above.
For example, for some 100ps that have complicated dependence relations, we focus on
systematically arranging synchronization primitives. For some loops that have much
more parallelism than the number of processors, overhead and workload balancing
should be the major concern. Based on the dependence relations of loops, we develop
three new methods to improve the execution of iterative loops in multiprocessor sys-
tems.

First, consider nested loops that carry regular dependence relations. Many tech-
niques have been proposed to deal with such loops. Most of them focus on exploring
more parallelism. For a multiprocessor system, however, more parallelism does not
always result in a better performance due to too much data transfer and synchro-

nization induced between processors. In the ﬁrst part of this research, we develop

10

a transformation method which sacriﬁces some potential parallelism to avoid data
transfer and synchronization between processors.

Second, consider nested loops that carry complex and irregular dependences. Due
to the complicated dependence relations between iterations, it is difﬁcult to efﬁciently
arrange synchronization to preserve all cross-iteration dependences. Usually, such a
loop is executed serially; no speedup is obtained. The second objective of this thesis
is providing a new mechanism, which requires a reasonable compiler-time cost and
induces a small run-time overhead, to explore more parallelism from such nested
loops.

Third, consider a DOall loop which is the most important parallel construct in
parallel programming. Since all iterations are independent of each other, there is
no need of synchronization and data transfer between processors. However, due to
unpredictable execution time of iterations, our major concern in scheduling a DOall
loop is how to evenly distribute the workload with the least amount of overhead.
Some well-known scheduling methods are good in distributing the workload, but
complicated. Some methods are simple, but not as good in balancing the workload.
The last objective of this research is designing a practical scheduling scheme which
can achieve a better workload balancing by inducing a smaller run-time overhead

than all other scheduling schemes currently used.

1.4 Thesis Organization

The rest of the thesis is organized as follows. Chapter 2 gives a brief review of depen-
dence concept and some knowledge about dependence analysis and synchronization.
Some associated terms used in the presentation are also deﬁned.

Chapter 3 discusses the parallel execution of nested loops with uniform dependence

relations between iterations. We present the concept and disadvantages of sequential

11

iterative DOall loops that are frequently generated by many restructuring techniques.
A new method, namely loop netting, is proposed. Loop netting transform nested
loops into a DOall loop enclosing serial loops that do not induce cache thrashing and
barrier synchronization.

In Chapter 4, the attention is turned to parallelizing loops with a complicated
dependence pattern. The new mechanism, data dependence uniformization, is pro—
posed. We develop a new dependence analysis method to determine more accurate
dependence information. Based on that information, irregular dependence relations
are uniformized so that synchronization can be easily and efﬁciently arranged.

In Chapter 5, we study the scheduling problem of DOall loops in multiprocessors.
Several well-known self-scheduling schemes are discussed, followed by a loop schedul-
ing model. This model which provides a basis for analyzing a loop scheduling scheme
is used to present our new scheduling scheme in the next chapter.

The practical Trapezoid Self-Scheduling (TSS) scheme is described in Chapter 6.
By linearly decreasing the number of iterations assigned to processors, TSS achieves
both the objective of small scheduling overhead and workload balancing. Due to
the linearity, an efﬁcient memory management is also proposed. The experiments
conducted on a 96-node Butterﬂy GP-1000 give a comparison of TSS with other
self-scheduling schemes.

Finally, the conclusion is presented in Chapter 7. Improvement of the work in this

thesis is discussed and plans for future works are given.

CHAPTER 2

CROSS-ITERATION
DEPENDENCE

Since an iteration is a unit of work assigned to a processor, cross-iteration dependences
are the major constraints in loop parallelization. A cross-iteration dependence occurs
if some data computed in one iteration is also used by another iterations. Data
dependence analysis gives information about underlying data ﬂow of a loop. If no
data ﬂows across iterations, a loop can be directly parallelized. Otherwise, some
proper synchronization must be applied in the transformed parallel code to completely
preserve cross-iteration dependences so that no semantic difference arises.

In this chapter, we ﬁrst present basic concepts of data dependence. Then, some
fundamentals of analyzing cross—iteration dependence will be given in Section 2.2. To
clearly describe the dependence relations, a loop is usually represented graphically.
In Section 2.3, we deﬁne the iteration space dependence graph and some associated
terms that will be used in this presentation . Finally, some common synchronization

primitives are reviewed in Section 2.4

12

13

2.1 Basic Concepts

A program is -a list of statements. Each statement has a set of input variables and a
set of output variables. A data dependence occurs if an input or output variable of
a statement is also an output variable of another statement. Let 5', denotes the i-th
statement in a program (counting in lexicographic order). As presented in literature,
we use I N (5,) to denote the set of input variables of statement 5,, and OUT(S;)
to denote the set of output variables of 3,. Let i < j. Corresponding to three
possible combination. of output and input variables there are three types of dependence

between S,- and S,- that are listed as the following.

1. ﬂow—dependence: OUT (5,) DIN (Sj) is not empty.
2. anti-dependence: I N (5,)0 0UT(S,-) is not empty.
3. output-dependence: OUT(S.-)nOUT(S,-) is not empty.

Consider Example 2.1. There are three data dependences: the ﬂow-dependence
from S; to 5'2, the anti-dependence from 5'1 to 53, and the output-dependence from
S; to S4. The dependence relations of a program are conveniently described by its
dependence graph: a directed graph with nodes representing statements and arcs
representing dependences between statements. Usually, we denote ﬂow-dependence
by a plain arc, anti—dependence by a cross arc, and output-dependence by an arc with

a small circle on it. The corresponding dependence graph of Example 2.1 is shown in

Fig.2.].
Example 2.1:
81: AIB+C
S2: DIAIC
$3: BIBI 2
S4: DIC+3

l4

 

 

Figure 2.1. The dependence graph of Example 2.1

 

 

 

For statements inside loops, we are interested in dependence relations between
statements and also between iterations. We distinguish dependence relations be«
tween different iterations by numbering arcs in a dependence graph. These numbers,
called dependence distance in the literature, represent the distance of dependences.
For instance, a plain arc with a value 3 from S, to S ,- represents that statement S,- in
iteration I + 3 is ﬂow-dependent on statement 5',- in iteration I. An arc with a value
0 represents a dependence inside an iteration, i.e., this is not a cross-iteration depen~
dence. Consider Example 2.2 which is a loop including the same set of statements
as in Example 2.1. The corresponding dependence graph is illustrated in Fig. 2.2. It
can be seen that there is only one cross-iteration dependence from $1 to $2 whilethe

other two dependences can be ignored in loop parallelization.

Example 2.2:

do I I 3, 10
31: 1(1) . 8(1) + C(I+1)
82: MI) I A(I-3) I 0(1)
83: 8(1) I B(I) I 2
S4: 0(1) - C(I+2) + 3
enddo

15

 

 

Figure 2.2. The dependence grapg of Example 2.2

 

 

 

2.2 Solving Dependence Equations

As mentioned in the previous section, detecting dependences among statements in-
volves taking the intersection of the corresponding IN and OUT sets. For scalar vari-
ables, it is straightforward. However, for array variables, testing for cross-iteration
dependences involves checking whether two subscript expressions could take on iden-
tical values during the execution of a loop. Various dependence analysis algorithms
have been extensively studied and proposed in the context of automatic loop paral-
lelization of sequential programs [32] [9] [15] [33] [34] [35]. They are mostly based on
the theory of linear Diophantine equations and the bounds on real functions.

Let I,- denote the loop index of the j-th level DO loop and i, denote a particular
value of I 1'. We assume that all 100ps are normalized, i.e., all index strides are equal to
1. Let the lower and upper bounds of I; be constants 1 and N, respectively. A d-level
nested loop has a general form as listed in Fig. 2.3 in which f1,--- , fk, g1,- - - ,gk,
[2, ' - - ,54, and U2, - - . , ud are linear functions and A is a lc-dimensional array.

The question is whether there are two different sets of integer values, I =
(51,: ° - ,id) and J = (jl, - - - , jd) such that they satisfy the following I: equations si-

multaneously.

16

 

 

d0 11 =1,N
do I2 = (2(Il)1u2(11)

do Id = £d(Il)sud(Il)

A(fl(11i12i°”$14))"°afk(llil2,”°,Id) = H.
~-=AW.....,1,),...,,,(1,,,,,...,,,)
enddo

enddo
enddo

Figure 2.3. A multiple-level nested loop model.

 

 

17

fl(il,' ° ° aid) = 91(j1, ° ° ' )jd)

(2.1)

fk(£ly ' ' ' aid) = gk(j1)° ' . vjd)

If the answer is negative, clearly there is no cross-iteration dependence. If the
answer is afﬁrmative, there may be a cross-iteration dependence. However, since I
and J must lie within the loop limits, the following inequalities must be satisﬁed as

well.

I
1 Silajl SN

] [20.1) S 1.2,].2 S “20.1)
(2.2)

 

L [4(1'1) S idajd S ud(i1)

The algorithm of ﬁnding the general solution of a system of Diophantine equations
can be found in [36]. Depending on the number of levels, d, the algorithm is time-
consuming. Most parallelizing compilers only apply in-exact tests that may give a too
conservative answer, i.e., a yes is answered while actually there is no dependences.
Two of the most common in-exact tests are the GOD test and the Banerjee bound test
[32]. The GCD test determines whether Eq.(2.1) has an integer solution ignoring
the loop bounds in Eq.(2.2). The Banerjee bound test checks whether Eq.(2.1) has a

real solution within the loop limits. For instance, consider Example 2.3.

18

Example 2.3:
do I I 1, 20

S1: AC3II+10) I ....

S2: .... I A(6II-5) .....
enddo

The corresponding diophantine eqution and inequality are listed below.

3*i-6Ij=—15
155.1320

The GCD test ignores the inequality and checks whether the gcd of 3 and 6
divides 15. In this example, the answer is positive which indicates that there are
integer solutions. However, the integer solutions may not be located in the valid loop
limit in many cases. Consider Example 2.4. The corresponding diophantine eqution

and inequality are listed in Eq.(2.3).

Example 2.4:
do I I 1, 10

81: A(I+1) I ....

S2: .... I A(2II+12) .....
enddo

i—2*j=11
(2.3)
1 S 5,} S 10
The GCD test gives a positive answer, i.e., this loop cannot be transformed

to a DOall loop. In fact, there does not exist. any dependence between S; and
52, since OUT(S,) = {A(2),A(3),...,A(11)} does not intersect with IN(S;) =

19

{A(14),A(16),. ..,A(32)}. The Banerjee bound test computes the lower and up-
per bounds of the expression i — 2 * j and check whether 11 is located in it. In
Example 2.4, the bound of the function i — 2 I j is {-19, 8]. Thus, the bound test

returns a correct answer, while the GOD test fails in this example.

2.3 Iteration Space

To clearly describe the dependence relations among iterations, informally, a loop can
be said to describe an iteration space. Let each level of the nested loop correspond to
an axis of the iteration space. A d—level nested loop describes a d—dimensional discrete
Cartesian coordinate system. Each iteration of this loop corresponds to an integer
point in this d-dimensional space. Let I ,- denote the loop index of the j-th level DO
loop and i5 denote a particularvalue of I j as before. The iteration space of a nested

loop is formally deﬁned as the following.

Deﬁnition 2.1 Let L be a normalized d-level nested loop-of the form in Fig. 2.3, the
corresponding iteration space of L, namely I‘(L), is equal to

{(i1,...,i4)|1_<_i1 S N,€,-(i1)$i,- S u,(i1) for 2 Sj S d, i,- E Z+ for ISj 5 d}
where Z1” is the set of non-negative integers and N is the upper bound of the l-st level

DO loop.

A vector if = (i1,i2,...,id) is an iteration vector of the loop L, if s E I‘(L).
Each iteration vector (ibig, ...,id) corresponds to a series of statements in the loop
body, SI,Sz,...,S, for (I;,I;,...,Id) = (i1,ig,...,id). A cross-iteration dependence
from iteration vector ii to iteration vector if can be represented by a dependence
vector cf = iii — ii. A dependence. vector is a directed edge in the iteration space
in which the node on the head must be executed after the node on the tail. The

dependence vector set of a nested loop L, namely D(L), is a vector set which contains

20

all distinct dependence vectors in the iteration space. Each iteration may have its
own dependence vectors. If all iterations of a loop have the same set of dependence
vectors, D, this is a uniform dependence loop. The deﬁnition is formally described as

the following. -

Deﬁnition 2.2 A nested loop, L, is called a Uniform Dependence Loop (UDL), if
for any a e I‘(L), 6 depends on s— 3‘ for all J‘e D(L) ifv‘ —- J 6 I‘(L), where D(L)

is the dependence vector set and I‘(L) is the iteration space of L.

Example 2.5:

do I I 1, 5
do J I 1, 5
s1: A(I, J) I BCI, J) + C(I+1, J+1)
S2: C(I, J) I A(I, J) + C(I, J)
enddo
enddo

We close this section with the doubly nested loop in Example 2.5. The 2-
dimensional iteration space is shown in Fig. 2.4 in which the dotted arrows show
the semantic execution order of iterations and the solid arrows represent dependence
vectors. Since the ﬂow-dependence from 51 to 52 does not traverse iterations, there is
only one cross-iteration anti-dependence from S; to S; which occurs in every iteration.

Thus, the loop is a UDL with the dependence vector set, {(1,1)}.

2.4 Synchronization Among Iterations

It is clear that any loop can be executed concurrently as long as the dependences
among iterations are preserved. To preserve those cross-iteration dependences, ad-
dition code must be added to ensure proper synchronization between the processors

executing different iterations. In the worst case, the synchronization from the bottom

21

 

 

Figure 2.4. The iteration space dependence graph of Example 2.5

 

 

 

of one iteration to the top of the next iteration results in a serial execution. There
are several types of synchronization primitives used in DOacross loops [37] [38]; the
choice depends on the target machine’s architecture and the dependence pattern of
the loop. In this section, we survey three major types of synchronization: post&wait,
critical section, and barrier. We discuss their strength and weakness by some simple

examples.

Example 2.6:

do I I 3, N
81: A(I) I 3(1) + C(I)
S2: C(I) I E(I+1) + C(I)
83: MI) I ACI-2) I 2
S4: 3(1) I A(I) * 3(1)
enddo

Consider Example 2.6. The dependence directed graph is shown in Fig. 2.5 in

which only cross-iteration dependences are considered. This is clearly a uniform

22

 

WM

 

Figure 2.5. The dependence graph of Example 2.6

 

 

 

dependence loop with dependence vector set D = {( 1), (2)} in a single—dimensional
iteration space. By using the post&wait mechanism, the above loop can be transferred

to a doacross loop as below.

Example 2.7: (Parallel code of Example 2.6)

post(Asyn, 1)
post(Esyn, 2)
doacross I I 3, N

81: ACI) I BCI) + C(I)
post(Asyn, I)
82: C(I) I E(I+1) + C(I)

post(Esyn, I)
nait(Asyn, I-2)

83: 0(1) I A(I-2) I 2
wait(Esyn, I-1)
s4: . 2(1) I A(I) I B(I)
enddo

As can be seen, the post&wait mechanism is ﬂexible but may require many syn-
chronization points and thus induce too much run-time overhead. Beside, if a state-
ment is dependent on itself, post&wait is not applicable. Consider Example 2.8.
Post&wait cannot be applied due to the ﬂow-dependence from S4 to itself. In this

example, critical section is the only choice. However, it was demonstrated in [39]

23

that critical sections are a major factor of performance degradation in a shared mem-
ory multiprocessor. The goal is to insert as few critical sections as necessary and to
make them as small as possible. Generally speaking, only those statements involved

in a data dependence cycle need to be enclosed in a critical section.

Example 2.8:

doacross I I 3, N

51: A(I) I A(I) I A(I)

32: 8(1) I 8(1) I 2

S3: C(I) I B(I) + A(I)I2
begin critical section

S4: sun I sun + C(I)

end critical section
enddo

The most restricted synchronization method is barrier synchronization which di-
vides a loop into segments such that cross-iteration dependences occur only between
segments, but not inside a segment. All statements of all iterations in a segment must
be completed before the execution of any iteration in the next. segment is started. The
principle of barrier synchronization can be simply illustrated by Fig. 2.5 in which the
two cross-iteration dependences can be hold by a barrier synchronization (the dotted

line). The parallel code is listed below.

Example 2.9: (Parallel code of Example 2.6)

doacross I I 3, N

31: A(I) I 8(1) + C(I)
32: C(I) I E(I+1) + C(I)
barrier

83: 0(1) I A(I-2) * 2
54: 8(1) I MI) I 8(1)
enddo

24

This primitive handles only lexically forward dependences. If dependence cycles
exist, some techniques, such as node splitting [8] [9], may be employed to break depen-
dence cycles. Statement reordering then can be used to change lexically backward
dependences into lexically forward dependences [31]. Barrier synchronization also
has been implicitly applied in many loop restructuring techniques. For instance, a
barrier synchronization is essentially equivalent to a loop distribution [40]. In a large
scale multiprocessor system, however, the barrier synchronization is one of the major
factors to the performance degradation due to the hot-spot condition [41] and the
atomic operation [42]. More discussion about barriers in loop restructuring will be

given in the next chapter.

CHAPTER 3

TRANSFORMATION OF
UNIFORM DEPENDENCE
LOOPS

As indicated in Chapter 1, the major goal of loop transformation is to explore more
parallelism. Based on the dependence information provided by dependence tests,
loop transformation rearranges nested loops such that the implicit parallelism can be
explicitly expressed by some parallel code. It is true in general that the more precise
and complete dependence information is provided, the more efﬁciently that parallel
code can be generated. Since exactly analyzing dependences is time-consuming and
difﬁcult, many proposed approaches deal with only uniform dependence loops that
have a regular dependence pattern [43] [44] [45] [46] [18] [47] [48] [49] [50].

In this chapter, we study the transformation of uniform dependence loops. The
dependence pattern of a uniform dependence loop usually is so simple and regular
that the complete dependence vector set can be easily identiﬁed, and synchronization
primitives can be systematically arranged. However, in addition to parallelism, there
are two important concerns. First, in many multiprocessor systems, a local cache or

local memory is employed to bridge the gap between fast processor cycle and slow

25

26

main memory access times. To reduce data transfer among caches and local memory
of processors, data locality should be considered in restructuring nested loops. Second,
to reduce run-time overhead induced by synchronization, selecting and arranging
synchronization primitives is the other important issue for loop transformation.

The rest of this chapter is organized as the follows. In Section 3.1, we discuss
a sequentially iterated DOall loop which is a common parallel code generated by
many transformation methods. In Section 3.2, we present a new partitioning method
which does not involve much data transfer and synchronization among processors.
Finally, the well-known wavefront method is reviewed for dealing with special uniform

dependence loops with no explicit parallelism.

3.1 Sequentially Iterated DOall Loop

Loop transformation uses the data dependence information to determine whether the
existing dependences allow the 100p, or parts of it, to execute in parallel without
violating the semantics. If no cross-iteration dependences exist, the loop can be
simply transformed to a DOall loop. In most cases, however, dependences occur, and
appropriate rearrangements are required to extract partial parallelism. It is common
that a nested loop is transformed into a serial outer loop enclosing a number of DOall
loops. This situation is referred to as sequentially iterated DOall loops in parallel
programming [51]. Consider Example 3.1(a) in which a dependence cycle exists. The
dependence vector set is { (1, 3), (3, 1)}. Since all dependences involved in this cycle
are ﬂow-dependences, this dependence cycle is not breakable. Thus, loop distribution

cannot be applied [40].

27

Example 3.1:
do I I 3, N
do J I 3, N
31: A(I, J) I B(I-l, J-3) .
S2: B(I, J) I A(I-S, J-1) . . .
enddo
enddo
(a)
do I I 3, N
doall J I 3, u
81: A(I, J) I 8(1-1, J-3) . . .
S2: B(I, J) I A(I-3, J-1) . . .
enddo
enddo

(b) Sequentially iterated DOall 100p

In fact, the iterations in the second level loop are independent of each other. For
example, by using selective cycle shrinking method [17], the parallel code is listed
in Example 3.1(b). Assuming self-scheduling is applied [26] [25] [52], processors dy-
namically fetch the next iteration of the DOall loop at run time. Due to the nonde-
terministic run-time behavior, the assignment of iterations to processors is arbitrary
and unpredictable. For instance, in the ﬁrst iteration of the outer serial loop (I = 3),
processor 1 is assigned to iteration vector (3, 3) and has array elements A(3,3), A(0,2),
B(3,3) and'B(2,0) in its cache. Processor 2 is assigned to iteration vector (3,4) and
has array elements A(3,4), A(0,3), B(3,4) and B(2,1) in its cache. Then, in the second
iteration of the outer serial loop (I = 4), if processor 1 is assigned to iteration vec-
tor (4, 7) and processor 2 is assigned to iteration vector (4,6), array element B(3,4)
must be moved from the the cache of processor 2 to the cache of processor 1, and
B(3,3) should be transferred in the opposite direction. The same behavior may occur

again in the iterations I = 5,6, - - - ,N if the iterations, which access the same set

28

of data, are not assigned to the same processor. This phenomena, called cache or
local memory thrashing, is one of major factors causing performance degradation in
multiprocessor systems [53].

A partitioning method is proposed in [51] which employs a static scheduling
scheme to partition sequentially iterated parallel loops with a uniform dependence
pattern. It has been proved that the amount of data transfer between processors re-
quired by this partitioning method is less than any other static partitioning methods.
However, this approach has several disadvantages. First, only the loops with constant
access vectors are considered. Thus if the access vector is a function of the outer se-
quential loop index, the method is no longer applicable. Second, if the number of
processors and the bounds of loops are unknown at compiler time, then this approach
is not applicable. Third, workload imbalancing is a major disadvantage in any static
scheduling scheme particularly when the workload is not uniformly distributed. More
discussion about scheduling schemes will be given in Chapter 5.

The other important factor in loop transformation is the amount of parallelism. In
the above example, the degree of parallelism is equal to the bound of the second level
loop, M —3, which possibly is large enough. However, in a sequentially iterated DOall
loop, barrier synchronization is implicitly applied in every iteration of the outer serial
loop. If the granularity of the loop body is small, the run-time overhead induced by
barriers may outweigh the beneﬁt of the resulting parallelism, in particular, when the

outer loop bound is large and the inner loop bound is small.

3.2 Loop Netting

Rather than a sequential iterative DOall loop, in this section, we propose a new loop
transformation technique, namely loop netting, to restructure a uniform dependence

loop to a DOall loop enclosing a number of serial loops. Our new method involves

29

neither cache and local memory thrashing nor barrier synchronization. We start our
presentation by considering the UDL in Example 3.1(a). The iteration space depen-
dence graph of Example 3.1(a) is partially shown in Fig. 3.1 in which the iteration
space is composed of many independent sets. All iteration vectors in each set are
chained by two dependence vectors, (3, 1) and (l, 3) and are independent of itera-
tion vectors in the other sets. To make this ﬁgure clear, only two sets are displayed:
one chained by solid arrows and the other one chained by dotted lines. The formal

relation between iteration vectors in the same set is deﬁned as the following.

Deﬁnition 3.1 Given two iteration vectors {2‘ and i. ii is reachable from ii via
dependence vector set D = {d;,d;, . - - ,d;}, if there exist c1, - . - ,ck E Z such that

ii—ii=c1*d; +c2*d;+---+ck*d;, where Zis the set ofintegers.
Lemma 3.1 The relation, reachable from, is symmetric and transitive.

Proof: Let dependence vector set D = {d;,d;, - - - ,di}.

symmetric: If ii is reachable from ii via D, ii —— ii = c; I d; + c; I d; + - - - + c), * d;
where c1,n-,c;, are integers. So, ii— 6’: -c1 and; —cg*d; - -c).*d-i..

Thus, ii is also reachable from ii.

transitive: If ii is reachable from ii and ii is reachable from 117 via D, we have
II—tI=l)1*d-;+bztd-;+H-+bk*d1, and t'i—w'zc1*¢f1+c2*¢fg+~-+ck*d;
where by, - - . , bk, c1, - . - , c), are integers. Combine above two equations. We have
17-65: (b. +c1)*d;+(52+c2)*di+°~+(bk+6k)*di.

Thus, ii is reachable from :13. D

It is clear that if two iteration vectors are not reachable from each other, then
there is no cross-iteration dependence between them. Note that the other direction is
not necessary true. For example, iteration vectors (3, 3) and (3,11) are reachable from

each other in Figure 3.1, but they are independent. Each independent set, namely

30

 

 

 

Figure 3.1. The iteration space of Example 3.1(a)

31

net, is independent of the others. The formal deﬁnition of a net is described as the

following.

Deﬁnition 3.2 Given a UDL with dependence vector set D, the iteration vector set
{v}, - - - ,5;} is a net in I‘(L), if if,- is reachable from 23',- via dependence vector set D

foranylSi,j$t.

Theorem 3.1 Given a UDL with dependence vector set D, all nets in F(L) are

independent of each other.

Proof: The proof is done by contradiction. .

Let D = {.113}, . - - ,d';} and a e T, and s e 7",, where T, and T2 are two distinct nets
in I‘(L). Assume ii is dependent on ii. There exist non-negative integers c1, - - - , ck
such that ii - ii = Cl I d; + c; * d; + ~ - - + c], * d1. From Deﬁnition 3.1, if is reachable
from ii via D. Since the relation reachable is symmetric and transitive (Lemma 3.1),
all iteration vectors in T1 are reachable from any one in T2. Thus T1 and T2 are in

the same net. D

Since nets are independent of each other, if all iterations in a net are performed
by a single processor, no data transfer and synchronization between processors are
required. The principle of loop netting is that a processor, which is assigned to an
independent net T, starts executing from an initial iteration vector, namely entry
point, and scans the rest of the iterations in T by following the direction of the
dependence vectors. The most important part of this approach is ﬁnding all entry
points. Obviously, the number of entry points is equal to the number of independent

nets in the iteration space. We formally deﬁne the set of entry points as the following.

Deﬁnition 3.3 Given a UDL, L, with dependence vector set D, a set of vectors E

is an entry set in I‘(L), if the following two conditions hold.

32

(1 ) For any two distinct e'}, 6;- 6 E, 6',- is not reachable from e“;- via D.

(2) For any ii 6 I‘(L), there exists a E 6 E such that ii is reachable from 6 via D.

Consider Figure 3.1. Intuitively, the integer points inside the parallelogram which
is enclosed by (3,3), (4,6), (6,4), (7,7) form an entry set. There are a total of 8 points
in that parallelogram,i.e., there are 8 independent nets in Figure 3.1. However, due
to irregularity, it is not a straightforward task to completely and mutually exclusively
fetch integer points inside a parallelogram at run—time. We develop a simple and
efﬁcient method which locates the other set of entry points that are enclosed in a
rectangle. For instance, the 8 points in the dotted rectangular form the other entry

set in Figure 3.1. Our approach is formally described by the following lemmas and

theorems.

Lemma 3.2 If ng($1,$2, - - - ,xk) = g, then there exist a1,a2, - ' - ,ak such that al a:
$1 + "' +ak*$k =9 when mlix2au'ixki aliGZiu'iak 6 Z

Proof: This is a basic property of the greatest common divisor [54]. D

Lemma 3.3 If gcd(.r1,:c2, ~ - ~ ,xk) = g, and a1 * z; + - ~ - + a). ,* an, = G, then there

existc such thatG=c*g, where zl,x2,---,xk, a1,ag,---,a;,, andce Z.

Proof: This is a basic property of the greatest common divisor [54]. 13

Theorem 3.2 Given a doubly nested UDL, L, with dependence vector set D =
{(u1,v1),(u2,v2)}. The set E ={(z,y)]15 1? < 9. 1 S y < h, 3,3! 6 Z}

, "1 01
is an entry set, where g = gcd(u1, 11.2), h = w/ g and w =
112 v;

33

Proof:
Part 1: we prove any point (3:, y) E E are not reachable from each other.
First, we prove that any two entry points (2:1, yl), (22, y;) 6 E are not reachable from
each other, if z; 75 2:2. The proof is done by contradiction.
Assume there exist a, b 6 Z such that 2:1 — 2:2 = a I u] + b I 112. From Lemma 3.3, we

have 21 — 1:2 = csg where c 6 Z. But, we know 2:1 - 2:2 < g.

Second, we prove that any two entry points (2:,y1), (2:,y2) E E are not reachable
from each other, if y] gé yg. Again, the proof is done by contradiction.

Assume there exist a, b E Z such that

32-: =a*u1+b*u2==0
(3.1)
yl—yz =a*v1+b*v2
Thus, a = -b* EI' To ensure a 6 Z, we have b= c* 1‘;- and a = —c* 291 where c6 Z.
Substituting a, b into Eq.(3.1), we have
c c
yl—y2=5*(ul*v2—u2*vl)=3*w=6*h (3.2)

Part 2: we prove that any point (i, j) can be reached from (:r,y) via D, where
(1:,y) 6 E. A
Let so = —‘-"? and b0 = 2,1. We have a0 I (u1,v1) + b0 I (U2,‘Ug) = (0, f) = (0, h).

Let i’ = i mod 9 and j’ =j mod h. There exist c.-, c, E Z such that

i =c,-*g+i’
(3.3)

j=CjIh+f

From Lemma 3.2, there exist al, In 6 Z such that al I u1 + b1 I ([2 = c, I g. Let

(01*01+b1*02) mod h = v’, there exist c. e Z such that also] +b1atv2 = chh+v’.

 

34

Case 1: j’ 2 v’. Let (1:,y) = (i’,j’ — v’) e E.
We prove there exist two integers (a, b) = ((Cj - c.) I a0 + a1, (Cj — c.) * b0 + b,»
such that a a: (u1,v1) + b I (u2,v2) + (:r,y) = (i,j). Compute i-part by the following

equations.

(c,-—c,,)*ao*u1+a1*u1+(cj—c,)*bo*u2+b1*u2+i’

= (Cj—Cu)*(ao*u1+bo*u2)+al*u1+b1*u2+i’

(3.4)
= 0+q*g+¥
= 2
Compute j-part by the following equations.
(cj—cv)*ao*v1+a1*vl+(cj—cv)*bo*v2+b1*v2+j’—v’
= (cj—c.)*(ao*v1+bo*v2)+a1*vl+b,*v2+j’—v’ (35)

= (cj-ct)*h+ct*h+v’+j’—v’ \

= I

Case 2: j’ S v’. Let (x,y) = (i’,j’ — v’+ h) 6 E.
We prove there exist two integers (a, b) = ((Cj - c, - 1) *00 +01, (01' — c, ‘ 1) * b0 + 51))
such that a I (u1,v1) + b * (u2,v2) + (z,y) = (i,j). The proof is similar to in Case 1.

Cl

Theorem 3.3 Given a doubly nested UDL, L, with dependence vector set D =
{(u1,vl),(ug,v2)}. The set E ={(:c,y)|15 a: < h, 15 y < g, 3,31 6 Z}

ul 01
is an entry set, where g = gcd(v1,v2), h = w/g and w =
112 v2

35

Proof: The proof is similar to the proof in Theorem 3.2. CI

Corollary 3.1 Given a doubly nested UDL, L, with dependence vector set D =
{(u1,v1), (u2,v2)}. The number of independent nets is w = "1 v1
u; v;

Now the last problem is how to scan a net from an entry point deﬁned in The-
orem 3.2,i.e., how to restructure the loop to parallel code. The scanning algorithm
actually is embedded in the proof of Theorem 3.2. From an entry point, a processor
jumps h steps in the second level loop until the bound is reached. Then, it jumps g
steps in the ﬁrst level loop by following dependence vectors to reach the other point
which is in the same net. This algorithm results in a semantic order of serial execu-
tion in a net. Thus, dependence constraints are naturally preserved. A loop netting
transformation for a doubly nested UDL with two dependence vectors is shown in
Figure 3.2. Note that loop coalescing can be applied to reduce two DOall loops to a
singly DOall loop, if necessary.

For instance, Example 3.1(a) can be transformed to Example 3.2 by using the
loop netting method. Since g = 1, the second DOall loop in Figure 3.2(b) is not
necessary. For example, the execution order of the ﬁrst net, which is started from the
entry point (3, 3), is (3, 3),(3,11),(3, 19), - - - , (4, 6), (4, 14), (4,22), - - -.

Example 3.2: Parallel code of Example 3.1(a) in loop netting
doall J1 I 3, 10

do I I 3, N, 1
do J I J1, H, 8

81: A(I, J) I B(I-1, J-3) . . .
82: B(I, J) I A(I-3, J-1) . . .
enddo

J1 I (J1 + 3 - 3) mod 8 + 3
enddo
enddo

36

 

 

do I I n, N
do J I m, N
\* Loop Body I\
81: A(I, J) I B(I-ul, J-vl) .
S2: B(I, J) I A(I-u2, J-v2) . . .

(a) Source code

/* a, b are any integers such that aIu1 + qu2 I g *l
g I gcd(u1, u2)
h I (ulIv2 - u2Iv1)/g
doall I1 I n, n+g-1

doall J1 I m, m+h-1

do I I I1, N, g
do J I J1, H, h
\I Loop Body *\

$1: . A(I, J) I B(I-ul, J-v1) .
32: B(I, J) I A(I-u2, J-v2) .
enddo
J1 I (J1 + (a*v1 + b*v2) - m) mod h + m
enddo
enddo
enddo .

(b) Parallel code in loop netting

Figure 3.2. Loop netting transformation

 

 

37

The above theorems deal with UDLs with two dependence vectors. If a doubly
nested UDL has only one dependence vector, a virtual dependence vector can be
added. The choice of the virtual dependence vector should result in as much paral-
lelism as possible. Let (u,v) be the only dependence vector, and [n,N] and [m,M]
be the ranges of the ﬁrst level loop and second level loop, respectively. Two good
candidate virtual dependence vectors are (0, M - m + 1) and (N - n + 1,0). The
choice is determined by the number of resulting independent nets. For instance, con—
sider Example 3.3(a) in which D = {(2,1)}. By applying Theorem 3.2, (0,100) which
results in 200 independent nets is the better choice. Since M = h = 100, the second

level serial loop in Figure 3.2(b) is not needed. The parallel code is listed in Example
3.3(b).

Example 3.3:

do I I 1, 100
do J I l, 100
\* Loop Body *\
Sl: A(I+2, J+l) ' B(I, J) . .

(a) Source code

doall II I 1, 2
doall J I 1, 100
‘ do I I I1, 100, 2
\* Loop Body *\
51: A(I+2, J+1) I B(I, J) . . .
J I (J + 1 -1) mod 100 + 1
enddo
enddo

(b) Parallel code in loop netting

38

On the other hand, if the number of dependence vectors is larger than 2, Theo-
rem 3.2 must be repeatedly applied to extract various entry sets. Again, by applying
the principle of greatest common divisor, the ﬁnal entry set can be obtained. An
extended theorem dealing with doubly nested UDLs with many dependence vectors

is described as the following.

Theorem 3.4 Given a doubly nested UDL, L, with dependence vector set D =
{(241, v1), - - - , (ud, vd)}. Let g = gcd(u1, - - - , ud), h = gcd(hu, h13, . - - , hm). The set
E={(z.y)|152<g,1Sy<h.x.y€Z}

, .. as v.-

is an entry set, where h,,- = Eff-'7) and wij =
n J

“1' ”1'

Proof: By applying Lemma 3.2, Lemma 3.3 and Theorem 3.2, this proof is trivial.
c

For example, consider a doubly nested UDL with dependence vector set D =
{(3, l),(1,3),(2, 2)} By using Theorem 3.4, ha = 8, ha = 4, and g = 1. The entry

5
set is a 1 by 4 rectangle.

3.3 Wave-Front Method

In the previous section, loop netting was proposed to transform a UDL into a DOall
loop enclosing serial loops. It is particularly suitable to shared memory multiprocessor
systems with cache and local memory. However, if the number of independent nets
is equal to 1, 100p netting results in a serial exection. The following is an example
of a ﬁve-point difference operation which carries a uniform dependence pattern. The

corresponding dependence pattern in the iteration space is shown in Fig. 3.3.

39
Example 3.4:

do I I 2, N
do J I 2, N
A(I,J) I (A(I+1,J)+A(I-1,J)+A(I,J+1)+A(I,J-1)) l4
enddo
enddo

 

 

Figure 3.3. The iteration space dependence graph of Example 3.5

 

 

 

In this nested loop, there are two ﬁxed dependence vectors, (1,0) and (0,1), in every
iteration. This loop can be parallelized in neither the] loop (due to (1,0)) nor the
J loop (due to (0,1)). According to Theorem 3.2, there is only one independent net.
An elegant method, known as the wavefront method [45] (or hyperplane method [44]),
can be applied to extract parallelism from the above example. The basic principle
of the wavefront method is also illustrated in Fig. 3.3 in which the iterations in the

same wavefront are independent of each other and depend only on the iterations in

40

the previous wavefront. The methods derive the formula of the optimal wavefront
“angle” to scan through the iteration space. Certainly, some cache and local memory

thrashing must be involved. The parallel code using the wavefront method is listed

as the following.

Example 3.5: Parallel code of Example 3.4 in Have-front method

do K I 2, N
doall L I 2, K
I I K - L + 1
J I L
A(I,J) I (A(I+1,J)+A(I-1,J)+A(I,J+1)+A(I,J-1)) l4
enddo
enddo
do K I 3, N
doall L I 2, NIK+1
J I K + L - 1

I I N - L + 1
A(I,J) I (A(I+1,J)+A(I-1,J)+A(I,J+1)+A(I,J-1)) l4
enddo
enddo

Clearly, the above program is composed of two sequential iterative DOall loops
that involve 2*N -1 barrier synchronizations. The other alternate method, loop skew-
ing [18], is integrated with the concept of data dependence direction vectors and the
technique of loop interchange. However, cache thrashing and barrier synchronization

are still the major disadvantages.

CHAPTER 4

DATA DEPENDENCE
UNIFORMIZATION

The dependence pattern of a nested loop is usually complicated especially in the case
of coupled subscripts in multidimensional array references [34]. An array reference
has coupled subscripts if some loop indexes appear in more than one dimension. Ac-
cording to an empirical study [55], 44.34% of two-dimensional array references contain
coupled subscripts. If some cross-iteration dependences exist in the coupled subscripts
reference, they are usually not uniform.

There are three major difﬁculties in parallelizing nested loops which carry irregu-
lar crossaiteration dependences. First, to correctly insert synchronization primitives,
compilers or programmers have to ﬁnd out all cross-iteration dependences. As far
as we understand, there does not exist any dependence analysis method which can
efﬁciently identify all cross-iteration dependences. Second, even if all cross-iteration
dependences can be identiﬁed, it is difﬁcult to systematically arrange synchronization
primitives when the dependence pattern is irregular. Finally, the synchronization
overhead which will signiﬁcantly degrade the performance should be minimized.

An example program segment is listed below, and the corresponding dependence

pattern in the iteration space is partially shown in Fig. 4.1.

41

42

Example 4.1:

do I I 1, 1000
do J I 1, 1000

 

s1: A(I+J, 3II+J+3) I ..
$2: I A(I+J+1, I+2IJ+4) .....
enddo
enddo
J
6

00/00"
mo 0/0 0/00/
so o’o 0’00/

3 j: C)
2

1(3 C) C)
1 2 3

a0

Figure 4.1. The iteration space dependence graph of Example 4.1.

 

 

 

Analyzing cross-iteration dependences is usually done by dependence direction
testing [15]. However, in Example 4.1, although the dependence vector (0,1) ap-
pears only at certain points, it prohibits the second-level loop (loop J) from parallel
execution. Thus, this nested loop can be parallelized in neither ﬁrst-level (due to
other dependence vectors) nor the second level (due to (0,1) dependence vector). In
Fig. 4.1, some iterations are independent of, and some iterations are irregularly de-

pendent on, others. Due to the difﬁculties of exact dependence analysis and efﬁcient

43

synchronization, such nested 100ps are usually executed serially.

Coupled with the wavefront method, one approach is proposed to parallelize nested
loops with nonuniform dependence patterns [46]. However, the algorithm using con-
vex set operations in [46] is too complicated. In this chapter, an efﬁcient method is
presented to unify the complex and irregular data dependence relations to a simple
and uniform dependence pattern. We ﬁrst develop a dependence test to identify the
domain of dependence vectors. Based on the concept of vector decomposition, an
algorithm is then presented to ﬁnd out a simple set of basic dependence vectors of
which every dependence vector in the iteration space can be composed. The set of
basic dependence vectors will be added to every iteration to replace all original de-
pendence vectors. The new dependence pattern becomes uniform even though the
number of dependence vectors is increased. A new method, index synchronization,
which does not require any barrier synchronization or extra synchronization variables,
then is proposed. Finally, an experiment on a 76-node BBN GP 1000 machine is also

conducted to measure the performance based on different dependence patterns.

4.1 Program Model

To simplify our presentation without loss of generality, the program under consid-
eration in this chapter is a doubly nested loop. The program model has the form
in Fig. 4.2, where N is a constant, [(1) and u(I) are linear functions of the index
variable I. The solution to multiple-level nested loops can be obtained by enhancing
the proposed approach. Among the series of computation statements 3;, 5'2, . . . , 5,,
it is assumed that .S', and S, are the only pair accessing to the same array A in which
f1(I, J), f2(I, J), g1(I, J), and g2(I, J) are linear subscript expressions. In the case of
multiple pairs of statements accessing to the same array (at least one write access),

our algorithm can be repeatedly applied.

44

 

do I = 1, N
do J = 8(1), u(I)
511 . . .
5,: A(f1(I,J), f2(I,J)) = . . .
St: = A(gl(I,J), g2(I,J)) . ..
3,: . . .
enddo
enddo

Figure 4.2. A doubly nested loop program model.

 

 

 

As indicated in Chapter 2, the doubly nested loop carries cross-iteration depen-
dences if there exist two distinct iteration vectors (i1,j1) and (i2,jg), such that
f1(i1,j1) = g1(ig,j2) and f2(i1,j1) = g2(i2,j2). More speciﬁcally, the nested loop
in Fig. 4.2 carries cross-iteration dependences if and only if there exist four integers
(i1, jl , i2, jg) satisfying the system of Diophantine equations (Eq..(4l)) and the system
of inequalities (Eq..(42))

f1(i1,j1)= 91(5le (4.1)

f2(i1,j1) = 92(52J2)

45

ISIISN
[(1.1) 5.11 S “(ill
ISi2$N

[(1.2) S j: S "(Isl

(4.2)

As mentioned earlier, ﬁnding the general solution of a system of Diophantine
equations is time-consuming when the levels of the nested loop is large [36] and [32].
However, in real application programs, the levels of nested 100ps is usually small. For
a doubly nested loop as shown above, this can be solved in about 300 usec in our
prototype program on a SPARC-2 workstation. If the solution does not exist, this
nested loop does not carry cross-iteration dependences. Otherwise, the solution is
expressed in terms of two free variables (the number of unknown variables minus the
number of equations). One exception is that when f1(i1,j1) = g1(i2,j2) is parallel
with f2(i1,j1) = g2(i2, jg), the solution is in terms of three free variables. The method
presented in this chapter deals with the general cases in which the solution is in terms
of two free variables. An extended method which can handle special cases is given in
Appendix A. Let the two free variables be x and y. The general solution of Eq.(4.1)

is a four-element vector of linear functions in terms of x and y as shown in Eq.(4.3).

(i11j11i23j2)=(q1($1 yla 92(3) 3]), q3(x$ y), q4(xi 31)) (43)

where x, y and all coefﬁcients of q; are integers. Once the general solution is obtained,
the general expression for dependence vector (i2,j2) - (i1,j1), called Dependence Vec-
tor Function, is deﬁned as (q3(z.y) - qt(z,y),qt(z.y) - satay). Formally. we have

the following deﬁnition. ‘

Deﬁnition 4.1 Given a doubly nested loop L with the form in Fig. 4.2, if ‘(q1(x,y),

920,30, q3(x,y), q4(x,y)) is the general solution of Diophantine equations (4.1), the

 

46

Dependence Vector Function of L, namely DVF(L), is deﬁned as (h1(x,y), h2(x, y)),

where MI. 3!) = «13(16, 3!) - (MI. 31) and h2(:r. y) = «(2. y) - 92(2, 21)-

Consider Example 4.1. The corresponding Diophantine equations is

2', +11 =i2+j2+1
3*i1+j1+3 =i2+2*j2+4

By applying the algorithm in [32], the general solution (q1,q2,q3,q4) is equal to
(x, y, —x+y-l, 2x), and the dependence vector function (h; , I12) is (—2x+y—1, 2x—y),
where x, y E Z. Obviously, this example is not a uniform dependence loop, because
some iteration vectors are independent of and some iteration vectors are. dependent

on others (see Fig. 4.1).

Lemma 4.1 A doubly nested loop L is a uniform dependence loop if the dependence

vector function (h1(x,y), hg(x, y)) = (c1,c2) where c1 and c; are integer constants.

Proof:

Case 1: (c1 > 0) or (c1 = 0 and c; > 0).

The dependence vector set D =~ {(c1,c2)]. Let I‘ be the iteration space and
(q1(x,y), q2(x,y), q3(x,y), q4(x,y)) be the general solution of the corresponding

Diophantine equations of L. According to the deﬁnition above,

9303,31) - 4107,31) = c1
9403,11) — 92(1'3 y) = c2

Let (X, Y) = (q3(x,y),q4(x,y)). The general solution can be normalized to
(X - c1,Y - Cg,X, Y). Thus, for any (i,j) 6 I‘, (i,j) depends on (i,j) — (c1,c2)
if (i,j) - (01,62) 6 I‘. So, the loop L is a UDL.

Case 2: (c1 < 0) or (c1 = 0 and 62 < 0)-

ﬂl

a.

 

47

The dependence vector set D = {(—c1,—C2)}. The proof is similar to Case 1. 0

Consider Example 3.4 in Chapter 3. There are two systems of Diophantine equa-

tions.

i1 =i2+1 i1 =i2

I1 =le .71 =j2+l

The general solution of the ﬁrst system is (x,y,x + 1,y). Thus, the dependence
vector function is a constant vector (1,0). For the second system, the dependence
vector function is another constant vector (0,1). Therefore, the loop is a UDL with

dependence vector set {(1, 0), (0, 1)}.

4.2 Analyzing Cross-Iteration Dependence

As observed in the previous section, ﬁnding exact data dependences involves solving
Diophantine equations. After the general solution of Eq.(4.1) is computed, the system
of inequalities (Eq.(4.2)) must be satisﬁed. In this section, we propose a new method
to apply the constraints (Eq.(4.2)) to the general solution (Eq.(4.3)) so that more

precise dependence information, a domain of dependence vectors, can be explored.

4.2.1 Solution Space and Dependence Convex Hull

Let R denote the set of real numbers. The two free variables of a general solution
(Eq.(4.3)), x and y, form a 2-dimensional space, namely solution space 9 where
9 = {(x,y)]x,y E R}. Every integer point (x,y) in 9 corresponds to two vectors,
(91(3,y),qz(x,y)) and (q3(x,y),q4(x,y)), which possibly forms a dependence vector.
However, due to the bounds of the nested loop (Eq.(4.2)), only a limited region in

the solution space is valid. In a multiple-dimensional solution space, this region is a

48

convex hull, called Dependence Convex Hull (DCH). Thus, every integer point in the

DCH certainly corresponds to a dependence vector in the iteration space. Substituting

Eq.(4.3) into Eq.(4.2), the DCH is deﬁned as the following.

Deﬁnition 4.2 Let L .be a doubly nested loop with the form in Fig. 4.2. The Depen-
dence Convex Hull of the loop L, namely DCH(L), is the region H, where H is equal

to
{(x.y)l1 S May) S New 6 R}
n {($,y)ll(qt($.y)) S (May) S U(ql(z.y)). m: 6 R}
f) {(I.y)|1 S (1303,31) S New 6 R}
n {($,y)|€(t13(z.y)) S May) S u(qa(a=.y)). any 6 R}

where (q1,q2,q3,q4) is the general solution of Diophantine equations (1.1).

 

 

 

 

 

 

 

 

 

 

y .,."':.x+y-I=I(XX)
I [I ."I’. I IX+Y' 1 =1
FIMIIO:'J ’03:} 25,; “:12; ‘
y=1 .
x=l “500 xélmo
Figure 4.3. The DCH of Example 4.1

 

 

49'

Consider Example 4.1 again. The general solution is (x, y, —-x + y — 1, 2x). Ac- .

cording to the deﬁnition above, the DCH is listed as the following.

{(x,y)]1 S x S 1000,x,y E R}
n {(aym s y 3 1000.3.) 6 R}
n {(xwlll S -a=+ y -1S1000.x.y 6 R}
n {(x,y)|1 S 2x S 1000,x,y E R}

The DCH in the solution space is also shown in Fig. 4.3 in which the DCH is bounded
by four nodes, (1, 3), (1, 1000), (500, 1000), and (500, 502). Each integer point in
the DCH corresponds to a dependence vector in the iteration space in Fig. 4.1. For
instance, the point (1, 5) in the DCH is corresponding to the dependence vector from

(1, 5) to (3, 2) in the iteration space.

Lemma 4.2 If the dependence convex hull (DCH) of a nested loop L is empty, then

no cross-iteration dependence occurs in L.

Proof: The proof is done by contradiction.

Let (q1(x,y),q2(x,y),q3(x,y),q4(x,y) be the general solution of the corresponding
Diophantine equations of L. Assume there exists a dependence between iteration
(inji) and iteration (ig,j2). There must exist two integers a,b such that (i1,j1) :-
(q.(a, b),.,,(., b)) and (5,, j.) = (q3(a, b), q.(a, 1)). Also, since both (11,13) and (1,, j.)

are in the iteration space, the following inequalities hold.

1 S QI(ai b) _<- N
l(q:(a. b)) S q2(a.b) S u(311(a.b))
1 S 93(aib) S N

I(q3(ai b)) _<_. Q4(ai b) _<— u(q3(a’ b»

50

Thus, (a, b) is in the DCH. D

If an inequality, q(x,y) S 0 (or q(x,y) 2 0), represents a half space, a DCH
actually is the intersection of eight half spaces in the solution space 9. To efﬁciently
compute the intersection of half spaces, a DCH is internally represented by a ring
structure composed of the corners (or called nodes) of the convex hull. Initially, the
DCH is set to be an inﬁnitely large region (in practice we select 4 corners). Then,
each half space is applied to cut the DCH. In our algorithm, the valid half space is
deﬁned as zoom-0 While the rest of solution space is called zoom-1. The evaluation
of zoom value‘for a point (x,y) e 9 is simply checking the value of the function,
g(x,y). If all nodes of a DCH are located in zoom-0, the intersection is the whole
DCH. On the other hand, if all nodes are located in zoom-1, the intersection is empty.
In case that nodes are located in both zoom-0 and zoom-1, the intersection points are
computed and inserted into the ring structure. Then, the ring is scanned once again
to remove the nodes in zoom-1. A detailed C-like algorithm to identify a DCH in a
2-dimensional solution space is listed in Fig. 4.4. Since each cut can increase at most
one node, the maximal number of DCH nodes of a doubly nested loop is 12 (original 4
nodes plus 8 cuts). The complexity, in worst case, is 0(n2) where n is the number of
half spaces. According to our experiments on a SPARE-2 workstation, this algorithm

can be ﬁnished in about 350 psec when the number of half spaces is 8.

4.2.2 S-Test: A Dependence Slope Test

Based on the general solution of Diophantine equations and the constraints derived
from loop index bounds, we have deﬁned and explained the concept of the solution
space and the dependence convex hull (DCH). In this section, we propose an S—test
to identify the bounds of dependence slopes of dependence vectors in a 2-dimensional

iteration space. The dependence slope of a dependence vector, ii = (i, j), is deﬁned

51

 

Input: A list of 8 half spaces (Def.(2))
Output: A DCH
struct node {
ﬂoat (x, y) ;
int zoom ;
struct node *next ;
} ;
max = 9999999 ;
begin
Build the initial DCH ring which is composed of four nodes:
(x1,y1) = ( max, max);
($02,112) = ( max. -maX);
(xg,y3) == (~max, -max);
(34,114) = (’max, max);
while (the input list is not empty)
Pop a half space from the list, named HS ;
Scan the ring
Determine the zoom value for each node ;

if ((x,y) 6 HS) then

zoom = 0 ;
else
zoom == 1 ;

if (the zoom is different from previous node) then
Compute the intersection point and give it zoom = 0 ;
Insert it into the ring between the current node and the previous node ;
Scan the ring again
Remove the nodes with zoom = 1 ;
if (the ring is empty) STOP ;
end while
end

Figure 4.4. The Algorithm of Finding the DCH in a 2-Dimensional Solution Space

 

 

 

52

as j /i. The general formula of dependence slopes, called Dependence Slope Function
(DSF), is deﬁned as follows.

Deﬁnition 4.3 Given a doubly nested loop L with the form in Fig. 4.2, the De-
pendence Slope Function of L, namely DSF(L), is deﬁned as s = $5], where

(h1(x,y), h2(x, y)) is the the dependence vector function of L.

Identifying the bounds of dependence slopes of a loop L is, therefore, transferred
to computing the maximal and minimal values of DSF (L) in the DCH(L). The char-

acteristic of a DSF can be explored by the equation below.

h2(x,y) — h1(x,y) I s = 0 (4.4)

Intuitively, Eq.(4.4) is a line in the solution space 6, in which all points have
the same DSF value 3. If line h1(x, y) = 0 is not in parallel with line h2(x,y) 2: 0,
line Eq.(4.4) passes the intersection point of line h1(x,y) = 0 and line h2(x,y) = 0
for all s E R. The geometrical relation between Eq.(4.4) and h; = 0 and hg = 0 is
illustrated in Fig. 4.5(a). When 3 is increased from —oo to +00, by following the
arrow, line Eq.(4.4) rotates from h] = 0, passing ’12 = 0, and back to h = 0. If line
h1(x,y) = 0 is in parallel with line h2(x,y) = 0, Eq.(4.4) is a line in parallel with
both line h1(x,y) = 0 and line h2(x,y) = 0. Let h: = a I h; + b, where a,b 61R
and a > 0. The geometrical relation between Eq.(4.4) and h; = 0 and h; = 0 is
shown in Fig. 4.5(b). Line Eq.(4.4) shifts from h, = 0, passing h; = 0, and goes
back h1 = 0, when s is increased from —oo to +00. It can be seen that, the s-value
in Eq.(4.4) is continuously increasing except that it jumps from +00 to —00, when

P3381113 h1(x,y) = 0 (both Fig. 4.5(a) and (b)).

Deﬁnition 4.4 Given a doubly nested loop L with DCH (L) = H and DSF(L) =
2%)}:3, we say DSF(L) is monotonic in DCH(L) if the maximal and minimal values

53

 

 

h >0
/ 2 .
\
sag-Q hl >0
0 v40
1“1:0 112-0
112 >0
(3)111yh2 (b) 1.11/th, h2=a*h1+b
Figure 4.5. The characteristic of DSF

 

 

 

Of w in H are located on the boundary of H.

Lemma 4.3 Given a doubly nested loop L with DCH (L) = H and DSF (L) = ﬁg],
if line h1(x,y) = 0 does not pass through H in the solution space, then DSF (L) is
monotonic in DCH(L).

Proof: Since h1 = 0 does not intersect with H, the whole convex region H must

be in either the half space h1(x,y) > 0 or h1(x,y) < 0. Assume that H is in the side
0f I11 > 0.

Let s] and s; be the maximal and minimal values of filﬁj in H. As shown in

Fig. 4.6, both line b; — s; I h} = 0 and line h; — s; I h; = 0 are two tangents of the
convex region H. Thus, 31 and s; can only be generated on the boundary of H. By

Deﬁnition 4.4, DSF(L) is monotonic in DCH(L). 0

Lemma 4.4 Given a doubly nested loop L with DCH (L) = H and DSF(L) = ﬁg”)!

if all nodes of H are located in one of the following two half solution spaces

54

 

 

 

h2>0 H

 

(b) h1// h2, h2=a*h1 + b

Figure 4.6. A Monotonic DSF

 

 

 

(1) {($.v)l(h1($.y) Z 0 and May) 2 0) or (h1(=r,y)> 0 and h2(=€,y) < 0),x,y e
R}

(2) {(x.y)l(h1(z,y) S 0 and h2($.y) S 0) or (May) < 0 and h2($,y) > 0),I.y 6
R},

then DSF(L) is monotonic in DCH(L).

Proof: This lemma is almost the same as Lemma 4.3 except that the maximal
value of m in H is possibly equal to +00. The proof is trivial and the same as

h! (1.”)

the proof above. . , ' D

Clearly, if h is a nonzero constant (h, will never be zero in the solution space),
DSF, ﬁg], is always monotonic in any convex region where h is an arbitrary func-
tion. As mentioned, the goal of our algorithm is to compute the maximal and minimal
DSF(L) values in DCH(L). If DSF(L) is monotonic in DCH(L), the maximal and min-
imal values can be easily located on the nodeslof DCH(L).

55

Theorem 4.1 Given a doubly nested loop L, if DSF (L) is monotonic in DCH(L),
then the minimal (or maximal) DSF (L) value can be found on a node of DCH(L).

Proof: Let s; be the minimal value of DSF(L) in DCH(L). Since DSF(L) is mono-
tonic in DCH (L), 31 must be located on the boundary of DCH ( L). For a convex region,
the boundary is composed of segments. Assume that 31 is generated on a boundary
point, namely (a, b). Then, (a,b) must be on the line h; — 31 I h; = 0. If (a, b) is an
end of a boundary segment, (a, b) is a node of DCH(L). -

On the other hand, if (a, b) is a middle point of a boundary segment, the bound-
ary segment which (a, b) located on must also lie on the line hz — s1 * h; = 0 due to
the characteristic of tangents. Thus, 31 can also be generated from both end of that

boundary segment, i.e.,sl can be found on a node of DCH(L). 0

Example 4.2:

do II1, 10
do JI1, 1O
A(2IJ+3, 1+1) I ...

I A(2II+J+1, I+J+3) ....
and do
end do

Consider the above example. The general solution of the corresponding Diophantine
equations is (x, y, -x +2y+ 4, 2x - 2y — 6) and the DSF is %. By applying the
algorithm in Fig. 4.4, the DCH is bounded by four nodes, (10,6.5),(10,3.5),(5,1),
and (4.5,1). Let h; denote 2x -‘ 3y - 6 and h; denote —2x + 2y + 4. The DCH
(shaded area) and both lines h, = 0 and h; = 0 (dotted lines) in the solution space
are illustrated in Fig. 4.7. It can be seen that line h1(x, y) = 0 does not pass though
the DCH, i.e.,the DSF is monotonic in the DCH. The maximal value of the DSF

56

in the DCH is 1.833 which is located on the node (10,3.5) while the minimal value,
-0.3889, is generated by the node (10, 6.5).

 

—>‘<

10

b2 lh1 =-o.3889

 

 

 

Figure 4.7. The solution space of Example 4.2.

 

 

 

Apparently, for a loop L with DSF (L) = $23, if h] = 0 passes DCH(L), the
range of DSF(L) in the DCH(L) is possibly from —00 to +00. However, a —00
dependence slope is invalid, because it implies that an iteration depends on its next
iteration in a serial execution. Also, due to the characteristic of integer functions, the

range of dependence slopes can be further reduced.

Lemma 4.5 Given a doubly nested loop L with the form in Fig. 4.2, the valid range
of DSF(L) is from (1 - M) to +00, where M is equal to maxlSi5N(u(i) - l(i)+1).

The proof is straightforward.

57

Theorem 4.2 Given a doubly nested loop L with DSF(L) = $37} where h; is in
parallel with ’12, ifDSF (L) is not monotonic in DCH(L), then the range of the DSF (L)
is

(a - b, +00) when b > 0

(a + b, +00) when b S 0

where h2=a*h1+b, anda,b€R.

Proof:

dependence slope s = .—

Since a is a constant, we only have to consider the second factor 361-.

Case 1: when 6 > 0.
Clearly the maximal s is 00 when h1 = 0.
Since h; is an integer function, the minimal 3 is a — b when h1 = —1.
Case 2: when b S 0.
From Lemma(4.5), s = -00 is invalid. The maximal s is 00 by reversing
the direction of the dependence vector when h} = O.

The minimal s value is a + b when h, = 1.

Again, we illustrate the above theorem by Example 4.1 which has the general
solution (m,y, —a:+y—1,2a:). The DSF is {3 where h = —2x+y—1 and ’12 = 2.1—y.
TheDCH (shaded area) and both lines h1 = 0 and h; = 0 (dotted lines) are illustrated
in Fig. 4.8 in which line h1(a:,y) = 0 clearly passes though the DCH. Thus, the DSF

58

is not monotonic in the DCH. However, since h: = —1 at h] - 1, by applying the above

theorem, the range of the DSF is (-2, 00).

 

 

 

Figure 4.8. The solution space of Example 4.1

 

 

 

Corollary 4.1 Given a doubly nested loop L with DSF (L) = ﬁg;- where ’12 is a

constant b, if DSF(L) is not monotonic in the DCH (L), then the range of the DSF(L)
is

(—b, +00) when b > 0

(b, +00) when b S 0

Proof: If h, is a constant b, then h, = o ... h, + b. By Theorem 4.2. the Corollary

is proved. El

59

Theorem 4.3 Given a doubly nested loop L with DSF (L) = 23:1: where hl is not
in parallel with hg. If DSF (L) is not monotonic in the DCH (L), then the range of the
DSF(L) is (— min(M — 1,T),00), where T is the maximal absolute h; value of the

corners of the DCH, and M is equal to max15;51v(u(i) -— [(2') + 1).

Proof: Since the h1(:c, y) and h2(:c, y) are integer functions, the minimal nonzero
absolute value of h; is 1. Therefore, if T is the maximal absolute h; value in the DCH,
the dependence slope, ff, is always large than —T. However, by Lemma 4.5, the valid
range of dependence slope is (1 — M, +00). So, the range of DSF is (- min(M — 1, T),

00). Cl

Concluding the above theorems, the algorithm of S-test is listed in Fig. 4.9. The
input of this algorithm are a DCH and a DSF, 7’3. We ﬁrst determine whether the
DSF is monotonic in the DCH by using Lemma 4.4. If the answer is yes, Theorem 4.1
is applied to compute the bounds of the DSF. Otherwise, depending on whether h;
parallels with hz, Theorem 4.2 or Theorem 4.3 will be employed. The complexity, in
the worst case, is 0(n) where n is the number of nodes in the DCH. Again, since a
DCH is composed of at most 12 nodes for a doubly nested loop, the execution time

is very small.

4.3 Dependence Decomposition

In the last section, an exact dependence analysis method has been presented to iden-
tify the range of dependence slopes of a doubly nested loop. The range of dependence
slopes, namely (31, 82), implies that any vector (I, J), where I 6 2*, J E Z, is possibly

a dependence vector as long as the following equation is satisﬁed.

NH.
IA
Co

N

(4.5)

60

 

 

Input: A DCH and a DSF,(%:)
Output: The bound of the DSF in the DCH
begin
if (both h; nd h; are constants)
Return the constant slope ﬁf- ;
if (the DSF is monotonic in the DCH) (Lemma 4.4)
Compute the maximal and minimal dependence slopes by Theorem 4.1;
else
if (hl in parallel with hg)
Compute the maximal and minimal dependence slopes by Theorem 4.2 ;
else
Compute the maximal and minimal dependence slopes by Theorem 4.3;
end if
end if
end

Figure 4.9. The algorithm of S-test

 

 

61

In order to be valid, the group of vectors satisfying Eq.(4.5), called dependence cone,
must be considered in the dependence uniformization. It should be emphasized that
the dependence vector set is a subset of the dependence cone. In this section, we
propose the approach dependence decomposition in which all vectors in the dependence
cone can be decomposed into a set of basic vectors. The set of basic vectors is called
Basic Dependence Vector set then can be added to every iteration to replace all

original dependence vectors so that the loop become a uniform dependence loop.

Definition 4.5 Given a doubly nested loop L, a set of vectors B = {b-;, b}, . . . , 5;} is
deﬁned as a Basic Dependence Vector (BDV) set of L, if for any 56 C, there exists
c1,c2,...,cb 6 2", such that d: c; *b: +cz*b;+~-+cb*b;, where Z" is the set

of non-negative integers and G is the dependence cone of L.

Theorem 4.4 Any doubly nested loop L can be transferred to a uniform dependence

loop with dependence vector set B, where B is a BDV set of L.

Proof: Let D denote the dependence vector set of L, I‘ denote the iteration space
of L, and B = {b-;,b;, . . . ,b7,}. We want to prove that for any two vectors 17,13 6 I‘,
where 17 = 21; + d for any d6 D, 17 must be executed after :3, as long as all vectors in
B are preserved in every iteration.

Since D is a subset of the dependence cone, by deﬁnition, there must exist positive
integers c1,c2,...,cb such that d: c1 *b: +Cz*b;+'°'+cb*l;bg T0 obey bi, tad-bi
must be executed after 213. Accumulatively, a; + c, * b: must be executed after 215.

Similarly, to preserve b: for 2 S i S b,

175+c1 *b: +c2 *b; must beexecuted after t3+c1 *b-i.

13+c1*b:+c2*b;+-~+cb*b;mustbeexecutedafter
t3+¢1*b'i+02*b;+"-+Cb-i *bb:l

62

Thus, ii = iii + (T will be executed after 113. D

The above theorem indicates that, in the existence of a BDV set, a nested loop can
always be transferred to a uniform dependence loop. This transformation is called
dependence uniformization. However, there are many BDV sets for a dependence
cone. The choice is determined by the synchronization technique being applied to
the nested loop. In general, less vectors in a BDV set, more efﬁcient synchronization
can be arranged. The goal of our algorithm is to ﬁnd a BDV set which is composed
of the least number of basic dependence vectors. Unless 31 = 32 in Eq.(4.5), a
BDV set of a doubly nested loop is composed of at least two vectors. The following
theorems derive formulas to compute most simple BDV sets for a given dependence
cone in a 2-dimensional iteration space. To cooperate with the proposed theorems,
a synchronization method which induces very small synchronization overhead will be

presented in the next section.

Theorem 4.5 Given a doubly nested loop L with a dependence slope range (31,32),
the set of vectors {(0,1),(1, [slj)} is a BDV set ofL.

Proof: Let (I, J) denote‘a dependence vector where I E Z+ and J 6 Z. We
want to prove that if (I, J) is in the dependence cone (Eq.(4.5)), there must exist two

positive integers, c1, e: such that (I, J) = 61 * (0:1) + 02 * (la [sill-

Since I, J, and [51] are all integers, there must exist a positive integer r, such that

J: [31J*I+r

63

Now, (I,J) can be decomposed as the following

(LJ) =(I,l~91l*1+"l
=(Iai-91J*Il+(0,rl

=I*(1,[slj)+r*(0,1)

I Now, the theorem is proved by selecting c1 = r and c2 = I .

 

Figure 4.10. The uniformized dependence pattern of Example 4.1

 

 

 

Consider Example 4.1 in Section ?? again. The general solution is (z, y, —a: + y —
l, 22:) and the DSF is ﬁzz-1:151. By applying the S-test in the last section, the range of
dependence slope is (-2, 00). A BDV set { (0, 1), ( 1, —2)} then can be easily computed
by using Theorem 4.5. Thus, the irregular dependence pattern in Fig. 4.1 can be

uniformized to a uniform dependence pattern with two ﬁxed dependence vectors,

64

(0,1) and (1,-2). The uniformized dependence pattern in the iteration space is shown
in Fig. 4.10.

Theorem 4.6 Given a doubly nested loop L with a dependence slope range, (31,32),

the set of vectors {(0, -—1), (1, [32]” is a BDV set ofL, if 32 74 00.

Proof: The proof is similar to the proof of Theorem 4.5. Let (I, J) denote a
dependence vector where I 6 Z‘l' and J E Z. We want to prove that if (I, J) is in the

dependence cone (Eq.(4.5)), there must exist two positive integers, c1, c2 such that

(I,J) = c1 *(0,-1) +c2 as (1, [32]).

315 % $82
31*15 J S32*I
J S[32]*I

Since I, J, and [32) all integers, there must exist a positive integer r, such that
J = [32] a: I — r.
Now, (I,J) can be decomposed as

(I,J) = (I, [32] *I—r)
=(I:[32l*1)_(0:r)

= I a: (1, [32]) + r ... (0, —1).

Thus, the theorem is proved by selecting c1 = r and c2 = I.

Corollary 4.2 Any doubly nested loop can be uniformized to a uniform dependence

100;? with only two dependence vectors.

65

Proof: By using Theorems 4.1, 4.2 and 4.3, a dependence cone can be always
computed for any doubly nested loop. Then, a BDV set which is composed of two

vectors can be derived by using Theorem 4.5 or Theorem 4.6. D

4.4 Index Synchronization for Uniform Depen-

dence Loops

There are many ways to parallelize a uniform dependence loop. The performance is
mostly dependent on the efﬁciency of the synchronization primitives which are used to
preserve those uniform dependence constraints. For instance, in the wavefront method
[45], a two-level uniform dependence loop is transferred to a sequential iterative DOall
loop. As indicated in Chapter 3, that loop involves many barrier synchronizations. In
this section, we present the index synchronization method to preserve the constraints
due to‘ the basic dependence vectors derived in the last section. The new method which
does not require any barrier synchronization and atomic operation is introduced ﬁrst.
Then, we will evaluate the performance in the next section.

The principle of our method is very similar to the post69‘wait technique in which
a processor has to wait for the signal posted by another processor to continue its
execution. However, since there are only two basic dependence vectors in a uni-
formized dependence pattern, the synchronization overhead can be further reduced.
As shown in Theorem 4.5, the ﬁrst basic dependence vector is (0,1), and the second
basic dependence vector is (1, t) where the t is determined by the lower bound of
the DSF. The basic idea of our approach is to group the iteration space along the
direction of the ﬁrst basic dependence vector (0,1). Then synchronizations between
adjacent groups are made to obey the second basic dependence vector (1, t). More

Speciﬁcally, the inner loop is grouped and executed serially, and the outer loop is

 

 

66

performed concurrently while synchronizations are applied. The synchronization be-
tween adjacent groups may be implemented in a post8wait manner. Since every group
only depends on the previous group by the direction (1, t), the post operation is not
needed. The processor executing the j’th group can directly check the index variable
of the (j —1)’th group, which represents the current working iteration in the (j -1)’th
group, to decide how far it can proceed. In order to be valid, every index variable
(except the ﬁrst one) must be less than its previous index variable at least by 1 - t.
The value 1 —t is deﬁned as delay factor (b in our approach. Thus, if the delay factor
43 is positive, every processor (except the ﬁrst processor) has to be behind the previous
processor by at least at steps. On the other hand, if d) is negative it means that each
processor (except the ﬁrst processor) cannot be ahead of the previous processor more

than —¢ steps.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

moo TATLLEATA.
’ l+t 1+: 1+1 1H
1000
4 999
3 998
2 997
l 996
i, - ' i ' i .L, 1
E l-t l-t 314 4~1-t

 

 

 

(a) BDV={ (0,1). (1.1)} (b) BDV={ (0.-1).(1,t)}

Figure 4.11. The Index Synchronization Method

 

 

 

 

67

The concept of index synchronization is also illustrated in Fig. 4.11(a). The
dotted rectangle is a group being assigned to a processor. The ﬁrst basic dependence
vector is naturally preserved in the serial execution of a group while the second basic
dependence vector is transferred to the index synchronization. The are from index
variable ii to index variable i j+1 indicates that i,“ must be always less than i ,- by at
least 1 ——t during the execution of this loop.

Consider Example 4.1. The uniform dependence pattern, which has a BDV set
{(0,1),(1, -2)}, is shown in Fig. 4.10. By applying index synchronization, the depen-
dence relation can be simpliﬁed as in Fig. 4.11(a) with t = —2. Thus, every group,
except the ﬁrst group, must be 3 steps behind previous group. The parallel code of
Example 2 can be implemented by a Doacross 100p including a Doserial loop as

the following.

Example 4.1: ( Parallel code )

doacross I - 1, 1000
shared integer J[1000]
doserial J[I] 3 1, 1000
if( I > 1 ) then vhile( JII-l] < JIIJ+3 ) ;

$1: A(I+J, 3*I+J+3) - ,...
$2: - A(I+J+1, I+2*J+4) .....
enddo
JII] = 1000 + 3
enddo

If Theorem 4.6 is applied, a BDV set has the form {(0,-1), (1,t)} where t 6 Z.
To preserve the basic dependence vector (0, —l), the execution direction of the inner
loop is reversed. As shown in Fig. 4.11(b), processors execute the inner loop in a
reverse order, from the top to the bottom. Similarly, the second basic dependence
vector (1, t) is transferred to index synchronization. The delay factor 45 in this case is

equal to 1+t. Thus, the index variable i j+1 has to be always larger than index variable

 

68

i,- by at least 1 + t. It should be emphasized that each index variable i,- is accessed
by only two processors. Thus, the hot-spot condition will not occur. Also, since the
second processor only performs read operation, no atomic operation is required in

this approach.

4.5 Performance Evaluation

4.5.1 Theoretical Speedup

To simplify the analysis, it is assumed that ((i) = 1 and u(i) = M in the pro-
gram model in Fig. 4.2. Intuitively, the delay factor 4), is the major factor to the
performance of loop uniformization. As mentioned, if a BDV set has the form
{(0,1),(1,t)}, the delay factor (b is deﬁned as 1 - t. Otherwise, if a BDV set has
the form {(0, —1),(1,t)}, the delay factor d) = 1 + t.

There are two possibilities for a given BDV set: 45 S 0 or (t > 0. In the ﬁrst
case, each group can continue its execution immediately as long as it is not ahead
of its previous group --¢ steps. Thus, no delay is required. If the scheduling and
synchronization overhead is not taken into account, the peak speedup is simply equal
to the bound of the ﬁrst level loop N when assuming the number of processors is
inﬁnite.

On the other hand, if it > 0, by the index synchronization method, group j has
to be behind group j — l at least 43 steps. The last group, the N ’th group, need to
delay accumulatively at least (N - l) a: 43 steps to start its execution. The equation

0f peak speedup is derived as the following.

Peak speedup= W

 

69

= NtM

W
S if when ()5 > 0 (4.6)

Peak speedup = N when (t S 0

Note that the upper bound 1% may not be tight when (M -— 1) is comparable
to N * (ii. The peak speedup is clearly determined by the ratio of M and at. A
larger ¢ induces a smaller peak speedup. In the worst case 45 = M, which actually
results in a serial execution, the peak speedup is equal to one. Therefore, when both
Theorems 4.5 and 4.6 are applicable to a nested loop, the choice should be determined

by the theoretical peak speedup. '

Example 4.3:

do 181, 1000
do j81, 1000
A(2*J+3, 1+1) = ....

. A(I+J+3, 2*I+1) ....
and do
and d0

Consider the example above. The general solution is (2x,y,a:, -:c + 2y) and the
DSF is :Ef". By applying S-test, the range of dependence slopes is (-499.5,0.499).
Since —499.5 < 00, both Theorem 4.5 and Theorem 4.6 can be applied to choose a
BDV set. However, if Theorem 4.5 is used, the BDV set is {(0,1),(1, —500)} which
results in a very low peak speedup 1%. On the other hand, if Theorem 4.6 is used,
the BDV set is {(0, -1),(1,1)} and the delay factor at is only 2. A much higher
peak speedup 333 then can be obtained. Thus, Theorem 4.6 is a better choice in this

example.

 

70

4.5.2 Experimental Results

To evaluate the performance of index synchronization under various delay factors, our
experiments were conducted on a 76-node BBN GP—lOOO parallel processor. The CP-
1000 is a physically distributedvmemory architecture running the MACH operating
system that offers a logically shared virtual space to the programmer [56]. Each
processor node is composed of a processor, a local memory module, and a network
interface controller. The 76 processor nodes are interconnected through a multistage
interconnection network (MIN), a slight variant of the Omega network, that allows
any processor to access any memory module. Any remote memory access is made
by sending a request message via the network controller to the remote peer network
controller which then accesses the target memory module and sends back a response
message. Therefore, a remote memory reference is about four times slower than a local
memory reference which directly accesses the local memory module bypassing the
interconnection network. If the trafﬁc of the MIN is heavy, a remote memory access
might take much longer than usual. The operating system allows an application to
have an exclusive control of a cluster of nodes without sharing with other applications.
, Detailed performance study and the characteristics of the GP-1000 can be found in
[42].

The experimental programs are coded in Butterﬂy Fortran [22] which is a parallel
language, in some sense, similar to the PCF Fortran [21]. To avoid too much inﬂuence
from network and memory contention, the remote reference ratio is set to 5%. As for
scheduling, dynamic self-scheduling scheme [26] is applied to schedule the outer loop.
Based on different granularity, three experiments are made in this research. In each
experiment, different delay factors, from 46 = 2 to ¢ = 200, are applied to a doubly
nested loop with bounds N = 1000 and M = 1000. The results of Experiment 4.1 is

shown in Fig. 4.12 in which the iteration granularity is only 120 psecs. As mentioned

 

71

before, delay factor (15 is the major factor to the performance. In Fig. 4.12, a smaller
gt clearly results in a higher speedup. According to the analysis in Section 4.5.1, none
should achieve a speedup higher than the peak speedup, 5:39. Thus, when the number
of processors is larger than 1%, the speedup should start going down due to more
memory or network contention induced by unused processors. For instance, when
43 = 50 the peak speed is 20. It can be seen that the highest speedup, 14, is generated
when the number of processors P is equal to 20. Beyond 20, the speedup start
dropping. In fact, due to a high ratio of the synchronization overhead to granularity,
the speedup drops even earlier. In this experiment, speedup drops when the number
of processors is larger than 30 under all ¢ values. However, it should be emphasized
that, without applying loop uniformization, non-uniform dependence loops are always

executed in serial.

 

35

 

 

 

..‘O. $52 r r T r T
..+.. o=10
30— ..D.. «15:20 a
..x.. 45:50
..®.. ¢=100 .
25* ..A. ¢=200
2or- $5 “
Speedup 8- 3
15_ g gnno ..
g'QWXXXHHX "3:,
10‘ W0 X "22-6“
8280-00-0~0-o--~o----@...,g'mix
5”.‘g'A.A.A.A.A..A...A...A...A ..... A ..... A 24
0 . 1 1 L 4 _4 _1 l

0 10 20 30 40 50 60 70

No. of Processors : P

Figure 4.12. Experiment 4.1: granularity is 120 microseconds

72

 

 

 

 

7o ..'<>.. ¢E2 ' r T ‘ r +
..+.. ¢=10
..D.. ¢=20 a
60F ..X.. 43:50
..Q. ¢=100
50* ..A. ¢=2oo "
2:13:$:2r$ Q
40-— 5 " ‘
Speedup .3330 ----- D
Q" (:1
30‘- au,‘ D-i
20r- Qm d
:-X ...........................
'9 x x x x x x ----x ...... x
10¢ 289999909990-
“‘2 .A.A.A.A...A...A...A..A.....A.....A ..... A
0 J 1 I J L 4 J

No. of Processors : P

Figure 4.13. Experiment 4.2: granularity is 560 microseconds

In Experiment 4.2, the granularity is increased from 120 usec to 560 psec. The
results are plotted in Fig. 4.13. Since the ratio of the synchronization overhead to
granularity is decreased, the overall performance is much better than in Experiment
4.1. However, due to network and memory contention caused by scheduling, syn-
chronizations, and remote references, none can achieve a peak speedup. For example,
when (:5 = 20 the highest speedup is about 37 which is smaller than the peak speedup,
50. Again, because of too many processors involved, speedup drops when P > 50.

The results of Experiment 4.3 is shown in Fig. 4.14 in which the granularity is
1600 psec. The synchronization overhead is not significant in this experiment due
to the high granularity. It can be seen that speedup almost linearly rises until the
number of processors is larger than the peak speedup. For «1) = 2 or d) = 10, the

 

73

speedup never drops because the peak speedup is larger than 72.

 

 

 

 

7o .T<>.. ¢EZ ‘1
..+.. 43:10
_ ..n.. ¢=20 . 4
60 ..X.. ¢=5o 43$
..9.. ¢=100 0-:
50- ..A. ¢=2oo ..;:-+" ‘
40L 3
Speedup Hag... .....D ...... D ...... CJE'
0L ,9
3 _av' ‘
20— ,9“
«QXXXXX ...... X ...... x ...... X
10L 2806099006 1
‘3: .A.A.A.A....A...A...A...A.....A.....A ..A
0 l J | J J l l
o 10 2o 30 40 50 60 70

No. of Processors : P

Figure 4.14. Experiment 4.3: granularity is 1600 microseconds

CHAPTERE»

SCHEDULDﬂ30FIKMﬂL
LOOPS

As mentioned, DOall loop is the most important parallel construct in parallel pro-
gramming. A critical issue in loop parallelization is how workload of a DOall loop can
be equally or fairly distributed among processors. This is the so called loop schedul-
ing problem. Consider the case in which a DOall loop of k independent iterations is
executed in a multiprocessor system with 71 processors. The goal of scheduling is to
distribute these I: iterations to n processors in the most equitable manner and with
the least amount of overhead. This is not an easy task because the amount of work of
each iteration may be different, the value of k may not be known during the compile
time, the number of processors n is machine dependent, and the amount of overhead
is also machine dependent.

In this chapter, we ﬁrst review various scheduling schemes of DOall loops by
discussing their strength and weakness. Then, a loop scheduling model which will be

used in the next chapter is given to express and analyze scheduling schemes.

74

75
5.1 Review of Loop Scheduling Schemes

A DOall loop is uniformly distributed if the execution time of all iterations are very
close. In a shared memory multiprocessor system, iterations of a DOall loop can be
scheduled among processors either statically at compiler time or dynamically at run
time. Static scheduling is usually applied to uniformly distributed loops, in which
iterations of a DOall loop are pre-assigned to processors at compiler-time. How-
ever, the actual execution time of each iteration still may vary due to network con-
tention, memory contention, mutually exclusive access, and complicated control ﬂow.
Thus, workload imbalancing is its major disadvantage, especially when loops are non-
uniformly distributed. Also, static scheduling is not applicable if the bound of loops
and the number of processors are unknown at compile time.

Dynamic scheduling allocates iterations to processors dynamically at run time.
Thus, whenever a processor becomes available, one or more iterations will be assigned
to that processor. Clearly, some scheduling overhead will be involved at run time,
while a better workload balancing is achieved. The overhead to start a new task
(or process) is usually expensive. Thus, self-scheduling in which no system calls are
involved is widely employed in multiprocessor systems, such as HEP [57], Alliant
FX/ 8 [4], Concurrent Computer [27], Cray X-MP, and BBN GP-lOOO [22]. In these
systems, each processor accesses a set of shared variables by itself to retrieve the
indices of iterations instead of invoking system calls provided by the operating system.
However, the number of iterations a processor should fetch becomes an important
issue. Getting a large number of iterations may cause load imbalancing, while getting
a small number of iterations will induce too much scheduling overhead.

Various self-scheduling schemes which put the scheduling responsibility on the
compiler rather than on the operating system were proposed in the past [26], [20],
[24], [25], [27]. The pure self-scheduling (SS) approach is to dynamically allocate one

 

76

iteration of a DOall loop to any idle processor at run time. The main advantage of
this approach is that it can easily achieve load balancing, which is especially useful
for those nonuniformly distributed loops. However, there are two drawbacks. First,
the amount of scheduling overhead is proportional to the bound of the DOall 100p
which might be tremendous. Second, when the granularity of each iteration is small,
the high frequency of mutually exclusive accesses to shared variables, such as the
loop index variable, will cause serious memory contention which will signiﬁcantly
reduce the overall performance. Detailed study of tradeoff between load balancing
and scheduling overhead can be found in [58].

To reduce the frequency of mutually exclusive accesses to those shared variables,
an improved self-scheduling scheme, chunk self-scheduling (CSS), which allocates a
group of iterations to an idle processor, was deveIOped in [22]. In the CSS(lc), the
number of iterations, Is, in a group, called chunk size, is ﬁxed and has to be speciﬁed
by programmers. When the chunk size is one, this scheme is the pure self-scheduling
as discussed above. If the chunk size is selected as the bound of the DOall loop
equally divided by the number of processors, this scheme becomes static scheduling
in which the system will easily cause load imbalancing in the case of nonuniformly
distributed loops. The efﬁciency of the CSS(k) scheme is based on the proper chunk
size which is determined by the number of processors, the bound of the DOall loop,
and the granularity of iterations. A large chunk size will cause load imbalancing, while
a small chunk is likely to produce too much scheduling overhead and memory and
network contention. Due to the unpredictable run-time behavior of parallel programs,
it is difficult for programmers or compilers to properly select an appropriate chunk
size.

Another better method, namely guided self-scheduling (GSS) [25], can dynamically
change the number of iterations assigned to each processor. More speciﬁcally, the next

chunk size is determined by dividing the number of remaining iterations of a DOall

 

77

loop by the number of processors. The property of decreasing chunk size implies its
efforts to achieve load balancing and to reduce scheduling overhead. By allocating
large chunks at the beginning of a DOall loop, one can reduce the frequency of
mutually exclusive accesses to those shared loop indices. The small chunks at the
end of a loop serve to balance the workload across all processors. This approach
obtains an optimal workload balancing under any initial conﬁguration for uniformly
distributed parallel loops. To avoid having many small chunks at the end of a loop,
GSS(k), in which the optimal load balancing is sacriﬁced, is proposed to bound the
chunk size from below by k, where k has to be speciﬁed by either programmers or
compilers.

However, the GSS approach is not perfect in some conditions. First, if the granu-
larity of iterations is not uniformly distributed, the GSS may not balance the workload
due to the large initial chunk size which is equal to the bound of the loop divided
by the number of processors. An alternate approach in the GSS is to make P, the
number of processors, artiﬁcially large to decrease the chunk size. However, the in-
creasing of the number of scheduled units and the scheduling overhead is a serious
side effect of this alternate approach. Second, even though the GSS is a very simple
method, the complexity of the GSS algorithm, especially the GSS(k), is still more
complicated than the SS or CSS approach. Without hardware support, the imple-
mentation of mutually exclusive accesses of the CSS algorithm will be a bottleneck
in a large-scale multiprocessor system when the granularity of DOall loops is small.
Finally, the best minimum bound on the value I: in the extended (3830:) approach
is machine and application dependent. Just like selecting the ﬁxed chunk size in the
083(k), it is not easy for programmers or compilers to select a proper bound value k.
Our experimental results for the GSS(k) scheme in the next chapter. will show that
a large lc value causes more workload imbalancing, while a small I: value induces too

much scheduling overhead. The best choice of I: value is mostly determined by the

 

78

workload distribution.

5.2 A Model for Loop Scheduling

In this section a simple loop scheduling model is presented which allows us to clearly
express and analyze different self-scheduling schemes.

The performance of a self-scheduling scheme is dependent on two factors: one. is
synchronization primitives provided by the system, and the other is the execution time
of loop iterations, which is program dependent and is also affected by system run-time
status. Thus, to design an efﬁcient self-scheduling scheme, we have to consider the
effect of these two factors.

The machine considered in this research is a shared memory multiprocessor system
consisting of P processors. Each processor can directly access any location in the
global shared memory. Either hardware or software atomic primitives are provided by
the system to allow the implementation of mutual exclusion or critical sections which
are essential to any self-scheduling schemes. As to application programs, arbitrary
nested parallel and serial loops are allowed, and the bound of loops may not be known
at compile time. Due to the unpredictable system behavior, the execution time of
each iteration is non-deterministic even at run time.

A chore, deﬁned in [21], is a unit of work to be assigned to a processor. In
scheduling a DOall loop, a chore is usually a group of iterations. For example, in
the CSS(k), a chore is composed of k iterations. For each chore, the indices of the
ﬁrst and last iterations are called chore starting point (a) and chore ending point
([9), respectively. In this research, the last chore to complete its execution is called
the critical chore. The processor executing the critical chore is called the critical
processor. For each processor, we distinguish three states in handling a DOall loop:

scheduling state, computing state, and waiting state. The state diagram is shown in

 

79

Fig. 5.1.

 

A

Scheduling 1 Y Computing

/‘

 

 

Waiting

Done

Figure 5.1. State diagram of a processor in loop scheduling

 

 

 

When encountering a DOall loop, the ﬁrst thing a processor has to do is to acquire
a chore. In this scheduling state, the processor tries to acquire a chore, (0,13), from
the remaining iterations. The time a processor stays in the scheduling state is con-
sidered as scheduling overhead. In general, the scheduling overhead is determined by
the implementation of mutually exclusive access to those shared variables, the num-
ber of processors, and the average execution time of each chore. A simple scheduling
algorithm, such as the SS, might be easily implemented by a single atomic fotchtadd
instruction; while a complicated one, such as the GSS, is usually implemented in a
critical section. As indicated in [58], a signiﬁcant amount of the scheduling overhead
may be caused by waiting for the availability of a critical section. A recent measure-

ment of some shared-memory multiprocessors also indicated that critical sections are

80

a major source of performance degradation, especially when the number of proces-
sors is large [42]. The total number of times that processors enter into the scheduling
state, which is actually the number of chores plus the number of processors, is another
important factor contributing to the total scheduling overhead.

Whenever a processor acquires a chore, it enters the computing state and starts
execution. After ﬁnishing the computation, it goes back to the scheduling state.
The total time all processors stay in the computing state is very likely greater than
the total load of the DOall loop due to network and memory contention. Thus, a
scheduling algorithm which generates much network and memory trafﬁc will affect
the time processors stay in the computing state.

A processor repeats scheduling and computing until the loop is exhausted. Then,
the processor goes to the waiting state and waits for the completion of the remaining
executing chores. Every processor, except the critical processor, has to stay in the
waiting state due to this barrier synchronization. In general, the larger and later
chores have a higher probability of becoming the critical chore. The actual critical
chore is determined not only by the scheduling scheme but also by the execution time
distribution of iterations. For example, the ﬁrst chore in the GSS scheme may possibly
become the critical chore if the execution time of the ﬁrst few iterations is much larger
than the rest. To minimize the waiting time in this state, i.e.,to achieve workload
balancing, it should not be assumed that loops are always constant or uniformly
distributed. The consideration of nonuniformly distributed loops is important in

designing a practical scheduling scheme. The symbols used in this model are deﬁned

as follows:

0 P: the number of available processors in the system.

0 I : the total number of iterations in a DOall loop, which is application program

dependent.

 

 

81

o L(i): the execution time of the i-th iteration. Thus, the total load of a loop is

L = 23’ L(i).

i=1

0 N: the total number of chores spawned from a DOall loop, which is determined

by the scheduling scheme and the value I.

o C(t): the chunk function which represents the number of iterations assigned
to the t-th chore (1 S t S N). This function which might be either linear or

nonlinear expresses the partition and dispatching rule of a scheduling scheme.

0 T(t): the load of the t-th chore, which is equal to 2:30)“)110'). Obviously,

i=0

Xi, T(t) should be equal to the total load, L, of this DOall loop.

0 X: the total time processors stay in the computing state. Recall that due to
network and memory contention, X is usually greater than or equal to the total

load L.
o 0: the total scheduling overhead of all chores.

O W: the total waiting time of all processors due to barrier synchronization and

load imbalancing.

In this model, P, I , and L(i) are given by the system and programs. C(t) rep-
resents the scheduling method which determines N and T(t). X, 0, and W are the
ﬁnal metrics that we are concerned with. An example of scheduling a uniformly dis-
tributed loop for SS, 058(k) and GSS schemes are depicted in Fig. 5.2(a), Fig. 5.2(b),

and Fig. 5.2(c), respectively. The area under the curve C (t) in each scheme is equal

to the total number of iterations, I.

 

 

82

 

 

 

 

 

 

 

 

 

 

 

 

C (i)
(a)
SS
1
N - I
Chore index: t
C(t)
(b)
0580:) k
N = rs
Chore index: t
00)
(C)
(3850)

 

 

 

Chore index: t

 

 

Iteration index: i

 

L(i)

T(l) T(2) T(N

 

 

 

Iteration index: i

 

 

 

 

Iteration index: i

Figure 5.2. Chunk size function of a uniformly distributed loop for SS, CSSUc), and

GSS schemes

CHAPTER 6

TRAPEZOID
SELF-SCHEDULING (TSS)

As mentioned in the previous chapter, in order to design an efﬁcient scheduling scheme
which induces small scheduling overhead and achieves workload balancing, our goal
is to minimize both 0 and W. To minimize 0, we are concerned with the number of
chores N (number of mutually exclusive accesses) and the simplicity of the scheduling
scheme (complexity of each mutually exclusive access). Thus, we would like to design
a scheme like CSS(k) which is simple and only produces a small N if k is large enough.
To minimize W, we want to reduce the waiting time for the completion of the critical
chore. Thus, the GSS scheme seems to be a good choice under any initial processor
conﬁguration.

In this chapter we propose the Trapezoid Self-Scheduling (TSS) scheme which
includes the advantage, but excludes the disadvantage, of both the 0330:) and GSS
schemes. The TSS method, which starts the execution of a loop by assigning a
large number of iterations and linearly decreases the number of iterations until all
iterations are exhausted, can obtain the best tradeoff between small overhead and
load balancing for both uniformly or nonuniformly distributed parallel loops. The

TSS method also substantially reduces the number of mutually exclusive accesses to

83

 

84

loop indices, which signiﬁcantly restricts the speedup on large-scale shared memory
machines.

The rest of chapter is organized 'as follows. The generic concept of this scheme
is discussed ﬁrst in Section 6.1. A conservative yet powerful approach is presented
in Section 6.2. Then, we shall address mutual exclusion implementation and static
memory management issues. Finally, experimental results measured from running
different self-scheduling schemes in a 96-node GP‘IOOO Butterﬂy machine will be

discussed in Section 6.5.

6.1 Generic Approach: TSS(f,£)

In addition to the simplicity of scheduling algorithm, workload distribution among
iterations of a DOall loop is the major factor to the success of a self-scheduling scheme.
In this research, four typical uniformly and nonuniformly distributed loops shown in
Fig. 6.1 will be discussed and experimented.

The basic idea of the T88 method is to extract the advantages of both CSS and
GSS schemes. As indicated in Fig. 5.2(b) and (c), the CSS(k) scheme applies a linear,
but ﬁxed, chunk function while the GSS uses a decreasing, but nonlinear, chunk
function. The linearity can make the chunk function simple enough to induce a small
scheduling overhead while the decreasing chunk size can balance the workload. To
combine both properties, intuitively, we design a linearly decreasing chunk function
for the T83 scheme. Thus, instead of using a nonlinear chunk function, the T33
scheme linearly decreases the chunk size at run time. The linearity, which makes the
T85 simple enough to be implemented with a few atomic instructions, is a major
difference between the T38 and GSS schemes.

Figure 6.2 clearly shows the key principle of the TSS( f ,l) method in which the

area below the linear curve C (t) is a trapezoid. The size of the ﬁrst chore and the

85

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Parallel Do i = I to I Parallel Do i = I to I
If( Expl ) then
(*“ Loop body **) (‘"‘ Block 1 **)
se
(** Block 2 **)
. .1 End Parallel . 1} End If
L(z) * L(z) End Parallel
I
14" ii”.
(a) Constant workload (b) Random workload
Parallel Doi = I to] ParallelDoi = l to]
SerialDoj=Itoi SerialDoj=ltoI—i
(“ Loop body **) (“ Loop body ")
End Serial End Serial
L(i)“ End pmnel my End Parallel
] 'i l ‘ z
(c) Increasing workload (d) Decreasing workload

 

Figure 6.1. Four typical uniformly and nonuniformly distributed loops

86

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C t L '
< i (a) Tm
”HM/4') MA
(«2) “l ‘1 W
TSS(f,!) 1‘0) "H21
3
N I
Chore index: i Iteration index: i

Figure 6.2. Chunk size function of a uniformly distributed loop for TSS scheme

last chore, f and e, can be ﬂexibly selected by compiler designers, programmers, or
even intelligent compilers. The number of chores, N, is determined by f, l, and I .

According to the formula of trapezoid area, the value N of TSS( f, l) is

2]

N=[f+£.

(6.1)

After N has been computed, the decreasing step (6) can be obtained by the following
formula.
6 = f — e

TV: (6.2)

For example, if I = 1000 and (f, t) = (88,12), then we have N = 20 and 6 = 4.
The distribution of the chunk function C(t) is: C(l) = 88, 0(2) = C (1) — 4 = 84,
0(3) = 0(2) — 4 = 80, - - o, C(18) = 20, C(19) = 16, and C(20) = 12. The complete

algorithm describing each processor in the scheduling state is listed below.

1. If this processor is the ﬁrst processor that encounters the parallel loop, then

perform the following initialization steps.

(a) select a (f, [) pair. according to the policy decided by the compiler or spec-
iﬁed by the programmer.

 

87

(b) compute N and 6 according to Eq. (6.1) and Eq. (6.2).

(c) set count:-O and csizez-f.
2. BEGIN MUTUAL EXCLUSION

(a) 0:8 count-*1
(b) ﬂ:-|_count+csizej

(c) csize :8 csize-6
3. END MUTUAL EXCLUSION
4. If a > I, then enter the waiting state.
5. Ifﬂ> I, then 3 := I.
6. enter the computing state.

The ﬂexibility of selecting (f, [) without requiring any extra statement in step 1
is an advantage of the T58 over the GSS. In fact, we can treat the SS and CSSUc)
schemes as special cases of the T88. For example, the SS is equivalent to the TSS( 1,1),
and the CSS(k) is equivalent to the TSS(lc,k). Although the GSS is not a linear
scheduling scheme, it can be roughly simulated by the TSS(-,’;,l). Now, a similar
question asked for in the 085(k) and GSS(lc) may be asked immediately: how to
select an optimal f and l? Apparently, an optimal (f, [) pair is determined by P,
I, L(i), and system run-time status. Before actually running that loop, it may not
be possible to compute the optimal (f, l) pair. However, in practice, an optimal
solution is not really necessary. A conservative yet efﬁcient (f, l) pair is proposed
in the next section. Our experimental results in Section 6.5 will show that this

consecutive approach will result in a near optimal solution for most of the uniformly

and nonuniformly distributed loops.

 

 

88

6.2 A Conservative Approach: TSS(ﬁ, 1)

An optimal choice of (f, l) is dependent on the execution time distribution of the
parallel loop. It has been shown in Eq. (6.1) that to have a small N in the T58 the
(f, l) pair should be chosen as large as possible. However, the load imbalancing will
be a serious side effect especially when the workload is not uniformly distributed.
Nonuniformly distributed loops, such as in Fig.i6.l(c) and Fig. 6.1(d), are often
generated in real application programs. Ignoring the other extreme cases, Fig. 6.1(d)
is almost the worst case for the scheduling schemes with a decreasing chunk function,
because too much workload will be assigned to the ﬁrst few chores. For example, if
the GSS(k) scheme is used, the T(l) for the loop in Fig. 6.1(d) is equal to (when
I: = l) or larger than (when k > 1) 3%- which is very likely to become the critical
chore of this lOop. Similarly, Fig. 6.1(c) is the worst case of GSS(k) scheme with a
large value 1:, because the last chore will be assigned too much workload and become

a terrible critical chore of this loop.

 

 

Figure 6.3. T(l) of the TSS(§%,1) under the loop in Fig. 3(d)

 

 

 

 

 

89

In order to ensure every T(i) is less than or equal to -f; ,instead of ,1, in the G38,
the ﬁrst chunk size f is set to —- in the conservative TSS method. In this case, T(l),
the shade area in Fig. 6. 3 18 derived ln Eq. (6. 3). It can be seen that T(l) ls always
less thanL 3, thus the workload of the rest iterations 13 still adequate to keep other

processors busy even under the decreasing workload as shown m Fig. 6.1(d).

T(I) = (I+ (I— é» an: (#) *% = (4134-1912) * L (6.3)

 

After the f is ﬁxed by #, due _to the property of the linearly decreasing chunk
size in the TSS, the range of the last chunk size, I, is from 1 to W. Substituting
(fl) = 2—p,1) and (f,£’) = (27,27) into Eq. (6.1), the range of the number of
chores N, is from 4P to 2P. Thus even the l is set to the most conservative value 1,
each processor will have at most two more mutual exclusion accesses than the other
values. If the mutual exclusive access can be implemented in an atomic operation
(will be described in the next section), the extra scheduling overhead is insigniﬁcant.
Also, considering the increasing workload in Fig. 6.1(c), the last chunk size I is set
to 1 to minimize the load imbalancing. This implies that each processor in average

Will execute 4 chores. The comparison of N among those well known self-scheduling

schemes is shown in Table 6.1.

Table 6.1. The number of chores generated by various self-scheduling schemes

 

Scheme 33 GSS(k) GSS(l)‘ TSS(#,1)

N . I rt] Plnrtl 4P
(worst case)

 

 

 

 

 

 

 

 

 

 

90
6.3 Mutual Exclusion Implementation

As mentioned before, scheduling overhead is dependent on both the number of chores
N and the implementation of chore dispatching. As for the ﬁrst factor, the number of
chores dispatched by the TSS(;%, I) from a parallel loop is 4P which can be further
reduced if an optimal (f, l) pair is selected. As for the implementation of chore
dispatching, the algorithm in Section 6.2 can be easily implemented in a critical
section. However, as indicated in [58] and [42], the system performance will degrade
signiﬁcantly if the critical section takes much time and the number of processors is
large. In this section, we present an efﬁcient chore dispatching mechanism which
replaces a critical section by a single atomic instruction.

Because the chunk function C (t) of the TSS is linear, the chunk size of each chore
is predictable. A simple way to compute the start iteration of the i-th chore, a(i), is
to sum the ﬁrst i — l chunk sizes and then add one. Similarly, B(i) is equal to the
summation of the ﬁrst i chunk sizes. A generic formula of calculating (a(i), ﬁ(i)) for

the TSS( f, l) is derived below.

C (1) = f (64).

0(2) = f — (i — 1)6 (6.5)
as) = 1'): 00)] +1 (6.6)

.=1

By using the formula of trapezoid area, we have

a“) = [(0(1)+C(i-2- 1))*(i— 1)J + 1. (6.7)

 

91
Substituting Eq. (6.4) and Eq. (6.5) to Eq. (6.7), we obtain

a“): l(l—1)...(2f2_(e.—2)....5)J +1. (6.8)

 

Similarly, the formula for ﬂ(i) can be derived as

as) = mince” = Li*(2f-(2i_1)*6)l- (6.9)

 

In these equations, f and 6 are selected or computed by the ﬁrst processor encoun-
tering the parallel loop. Hence, the index of chores, i, is the only shared variable that
processors have to access atomically. If a multiprocessor system supports atomic in-
structions, such as fetchtaddO, the implementation of the TSS can be very efﬁcient

as shown below.

1. If this processor is the ﬁrst processor that encounters the parallel loop, then

perform the following initialization steps.

(a) select a (f, 6) pair according to the policy decided by the compiler or spec-

iﬁed by the programmer.
(b) compute N and 6 according to Eq. (6.1) and Eq. (6.2).
(c) set t = 1.
2. i = fetchladd(t, l).
3. If i > N, then enter the waiting state.
. 4. Compute (a(i), ﬁ(i)) based on Eq. (6.8) and Eq. (6.9).
5. Enter the computing state.

It should be emphasized that the execution overhead of Eq. (6.8) and Eq. (6.9) is

very small. Accessing the shared variable is reduced to a single fetchtadd operation

 

92

which is usually insigniﬁcant. For example, according to our experimental results,
the average time required by a fetchaadd Operation on a 96-node BBN GP-IOOO
machine is only 30 psecs. Since some operations can be overlapped, the atomic phase
of this instruction is about 4.5 psecs. Thus the average scheduling overhead for each
processor in the TSS(§%, 1) scheme is about 120 psecs which is a small constant.
On the other hand, without hardware support, the GSS(k) scheme needs a
lock()/unlock() mechanism. The scheduling overhead then is determined by not
only the time processors stay in the critical section, but also the time processors wait
for entering the critical section. According to our recent measurement [42], the fre-
quence of the critical section accesses is the most important factor to the performance
degradation. If a small I: is chosen in the GSS(k) method, the frequency of mutu-
ally exclusive accesses will become higher and higher during the execution of loops
especially when the granularity of iterations is small. The waiting time for fetching
a chore is possibly even larger than the execution time of that chore. If a larger 1: is
chosen, the frequency of mutually exclusive accesses can be reduced. As mentioned,
the load imbalancing is a serious side effect. The impact of critical sections to the
system performance was studied in [58]. However, detailed theoretical analysis is still
an open issue as no closed-form solution is available for a general and practical model.
Instead, an extensive number of experiments will be conducted and demonstrated in

Section 6.5.

6.4 Memory Management

In this section, another interesting and important implementation issue, memory
management, in the run-time model of a multiprocessor system is discussed.
Due to concurrency, the traditional stack (or static allocation) is no longer able

to manage memory for parallel programs. A tree-like structure called a cactus stack,

 

93

which handles both local and global variables by directly mirroring the chore tree, is
applied to deal with both parallelism and procedure calls. A simple way to explain
the cactus stack is when a chore with a given frame splits into several chores, a child
frame is created for each child chore. The operation of frame creation will certainly
induce a large run-time overhead.

Another improved approach which treats chore locality differently from procedure
locality is proposed in [27]. Instead of dynamically creating chore frames at chore
splitting, all chore frames are allocated at the procedure call by using the well-known
scalar promotion technique. Thus, a promoted array will be created at the beginning
of the procedure which includes a parallel 100p. The length of the array will be the
same as the number of chores to split from that loop. Each element of the array
corresponds to a chore frame. A strip mining technique, which is the same idea used
in the GSS(k), has to be employed to bound the size of the promoted array. However,
this approach has a major drawback—workload imbalancing.

Note that the number of chores N in the TSS is small and predictable. The
approach discussed above can be improved by replacing the strip mining with the
trapezoid mining in the TSS scheme. This approach is explained by the simple pro-
gram, which exchanges array A and array B, listed in Fig. 6.4. The ﬁrst part is the
source program, while the second part is the translated program by using the TSS(f,-5,
I) approach. In addition to m, the index of the DO loop has to be a local variable. We
have to translate both in and i to two scalar arrays: u(j) and i (3'). SP (j) and EP(j)
are two functions to compute a( j ) and ,6( j ) using Eq.(6.8) and Eq.(6.9), respectively.
Now, the parallel do block can be easily implemented by a single fetchtaddO in-
struction. This approach further reduces the runvtime overhead of memory allocation

and still keeps a balanced workload.

94

 

 

/** Source program *t/
Procedure example()
int 1, A(1000), B(1000)
int i
I I 1000
doall i=1, I
local int n
m - A(i)
A(i) ' 8(1)
8(1) 3 m
end
end;

/** Translated.program **/
Procedure examp1e()
int j, I, A(IOOO). B(IOOO)
int 1(4P), m(4P) /** Scalar promotion array **/
I ' 1000
doall jtl, min(4P,I)
do i(j)'SP(j). 39(5)
u(j) * A(iCj))
A(i(j)) . B(i(j))
B(iCj)) ' u(j)
end
end
end;
/** SP(j): Starting Point of j’th chore (Eq.(6.8)) **/
/** EP(j): Ending Point of j’th chore (Eq.(6.9)) **/

Figure 6.4. Static memory management

 

 

 

95
6.5 Experimental'Results

To evaluate the performance of various self-scheduling schemes, our experiments were
conducted on a 96-node BBN GP-lOOO parallel processor. Rough review of BBN
GP-IOOO has been given in Section 4.5.2. Detailed information can be read in [56]
[42].

6.5.1 Workload and Performance Metrics

The experimental benchmarking programs are coded in Butterﬂy Fortran [22] which
is a parallel language, in some sense, similar to the being standardized PCF For-
tran [59]. To make an objective comparison, the loop language construct, PARALLEL
REGION, is used to represent parallel loops, which provides the same kind of barrier
for synchronizing processors in the waiting state. To take the advantage of atomic
instructions, chore dispatching in the SS, CSS and TSS schemes are coded using the
atonadd32() instruction in the scheduling state. Based on [25], the GSS is imple-
mented in a critical section using the lock() and unlock() mechanisms.

In addition to the selection of different self-scheduling schemes, the benchmark
programs are characterized by a set of parameters: (P, I, L(i)). Four kinds of work-
load as shown in Fig 6.1 will be experimented. The program executes the parallel
loop and keeps tracing‘where the resource is spent. As described eariler, there are
three possibilities: I) computing, 2) scheduling, and 3) waiting for synchronization.
Therefore, besides the traditional speedup 1‘ measurement, two other interesting per-
formance metrics, degree of scheduling overhead 6 and degree of load imbalancing
A, are deﬁned below.

LsP
X+0+W

 

Speedup: I‘ = (6.10)

 

96

 

 

, 0 * P
Degree of Scheduling Overhead. O — X + 0 + W (6.11)
. W * P
Degree of Load Imbalanclng. A —- X + 0 + W (6.12)

Note that O and A represent the average number of processors wasted in the
scheduling and waiting state, respectively. In the idea case, X = L and 0 = W = 0,
the speedup I‘ is same as the number of processors P. In practice, L S X. Thus,
I‘ + 9 + A is always less than or equal to P. The difference between P and I‘ + O + A
is due to network and memory contention in the computing state which is mostly
dependent on the remote memory reference ratio in the computing of a chore. To
avoid too much inﬂuence from it, the remote reference ratio is set to 5% so that the
sum of I‘, G, and A can be very close to P in our experiments. To measure the effect
of remote memory accesses to scheduling schemes, one experiment based on different
remote reference ratios from 0% to 50% is made as well.

For the GSS(k) scheme, the optimal choice of the chunk size k is machine and
application dependent. When the input parameters (P, I, L(i)) is (72, 100000, 110
psec) (a uniformly distributed loop), our experiment shows that when l: = 31.: = 1389,
we can achieve a speedup of 69.2 which is very close to the idea speedup, 72. Therefore,
our following experiments will be taking k = % for the GSS(k) scheme, which can
produce minimal scheduling overhead and keep a balanced workload under uniformly
distributed loops. However, it is believed that there must be a better I: value for the
GSS(k) scheme when the workload is not uniformly distributed or processors start
executing at different times. For the GSS(k) scheme, again the best choice of la value
is mostly determined by the workload distribution and the number of processors. It

is very difﬁcult to predict the best value of Is. In our experiments, we will be testing
different I: values from 1, 2, 5, 10, 20 to [{3]. Then, the best results of GSS(k) and

 

J1,-f

97

GSS(I).will be plotted and discussed.

6.5.2 Run-Time Behavior

The ﬁrst part of our experiments is performed in a 16—node cluster in order to analyze
and compare the run-time behavior of the CSS(%), GSS(I), and TSS(z—gsJ) schemes.
The event log library provided by the (SP-1000 system is applied to record the events
during the execution of loops. Then, the gist, an event log analysis software, is
employed to proﬁle the whole scheduling operations.

The ﬁrst experiment is made to analyze an uniformly distributed loop, shown in
Fig. 6.1(a), with a small granularity of 110 psec. The resultant event proﬁles are
shown in Fig. 6.5, Fig. 6.6, and Fig. 6.7 for the TSS, CSS, and GSS schemes, respec-
tively. In these ﬁgures, Y-axis is the logical processor number and X-axis represents
the execution time. The small squares represent events. Corresponding to the three
states discussed in Chapter 5, there are three kinds of events: scheduling events,
computing events, and waiting events. The fourth event indicates the completion
of the critical chore. Each large rectangle box is used to explain the event which is
connected with. In Fig. 6.5, since the scheduling overhead of the TSS is very small,
all computing events (squares labeled 2) are almost right after the scheduling events
(squares labeled 1). Thus, the squares labeled 1 are almost right behind the squares
labeled 2 in this ﬁgure, and thus invisible. The time period from any square box 2
to the next square box 1 is the execution time of a chore. As indicated in the ﬁgure,
in the ﬁrst round, all processors are scheduled at about the same time and fetch
chores by a linearly decreasing chunk size. The chunk size can be read in the large
rectangles. Since the earlier processors are assigned with larger chores, all processors
almost synchronize at the end of the second round. The same behavior repeats in
the third and fourth rounds. Thus, all processors ﬁnish their last chore at about the

same time which implies a balanced workload - the total waiting time is very small.

 

-‘Ib.

 

 

98

Figure 6.5. Event log proﬁle of a uniformly distributed loop using the TSS.

 

 

 

l éiéé
i: i” '3 $5
Ital, «+3
iéi :éééiéééés
~ ~ ” it i GP (3
El gs gs is e rs is ca s eh E9 is d? q? [.3 la ..8.

 

 

 

 

 

 

 

 

 

 

 

006550

u— have”

12L
U
42'
U
JIL
L1
‘2‘
U
'21
U

 

'7'
1.1
r314_l

‘1_J
1,1,
*‘L.I
r;1l,
1.;
or;1l,
"—*‘1_J
100000

1
")1

 

TV-
llil Chunk

 

w':

 

ml

 

.
l

 

 

Ive t f
we: Chunk-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

L;
al::}.
‘l;:*
s[:};
at}

P

O
I
o

.lF-«l +l+l «l

l

 

B?

0011‘

 

 

 

99

Figure 6.6. Event log proﬁle of a uniformly distributed 100p using the CSS.

 

u more: 1” 6 uses trous-
I.
mm
mm
mum
su. sass
Dtsplsy an lusts

lash-es."
Display Is [seats
0 “he Ivan: Display Changes

 

F:
+
J»
and

010
C
C
C
C
C
I

 

 

 

 

 

 

 

 

 

 

 

 

‘17}—
' ml «J nil

 

 

       

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ii
. l=
if;
8
ls:
03 2:
l 3
o:
8
‘8
l:
h: L; ﬁg
ii: :3 2
’6 Id
. ,. ... ... l L: rim rer 111 AL JJHJ'I IJ'I _
. .. . : : . . ‘:’ E" 3" E" ‘:' L" EC? L"

.hOCCI. HBO..-

 

 

 

 

126.. no. ﬂatten

1‘ processors 1

we: “glass."

 

Dlsplsy All tunes

Displsy Is lasts
0 Isle [vest Displsy usages

 

 

 

 

 

 

100

Figure 6.7. Event log proﬁle of a uniformly distributed loop using the G88.

 

 

ntI

 

 

. l s
ISCIIWIJD: Done- ”"1

 

 

 

 

.MK

the

selenium: 0..- 99955 Ml“: Qual- 1 MIN: Do..- 9999‘

 

l.- .ms

 

 

I.

 

 

 

end

 

 

 

 

#

l'l-«l+l+1-l

-]

 

 

u- insessi
I 0011’ I

 

 

 

 

 

 

 

 

 

 

 

"$230

ses-
“S

 

l

as.
In

 

 

 

101

Figure 6.8. Event log proﬁle of a decreasing workload using the TSS.

 

 

... BIB

2.31.. I

use-um

 

 

sDDSAU >s~nsqo as!” In! D

Sees"— ox >s~ns3
3:28 :4 ulna-«o m

axon a4< - nu

usaunas n nu
uz~uaaxoo u n
causanu-un _ __

sues}.-

Squ-nsv use! 0.32. “sues-scene 3
o....u.woad "use

 

 

 

757.362.“...

.5252. 230000
0 l]

a.
lo-

_

ooomacn

.28
:83; Old...

cosmos.

 

 

 

n. Ina—EU "8:5..8

 

_ mug“ .mnu ICU.

 

r-n r-n
l—J
I

 

 

    

1520 “DIE-DEB
veg . I...

  

F1
J

l"! [‘1
U
in

 

r-1
L_J
O

 

r-1 r‘1
L_J
,

r-n
L_J
U

 

«n lac—30 “clung
u§> .

E]

r-:
L]
0

 

 

v0 Inna—cu «22.98.00

m 233. 6m~ I:

$8.0.-

hlsDUII.

 

 

102

Figure 6.9. Event log proﬁle of a decreasing workload using the CSS.

 

 

.... ... BIB
wean:- I

season

 

 

 

acme-.6 Cum-«n ucs>u sac: D

5225— on as~nsqo
Saga :4 gin-«a m

uzoo .34 v D
U-an‘t n m
DEED u D
gas—Bus“: a U
mean-«n

I... «too
Sega

coda-=5 use 0.: ..suOsleOua :
:..8.1.3 63

NHCHI

uﬂd
s0-

ED-

 

E . _ . _..._.__

cocoon"—

cocooo:

 

 

.5
Ice-

ccooooo~

.53
:00: in

ooooao.

ooaooo~

j

ooooooo cocoon:

odd Inc—:5 "clubs

I:

 

 

llm!
lmlw]
..A late—AU “02—kDa
l u 233— .m: '5... .._u

lllllﬁ

 

 

 

 

 

coon lDeon ..Hzoo ...:
. I

 

 

 

 

coon IDCDQ "gr—n: oo— Ida—=6 "02:.350
. . I5.

 

His-0..

Luau...

 

 

103

Figure 6.10. Event log proﬁle of a decreasing workload using the GSS.

 

 

00.: >5 _HIH_
2:3... I

use-um

 

 

 

.09.an >sHA-«o ucgu 9.1: D

cases”. on >san-«a
Saga and >snnsao m

H200 ‘34 v D
9.2L: n 0
03:5ng N D
gndbnuznﬂ n D
>s~nsqo
ll: (‘00
3:25

sewn-nan use 1: “sues-000nm '—
3.-.o.-e3 53

 

7.1.? _ + _....._ _a...

oooooo: oooooo=

...: E

:00: luau

ooooooo— oooooo. oooooco oooooov oooooon a

 

loll

lid
ll

 

 

  

l—J
N

r1 r1;

 

coon stcoo ..Dlnhndl
.

 

 

 

 

ooon lye—on “0.2.54!

 

 

 

coon licoo «8:8 .34

v salsa .mp-mmo: Ila...

   

 

n uus>u .oo ucnn Ila...

      

.: laczsu "02:55.50
« negu .unv Ola.—

ails-9..

mucus-l

 

 

104

 

 

 

 

 

 

 

 

 

 

 

 

   
     

 

 

 

 

3 I
— ens — A. _ IV —:-.:_ two. Flow:— :8
..003: Id»
883 82.3 382
832: 883 o
v:-
a
9 E a
a
6 E u
H50 q— I
100: E c
5 .J .2: 5.5.6 63.5.60,
..8. F. E. a
a a
E] E o
B a
a n
=3 1:36 “9.3298“.
unD>u . n
.... ... me
3:... I Inﬁllﬂ. B la
sou-an J J
lgla R... E

 

 

 

Figure 6.11. Event log proﬁle of a uniformly distributed loop using the TSS, where

processors start at different times.

 

 

sounds scan-«o DEN sue! D Eon lac—Eu “nu-=38.
Sega ex >193

35>". :1 >s~ns3

 

 

E: [,9 E]
if

 

- D

9.2.3. n m
aisle ~ 0 Illajﬂ

a U

 

 

 

 

 

8.8 ..A:
JV 3
E E s
OISBH—Uu
J J
71.3 lglﬂu E E a
Is: 080
3:26 l—“_IHIH— a _u_ H.
..L I Ho
con-=5 use! «.0: "sans-DOPE = —H~h~ Inca-U "ax—#358 on: lac—EU "gr—:8
o—.ssu.ouo~ “03 h uzs>u .mronmn 01.: n :25 .onn '2.

 

 

unsol-

EMOOIII

 

105

Figure 6.12. Event log profile of a uniformly distributed loop using the CSS, where

processors start at different times.

 

 

can >5 _HI:
95:... I

ecu-um

 

 

 

.5540 71:3 :25 3.1: B

3:65 on ran-3
3:25 A: uni-3 m

8.8 4.: v a
927:: n E
92588 u D
92.59:.“ ~ D
hugs-«o

'3: I30
nucgu

coda-=5 00-! n69: “anal-I093 ed
3435?: 63

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

—rv:o— AI — IV is; —hn hnnﬁl _230—
ion-.... In:
23ch .3330“ 250°. 23.000 cocoa. cocoon
< 4 .1 ii
is a
[E a
law Jr?
A
liw nF
JE a
n
Ilixﬁ iml _
lm'wi imT. .
IIT iwiﬁ .
n
llmlu iml .

 

 

 

 

 

   

ammo Inc—.6 “czar—.580
50>” . I5...

23 5:8 622.2:
.

2.2.”. .. I:

 

 

AIM-9..

 

 

106

Figure 6.13. Event log proﬁle of a uniformly distributed loop using the GSS, where

processors start at different times.

 

 

... mIm

93:... I

non-an

 

 

 

nooinu aim-.3 295 can! D

35:8 on hung—nun
3:15 :1 ﬁnqanqa m

9.8 .34 c D
93.3.: n D
025558 a D
93.53:“ a U
>IAa-3
'3: {cu
Bali

.3332“ 00-! 0.3. :no-uoooun 3
3...?33 63

unlu.

nab
lac.

 

_HT. 7 _....._ h... .5

GE 8%...

has
Anni-d: Clan

89.8. 98.8.

23...:

 

 

 

 

 

 

 

 

 

 

 

 

:2 .156 6359.3
u 2.25 .22.: Id.—

 

 

 

 

 

 

“ll-0..

 

 

107

Since the workload is constant and very small, the CSS scheme works almost
perfectly in this case. It can be seen in Fig. 6.6 that the CSS scheme produces the
least events, only one round for each processor. The total execution time using the
CSS or TSS is about 706 msecs. The GSS scheme takes a little bit longer, about
20 msecs, than the TSS and CSS schemes. It achieves a perfect workload balancing,
while the mutually exclusive access is possibly the bottleneck in the case of small
granularity workload. Figure 6.7 shows the run-time behavior of the last 12 msecs
of the execution of the loop. In some processors, such as processors (b) and (6), the
scheduling overhead (from the event 1 to next event 2) are even much larger than
computing time (from the event 2 to the next event 1).

A loop with a decreasing workload shown in Fig. 6.1(d) is analyzed in the next
experiment. The event proﬁles of the TSS, CSS and GSS schemes are depicted in
Fig. 6.8, Fig. 6.9, and Fig. 6.10 respectively. Since the loop is not uniformly dis-
tributed, processors don’t synchronize their fetch every two rounds in the T88 scheme.
The ﬁrst chore which is assigned to processor 0 is not so large as to become a terrible
critical chore. As the previous experiment, all processors complete their last chore
almost at the same time in the TSS scheme.

In Fig. 6.9, every processor, except processor 0 which acquires the last 180 iter-
ations, fetches a chore with the same chunk size 188 according to the CSS method.
However, processors 3 which fetches the ﬁrst 188 iterations becomes a terrible critical
processor due to the heavy workload at the beginning of the loop. The shaded periods
from the event 3 to the event 4 on the other processors represent the time proces-
sors wasted in the waiting state. The same problem caused by the GSS is shown in
Fig. 6.10. Processor 0 fetches the ﬁrst large 188 iterations and becomes a critical
processor. Both CSS and GSS schemes take about twice time comparing to the TSS
scheme in this case.

In the last experiment of the run-time behavior, processors start executing an

108

uniformly distributed loop at different times. The event proﬁles of the TSS, CSS,
and GSS schemes are exhibited in Fig. 6.11, Fig. 6.12, and Fig. 6.13, respectively.
The TSS works as well as the GSSin which a good load balancing is achieved in
any initial conﬁguration, while the CSS causes serious load imbalancing. Clearly,
processor f which is the last processor to start executing the loop is the critical
processor in Fig. 6.12 under the CSS scheme. In this case, the CSS takes about 230
msecs longer than the TSS and GSS. ‘

6.5.3 Performance Measurement

The second part of our experiment is conducted to measure and compare the perfor-
mance (which includes I‘, 6, and A) of SS, CSS(-{;), GSS(k), and the conservative
TSS(é, 1) approaches under different kinds of DOall loops, where the maximum
number of processors, P, is 80.

The plots in Fig. 6.14 are the results of Experiment 6.1, where L(i) is a small
constant, about 110 psec, for all 2' and I is 100000. The TSS and CSS almost perform
identically well in this experiment. From the plots of 6 and A, the CSS has a little
bit higher A (load imbalance) than the TSS, while the TSS has a little bit higher
9 (scheduling overhead). Both the SS and GSS(l) produce too much scheduling
overhead in this eXperiment. The scheduling overhead, 9, of the SS is the worst, up
to 65 when P = 80. This is due to hot spot condition [41] occurs on the shared index
of the SS algorithm. As discussed above, the low and long tail in the chunk function
C(t) of the GSS(l) is the major source of 9. In the experiments of GSS(k), when
the value of k is increased the scheduling overhead is decreased while the degree of
workload imbalancing is increased. The best value of k in GSS(k) scheme is 80 under
this workload. Beyond 80, the workload imbalancing dominates the system, and the
overall performance goes down again. The speedup of the GSS(80), however, is still
a little bit lower than the TSS and CSS due to both higher scheduling overhead and

109

 

80

 

 

 

 

 

 

 

 

 

 

 

I T_

..+.. SS ”Q
70l- .....X 088 Q. “*1
60 L .. Cl. GSS(I) Q ,nk' " _
...*.. GSS(80) _ 1: _l
Speedup 40" E] ....... D ....... [:1 '4
30" If 8 "'D‘
20L 9mg J
10? ."9P-“++ ...... ..+...- + ..... + ....... + ”+4

0 ug‘l'L L n 1 J L 1 L

0 10 20 30 40 50 60 70 80

No. of Processors : P

80 I I T T j j I r“
70- ..+-
60? + "‘
9 50r- +-_.' 1
Overhead 40* ' " 1
10+- +~++ ...-“D"... ......... * ..... *4

0 J -IJAALzzigzzni ...... 1:34“ .... ""1

0 10 20 3 40 50 60 70 80

No. of Processors : P

8" 1 l l l 1 I I I

7_ .,
A 5.- ‘
Load .'
Imbalancing 4 P "* .‘D
3 _ ..D ...xq
2— "x" -
1 ...0-1- x 0-4
“ ::,Q:II....*:.. ........ o
0 ...hzgzillﬁiul.:_xw:Q:[ 4iQJ-J_L..?L1
0 10 20 30 40 50 60 70 80

No. of Processors : P

Figure 6.14. Experiment 6.1: I =100000, L(i)=110 microseconds

Speedup

6

Overhead

A
Load

Imbalancing

Figure 6.15. Experiment 6.2: I =10000, L(i)=2 milliseconds

80
70
60
50

30
20
10

cwwwemm-Joo

OHNWIhO‘GNm

O

110

 

1

I

I

I

I

I

 

. u .

v .

SS
(:33
(353(1)
GSS(S)
TSS

.3"

.3"

J

L

1

l

113??

..J

 

 

20

No. of Processors : P

30

40

50

60

 

 

I T I I —r m I
r a
l- .' 1
- E] q
r ' n
l- .. :::¥-l
- D _ a“; d
“’ +6 _.*' -4
. + 9135"“ Home.

.-+ m:i:::' . ..... O

..-‘+LL-4.1L-'1' in-L ...... I'_"'Jn L..1... .1...” -_L

 

 

10

20

30

40

No. of Processors : P

50

 

 

,_ n n I r I T— T T
- J
l- _
r ,0“
+ mm X-
.— X d
,m- ,5

. ii' :E::::'¢:.~ _J

l' .Q'WESEEBE: 6 ------ +"’+
n-n.IL-;;UJH:.EZL ----- 1 ---- 4 1

 

O

10

‘20

No. of Processors : P

30

40

50

 

111

load imbalancing.

Experiment 6.2 is very similar to Experiment 6.1 except that L(i) is increased
from 110 psec to 2 msec. The results are plotted in Fig. 6.15. The TSS and CSS
work as well as in the previous experiment, while the GSS(l) and SS have a signiﬁcant
improvement. This is due to the signiﬁcant improvement of the ratio between the
execution time of each iteration to the scheduling overhead. Note that the scale of
Y-axis for O and A is between 0 and 8 rather than between 0 and 80. The inﬂuence of

the critical section implementation overhead still restricts the GSS(I) from achieving

a better performance than the other schemes. The best choice of I: value in GSS(k) '

scheme is 5 in this case which performs almost as well as the TSS and CSS schemes.
Again, the effect of the load imbalancing will degrade the performance if the I: value
is larger than 5.

The loop distributions of experiments 3 and 4 are based on Fig. 6.1(c) and
Fig. 6.1(d), respectively. In EXperiment 6.3, since the execution time of the last
few iterations is very high, the last chore of the CSS will become a very large critical
chore. Results of this nonuniformly distributed loop are shown in Fig. 6.16. Obvi-
ously, the CSS scheme has the worst performance, in which A of the CSS is up to 39
when P = 80. The SS, GSS(I) and TSS schemes all achieve a very good speedup.
The GSS(l) has a slightly higher scheduling overhead when P is very large due to
its critical section implementation. The TSS has a slightly worse load imbalancing
due to the heavy workload in the last few chores. The GSS(k) also has the same load
imbalancing problem as the TSS when k 2 2 is chosen. Overall speaking, under this
increasing workload, the GSS(l) provides the best speedup, although the difference
is insigniﬁcant comparing with the SS, TSS, and GSS(k) with k 2 2.

112

 

80l- I r I II— F 1 T I__+
70. + 88 .328-
x 053 8'
60- D GSS(l) @ '4
_ O TSS _.: H
P 50 :;$7'

Speedup 40* ”a," ._..><'°"'Xd
30l- Hzll X ﬁ
20h- ."la ...me
10 up ..-"x

” .¢i..-x-“
omm'xi J J l 1 n L 1

 

 

 

 

 

 

 

 

 

 

 

8_ T T F T I— T 1 1:4
7- -
6-
e 5‘
Overhead 4 - J
3b 7
2* J
1 304
r ---------- D:.‘IIIZP...9
0 ‘L “ 14.4-4.1 """ li'—£—'~"i_L::::J_ ...... 1'"
0 10 20 30 40 50 60 70 80
No. of Processors : P
80v j T ' T I F r ms
70L _J
60r _
A 50 '- _l
L .l
Imbalancing40 xx
30_ ' _,.-X' _J
20L x _J
10- XX _J
0 "".x;1;:-~LA-|----v_-_L;-nnzz :;::::1::::IZ%II_I_IIIQZ:I'

0 10 20 30 40 50 60 70 80

No. of Processors : P

Figure 6.16. Experiment 6.3: I =3000, L(i)=[50 to 77000] microseconds

113

Figure 6.17 shows the results of the fourth experiment, which has a very small ex-
ecution time in the last few iterations. Again, the TSS still achieve both objectives-—
very small scheduling overhead (9) and load imbalancing (A), which result in a linear
increasing speedup, I‘, as P increases. On the contrary, the major disadvantage of the
GSS(I) scheme mentioned above, a very large ﬁrst chore T (1), obviously appears in
this experiment. The ﬁrst chore of the GSS(l) and CSS, even it is spawned ﬁrst, be-
comes the critical chore in both schemes. The performance degradation of the GSS(l)
and CSS is mainly caused by load imbalancing. When the k value is increased in the
GSS(k) scheme, the results are even worse than when k = 1 due to the larger ﬁrst
chunk size, [mfg-M].

In the same way, the random workload shown in Fig. 6.1(b) is measured in Ex-
periment 6.5. The probability of branch in the if statement is 50%. The execution
time of Block 1 and Block 2 are 36 psec and 1 msec, respectively. As shown in Fig-
ure 6.18, both TSS and CSS work almost as well as in Experiment 6.1, while a slight
load imbalancing is created. The SS gets a signiﬁcant improvement over Experiment
6.1. The GSS(ZO) is the best one in the GSS(k) scheme under this random work-
load. However, the critical region implementation still induces too much scheduling
overhead, and prevents the GSS(k) from getting better speedup.

In Experiment 6.6, processors start executing an uniformly distributed loop, as the
workload in Experiment 6.1, at different times. The result is plotted in Figure 6.19.
As expected, the SS is the worst one due to much scheduling overhead, while the CSS
loses it load balancing as described above. The GSS(l) works as well as the TSS when
the number of processors, P, is less than 40. Its speedup is seriously affected by the
critical section implementation when P is larger than 40, even though a good load
balancing is achieved. However if k is increased in the GSS(k) scheme, the scheduling
overhead can be signiﬁcantly improved. The best result of the GSS(k) is produced
by GSS(80) which works almost as well as the TSS scheme.

114

 

 

 

 

80" 1 SIS— T l T r 1 lg-4
_ ...+ ' -
70 ...x 033 Q
60 L ...13 (388(1) Q *
0 T -
II- ...I . -+
Speedup 40 “9‘ B'MX _1
30- Q ..szi“:' 'D-mé
20- a "
10,— g ”.8 -¢
0 J¢$l J L L l J L l

0 10 20 30 40 50 60 70 80

No. of Processors : P

 

 

 

 

 

 

 

 

8 L- I i I— j I T 7 ‘ﬁ
7 r- 4
6 L- J
9 5 ‘- ..
Overhead 4 _ _
3 L d
2 r- q
1 L -134
..... D'
0 i g1.u.+.LLU+an-:Lm;ugsL55::::p=::::.+
0 10 20 30 40 50 60 70 80
No. of Processors : P
80 "' j T 1— T 7 l ]—— 1
70 L .7
60 r 1
A 50 L _1
Load
Imbalancing” ’ Dx~~~ Bx,....a>9
30 F q
20 F Dx _J
0 'Di 1 l-+Ai+A-QJULLHI.4....L#...

0 10 20 30 40 50 60 70 80

No. of Processors : P

Figure 6.17. Experiment 6.4: I =3000, L(i)=[77000 to 50] microseconds

115

 

I

80 .....+ 38

70L ....X CSS $564
60,. ...D.. GSS(l) 6
..+.. GSS(20) ”£29 “.+.”;
I‘ 50’ ._.<>.. TSS 9*
Speedup

A
c
l
D—H
_L

30

20 $55:
10 4"
l l L l J

. ' L
0 10 20 30 40 50 60 70 80

N o. of Processors : P

T I
l

 

 

 

 

40i— W F j ﬁ— _—T l I j

35r

30*

9 25k "
Overhead 20 — 3D -
15F 1:1".- ..-'+1
10r ...:ili-'”*-

 

 

 

0 10‘ '20 30 4o 50 60 i0- 80

 

8- .-*-
. pp

A 6 (5
Load a" ._,x
Imbalancing4_

1:1?" , .
2_ _.*':='3:' 3:5“9 ,. "l.+

 

o
o
a
. n
......
n

0 .-;-;;1LumiLL ...1 ----- 4 _1 L 4 4

0 10 20 30 40 50 60 70 80

No. of Processors : P

 

 

Figure 6.18. Experiment 6.5: I =10000, L(i)=[36(50%) or 1000(50%)] microseconds

 

 

 

 

116
80 + 3s ' ' ' ' I _
70- Iffxf css -
60 _ [1. GSS(I) _
_ ...*. GSS(80) .
I‘ 50 ...<>. TSS .,¢""Q‘
Speedup 40- ,9?” ..x-
.53" X
30b ”.‘8:..;;:Q=°‘:-~D---.D_
..3g33.:::x.~'
20- .3:: ::X.°‘ _.
10_ '3 '15,.+ ....... + ....... + ...... +.n+_
Imggl:.$l I I l l l I

 

 

 

 

 

so ' ' ' ' ' . . —
70- -
60- _.-+-
9 50- _-+' -
Overhead 40 - ~
30- .+... --'+ _
20— ," __I:1_
10_ .+... ...+. D ..C} _
0 """ ‘ +1 ---I ------ IND; 1 ...... 1 ....... 1

 

0 10 20 30 40 50 60 70 80

 

 

 

30 l T l I 1 1 I r-

25- -

A 20- .x'__.x_
I‘r‘r‘balancingl5 ' Xx

10- x _

5b Xiﬂﬁx'. , ""D“

0 .....X'L'UUAIHUIll”...”HHOIJHHEIQIQ .... 1:8“?

 

0 10 20 30 40 50 60 70 80

No. of Processors : P

Figure 6.19. Experiment 6.6: I =100000, L(i)=112 microseconds, Processors start at
different times

 

117

 

 

 

 

 

80L- 1 j I j r-
Q59
70 i—‘i**'*i'::. ... +.. 3883 '-
60? 2339 a GSS(I) ‘
50 _ ‘x ...*.. GSS(150) _
1‘ * ...<>.. TSS
Speedup 40 - Q d
30? his}; ‘ i
20 b i 39 _ i
X-._:': Huh”: 3”
10 L-ﬂ+_*__+_ ..... +.++ ..... + ..... ﬁigunuqnn'g_
0 L J J _L J I‘
0 10 20 30 40 50 .

Remote reference ratio( %)

Figure 6.20. Experiment 6.7: P280, I =100000, L(i)=112 microseconds

Instead of setting 5% of the remote access ratio, the last experiment is made to
measure the speedup under different remote reference ratios in an 80-node cluster.
The result is shown in Figure 6.20 in which X-axis represents the remote ratio, from
0% to 50%. The GSS(150) is the best one in the GSS(k) group in this experiment.
Due to memory and network contention, the speedup drops fast in any scheme when

the remote ratio is larger than 10%. However, the TSS is still on the top of the ﬁgure.

CHAPTER 7

CONCLUSION AND FUTURE
RESEARCH

We have presented three new techniques to improve the execution of iterative loops
in multiprocessor systems. In this chapter, we summarize our major contributions in
this thesis. Improvements to the current work will be suggested, and plans for future

works will be given.

7 .1 Summary

In Chapter 1, the procedure of loop parallelization in most parallelizing compilers has
been reviewed. Two major types of parallel constructs, DOall and DOacross loops,
were described. By discussing the important concerns of loop parallelization, the
objectives of this thesis were outlined at the end of Chapter 1.

In order to preserve correct semantics, dependence constraints among iterations
are the major concerns in loop parallelization. In Chapter 2, cross-iteration depen-
dence was studied. After introducing the basic concepts of data dependence, we have
presented the ideas of linear Diophantine equations and real inequalities that are the

fundamentals of analyzing cross-iteration dependences. Some associated terms, such

118

J.’ 31

 

 

 

119

as iteration space and uniform dependence loop, are also deﬁned and described.

In a shared memory multiprocessor system, cache and local memory is widely
employed to bridge the gap between fast processor cycle and slow main memory
access times. To exploit more potential parallelism, many transformation techniques
ignore the problem of cache (or local memory) thrashing and barrier synchronization
that may overweight the beneﬁt of the resulting parallelism. Loop netting has been
prOposed in Chapter 3 to parallelize uniform dependence loops without inducing
cache thrashing and barriers. This method is particularly suitable for multiprocessor
systems with a local cache or memory.

In Chapter 4, dependence uniformization has been proposed to overcome the difﬁ-
culties of parallelizing a doubly nested loop in which irregular dependence constraints
occur among iterations. Based on solving a system of Diophantine equations and a
system of inequalities, we have presented an efﬁcient dependence test to compute
the maximal and minimal dependence slopes of any regular or irregular dependence
pattern in a 2-dimensional iteration space. Then, by applying the idea of vector de-
composition, a set of basic dependences is chosen to replace all original dependence
constraints in every iterations so that the dependence pattern become uniform. Thus
all approachs, such as wave—front method, loop skewing and so on, for parallelizing
nested loops with a uniform dependence pattern are now applicable. We have proved
that any doubly nested loop can always be uniformized to a uniform dependence
loop with only two dependence vectors. To reduce the synchronization overhead,
index synchronization method, in which appropriate synchronization can be system-
atically inserted, is also proposed to preserve the two basic dependence vectors in
a uniformized dependence pattern. Unlike the wavefront method or loop skewing
technique, the proposed method does not require loop index transformation in the
original program. The results of performance evaluation showed that the speedup

is signiﬁcantly affected by the range of dependence slopes and the granularity of

 

 

120

iterations.

Since DOall loops are the most important parallel construct in parallel program-
ming, efficiently scheduling of DOall 100ps is vital to the performance of shared-
memory multiprocessors. In Chapter 5, we have reviewed various loop scheduling
schemes by discussing their strength and drawback. Some methods are good in dis-
tributing the workload, but complicated. Some methods are simple, but not good in
balancing the workload.

To achieve both objectives: balanced workload and low scheduling overhead, we
have proposed a simple and efﬁcient Trapezoid Self-Scheduling scheme in Chapter 6.
By assigning a large group of iterations at the beginning, the frequency of mutually
exclusive accesses to loop indices is substantially reduced. The property of linearly
decreasing chunk size induces a good workload balance even under extremely nonuni-
formly distributed parallel loops. Furthermore, the linearity of the chunk function
allows a very efﬁcient implementation of the chore dispatching function, which can sig-
niﬁcantly reduce the scheduling overhead. Instead of strip mining, if the TSS scheme
is coupled with scalar promotion, one can statically manage memory at the procedure
level and still keep a balanced workload among processors. The adjustable starting
and ending chunk sizes imply more ﬂexibility of the TSS scheme than other well-
known schemes. A conservative choice of starting and ending chunk sizes is proposed.
With the availability of a 96—node GP-1000, we were able to actually measure, rather
than simulate, the performance of different self-scheduling schemes. Our experimen-

tal results clearly indicate the advantage of the TSS scheme over other self-scheduling

schemes.

 

 

121

7 .2 Future Plans

To improve the worked reported in this thesis, we identify at least the following areas
that deserve further study.

Partition an independent net
Loop netting divides an iteration space into many independent nets that are as-
signed to processors. If the number of independent nets is small, loop netting gains
a small speedup. One idea deserves further study is dividing an independent net
into many sub-nets so that an independent net can be shared by many processors.

This partitioning scheme will somehow lose data locality and cause data transfer and

 

synchronization between processors executing different sub-nets in the same indepen-
dent net. The major concern in this research is how to minimize the amount of data
transfer and synchronization.
Dependence uniformization considering the magnitude of dependence
vectors
Clearly, the major factor to the performance of dependence uniformization is the
range of dependence slopes which determines the delay factor between processors. If
a loop has a very wide range of dependence slopes, in the worst case, the proposed
method could result in a serial execution. One possible improvement is considering
the magnitude of dependence vectors to choose a more efﬁcient BDV set.
Dependence uniformization of multiple nested loops
The proposed dependence uniformization deals with only doubly nested loops. In the
cases of multiple nested loops, one can separately apply the proposed method to any
two adjacent levels of loops. Loop interchange and loop coalescing may be helpful in
this case. However, detailed study and experiments should be further explored.
TSS with more aggressive choice of starting and ending chunk size

As indicated in Chapter 6, one of major advantage in TSS is the ﬂexibility of choos-

122

ing the starting and ending chunk size. The conservative approach in this thesis is
certainly not the best one. If the workload distribution and scheduling overhead are

known, ﬁnding an optimal pair of (f, I) for the TSS scheme is an interesting issue.

 

 

APPENDICES

 

APPENDIX A

FINDING THE DCH IN A
N-DIMENSIONAL SPACE

In this appendix, we study the cases in which the general solution of a system of
Diophantine equations is composed of more than 2 free variables. Thus, instead of
2-dimensional solution space, the algorithm of ﬁnding the DCH should be extended
to n-dimensional space. We first present our ideas by some lemmas and theorems.

Then the data structure used to represent a convex hull is described; followed by the

completed algorithm.

A.l Basic Principle

Recall that ﬁnding a DCH in a n-dimensional solution space involves computing the
intersection of a n-dimensional convex hull and several half-spaces that are derived
frOm the constraints in Eq.(2.2). A half-space in a n-dimensional space can be rep-
resented by a (n — l)-dimensional hyper-plane and a symbol, 5 or 2. Thus, ﬁnding
a DCH involves computing the intersection of a n-dimensional convex hull and a

(n — 1)-dimensional hyper-plane.

To simplify our presentation, a m-dimensional convex hull is represented by CH "‘

123

 

 

124

and a n-dimensional hyper-plane is denoted by H P". Without formal proofs, we ﬁrst

describe the principle of our algorithm by the following lemmas and theorem.
Lemma A.l A C H m is a convex hull bounded by many connected CH "“13.

For example, the CH 3 shown in Fig A.1 is bounded by 6 connected CH 23. Each
CH 2, is 2-dimensional plane bounded by connected edges that are essentially 1-

dimensional convex hulls. For instance, plane 01 is bounded by edges, db, E, a, and

35.

 

 

 

 

 

 

 

 

 

' b
/ C1
e a ‘
h a
g c

Figure A.l. A C H 3 in a 3-dimension space

 

 

 

Definition A.l Let H be a H P""1 and C be a CH m in a n-dimensional space. We
say H cuts C if some part of C is in one side of H and the rest space of C is in the

other side of H, where m S 12.

Lemma A.2 Let H be a HP'“"1 and C be a CH'" in a n-dimensional space. IfH

cuts C, the intersection of H and C is a CH ”‘1.

125

For example, a H P2 cuts a CH3 in a 3-dimensional space is shown in Fig A.2 in
which the intersection is a 2-dimensional convex hull, bounded by edges W, p_q, q—r,

and ﬁ.

 

3.. prer-plane

 

 

 

 

 

 

 

Cs

Figure A.2. A CH3 cut by a H P2 in a 3«dimension space

 

 

 

Theorem A.l Let H be a H P"‘1 and C be a CH m which is bounded by connected
k CH "“18, Cl, C2, - - - , Ck, in a n-dimensional space. If H cuts C, the intersection of
H and C is a CH”"1 which is bounded by CHm'gs, C{,C§, . - - ,Ci, where C}, which

may be empty, is the intersection of H and Cg.

Consider the example in Fig A.2 again in which the CH 3 is bounded by 6 CH 28,
Ci, 02, ° ' - , C3. The intersection of the HP'2 and the CH3 is the CH2 bounded by 4

 

 

 

126

edges W, W, q—r, and W. As can be seen that this hyper-plane does not cut 01 and
02. The intersection of this hyper-plane and C3, C4, C5, and Cs are edges w, 171, W,

and r—o', respectively.

A.2 Data Structure

To ﬂexibly represent a m-dimensional convex hull, a tree-like structure is employed
in our algorithm. A convex hull is represented by a structude which is composed of

six ﬁelds as the following.

/* ----- structure of a Convex Hull ----- */

struct CH {
struct CH *child; /* pointer to its child ring */
struct CH *next; /* pointer to its brother CH */
int dim; /* number of dimensions of this CH */
int chdno; /* number of CHs in its child ring */
struct NUDE *left; /* pointer to a node; used in a CH1 */
struct NUDE *right; /* pointer to a node; used in a CH1 */

}

/* ----- structure of a Node ----- *I

#define Ndim 10 /* define the maximal dimension */

struct NUDE {
strcut NUDE *next; /* pointer to next node in node list */
int coord[HdimJ; /* the coordinate of this node */
int degree; /* the degree of this node *I

}

/* ----- structure of a Half-space ----- Iii/

struct HS {
int coeff[Hdim+1]; /* the coefficients of h-plane */
char zoom; /* > or < */

 

 

127

As indicated in Lemma A.1, a CH m is composed of many connected CH "“13 that
are stored in a ring structure, called child ring, in our algorithm. The ﬁeld child
is a pointer pointing to its child ring while next is the pointer pointing to the next
convex hull in the child ring. The dimension of this convex hull is stored in dim,
and the number of convex bulls in the child ring is stored in chdno. If this is a CH 1
(an edge), left and right are used to point to two nodes that form the edge. All
nodes of this convex hull are maintained in a list structure, which are the major
concerns in computing the range of dependence slopes in the proposed dependence
uniformation. For example, the data structure of Figure A.1 is partitally shown in

Figure A.3. A half-space represented by a hyper-plane and a symbol is also listed

 

above. Note that M dim represents the maximal dimension in this algotithm.

 

 

  

 

Node List

Figure A.3. The data structure of the CH 3 in Figure A.l.

 

 

 

128

A.3 Algorithm

Similar to the algorithm shown in Fig. 4.4, initially the whole n-dimensional space
is represented by an inﬁnite large CH " which is composed of 2" nodes. Then, the
half-spaces derived from the general solution of Eq.(2.l) and Eq.(2.2) are applied to

cut the initial convex hull. The algorithm in C-like language is listed as the following.

Input: A list of half-spaces.
Output: A DCH

int dim; /* the dimension of initial convex hull */
struct CH *ch; /* pointer of the convex */

struct NUDE *node; /* pointer to node list */

struct HS *hs; /* pointer to a half-space */

struct CH *dummy;

begin
Build the initial convex hull pointed by *ch and
the node list pointed by *node;
while(the input list is not empty)
begin
Pop a half-space from the list to *hs;
CUT(ch, node, hs, dummy);
if( node -- Null ) STUP;
end while
end

CUT(ch, node, hs, ich)

struct CH *ch ;
struct CH #ich; /* return the intersection of ch and hs */

struct NUDE *node;
struct HS the;
begin

int i, chdno;

struct CH *new_ch;

 

129

struct CH *p1, *p2, *ichx;

ich I Null;
nev_ch 8 (struct CH *) malloc( sizeof( struct CH ));
if( ch->dim == 2 ) { /* 2-dimensional convex hull */

PLANE(ch, node, hs, ichx);
ich I ichx;

}
else {
p1 - ch->child; /* the 1’st CH in the child ring */
p2 8 p1->next; /* the second CH */
chdno I ch->chdno;
/* --- now, scan the child ring --- *l

for( itl ; i<= chdno ; i++) {
CUT(p2, node, hs, ichx) /* recursive call *I
if( p2->chdno -- O ) { /* CH not in the right zoom */
p1->next I p2->next; /* remove it from the ring */

p2 8 p1->next; /* go to next child CH */
ch.chdno -- ;

}

else {

pl 8 p1->next; /* go to next child CH */
p2 = p1->next; '
if( ichx !8 Null ) { /* CH intersects h-plane */
/* insert it into the nev_ch */
if( nev_ch->child !' Null ) {
ichx->next I nev-ch->child->next;
nev_ch->chi1d I ichx;

}

else {
ichx->next - ichx; /* 1’st CH in the ring */
nev-ch->child I ichx;

}

nev_ch->chdno ++ 3

}
} /* end of the scanning the child ring */

1:» a 119.30“

 

 

 

130

if( new_ch->chi1d !I Null ) { [I if new CH is not empty I/
/* return it to the previous level 1i‘/
new-ch.dim I ch->dim -1;
ich I new-ch;

end;

131

III Cut a 2-dim CH (a plane) III
PLANE(ch, node, hs, ich)
struct CH *ch;
struct CH Iich; II return the intersection of CH and HS II
struct NUDE Inode;
struct HS Ihs;
begin

int i, chdno;
stuuct CH *new_ch1;
stuuct CH Ip1, Ip2;
struct NUDE *inode;

ich I Null;
new_ch I (struct CH I) malloc( sizeof( struct CH ));

new-ch->dim I 1;

p1 I ch->child; II the first child CH in the child ring */
p2 I p1->next; II the second child CH */
chdno I ch->chdno;

/*** scan the child ring */
for( iI1 ; i<I chdno ; i++) {
EDGE(p2, node, hs, inode)
if( p2->left II Null .and. p2->right II Null ) {
II the edge not in right zoom */

p1->next I p2->next; I* remove it from child ring Iill

p2 I p1->next; II go to next child CH */
ch.chdno -- ;

}

else {
p1 I p1->next; II go to next child CH */

p2 I p1->next;
if( inode !I Null ) { II this edge intersects h-plane II
II fill it into the new-ch1 II

if( new_ch1->1eft !I Null )

132

new_ch1—>left I inode;
else

new_ch1->right I inode;
new_ch.chdno ++ ;

}

} /*** end of the scanning the child ring I/

if( new-ch.right SI Null ) ich I new-ch;

end;

 

133

EDGE(ch, node, hs, inode)
struct CH Ich;
struct NUDE Inode;
struct NUDE Iinode; II return the intersc. of edge and HS I/
struct HS Ihs;
begin
stuuct NUDE Indl, Ind2;
struct NUDE Iinode, Inew_nd;

inode I Null;
new_nd I (struct NUDE I) malloc( sizeof( struct NUDE ));

nd1 I ch->left; II nd1 points to the one node of the edge II
nd2 I ch->right; II nd2 points to the other node II

 

case 1: (Both Indl and Ind2 are not in the half-space, Ihs) {
if( nd1->degree -- II 0 ) Remove Indl from the list Inode;
if( nd2->degree -- II 0 ) Remove Ind2 from the list Inode;
ch->1eft I Null;
ch->right I Null;

}

case 2: ( Ind1 is not in the half-space, Ihs) {
if( nd1->degree -- II 0 ) Remove Ind1 from the list Inode;
Compute the intersection point and store it into Inew_nd;
Insert Inew-nd into node list, Inode;
ch->1eft I new-nd;

}

case 3: ( Ind2 is not in the half-space, Ihs) {
if( nd2->degree -- II 0 ) Remove Ind2 from the list Inode;
Compute the intersection point and store it into Inew-nd;
Insert Inew_nd into node list, Inode;
ch->right I new_nd;

}

case 4: (Both Ind1 and Ind2 are in the half-space, Ihs);

if( new-nd !I Null ) inode I new_nd;

end;

BIBLIOGRAPHY

[1] K. Hwang and F. Briggs, Computer Architecture and Parallel Processing.
McGraw-Hill Book Co., 1984.

[2] D. Kuck, A. Sameh, R. Cytron, A. V. C. Polychronopoulos, G. Lee, T. McDaniel,
B. Leasure, C. Beckman, J. Davies, and C. Kruskal, “The effects of program re-
structuring, algorithm change and architecture choice on program performance,”

in Proceedings of the 1984 International Conference on Parallel Processing, Aug.
1984.

[3] M. Booth and K. Misegades, “Microtasking: A new way to harness multiproces-
sors,” in Cray Channels, Summer, 1986.

[4] Alliant Computer System Corp., Acton, MA., FX/Series Architecture Manual,
1985.

[5] Cray Research, Inc., Mendota Height, Minn, Multitasking User Guide, Jan. 1985.

[6] F. Allen, M. Burke, P. Charles, R. Cytron, and J. Ferrante, “An overview of the
ptran analysis system for multiprocessing,” in Proceedings of the International
Conference of Supercomputing, June 1987.

[7] D. Kuck, R. Kuhn, B. Leasyre, and M. Wolfe, “The structure of an advanced

vectorizer for pipelined processors,” in Proc. IEEE computer Society Fourth In-
ternational Computer Software and Applications Conf., Oct. 1980.

[8] D. J. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. Wolfe, “Dependence graphs
and compiler optimizations,” in Proceedings of the 8-th ACM Symposium on
Principles of Programming Languages, pp. 207—218, Jan. 1981.

[9] R. Allen and K. Kennedy, “Automatic translation of fortran programs to vec-
tor form,” ACM Transactions on Programming Languages and Systems, vol. 9,
pp. 491-542, Oct. 1987.

134

 

 

 

135

[10] R. Allen and K. Kennedy, “Automatic loop interchange,” in Proceedings of the
ACM SICPLAN 84 Symposium on Compiler Construction, pp. 233—246, June
1984. '

[11] D. Callahan, K. Cooper, K. Kennedy, and L. Torczon, “Interprocedure constant
propagation,” in Proceedings of the ACM SI GPLAN 86 Symposium on Compiler
Construction, pp. 59—67, July 1986.

[12] J. Davies, C. Huson, T. Macke, B. Leasure, and M. Wolfe, “The kap/s-l: An
advanced source-to-source vectorizer for the s-1 mark iia supercomputer,” in

Proceedings of the 1986' International Conference on Parallel Processing, (New
York), pp. 833—835, IEEE Press, Aug. 1986.

[13] J. Davies, C. Huson, T. Macke, B. Leasure, and M. Wolfe, “The kap/ 205: An ad-
vanced source-to-source vectorizer for the cyber 205 supercomputer,” in Proceed-
ings of the 1986 International Conference on Parallel Processing, (New York),
pp. 827—832, IEEE Press, Aug. 1986.

[14] B. Brode, “Precompilation of fortran programs to facilitate array processing,”
Computer“, .9, pp. 46-51, Sept. 1981.

[15] M. Wolfe, Optimizing Supercompilers for Supercomputers. Cambridge, Mas-
sachusetts: The MIT Press, 1989.

[16] D. A. Padua and M. J. Wolfe, “Advanced compiler optimizations for supercom-
puters,” Communications of the ACM, vol. 29, pp. 1184—1201, Dec. 1986.

[17] C. D. Polychronopoulos, Parallel Programming and Compilers. 101 Philip Drive,
Assinippi Park, Norwell, Massachusetts 02061: Kluwer Academic, 1988.

[18] M. Wolfe, “Loop skewing: The wavefront method revisited,” International Jour-
nal of Parallel Programming, vol. 15, pp. 279-293, Aug. 1986.

[19] M. Wolfe and D. J. Kuck, “Advanced loop interchanging,” in Proceedings of the
1986 International Conference on Parallel Processing, pp. 536-543, Aug. 1986.

[20] C. D. Polychronopoulos, D. J. Kuck, and D. A. Padua, “Execution of parallel
loops on parallel processor systems,” in Proceedings of the 1986 International
Conference on Parallel Processing, pp. 519—527, Aug. 1986.

[21] The Parallel Computing Forum, PCF Fortran: Language Deﬁnition, revision
1.5 ed., Aug. 1990. '

 

 

136

[22] BBN Advanced Computers Inc., Cambridge, MA., Mach 1000 Fortran Compiler
Reference, revision 1.0 ed., Nov. 1988.

[23] T. H. Tzen and L. M. Ni, “Dynamic loop scheduling for shared-memory mul-
tiprocessors,” in Proceedings of the 1991 International Conference on Parallel
Processing, vol. II, pp. 247—250, Aug. 1991.

[24] Z. Fang, P.-C. Yew, , P. Tang, and C. Q. Zhu, “Dynamic processor self-scheduling
for general parallel nested loops,” in Proceedings of the 1987 International Con-
ference on Parallel Processing, pp. 1—10, Aug. 1987.

[25] C. Polychronopoulos and D. Kuck, “Guided self-scheduling: A practical self-
scheduling scheme for parallel supercomputers,” IEEE Transactions on Cam-
puters, vol. C-36, pp. 1425—1439, Dec. 1987.

[26] P. Tang and P.-C. Yew, “Processor self-scheduling for multiple-nested parallel
loops,” in Proceedings of the 1986 International Conference on Parallel Process-
ing, pp. 528—535, Aug. 1986.

[27] M. Weiss, Z. Fang, C. R. Morgan, and P. Belmont, “Effective dynamic scheduling
and memory management on parallel processing systems,” in Proceedings of the
1989 COMPSAC, pp. 122—129, Sept. 1989.

[28] T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical scheduling
scheme for parallel compilers,” Tech. Rep. MSU-CPS-ACS-27, Michigan State
University, 1990.

[29] D. Padua, Multiprocessors: Discussion of Some Theoretical and Practical Prob-
lems. PhD thesis, University of Illinois at Urbana-Champaign, Nov. 1979.

[30] R. Cytron, Compiler-Time Scheduling and Optimization for Asynchronous Ma-
chines. PhD thesis, University of Illinois at Urbana-Champaign, Oct. 1984.

[31] R. Cytron, “Doacross: Beyond vectorization for multiprocessors,” in Proceedings
of the 1988 International Conference on Parallel Processing, pp. 836-844, Aug.
1986.

[32] U. Banerjee, Dependence Analysis for Supercomputing. 101 Philip Drive,
Assinippi Park, Norwell, Massachusetts 02061: Kluwer Academic, 1988.

[33] X. Kong, D. Klappholz, and K. Psarris, “The i test: A new test for subscript data
dependence,” in Proceedings of the 1990 International Conference on Parallel
Processing, vol. II, pp. 204—211, Aug. 1990.

 

 

 

137

[34] Z. Li, P.-C. Yew, and C.-Q. Zhu, “An efﬁcient data dependence analysis for

parallelizing compilers,” IEEE Transactions on Parallel and Distributed Systems,
vol. 1, pp. 26—34, Jan. 1990.

[35] D. Grunwald, “Data dependence analysis:the lambda test revisited,” in Pro-

ceedings of the 1990 International Conference on Parallel Processing, vol. II,
pp. 220—223, Aug. 1990.

[36] D. Malm, A Computer Laboratory Manual for Number Theory. COMPress, Inc.,

Wentworth, New Hampshire, 1980.

[37] S. Midkiff and D. Padua, “Compiler generated synchronization for do loops,” in

[38]

[39]

[40]

[41]

[42]

Proceedings of the 1986 International Conference on Parallel Processing, (Wash-
ington DC), IEEE Computer Society Press, Aug. 1986.

S. Midkiff and D. Padua, “Compiler algorithms for synchronization,” IEEE
Transactions on Computers, vol. C—36, no. 12, pp. 1485-1495, 1987.

A. K. Nanda, H. Shing, T. H. Tzen, and L. M. Ni, “Resource contention in shared-
memory multiprocessors: A parameterized performance degradation model,”
Journal of Parallel and Distributed Computing, vol. 12, pp. 313-328, 1991.

U. Banerjee, S.-C. Chen, D. J. Kuck, and R. A. Towle, “Time and parallel proces-
sor bounds for fortran-like loops,” IEEE Transactions on Computers, vol. c-28,
pp. 660—670, Sept. 1979.

G. F. Pﬁster and V. A. Norton, “Hot-spot contention and combining in multi-
stage interconnection networks,” IEEE Transactions on Computers, vol. C-34,

pp. 934—948, Oct. 1985.

A. K. Nanda, H. Shing, T. H. Tzen, and L. M. Ni, “A replicate workload frame-
work to study performance degradation in shared-memory multiprocessors,” in
Proceedings of the 1990 International Conference on Parallel Processing, vol. I,
pp. 161—168, Aug. 1990.

[43] Y. Muraoka, Parallelism Exposure and Exploitation in Programs. PhD thesis,

University of Illinois at Urbana-Champaign, Department of Computer Science,
Feb. 1971.

[44] L. Lamport, “The parallel execution of do loops,” Communications of the ACM,

pp. 83-93, Feb. 1974.

 

 

 

138

[45] D. J. Kuck, Y. Muraoka, and S. C. Chen, “On the number of Operations simulta-
neously executable in fortran-like programs and their resulting speedup,” IEEE
Transactions on Computers, vol. C-21, pp. 1293-1310, Dec. 1972.

[46] R. Kuhn, Optimization and Interconnection Complexity for: Parallel Processors,
Single-Stage Networks, and Decision Trees. PhD thesis, University of Illinois at
Urbana-Champaign, Department of Computer Science, Feb. 1980.

[47] W. Shang and J. Fortes, “Independent partitioning of algorithms with uniform
dependencies,” in Proceedings of the 1988 International Conference on Parallel
Processing, vol. 2, pp. 26—33, Aug. 1988.

[48] E. D’Hollander, “Partitioning and labeling of index sets in do loops with constant
dependence vectors,” in Proceedings of the 1989 International Conference on
Parallel Processing, vol. 2, pp. 139-144, Aug. 1.989.

[49] W. Shang and J. Fortes, “Tiling of iteration spaces for multicomputers,” in
Proceedings of the 1990 International Conference on Parallel Processing, vol. 2,
pp. 179-186, Aug. 1990.

[50] U. Banerjee, “Unimodular transformations of double loops,” the Third Workshop
on Programming Language and Compilers for Parallel Computing, Aug. 1990.

[51] D. E. Hudak and S. G. Abraham, “Compiler techniques for data partitioning
of sequentially iterated parallel loops,” in Proceedings of the 1990 International
Conference on Supercomputing, Amsterdem, pp. 187—220, 1990.

[52] Z. Fang, P.-C. Yew, , P. Tang, and C. Q. Zhu, “Dynamic processor self-scheduling
for general parallel nested loops,” IEEE Transactions on Computers, vol. 39,
pp. 919-929, July 1990.

[53] Z. Fang, “Cache or local memory thrashing and compiler strategy in parallel pro-
cessing systems,” in Proceedings of the 1990 International Conference on Parallel
Processing, vol. II, pp. 271-275, Aug. 1990.

[54] D. E. Knuth, The Art of Computer Programming, Vol. 2, Seminumerical Algo-
rithms. Addison-Wesley, Reading, Massachusetts, 1981.

[55] Z. Shen, Z. Li, and P. C. Yew, “An empirical study on array subscripts and data
dependences,” in Proceedings of the 1989 International Conference on Parallel
Processing, vol. II, pp. 145—150, Aug. 1989.

I.‘ ,
i-r‘)“ . ’

_ ... “1—

 

hr?" 1 . 2

 

139

[56] BBN Advanced Computers Inc., Cambridge, MA., Overview of the Butterﬂy
CP1000, Nov. 1988.

[57] E. Lusk and R. A. Overbeek, “Implementation of monitors with macros: A
programming aid for HEP and other parallel procrssors,” Tech. Rep. ANL-MCS-
83-97, Argonne National Lab., 1983.

[58] L. M. Ni and C.-F. E. Wu, “Design tradeoffs for processor scheduling in shared-
memory multiprocessor systems,” IEEE Transactions on Software Engineering,
vol. 15, pp. 327—334, Mar. 1989. also in Proc. of the 1985 Int. Conf. on Parallel
Processing, pp. 63—70.

[59] The Parallel Computing Forum, PCF Fortran: Language Deﬁnition, revision
1.5 ed., Aug. 1990.

 

 

 

"Iiillilillililililillf