ﬂaw" mm rw~v§~
' . _ _ ' .' . ; ' - '

.1 .12.: :‘.l

.,.-V.|S.‘

 

Ill/ll ., Ill/I llllwlsllt’ll/‘IA/l/l‘l’l

This is to certify that the

dissertation entitled

MATRIX-BASED REPRESENTATIONS OF LOOP
TRANSFORMATIONS

presented by

David Raymond Chesney

has been accepted towards fulﬁllment
of the requirements for

Ph . D . degree in Computer Sc ience

 

, Z

Major professor

Date 91’?!ng

MSU is an Affirmative Action /Equal Opportunity Institution 0- 12771

 

LIBRARY
Michigan State
Unlverslty

 

 

 

PLACE IN RETURN BOX to remove this Moat from your mood.
TO AVOID FINES Mum on or More data duo.

ﬁTE DATE DUE DATE DUE

        

 

 
  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU I: An Nﬁnnltlvo Adlai/Emu Opportunity Instltwon
Wm:

_ VEK __. _..__._ _ __...\_ _V _

 

_

 

 

MATRIX-BASED REPRESENTATIONS OF LOOP
TRANSFORMATIONS

By

David Raymond Chesney

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1995

ABSTRACT

MATRIX-BASED REPRESENTATIONS OF LOOP
TRANSFORMATIONS

By

David Raymond Chesney

The execution speed of source code on a parallel architecture machine is bound by
many factors, including the ability of a parallel compiler to determine parallelism in the
target source code. Most of the available parallelism in a source code is contained in the
loops and is exploited by applying a sequence of loop transformations. Different methods
of representing and ordering sequences of transformations have been developed, including
the use of unimodular transformations. A signiﬁcant feature of the unimodular approach is
that it uniﬁes a kernel set of seemingly independent loop transformations into the familiar
linear algebra domain. The unimodular approach has been applied to loop permutation,
loop reversal, and loop skewing in perfectly nested loops.

This research extends matrix-based representations of loop transformations to include
a wider family of loop transformations applied to a broad range of source code structures.
First, a matrix-based representation of additional loop transformations has been developed,
namely loop normalization, loop blocking, strip mining, cycle shrinking, loop collapsing,
loop coalescing, loop ﬁssion, and loop fusion. Second, the application of the extended and
kernel transformations is generalized to handle both perfectly and imperfectly nested loops.
Finally, properties of the original unimodular transformations, including composition, one-

time mapping, and the evaluation of parallelism are preserved in the extended model.

Copyright © by
David Raymond Chesney
1995

To S.T., ILYTTMABA

iv

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Betty H.C. Cheng, for her assistance during my
research. Early in my academic career she told me that earning a Ph.D. is like a roller-
coaster ride. Thank you, Dr. Cheng, for believing in me in both the highs and the lows of
my Ph.D. program.

Thank you also to my committee. Dr. Lionel Ni once said that the key to success in a
Ph.D. is ﬁnding the question. After that, the answer comes easy. His assistance in helping
me find both the question and the answer were invaluable. Dr. Diane Rover helped me
both technically, and by showing me that it is possible to balance family and academics.
She is a wonderful role model. Dr. David Yen ably assisted on my committee by adding a
mathematical perspective.

The pursuit of my Ph.D. would not have been possible without the General Motors
Corporation Fellowship. Mr. Dennis Meyers served in the non-always-easy position of
corporate mentor during the fellowship. Our numerous early morning meetings led to
sound career advice and opportunities. Thank you, Dennis.

Several other faculty and staff stand out. Dr. Carl Page sowed the seeds of my initial
interest in computer science, and saw to it that I had a solid foundation. Dr. Richard
Enbody is a true friend and has given me direction in both academic and non-academic
matters. I have a history with Mrs. Lora Mae Higbee from before my time in the Computer
Science Department. She has been a wonderful “aunt” and listener throughout. Mrs. Linda
Moore has always exuded genuine warmth and able assistance on my frequent visits to her
ofﬁce.

Thank you to my friends and fellow students. I feel fortunate to be a part of such a
remarkable group. I will truly miss our lunch and office conversations dealing with such
topics as SPMD machine operation, IATEX errors, and platypus breasts. By name, they
are Andy Fung, Edgar Kalns, Barb Birchler, Gerry Gannod, Steve Turner, Marie-Pierre
Dubuisson, Joe Sharnowski, Christian Trelftz, Tim Newman, Jon Engelsman, Jay Kusler,
Steve Schafer, Enoch Wang and Oliver Christ. Special thanks to Maureen Galsterer and
Linda Cherry for sharing the commute and providing rich conversation. Ron Sass offered
critical review of my work that strengthened my overall approach and the potential value
of my research. Thanks, Ron.

My parents, Dale A. and Beverly E. Chesney taught me how to dream big dreams.
Their love and support has always been an inspiration to me. It’s funny! Even as an adult,
I still need to make them proud.

My daughter, Mairin, came along midway through my academic career. Her giggles and
hugs added more to any success at my research than any amount of time at a workstation.
Mairin, I love you dearly, and being your (and your siblings, wink! wink!) Daddy is my
true vocation.

Finally, and most importantly, thank you to my wife, Jean. I don’t know if you realized
what the journey would be like when you married a dreamer. You are the absolute center
of my life and there is no possible way that I could have completed my Ph.D. without you.
I love you and thank you, from the bottom of my heart.

vi

TABLE OF CONTENTS

LIST OF FIGURES ix
1 Introduction 1
1.1 Motivation .................................... 2
1.2 Contributions ................................... 3
1.3 Organization ................................... 5
2 Background and Deﬁnitions 6
2.1 Source Code Structure .............................. 6
2.2 Perfect and Imperfect Loop Nests ........................ 9
2.3 Dependences ................................... 1 1
2.4 Properties of Dependences ............................ 13
2.5 Unimodular Transformations .......................... 17
2.6 Global Pre- and Postconditions ......................... 21
2.7 Transforming Source Code ............................ 21
3 Kernel Transformations and Loop Normalization 26
3.1 Kernel Transformations ............................. 26
3.2 Loop Normalization ............................... 30
4 Loop Blocking and Loop Collapsing 41
4.1 Loop Blocking .................................. 41
4.2 Loop Collapsing ................................. 51
5 Loop Fission and Loop Fusion 59
5.1 Loop Fission ................................... 59
5.2 L00p Fusion .................................... 62
6 Integration of the Extended Loop Transformations 71
6.1 Previously Developed Algorithms ........................ 73
6.2 New Algorithms ................................. 77
6.3 Fine-grain parallelism for imperfect-only-child loop nests using the kernel
transformation set ................................ 79

vii

6.4 Coarse-grain parallelism for imperfect-only-child loop nests using the ex-
tended transformation set ............................
6.5 Data locality for imperfect-only-child loop nests using the extended transfor-
mation set .....................................
6.6 Imperfect-only-child loop nests on vector architecture machines using the
extended transformation set ...........................
6.7 Imperfect-only-child loop nests on a shared-memory machine using the ex-
tended transformation set ............................

7 Implementation Model

7.1 Overview .....................................
7.2 Requirements ...................................
7.3 OMT Analysis ..................................
8 Related Work
8.1 Unimodular Matrices ...............................
8.2 Non-Singular Matrices ..............................
8.3 Schedules .....................................
8.4 Generalized Loop Transformations .......................
8.5 Precondition Approach ..............................
8.6 Morphism Notation ................................

9 Conclusions and Future Investigations

BIBLIOGRAPHY

viii

86

91

103
103
105
107

120
120
122
123
125
128
130

134

136

2.1
2.2
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.14
2.15
2.16

3.1
3.3
3.4
3.6
3.7

5.3

6.1

6.2

6.3
6.4

6.5
6.6
6.7

LIST OF FIGURES

Sample code: general form of a nested loop ................... 7
Sample code and array index function for statement SI. ........... 8
Relationship of perfect and imperfect loop nests ................. 11
Sample code containing scalar ﬂow, anti, and output dependence ....... 12
Sample code containing array ﬂow, anti, and output dependences ....... 13
Dependence graph (depth = 1) .......................... 14
Dependence graph (depth 2 2) .......................... 14
Dependence graph (depth = 3) .......................... 15
Sample code: before unimodular transformation. ............... 18
Transformation of assignment statements. ................... 22
Algorithm to map the loop bounds ........................ 23
Transformation of loop bounds .......................... 24
Sample code: general form of an imperfectly nested loop. .......... 27
Sample code: illegal permutation of loops L1 and L2 .............. 28
Sample code: skew of loop L; by m x L1. ................... 29
Sample code: different cases of unnormal loops ................. 33
Sample code: false dependence. ......................... 36
Algorithm to determine if two loop bound sets can be adjusted ........ 64

Algorithm to obtain coarse-grain parallelism using the kernel transformations

of perfect loop nests ................................ 75
Algorithm to obtain ﬁne-grain parallelism using the kernel transformations

on perfect loop nests ................................ 78
Sample code: n-deep imperfect-only-child loop nest. ............. 79
Algorithm to obtain ﬁne-grain parallelism using the kernel transformations

on imperfect-only-child loop nests. ....................... 81
Sample code: initial code before application of modJrerneLfine ....... 82

Sample code: transformed code after application of modlerneliine. . . . 83
Algorithm to obtain coarse-grain parallelism using the extended transforma-
tions on imperfect-only-child loop nests. .................... 85

ix

6.8 Sample code: parallelization not allowed by kernel unimodular transformations. 86

6.9 Sample code: parallelization after loop ﬁssion .................. 86
6.10 Sample code: showing reuse and locality ..................... 87
6.11 Sample code: statement groupings for data locality {(31, S4), (32, 53)}. . . . 88
6.12 Algorithm to improve data locality using the extended transformations on
imperfect-only-child loop nests. ......................... 89
6.13 Sample code: before application of data locality algorithm. ......... 90
6.14 Sample code: after application of data locality algorithm. .......... 91
6.17 Algorithm to optimize source code for a vector machine using the extended
transformations on imperfect-only-child loop nests ............... 95
6.18 Sample code: before application of Algorithm ext.vector ........... 96
6.21 Algorithm to optimize source code for a shared memory machine using the
extended transformations on imperfect-only-child loop nests. ........ 100
6.22 Sample code: before application of Algorithm ext.sms ............. 101
6.23 Sample code: after application of Algorithm extents. ............ 102
7.1 Graphical user interface .............................. 106
7.2 Object diagram for overall system. ....................... 109
7.3 Object diagram for source code dependences. ................. 110
7.4 Object diagram for transformation matrices ................... 110
7.5 Object diagram for transformation sequence. ................. 111
7.6 Object diagram for source code .......................... 112
7.7 Backus-Naur Form of target imperative language (modiﬁed Tiny syntax). . 114
7.8 State diagram for system. ............................ 115
7.9 Level 0 data ﬂow diagram ............................. 116
7.10 Level 1 data ﬂow diagram ............................. 117
7.11 Level 2 data flow diagram for ﬁnding the transformation matrix T ...... 118
7.12 Level 2 data ﬂow diagram for mapping source code. ............. 119

CHAPTER 1

Introduction

The speed of executing source code on a parallel computer is theoretically limited by the
number of processors [1, 2]. However, a practical limitation is the ability of a compiler
for a parallel machine to recognize and exploit parallelism and data locality. In order to
parallelize and localize data, a compiler is often more complex for parallel machines than
for sequential machines. Methods need to be developed that easily and consistently exploit
parallelism and locality, which leads directly to faster execution on parallel architecture
machines.

Most of the available parallelism in target source code is in the loops and certain com-
binations of transformations to the loops maximize parallelism or data locality goals. Ex-
amples of the loop transformations include loop permutation, loop reversal, loop skewing,
loop ﬁssion, loop fusion, loop blocking (tiling), strip mining, cycle shrinking, loop collaps-
ing, loop coalescing, and loop normalizing. Individually, the transformations are fairly well
understood, however, they are often treated independently.

Thesis Statement.

The purpose of this research is to extend matrix-based representations of loop transforma-
tions to include loop normalization, loop blocking, tiling, cycle shrinking, strip mining, loop
collapsing, loop coalescing, loop ﬁssion, and loop fusion. The transformations are applied to

both imperfectly and perfectly nested loops, and preserve attractive properties of the tradi-

tional matrix-based models. New algorithms are developed that use the kernel and extended

transformation set to obtain both machine-independent and machine-dependent goals.

1.1 Motivation

A uniﬁed framework that represents a broad spectrum of transformation techniques is es-
sential to the usefulness of a parallel compiler. An integrated framework facilitates the
user’s ability to manipulate and transform source code within a common context. Recently,
several generalized transformation schemes have been proposed to unify the seemingly inde-
pendent transformations, including unimodular matrices [3, 4], non-singular matrices [5, 6],
schedules [7, 8], and other general approaches [9, 10, 11].

Each generalized model has its own set of strengths and weaknesses. One weakness of
the generalized models is that only a limited number of loop transformations are repre-
sented. As an example, unimodular transformations represent loop permutation, reversal,
and skewing. Second, there is typically an explicit limitation on the source code struc-
ture before the generalized model is applied. That is, the models typically require perfect
loop nesting. Third, mapping rules for the dependences are often imprecise. Accurate
dependence information is paramount to the evaluation of any transformation sequence.
The accuracy needs to be maintained throughout the transformation sequence in order for
parallelism goals to be evaluated.

A major strength of each transformation model is the uniﬁcation of a set of loop transfor-
mations, which allows a user to manipulate different loop transformations withina common
context. Unimodular transformations offer three additional strengths: one—time mapping of
initial source code into transformed source code; composability; and the ability to evaluate
parallelism goals using an abstract representation of the source code and transformation
sequence.

Unimodular transformations are a set of loop transformations that are represented by
matrices with speciﬁc properties. Speciﬁcally, loop permutation, reversal, and skewing are

all represented by unimodular matrices. A sequence of transformations is represented by the

product of the individual transformations that comprise it, and is called the transformation
matrix T.

The user can determine if the transformed source code is legal based upon properties of
the product of the transformation matrix T and a vector representation of the dependences
D is the source code. Source code parallelism is also determined from the product of the
transformation matrix T and the dependences D. The source code is mapped from initial to
ﬁnal form, given the transformation matrix T. That is, there exists an algorithm for source

code mapping based upon the matrix representing the sequence of transformations.

1.2 Contributions

The major contributions of this research are: 1) to extend the matrix-based representations
of loop transformations; 2) to generalize the kernel and extended set of transformations
to apply to both perfect and imperfect loop nests; and 3) to develop new algorithms that
are based upon combinations of the transformations to achieve goals relevant to a parallel
compiler. The research presented in this dissertation preserves the strengths of unimodular
transformations, yet overcomes the weaknesses of other generalized models [12].

First, this dissertation describes an extension to the unimodular approach to include
a wider class of loop transformations. Speciﬁcally, ﬁve new families of transformation
techniques are described in terms of a matrix-based framework: loop normalization [13,
14, 15], loop blocking [4, 14, 15, 16, 17, 18, 19], loop collapsing [14, 15, 18], loop ﬁssion
[17, 18, 19, 20, 21], and loop fusion [14, 15, 17, 18, 19].

Loop normalization is a transformation that transforms source code so that the lower
bounds and step sizes of a loop are both one. Normalized loops are often a precondition for
kernel transformations and for many extended transformations that are discussed in this
dissertation. Loop ﬁssion is a loop transformation technique that divides a set of program
statements within a nested loop into two separate nested loops. The inverse of loop ﬁssion is
loop fusion, in which two adjacent nested loops are combined into one [22]. Loop blocking

includes tiling, strip mining, and cycle shrinking; and this transformation increases the

depth of a nested loop. Loop collapsing includes loop coalescing and is the inverse of loop
blocking because it decreases the depth of a nested loop [23].

Mapping for the source code and dependences is explicitly deﬁned for each extended
transformation. The rules for source code mapping show how the initial source code is
mapped to its ﬁnal form. Accurate dependence mapping rules ensure that the transformed
source code is semantically equivalent to the initial source code and allows for evaluation
of potential parallelism.

Another important advantage of the extended approach is the ability to transform im-
perfectly nested loops. Imperfect loop nests do not have all assignment statements contained
in the deepest loop. The generalized unimodular framework presented in this dissertation
enables both the kernel set (loop permutation, reversal, and skewing) and the extended set
(loop normalization, ﬁssion, fusion, blocking, and collapsing) of loop transformations to be
applied to both perfect and imperfect loop nests.

Loop transformations are used for numerous purposes in a parallel compiler. For ex-
ample, loop permutation and skewing may be used to obtain parallelism. Loop blocking
may result in improved data locality or reduced barrier synchronization. Combining loop
blocking, permutation, and skewing enables parallelism, data locality, and barrier synchro-
nization goals to be attained within a matrix-based framework. That is, extending the
unimodular approach to include additional transformations broadens the potential advan-
tages to include those of the extended transformation set.

Several new algorithms are introduced that combine the kernel and extended transfor-
mation sets on machine-independent and machine-dependent goals. Machine-independent
goals include data locality and parallelism, and machine-dependent goals include transfor-
mation of source code for a vector or shared-memory machine. The algorithms combine the
loop transformations to obtain goals that are beyond the scope of the kernel transformation
set alone.

The current unimodular approach is a theoretically elegant method for applying a limited

number of transformations to a restricted set of source code structures. The generalization

presented here broadens the types of transformations that are applied to a wider class of

source code structures.

1 .3 Organization

The remainder of this dissertation is organized as follows. Chapter 2 gives deﬁnitions
and background information used throughout the dissertation. Chapter 3 discusses the
generalization of the kernel transformations (loop permutation, reversal, and skewing) to
include imperfectly nested loops and also loop normalization. The next chapter discusses
two additional transformations that modify the nest depth of a nested loop, namely loop
blocking and collapsing. Chapter 5 discusses loop ﬁssion and fusion, which are the last of
the extended transformations. Algorithms that attain machine-independent and machine-
dependent goals using the kernel and extended transformation sets are discussed in Chapter
6. Chapter 7 uses the object modeling technique to describe the implementation framework
of the system. Related work is discussed in Chapter 8. Finally, conclusions and future
investigations are discussed in Chapter 9.

Chapters 3 through 5 describe matrix representations of individual loop transformations,
and the format of all of the chapters is similar. First, the transformation is brieﬂy described
and an example is given. Second, the legality of the transformation is discussed in terms of
invariant, explicit, and nesting preconditions. Next, the effect of the loop transformation
on the distance vector set D is discussed. Accurate mapping of D is essential toyevaluation
of overall legality, parallelism, and other transformation goals. Finally, the source code

mapping is discussed.

CHAPTER 2

Background and Deﬁnitions

This section gives background and deﬁnitions for terms used throughout the dissertation.
First, the terminology related to source code structure is discussed. Next, the taxonomy of
perfect and imperfect loop nests is described. Third, dependences and their properties are
discussed. Next, fundamentals of unimodular transformations are described, followed by

the global pre- and postcondition for applying a sequence of matrix-based transformations.

2.1 Source Code Structure

This section discusses the explicit notation used throughout the dissertation to describe

loop bounds and subscript expressions.

Loop Bound Notation

In general, a loop nest is in the form shown in Figure 2.1, where n is the loop nest depth. The
entire iterative statement “for I,- = 1b,, ab, 3;” is referred to as L. I1 through In are index
variables, and lb,-, ab;, and s,- are the lower bound, upper bound, and step size, respectively,
for the loop with index variable I,-. In order to prevent ambiguity, the subscripts for the
upper bound, lower bound, and step size may contain the name of the index variable with
which they are associated. As an example, lb;1 is equivalent to lbl.

If there is an n-nested loop, then the loop bound set B has n loop bound vectors 5.

Each 5 has three loop bound expressions: the lower bound, upper bound, and step size. As

 

L1 for I) = lb1,ub1,s1
L2 for 12 = lb2,ub2, .92

Ly; for In = lb“, “6”, an
51 ARR[sub1, suba, ..., subm] = . . .

Figure 2.1. Sample code: general form of a nested loop.

 

an example, the source code given in Figure 2.1 has the loop bound set:
B = {(lbl, ubl, 31), (lbg, sz, .92), . . . , (lbn, ubn, sn)}.

The ﬁrst vector in B is bl = (lbl, ubl, s1) and refers to the outermost loop. The last vector
in B is 5,, = (lbn, ab", 3,.) and refers to the innermost loop.

A restriction on the form of the lower bounds, upper bounds, and step sizes in the loop
nest is that each lb.-, and ub;, 1 5 i _<_ n, is a linear function of the surrounding index
variables 15, such that j < i; and that s,- is an integer constant. Then each lb,- is in the
form:

0:3111 + ai,212 + - - . + ai,jIj + Ci,

where j < i, at]: is the coefﬁcient of index variable I,- with respect to I j, and c,- is an integer

constant. Similarly, each ub; is in the form:
bi,lIl + b.3212 + - . - + bi,jIj + dz"

where j < i, b.31- is the coefficient of index variable I.- with respect to Ij, and d,- is an integer
constant. As an example, suppose that we have a loop in the form “for I3 = 211 +6, 812, m”,

then:

03,1 = 2; b3,1 = 0; 82 = m;
03,2 = 0; b3,2 = 8;

62:6; (12:0.

Subscript Expression Notation

In a loop nest, 3,- refers to an entire assignment statement. If ARR is the name of an m-
dimensional array, then each sub;, 1 S i S m, is referred to as either an array subscript,
subscript expression, or reference to index variables. Note that m does not necessarily equal
17.. That is, the loop nest depth and dimension of the array are not necessarily the same.
An intuitive representation of an array ARR is using array index function form [24]. An
m-dimensional array ARR in a n-nested loop is in the form ARR[sub1, subg, . . ., subm], where
each sub.- is a linear function of the loop index variables of the loops plus a constant term.

The subscript expressions in array ARR are represented in the form:

Hl,l . . . H1," 11 Cl

HgJ . . . H2," 62
. . X 5 + .

Hm,l . . . Hm," In Cm

where the m x n matrix H is called the array index function. In general, the representation
of an array in terms of an array index function is in the form [H ][I,-] + [03'], and is called
array index function form. As an example, the assignment statement 51 shown in Figure

2.2 has the array index function form shown to the right of the source code.

 

L; for 11 = Lab; 3 0 I 0
L2 for12=1,ub2 1 2 x[I1 ]+ 3
s, ARR[3I1,I1+212+3,4]=... o o 3 4’

Figure 2.2. Sample code and array index function for statement S].

 

Two array references have uniformly generated references if they refer to the same array
and have the same array index function H. A uniformly generated set is an equivalence
class, such that any two members of the set have uniformly generated references. There is

little potential for reuse in an array that has non-uniformly generated references [24]. That

is, arrays with the same name, that also have the same array index function H, will likely
have data locality that can be exploited.

If all Hi“,- and c,- for a subscript expression sub,- are integer constants, then sub,- is linear.
Otherwise, sub,- is nonlinear. If all subscript expressions in an m-dimensional array ARR are
linear, then ARR is linear. If some subscript expressions are nonlinear, then ARR is partially
linear. If no subscript expressions are linear, then ARR is nonlinear. Finally, if index variable
1;, 1 S i S n, appears in more than one sub,, 1 S j s m, then ARR has coupled subscripts,

where n is the loop nest depth and m is the dimension of the array [25].

2.2 Perfect and Imperfect Loop Nests

A loop nest hierarchy is a representation of the structure of a nested loop using a rooted
tree, where the parent-child relation is deﬁned in terms of loop nesting and the root is the
outermost loop. Speciﬁcally, loop L,- is the parent of loop L ,- if loop L,- is the next outer level
of loops that completely encloses loop L j. For example, in Figure 2.3(a), L1 is the parent of
L2 and L2 is the parent of L3. In Figure 2.3(b), L4 is the parent of L5 and L5 is the parent
of L6. Finally, in Figure 2.3(c), L7 is the parent of both L8 and L9. Note that in the source
code in Figures 2.3(b) and 2.3(c), there are assignment statements between the respective
for’s and the endfor’s. (The endfor’s are included in Figures 2.3(a) through 2.3(c) for
clarity. In other source code samples in this dissertation, the location of the endior’s is
implied.) Two loops may have the same parent, as shown in Figure 2.3(c); however, no loop
may have more than one parent.

Two loops are perfectly nested if there are no statements between their opening for
statements, nor between their respective closing endfor statements. A loop nest is perfectly
nested if each parent-child pair in the loop nest hierarchy is perfectly nested. Note that
there is a difference between perfect nesting of two loops and perfect nesting of an entire
loop nest. A loop nest that is not perfectly nested is imperfectly nested.

For example, L1 and L2 are perfectly nested, as are L2 and L3 . Therefore, the entire

loop nest in Figure 2.3(a) is perfectly nested. L4 and L5 are imperfectly nested in Figure

10

 

L1 101'11... L4 for];... L1 fOI'K1...
L2 forlg... 81 var1=... $1 var1=...
L3 for13... L5 forlz... L3 foer...
S; var1=... Le for-.13... $2 var2=...
S; var2 = . .. 32 var2 = ... endtor
$3 var3=... $3 var-3:... L9 forKa...
endfor endtor 33 var3 = . ..
endfor endtor endtor
and! or end! or endtor

Figure 2.3(a). Sample code and
loop nest hierarchy: perfect nest-

25.

Figure 2.3(b). Sample code and
loop nest hierarchy: imperfect-
only- child nesting.

Figure 2.3(c). Sample code and
loop nest hierarchy: imperfect-
sibling nesting.

2.3(b), however, L5 and L3 are perfectly nested. Therefore, the loop nest in Figure 2.3(b) is
imperfectly nested. Similarly, L8 and L9 are imperfectly nested in L7, and the entire loop
nest is imperfectly nested in Figure 2.3(c).

The taxonomy for perfect and imperfect loop nests is derived from the form of the
loop nest hierarchy. Perfect loop nests (Figure 2.3(a)) are rooted trees of degree one with
no statements between opening for ’s and closing endfor ’s of any parent-child loops.
Imperfect-only-child loop nests (Figure 2.3(b)) are rooted trees with degree one with at least
one assignment statement between a parent-child pair of for ’s or between a parent-child
pair of endfor ’3. Thus, statement 51 in Figure 2.3(b) makes the loop nest imperfect-
only-child. Finally, imperfect-sibling loop nests (Figure 2.3(c)) are rooted trees with degree
greater than one. The relationship between these three different forms of loop nests is shown

in Figure 2.4.

11

 

 

rimperfect-sibling loop nests \

imperfect-only-child loop nests

C perfect loop nests >
k j

 

 

 

 

 

 

Figure 2.4. Relationship of perfect and imperfect loop nests.

 

The applicability of unimodular transformations is dependent upon the distinction be-
tween perfect, imperfect-only-child, and imperfect-sibling loop nests. That is, currently
unimodular transformations are applied only to perfect loop nests. This research shows
how to apply transformation matrices to imperfect-only-child loop nests, and how to trans-

form many imperfect-sibling loop nests into imperfect-only-child loop nests.

2.3 Dependences

In any assignment statement, the variable on the left-hand side (LHS) is being deﬁned (def),
and any variables on the right-hand side (RHS) are being used (use). The three types of
dependences: ﬂow, anti, and output, are deﬁned in terms of the chain of defs and use’s that
result in a dependence. Flow dependence is deﬁned as a def-use chain; anti dependence is
deﬁned as a use-def chain; and output dependence is deﬁned as a def-def chain. Symbols
representing flow, anti, and output dependence are 6Jf , 6°, and 6", respectively, where 6
represents one or more occurrences of these types. That is, if the relation 6’, 6“, or 6°
holds, then 6 holds.

Source code that illustrates each type of dependence is given in Figure 2.5: 5'5 is ﬂow
dependent upon 5'1 due to A (51 6f 55); 53 is output dependent upon 5'2 due to B (5'2 6" S3);
S4 is anti dependent upon 53 due to G (53 6“ S4); and $2 and 33 are anti dependent upon

51 due to B (51 6“ $2, $1 6“ 53). Note that the order of the statements in the notational

12

description implies the order in the execution of the source code. That is, $1 6! 55 implies

that statement SI must ﬁnish execution before statement 35 begins execution.

 

51 A=A+B
S: B=C+D
$3 B=G+I
$4 G=G+1
35 H=A+K

Figure 2.5. Sample code containing scalar ﬂow, anti, and output dependence.

 

Dependences can be represented in two forms: distance vectors and direction vectors. In
distance vectors, the components are integral, whereas in direction vectors, the components
are symbolic (<, =, and >).

A distance vector set D is a set representation of the dependences within a body of
source code. Each member of the set is called a distance vector d: Finally, each component
of a distance vector d,- is a distance. For an n-nested loop, d1 is the outermost loop, dn is
the innermost loop, and d,-, 1 < i < n, refers to the respective loops between L1 and Ln.

For example, the distance vector set for the loop nest in Figure 2.6 is:

D = {(0,1)r(190)v(1’ _1)’(0’2)’(0’3)}’

where the ﬁrst element of D, (0,1) is the distance vector for the output dependence between
32 and $1; the second element (1,0) is the distance vector for the anti dependence between 53
and 51; the third element of D, (1,-1) is the distance vector for the anti dependence between
$4 and S]; the fourth element of D, (0,2) is the distance vector for the anti dependence
between $1 and itself; and the ﬁfth element of D, (0,3) is the distance vector for the anti
dependence between $2 and 51.

A symbolic direction in a dependence vector describes the temporal nature of depen-

dences between statements. The integer distances in a distance vector explicitly deﬁne the

13

 

L1 for [1 =1,”

L: for I; = 1,n

51 l[11,12]=1[11,12+2]+B[ll,12]+0[11.12]
S: A[11,12 - I] = var;

S3 BU1 - 1,12] = var;

$4 C[I1 - 1,12 + I] = vars

Figure 2.6. Sample code containing array ﬂow, anti, and output dependences.

 

distance in terms of iteration space between uses of a variable, and distance vectors are
used in this dissertation to represent dependence information. It should be noted that di-
rection vectors can be derived from distance vectors, but not vice-versa. As an example,

the distance vector (2,-3,0) is equivalent to the direction vector (<, >, =).

2.4 Properties of Dependences

This section formalizes deﬁnitions related to dependence vectors and dependence vector
sets. A distance vector d in an n-nested loop is in the form (d1, . . .,d,-, . . .,d,,). A loop-

independent dependence is one in which all of the distances in the distance vector are zero:
(Vi:1$i§n:d;=0).

A loop-carried dependence is one in which at least one distance is non-zero:
(3j:lgjgn:dj¢0).

In a loop-carried dependence, the dependence is carried at the loop depth of the ﬁrst non-
zero distance. That is, if d,- is the ﬁrst non-zero distance in d: then the dependence is carried
at depth j.

Each dependence relation exists from a statement to a statement of source code. An
intuitive representation of the dependence relations uses a directed graph G'(V, E) called a

dependence graph. The vertices V are the statements in the source code and the edges E

14

are the dependences between the statements. Figure 2.7 is the dependence graph for the

source code in Figure 2.6, where the are points in the direction of the dependence.

 

(2,0,0) (1,0,0) (0,0,2) (0,3,0)
0 2 1

9 9

 

 

 

 

0—K v.“

6:) (.-.,, 4x89

Figure 2.7. Dependence graph (depth = 1).

 

 

Dependence graphs are also deﬁned in terms of depths of the carried dependences. A
dependence graph at depth i contains all of the dependences that are carried at depth i
or greater. The depth 1 dependence graph represents all of the dependences in the source
code. As an example, the depth 1 dependence graph for the source code shown in Figure
2.6 is given in Figure 2.7. The depth 2 dependence graph contains all dependences that
are carried at a depth of 2 or greater. Thus, the distance vectors {(2,0,0),(1,0,0),(1,-1,0)}
are eliminated from the dependence graph in Figure 2.8. Similarly, the depth 3 dependence
graph is shown in Figure 2.9.

 

v
l)

(0,2,1) (0,0,2) (0,3,0)

 

 

 

it 1%

Figure 2.8. Dependence graph (depth = 2).

 

15

GD @

(0,0,2)

 

v

® @

Figure 2.9. Dependence graph (depth 2 3).

 

 

A lexicographically positive distance vector, denoted d > 0, has some positive distance

within the distance vector and all distances preceding the positive distance are nonnegative:
d>0::((3izlgign:dg>0) /\ (Vj:1$j<i:dj20)),

where 0 represents the zero vector (0, 0, 0, . . ., 0).

As an example, the distance vector (0, 0, 2, —1) is lexicographically positive because the
third distance (d3) is greater than 0; and the ﬁrst (d1) and second (d2) distances are both
greater than or equal to 0. If all of the distance vectors (f in a distance vector set D are
lexicographically positive, then D is lexicographically positive.

Loops L,- through L, are fully permutable if the distance vector set is in the following

form:

(vizd’e D:((d1,...,d,-_1)> 0‘)v(v1m'g kgjzdk 2 0)). (2.1)

That is, either the distance vectors are lexicographically positive preceding location i, or are
non-negative in the range from d,- to dj. Fully permutable distance vectors are signiﬁcant
because, as the name implies, any permutation of loops in the range L,- through LJ- results in
a legal transformation. Suppose that we have a predicate fully_perm(d., i, j) that evaluates
whether distance vector cf is fully permutable in the range of d,- through dj. Evaluation of
fully_perm(d: i, j) on d=(0,1,-3,2), 1 S i < j S 11 results in the following table:

l6

 

 

 

 

i j fully_perm
1 2 yes
1 3 no
1 4 no
2 3 no
2 4 no
3 4 yes

 

 

 

 

 

 

Typically, the objective of a loop transformation sequence is to maximize parallelism,
and parallelism is determined based upon properties of the distance vectors d in the distance
vector set D [4]. Consider a loop nest of depth n, with lexicographically positive distance
vectors (VJ: d6 D : d.» (I). A loop L,- can be parallelized if either the distance vectors are
lexicographically positive preceding location i, or the distance at location i is 0. That is,

the i“ loop is parallelizable if and only if:
(vi: J'e D : ((d1,...,d,-_1) > 0') v (d,- = 0)).

This result is signiﬁcant because it describes a property for determining available par-
allelism based only upon the form of the distance vectors in D. That is, parallelism is
determined from the abstract representation of the source code in terms of the dependence
vector set.

The criteria for vectorization relies more heavily on the dependence graph G'(V, E) and
is less restrictive than the criteria for parallelization. That is, source code that cannot
be parallelized may be vectorizable, but not vice-versa. The criteria for vectorization is
dependent upon cycles in the dependence graph. Speciﬁcally, loop L,- can be vectorized in
an assignment statement as long as there are no cycles in the dependence graph carried at
depth i.

As an example, loop L1 of statements 31, 5'2, 53, and 34 cannot be vectorized because

all of the statements are in cycles in the depth 1 dependence graph shown in Figure 2.7.

17

Loop L2 can be vectorized for statements 51 and S3 because they are not in a cycle in the
depth 2 dependence graph shown in Figure 2.8. However, statements 32 and 54 cannot be
vectorized. All statements can be vectorized at depth 3, because none are in a dependence

cycle as shown in Figure 2.9.

2.5 Unimodular Transformations

The objectives of the unimodular approach are: 1) to represent loop permutation, reversal,
and skewing as matrices and 2) to transform initial source code into some ﬁnal source
code that is semantically equivalent to the original source code, but exhibits a desirable
property (parallelism, data locality, etc.) [3, 4, 9, 26]. All transformations are represented
as operations between a transformation matrix T and the dependence vectors (f in the
dependence vector set D.

A unimodular matrix T has the following properties, where I is the set of all integers,

and T[i, j] refers to the element of T in the ith row and j‘h column:

1. an
2. (Vi:l$i$n:(Vj:lngn:T[i,j]6I))
3. IdetT|=1

This deﬁnition states that T is square; all components of T are integers; and the absolute
value of the determinant of T is 1.

Three kernel loop transformations represented by unimodular matrices are:

Permutation: interchanging two loops;

Reversal: modifying a loop to increment from the negation of the upper bound to
the negation of the lower bound;
Skewing: adding an integer function of an outer loop to the indices of an inner loop.

The matrices that represent permutation, reversal, and skewing are all simple modiﬁcations
to the identity matrix I. The following are examples of each kernel transformation matrix

applied to the distance vector d: (1, 3, -2, 0) as shown in the source code in Figure 2.10.

18

 

L1 for 11 = M1, 1101

L2 for I: = “2,002

L3 for 13 = lbs, uba

L4 for I4 = “4,1504

S1 A[I1,Iz,13,14]=...

32 L[I1-I,Ig-3,Is+2,l4]=...

/* *D = {(1,3, -2,0)} t */

Figure 2.10. Sample code: before unimodular transformation.

 

The 4 X 4 matrix tJD shown in Figure 2.11(a) is used to represent loop permutation of
loops L1 and L2. Note that the original lower and upper bounds of loops L1 and L2 (Figure
2.10) are swapped by loop interchange, as shown in Figure 2.11(b). Also, the distance vector

set after loop permutation of loops L1 and L2 is tp x J: (3,1, —2,0).

 

L1 for J1 = “2,1152
L2 for J2 = lbl,ubl

0 l 0 0 L3 for J3 = lbs, uba
t I 0 0 0 L4 for J4 = "34, “04
p_ 0 0 l 0 $1 A[J1,12,J3,J4]=...
0 0 0 I 32 A[J1-I,J2-3,J3+2,J4]=...

/s *D ={(3,1, -2, 0)} s s/

Figure 2.11(a). Unimodular matrix to permute Figure 2.11(b). Sample code: after loop permu-
loops L1 and L2. tation of loops L1 and L2.

 

In order to reverse loop L3, the transformation matrix t, is used as shown in Figure
2.12(a), where t, x d: (1, 3, 2, 0). Note the differences between the loop bound expressions
and array subscripts for index variable J3 in Figures 2.10 and 2.12(b).

Finally, the matrix shown in Figure 2.13(a) represents skewing loop L2 with respect
to loop L1 by a factor of 2. The dependence vector is multiplied by t,, where t, x d =
(1 , 5, —2, 0).

19

 

L1 for J1 = lb1, 001
L2 for J2 = "’2, “b2

1 0 0 0 L3 for J3 = —-ub3, -l03
_ 0 1 0 0 L4 for J4 = (54,1404
1,- — 0 0 -1 0 $1 A[J1,J2,—13,J4]=...
0 0 0 1 $2 A[J1-1,J2—3,—J3+2,J4]=,,.

/¢ tD = {(1,3,2,0)} s s/

Figure 2.12(a). Unimodular matrix to reverse Figure 2.12(b). Sample code: after loop reversal
loop L3. of loop L3.

 

 

L1 for J1 =lb1,ubl
L2 for J2 = lb2 + 211,1102 + 2.11

l. 0 0 0 L3 for 13 = lbs, 003
t _ 2 l 0 0 L4 for J4 = “4,1104
._ 0 0 I 0 S1 A[J1,J2-2J1,J3,J4]=...
0 0 0 1 $2 A[J1-l,Jz-3—2J1,Ja+2,.]4]=...

/:u I'D = [(1, 5, --2,0)} t s/

Figure 2.13(a). Unimodular matrix to skew loop Figure 2.13(b). Sample code: after loop skewing
L2 by 2L1. of loop L2 by 2L1.

 

In this dissertation, a superscript is sometimes used on the distance vector set D and
the transformation matrix T for clarity purposes. A superscript on the distance vector set
D has the following interpretation:

0 Dan; the initial distance vector set for the source code before the transformation;

0 Dlx'll: the transformed distance vector set before the x“ transformation;

0 D”): the transformed distance vector set after the x“ transformation;

0 Dm): the ﬁnal, transformed distance vector set.

Similarly, superscripts on the transformation matrix T have the following interpretation:
0 Tlo): the identity matrix;

0 T("'1): the transformation matrix before the x“ transformation;

0 Tm: the transformation matrix after the x“ transformation;

0 Tm): the ﬁnal transformation matrix;

0 to”): the elementary transformation matrix representing the 2:“ elementary operation.

20

The relation between D“) and TU) is as follows:

0 Tl”) = t(”) x T(""'1): the matrix representing the ﬁrst x transformations is the product
of the x"‘ elementary transformation matrix and the matrix representing the ﬁrst x -1
transformations;

0 DH") 2 Tl”) X Dlo): the transformed distance vector set after x transformations is the
product of the transformation matrix representing the ﬁrst x transformations and the
initial distance vector set;

0 D(”) = t”) x D(’"1): the transformed distance vector set after x transformations is
the product of the x“ elementary transformation matrix and the distance vector set
after x — 1 transformations.

Unimodular matrices have a composition property; that is, sequences of unimodular
transformations are also unimodular. As an example, a sequence of loop reversal, permu-
tation, and skewing using the matrices shown in Figures 2.11(a), 2.12(a), and 2.13(a) may

be composed into one transformation matrix T(3) = t(3) x tm X t“):

2 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0
TB) = 1 0 0 0 = 2 1 0 0 x 1 0 0 0 X 0 1 0 0
0 0 —1 0 0 0 l 0 0 0 1 0 0 0 —1 0
0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1

The distance vector set can be transformed in one of two ways: after each kernel trans-
formation (D(’) = tlf) x db“) 6 D("1)); or after the ﬁnal Tm is found thatrepresents
the entire transformation sequence (D(‘”) = T“) X (1(0) 6 D(°)). Continuing with the earlier

example, D(3) = T(3) x dlo) 6 BM:

2100 1 5
10 00>< 3_1
00—10 —2 2
0001 0 0

The composition property is signiﬁcant because a sequence of transformations is represented
by a single matrix. Therefore, mapping the distance vector set D and source code need

occur only once.

21

2.6 Global Pre- and Postconditions

Initially, all dependence vectors JIO) in the dependence vector set 0(0) are lexicographically
positive. Formally, this property is described as (leo) : 4'10) 6 0(0) : dlo) >- 0). This
condition is referred to as the global precondition, because it is true for all distance vectors
dlo) before any transformations are applied.

The global postcondition must be true after all unimodular transformations have been
applied, and asserts that all transformed dependence vectors (din) 6 0(9)) must be lexico-
graphically positive. This condition is formally described as (Vdm) : J19) 6 0(9) : dln) > 0),
where 0(9) = Tm) x (1'10) 6 Dm). Requiring lexicographically positive distance vectors af-
ter loop transformation ensures that the source code after transformation is semantically
equivalent to the code before transformation.

The global pre— and postconditions are signiﬁcant because they govern the legality of a
sequence of unimodular transformations. That is, all (110) E D(°) are initially lexicographi-
cally positive. Any sequence of transformations, represented by Tm), can be legally applied
to the source code, as long as all transformed distance vectors (Tm) x J10) E D(°)) are

lexicographically positive.

2.7 Transforming Source Code

Wolf and Lam [4] describe an approach for mapping source code from initial to ﬁnal form,
given a transformation matrix T. The approach is divided into two portions: mapping as-
signment statements and mapping loop bounds. Mapping loop bounds is further subdivided
into a four-step algorithm. Both assignment statement mapping and loop bound mapping
are described below.

Mapping assignment statements requires the inverse of the transformation matrix (T‘1 ),
which is used to determine the appropriate linear combination of transformed loop indices
to replace the initial loop indices. In matrix form, the following multiplication is used to

determine new assignment statements, where I,- are initial loop indices and J J- are new loop

22

indices:

=T-1X

 

 

 

 

It is also possible to determine new loop indices in terms of the old loop indices in matrix

form: F q _ q
[1 J1

 

 

 

 

In Jn

L d h d

and then solve for each I,- using algebraic substitution. As described later, T representing the
extended transformations may be non-square, and T"1 is undeﬁned for non-square matrices.
Therefore, algebraic substitution, rather than direct matrix-vector multiplication, is used
to solve for I,- in terms of J,. As an example, the assignment statements from the source
code shown in Figure 2.10 are mapped as shown in Figure 2.14 using the transformation
matrix T above. The boxed expressions represent the algebraic solution of the initial index
variables I,- in terms of the transformed index variables J,. The ﬁnal step is mapping the

transformed assignment statements from the initial assignment statements.

 

 

 

 

2 1 0 0 11 J:

1 0 0 0 x 12 _ J:

0 0 -l 0 13 _ J3

0 0 0 1 14 J4
11:211-1-12 [11:12] Js=-ls [14=J4[
I2 = J1 - 211 13 = -J3
[I2 = J1 — 212|

A[11.lz,13,14]=> III-72.11 - 212, --73. J4]

A[I1 — 1,12 - 3,13 + 2,14] 2 A[J2 — 1, J1 - 2J2 - 3,-J3 + 2, J4]

Figure 2.14. Transformation of assignment statements.

 

23

Mapping the loop bounds requires a four-step algorithm that is shown in Figure 2.15.
The ﬁrst and second steps are to extract inequalities, minima, and maxima from the initial
source code. Step 3 is transforming the indices, given the transformation matrix T. Also,
the inequalities, minima, and maxima are transformed based upon the new loop indices.
Finally, the new loop bounds are calculated based upon the inequalities, minima, and

maxima determined in Step 3.

 

Algorithm map_bounds
Maps the loop bounds from the initial index variables (11, . . . ,In) to the transformed index variables

(J],...,Jn)o

Input:

I1, . . . , In /**the initial index variables**/

B /*"'the loop bound set for the initial index variables**/

T /**the transformation matrix*"'/

Output:

J1, . . . , J" /**the transformed index variables**/

8' /*"'the loop bound set for the transformed index variables**/
Procedure:

1. Extract inequalities.
/**Inequalities are derived from the upper and lower bounds of each loop index variable.**/

2. Find the absolute minimum and maximum for each initial index variable.
/**Minima and maxima for each index variable are based upon the inequalities from Step 1."/

3. Thansform the indices using T.
I“ The transformation matrix T is used to map the initial index variables into transformed
index variables. Also, the transformed index variables are substituted into the
inequalities, maxima, and minima calculated in Steps 1 and 2.*“/

4. Calculate the new loop bounds.

/"“'I The new loop bounds for the ﬁnal index variables are determined based upon the
inequalities and minima and maxima from Step 3.**/

Figure 2.15. Algorithm to map the loop bounds.

 

The loop bounds for the source code shown in Figure 2.10 are mapped as shown in

Figure 2.16, assuming that the lower bounds for all loops is one (Vi : 1 5 i 5 4 : lb,- = 1).

24

The boxed expressions represent the algebraic solutions to the inequalities in terms of the

transformed index variables. The transformed inequalities are used to determine the loop

bounds of the transformed index variables (J1, . . . , J4).

 

Step 1. Extract inequalities
1121 1221 1321 1421
11 S “111 12 S 1162 13 S ab; 14 S 054

 

Step 2. Find maxima and minima
min(Il) = l min(I2) = 1 min(Ia) = l
max(I1) = “01 max(I2) = 002 max(13) = 1103

 

Step 3. Transform the indices
Using results from assignment statement mapping.

 

11 = 12 12 = 11 - 212 13 = --13
.1221 J1—2.1221 -J321
.12 S 1351 J1 -2.12 S ub2 -.13 S 053

min(J2) =1 min(J1 — 2j2) =1 min(—J3) = 1
max(J2) = 1161 max(J1 — 212) = M2 max(—-Ja) = abs

Step 4. Calculate new loop bounds

 

 

 

.11 2 1 + 2min(J2)

J1 - 2.12 S ub2
11 S “be + 212
J; 3 ub2 + 2max(J2)

[11 Subs-+2851]

 

 

Final code
L; for J; = 3, ub2 + 2ub1
L2 for J2 = 1,1161
L3 for .13 = —ub3, -1
L4 for .14 = 1,1104
51 A[.12,J1 -2J2,-.13,.14] =
S2 ‘[.12-1,.11—2J2—3,—J3+2,.14]=...

Figure 2.16. Transformation of loop bounds.

min(r.) = 1
max(I4) = 1104

LI = J‘

J. 2 1

.14 S 2104
min(J.) = 1
mIX(.14) = ub4

.1221 .11-2J221 -.1321 .1421
Jagub. J121+212 1.3.10.

 

25

The complexity of the source code transformation is dependent upon the loop nest depth
n, the number of assignment statements m, and the number of inequalities derived from
the source code q.

Mapping assignment statements has the following complexity:

0 matrix multiply 0(n2).

0 search and replace occurrences 0(n(n + m)) = 0(n2 + nm). Assumes that there is an
index variable for each loop.

0 Total 0(n2 + nm) = 0(n2).

Mapping the loop bounds has the following complexity:
Step 1. Extract inequalities 0(a).
Step 2. Find minima and maxima 0(n).

Step 3. Transform loop bound indices 0(n2 + q). Matrix multiply and scan through
inequalities.

Step 4. Calculate new loop bounds 0(nq). For each index variable, search through set of
inequalities.

Total 0(nq). There are always at least two inequalities from each loop.

Note that the code is always transformed using this method. Thus, the average and
worst case complexity are equivalent. Also, the code is transformed only once, after the

ﬁnal transformation code T is found.

CHAPTER 3

Kernel Transformations and Loop

Normalization

The chapter brieﬂy describes how the kernel set of loop transformations are applied to
imperfect loop nests. This chapter also describes loop normalization, a transformation that

modiﬁes a loop L,-, such that the lower bound lb,-, and step size 3;, are both one.

3.1 Kernel Transformations

The kernel set of transformations includes loop reversal, permutation, and skewing. The
legality and applicability of the kernel set is well understood in the context of perfectly
nested loops. This section discusses the legality of the kernel set in terms of [imperfect-
only-child loop nests. Transformation matrices cannot be applied to imperfect-sibling loop
nests; however, many imperfect-sibling loop nests can be transformed into imperfect-only-
child loop nests by loop ﬁssion (Section 5.1).

For simplicity, a doubly-nested loop is used in the following discussion. However, the
concepts can be easily generalized to an n-nested loop, where n > 2. The generalform of an
imperfect-only-child loop nest is shown in Figure 3.1, where f 0, g(), and h() are functions
of the enclosing index variables.

In the following discussion, the predicate normal(B(‘”), i) is true if loop L,- is normal in

loop bound set BM. A loop is normal if both the lower bound and step size are one.

26

27

 

L; for I; =lb1 , ub; , s;
S. Amr[f(I.)] = ...
L2 for 12 = 162,052,.”
52 ARMY[g(11, 12)] = . . .
endfor
$3 ARRAY[h(11)] =
endfor

Figure 3.1. Sample code: general form of an imperfectly nested loop.

 

3.1.1 Loop Reversal

Loop reversal of loop L,- results in a modiﬁed loop and all references to the index variable I,-
being negated. Speciﬁcally, if the initial loop is incremented from 1 to ub;, then the reversed
loop increments from —ub,- to —1. Loop reversal is applied to a single loop. Reversal of the
outer loop (L1) of the source code shown in Figure 3.1 results in the source code shown in

Figure 3.2(a), and reversal of the inner loop (L2) results in the source code shown in Figure

 

3.2(b).
L1 for 11 = —ub1 , -lb1 , 81 L1 for 11 = “1, ub1 , 81
s. mun—1.)] = ... S. ARRAY[f(Ii)] = ...
L2 for 12 = lb2, ub2, s2 L2 for I2 = -ub2, -lb2, s2
52 ARMYLﬂ-h, 12)] = . . . S2 ARRAYE(11, -12)] =-" . . .
endtor endfor
$3 ARRAY[h(-Il)] = . . . $3 ARRAY[h(I;)] = . . .
endfor endfor
Figure 3.2(a). Sample code: reversal of loop L1. Figure 3.2(b). Sample code: reversal of loop L2.

 

As shown in Figures 3.2(a) and 3.2(b), the assignment statements 51 , S2, and S3 have the
same number of iterations before and after loop reversal. The loop bounds and correspond-
ing references to the index variables are reversed; however, the source code is semantically
equivalent (assuming that the global postcondition is true).

Loop reversal is always legal for imperfect-only-child loops. That is, loop reversal can

be applied to source code regardless of whether or not it is perfectly nested. All kernel

28

unimodular transformations require that the step size be one prior to transformation. The

invariant below corresponds to this requirement.

Preconditions

Invariant

 

 

 

normal(B(’-1), i)

 

Explicit

Nesting

3.1.2 Loop Permutation

Loop permutation of loops L.- and L,- results in the interchange of the lower bounds, upper
bounds, and step sizes for the two loops. Figure 3.3 shows the permuted source code from
Figure 3.1. The arrays in SI and S3 are functions of the index variable of loop L1. After
loop interchange, loop L1 and index variable 11, are inside of statements 31 and 5'3, and
the functions f(Il) and h(Il) have no context. That is, an assignment statement that uses
an index variable to calculate a subscript expression must be contained by the loop that

increments the index variable.

 

L2 for 12 = lb2,ub2,s2
SI ARRAY[f(11)] = . . .
L1 for 11 = 1131,1151,”
S2 ARMYLg(11, 12)] = . . .
endfor
S3 AmY[h(11)] = . . .
endfor

Figure 3.3. Sample code: illegal permutation of loops L1 and L2.

 

Loop permutation may result in loop L,- following the assignment statement that ref-

erences it. Therefore, if two loops L,- and L; are permuted, then all parent-child pairs of

29

loops in between loops L; and L,- must be perfectly nested. Again, the invariant for loop
permutation is that both loops L; and L,- are normal. Finally, we assume that loop L; is

initially outermost when loops L; and L,- are interchanged.

Preconditions

Invariant

 

 

normal(B("1),i) A normal(B("'1),j)

 

 

Explicit
i < j
Nesting
[(Vk : i S k < j : perfect(Lk,Lk+1))l

 

 

where perfect(L;, L;+1) is true if loops L; and L;+1 are perfectly nested in each other.

3.1.3 Loop Skewing

Loop skewing results in the modiﬁcation of the loop bounds and references to an index
variable I ,- with respect to an outer index variable I;. The upper bound and lower bound of
loop L j are increased by a linear function m of index variable 1;, and all references to index
variable I; inside of the skewed loop are decreased by m x I;. As an example, skewing I2

by 111 times I1 results in the source code shown in Figure 3.4.

 

L1 for 11 = 101,051,.”
31 ARRAY[f(Il)] =
L2 forI2 =lb2+mI1,ub2+m11,s2
S2 ARMYLQ(11,12 -m11)] =...
endfor
S3 ARRAY[h(I1)] = . . .
endtor

Figure 3.4. Sample code: skew of loop L2 by m x L1.

 

Perhaps surprisingly, loop skewing is like loop reversal in that it can be applied to a

target source code regardless of the nesting structure of the code. Loop permutation results

30

in two loops being modiﬁed. Loop skewing is like loop reversal because only one loop is
modiﬁed by a function of an enclosing loop. Therefore, the nesting precondition is true. We

assume that loop L; is outermost and the invariant is that both loops are normal.

Preconditions

Invariant

 

 

 

normal(B(“'1), i) A normal(B(”_l),j)

 

Explicit
i < j
Nesting

3.2 Loop Normalization

We have developed a matrix-based representation of the loop normalization transformation
[13, 14, 15]. Loop normalization is a transformation that changes the bounds of a loop, such
that the lower bound and step size are both one. The transformation is important because
normalized loops are a precondition for many kernel and extended loop transformations.
For example, the step size of a loop must equal one before any kernel transformations are
applied, and fully normalized loops make loop blocking and collapsing easier to implement.
Also, loop normalization may decrease the cardinality of the distance vector set D, resulting
in greater opportunities for parallelism.

An example of code that is not normalized is shown in Figure 3.5(a), where neither
the lower bounds nor the step sizes are equal to one for loops L1 and L2. In general, loop
normalization of loop L; results in ub; being replaced with [(ub; — lb; + s;)/s;] and all
occurrences of index variable I; being replaced by lb; + (J; — 1)s;. In Figure 3.5(b), loop
normalization is applied to the outermost loop (L1), and in Figure 3.5(c), loop normalization
is applied to the innermost loop (L2).

In order to normalize the outer loop in Figure 3.5(a) the upper bound of loop L1 is

replaced with [(1th1 — lb;1 + s], )/s 1,]. As shown in Figure 3.5(b), [1 is renamed J1 and the

31

upper bound of .11 is calculated as:

11.11.]1 = [(1th1 -11111 +81,)/811_| = [(13 — 2+ 3)/3] = [_14/3] = 4.

All occurrences of I1 are replaced with lb;1 + (J1 — 1)s;1:

11 =>lb11 + (.11 --1)811 = 2+ (.11 — 1)3 = 3.11 - 1.

Note that occurrences of 11 in the loop bounds of loop L2, as well as in assignment statement
31, are replaced with 3.11 — 1.
Similarly, the innermost loop (L2) is normalized as shown in Figure 3.5(c), where the

upper bound is calculated as:

...,, = [(ubz. — 1b.. + era/sh = [(10 + 61. — (3.1. - 1) + 6)/61 = w. + 10/61,

and occurrences of [2 in the source code are replaced with:

12 =>lb12 + (.12 — 1)812 = 3.11 — 1+ (.12 - 1)6 = 3.11 + 6.12 — 7.

 

L1 for 11 = 2,13,3 L1 for .11 = 1,4 L1 for .11 = 1,4
L2 for 12 = 11,12+ 211,6 L2 for 12 = 3.11 - 1,10 + 6.11,6 L2 for .12 = 1, [(3.11 + 17)/6J
S1 A[11,12] = .. . 31 “3.11 - 1,12] = .. . S1 A[3.11 - 1,

3.11 +6.12 -7]=

Figure 3.5(a). Sample code: be- Figure 3.5(b). Sample code: Figure 3.5(c). Sample code:
fore loop normalization. loop normalization of loop L1. loop normalization of loop L2.

 

3.2.1 Legality Criteria

Loop normalization is always legal, although it may not be necessary. That is, loop L;

may already be normal (lb; = 1 and s; = 1) and, therefore, loop normalization does not

32

modify the source code. However, if the application of a kernel or extended transformation
results in unnormalized code, then loop normalization typically follows in the transformation
sequence.

The predicate normal(B("'),i) is true if loop L; is normal in loop bound set B0”). The

explicit precondition is that the loop is not normal before loop normalization is applied.

Preconditions

Invariant

Explicit

 

 

 

2(normal(B("1), i))

 

Nesting

3.2.2 Effect on the Dependence Vector Set D

If loop L; is not normal, then it is in one of three forms. Either the step size of loop L; is
not one, the lower bound of loop L; is not one, or neither the step size nor the lower bound
of loop L; is one. Figure 3.6 shows each of these three cases: loop L1 ﬁts case 1 because
the step size does not equal one (3; 7t 1 A lb; = 1); loop L2 ﬁts case 2 because the lower
bound does not equal one (s; = 1 A lb; 75 1); and loop L3 ﬁts case 3 because neither the step
size nor the lower bound is one (s; 75 1 A lb; 75 1). Each case is more thoroughly described

below.

Case 1. s; 75 1 A lb; = 1. Loop normalization results in modiﬁcation to all
distance vectors in the distance vector set. Speciﬁcally, the distances at loca-
tion i (d;) are divided by the current step size 3;. Intuitively, the modiﬁcation
to the distances is logical since the distances between executions of assignment

statements are reduced by :-.. Formally, if loop normalization is the x“ trans-

33

 

L1 for 11 =1,m,2

L2 for I2 = 311m

L3 for 13 = 2I2,p,5

S1 A[11,12,13] =

S2 A[11-4,12+5,13-10]=...

[at *D = {(4, —5,10)} s s/

Figure 3.6. Sample code: different cases of unnormal loops.

 

formation, then the modiﬁcation to the distance vector set D is described as:

d(x—1)

3i

(lid-(Pl) : db.” 6 D(”1) :dgx) = ).
If any distances in d1”) are non-integral, then the distance vector is removed

from the distance vector set.

The effect on D is represented by the transformation matrix T. Suppose that

T(“"1) is the transformation matrix before loop normalization and t7, is the

elementary matrix representing normalization. Then Tl”) = in X T‘s-l) becomes

the new transformation matrix, where tn is the identity matrix except that
1

tn[i,i] = 7."

As an example, the outermost loop L1 in Figure 3.6 is not normal. In order to
normalize L1, t,,[1, 1] must equal 711" or %. Therefore, the tn that normalizes L1

is:

ch
Hoc
II
OONI"
OHO
HOG

l
tn: 0
0

and the modiﬁed distance vector set is tn X (16 D:

34

300 4 2
010x—5=—5
001 10 10

Case 2. s; = 1Alb; 76 1. Again, loop normalization of L; results in modiﬁcation
to all if. The assumption is made that lb; is constant, or it is a linear function
of an outer loop. Let a;,,- be the integer coefﬁcient of the index variable I ,- with
respect to the lower bound of loop L;. As an example, a2] = 3 for loop L2 of
the source code shown in Figure 3.6. In order to normalize loop L2, the lower
bound of loop L2 (lb2) must be modiﬁed to equal one. Therefore, a skew is
required, and the skew factor is equal to the negation of the coefﬁcient a;,,-. In
the case of loop L2 in Figure 3.6, the skew factor is —a2,1 or —3. Also, since lb2
(i = 2) is skewed with respect to index variable II ( j = 1), the location of the

skew is tn[i,j] = tn[2, 1]:

1 0 0 1 O 0
in = —a2,1 1 0 = —3 1 0
0 0 1 0 0 1

Ifloop L; ﬁts Case 2, then the form of the lower bound of loop L; is lb; = a;,,- XI ,.
That is, the lower bound is a. linear function of the index variable of an outer loop.
The distances at location i (Ax-1)) are modiﬁed by subtracting the coefﬁcient
a;,,- times the distance at j MES-1)), as in loop skewing. Formally, this condition

is described as:
(vile-1) :d-l"1) e 0““) : .1?) = .15“) — (...,,- x age-1)».

Using the distance vector from the source code in Figure 3.6, d") = tn X 1113‘”:

1 0 0 4 4
-3 1 0 X -5 = —17
0 0 l 10 10

35

Case 3. 3; ;£ 1 A lb; 9E 1. In Case 3, neither the step size nor the lower bound
of loop L; equals one, and, not surprisingly, the modiﬁcations to D reﬂect both
normalization of the step size to equal one (Case 1) and the lower bound to
equal one (Case 2). First, we consider the lower bound lb;. As was shown in
Case 2, lb; = a;,,- X I j, where j < i. The matrix that normalizes the lower bound
is tn such that tn[i, j] = —a; '1‘. In Case 1, the step size of loop L; was not one
(s; 76 1), and the matrix to normalize the loop was tn such that tn[i,i] = :7. If

h

both Cases 1 and 2 are true, then t,,[i, j] is —a;,,-, and then the entire it row is

multiplied by 3:.

As an example, loop L3 in Figure 3.6 has a lower bound that is a linear function

of I2 and the step size does not equal one. That is, lb3 = a3,j X I,- = 2 X 12 (i =

31.1 = 21 03,2 = 2) and 3; = 5. Therefore, tn[i,j] = “[3,2] = —2 and the 5th(3rd)
1 1

row is multiplied by :—. = —- = g.

83

The elementary transformation matrix tn is:

1 0 0 1 0 0

in = 0 1 0 = 0 1 O
a_.. _1-

0 - :32 s3 0 —% 5

Without loss of generality, the form of tn for normalization of all loops in a 3-nested

loop is:

 

 

 

.1.
,1 0 0
tn : _ 221i .1. 0
32 32
03,1 _ 63,2 __1__
.. 83 83 83 .

 

 

 

where a;,,- is the appropriate coefﬁcient and s; is the step size for loop L;. The matrix tn

can be easily generalized to n > 3, where n is the loop nest depth.

36

Another advantage of loop normalization is that it discriminates between many actual
dependences and “apparent” dependences. Dependence tests make conservative assump-
tions when calculating dependences for source code. That is, if calculating the distances is
too complex, then a dependence test assumes that a dependence exists. Also, more depen-
dence vectors in D lead directly to more restrictions on the amount of attainable parallelism.
Therefore, it is desirable that the distance vector set D has minimal cardinality and accu-
rate distance vectors. Loop normalization allows the elimination of distance vectors that
are erroneously in D as a result of a conservative assumption by a dependence test.

As an example, the source code in Figure 3.7 has an apparent distance vector set of
D = {(8,2,5)}. The transformation matrix that represents loop normalization of loops

L1,L2, and L3 is:

 

 

 

 

 

 

200 8 1
%§0><2=—7
_00-3, 5 3

After normalization, one of the distances in the distance vector set is non-integral (g). Since
all distances in a distance vector must be integral, the distance vector (1: {(8, 2, 5)} is elim-
inated from the distance vector set, removing potential restrictions on available parallelism

in the source code.

 

L1 fori=3,m,4

L2 forj=2+2i,n+2i,2
L3 fork = -p,-1,3

S1 A[i,j,k] =

S2 A[i-8,j-2,k—5]=...

/¢ 11D = {(8,2,5)}* 41/

Figure 3.7. Sample code: false dependence.

37

In summary, an elementary matrix tn is found that represents the effect of loop normal-
ization of the distance vector set. The matrix tn is used to represent normalization of the
step size, lower bound, or both, of any loop. The matrix tn composes with other kernel and
extended transformations. Finally, loop normalization helps to eliminate “apparent” dis-
tance vectors in the distance vector set, thus increasing the amount of potential parallelism

in the transformed source code.

3.2.3 Effect on Source Code Mapping

The general form of a doubly-nested loop before loop normalization is shown in Figure

3.8(a). Normalization of the outer loop L1 results in a new upper bound:
11le = [(ub1 — lbl + sl)/31] 2 [(d1 - c1 + sl)/31],
and any occurrences of 11 are replaced with:
lb; + (J1 -1)sl : c1 + J131 - 31. (3.1)
Similarly, normalization of the inner loop results in the upper bound of loop L2 becoming:
11ng = [(ub2-lb2+s2)/s2] = [((bz,1(61+~1181-81)+d2)-(02,1(61+1131-81)+02)+82)/32J
and any occurrences of I2 are replaced with:
lb2 + (J2 — 1)s2 = (a2,1(c1 + .1131 — 31) + c2) + .1232 — .32. I (3.2)

Source code with both L1 and L2 normalized is shown in Figure 3.8(b).
The original assignment statement S 1 was A[I1 , I2] in Figure 3.8(a), but after substitution

using Equations (3.1) and (3.2) becomes:

Alcl + 1131 — 31,02,1(61 + «1181 - 31) + 62 + 1282 - 82]- (3-3)

38

 

L1 for 11 = c1,d1,s1
L2 for 12 = 02,111 + 62,52,111 + 42.82
51 A[11,12]=

L1 for J; = 1, [(d1 - c1 + s1)/s1] .

L2 for .12 = 1, [((b2,1(C1 + J181 - 31) + dz)
—(a2,1(C1 + J1 81 - a1) + c2) + 821/821

51 “61 + 1181 - 81.

62,1(61 +J181 - 81)+62 +1282 - 82] =

Figure 3.8(a). General form of unnormalized Figure 3.8(b). General form of normalized

double-nested loop. double-nested loop.

Therefore, after loop normalization of loops L1 and L2, we know that:

11 = Cl + 1131 — 31,

and

12 = 02,1(01 + J131 - 31) + 62 + «1282 - s2.

Manipulating Equation (3.4), and placing it in terms of J1, we obtain:

11 Cl 81
J1 = — — — _.
31 81 81

Similarly, putting Equation (3.5) in terms of .12, we obtain:

(3.4)

(3.5)

(3.6)

(3.7)

Recall from Section 2.7, that transforming source code involves the matrix representa-

tion of the new index variables in terms of the initial index variables. In matrix form, the

representation is [J;] = [T] X [I;], where J; are the new index variables, T is the transfor-

mation matrix, and I; are the initial index variables.

The transformation matrix T after loop normalization of loops L1 and L2 in Figure

3.8(a) is in the form:
1- 0

1
__ 02,1 1

32 32

(3.8)

39

Putting Equations (3.6) and (3.7) in matrix form results in:

1 _.
.11 = g; 0 X 11 _ 'L—H ”8 (39)
J2 _SEA 1 [2 _2._2.c -.

82 82 ‘2

W
additional terms

The form of the matrix representation of Equations (3.6) and (3.7) is consistent with
the kernel unimodular transformations ([J;] = [T][I;]). However, an additional term is
subtracted from each new index variable which represents the effect of normalization. The

additional terms from Equation (3.9) are rewritten in terms of the normalization matrix tn:

-1—-’-° " -1— 0 c —

1 31
’1 = ’1 X (3.10)
_2_zC-0 -22.; .1.

,2 ,2 ,2 02,1(01 - 31) + 62 - 32

Substituting Equation (3.10) into Equation (3.9) results in the matrix representation:

1 1
.11 = :1- 0 X 11 _ E 0 X C1 — 31 (3.11)
12 -93‘l l 12 -93‘-1' L 02,1(61 - 31) + 62 - 82

52 92 52 32

The remainder of the source code mapping process from initial to ﬁnal form, given a
transformation matrix T, is the same after loop normalization as it was after only kernel
transformations. That is, the four-step approach described in Section 2.7 is directly applied
for source code mapping.

In summary, the general form of the index variable mapping of a normalized loop is:

 

.11 11 V1

J I V
,2 =TX 2 -TX 2

 

 

 

 

 

 

 

 

 

40

where each J; is a normalized index variable, each I; is an initial index variable, T is the
transformation matrix, and each V; is a normalization expression that is a function of the

coefﬁcients and step sizes of the initial loops.

CHAPTER 4

Loop Blocking and Loop

Collapsing

Loop blocking is the generic name for a family of transformations that increase the depth
of a loop nest, and includes tiling, strip mining, and cycle shrinking. Loop collapsing is the
inverse of loop blocking and decreases the depth of a loop nest. Repeated application of loop
collapsing is called loop coalescing. This chapter describes the matrix-based representations

of both loop blocking and loop collapsing [12, 23].

4.1 Loop Blocking

Loop blocking is the name given to a family of loop transformations that increase the nest
depth of a nested loop. The objective of increasing the nest depth is to divide the initial
iteration space into several iteration subspaces that are distributed and executed in parallel.
Strip mining [4, 16, 17, 18, 19], cycle shrinking, and tiling [4, 19] are all generalized as
one transformation in the extended matrix-based model presented here. Loop blocking
represents each of these individual transformations, dependent upon the quantity of loops
that are blocked and which loops are parallelized.

Strip mining is a technique that transforms an n-nested loop into an n + l-nested loop.
The n‘h loop is divided into equal size partitions (sn) and is executed in parallel. Strip

mining is referred to as a memory optimizing transformation [17] because 3,, is dependent

41

42

upon some feature of the machine architecture. An advantage of strip mining is that ﬁne-
grained parallelism is obtained by parallelizing the nth loop.

Cycle shrinking is similar to strip mining, since an n-nested loop is transformed into
an n + 1-nested loop. However, in cycle shrinking, 3,, is dependent upon the source code,
rather than the machine. Also, the n+1“ loop, rather than the nth loop, is parallelized. An
advantage of cycle shrinking is that each iteration within the n + 1“ loop may be executed
in parallel, as long as the initial dependence distances of the a“ loop are greater than one.
Therefore, a ﬁne-grain intra-block parallelism is obtained.

Finally, tiling is a transformation that divides the initial iteration space into some ﬁxed
number of subspaces, or tiles. Each original loop may be partitioned based upon an ar-
chitectural characteristic (as in strip mining) or a source code characteristic (as in cycle
shrinking). Using tiling, an n-nested loop is transformed into a loop with a nest depth
between n + 1 and 2n. As an example, the 3-nested loop in Figure 4.1(a) becomes the
6-nested loop given in Figure 4.1(b) if each loop is tiled. An advantage of tiling is that the
transformed source code can be evaluated for parallelism within tiles (intra—tile parallelism)
and between tiles (inter-tile parallelism). Thus, tiling can be used to obtain both ﬁne-grain

and coarse-grain parallelism.

 

L; for I; = l,n L; for I; = 1, ms;
L2 for 12 = l,rn L; for 15: l, m,s2
L3 for 13 = 1,p L3 for I; = 1,p, s3 '
S1 A[11,12,13]= L1 for 11 =1[,min(1{ +81 -1,n)
S2 B[11 —2,12 — 1,13 -1] = L2 101712 =I§,min(12+s2 —1,m)
L3 for 13 = I§,min(I§ + s; — 1,p)
S1 A[11,12,13]=
S2 B[11-2,12-1,13—1]=...
Figure 4.1(a). Sample code: before transforma- Figure 4.1(b). Sample code: after loop blockingall

tion. loops.

 

43

Traditionally, loop blocking and tiling are considered synonymous, and strip mining and
cycle shrinking are considered subclasses. That is, loop blocking typically refers to tiling
a range of loops (L; through Lj) within a loop nest and strip mining and cycle shrinking
refer to transforming a single loop (L;). In this dissertation, blocking is applied to a single
loop as in strip mining. Tiling is equivalent to a combination of blocking and permutation
using this deﬁnition of blocking.

As an example, to tile the source code shown in Figure 4.1(a) we would invoke the
function tile(L;, L,,b[i, . . ., j]). Loops L; through L,- are tiled with blocking factors of 6;
through b;, respectively. The resulting source code is shown in Figure 4.1(b).

In contrast, obtaining the source code shown in Figure 4.1(b) is a six-step process using

the primitive block and permute:

H

. block(L1,3l);
2. block(L2,82);
3. block(L3.83);
4. permute(L1,I/2);
5. permute(L1,L3);
6. permute(L1,L2)-

First, the outer loop L1 is blocked with blocking factor 31 (Figure 4.2(a)). Next, loops L2
and L3 are blocked with blocking factors 32 and s3, respectively (Figure 4.2(b)). As is shown
in Figure 4.2(b), blocking of all three loops results in the blocking loop L:- directly enclosing
the blocked loop L;. Finally, permutation is used to move all of the blocking loops outermost
and all of the blocked loops innermost. Figure 4.2(c) shows the transformed source code
just prior to the ﬁnal permutation.

Although redeﬁning blocking seems like a complication, it is actually a simpliﬁcation.
The composed transformation tiling is subdivided into its constituent elementary transfor-
mations. Rather than discussing the legality of a composite transformation, the legality of

the constituent transformations are further explored and more easily understood.

44

 

L; for I; = l,n, s; L; for I; = l,n, s1

L1 for 11 = I{,I:ln(n,1; + s; - 1) L; for 11 = 1;,Iin(n,1{ + s1 - 1)

L2 for 12 = 1,711 L; for L1,: l,m,s2

L3 for 13 = 1,p L2 for 12 = 15,-1n(m,15 + s2 - 1)

S1 A[11,12,13] = . .. L; for 1:; = 1,p,83

S2 B[11 — 2,12 - 1,13 - l] = L3 for 13 = 15,-1n(p,15 + s3 - 1)
S1 A[11,12,13] =
S2 B[11-2,12—1,13-l]=...

Figure 4.2(a). Sample code: after blocking loop Figure 4.2(b). Sample code: after blocking

 

 

L1. loops L1 through L3.
L; for I; = l,n, s;
L for 1;: l,m,s2

L: for I; = 1,p, s3
L2 for 12 = 15,-111(m,1£ + s2 — 1)

L; for 11 = 1;,Iin(n,1[ + s; - 1)
L3 for 13 = 15,-111(1),]; + s3 - 1)
$1 A[I1,I2,13] =

S2 B[11-2,12-1,13-l]=...

Figure 4.2(c). Sample code: before permuting loops L2 and L1.

 

4.1.1 Legality Criteria

Historically, tiling loops L; through L ,- was legal if they were fully permutable [24]. How-
ever, tiling involves blocking single loops and then permuting the blocking loops to the
outermost position and the blocked loops to the innermost position. The precondition of
full permutability is necessary to allow the second stage (permutation) of the loops.
However, in this dissertation the transformations of loop blocking and loop permutation
are separated. The precondition of full permutability is only relevant in the context of
the kernel transformation of loop permutation. As an example, the source code in Figure
4.2(b) shows the blocking of each loop L1 through L3, without any permutation. Loop

blocking, as an elemental transformation, is always legal as long as the bounds are normal

45

prior to the transformation. It is assumed that loop normalization follows loop blocking in

the transformation sequence and returns the source code to normal form.

Preconditions

Invariant

 

 

 

normal(B(”‘1), i)

 

Explicit

Nesting

4.1.2 Effect on the Dependence Vector Set D

Given that the loop nest depth is n, and loop blocking is the x“ transformation in the
sequence, then all distance vectors prior to loop blocking have a length n (lex’11 : db-” 6
Def-1) : lift-1)] = n). Blocking a loop increases the nest depth by one and also increases
the length of each distance vector by one (Vd-l") : dl”) 6 0(3) : Idlfll = n + 1). As can be
seen, blocking a range of loops L; through L,- results in all distance vectors becoming 11’
long, where n’ = n + (j — i + 1).

Loop blocking may increase the cardinality of the distance vector set D, based upon the
values calculated from the distances before loop blocking and the block size. Speciﬁcally, if

(it’d) E Dlx‘ll are the distance vectors before loop blocking, and loop L; is blocked, then:

0 distances in loops outside of the blocked code (1 S k < 1') remain unchanged;

0 distances at loop L; (k = i) are assigned values calculated from the block size and the
distance before the loop blocking transformation;

0 and distances in loops inside of the blocked code (i < k g n + 1) are assigned the
original distances with a location offset by one.

46

Formally, the mapping rule from le‘ll to D“), where loop blocking is
the x‘“ transformation and loop L; is blocked with the blocking factor 3;:
(VJ: J'e D(==-1) : ([d;/s;] = [d;/s;] A
(de:(lgk<i:d;,=dk)v
(k=i:d[‘=d;)V
(i<k$n+1:d’ =dk_1))A
[)(x) = D(t) U J!)
V
([di/sil 7‘ [41/811 A
(de:(1_§k<i:dj‘=dk AdZ=d1)V
(k = i : d}, = s;[d;/s;] A d}: = s;[d;/s;]) V
(i<k$n+1:d]‘=dk_1 A dZ=dk_.1))A

mv=DWuduﬁn

As an example, Dlx‘ll = {(2,-1,0),(0,1,1),(3,3,3)}, where loop L2 is blocked and
the blocking factor is three (32 = 3) results in the following transformed distance vector set
D“):

{(2,1—1/313,-1,0),(0,(1/3)3,1,1),(3,(3/3)3,3,3),
(2, l-1/3l3,-1.0),(0. [1/313.1,1).(3.l3/3l3,3.3)}

{(2, -3, —1, 0), (0,0, 1,1),(3, 3,3, 3),
(2, 0, -1,0), (0, 3, 1,1),(3,3, 3, 3)}

{(2, —3, —1, 0), (0,0, 1, 1), (3,3,3, 3), (2, 0, —1,0), (0,3, 1, 1)}

As mentioned, all distance vectors before 100p blocking have length n, and all distance
vectors after blocking a loop have length n + 1. We need a matrix that, when multiplied
by 111"” 6 Def-‘1 gives us dl’) e D”). Therefore, if t1, is the elementary blocking trans-
formation matrix, dlf’ll are all n X 1 vectors, and d1”) are all (n + 1) X 1 vectors, then t1,
must be (11 + 1) X n:

1;, X db-” = (11$)
v V V
(n+1)... nxl (n+1)x1

47

Blocking may introduce two distance vectors into D”) where only one existed in 0”.”
because of the ceiling and ﬂoor functions. That is, if loops L; through L j are each blocked,
then each distance vector prior to loop blocking may map into as many as 21“.“ distance
vectors in the transformed distance vector set. Therefore, t; for mapping distance vectors

for a doubly-nested loop is in the form:

 

 

 

 

f(d) 0 g(d) 0

outer loop 1 0 , 1 0
_ 0 1 J _ 0 1 .
” 1 0 ‘ ' 1 0 '

inner loop 0 f(d) ; 0 g(d)
_ 0 1 ‘ _ 0 1 _

where f(d) = [d;/s;] s; and g(d) = [d;/s;_| s;.
As an example, say the initial distance vector set 0“” for a double nested loop is

{(2, -1), (0, 1), (3,3)} and the transformation matrix that represents reversal of the outer

loop and skewing the inner loop prior to loop blocking Tlx‘ll is:

—1 0
2 1
If the outer loop is blocked and 31 = 3, then T”) = t1, X le‘l):

P d

—1 0 = 1 0 X [ 2 1]
_ 2 1. _ 0 1 J

-l 0 = 1 0 X [ 2 1]
_ 2 1 . L 0 I]

 

 

 

 

and the distance vector set after blocking D(’) is Tm X D(°):

 

 

 

- -f(d) 01
—1 0

b 2 1 I

_ -g(d) 0 -
—1 0

b 2 1 d

’—f(d) 0‘
—1 o

b 2 1 d

P -y(d) 0 -
—1 0

b 2 1 d

- -f(d) 0 .
—1 0

n 2 1 d

' -g(d) 0 '
—1 0

L 2 1 .

 

XVI]:
..[31]:
..[g]:

48

 

r -g(0) l

 

0
1

-f(3) '
—3

6+3

.- J

-g(3) -
—3

 

 

L6+3¢

4.1.3 Effect on Source Code Mapping

In the following discussion, the source code shown in Figure 4.3(a) is the initial code. The
loop nest is 2-deep; therefore, any matrices representing kernel transformations that occur
before loop blocking are n X n = 2 X 2. The matrix representing blocking of one loop (either

L1 or L2) is (n + 1) X n = 3 X 2. Any kernel transformations occurring after loop blocking

are(n+1)X(n+1)=3X3.

 

 

 

 

 

' 42/313 ' ' —3 ‘
—2 = —2
I. 3 d u 3 .-
’ -L2/3J3 ' ' 0 '
—2 = —2
b 3 d u 3 .-
'—r0/313‘ '01
o = o
.. 1 .. . 1 .J
' 40/313 ' ' o '
o = o
u 1 u b 1 d
' 43/313 ' ' —3 '
-3 = —3
n 9 u- n 9 d
' —[3/3J3 ' ' —3 '
-3 = —3
I- 9 d - 9 .-

 

 

 

 

 

L1 for I; = 1, m
L2 for I: = l,n
$1 A[I1,12]=

Figure 4.3(a). Sample code: before transforma-

M

L1
L3
L3

for J1 = -n, —1
for J2 = l, m,sg
for 13 = nax(1, J2),nin(m, J: + 32 — l)
A[J3, —Jl] = . . .

Figure 4.3(b). Sample code: after transformation.

 

49

The transformation matrix representing loop blocking of the inner loop of a doubly-

nested loop is:

tb = n+1

OOH
8‘39

where ()0 is the blocking loop and b, is the blocked loop. During composition, b,- = 1 and be

is used to associate the original index variable to the transformed index variable. That is:

O

tb= o

OOH
O"

As an example, say the following sequence of transformations is applied to the source

code shown in Figure 4.3(a):

1. Permute both loops (permute(L1, L2));
2. Block the inner loop by a factor of .92 (block(L2, 32));

3. Reverse the outermost loop (reverse(L1)).

then the matrix representing the sequence of transformations is:

Mapping the assignment statements follows the source code mapping algorithm. The
initial loop nest is 2 deep (n = 2). After blocking, the source code is 3 deep and the
new index variables are J1, J2, and J3. The transformed index variables are calculated in
terms of the initial index variables using the transformation matrix T representing loop

permutation, blocking, and reversal.

J1 0 —1 I J1 = -Iz; [2 = —J1;
J2 = ()0 0 [ 11 ] J2 = (20(11);
h 1 o 2 h=m h=h

50

That is, in the assignment statements, I; is replaced with —J1 and II is replaced with

J3. The assignment statement 51 in Figure 4.3(a) is transformed as follows:

A[Il,12] => A[J3, —J1]

Next the loop bounds are transformed. The inequalities, minima, and maxima are

determined as in the original source code mapping algorithm.

 

Extract Inequalities Find min’s and max’s
11 Z 1; II 3 m; min(Il) = 1; max(Il) = m;
I; Z 1; I: g n min(Ig) = 1; max(lg) = 11

When the indices are transformed, four new inequalities are introduced into the inequal-
ity set to reﬂect the new blocking and blocked index variables. Speciﬁcally, if loop L,- with
index variable I,- is blocked by a blocking factor of s,- and becomes a blocking loop (be)
with index variable J, and a blocked loop (b,-) with index variable Jt, then the following

inequalities are added to the inequality set:

 

J. 2 (bi; J. S "be;
JtZJs; JtSJs'i’si‘l-

 

 

 

Using the source code shown in Figure 4.3(a), the inequality set becomes:

J3 2 1; J3 S m; min(Jg) = 1; max(J3) = m;
—J1 Z 1; —J1 S n; min(—J1)= 1; max(—J1)= n;
12 Z 1; J2 S m;

J32J2; J3$J2+82—1-

The ﬁnal step, calculating the new loop bounds, is equivalent for blocked code and
code containing only kernel transformations. That is, the entire inequality set is used to
determine the new loop bounds for J1, J2, and J3, where the boxed inequalities are used to

determine the lower bounds and upper bounds of J1, J2, and J3.

51

 

 

J1 J2

Js
Plsﬂ Illa-"J [132M [JsSsz—I]

125-73; Jzzja-Jz-l-l;

 

 

 

 

.72 S Iax(J3); J2 Z nin(J3) - 82 +1;

1221+1- ...;
Given the bounds calculated above, the transformed source code is shown in Figure

4.3(b).

 

 

 

 

 

 

4.2 Loop Collapsing

Loop collapsing is the inverse of loop blocking and decreases an n-nested loop to an n — 1
nested loop [14, 15, 18]. Repeated application of loop collapsing is called loop coalescing
and results in an n’-nested loop, where n’ < n. The source code shown in Figure 4.4(a)
is transformed into the source code shown in Figure 4.4(b) by loop collapsing. Additional
statements are introduced to recalculate the value of the index variables 11 and 12 in the
transformed source code (3:).

A loop nest may be collapsed to improve scheduling, reduce barrier synchronization,
or improve utilization of the memory hierarchy. Scheduling is improved because the work
is divided among fewer loops, and self-scheduling techniques are more apt to ﬁnd work
available when the size of the task varies. Barrier synchronization is reduced because fewer
synchronization points exist due to the decrease in the number of loop nests. Finally, loop
collapsing increases the effective vector length for vector machines, improving utilization
of vector registers. Vector operations are typically applied to the loop with the highest
iteration count. Collapsing two adjacent loops L,- and L ,- results in a collapsed loop L], with

a greater iteration count than loops L,- or LJ- had individually.

52

 

integer W[2, 3], V[2, 3] integer W[2, 3], V[2, 3]
L1 for I1 =1,ub.' L1,: for 11.2 = l,nb; X ab,-
L2 for 12 = 1, ab 5] it ((11,: nod ﬂbj) == 0)
S] W[I1, 12] = V[I1, 12] t 5 S; then I1 = 11,2/05,‘

SQ else I1 = cei1(I1,2/ub,') + l
S; 12 = ((113 —l) nod ubj)+l
S1 W[I1,12] = V[Il,12] It 5

Figure 4.4(a). Sample code: before loop collapsing Figure 4.4(b). Sample code: after loop collapsing

 

 

4.2.1 Legality Criteria

The legality criteria for loop collapsing was initially described by Wolfe [14], and later for-
mally speciﬁed by Polychronopoulos [27]. However, this legality condition is too restrictive
because it results in the illegality of loop collapsing in many cases where its application
results in semantically equivalent source code. Wolfe and Polychronopoulos state that two
loops L,- and L,- can be collapsed as long as d,- and d, are 0 for all of the direction vectors
in the direction vector set. According to Wolfe and Polychronopoulos, loops L,- and Li can
be collapsed if:
(WV-1) :61“) 6 UPI) :d, = o A d, = 0).

As a counter-example to this legality criteria, the source code in Figure 4.5(a) can be
collapsed into the source code in Figure 4.5(b), resulting in semantically equivalent source
code, where D = {(1,2)} (d1 76 0 A d2 75 O).

In this dissertation, it is legal to collapse two loops, L,- and L,-, as long as loop L,-
is the parent loop L5 in the loop nest hierarchy and loops L,- and L,- are both normal.
Loop collapsing cannot be applied to two loops L,- and L, if they are imperfectly nested.
Collapsing code that is imperfectly nested results in an increased number of iterations for
the statement 3k between the loops, which results in S]. executing more times than in the

initial source code. As an example, statement 31 executes m times in the source code shown

53

 

L1 for 11 = 1,2

L2 for]: = 1,3

S1 W[I1,I2]=...

$2 ...=W[I1-1,12—2]

Figure 4.5(a). Sample code: before loop collapsing
transformation, illegal by Wolfe’s [14] and Poly-
chronopoulos’s L27] deﬁnition.

L1,2 for [1,2 =1,6

S] if ((113 nod 3) == 0)

S; then 11 =11,2/3

55 else I; = ceil(I1,2/3) + l
51 I2 = ((11.2 - 1) nod 3) + 1
S1 W[I1,I2] =

S2 ...=W[I1-l,I2-2]

Figure 4.5(b). Sample code: after legal application
of loop collapsing transformation.

in Figure 4.6(a). However, statement SI executes m X n times in the collapsed source code

shown in Figure 4.6(b).

 

L1 for I1 = 1,17:
s, A[11] =
L2 for 12 = l,n

Figure 4.6(a). Sample code: imperfectly nested
code before loggcollapsing transformation.

£12 for [1,2 = 1,m X n
11((113 nod n) == 0)
then I1 = 11,2/73 .
0180 I1 = ceil(11,2/n) +
I: = ((11,: — 1)I0d n) + 1

S1 A[I1] =

Figure 4.6(b). Sample code: illegal loop collapsing
transformation of imperfectly nested code.

In summary, the legality for loop collapsing is:

Preconditions

Invariant

 

 

nonnal(B(”"1),i) A normal(B(”‘1),j)

 

 

Explicit
j = .-+ 1
Nesting

perfect(L,~, L j)

 

 

 

 

54

4.2.2 Effect on the Dependence Vector Set D

The initial loop nest depth is n, and all dependence vectors prior to loop collapsing d 3‘1) in
D("”1) have a length n, (Vd-lx'll : (113—1) E D(”’1) : Ida-1) | = n). The dependence vector
set after 100p collapsing is D”), if loop collapsing is the 2‘“ transformation. As mentioned
earlier, the loop nest depth decreases by one in loop collapsing and correspondingly, the
length of the dependence vectors d1”) decreases by one, (Vd-(I) : d1”) 6 0(3) : Illu‘lll = n—l).

Each dependence vector ill”) in the dependence vector set D”) is modiﬁed as follows:

0 distances in loops outside of the collapsed code (1 S k < 2') remain unchanged;

0 distances within the collapsed loops (k = i) are assigned a value that is a function of
the initial distances and the upper bound of the inner loop;

0 distances in loops inside of the collapsed code (j S k S n — 1) are assigned the original
distances with locations offset by one.

Formally, the modiﬁcation to the distances in each distance vector when loops L,- and

Li are collapsed is speciﬁed as follows, where ubj is the upper bound for loop Lj:

wit-1): (ﬁx-1) e MP1) :

(
((Vk:1$k<i:d§,=dk)v
(k = i : d, = (d); X abj) +dk+1) V
(szjgkgn—1:d]€=dk+1))A
0(3) = D(r) UJI)

As an example, consider a distance vector set before loop collapsing D(x'1) =
{(0,2,—1,3),(0,1,1,2)}. Loop collapsing is applied to loops L2 and L3 and the upper
bound of loop L3 is 2 (log = 2). The resulting distance vector set D”) = {(0, (2 X ub3) —
1,3),(0,(1 X abs) + 1,2)}, which is {(0,3,3),(0, 3, 2)}

As with loop blocking, the transformation matrix representing the effect of loop collaps-
ing is represented by an elementary matrix tc. If loop collapsing is the 2:“ transformation,

then all (lb-1) 6 0(3'1) have length n and all (lb) 6 D9”) have length n — 1. Therefore, the

55

matrix tc is (n — 1) X n:

tc x 1113-1) = cl")
V w—I v
(n-1)Xn "XI (“-ll’a

Suppose that a triply-nested loop is collapsed, then the transformation matrix tc that

represents collapsing of the outermost loops, L1 and L2, is in the form:

t __ ub2 1 0
c _ 0 0 1
and the transformation matrix tc that represents loop collapsing of the innermost loops, L2

and L3, is in the form:

t _ 1 0 o
c _ 0 "L3 1
As an example, if D("1) = {(2,—1,3),(0,1,1),(5,6,-7)}, loops L2 and L3 are col-

lapsed, and the upper bound of the inner loop is two (ub3 = 2), then D”) is calculated as

follows:
[100]X_21=[100]X_21=[ 2 l=l2l
0 ub3 1 3 o 2 1 3 2(—1)+1(3) 1
[1 o OJX-(l) : [1 001x11) = [ 0 J: [0]
o ab3 1 1 o 2 1 1 2(1)+1(1) 3
’5 '5
1 o o 1 0 o 5 5
[0 ub3 1]"Lf7 = [0 2 1]x f17=]2(6)+1(—7)]=]5]

 

 

4.2.3 Effect on Source Code Mapping

In the following discussion, the triply-nested loop shown in Figure 4.7(a) is the initial code.
Loop collapsing maps the triply-nested loop (n = 3) into a doubly-nested loop (n = 2).
Any kernel transformations occurring before collapsing are represented by an n X n matrix.
The matrix representing loop collapsing tc is (n — 1) X n, and any transformations occurring

after loop collapsing are represented by (n — 1) X (n - 1) matrices.

56

 

L; for 11 = Lab L; for J; = ubg

L: for I: = Lab L; for J: = 1, ub; X abs
L3 for 13 = l, 1163 11(J3 nod uba ==
S1 A[11,Iz,13] = . . . than 11 = Jz/uba

else I; = Liz/abs] + 1
I: = ((12 - 1)Iodub3)+ 1
S1 A[I1,Jl,13] =

Figure 4.7(a). Sample code: before transforma— Figure 4.7(b). Sample code: after transformation.
ﬁen-

 

The transformation matrix that represents the collapsing of the inner two loops of a
triply-nested loop is:

n

A
r x

t_ 1 100
c—n- 0606;

where co is the outer loop and c.- is the inner loop of the two loops that are collapsed.
During composition, the initial index variables associated with co and c,- are signiﬁcant to

the source code mapping.

As an example, the following sequence of transformations is applied to the source code

shown in Figure 4.7(a):

1. Permute the outer two loops (permute(L1, L2));

2. Collapse the inner two loops (collapse(L2, L3));

The transformation matrix that represents the sequence is:

0 1 0
T=[.° :3]=[:: 2.]x ...
O 1 O I 0 0 1
As with loop blocking, source code mapping follows the mapping algorithm with some

minor modiﬁcations. The assignment statements are mapped from the initial index variables

A, 12, and lg. to the transformed index variables J1 and J2.

57

o 1 o X? J1 = 12
0 Ci X12 J2 = 00(11),Ci(13)
3

The mapping shows that Ig is replaced with J1 in the transformed source code. Also,

the outer collapsed loop (co) is [1 and the inner collapsed loop (c,-) is I3.
AlIla I29 I3] => A[Ila J19 13]

Note that the index variables [1 and 13 are not transformed in the ﬁnal source code. As
will be shown later, the indices are recalculated by some additional assignment statements.
The loops are mapped after the index variables are calculated. The inequalities, minima,

and maxima are determined from the initial source code.

 

Extract Inequalities Find min’s and max’s

11 2 1; 11 3 ub1; min(Il) = 1; max(Il) 2 ub1;
I; 2 1; I; 5 ub2; min(Ig) = 1; max(Ig) = ubg;
I3 2 1; I3 5 ub3 min(Ig) = 1; max(13) = 1111;;

The indices are transformed using inequalities from the initial source code and the
index variable mapping used for the assignment statements. Two inequalities for the index
variables that represent the collapsed loop replace the initial inequalities. That 'is, if loops
L,- and L,- are collapsed and J, is the index variable in the transformed source code for the

collapsed loops, then the following inequalities are introduced:

J, 2 1b.- x 1b,; J, g ab.- x ubj

If we assume that the loops are normalized prior to loop collapsing (lb,- 2 1 and 1b, = 1),
then the ﬁrst inequality is replaced with J, 2 1 x 1 E J, 2 1.

Continuing with the example, the transformed inequality set is:

58

 

 

J1 J2

 

 

 

912g [JISubgl balm-x115; Jggungubj

 

 

 

 

J221X1;

Also, after loop collapsing, the array subscripts for I,- and I ,- are recalculated. The re-

 

 

 

 

 

calculated array subscripts for I,- and I ,- with J, representing the collapsed loops are:

 

if ((J, mod ubj) == )
than I,- = J,/ubj
else I,- = ceil(J,/ubJ-) + 1

 

 

Ij = ((J, - 1)mod ubj) +1

 

The complete transformed source code is shown in Figure 4.7(b).

CHAPTER 5

Loop Fission and Loop Fusion

Loop ﬁssion is a transformation that changes the structure of a loop nest, such that it
becomes two adjacent loop nests. Loop fusion is the inverse of loop ﬁssion and combines
two adjacent loop nests into one. The following discussion describes the legality criteria,
effect on the distance vector set D, and effect on the transformation matrix T for both loop

ﬁssion and loop fusion [12, 22].

5.1 Loop Fission

Loop Fission [17, 18, 19, 20, 21] is a loop transformation technique that divides a set of
program statements that are within a nested loop into two adjacent nested loops. As an
example, the source code in Figure 5.1(a) becomes the source code in Figure 5.1(b) after
loop ﬁssion. Applying loop ﬁssion improves the localities of reference such that the resultant
code has the largest possible groupings of references to the same variables, and also may
enable unimodular transformations, either by modifying the transformed Source code so
that all assignment statements are perfectly nested [28], or by reducing the cardinality of

the distance vector set D.

5.1.1 Legality Criteria

The initial source code has a dependence vector set Dw) that contains all dependence vectors

between statements in the initial code. As mentioned earlier, D(°) can be represented by

59

60

 

L; for 11 = l,n L; for 11 = l,n
L2 fox-12:1,": L2 forlg =1,m
L3 for 13 = 1,p L3 for 13 = 1,p
$1 Alli. 12.13] = 3U: - 1.12. 13] 52 CIA. 12.13] = DlliJz - 3.13]
32 0111, 12.13] = Dlli. 12 - 3. 13] $4 Dlli. 12.13] = BU: + 1.12 - 1.13]+
S3 BU}, 12,13] = “11, 12 - 2, 13 - l] C[11, 12,13 - 2]
54 D[11,12,13] = B[11 + 1,12 - 1,13]+ L1 for 11 = l,n
C[11,12,13 — 2] L2 for 12 = 1,m
55 var; = vat-2 L3 for 13 = 1,p

31 411112.13] = Elli - 1.12.13]
53 B[11,12,13] = I[11,12 - 2,13 - 1]
55 var; = var;

Figure 5.1(a). Sample code: before loop ﬁssion Figure 5.1(b). Sample code: after loop ﬁssion
transformation transformation

 

a dependence graph, where each assignment statement is a node and each dependence is a
directed arc. A II—block is a strongly connected component of the dependence graph [13].
II-blocks are signiﬁcant because they cannot be divided by most loop transformations. A
111-block (deﬁned in this paper) is a set of statements whose members are strongly connected
components. In other words, dividing a set of statements into two \II-blocks (\Ill and I'll)
results in two partitions such that each element of \III and W2 is a lI-block.

The source code in Figure 5.1(a) has three II-blocks: 111 = (81,83), II; = (82,84), and
H3 = (85). After transformation, the source code shown in Figure 5.1(b) has two \II-blocks:
\Ill = (82,84), and ‘11; = (81,83,85). Note that none of the II-blocks are divided by the loop
ﬁssion transformation.

In the following discussion, loop ﬁssion is the a“ transformation in the sequence. The

dependence vector set for the statements in \111 is 05:52 , and the dependence vector set for

the statements in W; is D5,: '31 . Finally, the dependence vector set from \III to W2 is labeled
D511}; and the dependence vector set from ‘112 to ‘111 is D9; 311,] . The union of all four of

these sets is D(""1), the dependence vector set before the loop ﬁssion transformation:

3. —1 -1 -1 —1
D( I) = 03:»! U 037.111 U 05111.11; U D5172»:-

61

As an example, the dependence vector set DOV-1) for the source code in Figure 5.1(a) is:
13‘“) = {(0,2,1),(0,o,2),(1,0,0),(1,—1,0),(0,3,0)}, (5.1)

where 05,231 = {(0,2, 1),(1,o,0)}, 135,13] = {(0,0,2),(0,3, 0)} 05,331 = (a, and man].

{(1, -1,0)}. As can be seen from Expression (5.1), D(z-l) = D("1) U 05:2,;1 U DSIZ-dlu

-1
Divisi-

The dependence vector sets Dita-1111; and D131: are critical to the legality and order of

the loop ﬁssion transformation. If there are dependences from both VIII to In and from

1)

'1’: to {11, then loop ﬁssion cannot be applied. A non-empty 05:1 7% implies that In must

execute before In in the transformed source code. Similarly, a non-empty DEE-é)! implies
that ‘1'; must execute before \Ill. Clearly, both execution orders cannot be satisﬁed.
Formally, loop ﬁssion cannot be applied if ((3d: d E D(1=1-1)) A (3d: d E Dita—11112)) A

simpler, formal notation to describe the legality criteria is:

 

 

 

 

Preconditions
Invariant
Explicit
(055.71! = 0 v 0137:! = 0)
Nesting

This notation states that loop ﬁssion is legal if either the distance vector set from In to

\I12, or from II; to \111, is empty, regardless of the nesting structure of the two \II-blocks.

5.1.2 Effect on the Dependence Vector Set D

The dependence vector set for the source code before the loop ﬁssion transformation is

0(3'1) = D(==-1) UDSr: 1;: UDsff1 7,2) UDSIZ-éi. The sets D“ I) and 03,323: were initially loop—

carried dependences; however, after loop ﬁssion the partitions D5331??? and Ufa-$2 become

loop-independent. Loop ﬁssion is referred to as a dependence-breaking transformation [17]

62

because the dependences between \II-blocks are carried at a higher level after transformation.
After loop ﬁssion, the dependence vector set for the ﬁrst code block (D513:1 ),\Iv1) consists of the
dependences within the statement partition \Ill. Similarly, the dependence vector set for
the second code block (D5121? 2) consists of the dependences within the statement partition

‘1'22

 

Partition \III: D”) = (Dali-1111))
Partition \112: 0(3) = (05:23:)

 

 

 

5.1.3 Effect on Source Code Mapping

Any unimodular transformation can be applied at any time during the loop transformation
process (composition property). In compliance with this property, loop ﬁssion can be applied
at any point in the parallelizing process, given that the legality criteria are met.

Loop ﬁssion results in two adjacent nested loops where only one previously existed.
Therefore, both new blocks of code (\111 and IQ) have a transformation matrix associated
with them (TS? and Téjl). Because of the composition property, both T1,?) and T11? are

initialized to the transformation matrix that existed for the nested loop prior to loop ﬁssion

(TV-1)). Notationally, this condition is expressed as:

 

Partition \III: Ta) = T(1=-1)
Partition ‘15: T]: = Tlx-l)

 

 

 

5.2 Loop Fusion

This section describes the extension of the unimodular approach to include loop fusion
[17, 18, 19], which is the complement of loop ﬁssion. Loop fusion combines two adjacent
nested loops into one nested loop resulting in a decrease in communication and synchro-
nization overhead. As an example, the initial source code that is shown in Figure 5.2(a) is

transformed into the source code shown in Figure 5.2(b) by loop fusion.

63

 

L; for 11 = l,n L1 for 11 = l,n

L2 for 12 = l,m L2 for 12 = 1,m

L3 for 13 = 1,p L3 for 13 = 1,1)

31 A[11112,13] = BUiJzJa] S1 A[11.12,13]= 8111212113]

52 l[11 - 1,12,13 - 2] = D[11, 12 - 2,13] 32 A[11 - 1,12,13 — 2] = D[11,12 - 2,13]

L; for 11 = l,n 53 E[11 - 2,12,13] = F[11,12,13] + G[11,12,13]
L; for 12 = 1,711 $4 “[11 - 3,12,13] = E[11,12,13]

L; for 13 = 1,p
53 EM - 2112.13] = 5111.12.13] + G[11.12. 13]
S4 “[11 - 3,12,13] = E[11,12,13]

Figure 5.2(a). Sample code: before loop fusion Figure 5.2(b). Sample code: after loop fusion
transformation transformation

 

5.2.1 Legality Criteria

Previous work related to loop fusion implies that the complexity of determining the legality
of the transformation exceeds the potential gains from its implementation [14]. This sec-
tion of the paper clariﬁes and simpliﬁes the legality condition for loop fusion based upon
properties of the loop bound set and the distance vector set.

A (~)-block is a set of statements that is associated by spatial locality in the source
code. That is, all of the assignment statements in a loop nest make up one 9-block. As
an example, the source code in Figure 5.2(a) has two 9-blocks: 91 = (81,82) and 92 =
(83,84). A O-block is another name for a loop nest and is introduced in this paper to ensure
consistent notation with loop ﬁssion. Unlike II- and \II-blocks, nothing is assumed about
the dependence relations within G-blocks .

Zima [15] and Wolfe [14] both state that a precondition for loop fusion is that the loop
bound sets for both loop nests must match exactly, or be “adjustable” using a transfor-
mation such as loop unrolling. The following discussion shows that this precondition can
be weakened into a predicate named can-adj ust:(Be1 , 89,), which returns true if the loop
bound sets for ('91 (B9,) and 62 (392) can be adjusted to match exactly. To clarify, Zima

and Wolfe state that the loop bound sets must already match, whereas can_adjust() de-

64

termines if the loop bound sets can be adjusted. An algorithm that determines the value of

can_adjust(B9, , B9,) is given in Figure 5.3.

 

Algorithm can_adjust
Given two loop bound sets, 89, and 89,, determines if the loop bound sets can be adjusted. Evaluation of
this function is a precondition for loop fusion.

Input:

B91 /*“'loop bound set for the ﬁrst code block"/
B9, /**loop bound set for the second code block**/
Output:

FLAG /”TRUE or FALSE‘V

Procedure:

1. /"‘*Determine iteration count for all 3-tuples in Be, and Be,**/
For all 5: 6 Be,
61.count.- = [(lb; — ub; + s,)/s,]

For all 5} 6 Be,
eaxount,’ = [My - “51' + SJl/SJ‘J

2. [“Sort 91.nount and egxountﬂl
SORT(91 .count)
SORT(62 .count)

3. [*‘Determine value of F LAG**/
FLAG = TRUE
For all i 6 61.count
/"‘"‘if the iteration counts diﬂer for a loop in 91 and 92"/
If 61 _count, ;£ 62.count.-
/**then the two loop bound sets cannot be adjusted'*/
Then FLAG = FALSE

Figure 5.3. Algorithm to determine if two loop bound sets can be adjusted.

 

Determining the set of transformations needed to adjust B91 and B92 may involve a
complex solution. Note that can_adjust(B9,, B92) determines if the loop bound sets can
be adjusted, but does not determine the actual adjustments. The algorithm to determine
the adjustment matrix is dependent upon the cardinality of the set of transformations and

the loop nest depth of the target source code.

65

Typically, another legality criteria for loop fusion is based upon the dependences between
the two code blocks 91 and 92. The dependence vector set after loop fusion has four

partitions:

1. D323]: The loop-carried dependences between statements in 91.

2. D8231: The loop-carried dependences between statements in 92.

3. Dg1 ).9 2: Before the fusion transformation, this partition was loop-independent. How-
ever, after loop fusion, this partition is loop-carried and is the distance vector set from
statements in 01 to statements in 92.

4. DS; {91: Before the fusion transformation, this partition was loop-independent. How-
ever, after loop fusion, this partition is loop-carried and is the distance vector set from
statements in 92 to statements in ('91.

As an example, the dependence vector sets for the initial source code in Figure 5.4(a)
are: D823: = {(0)} and Data = {(0)}, where 91 =(81) and 92 =(82). Note that there
is also a loop-independent dependence from S2 to 81 caused by variable AEi] . After loop
fusion is applied, as shown in Figure 5.4(b), the loop-independent dependences from 82 to
81 are now loop-carried (D3292 = ﬂ, and D8191 = {(1)}).

Wolfe [14] describes the legality of loop fusion based upon dependences between state-
ments in the ﬁrst loop nest 91 and the second loop nest 02. Zima [15] uses a similar,
although more speciﬁc deﬁnition, called serial-fusion preventing, to describe an illegal condi-
tion for applying loop fusion. Both discussions state that loop fusion is illegal if D5291 aé 0.
In other words, if after the loop fusion transformation there exists any loop-carried depen-
dences from the second to the ﬁrst code block, then loop fusion is illegal. According to this
legality criteria, the source code shown in Figure 5.4(a) cannot be fused and still maintain
semantic equivalence to the original source code. For illustrative purposes, the source code
in Figure 5.4(b) shows the fused code, and, as can be seen, it is not semantically equivalent
to the initial code because of the dependence from 52 to 51 caused by array A. However,
if the loop is reversed after loop fusion is applied, as shown in Figure 5.4(c), then the
transformed source code is semantically equivalent to the initial code.

This example contradicts the legality criteria proposed by Wolf and Zima. That is, we

are able to apply loop fusion to code that initially appeared non-fusible. From studying and

66

exploiting properties of the dependence vector set we developed a new strategy for applying

loop fusion. The approach for determining the legal application of loop fusion is as follows.

 

L1 for 11 = l,n L1 for 11 = l,n L1 for 11 = —n, -1
S1 l[11] = l[11] + C[11] S1 A[11] = A[11] + C[11] S1 A[-11] = A[-11]+ C[-11]
L; for 11 = l,n S2 D[11] = A[11 + l] + D[11] S2 D[-11] = A[-11 + l] + Ill—11]

S2 D[11] = l[11 + 1] + 3111]

Figure 5.4(a). Sample code: Figure 5.4(b). Sample code: Figure 5.4(c). Sample code: loop
before loop fusion transforma- loop fusion results in non- fusion followed by loop reversal re-
tion. equivalent source code. sults in equivalent source code.

It is understood that distance vectors are initially lexicographically positive and deﬁne
the lexicographic execution order of the assignment statements. In the initial source code
(Figure 5.4(a)), all iterations of statement 51 complete execution before any iterations of
statement 52 begin execution. Clearly, the source code after loop fusion must preserve any
dependences between $1 and 32.

The distance vectors in D3291 must reﬂect the execution order of ('91 before ('92. There-
fore, the set 1382,91 is negated when DU’) is determined after the loop fusion transformation.
That is, the distance vector set D”) is:

:1: z-l 3-1 :1: :1:
D( ) = Diane: U demo: U Dye, U ‘(Db2),91)-

11

As an example, the distance vector set for the source code in Figure 5.4(b) is:

per) = {(0)} u {(0)} u 0 u —{(1)}= {(0), (-1)}

Determining DU”) after loop fusion may result in lexicographically non-positive depen-
dence vectors in D1929; , as in the above case when D3291 = —{(1)}. Even though some
distance vectors may not be lexicographically positive during intermediate transformations,

the global postcondition must be true when all transformations are complete. That is, all

67

of the ﬁnal, transformed distance vectors must be lexicographically positive. The reversal
transformation is applied after loop fusion in the source code in Figure 5.4(c), which makes
all of the distance vectors in DU”) lexicographically positive.

A less trivial example is illustrated by the source code in Figure 5.5(a), which has
only one distance vector (Dbl-2,61 = {(1, —1)}) after loop fusion is applied (D8513: , 08:31,
and D8392 are all empty). Reversal of the signs of the distances in D8291 results in a

distance vector set that reﬂects the execution order of 01 before 92 (Dm = —(D(ea;)’el) =

-{(1,-1)} = {(-1,1)})-

The distance vector in DU”) is not lexicographically positive, and therefore the global
postcondition is false. Two possible transformations that make the distance vector set
lexicographically positive are to: 1) reverse the outermost loop using the transformation

matrix:

—10
“l .1],

or 2) interchange the inner and outer loop using the transformation matrix:

T=[(1)(1)].

Loop reversal applied after loop fusion is shown in Figure 5.5(b) and loop interchange
applied after loop fusion is shown in Figure 5.5(c). Both cases produce semantically equiv-
alent code with respect to the initial source code. Loop reversal results in D”) = {(1, 1)},
and the transformed source code is semantically equivalent to the initial source code because
the transformed distance vector set is lexicographically positive. Loop permutation results
in D0”) = {(1, —1)}, and, again, the distance vector set is lexicographically positive and the
transformation sequence is legal.

The above discussion shows that the legality criteria for loop fusion need not be based
on the dependence vector set between the two fused blocks of code. Instead, the legality
criteria for loop fusion is based strictly upon the predicate can.adjus1:.(B91 , B9, ). As with

all unimodular transformation sequences, all (1 in D(”) must be lexicographically positive

68

 

L1 for 11 = 1,3 L1 for 11 = -3, -1 L2 for 12 = 1,3

L2 for 12 = 1,3 L2 for 12 = 1,3 L1 for 11 = 1,3

S1 A[11,12] = . . . S1 A[-11,12] = . . . S1 A[11,12] =

Li for11=1,3 S2 ...=A[-11+1,12-1] S2 ...=A[11+1,12-1]
L; for 12 = 1,3

S2 ...=A[11+1,12-1]

Figure 5.5(a). Sample code: be- Figure 5.5(b). Sample code: Figure 5.5(c). Sample code: loop
fore loop fusion transformation. loop fusion and loop reversal. fusion and loop interchange.

upon completion of the transformation sequence. In summary, the legality criteria for loop

 

 

 

 

fusion is:
Preconditions
Invariant
Explicit
can.adj ust:(Be1 , B92)
Nesting

5.2.2 Effect on the Dependence Vector Set D

Modiﬁcations to the distance vector set DU”) were discussed above in the context of loop

fusion legality. There are four partitions to the distance vector set after loop fusion: Dbl—$1 ,

Dg’él, Dye, , and D8 9. The distance vector set after loop fusion 18 the union of these

four partitions with Dgzl’el negated. That is,

 

3 2—1 x-l :1: a:
D( )=D(919)1U 0(92 9; U D(91).92 U (D52).9 1)

 

 

 

5.2.3 Effect on Source Code Mapping

The composability property states that any unimodular transformation can be applied at
any time. Therefore, the loop nest 91 may have a sequence of transformations applied to it

prior to the loop fusion transformation (T9,). The loop nest 92 may also have a sequence of

69

transformations applied to it prior to loop fusion (T9,). Also, the loop bound sets 89, and
B9, may need “adjustment”. In any case, assume that after loop fusion, the statements
originally from 91 will have a different transformation matrix associated with them (T9,)
than the statements originally from 92 (T9,). The implementation of loop fusion must
account for both T9, and T9, in the transformed source code. In other words, we have two
transformation matrices for one loop nest after the loop fusion transformation.

The ﬁnal source code mapping proceeds in two steps. The assignment statements that
were initially in 91 are mapped using T9,. Similarly, the assignment statements that were
initially in 92 are mapped using T9,.

As an example, the source code in Figure 5.6(a) cannot be fused in its current form
because the loop bounds for the inner loops do not exactly match (loop L’, is the reversal
of loop L2). However, can-adjust(B9,,B9,) is true, because the outer loop and inner
loop of 91 have the same number of iterations as the outer loop and inner loop of 92,
respectively. If reversal is applied to the inner loop of 92, then the source code in Figure
5.6(b) is obtained. Then the loops can be fused as shown in Figure 5.6(c), resulting in
semantically equivalent source code.

The transformation matrix for 91 is:

10
“[01],

where 91 =(81), and the transformation for 92 is:

1 0
”[0 ~1l

where 92 =(S2) and T9, reﬂects the reversal of the inner loop. After loop fusion, T9,
remains associated with 91 and is used to determine the form of the assignment statements
in 91. Similarly, T9, remains associated with 92 and is used to determine the form of the
assignment statements in 92. Finally, either T9, or T9, is used to map the loop bounds.
That is, mapping the loop bounds for 91 using T9, results in the same loop bounds as

mapping the loop bounds for 92 using T9,.

70

 

L; for 11 = 1,m L; for 11 = 1,m L; for I; = 1,m
L2 for I2 = l,n L2 for 12 = 1, n L2 for I2 = l,n
S1 ”11]: S1 A[I1]= 31 M11]:

L; for 13 = 1,m Li for 13 = 1,m S2 . . . = A[I2 — 1]
L; for 14 = -n, —1 L; for I; = l, n

S2 ...=A[—I4-l] $2 ...=L[I4-1]

Figure 5.6(a). Sample code: be- Figure 5.6(b). Sample code: af- Figure 5.6(c). Sample code: af-
fore adjustment ter Mustment ter adjustment and loop fusion

In summary, the transformation matrix for the assignment statements originally in 91 is

TS? and the transformation matrix for the assignment statements originally in 92 is T82.

 

91 Statements: T8? = (gt-U
92 statements: T3? = Tgi-l)

 

 

 

CHAPTER 6

Integration of the Extended Loop

Transformations

Algorithms that use the kernel and extended set of transformations are broadly classiﬁed
as either machineindependent or machine-dependent. Machine-independent algorithms
determine a transformation matrix T that meets some general goal, such as parallelism or
data locality. Machine-dependent algorithms determine a transformation matrix T that
optimizes source code for a particular architecture, such as a vector or shared-memory
machine.

The kernel unimodular transformations are currently used to obtain parallelism, and in
some cases data locality, on source code in perfect loop nests. In order to provide context
and motivation for developing algorithms to integrate the extended loop transformations,
this chapter begins by reviewing the existing algorithms that use the kernel unimodular
transformations to obtain coarse- and ﬁne-grain parallelism.

Next, new algorithms that we developed are described that use the kernel and extended
sets of transformations to obtain new goals or that enable the application of transformations
to a broader range of source code structures. First, the wavefront algorithm that achieves
ﬁne-grain parallelism is modiﬁed so that it is applicable to imperfect-only-child loop nests,
in addition to perfect loop nests. Second, the extended transformations are used to expand

the machine-independent goals: ﬁrst, by ﬁnding parallelism where the kernel unimodu-

71

72

lar transformations fail; and second, by obtaining data locality within the matrix-based
framework. Finally, the extended transformations are used to obtain machine—dependent
optimization. Speciﬁcally, two algorithms are developed for source code executing on vec-
tor and shared-memory machines. All of the algorithms that use the extended set of loop
transformations can be applied to source code with both perfect and imperfect-only-child
loop nests.

Both Banerjee [3], and Wolf and Lam [4] have developed algorithms that yield coarse-
grain parallelism in some cases, and ﬁne-grain parallelism in all cases of perfect loop nests.
Using their algorithms, coarse-grain parallelism may not be obtainable with the kernel uni-
modular transformations. However, we have developed an algorithm that uses the extended
set of transformations and may allow parallelization of the outermost loops in cases where
the kernel set of transformations fail.

Similarly, ﬁne-grain parallelism refers to the parallelism of the innermost loops of a per-
fect, n-deep loop nest. Wolf and Lam show that n — 1 degrees of parallelism can always be
obtained from a n-deep perfect loop nest. A new algorithm is discussed that is a modiﬁca-
tion of the original algorithm for ﬁne-grain parallelism and works on imperfect loop nests.
Speciﬁcally, if assignment statements are nested at q unique depths in a loop nest, then
the algorithm ﬁnds a minimum of n - q degrees of parallelism. As examples, q is one for a
perfect loop nest because all statements are nested at one level, and q is greater than one
for any imperfect loop nest.

Data reuse is deﬁned as using the same datum in different iterations of a loop nest.
Data locality occurs if reused data remains in the same memory hierarchy level between
uses. An algorithm is presented that improves data locality using the transformations from
the extended set, namely loop ﬁssion and fusion.

Next, two architecture-speciﬁc algorithms are introduced that optimize source code us-
ing the extended set of loop transformations. Speciﬁcally, an algorithm is given that opti-
mizes source code for a vector machine using loop normalization, ﬁssion, permutation, and
blocking. Finally, an algorithm for coarse-grain parallelism on a shared-memory machine is

discussed, and uses loop normalization, fusion, permutation, and collapsing. The objective

73

of the vector machine algorithm is to minimize work performed in each loop and move the
parallel loops innermost. In contrast, the objective of the shared memory machine algorithm
is to maximize the work performed in each loop and move the parallel loops outermost. The
development of these algorithms shows that the extended set of loop transformations can
be composed to achieve both machine-independent and machine-dependent compiler goals

and can be applied to a wide variety of source code structures.

6.1 Previously Developed Algorithms

This section discusses two existing algorithms that are applied to perfect loop nests. The
ﬁrst algorithm obtains coarse-grain parallelism on a doubly-nested loop, but can be gener-
alized to a n-nested loop. The second algorithm obtains ﬁne-grain parallelism on a n-nested
loop.

Loop L,- can be parallelized if all of the distance vectors J in the distance vector set D
have speciﬁc properties. If each distance vector of is in the form ((11, d2, . . . , d“), where n is
the loop nest depth and J > (T means that distance vector (I is lexicographically positive,

then loop L; can be parallelized if:
(VJ: J’e D : ((d1,...,d,-_1) > 6) v (d.- = 0)). (6.1)

That is, loop L,- can be parallelized if each partial distance vector preceding location i is
lexicographically positive or each distance at location i is 0. In the following discussion,
dist refers to the ﬁrst disjunct of Expression (6.1), ((dl, . . ., (1,--1) >- (i) and disj2 refers to
the second disjunct of Expression (6.1) (d,- = 0).

As an example, suppose that D contains only one distance vector: D = {(0,2, —2)}.

The following table shows the parallelism in the source code based upon disjl and disj2:

74

 

 

d.-=0 (d1,...,d.-_1)>0° Parallel

 

 

 

 

 

L1 yes not applicable yes
L2 no no no
L3 no yes yes

 

 

 

 

6.1.1 Coarse-grain parallelism for perfect loop nests using the kernel

transformation set

A distance vector J for a doubly-nested loop is in the form (d1 , d2), where (11 corresponds to
the outer loop (L1) and d2 corresponds to the inner loop (L2). Parallelization of the outer
loop is equivalent to ﬁnding a transformation matrix T, such that (11 = 0 for all of the dis-
tance vectors in the transformed distance vector set T x J E D. Satisfying this requirement
is equivalent to satisfying disj2 from Expression (6.1). Note that disj] is irrelevant in this
case since i— 1 = 0.

For a doubly-nested loop, the transformation matrix T is in the form:

T[1,1] 111,2]
T[2, 1] T[2, 2]

Therefore, ﬁnding T that parallelizes the outermost loop is equivalent to ﬁnding T[l, 1] and
T[l, 2], such that T[l, 1] x d1 +T[l, 2] x d2 = 0 for all J6 D. The values of T[2, 1] and T[2, 2]
are later calculated to make T unimodular. That is, both T[2, 1] and T[2,2] are integers
and are calculated such that Idet(T)| = 1.

Wolf and Lam [4] describe coarse-grain parallelism as choosing the transtrmation ma-
trix T with ﬁrst row t: such that t: x J = 0 for all dependence vectors d: Banerjee [3]
describes an algorithm for parallelizing the outermost loop of a doubly-nested loop such
that the transformed source code is semantically equivalent to the original source code; the
outer loop can be executed in parallel; and the iteration count of the outermost loop is

maximized. Transformation matrix T results in (11 = 0 for all transformed distance vectors

75

in D. Algorithm kerneLcoarse obtains coarse-grain parallelism on a perfect, 2-deep 100p

nest using the kernel transformations and is given in Figure 6.1.

 

Algorithm kernel_coarse
Finds a 2 x 2 transformation matrix T that parallelizes the outermost loop of a doubly-nested loop, if one
exists.

Input:

D /**the distance vector set”/
Output:

T /"‘*the 2 x 2 transformation matrir"/
Procedure:

1. Move the loop with the highest iteration count to the outermost position using permutation.

2. Determine if all J E D are in basis form a(a1, a2), where a is a positive integer
and (a1 , a2) is lexicographically positive. If all of d E D cannot be placed in
this form, then terminate. Else, continue.

3. Find T[2, l],T[2, 2] and g = GCD(a1,a2) such that g = a; x T[2, 1] + a2 x T[2, 2],
where T[l, l] = a2/g and T[l, 2] = -a1/g.

Figure 6.1. Algorithm to obtain coarse-grain parallelism using the kernel transformations
of perfect loop nests.

 

Note that a transformation matrix T may not exist that results in coarse-grain paral-
lelism. That is, Banerjee’s algorithm terminates (Step 2) if the distance vectors cannot be
placed in basis form. The extended algorithm discussed in Section 6.4 shows how to increase

the chances of ﬁnding a basis form for all distance vectors in the distance vector set.

76

6.1.2 Fine-grain parallelism for perfect loop nests using the kernel trans-

formation set

This section describes an algorithm that parallelizes the inner n — 1 loops of a n-deep,
perfect loop nest using the kernel set of loop transformations. The approach is known as
the wavefront transformation [4, 24], and is named kerneLfine in this dissertation.

Two important proofs developed by Wolf and Lam [4] are critical to the applicability of
the wavefront transformation. The ﬁrst shows that any lexicographically positive distance
vector can be made fully permutable. The second shows that any perfect, n-deep loop nest
with a fully permutable distance vector set D can be transformed to have n — 1 degrees of
parallelism. The proofs are paraphrased below, and the reader is referred to Wolf and Lam
[4] for more details.

All distance vectors in the initial distance vector set are lexicographically positive, where
a lexicographically positive J has at least one positive distance (1),. The second disjunct of
Expression (2.1) (deﬁnition of fully permutable) requires that distances d,- through (I, must
be non-negative, if loops L; through L,- are fully permutable. Any negative distance in the
range d,- through d,- can be made non-negative via skewing by the positive distance d2. That
is, any negative distances between loops L,- and L,- can be made non-negative satisfying the
second disjunct of Expression (2.1).

Step 1 of Algorithm kernel-fine transforms a lexicographically positive distance vector
set D into a fully permutable distance vector set D“). Transforming a lexicographically
positive distance vector set into a fully permutable distance vector set involves skewing
the inner loops by the positive distance d], in each distance vector d: As an example, the

transformation matrix Tm for a 4-deep loop nest is:

1 0 0
#2,1 1 0
#3,1 [13,2 1
#44 #4,: #4,3

p-acoc

where 11.3,,- = [max(-d,/d.-)], such that d,- > 0 and JG D.

77

After transforming the distance vectors to fully permutable form, loops L2 through L,
are parallelized by satisfying disjl from Expression (6.1) (deﬁnition of parallelism). That
is, loops L2 through L. are parallelized by ensuring that the distances associated with loop
L1 are strictly positive (VJ: J6 D : d1 > 0). In Expression (6.1), dist states that loop L,-
can be parallelized if (VJ: J6 D : (d1, . . . ,d.-_1) >- 6). In an n-nested loop, if we show that
the ﬁrst distance is greater than zero (d1 > 0), then disjl is satisﬁed.

In fully permutable form, all distances in a distance vector set are non-negative. The
wavefront transformation skews the innermost loop L, of a n-nested loop by each of the
outer loops L1 through Ln-1. Therefore, all distances associated with the innermost loop
d" are strictly positive. As a ﬁnal step in the wavefront transformation, the innermost loop
is moved outermost. Thus, rather than (1,, being positive for all J6 D, (11 is positive for all
J6 D, which satisﬁes (11°st from Expression (6.1). The wavefront transformation matrix T
skews the innermost loop by each of the outer loops and then moves the innermost loop to

the outermost position. The wavefront matrix for a 4-deep loop is in the form:

COD-“H
CHOH
H061—
OOOr-I

Algorithm kernstine obtains n — 1 degrees of parallelism from a n-deep, perfect loop
nest. Step 1 transforms all lexicographically positive distance vectors in the distance vector
set so that they are fully permutable. Step 2 skews the innermost loop by each of the
outer loops. Finally, Step 3 moves the innermost loop to the outermost position using loop

permutation.

6.2 New Algorithms

This section introduces several new algorithms that make use of the extended and kernel
set of matrix-based loop transformations to address different objectives of a parallel com-
piler. The ﬁrst algorithm is a modiﬁed version of the wavefront transformation and can be

applied to imperfect loop nests. The second algorithm achieves coarse-grain parallelism in

78

 

Algorithm kernsl_fine
Finds a n x n transformation matrix T that parallelizes the innermost n - 1 loops of a n-deep perfect loop
nest.

Input:

D /*"‘the initial distance vector set**/
Output:

T /"the n x n transformation matrix”/
D' /”the transformed distance vector set"/
Procedure:

1. /*"Make all d6 D fully permutable by skewing“*/
For all i, 2 5 i S n
For all j,1 S j < i
BkOI(L.’, L,,nax|'—d,~/d.'l), such that d.- > 0
Endfor
Endfor

2. [*‘Shew the innermost loop L" by each of the outer loops“/
For all h, 1 S h < n
skev(Ln, L;., 1)
Endfor

3. /**Permute the innermost loop L. to the outermost position"/
per-uts(L1, Ln)

Figure 6.2. Algorithm to obtain ﬁne-grain parallelism using the kernel transformations on
perfect loop nests.

 

cases where the kernel unimodular transformations cannot and uses loop ﬁssion to reduce
the cardinality of the dependence vector set D. The dependence vectors in D form the
constraints on possible coarse—grain parallelism. Thus, reducing the number of constraints
(i.e., reducing the cardinality of D) results in more potential parallelism.

Next, an algorithm is presented that obtains data locality. Data locality has not previ-
ously been addressed with respect to matrix transformations. However, loop blocking was
used by Wolf [24] to obtain data locality external to the matrix-based framework. We use

loop ﬁssion and fusion to obtain locality within the matrix-based model.

79

Finally, two architecture-speciﬁc algorithms are given that transform source code for a
given machine. Speciﬁcally, an algorithm for vector machines is discussed, as is an algorithm

for a shared-memory system.

6.3 Fine-grain parallelism for imperfect-only-child loop

nests using the kernel transformation set

This section describes a modiﬁcation to Algorithm kernel_fine that allows it to be applied
to imperfect loop nests. The algorithm, named mod-kernel_fine, applies the wavefront
transformation to partitions or regions of the nested loop. We show that if m statements
(51, . . .,.S'm) are nested at q unique levels in a n—deep loop, then the source code can be
transformed such that it has a minimum of n — q degrees of parallelism.

As an example, the source code shown in Figure 6.3 has an assignment statement SI
nested i-deep, and several assignment statements (S2, . . ., Sm), nested n-deep. Therefore,
there are two unique levels at which assignment statements are nested (q = 2), and the

source code can be transformed such that n — q = n — 2 degrees of parallelism result.

 

L1 for11=...

L2 forI2=...

Li forI.-=...

S1 A[I1,...,I.']=...

L,- forI,=...

[1,4,1 for Ij+1 =

Ln forIn=...

32 B[Il,...,]n]=...
Sm C[Il,...,1n]=o..

Figure 6.3. Sample code: n-deep imperfect-only-child loop nest.

 

80

Algorithm mod-kernel_fine, given in Figure 6.4, is similar to Algorithm kerneLﬁne
except that it parallelizes within regions that are deﬁned by imperfectly nested assignment
statements. Using the source code shown in Figure 6.3 as an example, the Algorithm
mod_kernel_f ins ﬁrst parallelizes loops L1 through L,- using the wavefront transformation.
Next, loops L, through L, are parallelized using the wavefront transformation. The source
code is parallelized in regions because the legality condition of loop permutation does not
allow interchange of loops that are not perfectly nested within each other. Any two loops in
the range L1 through L,- can be interchanged as needed, as can any two loops in the range
L, through Ln. However, loops cannot be permuted between the two ranges.

The ﬁrst step of Algorithm mod_kernel_fine is to transform all distance vectors in the
dependence vector set into fully permutable form using loop skewing. Loop skewing is legal
regardless of the nesting structure of the source code. That is, loop skewing can be legally
applied across imperfectly nested loops.

A signiﬁcant result of the skewing transformation is that any partial distance vector
is also fully permutable. As an example, if the distance vector d = (d1, . . .,dn) is fully
permutable, then so is any (dg, . . .,d,-), where 1 S i S j S 11. All of the distances are non-
negative in a fully permutable distance vector, and therefore, any partial distance vectors
(dg, . . . , d,) are either the zero vector or are strictly positive.

If the distances in a region are all zero ((dg, . . .,d,~) > 0) then all loops L.- through L,-
can be parallelized by disj2 of Expression (6.1). If the distance vectors are strictly positive,
then the inner loop in the region can be skewed by each of the outer loops in the region
and moved outermost. This strategy is similar to the wavefront transformation, however, it
is applied across a region of a n-deep loop nest bounded by nested assignment statements,
rather than all n loops.

The motive for the application of region wavefronting is to maximize parallelism despite
the constraints imposed by the legality of loop permutation. That is, loops are permuted
only within a region. Maximal parallelism within a n—deep imperfect-only-child loop nest
is guaranteed because maximum parallelism is obtained within each region bound by an

imperfectly nested assignment statement.

81

 

Algorithm mod-kernel-fine
Finds a n x n transformation matrix T that parallelizes a minimum of n - q loops of a n-deep imperfect-
only-child loop nest, where q is the number of unique depths at which assignment statements are nested.

Input:

D /"'*the initial distance vector set**/
depth[1, ' ’ q] ”tonic“ 53‘ of unique depths at which assignment statements are nestea'"/

Output:

T /*"‘the n x n transformation matrif‘]
D' /"‘*the transformed distance vector set**/
Procedure:

1. /”Malce all d6 D fully permutable by skewing”/
For all i, 2 5 i S n
For all j, l S j < i
skev(L.-, L,,nax|'—d,/d:]), such that d.- > 0
Endfor
Endfor

(A) /"Initialize the counters**/
deptlu'ndez = 1
start.depth = 1

(B) /**For each partition of the loop,” /
Repeat
Wide?“ = depth[depth.indez]

(C) /**Skew within regions"/
For all k, starLdepth S h < end.depth

I’k°'(Lend.depth1 1"“ 1)
Endfor

(D) /"Permute within regions"/
”1111“”: start .depth’ L en d. dep th)

(E) /**Increment the counters*"‘/
deptthdez ++
starLdepth = end.depth + 1
Until deptthdez > q

Figure 6.4. Algorithm to obtain ﬁne-grain parallelism using the kernel transformations on
imperfect-only-child loop nests.

 

82

As an example, Algorithm mod_kernel-f ins is applied to the source code given in Fig-
ure 6.5. The initial distance vector set for the source code is D = {(0,0, 1, —1), (0,0,0, 1)}.
Assignment statement 5'1 is nested 2-deep and assignment statements S2 through 55 are
nested 4-deep. An array, named depth stores the ordered set of unique depths at which

assignment statements are nested. Thus, q = 2 and depth[1 . . .q] = [2,4].

 

L1 for 11 = 1,051
L2 for 12 = 1,1152
51 A[I1,I2]=...
L3 for 13 = l, 1153
L4 for 14 = l, 1164

s2 B[I;,I2, 13,14] =
53 B[I1,I2,13—1,14+l]=...
s. C[1,, 1213,14 =

$5 C[I1,I2,13,14 - 1] = . . .
[as #D = {(0, 0, l, —1),(0, 0,0, 1)} :1- t/

Figure 6.5. Sample code: initial code before application of modJrernelJine.

 

Step 1 transforms the distance vector set such that all d E D are fully permutable.
Speciﬁcally, 100p L4 is skewed by loop L3 and the resulting distance vector set is D“) =
{(0, 0, 1, 0), (0, 0, 0, 1)}. Note that all of the distances in the transformed distance vector set
are non-negative.

Step 2(A) initializes the value for indexing the array depth, and Step 2(B) iterates
through depth. Each region of the loop nest, deﬁned by the depths of the assignment
statements is traversed. The ﬁrst region to which the transformation is applied is loops L1
through L2. That is, end.depth = depth[1] = 2. The next iteration of Step 2(B) processes
loops L3 through L4. That is, start_depth = end_depth+1 = 3 and end.depth = depth[2] =
4.

Steps 2(C) and 2(D) are equivalent to Steps 2 and 3 from Algorithm kerneLfine,
respectively, except that the wavefront transformation is applied to regions of the n-deep

loop nest, rather than all n loops. The skewing transformation (Step 2(C)) skews loop

83

L2 by loop L1. However, since d1 and d2 are zero for both distance vectors, the resulting
distance vector set is 00(0)) = {(0,0,1,0),(0,0,0,1)}. Next, loop L4 is skewed by loop
L3 resulting in D(2(C)) = {(0,0,1,1),(0,0,0,1)}. Finally, loops L3 and L4 are permuted
resulting in DMD” = {(0,0, l,1),(0,0,1,0)}. Loops L1,L2, and L, are parallelized using

the deﬁnition for parallelism given in Expression (6.1) as shown in Figure 6.6.

 

L1 forall J1 = 2, ub2 + 001

L2 torall J2 = nax(l, J1 — ub2),nin(ub1, J1 - 1)
S1 A[J2, J1 - J2] = . . .

L3 for J3 = 3, 1154 + 2ub3

L4 forall J4 = nax(l, [Jig—“*j),nin(ub3, [=1ng
52 B[J2,J1 -J2,J4,J3-2J4]=...
$3 B[J2,J1-J2,J4-—1,J3-2]4+1]=...

s. c11211 — 1,,1.,13 — 21.] = . ..
s. c112, 11 — 12,1213 — 21, — 1] =

/: *D’ = {(o,o,1,1),(o,o,1,0)} t t/

Figure 6.6. Sample code: transformed code after application of modlerneliine.

 

6.4 Coarse-grain parallelism for imperfect-only-child loop

nests using the extended transformation set

Figure 6.7 gives an algorithm, named ext-coarse, that uses loop ﬁssion within a
matrix-based framework to obtain parallelism where kernel unimodular transformations
alone would not. In addition to using loop ﬁssion, the permutation step of Algorithm
kerneLcoarse is modiﬁed in Algorithm ext_coarse. The modiﬁcation is due to the ille-
gality of loop permutation for imperfectly nested loops.

Speciﬁcally, if there are any statements between loops L1 and L2, then loops L1 and L2
cannot be permuted. However, if there are no statements between loops L1 and L2, then

the loop with the highest iteration count is moved outermost.

84

Algorithm ksrnel-coarse (Figure 6.1) terminated if any d6 D did not have the same
basis a(a1,a2). The distance vectors that cannot be placed in basis form are further ex-
amined in Algorithm ext-coarse. If possible, the statements causing the non-basis dis—
tances are separated into adjacent \It-blocks by loop ﬁssion. Chapter 5 showed that strongly
connected components (SCC) of the dependence graph cannot be divided by loop ﬁssion.
Consistent with that constraint, Algorithm ext_coarse does not divide SCC’s of the depen-
dence graph. However, if the non-basis distances are not in a strongly connected component,
then the statements causing the distance are placed in separate 111—blocks. Thus, the dis-
tance vectors preventing parallelism that were originally in the distance vector set D become
loop-independent and are consequently ignored.

Loop ﬁssion can be applied to both perfect and imperfect loop nests, and thus, an
algorithm that uses loop ﬁssion can be applied to both perfect and imperfect loop nests.
The source code given in Figure 6.8 cannot be parallelized using Algorithm kernel_coarse,
but can be parallelized using Algorithm ext_coarse. We want to parallelize L1 because
the iteration count is signiﬁcantly greater than the iteration count of L2. The strongly
connected components of the dependency graph are {(51, S2), (53), (34)}.

The distance vector (0,3) can be placed in basis form 3(0,l), where a = 3 and the
basis is (0,1). The distance vector (0,4) can be also be placed in basis form 4(0,1), where
a = 4. However, the distance vector (1,2) cannot be placed in basis form. That is, there is no
positive integer a that, when multiplied by (0,1), equals (1,2). Therefore, the distance vector
(1,2) causes termination of Algorithm kernel-coarse. Algorithm sxt_coarse determines
that $3 and 5.2 are not in the same strongly connected component and the statements are
placed in adjacent loop nests by loop ﬁssion, as shown in Figure 6.9. The outer loops of both
\It-blocks are parallelized in the source code in Figure 6.9 because no dependency prevents

it.

85

 

Algorithm ext_coarse
Finds a 2 x 2 transformation matrix T that parallelizes the outermost loop of a doubly-nested loop, if one
exists.

Input:

D /**the distance vector set"/
SCC(D) [*‘strongly connected components of D”/

Output:
T(\It,) /*"‘the 2 x 2 transformation matrir"/

Procedure:

1. If there are no statements between loops L1 and L2, then move the loop with the highest
iteration count to the outermost position using permutation.

2. Determine if all d6 D can be put in basis form a(a1, a2), where a is a positive
integer and (a1, a2) is lexicographically positive.

(A) If any J6 D in non-basis form is caused by statements in the same
strongly connected component, then terminate;

(B) If any J E D in non-basis form is caused by statements in different
strongly connected components, then place the statements in
separate \Il-blocks;

(C) If all d6 D are in basis form, then continue.

3. For each 111-block, find T[2, 1],T[2, 2] and g = GCD(a;,a2) such that
g = a; x T[2, 1] + a2 x T[2,2], where T[1,1] = a2/g and T[l, 2] = —a1/g.

Figure 6.7. Algorithm to obtain coarse-grain parallelism using the extended transformations
on imperfect-only-child loop nests.

 

86

 

L1 for 11 = 1,1000

L2 for I2 = 1,6

51 A[I1,I2] = B[I1,I2 - 3]
S2 B[I1,I2] =A[I1,I2 —4]
53 C[I1,I2] =

S4 C[I1-1,I2-2]=...

/* *D = {(0.3), (0,4), (1, 2)} * */

Figure 6.8. Sample code: parallelization not allowed by kernel unimodular transformations.

 

 

L1 forall I1 = 1, 1000

L2 for I2 = 1,6

S1 A[I1,I2] = B[I1,I2 — 3]
S2 B[I1,I2] =B[I1,I2 -4]
$3 C[I1,I2]=...

L3 tor-all I1 = 1,1000

L4 for I2 =1,6

S4 C[I1-1,I2—2]=...

Figure 6.9. Sample code: parallelization after loop ﬁssion.

 

6.5 Data locality for imperfect-only-child loop nests using

the extended transformation set

This section discusses an algorithm that uses transformations from the extended transfor-
mation set to obtain data locality. Wolf [24] discusses the use of loop blocking (tiling) to
obtain data locality, where loop blocking is considered as a separate transformation outside
of the unimodular model.

We present an algorithm that uses loop ﬁssion and fusion to obtain data locality. The
legality conditions for loop ﬁssion and fusion are not dependent upon source code structure.
Correspondingly, this algorithm is applicable to both perfect and imperfect-only-child loop

nests. Chapter 5 discussed the legality criteria for loop ﬁssion and fusion, but did not discuss

87

the context in which the two transformations should be applied. Algorithm ext_1ocality
shows when to apply loop ﬁssion and fusion in order to obtain data locality.

Reuse was earlier deﬁned as the use of the same datum on different iterations of a
loop nest, and locality refers to the reused data remaining in the same memory hierarchy
level (e.g. cache) between uses. The source code in Figure 6.10 shows that reuse does not
guarantee locality, especially if the cache is small. That is, A[1] is reused in each iteration of
L1, however, A[1] is local only if ub2 is smaller than the cache size. Data reuse is a property
of the source code and is not dependent upon the loop structure or machine. In contrast,
locality is dependent upon machine characteristics, such as cache size. Thus, increasing
data locality results in a decreased number of memory accesses and improved use of the

cache.

 

L1 for 11 = 1,1151
L2 for 12 = 1, ub2
$1 A[I2] = . . .

Figure 6.10. Sample code: showing reuse and locality.

 

Section 2.1 discussed the representation of array subscript expressions using array index
functions and uniformly generated sets. A uniformly generated set is an equivalence class,
such that any two members of the set have the same array index function H and refer to
the same array. Therefore, our objective is to determine the uniformly generated sets with
maximum cardinality or simply to form the largest possible groupings of similar references
to the same array. Loop ﬁssion and fusion are used to separate and/ or combine groupings
of statements and provide the basis for the algorithm that improves locality.

As an example, all assignment statements shown in Figure 6.11 reference the same array
A. However, the array index function of 31 is the same as 54. Similarly, the array index
function of S2 is the same as 53. Thus, the greatest locality results from grouping S2 and

5'3 into one equivalence class, and 31 and 34 into another.

88

 

O W
H O
1

S1 A[3I1 + 2,12,3]:

a:

L1 for I1 = 1,1101;

F
L2 for 12 = 1,ub1, .
51 “311 + 2,12,3] = . . . S? “311 + 2.12.11 + 3].

 

 

 

 

 

2
$2 l[311+2.12,11+3]=... 1 0‘ ‘ .3.
L3 for11=1,ub1, ,3 0 J .
1" ‘°’J’=1"‘b’= S 131 1 16 J - o 1 1
$3 1[311.12+16,11]=... 3 [ 1’ 2+ "1‘ [, 0 12

S, A[3J1,J2+16,1]=

w
0
I

34 A[3J1, J2 + 16,1] :

C
1—1
+
H
03

+
I I
Chic
0:

”“5

 

coo

HO
l l l J 1

NH

H

+

ON

 

 

 

Figure 6.11. Sample code: statement groupings for data locality {(51, S4), (S2, 33)}.

 

Algorithm ext_locality, shown in Figure 6.12, obtains data locality using loop ﬁssion
and fusion. The objective of the algorithm is to partition a set of statements (51, . . . , Sn)
into the largest possible sets, such that the statements in the partitions have similar accesses
to the same variable. The input to the algorithm is two adjacent 9-blocks (loop nests). The
algorithm uses loop fusion to combine similar accesses to the same variable and loop ﬁssion
to separate accesses to different variables. Note that the statements in different 9-blocks
are only combined if the loop bound sets can be adjusted and the statements are separated
only if they are not in the same H-block, regardless of the perfect or imperfect nesting
structure of the source code.

The source code shown in Figure 6.13 becomes the source code shown in Figure
6.14 after application of Algorithm ext_1ocality. The partitions are initialized to
the strongly connected components of the source code {(51), (52,36),(S3),(S4,S7), (35)},
then through pairwise comparison the assignment statements are combined into larger
sets. The transformation results in the following partition of the assignment statements:
{(51, 35), (S2, 53, Se), (S4, 57)} because each set has similar references to the same variable

as indicated by the array index functions.

89

 

Algorithm ext_loca1 ity

Partitions the assignment statements in two adjacent G—blocks into groups with increased data locality.

Input:

39, /"”"loop bound set for the ﬁrst loop nest"‘*/

Be, /”loop bound set for the second loop nest"/

Doha, /”the distance vector set for the first loop nest**/

092,62 /**the distance vector set for the second loop nest**/

59, /"set of assignment statements in the ﬁrst loop nest"/
59, /"set of assignment statements in the second loop nest”*/
Output:

\I'1, . . . , \Itm /"‘*partitions of Se, and Se,**/
Procedure:

1. If can.adiust(Be,,Be,)

(A) Initialize the partitions \Ih, . . . , \Ilm to the strongly connected
components of the assignment statements.

(B) For each pair of assignment statements S: and 5,:

(i) If S.- and S, are in separate SCC’s
(a) If S.- and SJ have the same array index function for the
same variable then combine the partitions containing 3.- and 5,.

Figure 6.12. Algorithm to improve data locality using the extended transformations on
imperfect-only-child loop nests.

 

90

 

L1 for 11 = l,ub1

S1 G111] = . . .

L2 for I2 = 1,ub2

S2 A[I1, I2] = B[I1 -- 1,12]
53 var; = B[I1 — 1,12 -1]
S4 €111,212] = D[I1,212]

L3 for J1 = 1, ub1

$5 ...=G[J1-2]

L, for J2 = l,ub2

So B[J1, J2] = A[J1 -- 1, J2]

S7 D[J1 - 1,2J2] = C[J1 - 1, 2J2]

ft #800 = {(51), (S2, 55), (53), (34,57), (55)} t t/

Figure 6.13. Sample code: before application of data locality algorithm.

 

Comments on coarse-grain parallelism and data locality algorithms. Intuitively,
Algorithm sxt_coarse and Algorithm ext-1ocality could be merged into one algorithm
that combines both parallelism and data locality goals. However, these two goals often
conﬂict. That is, source code that is transformed to obtain maximum parallelism may have
poor data locality, and vice-versa.

As an example, the source code shown in Figure 6.15(a) has two singly-nested loops. In
the form shown in Figure 6.15(a), both L1 and L2 can be parallelized since there is no data
dependence that prevents parallelization. However, statements 5'1 and 32 have the same
array index functions and could be combined if data locality were a priority. After the loops
are combined, as shown in Figure 6.15(b), the data locality is improved but the enclosing
loop can no longer be parallelized due to the data dependence between statements 31 and
$2.

This simple example shows that parallelism goals and data locality goals may conﬂict,
and blindly combining Algorithms ext_coarse and ext_1ocality may result in an algo-
rithm that obtains neither parallelism nor data locality. A possible approach for a combined
algorithm is to assign a weight to both parallelism and data locality goals. When a potential

conﬂict arises, then the goal with the greater weight is met at the expense of the goal with

91

 

L1 for K1 = 1,ttb1
S1 G[K1]=...
S5 ...=G[K1-2]

L1 for K1 = 1,1101

L2 for K2 = 1,052

$2 A[K1,K2]=B[K1 -1,K2]
53 var; = B[K1 — 1,K2 - 1]
Se B[K1,K2] = A[K1 - 1,K2]

L3 for K1 = 1,051
L, for K2 = l,ub2

Sq, C[K1,2K2] = D[K1,2K2]
37 our, — 1, 2K2] = cm — 1, 2K2]

/s at Final Partitioning = {(31, 55), (52,53,56),(54.S7)} * */

Figure 6.14. Sample code: after application of data locality algorithm.

 

 

L1 forI1=... L1 for11=...

S1 ARRAYUﬂ = . . . S1 ARRAYU1] = . . .

L2 torJ1=... S2 ...=ARRAY[I1-l]
S2 ...=ARRAY[J1—1]

Figure 6.15(a). Sample code: parallel loops, but Figure 6.15(b). Sample code: data locality, but no
no data locality. parallel loops.

the lesser weight. That is, if parallelism has a higher weight, then the example source code
is transformed as shown in Figure 6.15(a). If data locality has a higher weight, then the

source code is transformed as shown in Figure 6.15(b).

6.6 Imperfect-only-child loop nests on vector architecture

machines using the extended transformation set

Vectorization is the process of modifying source code for execution on a vector machine.
The legality criteria for vectorization was given in Section 2.4. However, the legality criteria

deﬁnes what must be true for vectorization to occur, but does not describe what vectoriza-

92

tion is. Vectorization is SIMD execution of assignment statements, that is, all iterations of
an assignment statement S,- execute before any iterations of assignment statement S,- begin
execution, where i < j.

As an example, the source code in Figure 6.16(a) is shown before vectorization. The
source code after vectorization is shown in Figure 6.16(b) and uses typical vector notation.
The notation states that all 17: iterations of 5'1 execute simultaneously. After completion of
S], then all m iterations of S2 execute simultaneously. The loop L1 is implied in the array

notation and is eliminated from the source code shown in Figure 6.16(b).

 

L; for I; = 1,m S; A[1 : m] = B[l : m]
S; A[I1] = B[I1] S2 D[1 : m] = A[l : m]
32 D[I1] = A[I1]

Figure 6.16(a). Sample code: before vectorization Figure 6.16(b). Sample code: after vectorization.

 

Some objectives of a compiler for a vector machine are to:

o minimize the work done in each loop nest;

0 move the legal vectorizable loop with the greatest iteration count to the innermost
position;

0 match vector register length to iteration count for the innermost loop;
0 evaluate if vectorization is worthwhile; and

0 ensure that subscript expressions remain simple.

Each of these objectives is described in further detail below.

A compiler for a vector machine uses loop ﬁssion to minimize the work done in each
loop nest. Vectorization of a loop L,- in a loop nest is applied universally to all statements
(51, . . .,Sm) in the loop nest. If any statement(s) in the nest are non-vectorizable, then
none of the statements are vectorized. Loop ﬁssion is used to partition the statements into

the smallest, integral sets, namely II-blocks. Separation into II-blocks isolates the non-

93

vectorizable partitions of assignment statements so that the remaining partitions can be
vectorized.

One objective of a vector machine is to execute as many iterations as possible as vector
operations. We assume that vector operations are applied to only the innermost loop of a
loop nest. Correspondingly, we want to permute the loop with the greatest iteration count
Ln,“ to the innermost position. As with previous algorithms, the legality condition of
loop permutation does not allow the transformation to be applied across imperfectly nested
loops. Loop Lma, is found in the innermost region of the loop nest. As an example, the
source code shown in Figure 6.18 has assignment statement SI nested 1-deep and assignment
statements S2 through S3 nested 3-deep. Loop Ln,” is either of loops L2 or L3, but cannot
be loop L1 since loop L1 cannot be permuted innermost.

The cost of loading a vector register and executing vector operations may exceed the
value of the resulting speedup in vector mode. As an example, if the vector register length
is 256 elements, but the iteration count of loop Lmu. is 32, then vector operations may not
be justiﬁed. That is, it may be more efﬁcient to execute loop Lma, in sequential mode.
Therefore, a machine-dependent threshold is speciﬁed. If the iteration count of loop Lma, is
below the threshold value, then the assignment statements are executed sequentially. If the
iteration count of loop Lma, is above the threshold value, then the assignment statements
are executed as vector operations.

Assuming that the iteration count of loop Ln,” exceeds the threshold value, it is strip
mined according to some characteristic length of the target machine, such as vector register
length. As examples, the Convex C100 and C200 series vector machines have 128 element
vector registers [29]. Strip mining ensures that the innermost loop is optimized so that the
iteration count of the innermost loop and the vector register length exactly match.

Finally, subscript expressions need to remain simple for arrays in a loop nest that is vec-
torized. Typically, a gather-scatter directive consolidates a sparse array for vectorization,
then redistributes it after vector operations. Optimization manuals for vector machines
imply that gather-scatter only works for simple subscript expressions, so we assume that

all subscript expressions are linear functions of one index variable plus a constant. This as-

94

sumption eliminates the possibility of using a wavefront transformation because it typically
results in complex, coupled subscript expressions.

A newly developed algorithm for optimizing source code for a vector machine, named
ext_vector, uses the extended and kernel set of loop transformations is given in Figure
6.17. Speciﬁcally, the algorithm uses loop normalization, ﬁssion, permutation, and blocking
to obtain vectorization.

As an example, Algorithm ext-vector is applied to the source code shown in Figure
6.18. The source code has an imperfect-only-child loop nest because assignment statement
51 is nested between loops L1 and L2. Also, loops L1 through L3 are initially normal,
which greatly simpliﬁes the mathematics of the following discussion without compromising
the generality of Algorithm ext-vsctor.

Step 1. Normalize all loops. As mentioned, the source code shown in Figure 6.18
already has normal loops.

Step 2. Find the strongly connected components (SCC’s) of the loop
nest. SCC’s of the dependence graph are sets of assignment statements that are in cy-
cles, and individual assignment statements. Thus, the SCC’s for the source code are
{(51),(S2,S3),(S4),(Ss),(Se)}. The dependence from assignment statement 55 to 33
was initially loop-carried, but after loop ﬁssion is loop-independent. The distance vec-
tor (0,0,1) is eliminated from the distance vector set and the new distance vector set D“)
is {(0,1,2),(1,0,0)}.

Step 3(A). For each II-block, ﬁnd the permutable loop with the highest
iteration count. There are two unique depths at which assignment statements are nested
(q = 2) and so depth[1 . . .2] = [1, 3]. That is, assignment statement 5'1 is nested 1-deep and
assignment statements S2 through Se are nested 3-deep. Loop Ln,“ is chosen between loop
Ldepth[q—1] = Ldepthp] = L; and Ldepth{q] = Ldepth[2] = L3. Loop L2 has an iteration
count of 10000, and loop L3 has an iteration count of 32. Thus, loop Lmar is assigned loop

L2.

95

 

Algorithm ext_vector

Optimizes source code for execution on a vector architecture machine using the kernel and extended sets of
loop transformations. Speciﬁcally, the objectives of the algorithm are to: minimize the work done in each
loop (loop ﬁssion); move the vectorizable loop with the greatest iteration count to the innermost position
(loop permutation); match the vector length to the iteration count of the innermost loop (loop blocking);
evaluate if vectorization is worthwhile; and ensure that the subscript expressions remains simple.

Input:

{L1, . . . , Ln} /”an n-deep loop nest"/
C(V, E) /”a dependency graph for the source code where V are the
assignment statements and E are the dependences in the source code**/
{51, . . . , Sm} /**set of assignment statements in the loop nest”/
b.- /"'*bloclcing factor, such as vector register length**/
depth[1 . . . q] /**ordered set of unique depths at which assignment statements are nested”/

Output:
{\Ill, . . . , ‘11,} /**sets of assignment statements optimized for vector machines"/

Procedure:

1. /** Normalize all loops in the loop ne‘hhtl/
For all I"! 1 S i S n, nonalize(L.-)

2. / ”F ind the strongly connected components (II-blocks) for the loop nest."/
{1111 , . . . , \Itq} = SCC(G)

3. /”For each lIobloclcf'V
FOIHJl‘I’j,lSqu
(A) / "F ind the permutable loop with the highest iteration count Lm¢,.“/
Lma: = HIGHESTJT.COUNT(\II,-) between Ldepth 2-1] and L depth[g]

(B) /"If the iteration count of Lma, is less than threshold, where threshold = b.-**/
threshold = b.-
If L...” < threshold

Then terminate.

(C) Else
(i)/"Move Lma, innermost"/
pornute(L,M,, Ln)
(ii)/"Block innermost loop by vector register length.*"‘/
block(Ln, bi)
(iii)/”Put blocked loop in vector notation"/
vectorize(Ln+1)

Figure 6.17. Algorithm to optimize source code for a vector machine using the extended
transformations on imperfect-only-child loop nests.

 

96

 

L1 for I1 = 1,3

31 Alb] =

L2 for 12 = 1, 10000

La for 13 = 1,32

52 B[Ii.12,13]=0[11 -1112.13]

53 C[I1,I2,13] = B[I1,I2 - 2,13 — 1]
$4 E[I1+1,I2,13-1]=...

55 F[11,12,13] = . . .

So F[I1,I2,13-1]=...

/‘ *D = {(0,2,1).(1,0,0),(0.0,1)}
b.’ = 256
SOC = {(51), (52, 53), (50,095), (56)} * */

Figure 6.18. Sample code: before application of Algorithm ext.vector.

 

Step 3(B). If the loop Lmax has a low iteration count, then terminate. Else,
we continue. The iteration count of Lma, is 10000. The threshold is equal to the block
size b,-, and has the value of 256 in this example. Therefore, we continue.

Step 3(C(i)). Interchange loop Ln,” with the innermost loop. The second loop
Lm” and the third loop L3 are interchanged. The transformation matrix representing the

interchange is:

OOH
HOG
91-56

and the transformed distance vector set is:

11(2): {(0, 1,2),(1,0, 0)}.

Step 3(C(ii)). Block the innermost loop by b,- = 256. The innermost loop is
blocked so that the iteration count matches the vector register length. The transformation

matrix is:

OOOH
38cc:
oak-c:

97

The transformed distance vector set is:

0‘3) ={(0.1.15:31.2),(0A,[5231.2),(1,0,L-2°al,0),(1.0, [591.00
= {(0,1,0,2),(0,1,1,2),(1,0,0,0)}.

Step 3(C(iii)). Place the source code in vector notation. In the transformed
distance vector set 0(3), all of the loop-carried dependences are carried at a depth less than
four. Thus, no dependence cycles exist for the innermost blocked loop and the source code
can be vectorized. The source code shown in Figure 6.19(a) shows transformed source code

before being placed in vector notation, and Figure 6.19(b) shows the same source code in

vector notation.

 

L1 torI1=l,3 S1 A[1:3]=...

S1 A[I1]=... L1 for I1 =1,3

L1 for I1 = 1,3 L3 for I3 = 1,32

L3 for I3 = 1,32 L; for I2 = 1,10000,b.‘

L; for I; = I, 10000, bi S2 BUM]; :I; + b.‘ - 1,I3] =

L2 for I2 = I2,nin(12 + be —1,10000) C[I1 - 1,12 :12 + 04 - 1,13]
32 B[I1,I2,Ia] = C[I1 — 1,12,I3] $3 C[I1,I; :1; + be - 1,13] =

53 C[I1,I2,I3]=B[I1,I2-2,I3-1] B[I1,I$-2:I;+bi-3,I3-1]
L2 for I1 = 1,3 L1 for I; = 1,3

L3 for I3 = 1,32 L3 for I3 = 1,32

L; for I; = 1, 10000, b.‘ L; for 12 = 1, 10000, 05

L2 for I2 = I2,lin(12 +b.‘ — 1,10000) $4 E[I1 + LI; :1; + b.‘ - 1,13 — 1] = . . .
S4 E[I1+1,I2,I3-1]=... L1 forl; =l,3

L; for I1 = 1,3 L3 for I3 =1,32

L3 for I3 = 1,32 L; for 12 = 1,10000,b,‘

L; for 12 = 1, 10000, b.- 55 1111,12 :12 +b.- — 1,13] =

L2 for 12 = I§,uin(I; + b.- — 1,10000) L, for 11 = 1, 3

55 F[I1,I2,I3] = . . . L3 for Ia =1,32

L1 for I1 = 1, 3 L; for I; = 1,10000, 0:

L3 for13=1,32 Sa F[I1,I;:I;+bi—1,I3-1]=...

L; for I; = 1, 10000, b.-
L2 for 12 = 15,-mu; + b.- - 1,10000)
Se F[I1,I2,I3-1]=...

Figure 6.19(a). Sample code: ready for vectoriza-
@-

Figure 6.19(b). Sample code: in vector notation.

98

6.7 Imperfect-only-child loop nests on a shared-memory

machine using the extended transformation set

Our model of a shared-memory machine is MIMD. If p is the number of processors, then
each processor P5, 1 S i _<_ p, may operate on different instructions and a different data
stream. As an example, the source code in Figure 6.20(a) is shown before execution on
a MIMD shared-memory machine. Figure 6.20(b) shows the execution of the assignment
statements 51 and S2 on processors P1 through Pm. (For simplicity, we assume that the
number of processors and number of iterations are the same, that is, p = m.) Note that
there is no implicit or explicit barrier synchronization between execution of 51 and S2 on
processors P,- and P,, where i 9‘. j. That is, assignment statement 52 may ﬁnish execution

on processor P,- before processor P,- begins execution of statement SI.

 

 

 

P, P2... Pm
Ll for-11:1,": s, 1(1)=3(1) 1(2)=s(2)... 1(m)=B(m)

g: 0]I:j:l]i:j 52 l3(1)=l'l(1) 0(2)=A(2)... 0(m)=1(m)

 

 

 

 

 

 

 

 

 

Figure 6.20(3). Sample code: before Figure 6.20(b). Statement distribution on a SMS.
Erallelization

The objectives of a compiler for a shared-memory machine differ signiﬁcantly from those
of a vector architecture machine. Speciﬁcally, the objectives for a shared memory machine

are to:

o maximize the work done in each loop nest;
0 move parallel loops outermost;

s maximize iteration count of the parallelized loops;

In vector machines, the objective is to minimize the work done in each loop, because the
majority of the overhead on a vector machine is loading the vector registers. In contrast,

signiﬁcant overhead for a shared-memory machine is physically distributing the work to

99

the available processors. Therefore, a primary objective of a compiler for a shared-memory
system is to maximize the work done in each loop, which minimizes the work distribution
overhead. Loop fusion is used to combine assignment statements from adjacent loop nests,
which maximizes the work done in each loop nest.

Algorithm mod-kernel_ﬁne (Figure 6.4) is a modiﬁed wavefront transformation that
applies to imperfect-only-child loop nests. The algorithm for shared-memory machines,
named ext_sms and shown in Figure 6.21, uses Algorithm mod_kernel_ﬁne to obtain the
maximal amount of parallelism.

One of the objectives for a shared-memory machine is to move the parallelizable loops
outermost. As in previous algorithms, we cannot violate the legality condition for loop
permutation. That is, if the outermost assignment statement 51 is nested j-deep, then the
two loops with the highest iteration count are found between loops L1 and L,- and moved
outermost. The two loops with the highest iteration count are named Lmaﬂ and Lmu2,
respectively. In previous algorithms, only the single loop with the highest iteration count
was found. In this algorithm, we ﬁnd two loops Lmaﬂ and Lmaz2 because the ﬁnal stage
of the Algorithm sxt-sms is scheduling.

As mentioned earlier, work distribution on a shared-memory machine incurs signiﬁcant
overhead. The actual work distribution protocol is referred to as scheduling. There are two
types of scheduling: self-scheduling, in which the work distribution is determined at run-
time; and pre-scheduling, in which the work distribution is determined at compile time. Loop
collapsing aids scheduling by combining parallel loops that are distributed to processors.
That is, loop collapsing decreases the amount of overhead associated with work distribution
by simplifying the loop structure to which scheduling protocols are applied. Therefore,
we collapse the two loops with the highest iteration counts (L2 and L3) in order to aid
scheduling.

An algorithm that optimizes source code for a shared-memory machine, named ext_sms,
uses the extended and kernel set of loop transformations and is given in Figure 6.21. The

algorithm uses loop normalization, fusion, skewing, permutation, and collapsing.

100

 

Algorithm ext-sns

Optimizes source code for execution on a shared-memory machine using the kernel and extended sets of loop
transformations. Speciﬁcally, the objectives of the algorithm are to: maximize the work done in each loop
(loop fusion); move the parallelizable loops with the greatest iteration count to the outermost position (loop
permutation); and maximize the iteration count of the innermost loops (loop collapsing).

Input:

91 = {M1, . . . ,Mn} /"""an n-deep loop nest**/
92 = {N1 , . . . , Nu} /"‘"'another n-deep loop nest“*/

depth1[l . . . ql] /"'*ordered set of unique depths at which assignment statements in 91 are nested”/
depth2[l . . . q2] /**ordered set of unique depths at which assignment statements in 92 are nested“*/
Output:

L /**the transformed loop nest optimized for 0 SMS“‘*/

Procedure:

1. /**Normalize all loops in both loop nests.**/
For all i, 1 5 i S n
nonalize(M:)
nonalize(N,-)

2. /"‘*Fuse loop nests to maximize work.*"‘/
L = fuse(91,92)
depth = depthl U depth2

3. /**Apply wavefront to obtain 1: — q degrees of parallelism:**/
nod_kernel_1ine(L)

4. /*"F ind parallel loops in nest with highest iteration count and move to positions L2 and L3.**/
LmasI = m(L21 - - - 1 Ldepthn)
Lmu2 = nax(L2, . . . , Ldepthr] - Lman)
per-ute(L2, Laws-1)
penute(L3, Lmaz2)

5. /**Collapse loops L2 and L3 to aid scheduling.**/
collapse(L2, L3)

Figure 6.21. Algorithm to optimize source code for a shared memory machine using the
extended transformations on imperfect-only-child loop nests.

 

101

Algorithm ext_sms is applied to the source code shown in Figure 6.22. The ﬁrst loop
nest 91 has four loops M1 through M4, and assignment statements are nested at two unique
depths. That is, q] = 2 and depth1[1. . .2] = [3,4]. The second loop nest 92 also has four
loops N1 through N4. All assignment statements have perfect nesting. Therefore, q2 = 1
and depth2[l] = [4]. For simplicity, all loops in both the ﬁrst code block 91 and the second

code block 92 are normal.

 

M1 for I1 = 1, 1000

M2 for I2 = 1,32

M3 for 13 = 1,500

31 A[I1,I2,I3] =

M4 for I1 = 1,50

S2 B[I1,I2,I3,I4]=

N1 for J1 = 1,1000

N2 for J2 =1,32

N3 tor J3 = 1,500

N4 for J2 = 1,50

53 ...=B[J1,J2,J3,J4—2]

Figure 6.22. Sample code: before application of Algorithm ext.s1ns.

 

Step 1. Normalize all loops. As mentioned, the source code shown in Figure 6.22
already has normal loops.

Step 2. Fuse 91 and 92. After loop fusion, the assignment statements in loop nest
91 and 92 are combined into one loop nest, named L. The loop-independent dependence
between statement S2 and S3 becomes loop-carried, and the dependence vector set D”) is
{0,0,0,2)}. Assignment statement 51 remains nested 3-deep in the fused source code.

Step 3. Apply modiﬁed wavefront transformation. Algorithm mod-kernel_ﬁne
is applied to the loop nest to obtain maximal parallelism. The imperfectly nested assignment
statement 51 causes wavefront transformation to be applied to loops L1 through L3 and
to loop L4. Loop L3 is skewed by loop L1 and L2, then loop L3 is moved outermost. The
resulting dependence vector is D(3) = {(0,0,0,2)}. According to the parallelism criteria

(Expression (6.1)), loops L1 through L3 can be parallelized.

102

Step 4. Find two permutable loops with the highest iteration counts and
move them outermost. The iteration count of loops L1, L2, and L3 are 1000, 32, and
500, respectively. Therefore, loop L1 has the highest iteration count (Lmul = L1) and loop
L3 has the second highest iteration count (Lmax2 = L3). Loops Lmaﬂ and Lmaﬂ are moved
outermost using loop permutation. The distance vector set remains unchanged.

Step 5. Collapse the outermost loops. Finally, the outermost two loops are
collapsed to aid scheduling. The resulting source code is shown in Figure 6.23. Note that

the two outermost loops are parallel and the transformed distance vector set is D(5) =

{((O X 500) + 0.0.2)} = {(0,0,2)}-

 

L1,2 forall K13 = 1,500000
L3 forall K2 = 1,32
if (K13 nod 500) ==
than K1 = K1,3/500
0188 K1 = [K1,3/500] + 1
K3 = ((K1,3 — 1) mod 500) + 1
S1 L[K1,K2,K3] =
L4 for K4 = 1,50
32 B[K1,K2,K3,K4] =
$3 ...=B[K1,K2,K3,K4—2]

Figure 6.23. Sample code: after application of Algorithm ext_sns.

 

CHAPTER 7

Implementation Model

This chapter gives the requirements analysis for the development of an environment that
supports the theoretic model presented in previous chapters. The approach used for system
analysis is the Object Modeling Technique (OMT) [30], which uses three diagramming
notations to model orthogonal vieWpoints of the system. The Object Model describes system
structure and the relationships of objects to each other. The Dynamic Model describes the
dynamic behavior of the system as it changes from state to state. Finally, the Functional
Model captures data flow through the system. The three models describe complementary

aspects of the system and can be used to guide the entire development process.

7 .1 Overview

The implementation prototype is targeted for a graphics-based framework for matrix-based
loop transformations to target source code. The loop transformations: loop permutation,
loop reversal, loop skewing, loop normalization, loop blocking, strip mining, cycle shrink-
ing, loop collapsing, loop coalescing, loop ﬁssion, and loop fusion, will be represented by
matrices.

Transformation matrices are determined based upon machine-independent goals, such
as parallelism or data locality; or machine-dependent goals, such as optimization for vector
or shared memory systems. Finally, the transformations can be user-directed,where the

user deﬁnes the transformation sequence and the matrix representing the sequence.

103

104

7.1.1 User Scenario

The following is a typical scenario of a session with the proposed framework.

99°!"

The user selects a source code ﬁle to process.
The system checks the syntax of the source code.
The system loads the source code ﬁle into a window.

The user selects a portion of source code (e.g. loop nest, block, all) upon which to
run dependence analysis.

The user selects which dependence test to use, such as:

o Banerjee Test [15, 31];
GCD Test [15];

o Lambda Test;

Power Test [32, 33];
Omega Test [34].

The system runs the dependence test on the target source code and determines the
distance vector set D, which is displayed to the user in a window.

The system allows the user to specify either machine-directed, goal-directed, or user-
directed transformation matrix.

(a) If machine-directed is chosen, then the system prompts the user for machine
type.
0 If vector machine is chosen, then the system generates a transformation
sequence based upon Algorithm ext-vector for vector machines.

0 If shared-memory machine is chosen, then the system generates a transforma-
tion sequence based upon Algorithm ext_sms for shared-memory machines.
(b) If goal-directed is chosen, then the system prompts the user for a goal.
0 If parallelism is chosen, then the system generates a transformation matrix
T based upon Algorithm ext-coarse.
o If data locality is chosen, then the system generates a transformation matrix
T based upon Algorithm ext-locality.

(c) If user-directed is chosen, then the user chooses a transformation from a menu
and a matrix-based representation of the transformation sequence is displayed
in a window. If the transformation that the user chooses is illegal, then a corre-
sponding error message is returned to the user.

0 If permutation is chosen, then the user is queried for two loops.
0 If reversal is chosen, then the user is asked which loop.

0 If skewing is chosen, then the user is asked to supply the outer loop, inner
loop, and the value of the skewing factor.

0 If normalization is chosen, then the user is asked which loop.

105

o If ﬁssion is chosen, then the user is asked to partition assignment statements
into two VII-blocks by choosing individual or blocks of assignment statements.

0 If fusion is chosen, then the user chooses two adjacent 9-blocks.

e If blocking is chosen, then the user is queried for the loop and the blocking
factor.
0 If collapsing is chosen, then the user is asked which two adjacent loops.

8. The transformation matrix T is displayed in a window, and the distance vector set
D’ is displayed in another window.

9. The user can request at any time to check the global postcondition by evaluating D’.

10. The user can request at any time to have the source code transformed based upon the
current T.

11. If the user requests to transform the source code, then the system checks the global
postcondition. If the global postcondition is true, then the code is updated. If it is
not, then an error message is given, and the user is prompted to apply more transfor-
mations.

12. Updated code is displayed in a separate window.

13. The user may save the transformed source code and distance vector set to a ﬁle for
future use.

Figure 7.1 shows a sample graphical user interface for the system. The source code
transformation process proceeds from left to right using this GUI. Popup menus give the

user choices of dependence tests, transformation objectives, and source code mapping.

7 .2 Requirements

Unimodular matrices are a method of representing a sequence of kernel transformations
(loop permutation, reversal, and skewing) in matrix form. The dependences in the source
code are represented by the dependence vector set D, and the transformation sequence is
represented by a matrix T.

The distances in the initial distance vector set D are all lexicographically positive (global
precondition). After transformation, the transformed distance vectors in the distance vector
set must be lexicographically positive (global postcondition). Additional source code trans-
formation goals, such as parallelism, are evaluated based upon the form of the transformed

distance vector set.

106

i a Unimodular Transformations _ i l

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(Find D 9 (Precondition) (Transformations v) ( Update D 8: T) (Postcondition) (Map T v) r. 1
s :9 ‘ ,9 19 i '
(GCD Test ) (Machine Directed > (Fine Gralned ) l
Omega Test Goal Directed Coarse Grains d Bounds H , ‘
Power Test User Directed 9 Both
Banerjee Wolfe Test 9
( Permutation j
Reversal
Skewing
Fission
Fusion
Blocking
Coelesclng

 

 

 

 

 

 

 

Figure 7.1. Graphical user interface.

 

107

The transformations that are represented include the kernel unimodular transformation
set (loop permutation, reversal, and skewing), as well as the extended transformation set
(loop normalization, blocking, collapsing, ﬁssion, and fusion). A sequence of kernel and
extended transformations is represented by the product of the matrices representing the
individual transformations (composability). The source code is mapped only once after the
ﬁnal transformation matrix T is determined.

The transformation matrix T is determined in one of three ways: 1) machine-dependent:
the system generates T based upon machine characteristics; 2) machine-independent: the
system generates T based upon machine-independent goals, such as parallelism and data
locality; or 3) user-directed: T is generated based upon user directives.

The legality of the individual transformations is determined based upon the precondi-
tions discussed in Chapters 3 through 5. If an individual transformation is illegal, then a
popup window displays the legality criteria violation. The system applies transformations
to both perfect and imperfect loop nests. The transformation process continues, regardless

of the source code structure, as long as the individual transformations are legal.

7 .3 OMT Analysis

This section contains the analysis of the system using the Object Modeling Technique
(OMT). The three submodels that comprise the model are the object model, dynamic
model, and functional model. The diagrams that represent the respective models are the

objectdiagram, state diagram, and data ﬂow diagram.

7.3.1 Object Model

The object model describes the overall static structure of the system, where objects are
discrete, distinguishable entities. The relationships between objects are represented in dia-
grams, where a triangle (A) means inheritance and a diamond (0) means aggregation. Five
object models are described below: 1) system; 2) dependences; 3) transformation matrices;

4) transformation sequences; and 5) source code.

108

The object model for the system is given in Figure 7.2. The parallel compiler consists of a
dependence test utility, such as Tiny [19, 35]. Also, the compiler has a GUI, a transformation
matrix T builder, and a source code mapper (for example, the algorithms developed my
Wolf and Lam [4]). The transformation matrix builder consists of a global precondition
checker, a global postcondition checker, and algorithms. The algorithms reﬂect operation
of the system in either machine-independent, machine-dependent, or user-directed mode.
Finally, the mapper has two components: one maps the distance vector set D, and another
maps the source code.

The relationships between pages of the object model are represented by associations in
Figure 7.2. Dependence tests generate the distance vector set D, and also, the distance
vector set D is mapped by the distance vector set D mapper. The source code is mapped
by the source code mapper. Finally, the global transformation matrix is generated by the
algorithms in the system.

The object model representing dependences is shown in Figure 7.3. A distance vector set
D consists of distance vectors, where each distance vector consists of distances. The distance
vector set D can be represented by a dependence graph, where each node is an assignment
statement and there is a one-to-one correspondence between arcs in the dependence graph
and distance vectors.

Figure 7.4 shows the object model for the transformation matrices, and also, the asso-
ciation between the transformation matrix and the transformation sequence (Figure 7.5).
Each global matrix is a product, or sequence, of transformations. Each transformation is
comprised of unimodular and elementary matrices. A unimodular matrix is a type of ele-
mentary matrix with special properties, which represents loop permutation, reversal, and
skewing.

The object model for the transformation sequence is shown in Figure 7.5. A transforma-
tion sequence is comprised of an ordered sequence of transformations. Each transformation
is either a kernel unimodular transformation (loop permutation, reversal, and skewing) or

an extended transformation (loop normalization, blocking, collapsing, ﬁssion, and fusion).

109

 

    

is made up of

Figure 7.2. Object diagram for overall system.

 

110

 

    

is comprised of

Figure 7.3. Object diagram for source code dependences.

 

 

     

is comprised

of

Figure 7.4. Object diagram for transformation matrices.

 

111

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ence
is made up of
{ordered}
ormat1o
: \ is either
Skewing No zation Tallapsing 'on
ermu ation Reversal 31 I Fission

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 7.5. Object diagram for transformation sequence.

 

Figure 7.6 shows the object model for source code, which is the target to which the
transformations are applied. Because of the recursive structure of the source code (i.e., a
loop may be comprised of other loops), the object model may not be the best representation
of the source code. However, the object model is used here for consistency with the OMT
modeling approach.

In Figure 7.6, source code is comprised of loop nests, where each loop nest consists of
loops. Each loop has a header (i.e., for I,- = lb,-, ub2, 3,) which has an index variable, lower
bound, upper bound, and step; assignment statements; and a tail (i.e., endfor). Assignment
statements are comprised of use variables and def variables, which are typically arrays with
subscript expressions. 9-blocks are deﬁned in terms of loop nests and contain groupings
of assignment statements. A ill-block is a speciﬁc type of 9-block with special dependence
characteristics. Finally, a lI-block is a strongly connected component of the dependence

graph, and ill-blocks are comprised of II-blocks.

112

 

  
  

is comprised of

  

is comprised of

    
   

is bound by matched with

 

is a linear
function of

is a linear
function of

Figure 7.6. Object diagram for source code.

 

113

A more rigorous representation of source code is to use Backus-Near Form (BNF). The
BNF of a target language fully describes its structure in a succinct manner. The BNF for

a modiﬁed imperative language is shown in Figure 7 .7.

7.3.2 Dynamic Model

The dynamic model describes the behavior of the system over time. State diagrams rep-
resent the dynamic model, where nodes in the diagram are states and arcs are transitions
(events) between states. In the state diagram, “ur” is shorthand for “user request”. Also,
the parentheses following some transitions are conditions that must be true in order for the
transition to trigger.

The state diagram for the system is shown in Figure 7.8. The system is idle until the
user requests that a ﬁle be loaded. If the syntax of the ﬁle is incorrect, then the system
enters an error state until a ﬁle with good syntax is entered. Once a ﬁle with good syntax
is loaded, the user can request a dependence test to calculate the initial distance vector set
11“”.

There are three possible paths that the user chooses after the distance vector set is
calculated. The machine-independent and machine-dependent paths generate the transfor-
mation matrix T using the algorithms discussed in Chapter 6. The third alternative is for
the user to generate her/ his own transformation sequence.

There is extensive error checking while the user is choosing individual transformations.
Each time a loop transformation choice is made, the legality of the transformation is eval-
uated. If the transformation is illegal, then the user is allowed to undo a transformation
and choose another transformation. This process continues until a legal transformation
is chosen. Once the user is satisﬁed with the transformation sequence, the system checks
if the global postcondition is true. If it is not true, then the user continues to choose
transformations. If it is true, then the system proceeds.

All three paths converge after a transformation matrix T is determined. The user may
request that the source code be mapped from initial to ﬁnal form given the transformation

matrix.

 

 

114

 

program
stlist

statement

constdecl
constassignment
integerdecl
realdecl
vardecllist

vardecl

assignment

loop

expression

Figure 7.7. Backus-Naur Form of target imperative language (modiﬁed Tiny syntax).

stlist

statement[’;’staternent]...

’const’ constassignment [’ , ’constassignment] . . .

ID ’=’ expression

’integer’ vardecllist

’real’ vardecllist
vardecl[’,’vardecl]...

ID

ID ’(’[expression ’:'] expression
[’,'[expression ’:'] expression]... ’)’

lhs ’=' expression
ID

ID ’(' expression [’,’ expression]... ’)’

= [’for’ | ’doall’]

D. II I. I. O. O. I. O. I. I. O. O. I. O. .0 I. O. O. I. O. O. O. O. C. .0

ID ’=’ expression [’to' | ',’ I ’:’] expression
[[’by’ I ’,' | ’:’] expression]
’do’ stlist ’endfor'

ID

ID ’(’ expression [’,' expression]... ’)’
INTCONST

FLOATCONST

expression ’+’ expression

expression ’-’ expression

expression "" expression

expression ’/ ’ expression

expression """ expression

’-’ expression

’+’ expression

’(’ expression ’)’

expression ’<’ expression

expression ’<=’ expression

expression ’<>' expression

expression ’)’ expression

expression '>=’ expression

expression 'mod’ expression
expression ’max’ expression
expression ’rnin’ expression

rmr 1(1 expression r)r

’ﬂoor' ’(’ expression ’/’ expression ’)’
’ceiling’ ’(’ expression ’/’ expression ’)’
’max’ ’(’ expression [’,’ expression]... ')'
min’ ’(' expression [’,’ expression]... ’)’

 

E

115

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

System Idle
ur = ﬁlename ur = ﬁlename
{good syntax} {bad syntax}
Error State ur = filename
(Source Loaded (bad syntax) {bad syntax}
ur = ﬁlename
_ {good syntax}
ur — dependence test ur = t“)
. {illesd(t(‘))}
fDis. Vec. Set ur = user-directed Tr Error State
k D(°) Found \(illesal trans.)
ur = mach- ur = mach- “r = t(3)
indep goals dep goals {legal(t(’))}
( ) Interm. T “r = t(") A “r = t(")
oals Ch
Goals Chosen) CG osen C Generated {hang-10)} {mesalwﬂn
system system ur = ti‘)
generates T generates T {legal ( t( 3)»
J'Iians. Mat. .( Error State
\ T0) Found ur = done K(violate post)
{globstost} “1' = done
ur = map code {H(global.posi)}
Source Code
Mapped

 

Figure 7.8. State diagram for system.

 

116

7.3.3 Functional Model

The functional model captures the ﬂow of data through the system, and is represented by
a data ﬂow diagram (DFD). DFD’s are described in terms of levels, where each successive
level shows greater detail of the data ﬂow in the system. In a DFD, the ovals represent
processes, that is, producers and consumers of data. The boxes represent actors that are
external entities that move data through the system. Two horizontal lines represent a data
store, and the arcs represent data ﬂow.

Figure 7.9 is the level 0 data ﬂow diagram for the implementation, also known as a
context diagram. The process Transformer does the actual source code transformations.
As input, Transformer needs the initial distance vector set D(°) and the original source code.
As output, Transformer produces a transformed distance vector set D”) and transformed
source code. We assume that an external process, such as Tiny [19, 35], is used to determine

the distance vector set from the initial source code.

 

 

 

 

 

 

 

 

 

 

 

 

 

(I)
\ D :
C Transformer 2 D Set
I
0(0)
original transformed 0(0)
source source
_ Dependence Anal.
Source Code 0 . . al Utility Tiny
source

 

 

 

 

Figure 7.9. Level 0 data ﬂow diagram.

 

The Transformer process is examined more closely in the level 1 data ﬂow diagram
shown in Figure 7.10. The initial distance vector set DM is used when determining the
transformation matrix T. The source code is mapped from its initial to ﬁnal form as long as

both the global postcondition and goals are met, as determined from D”) = T”) x JG D(°).

117

The transformation matrix and initial source code are used to map the ﬁnal source code.
Both the mapped source code and transformed distance vector set can be saved to external

ﬁles.

 

 

 

 

 

 

 

 

 

 

 

Tia‘asz’r'iiiéi """""""""""""""""""
mapped
source r
a Map Source >
initial :
source :
1(1) {globaLpost(D(’))
and goals(D(‘))}
Source Code 4v D Set

0(1)

 

 

 

 

 

 

 

(0)
C Find T \ D
j {global_pre(D(°))}

 

Figure 7.10. Level 1 data ﬂow diagram.

 

Figure 7.11 shows the level 2 data ﬂow diagram for ﬁnding the transformation matrix
T, and assumes that the user is choosing the transformation sequence. The initial distance
vector set D(°) is input into the process. The user chooses an elementary transformation
matrix t“), and if t(‘”) is legal, then a new global transformation matrix Tm is calculated.
After T0”) is calculated, the distance vector set D”) is updated. The user can choose more
transformations, or map the source code.

The source code is mapped according to the approach described in Section 2.7. First,
inequalities, minima, and maxima are determined from the initial source code. Then the
index variables are transformed based upon the transformation matrix Tm. Finally, new

loop bounds are calculated based upon the mapped index variables and inequality set.

 

 

118

 

 

 

 

 

 

 

 

 

 

 

 

 

1*"
IFiﬁdT""""' 4
I I a
I —C Calculate 0(3) \ I >
i j : 0(1)
I D“) 7‘”. 0"") I
( Calculate it“) D '
I ,(x), Du-I) I
E {Iegal(t(’))} E
: Determine 1”) x i
: C J : D<°>

———————————————————————————————

Figure 7.11. Level 2 data ﬂow diagram for ﬁnding the transformation matrix T.

 

'FI

 

119

 

Miii'CBdéTWéll'iﬁd’Lé’rii)""""m'""'""§

f Step 1:
initial \ Extract Inequalities

 

 

 

 

 

 

code
inequality
set
Step 2:
Find min’s and max’s
min’s and
max’s

 

 

 

 

 

f Step 3:
T”) \ Transform Indices

trans. ineq set,
min’s and max’s

Step 4:
Calculate New Bounds

oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

transformed
source

 

 

Figure 7.12. Level 2 data ﬂow diagram for mapping source code.

 

CHAPTER 8

Related Work

This chapter describes current research in ordering source code transformations. Several of
the approaches are matrix-based, while others are based upon formal speciﬁcation of the
transformations. Some of the approaches suggest static ordering of the transformations to
obtain optimal transformed source code, while other approaches determine transformation
ordering based upon source code structure. Finally, some representations include both

scalar and loop transformations, while others contain only loop transformations.

8.1 Unimodular Matrices

Work by Banerjee [3] shows that unimodular matrices represent a sequence of transfor-
mations on a perfect doubly-nested loop nest, where all dependences are represented by
distance vectors. Banerjee also describes an algorithm that may obtain coarse-grain paral-
lelism and an algorithm that always obtains ﬁne-grain parallelism. Wolf and‘Lam [4] expand
upon Banerjee’s work and show the use of unimodular matrices to represent a sequence of
transformations on an n-nested loop, where dependences are represented as either distance
or direction vectors. Wolf and Lam also give a general approach to obtain coarse-grain
parallelism and prove that n - 1 degrees of parallelism can always be obtained from an

n-nested loop.

120

121

The basis of unimodular transformations is thoroughly described in Section 2.5, and is
brieﬂy described below. Sequences of kernel unimodular transformations (loop permutation,

reversal, and skewing) are represented by a transformation matrix T, where T is:

1. n X n (square);
2. all T[i, j] are integers;
3. Idet TI = 1.

The legality of a transformation sequence and the available parallelism are determined
from the product of T and all of the distance vectors in D. Finally, an algorithm exists to
map the source code from its initial to ﬁnal form given the transformation matrix T.

Some restrictions to the traditional unimodular approach are that unimodular matrices
only represent three kernel transformations and are only applied to perfect loop nests. These
two restrictions imply that, although unimodular transformations are theoretically elegant,
they have limited applicability.

Sass and Mutka [26, 28] have described two different methods for applying unimodu-
lar matrices to imperfect loop nests. The ﬁrst method, called enabling, transforms some
cases of imperfect loop nests into perfect loop nests. The second method, called the phase
method, applies unimodular transformations directly to imperfect loop nests at the expense
of increased barrier synchronization.

Sass and Mutka [28] describe a static ordering of transformations that enables unimod-
ular transformations. Speciﬁcally, scalar forward substitution and loop distribution are used
to transform source code into perfect loop nests. Next, unimodular transformations are
used to obtain parallelism, as in the traditional approach. Finally, loop fusion and common
subezpression elimination are used, as necessary, to restore the structure of the original
source code. The only source code structure that cannot be transformed into perfect loop
nests is source code with a transfer of control from within a loop nest, or source code with
an iterative command within an alternative command.

Alternatively, Sass and Mutka [26] describe a transformation, called the phase method,

that transforms a 2-deep, imperfect loop nest into three adjacent, 2-deep, imperfect-only-

 

122

child loop nests. The phase method has three steps: 1) calculate and lengthen the depen-
dence vectors in the dependence vector set D; 2) invoke Banerjee’s algorithm [3] to obtain
ﬁne-grain parallelism on a doubly-nested loop; and 3) map the source code into three phases

with the innermost loop executing in parallel in all three phases.

8.2 Non-Singular Matrices

Ramanujam [5] and Li and Pingali [6] independently developed a similar extension of the
unimodular approach using non-singular matrices. A non-singular matrix relaxes the third
constraint of unimodular matrices (ldet Tl = 1), and allows the determinant of the trans-
formation matrix T to be greater than or equal to one. Relaxation of this constraint allows
more freedom in the choice of T, and, in fact, unimodular matrices are a subclass of non-
singular matrices.

The legality of the transformation for non-singular matrices is the same asthe global
postcondition for unimodular matrices. That is, all transformed dependence vectors must be
lexicographically positive. This condition is formally described as: (VJ: JG D : T x d.» 5).

Also, optimal transformations for the non-singular approach are described in terms sim-
ilar to optimal transformations for the unimodular approach. Speciﬁcally, fully permutable
loop nests are related to dependence cones by Ramanujam [5]. Recall that fully permutable
loop nests allow any ordering of the loops.

Using non-singular matrices, there is a method for mapping from the initial code to
transformed code for both the assignment statements and the loop bounds. Although de-
termining the assignment statements is relatively easy, determining the new loop bounds is
more complicated for non-singular transformations than it was for unimodular transforma-
tions.

Suppose that (11,12,...,In) is the original iteration space, T is the transformation

matrix, and (J1,J2,.. .,J,,) is the transformed iteration space. The substitution in the

 

123

assignment statements is determined as follows:
(11,12,...,I,.) = T‘1(J1,J2,...,J,.)

Then, the new values for [1, . . . , In are substituted into the original assignment statements.
Note that this is the case for both unimodular and non-singular matrices.

Transforming the loop bounds for a non-singular matrix is more complex than for a
unimodular matrix. The reason for the complication is based upon the value of the deter-
minant of the transformation matrix T (det T Z 1). The transformed iteration space may
be much more sparse than the original iteration space, and the step sizes of the transformed
loops are typically greater than one.

The work presented by Li and Pingali, and Ramanujam relaxes the selection of the
transformation matrix T. The relaxation enables T to be more easily selected, and enables
the representation of one additional loop transformation technique, loop scaling. As dis-
cussed previously, the criteria for legality, optimality, and assignment statement mapping
are equivalent for unimodular and non-singular matrices. However, the mapping of the loop
bounds for the transformed code is more complex because of possible changes in the density
of the iteration space. The use of non-singular matrices complicates the loop bound mapping

strategy, while adding only limited expressive power (one additional transformation).

8.3 Schedules

Kelly and Pugh introduce a representation of source code transformations called schedules

that is an alternative to matrix-based representations [7]. Schedules are in the form:

T:[il,...,im] —* [f1,...,fn]]C

where T is the schedule; it is the original iteration space; f], is a function of original iteration
space; and C represents the domain restrictions. Note that the original iteration space may

have a different loop nest depth (m) than the ﬁnal loop nest (n). In unimodular matrices,

124

a transformation matrix T is assigned to an entire block of source code. In contrast, each
individual assignment statement has its own schedule.

The legality of a schedule is based upon its variable part and constant part. The variable
part is the portion of the schedule that is a function of the original index variables, and the
constant part is what remains after the variable part is removed. A level refers to the nest

depth of the transformed code. As an example, suppose that we are given the schedule:

[i,j,k] —+ [i+2k+n + 1,3j,2,2k+3n]

Then the variable part of the schedule at level 1 is i+ 2k, and the constant part is n + l.

Legality is formally deﬁned as:

Vi,j,p,q,Sym: i —> j 6 (1m :> Tr“) -< T9(i)'

Informally, this deﬁnition states that if there is a dependency from between statement p at
iteration i and statement q at iteration j in the initial source code, then the transformed
code must maintain the same execution order. A legal schedule T that satisﬁes the execution
order constraint is calculated based upon its variable part and constant part. The variable
part is algorithmically determined and the constant part is heuristically resolved.

Optimality of a schedule is determined based upon three criteria: simplicity, granularity,
and locality [8]. Simplicity is a metric that is based upon the complexity of the schedule
and the amount of parallel execution between statements. Granularity is a measure of the
total number of iterations that are executed sequentially outside of the outermost parallel
loop. Each loop size is assumed to be 100, since the value may not be known at compile
time. Finally, locality is an estimate of cache misses that may occur. Again, each loop is
assumed to have 100 iterations, each cache line to contain 8 elements, and array layout is
row major.

Schedules are used to map the initial to transformed source code based upon intervals.

The schedules for each assignment statement at each level of the schedule are compared to

 

125

determine whether or not the statement is executing in the interval. Based upon the com-
parisons, transformed source code is generated and simpliﬁed. Schedules are implemented
in Omega/ Tiny [7], which is a data dependence/source code transformation tool that is
built on top of Tiny [19].

Schedules allow a large family of transformations to be represented. The ability to
represent transformations in addition to loop skewing, reversal, and permutation leads to a
more usable compiler. Also, the genericity of schedules implies that they are extensible to
other loop transformations.

However, large data structures are needed since there exists a unique schedule for each
source code statement. Also, the solution set is inﬁnitely large for the variable and constant
parts of the schedules. Apparently, some clever heuristic is used in the implementation
to prune the search space. As mentioned, an assumption that is made when determining
optimality is that the loop size is always 100. In many cases this assumption is incorrect,
and extraneous optimality results may be obtained. Finally, and most importantly, the
mapping algorithm is exponential (0(3")) [36], because of the method used to determine
intervals.

8.4 Generalized Loop Transformations

Sarkar and Thekkath [9] present work that combines some of the beneﬁts of a. formal speci-
ﬁcation approach with that of matrix representation of loop transformations. They present
templates that formally describe the pre- and postconditions for the dependence vector
set, the loop bounds, and initialization statements for a kernel set of transformations ap-
plied to perfect loop nests. The kernel set of transformations is small, but extensible, and
includes Unimodular, ReversePermute, Parallelize, Block, Coalesce, and Interleave. Each
technique is described in terms of a template that describes what must be true before the
transformation can be applied, the input parameters that are required, and the effect of the
transformation on the source code.

The overall transformation of serial to parallel source code occurs in four steps:

 

126

1. Template instantiation
2. Sequence representation for loop transformations
3. Legality tests

0 Dependence vector test

0 Loop bounds test
4. Code generation for the transformed loop nest

0 Generate new loop bound expressions

0 Generate new initialization statements

Each of the four steps will be brieﬂy described below.

Template instantiation is obtained by supplying the input parameters, such as the num-
ber of nested loops, for one of the transformation techniques. A Sequence of transformations
is represented by a global sequence T = (t1,t2, . . .,tk), where each t,- represents an indi-
vidual transformation in the sequence. Note that the concept of a global sequence T has
no real meaning in this context. In Sarkar and Thekkath’s framework, the effect of T is
determined by application of transformations for each of the individual 15,8 in the sequence.
In contrast, the transformed code can be determined directly from T with no concern as to
the individual transformations in the unimodular approach.

The Legality tests are divided into two subsections: Dependence vector test and Loop
bound test. The dependence vector test ensures that all dependence vectors are lexicograph-
ically positive after transformation. The loop bound test checks the bounds of the loop to
determine if the preconditions are met for the individual technique. Each transformation
in the kernel set has a unique set of preconditions on the loop bounds.

The ﬁnal step in the generalized transformation model is Code generation. This step
describes the effects of the transformations in terms of loop bounds: how the upper and
lower loop bound for each loop in the loop nest is affected; and initialization statements:
how the index variables are affected.

The transformation template for the dependence vector set for loop blocking will be

shown for illustrative purposes. The template for the dependence vector set is:

127

p'={ (dl,...,d.-_1,d:,....d;,d:-',...,d;-'.d,-+i,...,dn)l
(d1,...,d,,) 6 DA
(d2, II) 6 bIOCkmaP(dk) W s k s j}

where blockmap(dk) is deﬁned as:

{(0,0)} if d, = 0
{(6*I, ‘*I)} if dk = 6*I
{(0,dk),(dk,‘*’)} if (I); = 101‘ - l

{(0, dk), (dir(dk), ‘*’)} otherwise
and div-(dk) is deﬁned as:

d], if d;: is a direction value or if dk = 0

“+” if dk is a positive distance value

“—” if d]. is a negative distance value

Suppose that we are given the following dependence vector set:

D = {(2,0, +, —1),(3, —2, 1, —1),(+, —1, *,2)}.

Using the mapping just described for loop blocking and given that the loop nest depth is
four (12. = 4), loops i = 2 through j = 3 are blocked, and bsize[l . . . 2] is a vector representing

the block size for loops i through j, then the new dependence vector set is:

D, = {(2, 09 0a 09 +’ -1), (2,0, +3 0, *7 -1), (3,090, -2: 1, _1)9 (33 0, 19 “'2, *, —1)v

(3’ —’0, *3 1’ -1), (3, _9 19*i*9 -1), (+109*9—12*v2)9 (+, —1i*: *3 *92)}

This approach combines some elements of formal speciﬁcations and matrix-based repre-
sentation of transformations. The transformations are not described in terms of a general
framework. The individual transformations (tg’s) must each be applied to determine the
global effect of the transformations (T). However, overall legality of the sequence cannot

be determined until all of the transformations are applied. Therefore, if one of the early

128

transformations in a sequence causes an illegal condition, then the compiler must “back up”
through the transformation sequence, eliminate the illegal transformation, then reapply the
remainder of the sequence. In contrast, in the unimodular approach only the ﬁnal trans-
formation matrix T is used to translate the initial source code into ﬁnal form. Thus, if T
causes an illegal condition, then it is much easier to correct.

The applicability of a transformation is dependent upon the accuracy of the dependence
information. Potential parallelism is determined based upon properties of the transformed
dependence vector set. Therefore, if dependence vector set mapping rules result in too
restrictive, or spurious dependence vectors, then potential parallelism may be lost. Finally,

Sarkar and Thekkath’s method requires perfect loop nests.

8.5 Precondition Approach

Whitﬁeld and Soffa [37, 38, 39] propose a formalized method for applying parallel compiler
techniques to translate source code. Their work describes a method to determine interde-
pendences between compiler techniques and also discusses methods for exploiting them in
order to establish an optimal ordering for compilation. The ordering is based solely upon
interactions between compiler methods and is completely independent of the properties of
the underlying source code.

First, seven common parallel compilation methods are formally described in terms of
pre- and postconditions. The seven code transformations are Dead Code Elimination, Con-
stant Propagation, Invariant Code Motion, Loop Unrolling, Strip Mining, Loop Fusion, and
Loop Interchange. Next, enabling and disabling conditions for each individual method are
determined. The enabling and disabling conditions describe what conditions are necessary
in order for the compiler method to be applied, and are used to determine any interactions
between optimizations. Since each pre- and postcondition is formally described, reasoning
about dependences between the individual transformations is also formalized. Finally, en-

abling and disabling graphs are developed between the seven transformations. The results

129

from the enabling and disabling graphs are combined into a global directed graph that
implies overall transformation ordering.

The next step in this method is to determine the enabling and disabling conditions for
each of the seven code transformations. The enabling conditions are the preconditions (i.e.,
the conditions necessary for the optimization to be applied), and the disabling conditions
are negations of the enabling conditions. Then the enabling and disabling conditions are
compared in a pairwise fashion to determine enabling and disabling conditions between the
transformations. Lack of interactions between code transformations is also proven using
formal reasoning techniques.

The ﬁnal step is constructing an overall dependency graph based upon the pairwise

comparison mentioned above. This is a three step process:

1. Develop the enabling graph;
2. Develop the disabling graph;

3. Develop the overall dependency graph (optimization ordering).

Step 3 is a consolidation of steps 1 and 2, and any conﬂicts arising from step 1 or 2 are
reﬂected in the overall dependency graph.
Finally, the ordering of the techniques is obtained by reviewing the results from the

overall dependency graph. Whitﬁeld and Soffa suggest the following order:

1. Constant Propagation

Dead Code Elimination (iterate)

Loop Interchange (iterate)

Invariant Code Motion (iterate)

Loop Interchange and Invariant Code Motion (iterate)
Loop Unrolling (iterate)

Loop Fusion (iterate)

90.49999050

Strip Mining

130

Most of the available parallelism in source code is in the loops. Therefore, extending
the transformation set to include more loop transformations may increase the applicability
of this approach. There is no metric mentioned to measure parallelism or data locality in
the transformed source code with this approach. In most cases, parallelism or data locality
is the goal of loop transformations and having a metric shows whether the goals have been
met. Another disadvantage of this approach is that static ordering of the transformations is
independent of the source code to which the transformations are applied. Different source
code and different machines have different optimization goals. Attempting to obtain one

ordering of the transformations that meets multiple goals seems optimistic.

8.6 Morphism Notation

Lu and Chen [40] present a mapping from original source code to parallelized code, rather
than the transformations themselves. The mappings, or morphisms, are broken down into
three subclasses: spatial, temporal, and reordering.

Temporal morphism is the mapping of the bounding loops of a nested loop. Temporal
mapping determines which loops can be parallelized and which loops may be interchanged.
It is important to note that the temporal morphism does not change the ordering of the
statements within the loop, only the order and parallelism of the iterative structures (do’s,
for’s) surrounding the statements.

Reordering maps the statements within the loops into some new ordering. As an
example, a statement reordering morphism may indicate that the original loop ordering
(51, $2, . . . , $1., . . . , Sn) may be transformed into parallelized code with the following state-
ment order (31:, 5'2, . . ., Sn, . . . , 51).

The ﬁnal morphism is spatial morphism. Spatial morphisms are hardware dependent
translations. In other words, they answer the question “How is the code domain mapped to
the available architecture?” An example of a spatial morphism is strip mining. Strip mining

determines the distribution of data into “chunks” that are dependent upon hardware vector

131

size. Spatial morphisms were not the focus of Lu and Chen’s work and are brieﬂy discussed
below.
Assume that D is the domain of the initial source code, m is the transformation mapping,

and P is the domain of the transformed (parallelized) source code, then:
m(D) —> P

Further, assume that the mapping function m has spatial s, temporal t, and reordering 1'

components, then:

m(D) = f(8(D),t(D)ar(D)) -* P

That is, the overall mapping is some function of the spatial, temporal, and statement
reordering functions over the domain of the initial source code.

The overall transformer that combines the spatial, temporal, and reordering as described
above, is called the loop transformer. A mapping that combines only temporal and reorder-

ing morphisms:
7r(D) = 9(t(D),r(D)) -* P’

is called a loop scheduler.

The loop transformer m is hardware dependent and the loop scheduler 1r is hardware
independent. Lu and Chen’s interest was hardware independent translations and so the
focus was loop scheduling rather then overall loop transformations.

Suppose that the initial source code is of the form shown in Figure 8.1(a). A temporal
mapping t indicates the order of the bounding loops. Any index name that is not included in
t is parallelized (for —+ forall), and must be innermost after transformation. As examples,
t(D) = (k, i,j) results in the transformed loop shown in Figure 8.1(b), and t(D) = (j,k) is
interpreted as shown in Figure 8.2(a).

The statement reordering function r indicates the order of the statements, by position,

after transformation. As an example, r(D) = (2,3, 1) indicates that the statement that

132

 

L1 for11=... L1 10143:...
L2 for12=... L2 forI1=...
L3 torI3=... L3 for12=...
51 S;
Sn Sn
Figure 8.1(a). Sample code: before morphism . Figure 8.1(b). Sample code: after temporal mor-

phism with ((1)) = (13, 11,12).

was originally ﬁrst is now second; originally second is now third; and originally third is now
ﬁrst.

A complete loop schedule for the original example shown in Figure 8.1(a) is:

"(13): (t(D),r(D)): ((J'),(1.3.2))

and results in the loop transformation shown in Figure 8.2(b).

 

L1 10r12=... L1 tor-12:...

L2 forI3=... L2 fora11I1=...

L3 forall 11 = . .. L3 torall I3 = .. .

51 . . . S; . . .

$2 .. . S3

S3 .. . 52
Figure 8.2(a). Sample code: after temporal mor- Figure 8.2(b). Sample code: after temporal and
phism with t(D) = (12, I3). reordering morphism with t(D) = (t(D),r(D)) =

((12),(1:3»2))-

An important constraint on transformations is that the resultant dependence vectors
are lexicographically positive. This constraint means that any dependences that existed

before transformation must still exist after transformation.

133

A difﬁculty with this approach is that it only works on perfect loop nests. Any deviation
from perfect nesting results in a loop structure that cannot be represented by the notation
as described in this paper. Some extension to the mappings that allows imperfect loop nests
is needed.

The method of determining how the morphisms are determined was only brieﬂy discussed
in the paper and is a non-trivial task. That is, the representation of the mapping was
described, but not the derivation of the mapping. Finally, the division of spatial, temporal,
and statement reordering morphisms allows the representation of the hardware-independent
portion of mapping (loop scheduling) and the hardware-dependent portion of mapping

(overall loop transformation).

CHAPTER 9

Conclusions and Future

Investigations

This dissertation presented an extension to the kernel set of unimodular transformations,
which includes loop permutation, loop skewing, and loop reversal applied to perfect loop
nests. Several advantages to the traditional unimodular approach are that it uniﬁes seem—
ingly independent transformations into a common framework; a sequence of transformations
is composed into the product of the matrices representing the individual kernel transfor-
mations; the source code is mapped from its initial to ﬁnal form only once, after a ﬁnal
transformation matrix T is found; and parallelism goals are evaluated based upon the prod-
uct of the transformation matrix T and an abstraction of the source code in the form of a
distance vector set D.

The unimodular transformations have been extended to include ﬁve other matrix-based
transformations [12, 22, 23]. The extended transformations are: loop normalization, which
modiﬁes the step size and upper bound of a loop so that they are both one; loop blocking,
which includes tiling, strip mining, and cycle shrinking and increases the depth of a nested
loop; loop collapsing, which includes loop coalescing and decreases the depth of a nested
loop; loop ﬁssion, which divides a set of assignment statements within a loop nest into two
adjacent loop nests; and loop fusion, which combines two sets of assignment statements

from adjacent loop nests into a common loop nest. The legality of the extended set of

134

 

 

135

transformations was discussed in terms of both perfect and imperfect loop nests. Explicit
mapping rules for the distance vector set were described, as well as any modiﬁcations to
the source code mapping.

Several algorithms were described that show the application of various combinations of
transformations from both the kernel and extended transformation set. Finally, the prelim-
inary design for the implementation of a tool that uses the extended transformation set was
described. The detailed requirements analysis can be used to guide the development of a
parallel compiler that supports the extended, matrix-based model of loop transformations.

Potential areas of future investigations are to develop new algorithms using the extended,
matrix-based framework, and remove the constraints on the structure and syntax of the
target source code language to which the extended framework is applied.

New algorithms, that include both the extended and kernel set of loop transformations,
can be designed to meet other machine—independent and machine-dependent goals. As an
example, we did not discuss the application of the extended family of transformations to
data distribution for a distributed-memory machine. Now that the uniﬁed model has been
developed, there are numerous software optimization goals to which it can be applied.

Second, the theoretical model should be extended to include direction vectors, as well
as distance vectors. It is often the case that explicit distance information is unavailable
because of complexities in the source code dependences. The ability to handle distances,
directions, and combinations of the two, could make the theoretic model applicable to a
broader range of source code.

We assumed that the loop bounds and subscript expressions were linear functions of
the index variables. This assumption is reasonable based upon statistical data derived from
actual source code [25]. However, including more complex expressions would broaden the

scope of the source code to which the uniﬁed model could be applied.

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] G. M. Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale
Computing Capabilities. In AFIPS Conference Proceedings, 1967.

[2] Jau-Hsiung Huang. On Parallel Processing Systems: Amdahl’s Law Generalized
and some Results on Optimal Design. IEEE Transactions on Software Engineering,
18(5):434—448, May 1992.

[3] Uptal Banerjee. Unimodular Transformations of Double Loops. In Proceeding of the 3rd
Workshop on Languages, Compilers, and Parallel Computing, pages 192—219, August
1989.

[4] Michael Wolf and Monica Lam. A Loop Transformation and an Algorithm to Maximize
Parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452—471,
October 1991.

[5] J. Ramanujam. Non-unimodular Transformations of Nested Loops. In Proceedings of
Supercomputing ’92, pages 214-223, November 1992.

[6] Wei Li and Keshav Pingali. A Singular Loop Transformation Framework Based on
Non-singular Matrices. In 5th Workshop on Languages and Compilers for Parallel
Computing, pages 249—260, August 1992.

[7] Wayne Kelly and William Pugh. A Framework for Unifying Reordering Tmnsforma-
tions. Technical Report CS-TR-2995-1, University of Maryland, August 1993.

[8] Wayne Kelly and William Pugh. Determining schedules based on performance esti-
mation. Technical Report UMIACS-TR-93-67,CS-TR-3108, University of Maryland,
August 1993.

[9] Vivek Sarkar and Radhika Thekkath. A General Framework for Iteration-Reordering
Loop Transformations. In Proceeding of the ACM SIGPLAN ’92 Conference on Pro-
gramming Language Design and Implementation, pages 175-187, June 1992.

[10] David Chesney. A Formal Approach to Automatic Source Code Translation for Parallel
Architectures. Technical Report CPS-91-15, Michigan State University, October 1991.

136

 

137

[11] David Chesney and Betty Cheng. Formal Speciﬁcation of an Automatic Source Code
Translator for Parallel Computer Architectures. In Minnowbrook Workshop on Soft-
ware Engineering for Parallel Computing, August 1992.

[12] David R. Chesney and Betty H.C. Cheng. Generalizing the Unimodular Approach. In
International Conference on Parallel and Distributed Systems, pages 398-403, Decem-
ber 1994.

[13] Randy Allen and Ken Kennedy. Automatic Translation of FORTRAN Programs to
Vector Form. ACM Transactions on Programming Languages and Systems, 9(4):491-
542, October 1987.

[14] Michael Wolfe. Optimizing Supercompilers for Supercomputers. The MIT Press, 1989.

[15] Hans Zima and Barb Chapman. Supercompilers for Parallel and Vector Computers.
The ACM Press, 1990.

[16] James Davies, Christopher Huson, Thomas Macke, Bruce Leasure, and Michael Wolfe.
The KAP/S-l: An Advanced Source-to—Source Vectorizer for the S-l Mark IIa Super-
computer. In International Conference on Parallel Processing, pages 833—835, 1986.

[17] Ken Kennedy, Kathryn McKinley, and Chan-Wen Tseng. Analysis and Transformation
in the Parascope Editor. Technical Report CRPC-TR90106, Rice University, December
1990.

[18] David Padua and Michael Wolfe. Advanced Compiler Optimizations for Supercomput-
ers. Communications of the ACM, 29(12):]184—1201, December 1986.

[19] Michael Wolfe. The Tiny Loop Restructuring Research Tool. In International Confer-
ence on Parallel Processing, pages 1146—1153, 1991.

[20] Zhiyuan Li and Pen-Chung Yew. Interprocedural Analysis for Parallel Computing. In
International Conference on Parallel Processing, pages 221—228, 1988.

[21] Constantine Polychronopoulos, Milind Girkar, Mohammed Haghighat, Chia Lee, Bruce
Leung, and Dale Schouten. Languages and Compilers for Parallel Computing, chapter
21 - The Structure of Parafrase—2: an Advanced Parallelizing Compiler for C and
Fortran, pages 423—435. MIT Press, 1990.

[22] David Chesney and Betty Cheng. Application of the Unimodular Approach to Loop
Fission and Loop Fusion. Presented at the Scalable High Performance Computing
Conference, May 1994.

[23] David Chesney and Betty Cheng. Extending the Unimodular Approach to Other
Transformation Techniques. Technical Report CPS-93—24, Michigan State University,
April 1993.

 

138

[24] Michael Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Stan-
ford University, August 1992.

[25] Zhiyu Shen, Zhiyuan Li, and Pen-Chung Yew. An Empirical Study of Fortran Programs
for Parallelizing Compilers. IEEE Transactions on Parallel and Distributed Systems,
1(3):356—364, July 1990.

[26] Ron Sass and Matt Mutka. Transformations on Doubly Nested Loops. In International
Conference on Parallel Architectures and Compilation Techniques, 1994.

[27] Constantine Polychronopoulos. Loop Coalescing: A Compiler Transformation for Par—
allel Machines. In International Conference on Parallel Processing, pages 235-242,
1987.

[28] Ron Sass and Matt Mutka. Enabling Unimodular Transformations. In Supercomputing,
pages 753-762, 1994.

[29] Convex Computer Corporation. Convex FORTRAN Optimization Guide, second edi-
tion, 1994.

[30] James Rumbaugh, Michael Blaha, William Premerlani, Frederick Eddy, and William
Lorensen. Object-Oriented Modeling and Design. Prentice-Hall, Inc., 1991.

[31] Uptal Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic, 1988.

[32] Michael Wolfe and Chan—Wen Tseng. The Power Test for Data Dependence. Technical
Report Rice COMP TR90-145, Rice University, December 1990.

[33] Michael Wolfe and Chan-Wen Tseng. The Power Test for Data Dependence. IEEE‘
Transactions on Parallel and Distributed Systems, 3(5):591—601, September 1992.

[34] W. Pugh. The Omega Test: A Fast and Practical Algorithm for Dependence Analysis.
In Proceedings of Supercomputing ’91, pages 4—13, November 1991.

[35] Michael Wolfe. TINY: A Loop Restructuring Research Tool. Oregon Graduate Institute
of Science and Technology, December 1990. i

[36] Wayne Kelly. A Framework for Unifying Reordering Transformations. personal com-
munication, September 1993.

[37] Debbie Whitﬁeld and Mary Lou Soffa. An Approach to Ordering Optimizing Trans-
formations. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and
Practice of Parallel Pragramming, pages 137-146, March 1990.

[38] Deborah Whitﬁeld and Mary Lou Soffa. Investigating Properties of Code Transfor-
mations. In 1993 International Conference on Parallel Processing, pages II—156-160,
August 1992.

139
[39] Deborah Lynn Whitﬁeld. A Unifying Approach for Optimizing Transformations. PhD
thesis, University of Pittsburgh, 1991.

[40] Lee-Chung Lu and Marina Chen. New Loop Transformation Techniques for Massive
Parallelism. Technical Report YALEU/DCS/TR-833, Yale University, October 1990.