31.1.:
93.... .y:
.3!

MW.‘:

iii:
31:25.2... 2...»: ..

13.. 1
z

“‘53:” I‘i
gﬁivéxr‘;
42%:
. a 4'5

“ﬁn-nun an...
.w.‘xlt.ﬁt.1’br
.5 2. 9t .‘

.. hag:- .
plug
aw?

. . . .
(Anna-Pu \ .
2. mar-f. .9»?!
7 ~ . f“ ‘ b»: I» 2!
. , 2.1. fat
53.25:... :1
liq-3!...
3:335
1.
{$9 )5...
1...V.lt..:
an... .

£3311.
L1:

vi... .
393..., h»:

 

avr!

THESiS
k,

Illillliill‘l’illflllllllliilllllllﬁlll

3 1293 01417 2674

This is to certify that the

dissertation entitled

Processor Management in 2-D Mesh
Wormhole-routed Multicomputers

presented by

Dugki Min

has been accepted towards fulﬁllment
of the requirements for

Ph D. Mamem_cgmpnten_5cience

Maw/M

 

 

 

ﬁ Major professor

 

Date M45, ,g #295

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY
Mlchlgan State
Unlverslty

 

 

 

PLACE II RETURN BOX to romovothlo chookom from your record. ,
TO AVOID FINES rotum on or before date duo.

DATE DUE DATE DUE DATE DUE

W

.‘ ' a . i r‘} k‘. ..
i Q ’ \ .
.71; o .

 

   
   

 

 

 

l
l
l
i

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i ll |

MSU IoAnNﬁrmotlvo ActlorVEqud Opportunity 1m
W1

 

 

PROCESSOR MANAGEMENT IN 2-D MESH
WORMHOLE-ROUTED MULTICOMPUTERS

By

Dugki M in

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DEGREE OF PHILOSOHPY

Department of Computer Science

1995

ABSTRACT

PROCESSOR MANAGEMENT IN 2-D MESH
WORMHOLE—ROUTED MULTICOMPUTERS

By

Dugki M in

Processor management is one of the important services provided by the operat-
ing systems of multicomputers that serve multiple jobs simultaneously. This thesis
investigates several fundamental issues that are important when designing processor
management schemes for 2-D mesh wormhole—routed multicomputers. This research
investigates the effects of job interactions and addresses the performance degrada-
tion due to network contentions when jobs interact. Network contentions may be
crucial to the design of processor allocation strategies that allocates processors to a
job from geometrically dispersed regions. A general contention model is proposed
to study the effects of competing paths on network contention due to the nature of
wormhole routing networks when several paths are overlapped that have different
communication rates. Based on the proposed contention model, we derive analytic
expressions that predict the performance detrimental effects of job interactions in
terms of contention delay. The analytic study of network performance leads us to
investigate the principles that may be applied when developing a scattered alloca-

tion strategy for a 2-D mesh multicomputers. Efficient methods of partitioning and

placing jobs are provided with regard to several communication parameters. We also
investigate the performance effects of irregularity of job shape and size by examining
a dynamic scheduling system that schedules jobs of various shapes from regular shape
to irregular shape. This research outlines an approach for restricting incoming job
request to use partitions from a multicomputer so that the performance advantage of
regular-shaped partition can be utilized. In addition, we propose a new job scheduling
discipline that can achieve low job turnaround time and high system utilization while
not inappropriately favoring small jobs to the detriment of large jobs. The discipline
adapts its scheduling policy to the changes of workload so that it behaves in a FCFS
manner under low loaded conditions, but exploits performance enhancing features of

multiple queue schemes under highly loaded conditions.

Copyright © by
Dugki Min
1995

To my parents and my wife

ACKNOWLEDGMENTS

I would like to express my sincere thanks to my advisor, Professor Matt W. Mutka,
who has guided me throughout the academic and research years of my Ph.D. pro-
gram. His consistent guidance and patient encouragement has helped me mature as
a researcher. I am grateful for many discussions and invaluable comments he has
provided.

I am also very grateful to the other members of my dissertation committee: Pro-
fessor Lionel M. Ni, for his wise and invaluable comments and criticisms; Professor
Moon Jung Chung, for his friendly advice and careful reviewing of the manuscript;
Professor James H. Stapleton, for his willingness to answer my questions.

I would like to take this opportunity to thank those who have helped me during
my stay at Michigan State University. Many thanks to Professor Rawle Hollingsworth
for his willingness to help and support me financially. I thank all the brothers and
sisters in the New Hope Baptist Church. In particular, I am grateful to Dr. Chang,
Dr. Han and Pastor Cho for their sincere care, love and prayer.

I am indebted to my family members. I express my hearty thanks to my parents for
their constant love and encouragement that have helped me do my best. Many thanks
to my father-in-law and mother-in-law for their understanding and encouragement. I
also thank my sisters, brothers-in-law and sisters—in-law for their encouragement and
support to make decisions. Finally, my very special thanks to my wife, Eunmi Choi,
for her continuous patience, love and prayer throughout my graduate studies. She

spent many sleepless nights with me in discussing on research issues.

vi

TABLE OF CONTENTS

LIST OF FIGURES

1 INTRODUCTION
1.1 Network Contention Issues ........................
1.2 System Fragmentation Issues .......................
1.3 Thesis Outline ...............................

2 RESEARCH ISSUES
2.1 Processor Sharing vs. Processor Partitioning ..............
2.2 Basic Principles For Job Scheduling ...................
2.3 Graph Embedding ............................
2.4 Partition Recognition Ability ......................
2.5 Other Techniques For Cube Allocation .................
2.6 External Fragmentation .........................
2.7 Rectangular Packing Problem ......................
2.8 Performance Studies of Wormhole Networks ..............

3 A PERFORMANCE STUDY OF WORMHOLE NETWORK
3.1 System Model ...............................
3.2 Analytic Model: The Multipath Contention Model ...........
3.3 Goal and Approaches ...........................

4 ANALYSIS OF INTERNAL CONTENTION DELAY
4.1 The Homogeneous Multipath Contention Model ............
4.2 Analysis ..................................
4.2.1 Intermediate Contention Delay .................
4.2.2 Starting Contention Delay ....................
4.3 Results ...................................
4.3.1 Contention Delay of a Path ...................
4.3.2 Internal Contention Delay of a Matrix Transpose Pattern . . .

4.4 Summary .................................

vii

NO‘AH >4

on

10
11
12
14
14
16
17

20
21
23
27

28
29
32
33
37
41
41
42

5 ANALYSIS OF GENERAL CONTENTION DELAY

5.1 The Heterogeneous Multipath Contention Model ............
5.2 The Divide-And-Conquer Strategy ...................
5.3 Analysis of the Heterogeneous 2-Path Contention Model .......
5.3.1 Analysis ..............................
5.3.2 Results ...............................
5.4 Analysis of the Heterogeneous Multipath Contention Model ......
5.4.1 Starting Contention Delay ....................
5.4.2 Intermediate Contention Delay .................
5.4.3 Results ...............................
5.5 Analyzing Two Transpose Jobs .....................

5.6 Summary .................................

JOB PARTITIONING AND PLACEMENT

6.1 Contention Among Competing Paths ..................
6.1.1 Path Interaction .........................
6.1.2 Discussion .............................

6.2 Partitioning and Placing Jobs ......................
6.2.1 Methodology ...........................
6.2.2 Effect of Communication Rates .................
6.2.3 Effect of Internal Competing Level ...............
6.2.4 How to Partition: Effect of Congestion Factor .........
6.2.5 Discussion .............................

6.3 Summary .................................

EFFECTS OF JOB SIZE IRREGULARITY ON DYNAMIC RE-

SOURCE SCHEDULING

7.1 Motivation: Irregularity of a 2-D Mesh System .............

7.2 Simulation Model .............................

7.3 The BWQ Searching Algorithm .....................
7.3.1 Between Queue (BQ) Policy ...................
7.3.2 Within Queue (WQ) policy ...................
7.3.3 Design Parameters ........................

7.4 Performance Effects of Variability in Size and Shape ..........

7.5 Summary .................................

47
48
50

59
63
65
67
69
7O

74
75
78
80
81

98

99
100
101
103
108
110

EFFICIENT JOB SCHEDULING WITHOUT DISCRIMINATION

AGAINST LARGE JOBS 112

viii

8.1 The Proposed Strategy .......................... 113

8.2 Design Parameters ............................ 117
8.2.1 Number of LQs .......................... 118

8.2.2 Coefficient of UBWT ....................... 120

8.2.3 Coefﬁcient for the Size of Lookahead Window ......... 121

8.3 Simulation Results and Comparison ................... 121
8.3.1 Job Turnaround Time and System Utilization ......... 124

8.3.2 External Fragmentation ..................... 124

8.3.3 Fairness .............................. 126

8.4 Summary ................................. 127

9 CONCLUSIONS AND FUTURE RESEARCH 129
9.1 Summary and Major Contributions ................... 130
9.2 Future Research .............................. 133
APPENDIX 136
BIBLIOGRAPHY 138

ix

3.1

3.2

4.1
4.2

4.3
4.4

4.5
4.6
4.7

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

5.9

5.10

LIST OF FIGURES

A snapshot of a 6x10 2-D mesh system with three interacting jobs; A
close-up view of a 3-path contention is shown. .............

Stair-layered m-path contention model ..................

A homogeneous ﬁve-path contention model. ..............

Illustration of Computation of an Intermediate and Starting Con-

tention Probability .............................
The Maximum Intermediate Contention Delay: Cmaxf“’d. ......
The Maximum Starting Contention Delay, C maz§"’d: top - simulation
results, middle - prediction, bottom — combination. ..........
Average Contention Delay .........................
Transpose Pattern. ............................
Prediction for Matrix Transpose Pattern .................
The heterogeneous m-path contention model ...............
The reduced contention model for P,- ...................

Heterogeneous 2-path contention model. ................
An illustration of Case 1 ..........................
An illustration of Case 2 ..........................
An illustration of Case 4 ..........................
The m-n simulation model for 2-path contention. ...........
Performance effect of various MTBSs of P,- on communication of P“:
Mean communication delay of P.- vs.MTBS of P.- for 1-1 interaction
model while MTBS of P,- is 0, 25 or 100. ................
Performance effect of various MTBSs of P,- on communication of P,
: Mean communication delay of P,- vs.MTBS of Pj for 1-1 interaction
model while MTBS of P,- is 0, 25, 100, 250 or 1000. . . . . . . . . . .
Performance effect of number of layers of Pj on communication of P,- :
Mean communication delay of P.- vs.MTBS of P,- for 1-1, 1—3 and 1-5
interaction model : M T852 2 25 .....................

25
26

30

34
37

40
42
43
45

49
51
53
54
55
58
60

60

62

5.11

5.12

5.13
5.14
5.15

5.16
5.17

5.18

6.1
6.2

6.3

6.4

6.5

6.6

6.7
6.8

Performance effect of number of layers of P,- on communication of P.- :
Mean communication delay of P,- vs.number of layers of Pj. The pair
represents the MTBSs of P,- and P,- ....................
Performance effect of combination of factors on communication of P5:
dotted line-simulation results, solid line-predictions. The ﬁrst element
of the pair represents k-k interaction model and the second number
represents the MTBS of Pj. .......................
Computation of starting contention delay. ...............
Computation of intermediate contention delay ..............
Mean communication delay of each path vs. MTBS of Path 3 for
heterogeneous 5-path contention model. The MTBS of Paths 1, 2, 4,
5 are 0, 50, 250, and 500 psecs, respectively. .......... V . . . .
An allocation of two transpose jobs ....................
Mean communication delay of the inside transpose job while the MTBS
of the outside job varies from 0 to 500psecs. M TBS.-n is MTBS of the
inside job. .................................
Mean communication delay of the outside transpose job while the
MTBS of the outside job varies from 0 to 500psecs. M TBS,-n is MTBS
of the inside job. .............................

Two layouts of 5 competing paths. ...................
Communication delay of layout (a) vs. MTBS of P3 that varies from 0
to 500: MTBSs of P1,P2,P4, and P5 are 0, 50, 250, and 500, respectively.
The mean communication delay is greater for layout (a) than the delay
for layout (b) shown in the next ﬁgure. .................
Communication delay of layout (b) vs. MTBS of P3 that varies from
0 to 500: MTBSs of P1, P2, P4, and P5 are 500,250, 50, and 0, respec-
tively. The delay shown in this figure is less than the delay of layout
(a) shown in the previous ﬁgure ......................
The starting contention delay when competing with 3 or 5 competing
paths (CPS) at a starting channel .....................
The intermediate contention delay when competing with 3 or 5 inter-
mediate competing paths (CPS) ......................

Two logical communication patterns within a 4x4 job ..........

63

64
66
68

70
71

72

73

76

77

77

79

80
82

Job interaction model for investigating the effect of communication rate. 84

Communication delay of inside job: The x-axis represents the MTBS
of inside job and the y-axis represents the MTBS of outside job. . . .

xi

85

6.9

6.10
6.11

6.12
6.13

7.1
7.2
7.3
7.4
7.5

7.6

7.7

7.8

7.9

8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8

Communication delay of outside job: The x-axis represents the MTBS

of inside job and the y-axis represents the MTBS of outside job. . . . 85
Job interaction model for internal competing level ............ 87
Effect of exchanging internal competing levels. (a) represents the com-

munication delay of the inside job and (c) represents the communi-
cation delay of the outside job while the inside matrix transpose job
interacts with the outside diagonal job. (b) represents the the commu-
nication delay of the inside job and (d) represents the communication

delay of the outside job while the inside diagonal job interacts with the

outside matrix transpose job. ...................... 88
Job interaction model for partitioning. ................. 89
Effect of partitioning ............................ 91
Processor allocation in a general 2-D mesh multicomputer system. . . 97
Four different types of inputs. ...................... 98
Multi-queue based job scheduling ..................... 101
An illustration of job scheduling of three queues ............. 102

Response Time (RT) effect of increasing number of queues on the
performance of job scheduling policy: BQSLs = 30, Input type =

Rectangular-unrestricted .......................... 104
Effects of changing BQSL of Q1: 3 Queues with both window sizes =
1, Input type:Rectangular-unrestricted. ................ 106
Effects of changing BQSL of Q2: 3 Queues with both window sizes =
1, Input type:Rectangular-unrestricted. ................ 106

Response Time (RT) effects of lookahead windows: 3 queues, BQSLs
of Q1 and Q2 = 30, Input type=Rectangular-unrestricted, F S-allocator.107

Response Time (RT) effects of variability in size and shape on job
turnaround times of BWQ algorithm: Q1 with BQSL = 10 and Q2

with BQSL = l ............................... 108
The HELM Algorithm .......................... 116
Tuning the Number of Queues in the HELM Algorithm. ....... 119
The Coefﬁcient for the Upper Bound on Waiting Time ......... 122
The Size of the Lookahead Window .................... 123
Comparison of the Turnaround Time for the Three Disciplines. . . . . 125
Comparison of the System Utilization. ................. 125
External Fragmentation. ......................... 126
The L/ S Ratio for the Three Schemes. ................. 127

xii

CHAPTER 1

INTRODUCTION

Parallel computing systems in the form of massively parallel computers (MPC) have
become popular for researchers in many scientiﬁc ﬁelds. Researchers map scientiﬁc
computations to parallel programs in an attempt to utilize all the processors of an
MPC effectively. Some computing problems may be able to efﬁciently utilize all the
processors and memory of an MPG and nearly achieve linear speedups in execu-
tion times. Many other computing problems, however, can only efﬁciently execute
on a smaller subset of the processors. Due to synchronization and communication
overhead, the computing problems will not improve in performance as the number
of processors allocated to the problems increases. Therefore, in order to utilize the
processors in an MPC efﬁciently, multiple jobs must be allocated to the MPC simulta-
neously. The multi-user environment greatly complicates the processor management
task of the operating system.

The portion of the operating system for processor management, called the proces-
sor manager, plays the role of allocating the incoming jobs generated by concurrent
users to a limited number of processors. The processor manager is composed of three
components. When incoming jobs arrive at the system, the ﬁrst component, called
the task assigner, characterizes the request of the jobs by determining the sizes and

shapes of the subpartitions that can accommodate the incoming jobs. The jobs are

passed to the second component. The second component, called job scheduler, has
one or more queues that are dedicated for job scheduling. When the job scheduler
receives a job request from the task assigner, it places the job in one of the schedul-
ing queues. The job scheduler schedules the jobs in the queues according to its job
scheduling discipline and the characteristics of the jobs, until the scheduling is blocked
by a job that cannot be allocated immediately. When a job is scheduled by the job
scheduler, the third component, called the processor allocator, checks the status of the
current system and the possibility that the job request can be allocated. If possible,
the processor allocator allocates the job to the processors that are determined by its
processor allocation strategy.

We concentrate on the aspects of the processor management problem in an MPC in
which the processors are interconnected in a 2-D mesh topology and communicate by
message passing that is based upon wormhole routing. The regular and simple struc-
ture of the 2-D mesh multicomputer can be implemented at a low cost, while many
algorithms implemented for the structure exhibit good performance. In addition, the
performance of the system in terms of the average network latency, throughput, sat-
uration throughput, and hot-spot throughput is better than the other k-ary n-cube
networks that have higher dimensions when the bisection width of each network is
held as a constant [1].

Two types of processor management scheme have been studied for 2-D mesh
wormhole-routed MPC in commercial and research ﬁelds, depending on the type
of the employed processor allocation strategy. One management scheme, which has
been employed in the Intel Paragon [2], allows processors from geometrically dispersed
regions of the MP0 to be allocated to a user’s job. Although a user’s job may be
represented logically as a particular geometrical shape, such as a rectangular mesh,
each processor of the logical mesh can be scattered across the 2-D mesh parallel pro-

cessor when the job is allocated to the MPC. One benefit of a scattered processor

passed to the second component. The second component, called job scheduler, has
one or more queues that are dedicated for job scheduling. When the job scheduler
receives a job request from the task assigner, it places the job in one of the schedul-
ing queues. The job scheduler schedules the jobs in the queues according to its job
scheduling discipline and the characteristics of the jobs, until the scheduling is blocked
by a job that cannot be allocated immediately. When a job is scheduled by the job
scheduler, the third component, called the processor allocator, checks the status of the
current system and the possibility that the job request can be allocated. If possible,
the processor allocator allocates the job to the processors that are determined by its
processor allocation strategy.

We concentrate on the aspects of the processor management problem in an MPC in
which the processors are interconnected in a 2—D mesh topology and communicate by
message passing that is based upon wormhole routing. The regular and simple struc-
ture of the 2-D mesh multicomputer can be implemented at a low cost, while many
algorithms implemented for the structure exhibit good performance. In addition, the
performance of the system in terms of the average network latency, throughput, sat-
uration throughput, and hot-spot throughput is better than the other k-ary n—cube
networks that have higher dimensions when the bisection width of each network is
held as a constant [1].

Two types of processor management scheme have been studied for 2-D mesh
wormhole-routed MPC in commercial and research ﬁelds, depending on the type
of the employed processor allocation strategy. One management scheme, which has
been employed in the Intel Paragon [2], allows processors from geometrically dispersed
regions of the MPG to be allocated to a user’s job. Although a user’s job may be
represented logically as a particular geometrical shape, such as a rectangular mesh,
each processor of the logical mesh can be scattered across the 2-D mesh parallel pro-

cessor when the job is allocated to the MPC. One beneﬁt of a scattered processor

allocation strategy is that it enables a system to maximize the utilization of the MPC.
Since a job can be allocated processors from any region of the MPC, processors will
not remain unused simply because there is not a large enough contiguous region of
processors to allocate to a job. Nevertheless, when jobs from independent users are
scattered across an MPC, the communication generated by one job may negatively
affect the communication delay suffered by a second job. This is because wormhole
routing [3] is the communication technique used by many MPCs. The communication
delay overhead is due to the sharing of network resources such as routers and wires
between interleaved and independent jobs.

An alternative approach for allocating processors to independent jobs would be to
require contiguous allocations. This means that a user’s job is mapped to an MPC in
a manner such that the processors allocated to one job will not have communication
paths overlapped with the communication paths of other jobs. This may be feasible
in an MPC with respect to the processors allocated to the jobs, but two problems
occur. First, contiguous allocation can lead to a signiﬁcant amount of fragmentation
of processing capacity of the MPC [4, 5, 6]. Fragmentation is the unused processors
of the MPC that do not form contiguous regions that are large enough to serve
users’ job requests. Second, communication contention between independent jobs are
likely to occur even if processors are allocated in contiguous regions. This is because
jobs generate I/O requests to shared disks and other devices that are attached to
servers located at speciﬁc locations in the MPC. For example, the servers in the
Intel Paragon are located at the edges of the 2-D mesh. Therefore, communication
contention between jobs will occur when the jobs generate message trafﬁc to the I/O
servers. The contention for I / O servers means that one job can affect the performance

of a second job.

1.1 Network Contention Issues

We study network contention issues that may be critical to the performance of scat-
tered management schemes. Because the network contention causes the degradation
effect on the network performance, the locations of the processors for jobs should be
carefully determined in order to minimize the negative effect.

Under the current release of the Paragon operating system, the performance degra-
dation due to interference between the competing paths has measurable effects only
for messages of large size according to the results in [7]. This is because the speed of
message delivery supported by the operating system is much slower than the network
capacity. However, the performance impact due to network contention is expected
to become more signiﬁcant under future operating systems. If an operating system
could deliver messages as fast as the network capacity, the negative effects of network
contention increase signiﬁcantly as the number of competing paths increases. We
believe that future MPCs will have operating systems that can deliver messages at
the bandwidth of the network.

We analyze the performance degradation due to network contention. We propose
a general contention model that is suitable for analyzing the performance degrada-
tion in wormhole—routed multicomputer systems [8]. This model is a representation
of arbitrarily overlapped communication paths of jobs. It has been developed by con—
sidering the network contention problem occurring while multiple jobs are allocated
to the system. Based on the general contention model, we predict the contention
delay due to job interactions in a 2-D mesh wormhole-routed MPC [8, 9]. If a job is
allocated several processors, the internal communication of a job may use paths that
overlap with the communication paths occurring within other jobs. The detrimental
effect of contention caused by interference between jobs has led us to study whether

it is necessary to eliminate inter-job contentions by requiring processor allocations in

contiguous regions of the multicomputer.

Based on this analysis of network performance, we investigate the principles that
may be applied when developing a scattered allocation strategy for a 2-D mesh multi-
computers. By isolating each communication parameter, such as the communication
rate, we study whether the method of partitioning and placing jobs can change the
negative effects of job interactions [10]. For the study, we draw conclusions of how to

place and partition jobs in a 2-D mesh system.

1.2 System Fragmentation Issues

As jobs are allocated, the system may become fragmented. External fragmentation
occurs if a contiguous partition of processors is not available to serve a job even if
the needed number of processors are available. Internal fragmentation occurs if more
processors are allocated to a job than is required for the job’s execution.

Most research for hypercube multicomputers has focused on developing innova-
tive processor allocation strategies that have better ability to recognize subparti-
tions [11, 12, 13, 14]. Due to the special high-dimensional structure of the hypercube,
a hypercube system has a structural advantage that can embed many other structures
into it, but this characteristic of high dimensionality makes it difﬁcult to detect an
available subcube. Compared with the hypercube, the 2-D mesh topology is simple
and straightforward to detect an available submesh. A processor allocation strat-
egy that has complete recognition ability can be developed easily. In general 2-D
mesh systems, however, system fragmentation can be signiﬁcantly large even though
the employed processor allocation strategy has the ability of complete recognition.
Incoming jobs on a general mesh system could request computing nodes that form
irregular-shaped submeshes, making the unallocated parts of the system to be irreg-

ular. Therefore, in developing a processor management strategies for a general 2—D

mesh multicomputers, the inherent property of irregularity in the size of job request
and the irregularity in the shape of processor allocation should be dealt with in order
to reduce system fragmentation.

We study the performance effects of irregularity of job shape and size on the
performance of processor management strategies. We examine the performance effect
of irregularity by examining a dynamic scheduling system that schedules jobs with
requests that range from regular-shaped partitions to irregular-shaped partition. The
research outlines an approach for restricting incoming job request to use partitions
from a multicomputer so that the performance advantage of regular-shaped partition
is utilized.

In addition, we develop a job scheduling scheme that can achieve signiﬁcant per-
formance gains by reducing system fragmentation. Since most processor management
schemes have concentrated on approaches for processor allocation, the schemes have
used First-Come-First-Serve (FCF S) as the job scheduling discipline. However, it
has been previously established that job scheduling algorithms for parallel computing
systems can have a large impact on the system utilization and job response time [15].
Schemes that use multiple queues, which reorder the sequence of jobs allocated to the
parallel system, can be very effective in improving the system performance. However,
such non—FCFS schemes have been criticized because they provide improved average
performance by favoring small jobs at the expense of large jobs. In order to achieve
improved performance by means of multiple queue job scheduling schemes without
sacriﬁcing the fairness of FCFS, we propose a new job scheduling discipline that
behaves in a FCFS manner under low loaded conditions, but exploits performance
enhancing features of multiple queue schemes under highly loaded conditions. The

scheme does not inappropriately discriminate against large jobs.

1 .3 Thesis Outline

The thesis is organized as follows. The next chapter presents a brief overview of
research issues and related work on the processor management problem for different
architectural platforms.

Our research related to the network contention issue is given in Chapter 3—
Chapter 5. Chapter 3 describes our model of a 2-D mesh wormhole-routed multi-
computer system and proposes the general contention model, called the heterogeneous
multipath contention model, for analysis. In Chapter 4 we develop expressions for pre-
dicting the contention delay of a job. The detrimental effect of contention caused by
interference within a job has led us to analyze two different kinds of communication
contention. Chapter 5 we analyze the degradation of communication performance
due to multiple interacting jobs in the heterogeneous multipath contention model. A
divide-and-conquer strategy divides the problem into several manageable problems of
computing the contention delay for the heterogeneous 2—path contention model.

The system fragmentation issue is studied in Chapter 7 and Chapter 8. The ef-
fect of job size irregularity is studied in Chapter 7 with regard to the jobs whose
requests vary from regular-shaped partitions to irregular-shaped partitions. In order
to evaluate the effect of irregularity, we examine a group-based job scheduling algo-
rithm, called BWQ-search algorithm, which uses multiple queues for ordering jobs
to be placed on a 2—D mesh multicomputer. In Chapter 8 we propose a new job
scheduling scheme, called the HELM discipline, that adapts its scheduling policy to
the changes of workload. The HELM discipline achieves improved performance by
means of multiple queue job scheduling schemes without sacriﬁcing the fairness of
FCF S.

Future work and concluding remarks are given in Chapter 9.

CHAPTER 2

RESEARCH ISSUES

The rapid progress in the evolution of multicomputer systems focuses the attention
of many researchers on deﬁning and solving new problems. Many innovative strate-
gies for job scheduling and processor allocation have been proposed and compared
for several different architectures, applications, and performance requirements. An
interesting fact is that the foci of the studies have changed depending on the type of
system architecture. ‘

MPC systems have been classiﬁed as shared memory systems and distributed
memory systems depending on how processors and memory modules are connected.
Shared memory systems have a global memory shared by all processors. The global
memory reduces the difﬁculty of programming. Small or medium scale products
are implemented commercially and in research environment [16, 17, 18, 19, 20, 21].
which include Sequent’s Balance 8000, Encore’s Multimax, CRAY-X/MP, BBN’s
Butterﬂy, Stanford Dash and KSRl. In distributed memory systems, called mul-
ticomputers, each processor has its own memory and can only access its own
private memory. Communications between processors are done by passing mes-
sages through an direct interconnection network. Medium and large scale dis-

tributed memory systems have been developed commercially and in research envi-

ronment [22, 23, 24, 25, 26, 27, 28, 29, 30, 31], which include NCUBE family, Intel

01

an

“'5

Job.

Paragon, Connection Machine CM-5, IBM’s SP2, MIT J-Machine, Tera Computer
System, Cray T3D, and NEC Cenju-3.

This chapter presents a overview of research issues and related work on the pro-
cessor management problem on the shared memory and distributed memory system

architectures.

2.1 Processor Sharing vs. Processor Partitioning

Processors in the shared memory systems have been treated in the literature, as if
they are elements in a processor pool [32, 33, 34, 35]. In a processor pool processors
can be shared by several ready users (called processor sharing) or can be exclusively
dedicated to an assigned user (called processor partitioning). A research issue is the
design tradeoffs between processor sharing, which may increase processor utilization
at the cost of context switching, and processor partitioning, in which job turnaround
time may be lower than in processor sharing.

Processor sharing scheme can be employed in the situation that the number of
processors is insufﬁcient. In that situation the processor manager should utilize pro-
cessors efﬁciently by sharing a processor among several processes. Two approaches
of processor sharing have been studied. Dynamic allocation is one approach in which
processors are dynamically allocated to jobs on demand [32, 35]. It assumes that the
number of processors available to each job may vary during the execution in a way
that reﬂects the dynamic parallelism of the job. If some of the allocated processors
are not needed temporarily, then the processors are allowed to be reallocated to other
waiting jobs at the cost of context switching. Another possible processor’sharing
approach is a static allocation strategy that allocates the processors by ‘time slicing’.
Round—robin is an example. Even though the allocated processors are shared by other

jobs, the number of processors allocated to each job is ﬁxed during its entire execu-

10

tion. A dynamic allocation strategy has been compared with several static allocation
strategies in [32].

Processor partitioning scheme is appropriate for the systems that have a sufﬁcient
number of processors. In the systems, processors can be partitioned according to the
requests of each job and dedicated to the job during the entire execution time. ‘Run-
to—completion’ static allocation is a good processor partitioning strategy. This kind
of processor partitioning is called processor clustering, since a number of processors,
called clusters, are partitioned from the processor pool and dedicated to the job during
the entire execution time. The BBN’s Butterﬂy multiprocessor is an example using

processor clustering [16].

2.2 Basic Principles For Job Scheduling

In a processor pool model, a job can be allocated in any subpartition of the system
without any difference in communication performance. Therefore, the processor allo-
cation strategy for this model is simple; if there are a sufﬁcient number of processors
for the currently scheduled job, then an available subpartition is allocated to the job.
Otherwise, the job should wait until the required number of processors are available.
In contrast, the system performance is mainly restricted by the performance of job
scheduling. Thus, it is worthwhile to know what are the basic principles that affect
the performance of a job scheduling strategy.

One of factors that affect the performance of job scheduling strategies is the sys-
tem workload. Based on whether a priori knowledge of the workload is given while
scheduling, job scheduling strategies have been classiﬁed into static scheduling and
dynamic scheduling. A static scheduling assumes that there is a given job list and that
the characteristics of all jobs are known in advance. In contrast, a dynamic schedul-

ing assumes jobs arrive according to a stochastic process and there is no a priori

11

knowledge of the characteristics of all jobs to be scheduled. Many researchers have
proposed solutions to static and dynamic scheduling problems for general purpose
multicomputer systems [36, 5, 37, 38, 39, 40] and for hard real-time systems [41, 42].

An interesting study by Majumdar, et al. [33] investigated the fundamental issues
that are important for static and dynamic scheduling on shared memory multipro-
cessors. They addressed several fundamental issues that have important roles in
uniprocessor systems, such as how signiﬁcant the effect of the characteristics of the
workload is on the performance of job scheduling strategies on multiprocessor systems
and what kind of knowledge about the characteristics of workload is important for
the design of scheduling strategies for a given system. Based on their abstract model
of multiprocessor systems and job scheduling strategies, the research provided basic
principles underlying the performance of job scheduling strategies.

Leutenegger and Vernon [34] examined various job scheduling strategies in deter-
mining which properties of a scheduling strategy are the most signiﬁcant. According
to their results, strategies that allocate an equal fraction of the processing power
to each job perform better than strategies that allocate processing power unequally.
They also claimed that for lock access synchronization, dividing processing power
equally among all jobs is a more effective property of a scheduling strategy than
the property of minimizing synchronization spin—waiting, unless the demand for syn-
chronization is extremely high. Some heuristic job scheduling algorithms and their

performance analysis can be found in [43, 44, 45, 46, 47].

2.3 Graph Embedding

In distributed memory systems, the graph embedding problem should be considered
in determining the sizes and shapes of the subpartitions that can accommodate the

incoming jobs. The graph embedding problem arises when the dependency structure

9X

Wat
Pap

‘zi
My

12

of a parallel algorithm differs from the processor interconnection of the system or
when the number of processes generated by the algorithm exceeds the number of
processors available. The graph embedding problem is to ﬁnd a one-to-one mapping
of a graph onto another graph which minimizes a cost function. This problem has
been applied to not only the task assignment problem in a distributed and parallel
processing system [48, 49, 50, 51, 52], but also many ﬁelds in computer science, such
as VLSI circuit layout [53, 54, 55], simulating one data structure by another [56, 57],
and simulating one parallel processing architecture by another [58, 59];

The graph embedding problem is signiﬁcant because the performance of a parallel
system for a job can be affected by the efﬁciency of the embedding. Several qual-
ity factors have been considered which measure the communication distance between
two communicating processors (the dilation factor), the communication congestion
in the interconnection network (the congestion factor), and the number of extra pro-
cessors used in support of communication between processes (the expansion factor).
A careless embedding may increase the dilation factor, the congestion factor, or the

expansion factor.

2.4 Partition Recognition Ability

The hypercube network topology for distributed memory systems raises a new issue.
A simple regular partitioning scheme is likely to partition the system in a particular
way, such that it is difﬁcult to recognize available partitions. Most of the research
papers on the processor management problem of hypercube systems have proposed
allocation strategies concerning this issue [11, 12, 13, 14].

The system partitioning scheme of hypercube systems typically is regular in order
to utilize the processors efﬁciently since the hypercube topology itself is regular. The

number of processing nodes is assumed to be 2", where k 2 0 and the shape of

13

partitions are assumed to be hypercube. Thus, one research issue is how to partition
the system such that a available subcube of the required size can be recognized quickly
and completely. Complete recognition means that if there are unused subcubes large
enough to cover the job request, the scheme should be able to recognize them.

Many researchers proposed processor allocation strategies for dynamic scheduling
in hypercube multicomputers [15, 11]. The buddy strategy for memory allocation [60]
is a simple approach that is implemented in the N CUBE/ six multiprocessor [24]. The
Buddy strategy is optimal for static scheduling, but shows poor recognition ability
for dynamic scheduling [11, 12, 13]. A strategy using a simple gray code, called
SGC strategy, is another simple and statically optimal strategy whose recognition
ability is twice that of the buddy strategy for the dynamic scheduling problem [11].
This strategy does not have complete recognition ability. However, the recognition
ability can be complete by using multiple gray codes. A strategy that has complete
recognition ability by means of multiple gray codes is presented in [11].

Several other processor allocation strategies having complete recognition ability
have been proposed in the literature. The MSS strategy [14] is one. The strategy
employs the concept of a maximal set of subcubes (M88) in order to minimize frag-
mentation. The Tree-Collapsing strategy [13] is also a processor allocation strategy
that has complete recognition ability. The strategy collapses the typical binary tree
representation of a hypercube successively so that the nodes, which form a subcube
that are distant, are logically nearby each other for recognition. Non-cube allocation
strategies have been proposed [12, 61, 62] that can allocate to a job a number of

processors that is not a power of 2.

l‘f‘

14

2.5 Other Techniques For Cube Allocation

Even if the subcube recognition ability is complete, the performance of processor al-
location strategies can be restricted in hypercube multicomputers due to system frag-
mentation. Several techniques have been investigated in order to alleviate the system
fragmentation in the hypercube multicomputer systems. The scan algorithm [15] and
the lazy scheduling algorithm [63] tried to improve the performance by changing the
order of execution. In [15], Krueger, et al. have compared the roles of processor
allocation and job scheduling for achieving good performance on hypercube comput-
ers. They showed that the choice of the job scheduling algorithm is more important
for the overall performance of the system than the choice of the processor allocation
strategy in hypercube systems.

Chen and Shin [64] have examined the performance improvement to be achieved
by relocating the allocated jobs to compact the system for a large free space. They
proposed a task migration strategy for the gray code allocation strategy [11]. Yu and
Das [65] have proposed another approach called limit allocation that scale down the

request size of an incoming job so that it ﬁts into a fragmented hypercube.

2.6 External Fragmentation

Processor allocation strategies for mesh multicomputers that avoid inter-job con-
tentions have followed a generalization of the traditional binary buddy strategy for
memory management [5, 66]. Using these strategies, partitions allocated to jobs have
a square submesh geometry, with the lengths of the sides of the submesh equal to
2", k 2 0. This restriction of the geometry of the shapes of jobs is not appropriate
when we consider a general mesh system. For a general system, jobs may request
irregular-shaped partitions as well as square-shaped partitions, such that the lengths

of the sides of a partition might not equal 2". Another drawback of a traditional

(“O
nan .

15

binary buddy strategy is that large internal fragmentation occurs if it is used for jobs
with irregular sizes and shapes. Internal fragmentation is the ratio of the number of
the processors that are allocated, but not used to the number of allocated processors.

A general recognition strategy is needed to identify available submeshes of ar-
bitrary sizes at any location in a mesh. One strategy proposed is based on frame
sliding [6]. In this strategy, a submesh is allocated that matches the shape and
size requested by an incoming job. Therefore, internal fragmentation is completely
avoided. As a result of its matching capability, this scheme can be used for general
sizes and shapes. Ding and Bhuyan [67] improved the performance of the FS strategy
by allowing the change of the orientation of incoming jobs. However, due to the irreg-
ular sizes of jobs, it is still difﬁcult to avoid large external fragmentation, which is the
ratio of the number of available processors to the number of processors in the system
when an allocation miss occurs. Addressing the problem of the ﬁrst-ﬁt allocation
nature of the FS strategy, Zhu [68] has compared a best-ﬁt (BF) strategy based on
certain heuristic with the ﬁrst-ﬁt (FF) strategy. Their results showed that neither of
FF nor BF can achieve superior performance than the other at all times in a dynamic
workload, and both strategies suffer external fragmentation.

By controlling the location where a submesh is allocated by the processor alloca-
tion strategy, Sharma and Pradhan [69, 70] tried to reduce the external fragmentation.
Their strategy searches free submeshes on the corners of allocated submeshes along
with the corners of the mesh system so as to aggregate allocated processors. This
aggregative allocation clusters the allocated processors, increasing the probability of
having big free partitions and of ﬁnding a sufﬁcient submesh to accommodate an
incoming job. Bhattacharya and Tsai [71] attempted to enhance the system per-
formance by an heuristic approach that looks into the queue of waiting jobs. They
argued that the heuristic performance is heavily dependent on the nature of jobs in

the waiting queue, and for a dynamic workload none of the heuristics that do not use

16

lookahead knowledge performs superior to the others.

Recently, Liu et al. [7] proposed a non-contiguous processor allocation strategy,
called the multiple buddy strategy. This strategy may allocate processors that are
scattered across the parallel computing system to a job. The authors showed that
the approach worked well, especially on an Intel Paragon because the communication
overhead to send a message on the system was relatively large in comparison to the
bandwidth of the network. Therefore, communication network contention between

messages was not a problem.

2.7 Rectangular Packing Problem

The processor allocation problem in a 2-D mesh can be interpreted as a theoretical
problem, called the rectangular packing problem. The rectangular packing problem
is known to be N P-complete [72]. Researchers have provided several approximation
algorithms for the problem and analyzed them theoretically.

The static processor allocation problem on mesh systems has been considered a
variant of the bin packing problem [73, 74, 75]. Coffman, et al. [73, 74] interpreted the
processor allocation problem on a l-D mesh, i.e.,array, system as a two-dimensional
optimal rectangle packing problem. Li and Cheng [76] applied the same analogy to the
2-D mesh allocation problem, interpreting it as a three-dimensional optimal rectangle
packing problem. The packing problem has been shown to be N P-hard even for 1-D
mesh, i.e.,array, systems that have only two processors [72]. Therefore, researchers
have concentrated on providing near-optimal approximation algorithms that have
polynomial execution time and a reasonably absolute and asymptotic performance
bound [73, 74, 77].

A decision version of the two dimensional optimal rectangle packing problem has

been studied extensively [78, 79, 76]. The decision problem is considered a formal

17

model of the system partitioning problem of mesh systems. NP-completeness of this
decision problem has been proved with different constraints; e.g., rectangle packing
concerning orientation [78] and square packing problem in which all rectangles are
squares [79]. Some heuristic polynomial algorithms for the decision version of the

rectangle packing problem have been proposed and analyzed [76].

2.8 Performance Studies of Wormhole Networks

Our analysis of job interactions in the following chapters is associated with the per-
formance studies of a wormhole-routed interconnection network under contention.
Many researchers have investigated the performance of wormhole-routed networks
from different perspectives. Among the many important studies we limit our discus-
sion mainly to the work that we have found to be the most related to the development
of our expressions for evaluating contention.

Dally [1] analyzed latency due to contention and hot-spot throughput of k-ary
n-cube communication networks for various dimensions under the assumption of con-
stant wire bisection. He developed an estimate for contention delay that is similar to
our work. By assuming e-cube routing in a k-ary n-cube network, the latency due to
contention is calculated by multiplying the probability of a collision with the expected
latency due to a collision along the dimensions that a message travels. At each dimen-
sion of the network, the probability that a message skips a dimension is considered
since there are n dimensions the message can travel. To compute the expected waiting
time of a message by collisions he developed a quadratic equation for the expected
waiting time for a collision. He compared measurements from a network simulator to
the latency predicted by his expression. He claimed that his simulation agrees with
the prediction within a few percent until the network approaches saturation.

Another important investigation related to our study was performed by Chittor

(0
[)0
ma

flit

bee

SpOI
loca:
5€Vez

COns‘.

sir .

18

and Enbody [80, 81, 82]. They showed that 2-D/3—D mesh networks provide much
higher performance than popular hypercube networks when the effect of contention is
negligible. However, in the case of large multicomputers that use mesh networks, con-
tention for network channels can be signiﬁcant. They studied the effect of contention
for a given mapping of parallel tasks on a set of multicomputer nodes. A metric called
the path contention level was introduced as a measure of contention and the quality of
the mapping. They showed that the effect of contention requires an upper bOund on
the rate at which messages can be injected into the network by a node. For a given
communication pattern and mapping they showed how to compute the saturation
point. They analyzed the case of random mapping and showed that random mapping
may not be advisable for large systems having hundreds or thousands of nodes that
use mesh networks.

A performance study of wormhole-routed mesh networks under no contention has
been studied by Adve and Vernon [83]. The authors proposed a closed queuing
network that includes message pipelining and blocking and the asymmetric virtual
channel. By using that model, they examined the performance and the scalability
of 2-D networks in which nodes can make multiple requests before blocking for re-
sponses, as well as for trafﬁc patterns that exhibit nearest-neighbor communication
locality. Seth [84] stochastically evaluated the performance of multicomputers with
several mesh network topologies and switching techniques when the channel width is
constrained.

In order to improve the performance of the wormhole networks under contention
situation, several techniques has been proposed. First, the packetization technique
has been investigated [85]. Packetization breaks long messages into a set of smaller
messages, each of which is transmitted separately. This techniques has advantages of
increasing throughput and a better distribution of message latencies, but the over-

heads of message fragmentation and reassembly are large. Next, to increase the

19

network throughput and to avoid the deadlock a technique has been employed that
divides the buffer storage associated with a network channel into several virtual chan-
nels [86, 87]. Virtual channels decouple allocation of buffers from allocation of chan-
nels by providing multiple buffers for each channel in the network. The performance
of wormhole networks using virtual channels has analyzed in [88]. Finally, adaptive
routing can adapt to dynamic changes of network conditions, such as congestion or
the fault of a node [89]. Therefore, a system using an adaptive routing has advantages
of taking another path the following conditions. While message trafﬁc is heavy on
one path, message latency can be reduced by sending messages along the alternate
paths. While a faulty node exists on one path, communication is possible by sending
messages along the alternate paths. Several adaptive routing strategies have been
proposed in [90, 91, 92]. Kim and Chien [93] investigated the effect of all the above
techniques on the performance of wormhole-routed networks under a workload of a
mix of short and long messages. They used an M / G / 1 queuing model to explore the

performance effect of message size in wormhole routing networks.

CHAPTER 3

A PERFORMANCE STUDY OF
WORMHOLE NETWORK

The communication performance of an interconnection network depends on the
switching technology of the network. In the switching technique called wormhole
routing, which is employed in most currently implemented MPCs, a message’s trans-
mission time is relatively independent of the distance that the message travels under
contention-free conditions [1]. This means that assigning computing nodes to a job
that are scattered across an MPC does not increase the transmission time. However,
contention for the communication bandwidth may signiﬁcantly increase the delay of
sending messages.

An important research issue of allocating processors in wormhole-routed 2-D mesh
multicomputer systems is to characterize the contentions caused by interactions be-
tween jobs. A performance study of the interactions between jobs can be reduced to
a performance study of the interactions between the communication paths of com-
municating processes that share channels. This is an analysis of the effect of the
communication of one path upon the performance of another path due to wormhole
routing. If communication paths pass through the same channel, then the communi-

cation time experienced by a message on a path can increase due to contention. The

20

21

amount of contention depends on the communication trafﬁc rates and the amount of
overlap in the communication paths.

This chapter describes the system and analytic models for a performance study
to analyze the contention delays that occur in contiguous and scattered processor
allocations. The goal of the performance study and our approaches to achieve the
goal are presented as well. The actual analysis of contention delays is presented in

the following two chapters.

3.1 System Model

A job considered in this chapter is composed of several parallel processes which com-
municate with other processes in a logical communication pattern. Each process is
allocated to a processor of the 2-D mesh multicomputer. The 2-D mesh system has a
large number of processing nodes, each node containing a processor, a memory, and
a router. The nodes are interconnected by bidirectional channels in the form of a
2-D mesh or a 2-D torus. The processors communicate by passing messages over the
interconnection network. We assume that a message is a packet, and the terms are
used interchangeably.

In the interconnection network, wormhole routing transmits messages between
processors assigned to each job. A survey of routing techniques for wormhole networks
has been presented by Ni and McKinley [89]. Each message in a wormhole network
is composed of a number of ﬂow control digits called ﬁits. The header ﬂit controls
the route of the message and the remaining ﬂits follow the header ﬂit in a pipelined
fashion. Once a channel has been acquired by a message, it is reserved for the message.
The channel is released when the last ﬂit of the message is transmitted on the channel.
If the header ﬂit encounters a channel in use by some other message, the header ﬂit is

blocked until the channel is released. When a message is blocked, the remaining ﬂits

in

22

in the pipeline stay in ﬂit buffers along the route. Messages between processors in our
model use a deterministic routing scheme known as X Y routing [25, 22]. The X Y
routing algorithm sends packets on the X -dimension ﬁrst and then the Y-dimension.
There are no virtual channels implemented in the model.

The time required to send a message from one node to another through the inter-
connection network is composed of transmission delay and routing delay. Transmis-
sion delay determines the lower bound of the communication delay and is dependent
on the switching technology used. For wormhole routing, the transmission time is
k0 + Ir] * Dist + k2 * (L — 1), where Dist is path length, L is message length and
k0, la; and k2 are system dependent constants. The ﬁrst term, 161, represents the ﬁxed
network overhead. The second term, k1 * Dist, represents the time for the header ﬂit
to set up a path and the last term, k2 * (L — 1), is the time to pass a message after
a dedicated path is established. Examples of the constant values used in the current
2-D mesh wormhole-routed multicomputer systems can be found in [94, 95].

Routing delay is divided into two parts: queuing delay and contention delay. Queu-
ing delay is caused by the stochastic arrival pattern of messages to a communication
system that can only serve the messages at a ﬁxed rate. This delay dominates the
communication time as the average rate of message arrivals to the communication
system becomes large. To remove this delay in our model, the rate of message gen-
eration is restricted by a source node. That is, the source node of a message does its
own local computation after the message has been sent to the destination, and the
computation takes at least as long as the message delay to the destination. The idea
of local computation in injecting messages has been employed in other work [96, 81].
Recent work in [97] suggests that this idea of limiting injection rate can actually
improve the performance of network. The average time of local computations by a
source node is called Mean Time Between Sending messages or MTBS. The inverse

of MTBS is called the mean communication rate. At the destination node, packet

In
on

of;

23

arrivals are taken from the network without any waiting time.

The second source of routing delay is contention. Contention delay occurs when
messages of two or more communication paths attempt to use a single channel at
the same time. Contentions are classiﬁed into two classes: internal contention and
external contention. Internal contention occurs when two or more routing paths
within a job try to use a physical channel at the same time. This type of contention
is caused by the rate and pattern of the communication of a job and therefore it is
an inherent property of each job. It can occur in both the contiguous and scattered
allocation models. External contention occurs when two or more routing paths of
different jobs try to use the same physical channel at the same time. When the
wormhole routing mechanism is used with the scattered allocation model, external
contentions cause additional delays in communication time.

For convenience, we call the contention delay seen by a message due to only inter-
nal contentions the internal contention delay of the message. The contention delay
due to a combination of internal contentions and external contentions is called the
(general) contention delay of the message. The internal contention delay added to
the queuing delay and the transmission delay is called the internal communication
delay of the message. The communication delay that is computed by using the gen-
eral contention delay instead of the internal contention delay is called the (general)

communication delay of the message.

3.2 Analytic Model: The Multipath Contention

Model

In this section, our analytic model is proposed to analyze the effect of job interactions
on communication performance. Consider the example illustrated in the part (a)

of Figure 3.1. It is a snapshot of a 6x10 2-D mesh system that allows scattered

24

allocation. Currently, three jobs are allocated to the mesh system. Job 2 is assigned
to a contiguous submesh, and Job 1 and Job 3 are assigned to scattered submeshes.
We are interested in how Job 2 interacts with Job 1 and Job 3 when it shares routers
and wires of its allocated processing nodes. The long arrows illustrate an example of
three paths sharing physical channels between the nodes. P1, P2, and P3 belong to
Job 1, Job 2, and Job 3, and the MTBSs of those paths are mtbshmtbsz, and mtbsg,
respectively. The part (b) of the ﬁgure redraws the paths in the form of our analytic
model that has a stair-layered pattern.

A contention may occur when messages of P1, P2, or P3 attempt to use a single
channel at the same time. The location at which the contention may occur is called a
contention point, and the paths that compete for a physical channel at the contention
point are called the competing paths. The ﬁrst contention point at the ﬁrst com-
munication channel that a message faces when it is injected from a processor to the
network is called starting contention point. The remaining contention points along a ~
communication path are called the intermediate contention points. For example, in
the part (b), the ‘0’ symbols on P2 illustrates the starting contention point due to
P3, and the ‘X’ symbol on P2 illustrates the ﬁrst intermediate contention point due
to P1.

Let us examine DP“, which is the communication delay of P2. The part (c) of
Figure 3.1 determines the time related to the communication delay. If there are no
contentions, DP2 is merely the transmission delay, which is approximately (d2 + d3 +
L) * T,, where T, is the ﬂit transmission time, (d2 + d3) is the distance of message
passing, and L is the number of ﬂits in a message. If L >> (d2 + d3), then the
transmission delay is approximately L * Tt. However, P2 has possibilities of external
contention in two places. The contention delay at the ﬁrst contention point of P2 is
called the starting contention delay of P2 and is labeled CI. The next contention may

occur with P1 where messages on P1 are injected into the network. This contention

25

(a)

 

       
 

P1 ( mtbsl )
P2 ( mtbs2 )

( mtbs3 )

Figure 3.1. A snapshot of a 6x10 2—D mesh system with three interacting jobs; A
close-up view of a 3—path contention is shown.

26

delay labeled 02 is called an intermediate contention delay of P2. The communication
delay of P2 will be the sum of the transmission delay and the contention delays, which
is approximately Cl + 02 + L * T1.

Our analysis uses a contention model called the multipath contention model to
study performance degradation due to interactions between jobs. The contention
model is a representation of arbitrarily overlapped communication paths of several
jobs. Suppose that the model has m competing paths and let P1,...,Pm be the m
competing paths. The m competing paths can be rearranged by sorting in the order
of the occurrence of the starting contention points such that the m—path contention
model forms a stair-layered pattern, as illustrated in Figure 3.2. Since the transmis-
sion delay is ﬁxed and relatively independent of the distance traveled by a message
and the effect of queuing delay is removed in our model, the only cause of variance
in the delay of a message is due to contention delays. Therefore, without loss of

generality, the paths in the contention model are assumed to be sorted in the order

of the occurrence of the starting contention points.

 

’— .............

 

 

 

 

 

 

P, r ""‘i"">
P. rs .2 : >
’5 Pl 'l : i :
p 0:. C3. 6.: ..... E , E '
l I I " : Terminaan' g
1:”.le r 3: l > “MIMI"
P L 3: .2 .2 a: % >

ooooooooooooooo

Figure 3.2. Stair-layered m-path contention model.

 

 

 

OI

te:

“fl
the

C0;

27
3.3 Goal and Approaches

Our goal is to predict D1D 1' for all j, which is the communication delay seen by a
message sent through Pj. In our model, DPi is composed of contention delay and
transmission time. The contention delay of DPJ' , which is CID 1', can be computed by
accumulating C5131 3 for all i, which are the contention delays at the ith contention point
on Pj. Note that with the exception of Pm, P,- has j contention points. Therefore,

DP 1' can be written as follows,

i=j
DP: = L*T.+Zcf”

i=1

where Tt is the ﬂit transmission time, L is the number of ﬂits in a message. Since the
message transmission time is a known constant when the size of the system is known,
what we have to determine are Cf’ , for all i.

To compute 0,?” , two different approaches are presented in the following two chap-
ters. The ﬁrst one in Chapter 4 employs queuing theory, considering the communi-
cation in a wormhole-routed network a stochastic model. This approach is simple for
deriving formulas, but it is appropriate only for the multipath contention model in
which all paths have the same MTBS. Thus, this approach may be used to compute
the internal contention delay of a job that is allocated in a contiguous region. In
contrast, the second approach in Chapter 5 is more complex to derive formulas, but
it can be applied to general contention model whose competing paths have different
MTBSs. This approach uses a divide-and-conquer strategy that divide the mulitpath
contention model into manageable several 2-path contention models and combines

them.

CHAPTER 4

ANALYSIS OF INTERNAL
CONTENTION DELAY

As a ﬁrst attempt to examine the interference and detrimental effect that can occur
when multiple jobs share communication bandwidth on a MPC, we apply queuing
theory to develop a set of formulas for evaluating internal contention delay of commu-
nication paths in a wormhole network. The detrimental effect of contention caused
by interference between competing paths has led us to analyze two different kinds of
communication contention. As described in the previous chapter, starting contention
occurs when a processor attempts to access the network at the ﬁrst hop on its route
from the source to destination. Intermediate contention is the contention facing a
communication path as the message arrives at intermediate nodes on a path. The
starting contention of a path will increase if the processes on the path are allocated
to processors of the multicomputer that is internal to the other communication paths
that occupy surrounding processors. Conversely, the intermediate contention of a job
increases if the processes of the job are allocated to nodes that are dispersed across
the multicomputer and around processors allocated to other paths. The evaluation of
starting and intermediate contention may provide valuable insight on how a processor

allocation strategy should assign jobs to submeshes.

28

of

f6

29

We develop formulas for evaluating starting and intermediate contention that a
job faces without regard to the interactions with the other jobs. They are based on
a stochastic model of communication in a wormhole-routed network. This formula
works for a wide range of communication trafﬁc rates, which includes rates that go
beyond a saturation point. This analytic study enables us to analyze internal con-
tention delays of a job that have a complicated communication pattern. We compare
the evaluation of our formulas to a simulation of the communication network and show
that our analysis yields very good results for predicting contention within wormhole
networks.

The rest of this chapter is organized as follows. Section 4.1 describes the ana-
lytic model on which we analyze the contention delay within a job in Section 4.2.

Concluding remarks are given in Section 4.4.

4.1 The Homogeneous Multipath Contention
Model

There are two parameters in the multipath contention model that affect the utilization
of a shared channel: the number of competing paths and the message communication
rate. By restricting each path to have the same message communication rate, we
can make a speciﬁc multipath contention model called the homogeneous multipath
contention model. The analysis of the homogeneous mulitpath contention model
provides a way to predict the internal contention delay seen by a message in a job.
Figure 4.1 illustrates an instance of the homogeneous multipath contention model
that has ﬁve layers. The ‘0’ symbols represent starting contention points and the
‘X’ symbols represent intermediate contention points. To analyze the homogeneous
multipath contention model, we deﬁne the following notations for convenience. The

notations are illustrated by using the middle path.

t

 

I
--------.f.----.

 

 

  

00000

 

4%

 

I

I

I
-L

 

 

 

 

 

 

It

Starting contention point of P 2',

: CPI = CPS

Two intermediate contention

pointsof P2,: : CP2 = CPI”, <..
C'P3 = CPm)

Two competing paths of P 2’,

at starting channel : P3, 1! P 4,0

' ° Starting channel of P 2.2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

rm; 4: PM

A No contention at starting channel

 

0 Starting contention point
X Intermediate contention point

Figure 4.1. A homogeneous ﬁve-path contention model.

31

0 PW; represents a path that encounters d competing paths at its starting channel
and has u intermediate contention points. For a given path P, u is the number
of paths stair-layered above the path P and d is the number of paths stair-
layered below the path P. For example in Figure 4.1, PM is the top path, P43

is the bottom path and P2,; is the third path from the top.

0 CP,-P is the ith contention point on path P. Figure 4.1 shows all contention

points on the path P”.

o Pr,P is the probability that a contention occurs at C P,” . This is the contention

probability, which is equal to the channel utilization of paths that compete with
P.

P

o Ct,P is the mean contention delay at CPt-P. C tmazr, represents a maximum

expectation of Ctr.

o C P "'4 is the mean contention delay of a message sent on path PM]. It is computed

by adding all contention delays at each contention point on PM (i.e. 22:: Pr:D =1:

Pu“; P

Ctr). CmamPu'd is a maximum expectation of C . Cmax M is composed of

Cmazspu" (the maximum expectation of the starting contention delay) and

Cmaa: [P “v“ (the maximum expectation of the intermediate contention delay).

0 DP“ is the mean communication delay of a message sent on path PM; and

Punt.

is the sum of the transmission delay and the contention delay, C In our

communication mode], the transmission delay is a constant, L at T, where L

P

is message length and T. is ﬂit transmission time. Dmax "'4 is a maximum

expectation of DP“.

0 M TBS is the mean time between the sending of a message.

32

0 AP“ is the mean message arrival rate of PM. Note that due to the assumption

Pu,d

of ﬂow control, AP“ is not constant, but a function of D . For convenience,

we use And.
0 A?“ is the mean message arrival rate at the starting channel of PM].

0 A22; represents the mean message arrival rate at the kth intermediate channel

Of Pug.

4.2 Analysis

Consider a path, P, that has I: contention points including the starting contention
point. The expected contention delay of a message sent through path P can be
computed by adding the expected contention delays at the contention points on the
path. The expected contention delay at the ith contention point on P, Cf, is given
by multiplying the probability of a contention at the point, Pr? by th, which

is the expected contention delay when the contention has occurred. The expected

communication time of the path P, DP, can be written as follows:

i=1:
DP:L*Tt+ZPT,P*th (4.1)

i=1

The probability of contention at the ith contention point of P can be interpreted
as the probability that the channel is used by the competing paths. To compute
this probability, we must know the number of P’s competing paths at the contention
point. By observing the Figure 4.1, we recognize that the number of competing paths
at the starting contention point, called the starting contention level, can be larger than
one. The number of competing paths at any intermediate contention point, called
intermediate contention level, is always one. Therefore, we decompose the summation

that includes th into the starting contention delay and the intermediate contention

33

delay.

4.2.1 Intermediate Contention Delay

Figure 4.2 illustrates how to compute the intermediate contention probability and
delay for the path, P23, using the ﬁve layer communication model of Figure 4.1. First
we concentrate on the second contention point of path P2,; which we call CPI”.
Since CPzpz'2 is the ﬁrst intermediate contention point of path P23, C P; 2'2 is named
C P521”)2 . It has only one competing path, P13. To compute the contention probability
at C PET)” due to P13, which is called Prfff) , we evaluate the utilization of C P1P "3 due
to P13. We can compute Prﬁl”) by approximating the utilization by referring to a
simple M / M / 1 queueing model that uses the channel capacity at C P1P1 ’3 as the server.‘
Message arrivals are generated by P13 as shown in the small box (a) of Figure 4.2.
The mean arrival rate, A?” , is the inverse of the mean inter-arrival time that depends
on the mean communication time of the messages generated through P13. In fact we
do not know the mean communication time, DP 1'3. Instead, we approximate DP 1-3 to

P1

the expected maximum communication time, Dmax 3, as the maximum bound of

the communication time. Suppose we know DmazP 13. Then Afl'a is given as follows:
Af” = 1/(MTBS + Dmame) = A13 (4.2)

The mean service rate, pf” , depends on the mean service rates of the other con-
tention points on P13 due to wormhole routing. If a message header is blocked at one
of the later contention points, then the entire ﬂit stream is blocked. Therefore, the

service time equals the mean communication time of P13, which is DP 1’3. If we use

 

‘The queueing model is only an approximation of the system. We evaluate the quality of the
approximation when we later compare it to a simulation model.

34

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

p s: P 2»
1,3 _ 1,3_ 4,0
7‘1 ‘7‘” 3”] ‘llm
(message (channel ‘J L
arrival - service 132 P
’ — 2,2

rate ) rate ) A s — 13,1 + 7‘40 s = Ll. 3,0

(message (channel
(a) arrival service

rate ) rate )

 

 

 

(b)

Figure 4.2. Illustration of Computation of an Intermediate and Starting Contention
Probability.

Dr.

Th'

oil

We

Will

I
Pr,

COD

it it

that

lhe

Cam

951th

35

3

DmaxP 1’3 as a estimate of DP 1'3, then pf" is computed as

pf” = 1/DmarP1-3 2 111,3 (4.3)

Therefore, by means of queueing theory, we can approximate the channel utilization

of GP}??? due to P13 by dividing A?" by pf”. That is
P P P
Prlfii = ’\11'3//‘11'3 = /\1,3//11,3 (4.4)

We ﬁnd the probability of contention at the other intermediate contention points of
path P2,; in a similar manner.

Next, we need to compute the expected delay at an intermediate contention point,
which is the term Ctr of Equation 4.1. When a contention occurs at CPI??? with
Prffi’), the expected maximum contention delay is the expected maximum channel
reservation time of the conﬂicting message on P13. In this case, however, the channel
has been reserved by the message and therefore the other competing paths fail to
acquire the channel. We do not need to consider P23, P3,1 and P”. Therefore, the
contention delay at the ﬁrst intermediate contention point on path P23 is C tfa”), which
equals Dma$P1v°, the maximum delay on a path that has one competing channel above
it in the stair-layered pattern and no competing paths below it. DmaxP1v° is a value
that we can compute. It is computed by adding the transmission delay of Pm, which
is L * Tt, with the expected maximum contention delay of Pop, which is L * Tt. Note
that when a contention occurs with the path PM at Cf” , the channel is reserved for
the maximum time, which is L at T,.

We can generalize the computation of the expected maximum contention delay

caused by u intermediate contention points on PM; in a model with basic communi-

cation pattern of lv layers, where lv = u + d + 1. Let Cmarf‘“ be the sum of the

0U
lit“

I‘m

Ea.

{10

an

me

[DC

36

d

expected maximum intermediate contention delays. Then Cmamf'“ is computed as

follows:

k=u
Pu, Pu,
= 2(Pr1(k‘)’* Ctmcil

k=1

P
C max I“

k=u

= ZUk—uv—k/ﬂk—uu—k) * Dmawpk"’°
k=1

where Ak-lJv-k = 1/(MTBS + Dmarpk-1"”-")

and uk_1,1v_k = 1/Dmaa'P"-"‘”"' (4.5)

To verify the formula, we simulated the basic communication model of Figure 4.1
while varying the number of layers. We used the Multisim [98] simulation package for
modeling wormhole-routed multicomputer networks, which is based on CSIM [99]. For
our simulation study, we assumed that the communication bandwidth of the wormhole
network requires 0.05 psec to transmit each ﬂit of a message. Since the contention—
free transmission time in the wormhole network is assumed to be a constant, we set
the length of a message transmitted in the network to a large value, 500 ﬂits. This
means that the transmission delay of a message is approximately 25 psec regardless
of the distance traveled. The simulation results are represented by the solid plots and
the analytic results are done by the dotted plots.

Figure 4.3 compares the prediction using Equation 4.5 with the simulation results.
Each plot in the ﬁgure speciﬁes the number of levels in the stair-layered communica-
tion pattern. The vertical axis represents the communication delay and the horizontal
axis represents the MTBS of messages. Our formula does a good job of predicting the
maximum bound of intermediate contention delay for the entire range of MTBS, which
includes the rates when saturation occurs. Note that the estimation of prediction ob-
tained from Equation 4.5 (the plots labeled “prediction” in Figure 4.3) compares very

well with the simulation results when MTBS is very large and very small. This is

37

 

 

 

500 “f l l I l 1
450 - o simulation level=1 <> '—
400] 1 prediction level=1 e— —-
Mean Delay 350 <2 “ simulation level=3 ~o- - _
'- prediction level=3 —o—
of 300 ' ; simulation level=5 ~+~ - I
250 - o, . prediction level=5 —+— -
the Last Path200 _ , '. simulation level=7 ~o —
150 _ '. 1° \ prediction level=7 -o— _
1001. +5.. ~ _ -
50 _ ‘. ¢.¢M'v"="‘vvemirtgvgvgvﬁve’vrv"v_v“v—.7‘ V v
0 100 200 30 400 500 600

Figure 4.3. The Maximum Intermediate Contention Delay: C maxf”.

because at these values DmarPhlJv-k is a very good estimate of DPk-lvlv-k. The es-
timate is not as accurate at intermediate values of MTBS, but nevertheless provides

a good bound for the intermediate contention delay.

4.2.2 Starting Contention Delay

The number of competing paths at a starting channel can be larger than one. To
compute the channel utilization of the starting competing paths, we cannot use the
M / M/ 1 queueing model directly. We must use a queueing model that has multiple
input sources. Assume there are n competing paths. Let S.- be the random variable
of the message injection rate for ith path and let As, be the mean value of 5;. Then,
the message injection rate at the starting channel of the path is 2le 5;. If the ith
path generates messages according to a Poisson arrival process, we can prove that

2;, 5'.- also follows an Poisson process whose mean value is 2;, A3,. by using the
moment generating function technique [100]. This is an approximation to reduce the

multiple input queueing model to a M/M/l queueing model. The validity of this

38

approximation is veriﬁed by the simulation results in Section 4.3.

Let us again consider the example illustrated in Figure 4.1. P23 has two competing
paths at its starting channel, P3,1 and P44), as shown in Figure 4.2. The mean arrival
rate at the channel by P31 is same as the mean arrival rate of P3,1 which is A3,].
Note that all mean arrival rates at each contention point on P3,1 are equivalent to
A3,] because the routing technology is assumed to be wormhole routed. Similarly the
mean arrival rate at the channel by P43 is Aw. Therefore the combined mean arrival

rate at the channel by P3,1 and P43 is
AP” — A A
s — 3.1 + 4,0 (4.6)

where dual = 1 / (M TBS + DmaxP"»d). The mean service time for a message at the

channel server is

P
#52'2 = #3.0 (4-7)

2

where 113,0 2: 1/DmaxP3»°. Note that ”1532’ is neither #3,“ #4,.) nor a combination of

two, but 113,0. These rates are shown in the small box (b) of Figure 4.2. Once A?”

2

and #15’2, are provided, we can easily compute the contention probability at CF?”2

by dividing Afl'a by pf”, that is
Pr?” = Ate/u?“ = 0.1+ WW3.) (4.8)

If a contention occurs at CP‘?’2 by a message sent through one of the competing
paths, the channel is reserved by the message. The expected maximum reservation
time is the expected maximum communication time of a message on P2,; when no

132,0

contention occurs at the starting channel, which is Dmar . By multiplying this

delay by the contention probability, we can predict the maximum contention delay at

P O . Pu
CPS”2 on P23, Wthl’l lS Ctmams 'd.

39

In general, the expected maximum contention delay caused by d competing paths
at the starting channel of PW; in the basic communication model with lv layers is

Cmaxgm", which is computed as follows. Note that lv = u + d + 1.

P P P
Cmaxs"'d = Prs“"’*Ctma:cS‘"

=u+d
= ( Z Ak,lv—k—l/I1u+1,0)*Dmaxup

=u+l

where Ak’lv-k_1 = I/(MTBS + Dmaxk,1v_k_1)

d

k=u+l,...,u+d

and #u+1,o = l/Dmaxp“+"° (4.9)

It is worthwhile to explain how to solve Equations 4.5 and 4.9. Notice that Equa-
tion 4.9 uses Equation 4.5 to compute the Dmax’s and Equation 4.5 uses Equation 4.9
to compute the A’s and u’s. We can solve the equations by a numerical method of
convergence. If we use AF”, and ﬂk_1,o instead of Ak_1,;.,_k and uk_1,1v_k respectively,
then the equations can be computed in backward fashion by starting the summation
when k=1. By using this solution as a starting point, we can converge to the correct
solution by iteratively solving the two equations based on the previous solution.

Figure 4.4 show the starting contention delays as a function of MTBS of messages
for our basic communication model as the number of layers varies. The ﬁgures com-
pare the prediction of starting contention delay due to our developed formula and the

. . P
Slmulatlon results of Cmascs""’.

The top and bottom ﬁgures respectively show the
simulation and prediction results by overlapping the lines for the model with lv 2
l, 3 or 13. The trend of the plots of prediction is very similar to that of the plots
of simulation for all lv and for the entire range of MTBS. In the model of thirteen
levels, contention delay is very high for small values of MTBS until MTBS is less than
300. Beyond MTBS=300, the contention delay decreases slowly. This tendency gives

us insight of how internal contention within a jobs effects its performance and how

40

 

     
     

 

 

 

 

 

 

 

 

60 I I I I I
55 * eve =1 A— —
504 eve =2 6— _
Mean Delay eve —3 +-
45 ‘ v", 5 1— 3
0f 40 _ eve =7 -°— g
the First Path ~ level=13 *—
35 I- o \\
30 ° ‘ '
25.1fm. ‘1 HM_§————-.
0 100 200 300 400 500 600

MTBS of messages

60
55

Mean Delay 504
45
of

40 -
the First Path 35 _

30
252
0

 

 

 

 

 

 

 

 

 

._. ”5'30“
MTBS of messages

Figure 4.4. The Maximum Starting Contention Delay, Cmaa:§"'d

: top — simulation
results, middle - prediction, bottom - combination.

41

independent jobs executing on a multicomputer can affect the performance of each
other. If one job uses processors that are external to the processors of another job,
then the messages between processors of the external job will pass through the paths
used by the internal job. The performance degradation experienced by the internal
job due to the contention caused by the external job’s messages can be signiﬁcantly

increased as the MTBS experienced by the internal job decreases.

4.3 Results

In order to verify the analysis of our homogeneous multipath contention model, we
compare our analytic results with a simulation model for several conﬁgurations. For

simulations, We used the same simulator as before.

4.3.1 Contention Delay of a Path

We can compute the expected maximum contention delay of a path by adding the
two formulas for the starting contention delay and for the intermediate contention

delay from Equations 4.5 and 4.9. Therefore,

Omani)“ = Cmaxs‘“ +Cmam1""’
_ km” Ala—Iv k- 1 “A1,- 11%,, P_
— ( Z —)* *,DuO + :— *Dmaz " "0 (4.10)
k=u+1 ﬂu+1.o k-1 ,uk— 1,lv-lc

The average contention behavior of the basic communication pattern can be ob-

tained if we average the expected maximum contention delays of each paths as follows:

k=lv-1 Pkl-k—
P_ k=0 Cmarr v" 1

lv

 

Cmaa: (4.11)

Figure 4.5 shows the average behavior of our stair-layered communication pattern

42

 

 
 
 
  
   
  

200 l I n I
180 Simulation(level=3) ~o- - 2
160 Prediction(level=3) +— _,
Mean Delay Simulation(level=5) -+- -
140 Prediction(level:5) —l— 7
Of 120 Simulation(level:9) . _
the Last Path 100 Pred1ct10n(level=9) —o— ..

 

 

40
MTBS of messages

Figure 4.5. Average Contention Delay.

by varying the number of paths. Our predictions follow the simulated average con—
tention delays over the entire range of MTBS, representing their maximum bounds.
Note how the predicted values of contention delay compares with the values obtained
through simulation. Equation 4.10 provides an excellent basis for evaluating con-
tention within a job on a multicomputer using wormhole routing. It can be used as a
basis for evaluating contention between independent jobs executing on a wormhole-

routed multicomputer.

4.3.2 Internal Contention Delay of a Matrix Transpose Pat-

tern

We apply our formula to a more complicated example, called the matrix transpose
communication pattern. This communication pattern illustrated in Figure 4.6 with
4-by-4 matrix has been used by Chittor and Enbody [101]. In this pattern, a subset
of nodes are actively sending messages. Only the nodes on the diagonal do not send

messages. Each active node has a related node with which it exchanges messages

43

where

(Iii) = (ICL. SCL)

 

Figure 4.6. Transpose Pattern.

repeatedly. The communicating pairs of nodes using the matrix-transpose pattern
are speciﬁed as

Nodem- +—> Node,”-

where N ode“ indicates the node in the submesh at the kth row and lth column. The
speciﬁcation assumes there are n rows and 77. columns assigned to the job, with rows
and columns numbered from 0. ..n — 1. The symbol 4——> indicates bi-directional
communication. This pattern has a high potential for communication contention.
Every message generated by the job will use a path that is shared by other pairs of
processors in the job. For convenience, we will refer to the matrix-transpose pattern
simply as the transpose pattern.

To compute the expected maximum contention delay of a pattern, ﬁrst we have
to know the average intermediate contention level (I C L) and the average starting

contention level (SC L) of the pattern. The average I C L can be calculated as follows:

i=N
ICLpattern = (Z ICLP‘)/N (4.12)

i=1

where N is the number of paths in the pattern and P,- is the ith path. Similarly, the

44

average 5' C L of a pattern is

i=N
SCL”““"" = (Z SCLP‘)/N (4.13)

i=1

Once we evaluate the I CLpatte’" and SCLpattem, then the expected maximum
contention delay can be evaluated by using Equation 4.11. For example, the 4-by-
4 transpose pattern displayed in Figure 4.6 has 12 paths. The pair of numbers at
the beginning of each path indicates the ICL and SCL of the path. Therefore the
average ICL and SCL of 4—by-4 transpose pattern are both 0.67 by the Equations 4.12
and 4.13.

Since this result is not an integer, the maximum expected valued of contention

delay, Cmax4’4t'an‘po”, can be calculated by an interpolation as follows:

k=0
Cmax4t4transpose = Z CTRGIBP'm—k—l
k=0

k=2 k:0
+ 0.67 * (Z Cmaer"'2"‘ — Z Cmaxpk'1'k“) (4.14)
k=0

k=0

That is, we interpolate between the case when I C Lpatte'" = SCLpatter” = 0
and I CLpattc'” = SCL"°“°"" = 1. Note that the basic stair-layered pattern has
5 C LP““°"" = I C Lp‘m‘m = 0 when there is one layer and S C Lpat‘e'" = I C Lpatte'” = 1
when there are three layers. The term, 2::0 C marP'm-k-1 , represents C maxb°3’c”°“”"
when I CL = SC L = 0, which implies that the number of layers, lv, is one. The term,
[:3 CmaxP*-2-", represents Cmarba‘kpattem when I C L = SC L = 1, which implies
lv is three.
Figure 4.7 shows an evaluation of Equation 4.14 for the transpose pattern in
comparison to values of contention delay obtained by simulation. In addition, we

evaluate the contention delays from our formula and simulation for transpose patterns

are I
Xstf

I
an;

CD If]

45

 

 

 

 

100 l q I I I l

90 — Simulation for 4*4 transpose -0- - —

Prediction for 4*4 transpose +—
Mean Dela 80 ‘ . . ‘ Simulation for 6*6 transpose -+- - "
y 70 F -_ Prediction for 6*6 transpose +— _

of , - 0 ~- Simulation for 12*12 transpose -o- -
60 “if '. . Prediction for 12*12 transpose -o— —

Transpose Pattern . -. 1 .
50 '. '. o. ’l . ‘
. . . ‘ _
40_.. + —I
g. . . O ....... ‘ ‘ ~ . _ . , ‘ ~ ,
3° " "' ' '- 51.3 - 3". """""'.x~__9§.‘
0 200 400 600 800 1000

MTBS of messages

Figure 4.7. Prediction for Matrix Transpose Pattern.

of larger submeshes: 6-by-6 and 12-by-12. The average I CL and SCL of a 6—by-6
transpose pattern are 1.3. For a 12—by—12 transpose pattern the average I C L and SC L
are 3.3. Interpolation was used to calculate the contention delays for both patterns.
Notice that how well our formulas bound the contention delays found by simulation,
and that our formulas provide a very good means for predicting contention for this

complex communication pattern.

4.4 Summary

We examined the interactions that occur within a job. We used queueing theory to
develop a set of formulas for evaluating internal contention delay of communication
paths in a wormhole network. The expected internal contention delay at the jth con-
tention point on ith path was computed by multiplying the probability of a contention
at the contention point by the expected contention delay when the contention occurs.
The probability of contention with a path can be interpreted as the probability that

the channel is used by other competing paths. This probability was estimated by

46

means of queueing theory. An interesting technique applied in the computation of
the contention delay is the separation of the calculation of the expected contention
delay at the ﬁrst starting contention point on a path from the calculation of delay at
the subsequent intermediate contention points.

We proposed two metrics that can be used when one wants to measure internal and
external contentions between jobs in a multicomputer. These metrics are called the
starting contention level and intermediate contention level. Based on the metrics, the
internal contention delays of a stair-layered pattern and a complex transpose pattern
were predicted and veriﬁed by means of simulation. According to the results, as the
starting contention level increases, the communication increases for a wide range of
communication rates. The amount of increase in communication delay depends on
the rate of communication as well as the contention delays facing the external paths
that contend at the starting point. Nevertheless, the detrimental delays due to the

intermediate contention level primarily occurs only at high rates of communication.

Ila
r0:

lﬂf

int

li‘.‘

iDI

4

CO

CHAPTER 5

ANALYSIS OF GENERAL
CONTENTION DELAY

This chapter provides an analysis of job interactions to predict the contention delay
that can occur for a generalized model of job interactions in a 2-D mesh wormhole-
routed multicomputer system. The general job interaction model, which is called
the heterogeneous multipath contention model, is a representation of arbitrarily over-
lapped communication paths of jobs that have different message injection rates and
individual communication patterns. Based on this model, we analyze the degrada-
tion of communication performance due to multiple interacting jobs in a 2-D mesh
wormhole-routed multicomputer system. We compute the contention delay seen by a
message on a path in the heterogeneous multipath contention model. A divide-and-
conquer strategy divides the problem into several manageable problems of computing
the contention delay for the heterogeneous 2-path contention model.

The rest of this chapter is organized as follows. The heterogeneous multipath
contention model and our strategy to analyze the model are presented in Section 5.1
and Section 5.2. Section 5.3 analyzes the heterogeneous two-path contention model,
which is used for the analysis of the heterogeneous multipath contention model. Our

approach and expressions for predicting the general contention delay due to job inter-

47

at

01

I“.

70

[92‘

C0

48

action are presented in Section 5.4. Section 5.5 compares the results of our analysis

with simulation. Concluding remarks are given in Section 5.6.

5.1 The Heterogeneous Multipath Contention
Model

Our analysis uses a contention model called the heterogeneous multipath contention
model to study performance degradation due to interactions of several independent
jobs. The contention model is a representation of arbitrarily overlapped communi-
cation paths of different jobs that have different MTBSs. Suppose that the model
has m competing paths that have independent MTBSs, where m 2 2. Let P1,...,Pm
be the m independent competing paths and mtbsl,...,mtbsm be their MTBSS, respec-
tively. Without loss of generality, the paths in the contention model are assumed to
be sorted in the order of the occurrence of the starting contention points, as done in
Section 3.3. Figure 5.1 is an illustration of a heterogeneous m-path competing model
for our analysis.

Our goal is to predict DP 1' for all j, which is the communication delay seen by
a message sent through Pj. In our model, DP 1' is composed of contention delay and
transmission time. The contention delay of DP 1', which is CPI, can be computed
by accumulating Cf’s for all i, which are the contention delays occurred at the ith
contention point on Pj. Note that with the exception of Pm, P,- has j contention

points. Therefore, DP 1' can be written as follows,

i=j
DP: =L*T.+ZC,-P’

i=1

where T: is the ﬂit transmission time, L is the number of ﬂits in a message. Since the

message transmission time is a known constant when the size of the system is known,

49

 

 

 

 

 

 

 

 

 

 

 

 

 

Node

Figure 5.1. The heterogeneous m-path contention model.

what we have to determine are Cip’ , for all i. The rest of this chapter develops the
formula to compute Cf,” and DPJ.
Before proceeding further with the analysis of the heterogeneous m-path con-

tention model, we deﬁne the following notation for convenience.

0 P1,. . . ,Pm are the m independent competing paths in the system, whose MTBSs
are mtbsl, mtbsg, ..., mtbsm, respectively. As illustrated in Figure 5.1, P,- is

assumed to be located at a higher layer than Pk, where k 2 j + 1.

o S,- is the entrance channel of P.- and is called stage i. The stages are numbered
from 1 to m in the reverse direction that the message is sent. 52-1 is the stage
of the next possible contention after 3;. Between 5.- and S-_1, there may be
channels that have no contention points. Note that a model with m-paths has

m stages.

0 D" is the transmission delay for a message in the model. It is a constant

(= L * T,) in our model.

50

o Cg." is the contention delay at 5,- on P1,. C951. is the summation of the contention

delays from 5.- to 5,- on P1,, including the contention delays at 5; and Sj.

0 D3}, is the communication delay between 5,- and the destination node on Pk.
It includes Cg“. Note that Dﬁg'd is Dt, for all k. Thus, Dgf’d = C351 + D",

Pk _ PI: PI:
and C535,.“ — D “d - Ds,,d-

o T5,,p, represents the mean waiting time between two successive messages passing

through 5,- on Pj. Note that rshp, is mtbs;.

o Ashpj is the mean message arrival rates on P,- at 5;, where 1 S i S j S m.

5.2 The Divide-And-Conquer Strategy

This section presents a divide-and-conquer strategy for predicting the communication
delay seen by a message on P,, a path in the heterogeneous m-path contention model.
Consider P, of Figure 5.1. The number of competing paths of P,- at its starting
contention point can be larger than one, but the number of competing paths of P,-
at each subsequent intermediate contention point is just one. This is the reason that
we compute the starting contention delay separately from the subsequent contention
delays. If the several starting competing paths of PJ- (i.e., Pj+1,...,Pm) can be
reduced to a single competing path, called Q, that produces a similar amount of
contention delay as the original competing paths, then the contention delay seen by
a message on P,- can be computed, as in Figure 5.2, by adding the contention delays
at the X -marked contention points of the reduced model.Consequently, the problem
of computing the contention delay seen by a message on a path in the heterogeneous
multipath contention model can be solved by dividing the problem into several smaller
problems, each is the computation of the contention delay in a heterogeneous 2-path

contention model. To obtain the contention delay on P,, the contention delays of the

51

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

KN KN__ __KN KN
. '1' > I:
.X ‘ air—+9 P.

U v“ “v \_»_/

Figure 5.2. The reduced contention model for P,-.

smaller problems are simply added.

The computation of starting or intermediate contention delay in a heterogeneous
2-path contention model is not trivial if the 2-path model is the part of a heteroge~
neous m-path contention model; the computation for the 2-path contention model
requires several parameter values that could be obtained when computing the m-
path contention model. This problem is resolved iteratively as follows. Based on the
parameter values obtained at the (n — l)th iteration, for all P,- the nth iteration com-
putes DP1(n)s,,d, and C (n)? stage by stage from 51 to 5m in the backward direction.
These values are again used for the (n + l)th iteration. The iterative computation
proceeds until steady state values are reached for C (n)? When the iteration reaches
steady state, the communication delay seen by a message on P,- is obtained by adding
the contention delays at all stages along the path. Computation in the forward di-
rection is impossible because the contention delay at 5,- is effected by contentions
that occur at 5;, for all k greater than i. To construct the basis of the ﬁrst iteration,
the initial iteration computes D1D J(0)3,- ,d, and C (0)? stage by stage, by ignoring all
starting contentions but including intermediate contentions.

Section 5.3 presents the detailed analysis and formula to predict the contention

delay in a heterogeneous 2-path contention model. Based on Section 5.3, Section 5.4

analyzes the starting and intermediate contention delay on Pj in the heterogeneous

52

m-path contention model at stage i of the nth iteration, assuming that D(n — 1):],d

and C(n — 1):: are known at the (n —— l)th iteration, for all 1 _<_ i Sj S m.

5.3 Analysis of the Heterogeneous 2-Path Con-

tention Model

This section describes the analysis of the contention delay in a heterogeneous 2-path
model as a stand alone model, and not as a part of a heterogeneous m-path contention
model. The method in which the derived formula is applied to a heterogeneous m-path
contention model is discussed. As explained previously, the analysis in this section is a
building block for analyzing contention delays in a heterogeneous m-path contention
model. To verify our analysis of the heterogeneous 2-path contention model, we
compared our analytical results with a simulation model for several conﬁgurations of

contention.

5.3.1 Analysis

Suppose that a channel has two competing paths as illustrated in Figure 5.3. Let us
name the upper path P.- and the lower path Pj. Both paths have different MTBSs,
mtbs; and mtbsj. We assume that there is only one contention point between the
two paths, and it is the ﬁrst contention point for both paths. The channel where the
contention occurs is called H. After the contention point, both paths have their own
remaining communication delays, rd,- and rdj, which are known constants.

Let T; and T,- be the random variables that represent the local computation times

Pk

between successwe messages at the source nodes of both paths, and let C exmtbswmb,’ ,

or simply Cexk, be the random variable of the external contention delay on P). (k

is either i or j) while MTBSs of P,- and P, are mtbs; and mtbsj. Then, the real

53

   

 

 

 

ﬁdb.&dp———---—-—

 

 

 

 

 

 

 

 

 

     

Figure 5.3. Heterogeneous 2-path contention model.

communication delay on Pk is C exk+ rdk. If mtbs; is equal to mtbsj, then the 2-path
contention model is reduced to a two-layer stair-layered communication model, which
was proposed in [102]. Thus, we can employ a queuing model as an approximate
analytic model to predict the external contention delays in the worst case. However,
the queuing approach does not consider the difference between MTBSs of the inde-
pendent paths, and thus provides an inaccurate estimate of the external contention
delay. This section presents a stochastic approach for developing a prediction formula
of the external contention delay on Pk.

The expectation of Cea';c can be computed by summing the conditional expec-
tations of Cox;c for all possible conditions that could be made by the relationship
between (T,- and Ti) and (rd,- and rdj). All the possible cases are the following:

Case 1: Tg<rdj and Tj<rd; Case 2: Tg<rdj andeZrd;
Case 3: T.- 2 rd,- and Tj < rd,- Case 4: T.- 2 rd, and T,- 2 rd,-

and the expectation of the external contention delay on Pk is

4
E[Cexk] = z Pr{casen} at E[C'ea:;c | casen] (5.1)

n=l

where k is either i or j.
The following subsections discuss how to compute the probability and the condi-

tional expectation for each case.

The

54

ex 1 rd

P, nu ' —-——

 

 

P- ____ ' ex'____

Figure 5.4. An illustration of Case 1.

Case 1

The probability of case 1 is computed as follows:

PT{CGS€1} : P'I'{T,' < de and Tj < ng}

= Pr{T.- < rdj} Pr{T_,- < rdg}, since T,- and TJ- are independent (5.2)

To compute the conditional expectation of the external contention delay on P.- due
to P,- under case 1, it is worthwhile to examine an instance of the case as illustrated
in Figure 5.4. Since T.- and T, are less than rd,- and rd,- respectively, this instance
will repeat for the entire overlapping period between P.- and Pj even if T.- and T,- are

random variables. Therefore,
E[Cea:,-|case1] 2 rd, - E[T; | T,- < rdj] (5.3)

The reason that E [T, | T; < T'dj] is used instead of E [T.] is that T,- < rd, is the given
condition in case 1. The computation of E [T. I T,- < rdj] is in the appendix when T,-
has an exponential distribution with mean mtbs; as an example. E [C ex,- Icasel] can

be computed in the same way.

arr] I."-

55

sf?!
-I

 

 

 

 

(a) (b)

Figure 5.5. An illustration of Case 2.

Case 2

The probability of case 2 is similarly computed as in case 1, i.e., PT{CGS€2} = Pr{T, <
rdj} Pr{T,- 2 rd;}. To compute the conditional expectation of the external contention
delay under case 2 is complicated. We should consider two possible timings of Pj’S
message arrival which depends on the status of P.- when the message arrives at H
on Pj. One possible situation, as illustrated in the part (a) of Figure 5.5, is that
the message of Pj arrives at H while P,- is free. In this case, the message can be
transmitted without any extra waiting time, but it causes an external contention
delay to a message that arrives at H on P,- during the transmission. The external
contention delay, which is named Cg“), is de minus the remaining time of T.- after the
time that the message arrives at H on Pj. Our task in computing Cg“) is to calculate
the remaining time. By means of renewal theory of stochastic processes, we develop
an approximation.

Suppose a process whose mean of inter-arrival time is a and whose variance is 02.
The expected remaining time from a given time t until the next event generated by
the process is E [Rg] = %(1 + 0%). A proof of this equation is in the appendix. Note

that the remaining time is approximately greater than one-half of the mean inter-

arrival time. Therefore, the approximate remaining time of T,- can be computed as

56

—[—|——11E T‘ 72'3'<Td' , and thus Cg“) is de — will. Since T,- is greater than or equal to rd;,
Ci

(a) is a contention delay after several transmissions. Thus, the expected external

C.‘ .
contention delay per a message, E [C earn-ha), is #155, where E [Nt] Is the mean number
of transmissions. Approximately,

de + E[Tj [Tj Z 711;] .I
rd,-+E[T;|T.° < T'dj]

 

Ethl = f (5-4)

where N: is the random variable of the number of transmissions, and [] is the ceiling
function that gives the nearest larger integer. In summary, the expected external
contention delays per a message on P.- and P,- in the case of part (a) are given in the
following equations:

C.- ..
E[C€$,’](a) = ﬂ, E[C€C€j](a) = 0 (5.5)

The other possible timings of Pj’s message arrival is that the message of P,- arrives

at H while the channel is used for a message of E. This situation is illustrated in

and C-

part (b) of Figure 5.5. In this case, both paths have contention delays. C.- 1“,)

(5)

represent the contention delays of P.- and Pi, respectively. C,- can be computed in the

(5)

similar way as in Section 5.3.1, i.e., Cg“), = rdj — E'[T.- | T.- < rdj]. As in part (a), Cab)

is a contention delay after several transmissions. The expected external contention

delay per message on P.- is C; over the mean number of transmissions, E[Nt]. That

(6)

is,

rdj - E[T,' I T; < de]
Ethl

 

E[C€£Eg](b) = (5.6)

The external contention delay of Pj, CJ-(b), is the remaining communication time of
P,- after the arrival of the next message. As an approximation, it is one-half of rd,-.
Since Ci”) may occur in every transmission, E [C 623,-](5) = %.

Let the probability of part (a) be Pr(a), and let the probability of part (b) be

Pr“). The expected external contention delay of P,- and P,- in the case that T,- < rd,-

57
and T, 2 rd,- are computed as follows:

EICezglcaseg] = Pr(a)E[Ce:r,-](a) + Pr(b)E[Cex;](b)

EICexJ-Icaseg] = Pr(a)E[Ce:r,-](a) + Pr(b)E[Cea:,-](b) (5.7)

PrM is the percentage that H is reserved by P,- without any contention delay
for several transmissions of Pg. This probability can be approximated by means of

renewal theory as follows:

E[T,' I T; < deI

 

 

Pr”) : Td,’ + EIT,’ I T; < 7‘de (5'8)
and since Pr“) is 1 — Pr(a), then
rd,-
PT(b) = I — PT(a) = (5.9)

rd,- + E[T,- I T; < rdj]

The proof of the equation for PrIa) is an application of the key renewal theorem. If
the channel is considered to be a system that alternates between two states, then a

proof is given in [103].

Case 3

Case 3 is exactly the same as case 2 if all notations related to P,- are changed for
PJ- and all notations related to P, are changed for P,-. We omit ﬁgures and formulas

corresponding to this case.

Case 4

The probability of case 4 is Pr{T,- _>_ rdj} Pr{T,- Z ng}. Because T.- and T ,- are greater

than or equal to rd, and rdg, respectively, both EICex; I case.;] and EICexj I case4] is

58

T. T,- TI
rd rd- rd
P.- "'-'Q"'-'QP"' P "'—'Q
...... rd ------
P --- l --- P --- \C .
j ”...‘F'

 

 

 

 

(a) (b)

Figure 5.6. An illustration of Case 4.

computed by using renewal theory as we do in Case 2. Figure 5.6 illustrates how to
compute E [C em.- I case4]. As before, we consider two possible situations depending on
the relative timing of the messages of both paths. If a message of Pj arrives during
T,- as in the part (a) of Figure 5.6, then the waiting time of the message is zero. If a
message of Pj arrives during rd,- as in part (b), then the waiting time is approximately

one-half of 711;. According to the renewal theory, the probability of the part (a), PrIa),

E T' IT‘Zrdg]

is r d) + ElleTer dal’ and the probability of the part (b), Pr“), is 1 — Pr(a). Therefore,

the expected external contention delay of P; is

E[Ce:1:,- I case4] = Pr(b) * er, (5.10)
Similarly, E [C em,- I case4] can be computed
d,- d -
E[Ce:cj I case4] = r r J (5.11)

 

rd,- + EITiITi 2 1‘de * —2—

In summary, the external contention delay due to the interaction between P,- and
P, can be computed by Equation 5.1 with the input values, mtbsg, mtbsj, rd;, and
rd,- when the two paths are not a part of a larger m-path contention model. However,

if there are other competing paths in the multipath contention model where the two

59

paths are involved, input values should be the values that include the delays due to
the contentions with the other paths. Thus, T5hpi, 75,,121, DEL“, and Dng’d, which
are deﬁned in Section 5.1, are used instead of mtbsg, mtbsj, N15, and rd,- respectively,
where 5k indicates the stage at which the 2-path contention situation is considered.
Based on that equation, in the next section we analyze the heterogeneous multipath

contention model. Equation 5.1 is used as a known function called f () This function

returns EICemP‘ ] + Dgzwd as f,-() and EICerfﬁvaerpj] + 051-13 as fj().

TS*,P.’ {75* ,PJ'

5.3.2 Results

We present results that demonstrate how well our analysis of the heterogeneous 2-
path contention model compares with a simulation model for several conﬁgurations
of contention. We used the same Multisim [98] simulator as before.

The simulation model is composed of two competing paths, P.- and Pj, which are
parts of two independent homogeneous multipath contention models. The homoge-
neous multipath contention models for P.- and P,- have different MTBSs, M TBS,- and
M TB 5,, and m and n competing paths respectively. Figure 5.7 illustrates the simula-
tion model. We call the simulation model the m — n interaction model. For simulation
purposes, the remaining communication times of P.- and Pj after the external con-
tention are assumed to be their own internal contention delays that are independent
of the other path. Based on the analysis in Chapter 4, the internal delays are assumed
to be a known value for the given MTBS and the given communication pattern. The
remainder of this subsection shows the effects of interactions of P,- on the performance
of the P,- as the MTBSs and the numbers of the competing paths of the homogeneous
models are various. Since there is only one external contention point between P.- and
P,- in this simulation model, the effects of P.- on the performance of P,- are identical.

Figure 5.8 shows the mean communication delay of P; as a function of the M TB 5

of P.- as different values of the M TBS of P, are used. The 1 — 1 contention model

60

 

 

 

 

 

 

 

 

Figure 5.7. The m-n simulation model for 2-path contention.

 

100 I I 1 l l
90 . Simulation (MTBS = 0) 9-
Prediction (MTBS = 0) O- -
80 Simulation (MTBS = 25) —I— _
Prediction (MTBS = 25) -+- -
70 Simulation (MTBS = 100) d— _
Communicaiton

Prediction (MTBS = 100) -*- -

Inside Job Only -o- . _
Delay of P,-

 

 

 

 

 

0 200 400 600 800 1000
MTBS of P.-

Figure 5.8. Performance effect of various MTBSs of PJ- on communication of Pg: Mean
communication delay of P,- vs.MTBS of P.- for 1-1 interaction model while MTBS of
P,- is 0, 25 or 100.

hale

61

 

40 l l l l
38 ﬂ MTBS = 0 G- _
MTBS = 50 A—
36 MTBS = 100 +— g
MTBS = 250 ..—
34 MTBS = 1000 e— _

Communication

Delay of Pg

 

 

 

 

0 200 400 600 800 1000
MTBS of Pj

Figure 5.9. Performance effect of various MTBSs of P,- on communication of Pg : Mean
communication delay of Pg vs.MTBS of P,- for 1-1 interaction model while MTBS of
Pg is 0, 25, 100, 250 or 1000.

is employed for the simulation. Since Pg does not have any internal contention, the
communication delay of Pg without interactions with P,- is merely the transmission
delay, i.e., 25 psec, as shown on the lowest plot of Figure 5.8. The only cause for
an increase in P,’s communication delay is its contentions with Pg. The other plots
in Figure 5.8 show the effect of Pj on the performance of Pg. As the communication
rate of P,- increases, the communication delay of Pg increases for all ranges of commu-
nication rates of Pg. It increases more signiﬁcantly for high rates. This relationship
between M TBS of P,- and the mean communication delay of Pg can be seen more
clearly in Figure 5.9. As the M TBS of P,- becomes smaller (i.e., the communication
rate becomes larger), the analytical results show that the mean communication delay
of Pg becomes larger.

Another factor that has an effect on the communication time of Pg is the number
of competing paths of the homogeneous contention model in which P,- resides. As we

have examined in Chapter 4, the internal contention delay of the path increases as the

62

 

   
     

500 ' I I I I
450 Simulation (1-5) (>- - .3
Prediction (1—5) 9-
400 Simulation (1-3) «An - E
350 Prediction (1-3) +— —
Communication 300 Simulation (1-1) .. . . _
Prediction (1-1) -9—

Delay of Pg 250

 

 

 

 

200 _ .
150 .........................
100
50 : ' ‘ ‘ “““
0
0 200 400 600 800 1000
MTBS of Pg

Figure 5.10. Performance effect of number of layers of Pg on communication of Pg
: Mean communication delay of Pg vs.MTBS of Pg for 1-1, 1—3 and 1-5 interaction
model : M TBS; = 25.

number of layers becomes large. This increase in the internal contention delay can
effect the performance of Pg. Figures 5.10 and 5.11 present the effect of the increase in
the number of layers in which P,- resides on the performance of Pg. Figure 5.10 shows
the mean communication delay of Pg which is measured as a function of the M TB 5 of
Pg, while Pg has a path that shares a competing channel with P,- that resides with l,
3 or 5 layers. To provide insight to their relationship, the mean communication delay
of Pg is given in Figure 5.11. The delay is shown as a function of the number of layers
associated with P,- for several combinations of M T1351 and M T852. The number of
layers associated with P,- is varied from 0 to 7, while that of Pg is ﬁxed to 1. As we
see in the ﬁgure, the communication of Pg is not affected by the communication of Pj,
when P,- sends message slowly at around 500 psec. Nevertheless, at high rates such
as 25 usec, the communication delay of Pg increases greatly for all ranges of MTBS of
Pg. Figure 5.12 is an another view that shows the relationship among MTBSs of Pg
and P,- and the number of layers. At extremely small values of MTBS of Pg, a small

decrease in the MTBS of P,- or a small increase in the number of layers causes very

63

 

 

 

 

250 I I I I I I
(25,25) 9— 0
200 _ (500,25) -l-— _
(25,100) +—
. . (500,100) +—
Communication 150 _ (25,500) 6— _
(500,500) -0—
Delay of Pg
100 P r
9 l
50 - . d
0 l l l l l

 

l
0 1 2 3 4 5 6 7
Number of Layers of P]-

Figure 5.11. Performance effect of number of layers of P,- on communication of Pg :
Mean communication delay of Pg vs.number of layers of Pg. The pair represents the

MTBSs of Pg and Pg.

signiﬁcant interactions. At other rates of MTBS of Pg, the decrease of MTBS of P,-
or an increase in the number of layers of P,- increases the communication time of Pg

less signiﬁcantly.

5.4 Analysis of the Heterogeneous Multipath

Contention Model

This section completes the divide—and-conquer strategy in Section 5.2 for predicting
the real communication delay seen by a message on a path in the heterogeneous m-
path contention model. Based on the analysis of the 2-path contention model, this
section derives formulas which compute the contention delay of PJ-(i S j S m) at

stage i in the nth iteration.

64

 

180

I
511
NM
0‘0!

160

O
O

Iititggwgégww

140

Cami—lid
co
DO

120
Communication

 

1

. 1
Delay of Pg 1 — ,500
80 '- - ,500

av

60 '.._ _

i
40 500
500

 

0 200 400 600 800 1000
MTBS of Pg

Figure 5.12. Performance effect of combination of factors on communication of Pg:
dotted line—simulation results, solid line—predictions. The ﬁrst element of the pair
represents k-k interaction model and the second number represents the MTBS of Pg.

65

5.4.1 Starting Contention Delay

Figure 5.13 illustrates the situation involving starting contention for a message on Pg
at stage i. As mentioned in Section 5.2, 702 — 1)sg,pj and D(n — 1)::d are assumed to
be already computed at the (n — l)th iteration, for all 2', j such that 1 S i g j S m.
The number of the starting competing paths of Pg (i.e., Pg+1, . . .,Pm) is m — 2', as
shown in part (a) of Figure 5.13. This number may be larger than one. Recall that in
order to apply f () there should be only two competing paths at the contention point,
whose 7' and the remaining communication delays after the contention are known at
the considered stage. Therefore, our strategy for computing the starting contention

delay of Pg at S'g in the nth iteration takes the following steps:

1. Identify T of Pg at Sg in the nth iteration, i.e., T(n)3g,pj, for all j such that

iSjSm-

2. Reduce the multiple starting competing paths of Pg to a single competing path

(say Q) that generates the same amount of starting contention delay.

3. Identify the remaining communication delays after the contention for both Pg

and Q in the nth iteration, i.e., D(n)§:_hd. and D(n)g‘_1,d.
4. Apply fg() to compute D(n)ISJ:,d .

In the ﬁrst step, T(n)s.-,p.- is simply mtng. Let P,- be one of the starting competing
paths of Pg, where i+ 1 S j S m. Due to the contention points that are on P,- before
stage i,

T(n)3..,pj = mtbsj + C(n —1)§j,3i

wherei+1$j$m.
For the second step, we use an approximation as follows. Let Ashpj, 2' + 1 S j S

m, be the random variable of the message arrivals of P,- at stage 2', whose mean is

66

Proceuor g Processor g

    

Router ,

Router g

 

 

 

 

 

 

(a)

Figure 5.13. Computation of starting contention delay.

Asi’Pj. If Ashpj follows a Poisson distribution, then the mean message arrival rate
generated at the starting channel of Pg by the m — 2' starting competing paths of Pg,
is XXL-+1 Asi'p, [104]. In fact, we do not know the actual distribution of Asi'pj, but
the simulation results in the Section 5.4.3 verify that this approximation serves us as
a good estimate. Therefore, we substitute a. competing path, say Q, for the m — 1'

starting competing paths of Pg, as in part (b) of Figure 5.13, and

was... = 1/ i m,- — D(n —1)3..,.
j=i+1
where D(n — 1):. _d is the value deﬁned in the next step of the (n — l)th iteration. In
our model, /\(n)si,pj is computed as 1/(T(n —1)si'pj + D(n — Did).
The last step identifies the remaining communication delays on Pg and Q after
the contention at stage i. For the remaining communication delays on Pg in the nth
iteration, D(n — DEL”, can be used. The remaining communication delay on Q,

which is D(n)§l'._1 ,d, is approximated as a weighted average of the remaining commu-

67

nication delays on Pg+1, . . . , Pm after the contention at stage 2'. A reasonable weight
for Pg, i + 1 S j S m, is Pg’s contribution to the arrival of Q, i.e., ratio A(n)sg,pj to

271m Mnlsm- Thus,

m r\(n)s- P P
D n 9 = m " ’ * D n — 1 ?
( )Su—mi ij-H 21:14] /\(n)Sg,Pg ( )S.-1.d

 

Consequently, the expected real communication delay seen by a message on Pg at
stage i is,

D(n)f5‘):,d = fi(T(n)3.',PivT(n)SnQ’D(n)§:_g,di D(n)gg-1,d)

and the expected starting contention delay of Pg is

5.4.2 Intermediate Contention Delay

Consider a situation of an intermediate contention on Pg at stage i, where i+1 S j S
m. Figure 5.14 illustrates the case when j is i + 1. Pg is the only competing path of
Pg+1 at stage 2'. Therefore, we can apply f () without the second step of reducing the
number of competing paths. Again, T(n — 1)5,.,pj and D(n — 1));f’d are known values
at the (n — l)th iteration for all i, j such that 1 S i S j S m.

The ﬁrst step is the same as for the starting contention delay. That is, 'r(n)s,,p..
is simply mtbsg, and T(n)s..,pj = mtb3g + C(n — 1);)15“ where 2' +1 S j S m.

The last step for identifying the remaining communication delays on Pg and Pg
after the contention at stage 2' is more difﬁcult than for the starting contention delay.
Without loss of generality, suppose that j is i + 1 as in the part (b) in Figure 5.14.
For the remaining communication delays on Pg+1 after Sg, D(n — ”139:“! can be used.

However, the remaining communication delays for Pg may be larger than D(n—1%:l ’d,

68

Processor g Processor g

    

Router g

P!

(+1

 

(a) (b)

Figure 5.14. Computation of intermediate contention delay.

since the other competing paths at stage i, i.e., {Pg+2, . . . , Pm}, may increase the delay.
Thus, the contention delay between Pg and Pg+1 with {Pg+2, . . . , Pm} must be larger
than the delay without {Pg-+2, ..., Pm}. The difference can be interpreted as the
possibility of contention between Pg and {.Pg+2, ..., Pm}. Therefore, the remaining
communication delays on Pg increases as much as the contention delay between Pg
and {Pg-+2, .. . , Pm}. The contention delay can be computed in the same way as the
computation of the starting contention delay of Pg, which is described in Section 5.4.1,
except that there are m—i—l starting competing paths for Pg, i.e., Pg+2, . . . Pm. Thus,
the modiﬁed remaining communication delays on Pg after stage i for computing the

intermediate contention delay on Pg at stage i is

‘ R Pi Q I' ,m , '
D(n)sg_, ,d = fi(T(n)Si.Pn T(n)sin(i+l.m),#j a D(n)Sg_1,d’ D(n)S.‘(_-:td ) #1 )

lb

fY‘

(0::

:95 L‘

69

Therefore, the expected real communication delay on Pg from stage i to stage 1 is
PJ‘ “ P.” Pj
1701)ng = fj(T(n)sg.Pg,T(n)sg.P,-, D(n)Sg-_1,d’ D(n)Sg_1,d)

and the intermediate contention delay on Pg at stage i is

an): = 0002,. — D(n>’s’:_.,.

5.4.3 Results

In this subsection we provide results that show agreement between the analysis and a
simulation of the heterogeneous multipath contention model. In order to illustrate the
comparisons for the multipath contention model, we show analytical and simulation
results for a heterogeneous 5-path contention model.

Figure 5.15 shows the average communication delays of the ﬁve paths of a 5-path
contention model. We analyzed many different cases, and present in this ﬁgure the
results for ﬁxed values of MTBS for all paths in the model with the exception of path
3, the middle path in the model. The respective values of the MTBS for each path
represented in this ﬁgure, from path 1 to path 5, are 0, 50, k, 250, 500 psecs, where It
varies from 0 to 500 psecs.

We observe how the communication delay relates to the MTBS. As the MTBS
decreases, the mean communication delay increases because the probability of con-
tention increases the average path delay. Path 1 is a path that has no intermediate
contention points and has four other competing paths at its starting contention point
in the heterogeneous 5-path contention model. In contrast, path 5 has four interme-
diate contention points and no starting contention point. Notice that the paths that
have the greater number of intermediate contention points are more sensitive to the

network load than the paths that have fewer intermediate contention points. Path 1

70

 

Comm.

Delay

 

 

 

 

l l l l l l l l I

 

0 50 100 150 200 250 300 350 400 450 500
MTBS of Path 3

simulation (Path 1) 6—
analytic (Path 1) <> -
simulation (Path 2) —l—
analytic (Path 2) -+-
simulation (Path 3) +—
analytic (Path 3) -*- -
simulation (Path 4) -o—
analytic (Path 4) --- -
simulation (Path 5) -o—
analytic (Path 5) -o- -

Figure 5.15. Mean communication delay of each path vs. MTBS of Path 3 for
heterogeneous 5-path contention model. The MTBS of Paths 1, 2, 4, 5 are 0, 50, 250,

and 500 psecs, respectively.

is relatively unaffected by the network load.

5.5 Analyzing Two Transpose Jobs

We apply our analytical model to a more complex job-interaction model, which is

illustrated in Figure 5.16. Two jobs are interleaved and have overlapping communi-

cation paths. The ﬁrst job is allocated to a contiguous partition of processors and is

placed inside of the second job. The second job has its processors scattered to two

contiguous regions outside of the ﬁrst job. The overlapping communication paths

are illustrated. The logical communication pattern of each job is a 4-by-4 transpose

pattern, which has been used by other researchers to study performance effects of

processor mappings [101]. In this pattern, a subset of nodes actively send messages.

Only the nodes on the diagonal do not send messages. Each active node has a re-

lated node with which it exchanges messages repeatedly. The communicating pairs

71

 

Figure 5.16. An allocation of two transpose jobs.

of nodes using the matrix-transpose pattern are speciﬁed as N odeg,g +—> N odegg,
where N ode“ indicates the node in the submesh at the kth row and lth column. The
symbol s—g indicates bi-directional communication. Note that the physical commu-
nication pattern of the outside job is different from its logical communication pattern.
In fact, the physical communication pattern of the scattered job is determined by the
location of the processors allocated to the job.

The job interaction model of this example can be divided into several heteroge-
neous multipath contention models, row by row. Each row has a pair of heterogeneous
multipath contention models; one model is directed to the left and the other model
is directed to the right. In other words, there are 6 left-directed heterogeneous mul-
tipath contention models and 6 right-directed heterogeneous multipath contention
models. To illustrate the effects of communication delays produced by the job inter-
action model, we compute the average communication delays for each path in each
direction separately, and then average the delays. In this example, the communica-
tion pattern of each direction is symmetric and therefore will be the same in each
direction. We thus compute the average communication delay of each job only for
the right direction. Let (i, j) be a heterogeneous (2' + j )-path contention model con-

structed by 2' paths that belong to the inside job and j paths that belong to the outside

72

 

 

 

200 I j I I I r l l l
180 t simulation MTBSggg = 0 9- .-
prediction MTBSggg = 0 O -
160 * simulation MTBSggg = 50 -A-— ‘
140 g_ prediction MTBSggg = 50 -*-'- _
simulation MTBSggg = 500 -0—
Mean 120 r prediction MTBSggg = 500 -o- - ‘
0mm.
Delay 100 ' —
80 > . . . . . . . ............ "
:x A [(3 2 . . <2 .......................... <>
60 L v v v v 14>
401%;- .3": ................ :5 ...........................
, v ‘ ': ' . ~44 ................. . ..... j ..... 1 ..... J ..... L A - . -

 

 

 

 

0 50 100 150 200 250 300 350 400 450 500
MTBS of Outside Transpose Job

Figure 5.17. Mean communication delay of the inside transpose job while the MTBS
of the outside job varies from 0 to 500psecs. M TBSggg is MTBS of the inside job.

job. Using this notation, the 6 right—directed contention models can be represented
as (0,0),(0,0),(l,0),(2,0),(3,2), and (0,2) from the top row to the bottom row.

The simulation and analytical results are given in Figures 5.17 and 5.18. F ig-
ure 5.17 shows the average communication delay of the inside job for various values
of MTBS of the inside job as a function of the load of the outside job. Figure 5.18
shows the delay of the outside job for the same set of parameters. The simulation

results are veriﬁed with the computation of the analytical model for all cases.

5.6 Summary

We have analyzed the performance degradation due to the sharing of network re-
sources by multiple independent interacting jobs for a general contention model called
the heterogeneous multipath contention model. Our analysis is based on a divide-and-

conquer strategy, which derives the communication delay at each contention point on

73

 

200 [ g
1802 ,
160

  
  
 

I I I I I
simulation(MTBSggg = 0) e—
prediction(MTBSgn = 0) O -

simulation(MTBSggg = 50) d:—

prediction(MTBSg,g = 50) -*- -

_J

u—(

   
    

140 . simulation(MTBSgn = 500) '._ ~
Mean 120 _ ° - - . _ . . prediction(MTBSg,g = 500) '0' ' ‘
Comm. ' i ' ' -
Delay

 

 

 

 

I

150 20 250 300 350 400 450 500
MTBS of Outside Transpose Job

 

0 50 100

Figure 5.18. Mean communication delay of the outside transpose job while the MTBS
of the outside job varies from 0 to 500psecs. M TBSggg is MTBS of the inside job.

a path. It reduces a heterogeneous multipath contention model at each contention
point to a heterogeneous 2-path contention model. The computation of the reduced
model distinguishes the starting contention point from the intermediate contention
points. Simulations indicate that our analysis of the analytic model closely agrees
with the results of the simulations. These results help us understand the effect of
job interactions on network performance such that we are better prepared for ﬁnd-
ing solutions to the problem of allocating processors in 2-D mesh wormhole-routed

multicomputers.

CI

.10
P1

in Ihi
QIOCE‘:
impor
enjoy
effect:

3013 in

the it

 

 

CHAPTER 6

JOB PARTITIONING AND
PLACEMENT

In this chapter we study principles that may be applied when developing a scattered
processor allocation strategy for a 2-D mesh multicomputer. We believe that it is
important to develop scattered processor allocation strategies for MPCs in order to
enjoy the increased processor utilizations that they allow as long as the negative
effects of job interactions can be kept under control. We investigate the effects of
job interactions due to communication parameters such as the communication rate,
the internal competing level, and the congestion factor. The competing level is a
measure of the contention on a path within an individual job. The congestion factor
is a measure of the contention at a channel. By isolating each parameter, we study
whether the method of partitioning and placing jobs can change the negative effects
of job interactions. Our investigation of factors that affect a scattered processor
allocation strategy uses a combination of simulations and an analytic model that is
analyzed in Chapter 5.

The rest of the chapter is organized as follows. Effects of starting competing
paths and intermediate competing paths are examined in Section 6.1 with the con-

sideration of contention among competing paths. In Section 6.2, efﬁcient methods for

74

part

(of;

6.]

In 01
com
com;
how
a pat

WOT.

75

partitioning and placing jobs and effects of communication parameters are studied.

Concluding remarks are given in Section 6.3.

6.1 Contention Among Competing Paths

In order to investigate the effects of contention between jobs that have independent
communication characteristics, we must ﬁrst study the effects of contention among
competing paths that have independent communication rates. This section examines
how contention between a set of competing paths can affect the delays experienced by
a path. The insight obtained from this examination is intuitive from the nature of the
wormhole routing switching technique, but it is basic and substantial to understand
the results of job placement and partitioning in the next section. This section presents
simulation results and the explanations of the results. For our simulation, we used
the same Multisim [105] simulation package as mentioned in the previous chapters.
Results provided by an analytical technique described in Chapter 5 can help explain
the characteristic of the results.

Suppose that we have ﬁve competing paths, P1,P2,P3,P4, and P5, which have
different MTBSs, as illustrated in Figure 6.1. In the ﬁgure, the gray ovals represent
processing nodes and the white rectangles between two consecutive ovals represent the
channels between two consecutive processing nodes. P1, P2, P4 and P5 have constant
MTBSs, but the MTBS of P3 varies from 0 to 500. The overall communication delay
of the communicating paths may depend on how the paths overlap. For the purpose of
our initial investigation, we consider two possible layouts. The layouts are illustrated
in Figure 6.1 by the Mean Time Between Sends (MTBSs) listed in the columns (a)
and (b), which we will label as layout (a) and layout (b). Layout (a) is the case in
which a path with a higher communication rate (i.e., smaller MTBS) is located at the

upper level among the paths, and the paths with the lower communication rates (i.e.,

 

MTBS MTBS
o 500
so 250
k k
250 50
500 0
(a) (b)

Figure 6.1. Two layouts of 5 competing paths.

larger MTBSs) are located at the lower paths. In contrast, layout (b) places a path
with a. lower communication rate (i.e., larger MTBS) at a upper level, while the paths
with higher communication rates are located at lower levels. As shown in the ﬁgure,
the set of communication rates used in the study are identical for both layout (a) and
layout (b). The only difference is the location of the paths with different MTBSs.
Figure 6.2 shows the average communication delays of the ﬁve paths in layout (a)
whose MTBSs (in psec) are 0, 50, k, 250, and 500, Where It varies from 0 to 500. The
communication delays are displayed as functions of MTBS of P3. Figure 6.3 has
the results for layout (b), which has the MTBSs for the paths to be 500,250, 16,50,
and 0. P3 has an MTBS that varies from 0 to 500. All dotted lines in Figures 6.2
and 6.3 present our results from the analytical model, and the solid lines present
our simulation results. As displayed by the ﬁgures, the communication delays for all
paths of layout (b) are smaller than that of layout (a). This result implies that the
overall communication delay can be reduced by positioning the paths in a particular
order, such as layout (b). This phenomenon can be explained by understanding the

detailed interactions between the competing paths.

77

 

simulation (Path 1)
analytic (Path 1)
simulation (Path 2)
analytic (Path 2)
simulation (Path 3)
analytic (Path 3)
)-'—

)

)*—

)‘0

Comm.

Delay simulation (Path 4

analytic (Path 4
simulation (Path 5
analytic (Path 5

 

 

 

l l l l l l l l I

0 50 100 150 200 250 300 350 400 450 500
MTBS of Path 3

 

Figure 6.2. Communication delay of layout (a) vs. MTBS of P3 that varies from 0
to 500: MTBSs of P1,P2,P4, and P5 are 0,50,250, and 500, respectively. The mean
communication delay is greater for layout (a) than the delay for layout (b) shown in
the next ﬁgure.

 

simulation (Path 1)

analytic (Path 1)
simulation (Path 2)

analytic (Path 2)
simulation (Path 3)
analytic (Path 3)-
simulation (Path 4
analytic (Path 4;-.E—E
simulation (Path 5)
analytic (Path 5)

 

 

 

 

0 50 100 150 200 250 300 350 400 450 500
MTBS of Path 3

Figure 6.3. Communication delay of layout (b) vs. MTBS of P3 that varies from 0
to 500: MTBSs of P1,P2, P4, and P5 are 500, 250,50, and 0, respectively. The delay
shown in this ﬁgure is less than the delay of layout (a) shown in the previous ﬁgure.

.rw-
'JLA

Sta

78

6.1.1 Path Interaction

In view of analyzing path interactions, this subsection re-interprets the effects of
starting contention and intermediate contention by considering the number of starting
and intermediate competing paths and the communication rate of the competing

paths.

Effect of Starting Competing Paths

Suppose path P starts at the processor that is connected to the router for the speciﬁc
channel that we are examining, and other paths route messages through the channel
(but start at other processors in the network) and will be competing paths to P. A
starting contention of P occurs if P sends a message while the channel is utilized by
a message from one of the starting competing paths. Even if there are potentially
several starting competing paths, only one path can contend at a time since a channel
can be used by only one path at a time. The probability that starting contention
occurs is greater as the number of the starting competing paths increases and the
communication rates of the starting competing paths increase.

Figure 6.4 shows the starting contention delay of a path as a function of its MTBS.
The ﬁgure shows results in which a path that has contention at its starting path com-
petes with 3 or 5 paths. Results are displayed when the MTBSs of starting competing
paths are 0, 50, 100, 250, and 500. The results show that starting contention delay
is relatively insensitive to the change in the number of competing paths and their
MTBSs. This can be explained by considering a message that is initially blocked at
its starting channel because the starting channel is utilized by one of the competing
paths. The message at its starting channel will be able to acquire the channel as soon
as the tail ﬂit of the message utilizing the competing path passes the channel. No
other competing path will acquire the channel as quickly as a path with a message

waiting at its starting channel. Therefore, an increase in the number of competing

hgur

Dads

pahs

EHEC

Thes
that a
{Emio
Shows
use

In con

l't’l'I' 5

 

 

79

50
3 Starting CPS, MTBS = 0 9—
3 Starting CPs, MTBS = 50 -l—
3 Starting CPS, MTBS = 100 -o—
Mean 40 3 Starting CPS, MTBS = 250 -o—
C 3 Starting CPS, MTBS = 500 -0—
0mm. 5 Starting CPs, MTBS = 0 O -
Delay 35 5 Starting cps, MTBS = 50 -+--
5 Starting CPS, MTBS = 100 -O- -
30 5 Starting CPS, MTBS = 250 -O- -
5 Starting CPs, MTBS = 500 -o- -

25

 

0 50 100 150 200 250 300 350 400 450 500
MTBS

Figure 6.4. The starting contention delay when competing with 3 or 5 competing
paths (CPs) at a starting channel.

paths does not have a great effect on a path’s access to its starting channel.

Effect of Intermediate Competing Paths

The starting channel of a path is also an intermediate contention point for channels
that are routed through the starting channel. A path has as many intermediate con-
tention points as the number of the intermediate competing paths it faces. Figure 6.5
shows the intermediate contention delay of a path that faces either 3 or 5 interme—
diate contention points. We assume that the path has no starting contention delay.
In contrast to the effect of starting contention delay, intermediate contention delay is
very sensitive to the increase of the communication rate and the number of the inter-
mediate competing paths. This is because an additional intermediate competing path
or an increased communication rate of an intermediate competing path has a chain
effect by increasing the delay of all other intermediate competing paths. Therefore,

the characteristics of intermediate competing paths affect the delay of a path more

80

 

 

 

300 _ I I I I I I I I I

700 ~ 0" 0 r
' - - . . _ g . 3 Intermediate CPs, MTBS = 0 e—
600 r- ~ 0 ..................... O 3 Intermediate CPS, MTBS = 50 +-
3 Intermediate CPs, MTBS = 100 -0—
Mean 500 ” " 3 Intermediate CPs, MTBS = 250 -o—
3 Intermediate CPs, MTBS = 500 -o—
Comm. 400 ' _ 5 Intermediate CPs, MTBS = 0 0 -
H _ 5 Intermediate CPs, MTBS = 50 -+' -
Delay 3°01 ' 5 Intermediate CPs, MTBS = 100 ....
............ +jg 5IntermediateCPs,MTBS=250 -o--
\> 5 Intermediate CPs, MTBS = 500 -o- -

.9 ..................... 7‘,

 

 

 

 

vvx

0 50 100 150 200 250 300 350 400 450 500
MTBS

Figure 6.5. The intermediate contention delay when competing with 3 or 5 interme-
diate competing paths (CPS).

signiﬁcantly than the characteristics of the starting competing paths.

6.1.2 Discussion

Let us reconsider the paths illustrated in Figure 6.1. The only difference between
layouts (a) and (b) is that the communication rates of upper paths and lower paths.
The higher communication rates of the lower paths contribute to the increase of the
starting contention delays of the upper paths. The starting contention delays, how-
ever, are relatively insensitive to the increased communication rate. In contrast, the
lower communication rates of the upper paths contribute to the decrease in the in-
termediate contention delay of the lower paths. The intermediate contention delay is
very sensitive to a change in communication rate. Therefore, the overall performance
of layout (b) is better than that of layout (a). In general, if you have control of the
layout of communicating paths, it is much better to cause the paths with greater com-

munication demands to encounter a greater number intermediate contention points

81

relative to the number of intermediate contention points encountered by paths that
infrequently transmit messages. It is undesirable to have the paths with a greater
amount of communication demand to be intermediate contention points for other

paths.

6.2 Partitioning and Placing Jobs

Our investigation of the effect of contention among multiple paths in a wormhole
network provides a basis for our investigation of the interferences that occur among
interleaved jobs. This investigation is important in order to understand how to allo—
cate jobs to a 2-D mesh system when the processors allocated to a job can be scattered

across the system.

6.2.1 Methodology

We constrain our study to examine how two interleaved jobs interact. For simplicity
we assume that all paths that belong to a job have the same communication rate and
the job has a speciﬁc pattern in which messages are transmitted. We consider two
patterns of communication. A diagonal pattern is used when we want a given job to
have no internal contentions, such that the only contentions from which a job suffer are
the contentions due to other jobs. The communicating pairs of nodes in the diagonal
pattern are speciﬁed as N odeg,g +——> N ode(,g_g_1),(,g_g_1), where N ode“ indicates the
node in the submesh at the kth row and lth column. The speciﬁcation assumes there
are n rows and 12. columns assigned to a job, with rows and columns numbered from
0. . .n — 1. The symbol <——) indicates bi-directional communication. The second
pattern of communication that we use is the matrix-transpose pattern, which has
been introduced in Chapter 4. This pattern is used when we want to consider cases

that the internal contention within a given job is signiﬁcant. The two patterns of

82

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) Diagonal (b) Transpose

Figure 6.6. Two logical communication patterns within a 4x4 job.

communication are illustrated in Figure 6.6. These two communication patterns are
very simple, but by using these patterns we form job interaction scenarios that can
isolate the effect of several communication parameters due to the job interactions. In
Section 6.2.5, we discuss how to apply the results that we obtain by the simulation
of two interleaved jobs for developing scattered processor allocation strategies.

Suppose that two jobs should be allocated to an MPC. Consider a situation in
which we cannot allocate both jobs to contiguous regions of the MPC due to the lack
of available processors in partitions that are large enough for both jobs. What will
be the effect of one job on the other if one of the jobs is allocated within a contiguous
region, while the other jobs must be partitioned into two pieces and be allocated
to regions that surround the ﬁrst job? How to choose which job to allocate to a
contiguous region, and which job to allocate to dispersed partitions? For the job that
is partitioned, how to partition it into separate pieces? In other words, how should
we cut a job and allocate it to disperse regions?

In addition to the communication rate that is used in Section 6.1, two more
communication parameters are considered: internal competing level and congestion
factor. The internal competing level of a path within a job is deﬁned as the number
of the path’s starting and intermediate competing paths that belong to the same job.

The average internal competing level of a job is the average of the internal competing

levels

com
Side

insic

 

83

levels for all paths within the job. For the detailed explanation and examples of this
parameter see Chapter 4. The congestion factor of a channel is deﬁned as the number
of competing paths that share a channel. In contrast to the internal competing level,
the competing paths that compose the congestion factor may come from any job. This
parameter is a measure of contention at a channel. To isolate the effect of parameters

due to job interactions, we consider the following scenarios:

0 Scenario A: One job has a higher rate of communication than the other job.
Both jobs are composed of a 4-by-4 matrix-transpose pattern that has signiﬁcant
internal contention. Since the pattern of each job is the same, with this scenario

we examine the effects of communication rate on job placement.

0 Scenario B: One job has a 4-by-4 matrix-transpose pattern and the other job
has a 4-by-4 diagonal pattern. Both jobs have the same rate of communication.
In this case we isolate the effect of internal competing level on job placement.
A difference between the jobs is that one has a signiﬁcant amount of internal

contention relative to the other.

6.2.2 Effect of Communication Rates

We use Scenario A to investigate the effect of communication rate on the placement
of jobs on the MPC. Scenario A can be represented as the two jobs illustrated in
Figure 6.7. Each job has the same communication pattern, but different rates of
communication. We compare the effects on performance when the job in the con-
tiguous portion of the system communicates at a lower rate in comparison to the
communication rate of the job that is partitioned into two pieces and is to the out-
side of the ﬁrst job. Likewise, we will examine the performance when the job to the
inside communicates at a rate that is higher than the job to the outside.

As illustrated in Figure 6.7, the communicating paths of the job to the inside serve

figsre 6.

as intern
the job I
the effecI
by the el
Inside joi
imermed
figur
OUlSlde j
taxis re
0f the 0
Gray sh
631(th)

inn;
1% figu-
lilf COII
ihe out

lieu Ii

rate of

84

 

Figure 6.7. Job interaction model for investigating the effect of communication rate.

as intermediate contention points to the job that is placed to the outside. Likewise,
the job to the outside will provide starting contention to the inside job. Therefore,
the effect of the outside job on the performance of the inside job can be explained
by the effect of starting competing paths on a channel. Similarly, the effect of the
inside job on the performance of the outside job can be explained by the effect of the
intermediate competing paths.

Figures 6.8 and 6.9 Show the communication delays of the inside job and the
outside job as a function of the MTBSS of two jobs, which range from 0 to 500. The
x-axis represents the MTBS of the inside job, and the y-axis represents the MTBS
of the outside job. The communication delays are represented as 2-D contour lines.
Gray shadings in each ﬁgure represent different communication delay thresholds for
each job at various communication rates of the pair of jobs.

Imagine a diagonal line is drawn from the point (0,0) to the point (500,500) for the
two ﬁgures. The upper triangle due to the diagonal line would represent the case when
the communication rate of the inside job is greater than the communication rate of
the outside job. The lower triangle due to the diagonal line would represent the case
when the communication rate of the outside job is greater than the communication

rate of the inside job. For both ﬁgures, the contention delay represented in the upper

Fm;

‘%

and

figure

85

 

 

O 25 w75|ml25l60 mm:.m.um

Figure 6.8. Communication delay of inside job: The x-axis represents the MTBS of
inside job and the y-axis represents the MTBS of outside job.

   

Figure 6.9. Communication delay of outside job: The x-axis represents the MTBS of
inside job and the y-axis represents the MTBS of outside job.

86

triangle is greater than the contention level in the bottom triangle. This means that
it is more desirable for both jobs if the job that communicates less frequently is to

the inside of the job that communicates frequently.

6.2.3 Effect of Internal Competing Level

We use Scenario B to examine the effect of internal competing level on job placement
when the two jobs have the same communication rates but have different communica—
tion patterns. The transpose job has a higher internal competing level in comparison
to the diagonal job. We compare two ways to allocate the jobs. First, we allocate the
transpose job to a contiguous partition and allocate the diagonal job to the outside
with two pieces. Likewise, we consider the case when we reverse the method of allo—
cation. Figure 6.10 illustrates the cases. We use the same MTBS for both jobs and
partition the outside job in the middle and remove one of the communication path
of the transpose job. This modiﬁcation of removing one of the communication paths
is made because we do not want to change the congestion factors of all channels in
the inside job after exchanging the location of two jobs. Therefore, only the internal
competing levels are exchanged.

Figure 6.11 compares the communication delays for the two cases. The commu-
nication delays are shown as functions of the communication rate. The solid lines
represent the results when the matrix-transpose job is to the inside of the diagonal
job, while the dotted lines represent the opposite case. The starred lines indicate the
communication delays of the modiﬁed transpose job and the circle lines indicate the
communication delays of the diagonal job.

The job located to the outside has higher communication delays than the job
located to the inside, regardless of the communication patterns. This is because the
intermediate contention delay caused by the inside job is much more severe than

the starting contention delay caused by the outside job. The overall communication

87

 

Diagonal inside—matrix-transpose outside.

Figure 6.10. Job interaction model for internal competing level.

88

60

 

55
Mean 50 i.
Comm. 45 .
Delay 40
35
30

 

 

 

 

I I i ' f
0 50 100 150 200 250 300 350 400 450 500
MTBS of Both Jobs

25

Figure 6.11. Effect of exchanging internal competing levels. (a) represents the com-
munication delay of the inside job and (c) represents the communication delay of the
outside job while the inside matrix transpose job interacts with the outside diagonal
job. (b) represents the the communication delay of the inside job and ((1) represents
the communication delay of the outside job while the inside diagonal job interacts
with the outside matrix transpose job.

performance when the matrix-transpose pattern is to the outside of the diagonal job
is better than when it is to the inside. This is because when the matrix-transpose is
to the outside, the outside job creates a large number of starting competing paths for
the inside job. The performance effect caused by the number of starting competing
paths is not large. When the matrix-transpose job is to the inside, it causes a large
number of intermediate contention paths for the outside job. The performance of a
job is very sensitive to the number of intermediate contention points, and therefore

the diagonal job to the outside will suffer signiﬁcantly.

6.2.4 How to Partition: Effect of Congestion Factor

Suppose we have determined which job will be place to the inside and which job

will be placed to the outside. We must also decide how to partition the outside job.

89

 

Second partitioning method.

Figure 6.12. Job interaction model for partitioning.

For simplicity, we focus our attention on partitioning along a vertical line. Different
manners of partitioning will change the congestion factor even if the communication
rates and the internal competing levels do not change.

We compare two different methods of partitioning using the model in Scenario A,
which has two transpose jobs communicating with different rates. Figure 6.12 illus-
trates the two methods of partitioning. The ﬁrst method partitions the outside job
between the ﬁrst and second columns. The second method partitions between the
second and third columns. The ﬁrst partitioning method causes one row to have a

high congestion factor and three rows to have low congestion factors. The second

90

partitioning method causes each row to have a medium congestion factor.

Figure 6.13 shows the effect of the communication rates of the outside job on the
communication delay of the inside job. The dotted line is for the ﬁrst partitioning
method, and the solid line is for the second partitioning method. The communication
delay of the outside job for the ﬁrst partitioning method is lower than the second
partitioning method for all ranges of communication rates of the outside job. However,
the communication delay of the inside job for the ﬁrst partitioning method is lower
than the second partitioning method only if the communication rate of the outside
job is not too high. The reason is because the high congestion caused by the outside
job saturates the channel when the communication rate is high. Therefore, if the
communication rate of the outside job is not too high, then partitioning at the vertical

line where the overall congestion factor is lower is a good decision.

6.2.5 Discussion

Let us consider the scattered processor allocator of a MPC system that serves multiple
jobs simultaneously. When a job is scheduled for allocation by the job scheduler of the
system, the processor allocator allocates the job in a contiguous partition, if possible.
Otherwise, the allocator may partition the job into several pieces and allocate the sub-
pieces into scattered regions. In this case, our conclusions of Section 6.2.2—6.2.4 could
be used as “rules of thumb” by the scattered allocator. As an example, the result of
Section 6.2.4 can be used for the Multiple Buddy Strategy in [7]. The strategy divides
the request of the job into smaller square submeshes whose sides have sizes of 2" when
there are no contiguous partitions that are big enough for the required submesh. At
each time of division, the scattered allocator should decide which among the four
equal buddies is divided further. Obviously, different decisions cause different ways of
partitioning the job. The result of Section 6.2.4 suggests that if the communication

rate the job is not too high, then partitioning at the vertical line where the overall

Comm.

Delay
of

Inside

Job

Comm.
Delay

of
Outside
Job

91

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

120 I I I I I I I I r
MTBS", = 0 <>~ -
MTBS” = 0 9—
100 MTBSgn — 25 '0' - -
«a. MTBSgn — 25 '6—
<>. MTBSggg =100 . -
30 <>. MTBS... = 100 «a— _
{e¢: . A A ‘ MTBSgn : 500 .*. .
.1 - 3: e . M TBS... _ 500 a—
C a... v 9 ‘
60 Q <9 <9 6 .......... {5 .......... ,,
2:“. I 0.0 ..... ..... --HH : ; g
40 If“ .1 .0 ..... G .....................
""""" ':::T:‘T';';:°I:::::::::::‘""""'
20 l I l l l l l l l V?
0 50 100 150 200 250 300 350 400 450 500
MTBS of Outside Job
200 I I I I I I I I I
MTBSg =0 0-
4 m
180 , x. MTBSggg— _ 0 e~
160i \\ MTBSg'n— — 25 "G '._.
140:0 '._ \ . :100 0°C. _
"<>-.<> “ ‘ MTBSEZ = 100 +—
120 9'9 ‘ g M TBSggg = 500 -»t- - _.
.0 ~‘ ~ <2 ..... <9 ‘ ¢ MTBSg9\= 500 '1"—
100 b . e ‘ v _
G-o. ‘ ; (>5. V\<
80 G'o ‘ ”...Q . _
,g» ’0' ' : ------- .9 .......... .<
60 ,' 0 '-.0 ..... o ..... G”. v —£’
0.. g """"" O ........... c
40 3'0. 00' ooooo o ........ 3 ‘ 3 _l
20 l l l i l ----;i::::.*.....; .....
0 50 100 150 200 250 300 350 400 450 500

MTBS of Outside Job

Figure 6.13. Effect of partitioning.

92

congestion factor is lower is a good idea.

As an second example, we can apply the results of Section 6.2.2 and 6.2.3 in order
to improve the performance of the Naive allocation strategy that is used in [7]. Under
the strategy, a job request for k processors is allocated the ﬁrst It free processors
in a row major scan of the mesh. According to the result of Section 6.2.3, the
communication rate affects the communication interference between the interleaved
jobs more signiﬁcantly than the internal competing level. So, in the situation that
the processors that surround the submesh allocated to a job should be allocated
to the other job, the processor should check the communication rate of the already
allocated job. If the communication rate of the allocated job is high, then the result
of Section 6.2.2 suggests that it would be better not to allocate the processor to the
job. The allocation surrounding a job with high communication rate may decease
the communication performance of both jobs signiﬁcantly. In this case, the allocator
would improve the performance of the system by allocating the job to next possible
places in a row major, or do not allocate the job in such a scattered way if there are
no other possibilities.

An important argument against the practicability of our approach is that the
characteristics of communication patterns and rates are not known before the time
of processor allocation. If no characteristics of any job can be known in advanced,
then no technique to exploit the characteristics can be used. Nevertheless, many
jobs that use parallel processing systems execute for long periods and are re-executed
many times. Many researchers have been building tools to analyze the performance
bottlenecks of parallel computations. For jobs that execute for long periods, and are
re-executed many times, the analysis of the patterns and rates of communication of the
long and frequently executing jobs may be very beneﬁcial to the overall performance of
every user of the parallel processing system. Therefore, when these job characteristics

are acquired, they might be used by the processor allocator when it makes its decisions

93

of scattered allocations.

6.3 Summary

Jobs interact with each other due to overlapping communication paths. The over-
lapped competing paths cause contention delays to be suffered by each job. Our re-
sults from an analytical model and simulation indicate that the effects of intermediate
competing paths are more signiﬁcant than the effects of starting competing paths as
the number of competing paths and the communication rate increase. Our analysis
provides guidelines for placing and partitioning jobs. For example, it is better to have
the highly interactive jobs partitioned and placed at locations that cross the paths
of other less interactive jobs, rather than partitioning the less frequently interacting
jobs and placing them at locations such that they suffer from many intermediate con-
tention points. Likewise, the point at which a job is partitioned is important, such

that it is beneﬁcial if the point of partitioning will lower the congestion factor.

CHAPTER 7

EFFECTS OF JOB SIZE
IRREGULARITY ON
DYNAMIC RESOURCE
SCHEDULING

In order to provide a highly utilized parallel computing system, the problem of frag-
mentation has been addressed by a number of research studies. Most research has
focused on developing innovative processor allocation strategies that can minimize
system fragmentation. Many innovative strategies for allocating jobs to parallel com-
puting systems have been proposed [66, 6, 67, 69, 7]. The proposed processor alloca-
tion strategies have been used in association with ﬁrst-come—ﬁrst-serve job scheduling,
in which the jobs come to the system in a single queue. We will use SQ-FCFS to notate
this job scheduling approach. A cited reason for using SQ-FCF S for job scheduling
is because researchers wish to focus on the relative merits of the processor allocation
strategies being explored. More importantly, SQ—FCFS is favored because it has an
inherent notion of fairness, which is to say that jobs are served as they arrive in the

single queue, and jobs are not favored on the basis of the size of the partitions that

94

95

they require.

Nevertheless, Krueger et al. [15] observed for a hypercube multicomputer sys-
tem that the system performance is affected more signiﬁcantly by the job scheduling
algorithm rather than the processor allocation strategy. Sophisticated processor allo-
cation strategies do little to improve the response time in relation to job scheduling
strategies. By allowing jobs to be scheduled in an order that does not follow F CF S,
the system utilization can be improved while the system fragmentation is reduced.
Krueger et al. proposed the scan [15] discipline for hypercube multicomputers. Ac-
cording to the scan discipline, job arrivals are placed in multiple queues corresponding
to the sizes of the subcubes requested. Jobs are scheduled by scanning through the
non-empty queues, similar to the c-scan algorithm for disk scheduling [106]. The
authors showed that with the scan algorithm a simple processor allocation strategy
such as buddy allocation performs as well as more sophisticated strategies under most
workload conditions for a hypercube system [15].

In contrast to the hypercube, jobs requesting resources on a general 2-D mesh
system could request computing nodes that form irregular-shaped submeshes. In this
chapter, we investigate effects of irregularity of job shapes and sizes and effects of
a job scheduling strategy that uses multiple queues on the performance of dynamic
scheduling. For the study, we examine a dynamic scheduling system that schedules
jobs with requests that range from regular-shaped partitions of a multicomputer to
irregular-shaped partitions. The employed job scheduling strategy, called the BWQ-
search algorithm, provides additional opportunities for allocation of resources to jobs
that require smaller submeshes when the job scheduler is blocked by a large job that

cannot be assigned immediately. By means of this algorithm, we identify important
corflponents of a multiqueue job scheduling strategy and evaluate the relative beneﬁt
of the components. We address that irregularities of the shapes and sizes of jobs are

imPortant factors affecting the performance of a resource scheduling algorithm in a

96

2—D mesh multicomputer.

Section 7.1 provides motivation for our performance study of the job scheduling
problem in a 2-D mesh system. A simulation model of a system on which our approach
is analyzed is presented in Section 7.2. Section 7.3 includes a description of the job
scheduling strategy. The results of a study of the performance is given in Section 7.4.
Based on the simulation results, we discuss effects of the variability in the size and
shape of an incoming job on the performance of the job scheduling algorithm. A

summary and conclusions are given in Section 7.5.

7 .1 Motivation: Irregularity of a 2-D Mesh Sys-
tem

The development of a dynamic resource scheduling algorithm for a general 2-D mesh
multicomputer must deal with the inherent property of irregularity in the size of job
requests and the irregularity in the shape of processor allocations. The irregularity
of a 2-D mesh system is clearly illustrated, if it is compared with the inherent regular
characteristics of jobs scheduled and allocated to a hypercube multicomputer system.
Due to the special structure of a hypercube system, an incoming job request to the
hypercube will require 2" computing nodes that are conﬁgured as a subcube of the
hypercube. Note that the unallocated parts of the system are also 2" subcubes. It is
straightforward to develop a job scheduler and a processor allocator that utilizes these
regularities in order to reduce system fragmentation. In contrast to the hypercube,
jobs requesting resources on a general 2-D mesh system could request computing
nodes that form irregular-shaped submeshes. Figure 7.1 shows an example of the
E“llocation of jobs to submeshes in a 2-D mesh multicomputer system. It assumes
that the allocated submeshes match the shape and size requested by incoming jobs.

The gray rectangular regions represent the allocated processors and the white

97

 

 

 

 

 

Figure 7.1. Processor allocation in a general 2-D mesh multicomputer system.

regions represent the unallocated regions of free processors. Most of the white regions
have the irregular shapes and sizes and therefore it may be difﬁcult to ﬁnd a place
to accommodate the next scheduled job even if there remains a partition for the job.
The average amount of system fragmentation may vary and depends on the freedom
that the shape of jobs may assume and sizes that a job can request. Greater freedom
in the variations of shapes may cause larger system fragmentation. An important
design issue for a resource scheduling strategy is to study the amount of performance
degradation with respect to the amount of irregularity of the size of a job or the
irregularity of the shape of the partition of processors the job will occupy. For this
purpose, we study the performance effects of four representative cases of jobs, which
can be described in terms of the shapes of the perimeter of the required partitions.
The shapes are chosen with the restriction that processors are allocated in contiguous
regions in which the routing technology of the multicomputer will not overlap one
job’s message trafﬁc with the message trafﬁc of other jobs. Then, the type of jobs
illustrated in Figures 7.2(a)-(d) would not overlap message trafﬁc of different jobs.
For our study, the length of the sides of a partition will determine the shape of the

jobs. The following job shapes are considered:
0 Square-2": each side length is equal, and the length is 2", where k 2 0.

o Square-var: each side length is equal, but can be any integer value in the range

98

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) Square-2 "k (b) Square- Variable

 

(c) Rectangular-restricted (d) Retangular—unrestricted

Figure 7.2. Four different types of inputs.

from 1 to the maximum side length of the mesh system.

0 Rectangular-restricted: side lengths of a job partition can be different, but the

difference should not be larger than a given constant.

0 Rectangular-unrestricted: each side length can be any integer value from 1 to

the maximum size of a side of the mesh system.

7 .2 Simulation Model

We model a dynamic scheduling system in which job arrivals follow a Poisson process.
An incoming job consists of a number of interacting tasks that communicate via
messages and speciﬁes the geometry of a submesh it will need to occupy. The job will

be allocated to a submesh of the requested geometry to avoid internal fragmentation.

99

The four cases of jobs described in the Section 7.1 are considered. The hold time for
each job is assumed to be exponentially distributed and independent.

Since little information is available about the computing time demands required
for jobs on mesh multicomputers, we consider the case that jobs are executed on
a multicomputer in order to increase the throughput produced by a job, which is
described as uncorrelated workloads by [15, 34]. Therefore, the amount of work done
by each processor is independent of the submesh size. A job that consumes a large
submesh will hold the submesh relatively the same length of time as a job requiring
a small submesh. The larger submesh will have a greater computational throughput.

Our model of a multicomputer system consists of a processor manager and a
number of general-purpose processors that are interconnected by a 16*16 2-D mesh
network. In our study we consider only the allocation of processors to a job in a
contiguous region. In order to allocate processors to a job in a contiguous region, we
used the frame sliding method [6], which was developed for a general 2—D mesh system
and allocates jobs in contiguous regions in an efﬁcient way of minimizing system
fragmentation. For evaluating components of job scheduling algorithms, a group-
based strategy, called BWQ-algorithm, is used and described in the next section. The
algorithm is studied for its ability to increase system utilization by reducing system

fragmentation and improve the mean job turnaround time.

7 .3 The BWQ Searching Algorithm

One method for restricting fragmentation due to processor allocation is to group jobs
of similar sizes to locations in the multicomputer in a close vicinity. It is straight-
forward to implement the concept of grouping within a job scheduler by using multiple
queues. Depending on how the queues are manipulated, an implementation may be

classiﬁed as either blocking-multiqueue scheduling or nonblocking-multiqueue schedul-

100

ing. If a job is so large that it cannot be assigned to adjacent free processors, the
ﬁrst strategy blocks the job scheduler until processors become available. In contrast,
the second strategy provides an additional opportunity to allocate jobs that require
smaller submeshes. The job scheduler searches for jobs within the current queue or
other queues. The underlying idea behind blocking-multiqueue scheduling is to take
advantage of the grouping effect by blocking the job scheduler and collecting jobs of
the similar sizes during the blocking period. This strategy can help reduce system
fragmentation. However, as the blocked job waits, not only does its turnaround time
increase but the turnaround times of other waiting jobs increase, which may result in
a large average job turnaround time for the system.

We propose a nonblocking multiqueue—based job scheduling algorithm, called the
BWQ algorithm. This job scheduling algorithm is divided into two parts: the Between
Queue (BQ) policy and the Within Queue (WQ) policy. The BQ policy controls the
order that a scheduler selects queues. The WQ policy controls the order that jobs
are selected from a queue. Each queue has its own Within Queue(WQ) policy. In
general, the WQ policy for each queue reorders the jobs within the queue so that in
some situations a smaller job can be allocated before a larger job for which a partition
is not available. However, the WQ policy is simply FCFS if all jobs in a queue have

the same size.

7.3.1 Between Queue (BQ) Policy

The BQ Policy controls the order that queues are considered for selecting jobs to
allocate. Queues are ordered according to the size of the jobs they contain, ranging
from the smallest to largest jobs as illustrated in Fig. 7.3.

Initially, the scheduler searches the queue that holds the smallest jobs and sched—
ules its jobs according to the WQ policy of the queue. If the queue is empty, the

scheduler moves to the next non-empty queue in circular-right pattern and tries to

101

Job arrives

Q0 Q2 n

o allocator

 

 

 

 

 

 

 

 

 

 

 

 

Figure 7.3. Multi-queue based job scheduling.

schedule the jobs of the queue with the WQ policy. When all jobs are scheduled and
assigned without blocking, the BQ policy moves to the next non-empty queue. If no
jobs within the queue can be scheduled for allocation, the job scheduler moves back to
a. non-empty queue that holds smaller jobs. The scheduler is not necessarily blocked
from assigning jobs for which partitions exist. Therefore, an additional opportunity
for resource allocation is given to smaller jobs that require smaller submeshes. This
non-blocking property may decrease the turnaround times of the smaller jobs and
reduce the external fragmentation in the system. Depending on the scheduling pol-
icy, the scheduler may move to the queue with the smallest jobs or to the previous
non-empty queue. The policy needs a limit on the number of times the scheduler
can return to queues holding smaller jobs to avoid starvation of larger jobs. Every
time the scheduler passes a large job, the priority of the job is increased by 1. If
the priority reaches the Between Queue Search Limit (BQSL), then the scheduler is
blocked to schedule the “starving” job. Initially the priority of each job is set to zero.
The BQSL is a design parameter with performance implications that we will discuss

later in this section.

7 .3.2 Within Queue (WQ) policy

The WQ policy is controlled by a lookahead window that provides an additional
allocation opportunity for smaller jobs within a queue. When a job or multiple jobs

are ready to be scheduled within a single queue Q g, the jobs that are chosen from the

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

J30 o 123 8

129 o ,2

127 0 QB1 "'3'2'6 ----- 0""

125 0

124 0 iiiﬁi 'E 126 o

122 0 5 19 0 = 118 0

:21 0 WI‘ J7 1 E ii4 8

119 0 : : , _1 _________ _:
w0{:::--Tf7"""0 ..... :3 I - --Ji--.---2.--_____ WZE- - J3 O - E

 

 

 

Q0 Q1 Q2

Figure 7.4. An illustration of job scheduling of three queues.

queue for scheduling come from the set of jobs within the lookahead window of the
queue. Initially the set of jobs in the lookahead window are the n jobs at the head
of the queue, where n is the lookahead window size. If n is greater than one and the
ﬁrst job of the queue cannot be allocated, then the job scheduler considers jobs in the
queue within the lookahead window to ﬁnd a job that is allocatable. If it ﬁnds such
a job, it schedules the job and looks to the next n jobs. Otherwise, the job scheduler
moves to another queue according to the BQ policy. A large lookahead window size
increases the probability of ﬁnding a job that can be scheduled, but also increases
the cost to search for a job. The window size needs to be bounded depending on the
type of jobs assigned to a particular queue. The window size of each queue is a design
parameter that is discussed later in this section.

Figure 7.4 shows the state of a queuing system of an job scheduler that has 3
queues. The ﬁrst queue (Q0) is a FCFS queue, i.e., the size of the lookahead window
is 1 and the BQSL is ﬁxed to 0. Jobs in the queues have priority P set to 0. Notice

that J5 and J7 in queue Q1 have P > 0, which means that these jobs were not able

103

to be scheduled in previous attempts.

The second and third queues, Q1 and Q2, are general lookahead window queues.
Each has a BQSL given as a design parameter. After scheduling the jobs in Q0, the
scheduler is currently working in Q1. Some jobs have been assigned and the scheduler
searches within the lookahead window of Q1. Notice that Q1 has a dotted—line that

is called the Qboundary. This parameter is discussed in the next subsection.

7 .3.3 Design Parameters

In this subsection, three design parameters are studied to maximize system perfor-
mance. To isolate the effect of each parameter, we ﬁx the other parameters at speciﬁc
values. Experiments were conducted for all four job types. Since some of our stud-
ies produce similar results, only the results of the rectangular-unrestricted input are

displayed when they are similar.

Number of queues

The number of queues used is an important design parameter that can have a signif-
icant effect on the system performance. In a hypercube multicomputer system, the
number of queues used is the number of possible subcubes whose sizes are powers of
2 [15]. For the 2-D mesh system we cannot apply the same method since jobs have
irregularity. This parameter depends on factors such as the allocation strategy, job
arrival rates, distribution of job sizes and system size.

Figure 7.5 presents the effect of increasing the number of queues on the perfor—
mance of BWQ job scheduling algorithm in our simulation model of mesh system.
We could use a large number of queues due to the variability of the possible sizes for
all types of jobs. However, the number of queues was limited to six or less in order to
study the effect of increasing the number of queues. The Simple case of using only one

queue implies a FCFS queue. The BQSL of each queue is set to 30. Thresholds that

104

 

 

 

 

 

60 I I
SQ-FCFS @—
50 2Q-PL -+— 1
3Q—PL +—
Mean 40 4Q” '°— 1
'Ihrnaround ﬁQ:PL ::
Time 30 r
20 —
10 '-
0 l l l l l l

0.25 0.3 0.35 0.4 0.45 0.5 0.55
System Utilization

Figure 7.5. Response Time (RT) effect of increasing number of queues on the perfor-
mance of job scheduling policy: BQSLs = 30, Input type = Rectangular-unrestricted.

establish the sizes of jobs placed in each queue are deﬁned in order to balance the
number of jobs that arrive at each queue. When the number of queues is increased
from one to three, the mean turnaround time decreases signiﬁcantly. However, for
more than 3 queues the mean turnaround time does not improve. Consequently, for a
certain number of queues we obtain a system performance that is near the maximum.
The appropriate number of queues may change, depending on the allocation strategy,
distribution of job sizes, and the size of the mesh system. Larger mesh systems with
incoming jobs that have greater variability in sizes may need more queues. For our

simulation model, 3 queues are enough.

Between Queue Search Limit(BQSL)

Another important design parameter is the BQSL of each queue. Any queue with
a BQSL equal to zero becomes a scheduling bottleneck. Large jobs in a queue that

cannot be scheduled immediately block all other jobs in the system, which may result

105

in an increased mean turnaround time. Therefore, the BQSL of each queue should be
tuned in relation to the BQSLS of the other queues. Results providing information
about the tuning of the BQSL are displayed in Fig. 7.6 and 7.7. Three queues were
used and incoming jobs of the unrestricted rectangular type were generated with the
arrival rate set to result in 50% system utilization. Because Q0 is designed as a
FCFS queue, the BQSL for Q0 was set to 0. The simulation results Show that mean
turnaround times are much higher when at least one of the BQSLS is zero. Therefore, a
signiﬁcant improvement in performance occurs when the BQSL of each queue is larger
than zero. These results imply that providing an additional scheduling opportunity
for smaller jobs when blocking would occur (i.e., non-blocking) has a profound effect
on the mean turnaround time. However, it is inappropriate to set the BQSL to a
value that is too large. If the BQSL is greater than or equal to 5, then the mean
turnaround time increased signiﬁcantly since large jobs can wait too long. Another
interesting aspect is the relative effect of the BQSL of one queue on the performance
of the jobs in another queue. When we compare Fig. 7.6 and 7.7, the effect of the

value assigned to the BQSL of Q2 is more signiﬁcant than the value assigned to Q1.

Lookahead Window

The lookahead window of each queue is used to reorder jobs within a queue. The
size of the lookahead window is another factor that should be tuned appropriately for
system performance. At a high level of system utilization, as illustrated in Fig. 7.8,
large lookahead windows improve the performance. The diamond plot of Fig. 7.8
shows the simulation results when the lookahead window size of queues 1 and 2 is
one (WS=1). The cross plot of Fig. 7.8 shows the improved performance for larger
window sizes. We have found that the performance improvement due to the lookahead
window is sensitive to the type and characteristics of the incoming job stream. An

incoming job stream with a wide variation in the sizes of jobs beneﬁts much more

106

 

I I I I I I I

 

 

 

 

 

 

130 BQSL for Q2 = 0 B— I

120 BQSL for Q2 = 1 —)(—— ‘

110 _ BQSL for Q2 .GT. 5 A— _

Mean 100 r 4
D E} E]

Turnaround 90 - —
Time 80 - —
50 I I I I I I I ~

0 2 4 6 8 10 12 14
BQSL for Q1

Figure 7.6. Effects of changing BQSL of Q1: 3 Queues with both window sizes = 1,
Input type:Rectangular-unrestricted.

 

 

 

 

 

 

 

130 ' ' ' T ' ' ' ..

BQSL for Q1 = 0 B—
120 BQSL for Q1 = 1 -x— ~
110 BQSL for Q1 .GT. 5 A— _
Mean 100 r -
'Ihirnaround 90'5 -4
Time 80 —
70 _
A 1
60 a T

50 a I I I I
0 2 4 6 8 10 12 14
BQSL for Q2

Figure 7.7. Effects of changing BQSL of Q2: 3 Queues with both window sizes = 1,
Input type:Rectangular-unrestricted.

107

 

60 I I F I I

55 r WS of Q1 = 1 and WS2 of Q2
50 WS of Q1 = 30 and WS of Q2

    

19—
5—I—

   

Mean

Turnaround

35
30
25
20

15 I l l l l l l l l

0.46 0.465 0.47 0.475 0.48 0.485 0.49 0.495 0.5 0.505 0.51
System Utilization

Time

 

 

 

Figure 7.8. Response Time (RT) effects of lookahead windows: 3 queues, BQSLS of
Q1 and Q2 = 30, Input type:Rectangular—unrestricted, FS-allocator.

from a larger lookahead window size than an incoming job stream with uniform sizes

of jobs.

Qboundary

The purpose of the boundary is to avoid starvation of jobs. Only the current queue
for which the scheduler is selecting jobs has a Qboundary. It is established when
the scheduler ﬁrst considers a queue and remains until all jobs below the Qboundary
have been scheduled. .I obs arriving after the Qboundary has been established will not
be considered candidates during the current scheduling phase of the queue. Suppose
there is no Qboundary established for a queue, e.g., Q1. Further suppose all new
jobs are placed in Q1. Therefore, the scheduler will only select jobs from Q1, which

results in the other jobs starving.

108

 

 

 

 

60 I I I I l I

Square-2k '9—
50 - Square—Var -I— —

Rectangular—Restricted
Mean 40 _. Rectangular-Unrestricted _
Turnaround

Time 30 - —
20 - -
10 r _

0 I I I I I

0.3 0.4 0.5 0.6 0.7 0.8

System Utilization

Figure 7.9. Response Time (RT) effects of variability in size and shape on job
turnaround times of BWQ algorithm: Q1 with BQSL = 10 and Q2 with BQSL
= 1.

7 .4 Performance Effects of Variability in Size and
Shape

Once we obtain parameters for the BWQ algorithm, we are ready to examine the
effects of the variability of sizes and irregularity of shapes of jobs on the dynamic
scheduling system. Figure 7.9 shows results that demonstrate these effects. The four
curves show the mean job turnaround times of BWQ scheduling algorithm for the
four different types of inputs described in Section 7.2. An important feature is the
difference between the curves for Square-2" input and the other inputs. The mean job
turnaround time of Square-2" is stable up to 60% system utilization. In contrast, the
other inputs become unstable before 45% system utilization. This fact implies that a
variability of input sizes has a dramatic effect on the performance of a job scheduling
policy.

One of the key factors that contributes to the difference in the results is the system

109

fragmentation. As the system utilization increases to 80%, the external fragmentation
due to Square-2" input decreases to 12%. However, the other types of inputs cause the
system to become unstable at lower system utilizations, resulting in a high external
fragmentation (38% external fragmentation at 45% utilization). Although our job
scheduling algorithm tries to reduce system fragmentation by allowing non-blocking
for the other types of inputs, the processor allocator has difﬁculty in assigning jobs
efﬁciently because the sizes of inputs vary greatly. While the variability of input size
has a profound effect on the performance of job scheduling algorithms, irregular input
shapes do not make a signiﬁcant difference in the system performance, as illustrated
in Fig. 7.9. The performance for the input types Square-var, Rectangular-Restricted,
and Rectangular-Unrestricted are similar as the system utilization increases. As a
result, we have the freedom to change input shapes when assigning jobs. Nevertheless,
the “less-regular” jobs (i.e., Rectangular-Unrestricted) display a little poorer result
than the “more regular” jobs (i.e., Square-var). From the above analysis, we can

conclude:

o The type of input has signiﬁcant effects on the performance of a job scheduling
algorithm. A regular—shaped job — regular-shaped (2") cluster — can be sched-
uled with signiﬁcantly better performance in comparison to other input shapes

that have less regularity.

o If the geometry of job partitions are allowed to be arbitrary shapes, then differ-
ences between separate classes of irregular jobs (e. g. , Square-Var, Rectangular—

Restricted, Rectangular-Unrestricted) are insigniﬁcant.

Restructuring Job Input

Suppose we allow scattering a submesh requested by a job into several smaller sub-

meshes. This assumption is possible if wormhole routing is used as a switching tech-

110

nology for the mesh multicomputer and the frequency of communication between
processors allocated to a job is not large. We can improve the performance of the job
allocation strategy by dividing the required submesh into square-shaped submeshes
with side lengths equal to 2", k 2 0. A simple buddy processor allocation strategy
with a multiqueue-based job scheduling algorithm can be used to achieve good job
scheduling performance. However, further complications are required in order to co-
schedule the separate submeshes that are part of a single job. Also, if a large amount
of communication is required between the processors allocated for a job, then the
contentions for the communication system can be signiﬁcant. As a result, the com-
munication contention generated by one job will likely affect the performance results
obtained by other unrelated jobs. Other jobs can be inhibited from accessing the com-
munication network if jobs must communicate between processors that are dispersed
in the multicomputer. For jobs that require a large amount of communication, it is
advantageous to allocate partitions of processors to the jobs in contiguous regions of
the multicomputer so that they will not inhibit the performance effect of other jobs.
Therefore, we need a strategy that can transform the shape of an incoming job to
allow for efﬁcient utilization of the system, while keeping the submesh allocated to

the job in a contiguous shape.

7 .5 Summary

We discussed the performance effects of arbitrary sizes of jobs in 2—D mesh system
and proposed a job scheduling strategy based on a concept of grouping. The job
scheduling algorithm used is a nonblocking multiqueue-based job scheduling algo-
rithm that is efﬁcient and suitable for general 2—D mesh multicomputer systems. In
order to decrease the job turnaround time and increase the system utilization, the

strategy minimizes the external fragmentation by reordering the scheduling of jobs.

111

The simulation results showed that the amount of inherent system fragmentation of
a 2-D mesh system depends on the amount of irregularity of the sizes and shapes of
job requests. There is a large performance improvement between very regular-shaped
partitions and the other types of partitions. The results were analyzed in the context

of developing a processor allocation strategy for wormhole-routed 2-D mesh systems.

CHAPTER 8

EFFICIENT JOB SCHEDULING
WITHOUT DISCRIMINATION
AGAINST LARGE JOBS

Research activity has been divided between those seeking better system response
times by means of job scheduling and those who insist that fair scheduling requires
F CFS. Since SQ-FCFS does not favor small jobs to the detriment of large jobs, re-
searchers continue to use SQ-FCF S and concentrate on processor allocation tech-
niques. Nevertheless, other researchers have made efforts to develop job scheduling
algorithms that can achieve signiﬁcant performance gains by reducing system frag-
mentation. The BWQ-search algorithm in Chapter 6 is the effort in this direction.
Generally, these algorithms favor small jobs by changing the execution order and have
been criticized because they provide improved average performance by favoring small
jobs at the expense of large jobs [65].

In this chapter we propose a new job scheduling discipline, called the HELM
discipline, which takes advantage of the performance gains that are possible via job
scheduling, while it ensures that small jobs are not inappropriately favored in compar-

ison to large jobs. The HELM discipline adapts its scheduling policy to the changes

112

113

of workload by using parameters that dynamically update their values depending on
the history of the workload (A detailed description of these parameters is given in
Section 8.2). Under low loaded conditions HELM follows the SQ-FCFS discipline.
If the load is low, little performance gain would be possible even if a non-FCFS
scheme was used. Under a highly loaded condition, the HELM discipline reorders
the jobs to increase the utilization of the system. However, to avoid the long waiting
time charged to large jobs that have been skipped several times in order to allocate
smaller jobs, the discipline checks the waiting time experienced by a job when it con-
siders scheduling it. If a job waits for such a long time in comparison to the average
waiting time experienced by jobs that have been served, then the discipline raises the
priority of the job so that it will not be skipped again. To evaluate the fairness of the
HELM scheme, we compare the performance ratio that large jobs achieve relative to
the performance of small jobs. This is evaluated by classifying jobs into two groups,
large and small, and then computing the ratio of average waiting time experienced
by the large jobs to the average waiting experience by the small jobs. We compare
our algorithm’s ratio with that of SQ-FCFS.

The rest of the chapter is organized as follows. The following section describes
the HELM discipline. The design parameters of the discipline are examined in Sec-
tion 8.2. Section 8.3 presents simulation results and a comparison to other disciplines.

Concluding remarks are given in Section 8.4.

8.1 The Proposed Strategy

The HELM scheduling discipline controls the order of scheduling by using three types

of queues: High-priority queue, Entrance queue, and Lookahead Multiple queues

(HELM).

114

o The Entrance Queue (EQ): The entrance queue is a FCF S queue with a
lookahead window (LW). This queue is used for keeping jobs in the order of
arrival. When an incoming job arrives at the multicomputer system, the job is
placed in the entrance queue in the F CFS fashion. The job scheduler schedules
jobs from EQ one by one. When an attempt to allocate the current job has
failed, the HELM discipline allows the scheduler to look ahead to the next
job, passing the current job to the Lookahead Queues (LQ), which will be
described below. The amount of look-ahead is controlled by the lookahead
window (LW), whose size is adaptive to the change of workload. Tuning the
size of LW for the entrance queue is discussed in Section 8.2. Due to the
adaptive lookahead window, the HELM discipline has a tendency to follow the
F CF S discipline, especially under low loaded conditions. Under highly loaded
conditions, however, the FCFS discipline is inefﬁcient because of the randomness
in the shapes and sizes of consecutive jobs. As the system workload becomes
high, the HELM discipline places large jobs in the LQs where jobs are scheduled

in an efﬁcient order to achieve better system performance.

0 The Lookahead Queues (LQ): The jobs passed to the lookahead queues are
classiﬁed according to the size of request submesh and are queued to one of the
LQs in the order of arrival. The HELM discipline attempts to schedule all the
jobs in the LQs one by one, skipping the jobs that cannot be allocated. The
advantage of using lookahead queues is that the HELM scheduler can allocate
as many jobs as possible under a highly loaded condition. However, jobs are
scheduled in an different order from FCF S. To avoid unfair treatmentof large
jobs, the HELM discipline uses a parameter called the Upper Bound of Waiting
Time (UB WT). UBWT is a linear function of the average waiting time of the

jobs that have been served so far. The coefﬁcient of the linear function has to

115

be tuned. The UBWT is a dynamic parameter that is updated every time a job
leaves the system after its execution. When a job in a lookahead queue cannot
be allocated by the scheduler at its turn, the HELM discipline compares the
waiting time of the job to the current value of UBWT. If the value of UBWT
is larger, then the scheduler passes the job. Otherwise, the job is moved to
the high-priority queue in which the job would not be passed again until it is

allocated. The high priority queue is described below.

0 The High-priority Queue (HQ): The jobs that have waited in the system
more than the UBWT are moved to the high priority queue. This queue is
a typical FCFS queue. This queue is highest in priority in comparison to the
other queues; the HELM scheduler checks this queue ﬁrst when a job releases
processors allocated to it. If there are jobs in this queue, the HELM scheduler

schedules the jobs one by one in the FCFS fashion until the queue is empty.

Figure 8.1 describes the algorithm of the HELM discipline. The job scheduler
based on the HELM discipline ﬁrst checks whether the HQ is empty. If not, then the
job scheduler schedules the jobs in HQ one by one in F CFS fashion until the HQ is
empty. When the HQ becomes empty, the control of the job scheduler moves to LQS.
If there are some jobs in the LQs, then the HELM scheduler considers all the jobs in
LQs, checking whether each considered job can be allocated. After considering all the
jobs in a LQ, the HELM scheduler moves to the next non-empty LQ. If the allocation
of the current job is possible, then the job scheduler schedules the job. Otherwise,
the scheduler checks whether the waiting time of the job is larger than UBWT, the
upper bound of waiting time, where U BWT = K * AWT, K is the positive constant
and AWT is the average waiting time of the jobs that have left after their executions.
If the waiting time of the job is larger, then the job is move to HQ. Otherwise, the

scheduler passes the job and considers the next job in the next non-empty LQ. After

116

do{
wait( not_empty );
while( there is something in queues ) {

/* SCHEDULE THE HIGH-PRIORITY QUEUE (HQ) */
while ( H0 is not empty ) {

schedule and allocate the jobs in HQ in the FCFS fashion
} /* end of while *I

It SCHEDULE THE LOOKAHEAD QUEUES (L0) */
while( All LQs are not empty as All jobs in LQs are not considered ){
J = the first non-considered job from the next non-empty LQ
if (J is allocatable) then allocate J
else {
It Update the Upper Bound of Waiting Time (UBWT) */
/* The coefficient of UBWT, K1, is determined in Section 8.2.2 */
UBWT 8 K1 * the average waiting time of the served jobs
if( HTCJ) > UBWT) move J to H0
}
} /* end of while */

It SCHEDULE THE ENTRANCE QUEUE (E0) */

/* Update the Size of Lookahead Window (LookaheadSize) */
/* The coefficient of LookaheadSize, K2, is determined

/* in Section 8.2.3 */

Cnt-Lookaheaded_Jobs I O;

LookaheadSize = K2 * the average queue length of E0

while( E0 is not empty as Cnt-Lookaheaded-Jobs < LookaheadSize ){

J = the first job in E0

if (J is allocatable) then allocate J

else {
Cnt-Lookaheaded_Jobs++
move J to one of L0 depending on the given classification policy
/* In our simulation we classify the jobs

depening on the size of job request *I

}
} /* end of while */

It To AVOID BUSY WAITING */
if ( Nothing could be allocated ) then wait until a job leaves

} I* end of while */

}while(TRUE);

Figure 8.1. The HELM Algorithm

117

considering all the jobs in LQs, the control of the HELM scheduler moves to the EQ.
The scheduler considers jobs within the size of lookahead window. If the job can be

allocated, then it schedules the job. Otherwise the job is moved to one of the LQs.

8.2 Design Parameters

We conducted a set of discrete event-driven simulations in order to ﬁnd suitable
values for the parameters of the HELM algorithm. The HELM discipline has three
parameters to be tuned: the number of lookahead queues, the coefﬁcient of UBWT for
LQs, and the size of the lookahead window. To isolate the effect of each parameter,
we varied the value of one parameter at a time while the other parameters were ﬁxed
to constants.

The simulator, which is written in the CSIM simulation language, progresses its
time according to the occurrences of events, which are job arrivals and departures.
The interarrival time and the service time of jobs are assumed to be exponential.
In the simulation, the mean service time was ﬁxed to 20 time units, but the mean
interarrival time, called IATM, varied in order to generate various loads. When a
job arrives to the system, the incoming job requests a number of processors. The
number of processors requested by a job follows a uniform distribution. Given the
request by a job, the processor management part of the operating system assigns the
smallest submesh that can accommodate the job request. This method of submesh
assignment generates submeshes having more uniform sizes than the method that
was used in [68, 67]. For the simulation, a 32x32 mesh system (the total number
of processors are 1024) was used. The processor allocation strategy used in the
simulations is F irst-Fit strategy [68]. The simulations run a series of batches of 1000
jobs until a 95% conﬁdence interval is achieved.

Four performance metrics have been employed for examining the performance of

118

the scheduling disciplines. First, the job turnaround time and the system utilization
are measured, which are the most important performance indicators for users and
system administrators. The job turnaround time, denoted by TAT, is deﬁned as the
time interval from the point when a job arrives in the system to the point when the
job leaves the system after its service is done. The system utilization, denoted by
UTIL, is deﬁned the ratio of the number of allocated processors to the total number
of processors in the system. Both are measured when a job leaves the system and
the measured values are averaged after the simulation is done. As another indicator
of system performance, the external fragmentation is measured. This metric will be
denoted by FRAG. When the job scheduler is not able to allocate any jobs, the ratio
of the number of idle processors to the total number of processors in the system is
measured and averaged to compute the external fragmentation of the system. This
metric is also used in order to explain the effects of reordering jobs on the system
performance. Finally, to examine the fairness of the disciplines the L/S Ratio is
measured. This metric classiﬁes the jobs into two groups, a large group and a small
group, depending on the submesh sizes assigned to the jobs. The median job size
is used to distinguish to which group each job is classiﬁed. The ratio of the waiting
time of the jobs in the large group to the waiting time of the jobs in the small group
is deﬁned as the metric, L/ S Ratio. If this ratio is near to 1, the discipline is said to

be fair.

8.2.1 Number of LQs

The two lines in the ﬁrst ﬁgure of Figure 8.2 show the effects of the number of
queues on the turnaround time (TAT) of the MQ-SCAN discipline and of the HELM
discipline, respectively. MQ-SCAN is the job scheduling algorithm using Multiple
Queues in which jobs are scanned similarly to the c-scan algorithm for disk schedul-

ing [106]. This algorithm is a variation of the “scan” algorithm in [15], that is orig-

119

 

 

 

 

 

 

 

 

 

 

 

 

1200 I I I I I I
MQ-SCAN -o—
1000 - HELM -*— q
800 -
TAT
600 r
400 r
i
200 l l l l l l
1 2 3 4 5 6 7 8
Number of Queues
T I I I I I
0'44 ” MQ-SCAN 4._ ‘
0.42 - HELM +— ‘
t
0.4 - _
FRAG 0'38 " ‘
0.36 - ., a _
\D
0.34 - ‘ + k
0.32 - _.
0.3 l l l L l l
1 2 3 4 5 6 7 8
Number of Queues
6 I F F F I I
— MQ—SCAN +-
HELM +—
L/S
Ratio

 

 

 

 

0 l l l l l

1 2 3 4 5 6 7 8
Number of Queues

Figure 8.2. Tuning the Number of Queues in the HELM Algorithm.

120

inally designed for hypercube multicomputers. Like the HELM discipline, the MQ-
SCAN discipline also employs multiple queues. In order to make the TAT sensitive
to the changes of the number of queues, a high workload is applied.

According to the ﬁgure, as the number of queues increases the performance of the
MQ-SCAN discipline improves signiﬁcantly, while the TAT of the HELM algorithm
is relatively low and steady. The improvement of the performance of MQ-SCAN can
be explained in terms of the system fragmentation illustrated in the second ﬁgure
in Figure 8.2. As the number of queues increases in the MQ-SCAN discipline, a
greater number of small jobs can be considered for allocation. However, it also means
that if more queues are used, then large jobs may be treated unfairly. Therefore,
the L/ S Ratio of MQ-SCAN increases signiﬁcantly, as presented in the third ﬁgure of
Figure 8.2. In contrast, as the number of the LQs increases, the system fragmentation
of the HELM algorithm remains steady. Because the HELM algorithm considers all
jobs in LQs for scheduling, a small number of queues enables the HELM scheduler to
allocate a job if an allocation exists. For the remainder of the simulation study, the
number of queues for MQ-SCAN is ﬁxed to 5 and the number of LQs for HELM is
ﬁxed to 2.

8.2.2 Coefﬁcient of UBWT

The ﬁrst ﬁgure in Figure 8.3 shows TAT as a function of the coefﬁcient of UBWT.
UBWT is computed by multiplying the coefﬁcient with the average waiting time of the
completed jobs. The y-axis represents TAT, and the x-axis represents the coefﬁcient.
In the simulation, the number of lookahead queues is set to 2 and the size of lookahead
window is set to one-half of the average length of EQ. The coefﬁcient varies from 1 to
8. As shown in the ﬁgure, TAT changes signiﬁcantly when the coefﬁcient varies from
1 to 4. After 4, however, the coefﬁcient does not have much effect on the performance

of the system. If the coefﬁcient is 1, most jobs in the LQs are moved to HQ and the

121

HELM discipline behaves as the F CFS discipline. Therefore it shows high system
fragmentation and low L/ S ratio. As the coefﬁcient increases from 1 to 2, the role of
the LQs increases. The coefﬁcient 2 ensures that on average the job waiting time of
the jobs in HQ is not much larger than two times of the average job waiting time. If
the coefﬁcient is larger than 2, the UBWT becomes too large and most jobs remain
in LQs without going to HQ. For the remainder of the simulation study, the value of

this coefﬁcient is ﬁxed to 2.

8.2.3 Coefﬁcient for the Size of Lookahead Window

Figure 8.4 shows the effect of the coefﬁcient of the size of lookahead window (LW)
on the job turnaround time. When the current job in EQ cannot be allocated, the
HELM scheduler moves the job to one of the LQs. To adapt to the change of the
system workload, the size of LW is not ﬁxed to a constant. Rather, the value could
change depending on the number of jobs waiting in the EQ. For the simulation, the
size of the LW is computed by multiplying the average queue length of EQ with the
coefﬁcient. In the ﬁgure, the coefﬁcient varies from 0.1 to 1. As the rate increases
to 0.5, the system performance is improved. The HELM scheduler with a very small
lookahead window behaves as a SQ-FCF S. As the window size increases, the role of
LQs become more active, but the L / S Ratio increases. If the window size is too large,
the searching time in LQs increases, which decreases the system performance. For

the remainder of the simulation, this coefﬁcient is ﬁxed to 0.5.

8.3 Simulation Results and Comparison

We compare the HELM discipline to two other job scheduling disciplines: SQ-FCFS
and MQ-SCAN. The L/ S Ratio of the SQ-FCFS algorithm is used as a standard of
fairness. MQ-SCAN is the job scheduling algorithm using Multiple Queues in which

122

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

400 I I I I I I
HELM +—
350 — *
TAT 300 r a
250 ~ ,.— * —
200 l m 1 1 l
l 2 3 4 5 6 7 8
Coefﬁcient of UBWT
0.4 I f I I I I
HELM +—
0.38 r _
i
0.36 _
FRAG
0.34 - _
0.32 ~ _
03 l l L L l l
l 2 3 4 5 6 7 8
Coefﬁcient of UBWT
5 I I I I I I
HELM +—
4 _ a
L/S
Ratio
2 A ‘Mr
1
1 - ._
0 l l J I l
l 2 3 4 5 6 7 8

Coefﬁcient of UBWT

Figure 8.3. The Coefﬁcient for the Upper Bound on Waiting Time.

123

 

450

I

400

350 r
TAT

I

300

250 '-

 

 

4 I

 

 

 

200

0.4 0.6
Coefﬁcient of LW Size

0.8

 

0.4

I

0.38

I

0.36
FRAG

0.34 i\

0.32 -

 

I I I I

HELM +—

 

 

0.3
0.1

0.2

0.3 0.4 0.5 0.6
Coefﬁcient of LW Size

0.8

 

L/S

Ratio ‘
2 r—
{\K/W

 

I I I I

 

l l I l

1

 

 

0.3 0.4 0.5 0.6
Coefﬁcient of LW Size

0.7

Figure 8.4. The Size of the Lookahead Window.

0.8

124

jobs are scanned similarly to the c-scan algorithm for disk scheduling [106]. This
algorithm is a variation of the “scan” algorithm in [15], that is originally designed for
hypercube multicomputers. Like the HELM discipline, the MQ-SCAN discipline also

employs multiple queues.

8.3.1 Job Turnaround Time and System Utilization

Figure 8.5 presents the TAT of the scheduling disciplines for varying system work—
loads, which is controlled by the IATM (mean interarrival time). The TAT of the
SQ-FCF S algorithm increases very quickly at higher loads. Compared with the SQ-
FCFS algorithm, the multiple queue algorithms have much better TAT at higher
loads. The HELM algorithm shows a little higher turnaround time than the MQ-
SCAN algorithm under low loaded conditions, however its performance under highly
loaded conditions is much better than that of MQ-SCAN. The reason why HELM
shows a little higher turnaround time than MQ-SCAN is that it needs to maintain a
steady L/ S Ratio.

Figure 8.6 shows the system utilization of each algorithm. It can be seen that
under various loads the HELM algorithm shows higher system utilization than the
other algorithms. It means that the HELM algorithm schedules jobs more efﬁciently

so that more jobs can be allocated on the system.

8.3.2 External Fragmentation

Figure 8.7 shows how the external fragmentation changes as a function of IATM. In
the SQ-FCFS discipline, 41% of processors are idle no matter what is the system
workload. In contrast, the external fragmentation of MQ-SCAN and HELM decrease
as the workload increases. At very high load, the external fragmentation of MQ-SCAN

and HELM are 36% and 33%, respectively. This tendency of decreasing external

125

 

800 ‘ I I I

I I

SQ-FCFS *—
700 " MQ-SCAN s_ -
HELM +—

600 l-
500 *-
TAT 400 -
300 -

200 -

 

 

 

 

 

IATM

Figure 8.5. Comparison of the Turnaround Time for the Three Disciplines.

 

  
    

0.7 I I I I T
, SQ-FCFS +—
0.68 MQ-SCAN —o— -

HELM *—

UTIL

 

 

 

IATM

Figure 8.6. Comparison of the System Utilization.

126

 

 

 

 

 

 

0.44 ~ _,
0.42 _ _
0.4 - D
FRAG 0.38 - —
0.36c SQ-FCFS +— —
MQ—SCAN -.—
0.34 — HELM +— _
J .
0.32 - _
03 l J I l l
16 18 20 22 24
IATM

Figure 8.7. External Fragmentation.

fragmentation under high system workload means that the non-blocking aspect of
the schemes makes the job scheduling algorithms more adaptive to the change of
system workload. The HELM algorithm shows some external fragmentation under
low loaded conditions. This is because the size of the lookahead window in EQ
dynamically changes depending on the workload. Under low loaded conditions, the
lookahead window has a small size such that the HELM scheduler only “looks-ahead”
to a few jobs. Under highly loaded conditions, the size of lookahead window increases
as the length of EQ increases. Therefore, more jobs move to LQs and the HELM
scheduler allocates as many as possible. The HELM scheme can achieve high system

utilization and low external fragmentation under highly loaded conditions.

8.3.3 Fairness

Figure 8.8 shows L/ S Ratio of the three algorithms. The ratio of SQ-F CF S is near
to 1 because SQ-FCF S does not discriminate on the basis of job size. In contrast,

MQ-SCAN shows high L/S Ratio for many workloads. This is because the MQ-

127

 

 

 

 

 

 

 

 

4 I I I l I
3 5 SQ-FCFS *_ _
' ‘ MQ-SCAN +—
3 - HELM d— -
2.5 —
L/S 2*
Ratio
1.5 —
1
0.5 — _
0 J l 1 I l
16 18 2o 22 24
IATM

Figure 8.8. The L/ S Ratio for the Three Schemes.

SCAN discipline changes the order of jobs to decrease external fragmentation. As
MQ-SCAN, the HELM algorithm rearranges the order of job scheduling due to LQs
under highly loaded conditions to achieve high system utilization. However, the
HELM scheduler always checks its waiting time when a job is passed. If the waiting
time of the job is larger than UBWT, then the scheduler moves the job to HQ so
that the job would not be passed again. Because the value of UBWT dynamically
changes depending on the system workload, HELM is more adaptive to the change of
workload. Therefore, under highly loaded conditions the HELM scheme shows a lower
L/ S Ratio in comparison to the MQ—SCAN. Unlike MQ-SCAN, the HELM discipline
shows the low L/ S Ratio as SQ-SCAN under low loaded conditions. This is a very

important performance characteristic of the HELM discipline.

8.4 Summary

We proposed in this chapter an innovative job scheduling discipline, called HELM,

which adapts its scheduling policy to the changes of workload. HELM achieves low

128

job turnaround time and high system utilization while not inappropriately favoring
small jobs to the detriment of large jobs. HELM manipulates three types of queues:
the entrance queue, the lookahead queue and the high-priority queue. For jobs that
arrive at the entrance queue that cannot be allocated, HELM moves them to the
lookahead queues. The lookahead queues are non-blocking for allocation efﬁciency.
If a job cannot be allocated, HELM searches for other jobs until it ﬁnds a job that
can be allocated. For fairness and adaptiveness to workload, HELM changes the size
of lookahead window of the entrance queue depending on the length of the entrance
queue. Under low loaded conditions, this lookahead window will be small so that
most jobs are scheduled in a FCFS fashion by the entrance queue. As the work-
load increases, the window size increases. Therefore, under highly loaded conditions,
many jobs move to the lookahead queues and HELM can schedule them efﬁciently by
reordering the jobs. The reordering of jobs in the lookahead queues decreases the sys-
tem external fragmentation and increases the system utilization. However, to avoid
the situation where a job waits in the lookahead queues too long, HELM uses a upper
bound of waiting time in the lookahead queues. If a jobs waits longer than the bound,
HELM moves the job to the high-priority queue. The high-priority queue keeps and
schedules the jobs in the order of arrival time one by one. The upper bound of wait-
ing time is a dynamic parameter whose value changes depending on the characteristic
of the system workload. According to our simulation results, the HELM discipline
shows much better performance under highly loaded conditions than the MQ-SCAN
and SQ-FCFS disciplines. HELM schedules fairly as the SQ-FCFS discipline under

low and medium loaded conditions.

CHAPTER 9

CONCLUSIONS AND FUTURE
RESEARCH

Processor management is one of the important services provided by an operating
system for massively parallel computers that serve multiple jobs simultaneously. Re-
search activity on the processor management problem has been divided between those
seeking better system utilization by allocating a job to any set of processors regard-
less of their geometric location (scattered scheme) and those who insist contiguous
allocations in which each subpartition is a contiguous region of the MPC (contiguous
scheme). In this thesis, we studied the processor management problem on a wormhole-
routed 2—D mesh multicomputer for the issues that may be critical in both processor
management schemes: the network contention issue and the system fragmentation
issue. The major objective was to gain insight into the system behavior and to un-
derstand the basic principles underlying the performance of processor management
strategies in the system. In this chapter, we summarize the major contributions made

by this research and present the directions for future research.

129

130
9.1 Summary and Major Contributions

The effects of contention in wormhole-routed networks can cause the characteristics
of one job to affect the performance of another job when a job is allocated to scat-
tered partitions on an MPC. In this thesis, the performance degradation due to job
interactions has been intensively studied. We studied the effects of competing paths
on network contention due to the nature of wormhole routing networks when several
paths are overlapped that have different communication rates. We proposed a general
contention model, called the heterogeneous multipath contention model. The model
is a representation of arbitrarily overlapped communication paths of jobs that have
different message communication rates. Based on the proposed contention model,
we analyzed the performance characteristics of 2D mesh multicomputers under con-
tention due to job interactions. The analysis may help the system designer to gain
insight into system behavior, to understand the basic principles underlying the per-
formance of system strategies, and to compare scattered allocation strategies with
contiguous allocation strategies.

As a starting point for analyzing the interactions and interference that occur be-
tween jobs, we examined interactions and interference between communication paths
that have the same communication rate. We proposed two metrics that can be used
when one wants to measure internal and external contentions between jobs in a mul-
ticomputer. These metrics are called the starting contention level and intermediate
contention level. Based on the metrics, the internal contention delays of a stair-layered
pattern and a complex transpose pattern were predicted and veriﬁed by means of
simulation. According to the results, as the starting contention level increases, the
communication increases for a wide range of communication rates. The amount of
increase in communication delay depends on the rate of communication as well as the

contention delays facing the external paths that contend at the starting point. Nev-

131

ertheless, the detrimental delays due to the intermediate contention level primarily
occurs only at high rates of communication.

We also analyzed the performance degradation due to the sharing of network
resources by multiple independent interacting jobs in a 2-D mesh wormhole-routed
multicomputer system. The analysis is based on a divided-and-conquer strategy,
which derives the communication delay at each contention point on a path. This
strategy reduces a heterogeneous multipath contention model at each contention point
to a heterogeneous 2-path contention model. The computation of the reduced model
distinguishes the starting contention point from the intermediate contention points.
In addition, we analyzed a contention model in which two jobs have complex internal
communication patterns that overlap. These results help us understand the effect
of job interactions on network performance such that we are better prepared for
ﬁnding solutions to the problem of allocating processors in 2-D mesh wormhole-routed
multicomputers.

Based on the performance analysis of network contention, we developed efﬁcient
processor allocations strategies for 2-D mesh wormhole-routed multicomputer sys-
tems. Our examination on the effects of the overlapped competing paths on the
network contention indicated that the effects of intermediate competing paths are
more signiﬁcant than the effects of starting competing paths as the number of com-
peting paths and the communication rate increase. This phenomenon is substantial
in wormhole routing networks due to the nature of the wormhole routing switching
technique. Based on this examination, we investigated the effects of several com-
munication parameters on communication interference between interleaved jobs due
to job partitioning and placement. Our conclusions drawn from simulation results
provide guidelines for placing and partitioning jobs. For example, it is better to have
the highly interactive jobs partitioned and placed at locations that cross the paths

of other less interactive jobs, rather than partitioning the less frequently interacting

132

jobs and placing them at locations such that they suffer from many intermediate con-
tention points. Our results also show that the overall communication delay is very
sensitive to the relative partitioning and placement of jobs. This research offers just
a starting point for more in-depth research to address placing and partitioning jobs.
It emphasizes that careful partitioning and placing jobs in scattered allocation can
reduce the negative performance effect of job interaction.

Another investigation of this thesis is to characterize the effect of job size irreg-
ularity. We examined a dynamic scheduling system that schedules jobs of various
shapes from regular shape to irregular shape. According to the results, job’s regular-
ity in both shape and size can contribute to improved system performance. We found
that the performance is similar when the system schedules jobs that request various
types of irregular-shaped partitions. A large improvement in performance occurs if
all jobs scheduled on the multicomputer request very regular-shaped partitions. This
study gives us some insight into what are the important characteristics to be consid-
ered to design a system partitioning scheme for a processor allocation strategy. We
outlined an approach for restructuring incoming job requests to use partitions from
a multicomputer so that the performance advantage of regular-shaped partitions is
utilized.

Finally, we proposed a new job scheduling strategy that can achieve low job
turnaround time and high system utilization while not inappropriately favoring small
jobs to the detriment of large jobs. Many innovative schemes for allocating jobs to
parallel computing systems have been proposed in order to achieve highly utilized par-
allel computing systems. Since most schemes that have been proposed for allocating
jobs to parallel computing systems have concentrated on approaches for processor al-
location, the schemes have used First-Come—First-Serve (F CFS) as the job scheduling
discipline. However, it has been previously established that job scheduling algorithms

for parallel computing systems can have a large impact on the system utilization and

133

job response time. Schemes that use multiple queues, which reorder the sequence of
jobs allocated to the parallel system, can be very effective in improving the system
performance. However, such non-FCF S schemes have been criticized because they
provide improved average performance by favoring small jobs at the expense of large
jobs. The proposed job scheduling strategy, called the HELM discipline, adapts its
scheduling policy to the changes of workload so that it behaves in a FCF S manner
under low loaded conditions, but exploits performance enhancing features of multiple
queue schemes under highly loaded conditions. According to our simulation results,
the HELM discipline shows low job turnaround time and high system utilization while

not inappropriately favoring small jobs to the detriment of large jobs.

9.2 Future Research

The research presented in this thesis has focused on the network contention and
system fragmentation issues that are fundamental in developing efﬁcient processor

management schemes. Future work relating to this research will include the following.

0 Although the analysis presented in this thesis can be successfully applied to
characterize the communication performance of wormhole routing under job
interactions, it has a limitation that the routing scheme should be determinis-
tic, such as X Y routing. Multicomputer systems can employ adaptive routing
schemes that adapt to dynamic changes of network condition. A system using
an adaptive routing takes advantages of taking another path in the conditions
such as while message trafﬁc is heavy on one path or a faulty node exists on
one path. Moreover, for scattered allocation schemes in which jobs are allo-
cated to processors in any dispersed regions, adaptive routing may be an better
alternative than deterministic routing. Future research will include the study

of examining the performance of scattered processor allocation strategies under

134

adaptive routing scheme. Adaptive routing will have effects on partitioning and

placing a submesh request to small subpartitions in the system.

Virtual channel can be used to reduce network contentions. In the systems
supporting virtual channels, scattered allocation will become more competitive
than in the systems without virtual channels. Therefore, it would be interesting
to extend our analytic formula for the systems that support virtual channels.
This extension causes modiﬁcations in our formula. Consider a 2D mesh mul-
ticomputer that supports virtual channels. Assume that a physical channel is
split into at least two virtual channels. In the system, intermediate contention
delays no longer exist. Each intermediate contention delay in the original for-
mula should be substituted with the delay that takes in sharing a physical
channel by two paths through the supported virtual channels. This delay is a
variable that depends on the system characteristics to support virtual channels
and the message communication rates on the paths. The modiﬁcation for com-
puting the delay at the ﬁrst contention point may be more complicated. Let n
be the number of competing paths at the ﬁrst contention point and m be the
number of virtual channels per each physical channel. If n is smaller than m,
the starting contention delay in the original formula should be substituted with
the delay in sharing a physical channel by the n competing paths through the
supported m virtual channels. Again, the delay is a system-dependent variable.
If n is larger than m, a contention might occur at the ﬁrst contention point.
This delay includes two kinds of delay: the delay in sharing a physical channel
by the n competing paths through m virtual channels and the contention delay
by (n -— m) competing paths. Research issues are how to compute the system-
dependent delay and how to combine it with the contention delay generated by

the (n — m) competing paths.

135

0 External fragmentation in 2—D mesh multicomputers occurs when an scheduled
job fails to be allocated to every available subpartitions in the system. The
failures can be caused by either types of mismatching: size mismatching Or shape
mismatching. Size mismatching occurs when the number of processors in an
available subpartition is smaller than the number of processors required by the
incoming job. External fragmentation due to size mismatching is unavoidable
in general parallel processing environment where the sizes of incoming jobs vary
and processor migration [64] and limit allocation [65] are not supported. Shape
mismatching occurs when an subpartition has sufﬁcient processors, but has a
shape into which the submesh required by the job cannot ﬁt. This type of
mismatching occurs because any shape of submesh can be assigned to the job.
It has been shown in this thesis that the regularity of the shapes of jobs is an
signiﬁcant factor affecting the performance of a resource scheduling algorithm
in a 2-D mesh multicomputers. Another goal for future research is to design a
set of rules for submesh shaping that can be applied in the process of submesh
assignment and to analyze its effects on the performance of processor allocation

strategies.

0 Better processor allocation strategies that can achieve lower job turnaround
time and higher processor utilization may be hybrids between contiguous and
scattered schemes. As the results of the research indicate, the negative effects
of network contention due to job interactions will be increasingly visible in
the future MPCs that can deliver messages as fast as the network capacity.
Therefore, a goal for future research is to develop a hybrid processor allocation
strategy that enjoys the increased processor utilizations as long as the negative

effects of job interactions can be kept under control.

APPENDIX

APPENDIX

The following is given to complete Section 5.3.

Conditional Expectations

Let X be a continuous random variable having an exponential distribution with mean
i and k be a constant. In this subsection we compute the expectation of X upon the
condition that X Z k, i.e. E[X IX 2 k]. The computation of E[X IX < k] can be

solved in a similar way as the computation of E [X IX 2 Is]. To conserve space, we

omit the computation of E [X IX < [C].

To compute the conditional expectation, E [X IX 2 k], we have to solve the con-

ditional probability density function (p.d. f) conditioned by X 2 k.

fX|X2k($) = 1— Pr{X Z :ch 2 k} = 1- Pr{X Z a, _ k}
z\e”\(”"‘), ifa: Z k

0, otherwise

And therefore the conditional expectation, E [X IX 2 k] is

E[XIX 2 k] = ﬂoofo|X2k($) dz

= Amer-A(x_kl = AeAk/ooxe_’\xda: = k+
k

136

137

Expectation of the Remaining Time, Rt

Consider a renewal process whose inter-arrival times, T0, T1, ..., are random variables
having mean u and variance 0'. Although we do not need to specify T,"S distribution,
for convenience we denote F (:c) and f (.73) as the c.d. f and p.d. f. of Ti, respectively.
Let R, denote the time between the last renewal and the next renewal, and 0(3) and

g(x) be the c.d.f. and p.d.f. of Rt, respectively.

From renewal theory [103] we know that

3320 PrIR. < z}=;,1— / 71 - F(y))dy

and therefore g(a:) = (l — F (13)) Using this fact, we can compute the expectation

l
p

of the R, as follows:

E [Rt]

fowzg(z)dz=%/oooz(1—F(z))dz

_ .1. °°2 __1_ 2 2 _ e. 93.
_ 2“/0 zf(:c)dx—2#(U +p) _ 2(1+u2)

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] W. J. Dally, “Performance Analysis of k-ary n-cube Interconnection Networks,”

[2]
[3]

[4]

[5]

[6]

[7]

[8]

[9]

IEEE Transactions on Computers, vol. 39, pp. 775—785, June 1990.
Intel Corporation, Paragon sip/s Product Overview, 1991.

W. J. Dally and C. L. Seitz, “The Torus Routing Chip,” Journal of Distributed
Computing, vol. 1, no. 3, pp. 187—196, 1986.

D. Min and M. W. Mutka, “Effects of Job Size Irregularity on the Dynamic
Resource Scheduling of a 2-D Mesh Multicomputer,” in Proceedings of PARLE
’93, Parallel Architectures and Languages Europe, pp. 476—487, Jun. 1992.

K. Li and K. H. Cheng, “A Two Dimensional Buddy System for Dynamic
Resource Allocation in a Partitionable Mesh Connected System,” in Proceedings
18th ACM Computer Science Conference, pp. 22—28, Feb. 1990.

P.-J. Chuang and N.-F. Tzeng, “An Efﬁcient Submesh Allocation Strategy for
Mesh Computer Systems,” in Proceedings of the 1991 International Conference
on Distributed Computing Systems, IEEE, May 1991.

W. Liu, V. Lo, K. Windisch, and B. Nitzberg, “Non-contiguous Processor Al-
location Algorithms for Distributed Memory Multicomputers,” in Proc. 1994
International Conference on Supercomputing, pp. 227—236, Jun 1994.

D. Min and M. W. Mutka, “A Multipath Contention Model for Analyzing Job
Interactions in 2-D Mesh Multicomputers,” in Proceedings of the 1994 Interna-
tional Parallel Processing Symposium, IEEE, Apr. 1994.

D. Min and M. W. Mutka, “Determining External Contention Delay Due to Job
Interactions in a 2-D Mesh Wormhole Routed Multicomputers,” in Proceedings
of the 1993 Symposium on Parallel and Distributed Processing, IEEE, Dec. 1993.

138

[10]

[11]

[12]

[13]

[14]

[15]

[15]

[17]

[18]

[19]

[20]

[21]

139

D. Min and M. W. Mutka, “Effects of Job Interactions Due to Job Partitioning
and Placement,” Tech. Rep. CPS-95-1, Computer Science at Michigan State
University, 1995.

M. Chen and K. Shin, “Processor Allocation in an N -Cube Multiprocessor Using
Gray Code,” IEEE Transactions on Computers, vol. C-36, pp. 1396—1407, Dec.
1987.

J. Kim, C. R. Das, and W. Lin, “A Top-Down Processor Allocation Scheme
for Hypercube Computers,” IEEE Transactions on Parallel and Distributed
Systems, vol. 2, pp. 20-30, Jan. 1991.

P.-J. Chuang and N.-F. Tzeng, “A Fast Recognition-Complete Processor Allo-
cation Strategy for Hypercube Computers,” IEEE Transactions on Computers,
vol. 44, pp. 467—479, Apr. 1992.

S. Dutt and J. P. Hayes, “Subcube Allocation in Hypercube Computers,” IEEE
Transactions on Computers, vol. 40, pp. 341—352, Mar. 1991.

P. Krueger, T.-H. Lai, and V. A. Radiya, “Processor Allocation vs. Job Schedul-
ing on Hypercube Computers,” in Proceedings of the 1991 International Con-
ference on Distributed Computing Systems, IEEE, May 1991.

W. Crowther, J. Goodhue, E. Starr, R. Thomas, W. Millakem, and T. Black-
adar, “Performance Measurements on a 128-Node Butterﬂy Parallel Processor,”

in International Conference on Parallel Processing, pp. 531-540, IEEE, Aug.
1985.

T. Lovett and S. Thakkar, “The Symmetry Multiprocessor System,” in Inter-
national Conference on Parallel Processing, IEEE, 1988.

Thacker and et al., “Fireﬂy: A Multiprocessor Workstation,” IEEE Transac-
tions on Computers, pp. 909—920, Aug. 1988.

R. Olson, “Parallel Processing in a Message-based Operating System,” IEEE
Software, vol. 2, pp. 39—49, July 1985.

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hen-
nessy, “The DASH Prototype: Logic Overhead and Performance,” IEEE Trans-
actions on Parallel and Distributed Systems, vol. 4, pp. 41-61, Jan. 1993.

Kendall Square Research, KSRI Principles of Operation, Revision 5.5, 1991.

140

[22] Intel Corp., Touchstone DELTA System Description, 1991.

[23] W. D. Hillis and L. W. Tucker, “The CM-5 Connection Machine: A Scalable
Supercomputer,” Communication of ACM, vol. 36, pp. 31-40, Nov. 1993.

[24] N. Corp, NCUBE/ten: An Overview. Beaverton, OR, Nov. 1985.

[25] C. L. Seitz and et al., “The Architecture and Programming of the Amtek Se-
ries 2010 Multicomputer,” in The Third Conference on Hypercube Concurrent
Computers and Applications, Volume 1, pp. 33—36, ACM, Jan. 1988.

[26] R. Alverson and et al., “The Tera Computer System,” in Proc. 1990 Interna-
tional Conference on Supercomputing, pp. 1-6, June 1990.

[27] C. B. Stunkel, D. G. Shea, B. Abali, M. M. Dennean, P. H. Hochschild, D. J.
Joseph, B. J. Nathanson, M. Tsao, and P. R. Varker, “Architecture and Imple-

mentation of Vulcan,” in Proceedings of the 8th International Parallel Processing
Symposium, pp. 268—274, IEEE, Apr. 1994.

[28] W. J. Dally, “The J-Machine: System Support for Actors,” in Actors:
Knowledge-Based Concurrent Computing (Hewitt and Agha, Eds.), MIT Press,
1989.

[29] B. Duzett and R. Buck, “An Overview of the nCUBE 3 Supercomputer,”
pp. 458—464, IEEE, Jul. 1992.

[30] The Cray Research, Massively Parallel Processor System: CRAY T3D, 1993.

[31] N. Koike, “N EC Cenju-3: A Multiprocessor-Based Parallel Computer,” in Pro-
ceedings of the 1994 International Parallel Processing Symposium, pp. 396—401,
IEEE, Apr. 1994.

[32] J. Zahorjan and C. McCann, “Processor Scheduling in Shared Memory Multi-
processors,” in Proceedings of the 1990 ACM Conference on Measurement and
Modeling of Computer Systems, pp. 214—225, ACM, May 1990.

[33] S. Majumdar, D. L. Eager, and R. B. Bunt, “Scheduling in Multiprogrammed
Parallel Systems,” in Proceedings of the 1988 Conference on Measurement and
Modeling of Computer Systems, pp. 104—113, ACM, May. 1988.

[34] S. T. Leutenegger and M. K. Vernon, “The Performance of Multiprogrammed
Multiprocessor Scheduling Algorithms,” in Proceedings of the 1990 ACM Con-
ference on Measurement and Modeling of Computer Systems, pp. 226—236,
ACM, May 1990.

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

141

A. Tucker and A. Gupta, “Process Control and Scheduling Issues for Multipro—
grammed Shared-Memory Multiprocessors,” in Proceedings 12th ACM Sympo-
sium on Operating System Principles, ACM, Nov. 1989.

K. Li and K. H. Cheng, “Static Job Scheduling in Partitonable Mesh Connected
Systems,” Journal of Parallel and Distributed Computing, pp. 152—159, Oct.
1990.

H. El-Rewini and T. G. Lewis, “Scheduling Parallel Program Tasks onto
Arbitrary Target Machines,” Journal of Parallel and Distributed Computing,
pp. 138—153, Sep. 1990.

F. Ercal, J. Ramanujam, and P. Sadayappan, “Task Allocation onto a Hyper-
cube by Recursive Mincut Bipartitioning,” Journal of Parallel and Distributed
Computing, pp. 35—44, Oct. 1990.

B. Shirazi and M. Wang, “Analysis and Evaluation of Heuristic Methods
for Static Task Scheduling,” Journal of Parallel and Distributed Computing,
pp. 222—232, Oct. 1990.

Y. Zhu and M. Ahuja, “Job Scheduling on a Hypercube,” in Proceedings of the
1990 Conference on Distributed Computing Systems, pp. 510—517, IEEE, Jun.
1990.

D.-T. Peng and K. G. Shin, “Static Allocation of Periodic Tasks with Prece-
dence Constraints in Distributed Real-Time Systems,” in Proc. of Intl. Conf.
on Distributed Computing Systems, pp. 190—198, IEEE, 1990.

K. Ramamritham, “Allocation and Scheduling of Complex Periodic Tasks,” in
Proc. of 1990 Int. Conference on Distributed Computing Systems, pp. 108—115,
IEEE, 1990.

Q. Wang and K. H. Cheng, “A Heuristic of Scheduling Parallel Tasks and its
Analysis,” SIAM Journal on Computing, vol. 21, pp. 281—294, Apr. 1992.

J. Blazewicz, M. Drabowski, and J. Weglarz, “Scheduling Multiprocessor Tasks
to Minimize Schedule Length,” IEEE Transactions on Computers, vol. 35,
pp. 389-393, May 1986.

C. Coroyer and Z. Liu, “Effectiveness of Heuristics and Simulated Annealing for
the Scheduling of Concurrent Tasks — An Empirical Comparison,” in Proceed-
ings of PARLE ’93, Parallel Architectures and Languages Europe, Jun. 1993.

142

[46] M. R. Garey and R. L. Graham, “Bounds For Multiprocessor Scheduling with
Resource Constraints,” SIAM Journal on Computing, vol. 4, pp. 187—200, Jun.
1975.

[47] J. Du and J. Y.-T. Leung, “Complexity of Scheduling Parallel Task Systems,”
SIAM Journal on Computing, vol. 2, pp. 473—487, Nov. 1989.

[48] S. Y. Lee and J. K. Aggarwal, “A Mapping Strategy for Parallel Processing,”
IEEE Transactions on Computers, pp. 433—442, Apr. 1987.

[49] S. H. Bokhari, “On the Mapping Problem,” IEEE Transactions on Computers,
pp. 207—214, Mar. 1981.

[50] P. Sadayappan and F. Ercal, “Nearest-Neighbor Mapping of Finite Element
Graphs onto Processor Meshes,” IEEE Transactions on Computers, pp. 1408-—
1424, Dec. 1987.

[51] C. T. Ho and S. L. Johnsson, “On the Embedding of Arbitrary Meshes in
Boolean Cubes with Expansion Two Dilation Two,” in International Conference
on Parallel Processing, pp. 188—191, IEEE, 1987.

[52] M. Y. Chan, “Embedding of Grids into Optiaml Hypercubes,” SIAM Journal
on Computing, vol. 20, pp. 834—864, Oct. 1991.

[53] R. Aleliunas and A. L. Rosenberg, “On Embedding Rectangular Grids in Square
Grids,” IEEE Transactions on Computers, pp. 907—913, Sep. 1982.

[54] J. A. Ellis, “Embedding Rectangular Grids into Square Grids,” IEEE Transac-
tions on Computers, pp. 46-52, Jan. 1991.

[55] R. G. Melhem, “Embedding Rectangular Grids into Square Grids with Dilation
Two,” IEEE Transactions on Computers, pp. 1446-1455, Dec. 1990.

[56] Y. E. Ma and L. Tao, “Embeddings Among Toruses and Meshes,” in Interna-
tional Conference on Parallel Processing, pp. 178-187, IEEE, 1987.

[57] K. Efe, “Embedding Mesh of Trees in the Hypercube,” Journal of Parallel and
Distributed Computing, pp. 222—230, 1991.

[58] A. Y. Wu, “Embedding of Tree Networks into Hypercubes,” Journal of Parallel
and Distributed Computing, pp. 238—249, 1985.

[59] R. Varadarajan, “Embedding Shufﬂe Networks in Hypercubes,” Journal of Par-
allel and Distributed Computing, pp. 252—256, 1991.

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

143

K. Knowlton, “A Fast Storage Allocator,” Communication of ACM, vol. 8,
pp. 623—625, Oct. 1965.

D. D. Sharma and D. K. Pradhan, “Fast and Efﬁcient Strategies for Cubic and
Non-cubic Allocation in Hypercube Multiprocessors,” in International Confer-
ence on Parallel Processing, vol. I, pp. 118—127, IEEE, 1993.

H. Wang and Q. Yang, “Prime Cube Graph Approach for Processor Alloca-
tion in Hypercube Multiprocessors,” in International Conference on Parallel

Processing, vol. I, pp. 25—32, IEEE, 1991.

P. Mohapatra, C.Yu, C.R.Das, and J.Kim, “A Lazy Scheduling Scheme for
Improving Hypercube Performance,” in International Conference on Parallel
Processing, vol. I, pp. 110—117, IEEE, 1993.

M.-S. Chen and K. G. Shin, “Subcube Allocation and Task Migration in Hy-
percube Multiprocessors,” IEEE Transactions on Computers, vol. 39, pp. 1146—
1155, Sep. 1990.

C. Yu and C. R. Das, “Limit Allocation: An Efﬁcient Processor Management
Scheme for Hypercubes,” in International Conference on Parallel Processing,
vol. II, pp. 143—150, IEEE, 1994.

K. Li and K.-H. Cheng, “Job Scheduling in a Parallel Mesh Using a Two-
Dimensional Buddy System Partitioning Scheme,” IEEE Transactions on Par-
allel and Distributed Systems, vol. 2, pp. 413—422, Oct. 1991.

J. Ding and L. N. Bhuyan, “An Adaptive Submesh Allocation Strategies for
Two-Dimensional Mesh Connected Systems,” in International Conference on
Parallel Processing, vol. II, pp. 193—200, IEEE, 1993.

Y. Zhu, “Efﬁcient Processor Allocation Strategies for Mesh-Connected Parallel
Computers,” Journal of Parallel and Distributed Computing, pp. 328—337, 1992.

D. D. Sharma and D. K. Pradhan, “A Fast and Efﬁcient Strategy for Submesh
Allocation in Mesh-Connected Parallel Computer,” in Proceedings of the 1993
Symposium on Parallel and Distributed Processing, pp. 682—689, IEEE, Dec.
1993.

D. D. Sharma and D. K. Pradhan, “Job Scheduling in Mesh Multicomputers,”
in International Conference on Parallel Processing, vol. II, pp. 251—258, IEEE,
1994.

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

144

S. Bhattacharya and W.-T. Tsai, “Lookahead Processor Allocation in Mesh-
Connected Massively Parallel Multicomputer,” in Proceedings of the 1994 In-
ternational Parallel Processing Symposium, pp. 868—875, IEEE, Apr. 1994.

M. R. Garey and D. S. Johnson, Computers and Intractability : A Guide to the
Theory of NP-Completeness. Freeman, 1988.

B. Baker, E. Coffman, and R. Rivest, “Orthogonal Packings in Two Dimen-
sions,” SIAM Journal on Computing, vol. 4, pp. 846—855, Nov. 1980.

E. Coffman, J. F. M.R., J. D.S., and R. Tarjan, “Performance Bounds for Level-
oriented Two—dimensional Packing Algorithms,” SIAM Journal on Computing,
vol. 4, pp. 808—826, Nov. 1980.

E. Coffman, J. Leung, J .Y., and D. Slutz, “On the Optinality of First-Fit and
Level Algorithms for Parallel Machine Assignments and Sequencing,” in 1977
International Conference on Parallel Processing, pp. 95—99, IEEE, 1977.

K. Li and K. H. Cheng, “Static Job Scheduling in Partitionable Mesh Connected
Systems,” Journal of Parallel and Distributed Computing, pp. 152—159, Oct.
1990.

J. Turek, J. L. Wolf, and P. S. Yu, “Approximate Algorithms for Parallelizable
Tasks,” in Symposium on Parallel Architecture and Language, pp. 323—332,
IEEE, 1992.

K. Li and K. Cheng, “Complexitiy of Resource Allocation and Job Scheduling
Problems in Partitionable Mesh Connected Systems,” in Proceedings of the Ist
Annual IEEE Symposium on Parallel and Distributed Processing, IEEE, May
1989.

J. Y.~T. Leung, T. Tam, C. Wong, G. Young, and F. Chin, “Packing Squares
into a Squre,” Tech. Rep. UTDCS-12-89, Computer Science Department, Uni-
versity of Texas at Dallas, July 1989.

S. Chittor and R. Enbody, “Hypercube vs. 2d meshes,” in Proceedings SIAM
Fourth Annual Conference on Parallel Processing for Scientiﬁc Computing,

pp. 313—318, SIAM, 1990.

S. Chittor and R. Enbody, “Predicting The Effect of Mapping on The Commu-
nication Performance of Large Multicomputers,” in 1991 International Confer-
ence on Parallel Processing, IEEE, 1991.

145

[82] S. Chittor and R. Enbody, “High-Performance Simulator for Large Wormhole-
Routed Networks,” International Journal in Computer Simulator, pp. 151—174,
1992.

[83] V. S. Adve and M. K. Vernon, “Performance Analysis of Multiprocessor Mesh
Interconnection Networks with Wormhole Routing,” tech. rep., Department of

Computer Science at University of Wisconsin-Madison, June 1992.

[84] S. Abraham, “Performance of Multicomputer Networks under Pin-out Con-
straints,” Journal of Parallel and Distributed Computing, pp. 237—248, 1991.

[85] J. H. Kim and A. A. Chien, “The Impact of Packetization in Wormhole—Routed
Networks,” in Proceedings of PARLE ’93, Parallel Architectures and Languages
Europe, pp. 242—253, Jun. 1993.

[86] W. J. Dally and C. L. Seitz, “Deadlock-Free Message Routing in Multiprocessor
Interconnection Networks,” IEEE Transactions on Computers, pp. 547—553,
May 1987.

[87] W. J. Dally, “Express Cubes: Improving the Performance of k-ary n-cube In-
terconnection Networks,” IEEE Transactions on Computers, vol. 40, pp. 1016-
1023, Sep. 1991.

[88] W. J. Dally, “Virtual-Channel Flow Control,” Journal of Parallel and Dis-
tributed Computing, pp. 194—205, 1992.

[89] L. M. Ni and P. K. McKinley, “A Survey of Wormhole Routing Techniques in
Direct Networks,” IEEE Computer, vol. 26, pp. 62 — 76, Feb. 1993.

[90] C. J. Glass and L. M. Ni, “The Turn Model for Adaptive Routing,” in Proceed-
ings of the 19th Annual International Symposium on Computer Architecture,

May 1992.

[91] Z. Liu, J. Duato, and L.-E. Thorelli, “Grouping Virtual Channels for Deadlock-
Free Adaptive Wormhole Routing,” in Proceedings of PARLE ’93, Parallel Ar-
chitectures and Languages Europe, pp. 254—265, Jun. 1993.

[92] D. H. Linder and J. C. Harden, “An Adaptive and Fault Tolerant WOrmhole
Routing Strategy for k-ary n-cubes,” IEEE Transactions on Computers, pp. 2—
12, Jan. 1991.

[93]

I94]

[95]

[96]

[97]

[98]

[99]

[100]

[101]

[102]

[103]

[104]

146

K. J. H. and A. A. Chien, “Evaluation of Wormhole Routed Networks Under
Hybrid Trafﬁc Loads,” in Proceedings of the Hawaii International Conference

on System Sciences, Jan 1993.

S. Borkar and et al., “An Integrated Solution to High-Speed Parallel Comput-
ing,” in Proc. of the Supercomputing Conference, pp. 330—339, 1988.

P. Y. Song, “Design of a Network for Concurrent Message Passing Systems,”
Tech. Rep. Master’s Thesis, Department of Electrical Engineering at MIT, May
1988.

S. F. P. Raghavan and E. Upfal, “A Theory of Wormhole Routing in Parallel
Computers,” in Symposium on the Foundations of Computer Science, pp. 563-
572, 1992.

E. A. Brewer and B. C. Kuszmaul, “How to Get Good Performance from the
CM-5 Data Network,” in Proceedings of the 1994 International Parallel Pro-
cessing Sympos ium, pp. 858 — 867, IEEE, Apr. 1994.

P. K. McKinley, “Simulation of Wormhole Routing in Multisim,” in Proc. 23rd
Pittsburg Conference on Modeling and Simulation, Apr. 1992.

H. Schwetman, “CSIM: C-Based, Process-Oriented Simulation Language,”
Tech. Rep. PP-080-85, Microelectronics and Computer Technology Corpora-
tion, 1985.

L. Kleinrock, Queueing Systems, Volume I: Theory. John Wiley and Sons, 1975.

S. Chittor and R. Enbody, “Performance Evaluation of Mesh-connected
Wormhole-routed Networks for Interprocessor Communication in Multicom-

puters,” in Proceedings of Supercomputing Conference, pp. 647—656, Nov. 1990.

D. Min and M. W. Mutka, “A Framework for Predicting Delay Due to Job Inter-
actions in a 2-D Mesh Multicomputer,” in Proceedings of the 1993 International

Parallel Processing Symposium, pp. 350 - 357, IEEE, Apr. 1993.
S. M. Ross, Stochastic Processes. John Wiley and Sons, 1983.

F. A. G. Alexander M. Mood and D. C. Boes, Introduction to the Theory of
Statistics. McGraw Hill, 1974.

APPENDIX

The following is given to complete Section 5.3.

Conditional Expectations

Let X be a continuous random variable having an exponential distribution with mean
i— and k be a constant. In this subsection we compute the expectation of X upon the
condition that X Z k, i.e. E[XlX Z k]. The computation of E[X IX < k] can be
solved in a similar way as the computation of E [X [X Z Is]. To conserve space, we
omit the computation of E [X IX < h].

To compute the conditional expectation, E [X [X Z k], we have to solve the con—

ditional probability density function (p.d. f) conditioned by X Z k.

fX|X2k($) = 1— Pr{X 2 :rlX 2 k} = 1.. p,.{X 2 x _ k}
Ate-”(34), if a: Z k

0, otherwise

And therefore the conditional expectation, E [X [X Z k] is

E[XIX _>_ k] = Aw$fX]x2k($)d$

oo oo 1
= / rAe-Mf'kl = AeAk/ re_”d:c = k+—
k k A

136

137

Expectation of the Remaining Time, Rt

Consider a renewal process whose inter-arrival times, T 0, T1, ..., are random variables
having mean u and variance 0. Although we do not need to specify T,’s distribution,
for convenience we denote F (m) and f (x) as the c.d. f and p.d. f of T,-, respectively.
Let R, denote the time between the last renewal and the next renewal, and C(z) and

g(a:) be the c.d.f. and p.d.f. of Rt, respectively.

From renewal theory [103] we know that
tlim Pr{Rt < 9:} = % /$(1 — F(y)) dy

and therefore g(a:) = -‘1;(1 — F (x)) Using this fact, we can compute the expectation

of the R, as follows:

E [Rt]

Ang(z)dz=%/0°oz(1—F(x))dx

_ _1_ °° 2 __1_ 2 2 _. 1‘. if
— 2,]0 zf(2)dx—,,,(a +11) — 20+,»

BIBLIOGRAPHY

BIBLIOGRAPHY

[1] W. J. Dally, “Performance Analysis of k-ary n—cube Interconnection Networks,”

[2]
[3]

[4]

[5]

[6]

[7]

[8]

[9]

IEEE Transactions on Computers, vol. 39, pp. 775—785, June 1990.
Intel Corporation, Paragon asp/s Product Overview, 1991.

W. J. Dally and C. L. Seitz, “The Torus Routing Chip,” Journal of Distributed
Computing, vol. 1, no. 3, pp. 187—196, 1986.

D. Min and M. W. Mutka, “Effects of Job Size Irregularity on the Dynamic
Resource Scheduling of a 2-D Mesh Multicomputer,” in Proceedings of PARLE
’93, Parallel Architectures and Languages Europe, pp. 476—487, Jun. 1992.

K. Li and K. H. Cheng, “A Two Dimensional Buddy System for Dynamic
Resource Allocation in a Partitionable Mesh Connected System,” in Proceedings

18th ACM Computer Science Conference, pp. 22—28, Feb. 1990.

P.-J. Chuang and N .-F. Tzeng, “An Efﬁcient Submesh Allocation Strategy for
Mesh Computer Systems,” in Proceedings of the 1991 International Conference
on Distributed Computing Systems, IEEE, May 1991.

W. Liu, V. Lo, K. Windisch, and B. Nitzberg, “Non-contiguous Processor Al-
location Algorithms for Distributed Memory Multicomputers,” in Proc. 1994
International Conference on Supercomputing, pp. 227-236, Jun 1994.

D. Min and M. W. Mutka, “A Multipath Contention Model for Analyzing Job
Interactions in 2-D Mesh Multicomputers,” in Proceedings of the 1994 Interna-
tional Parallel Processing Symposium, IEEE, Apr. 1994.

D. Min and M. W. Mutka, “Determining External Contention Delay Due to Job
Interactions in a 2-D Mesh Wormhole Routed Multicomputers,” in Proceedings
of the 1993 Symposium on Parallel and Distributed Processing, IEEE, Dec. 1993.

138

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

139

D. Min and M. W. Mutka, “Effects of Job Interactions Due to Job Partitioning
and Placement,” Tech. Rep. CPS-95-1, Computer Science at Michigan State
University, 1995.

M. Chen and K. Shin, “Processor Allocation in an N-Cube Multiprocessor Using
Gray Code,” IEEE Transactions on Computers, vol. C—36, pp. 1396—1407, Dec.
1987.

J. Kim, C. R. Das, and W. Lin, “A Top-Down Processor Allocation Scheme
for Hypercube Computers,” IEEE Transactions on Parallel and Distributed
Systems, vol. 2, pp. 20—30, Jan. 1991.

P.-J. Chuang and N.-F. Tzeng, “A Fast Recognition-Complete Processor Allo-
cation Strategy for Hypercube Computers,” IEEE Transactions on Computers,
vol. 44, pp. 467—479, Apr. 1992.

S. Dutt and J. P. Hayes, “Subcube Allocation in Hypercube Computers,” IEEE
Transactions on Computers, vol. 40, pp. 341—352, Mar. 1991.

P. Krueger, T.-H. Lai, and V. A. Radiya, “Processor Allocation vs. Job Schedul-
ing on Hypercube Computers,” in Proceedings of the 1991 International Con-
ference on Distributed Computing Systems, IEEE, May 1991.

W. Crowther, J. Goodhue, E. Starr, R. Thomas, W. Millakem, and T. Black-
adar, “Performance Measurements on a 128-Node Butterﬂy Parallel Processor,”
in International Conference on Parallel Processing, pp. 531—540, IEEE, Aug.
1985.

T. Lovett and S. Thakkar, “The Symmetry Multiprocessor System,” in Inter-
national Conference on Parallel Processing, IEEE, 1988.

Thacker and et al., “Fireﬂy: A Multiprocessor Workstation,” IEEE Transac-
tions on Computers, pp. 909—920, Aug. 1988.

R. Olson, “Parallel Processing in a Message-based Operating System,” IEEE
Software, vol. 2, pp. 39—49, July 1985.

D. Lenoski, J. Landon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hen-
nessy, “The DASH Prototype: Logic Overhead and Performance,” IEEE Trans-
actions on Parallel and Distributed Systems, vol. 4, pp. 41—61, Jan. 1993.

Kendall Square Research, KSRI Principles of Operation, Revision 5.5, 1991.

[22]
[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]
I31]

[32]

[33]

[34]

140

Intel Corp., Touchstone DELTA System Description, 1991.

W. D. Hillis and L. W. Tucker, “The CM-5 Connection Machine: A Scalable
Supercomputer,” Communication of ACM, vol. 36, pp. 31—40, Nov. 1993.

N. Corp, NCUBE/ten: An Overview. Beaverton, OR, Nov. 1985.

C. L. Seitz and et al., “The Architecture and Programming of the Amtek Se-
ries 2010 Multicomputer,” in The Third Conference on Hypercube Concurrent
Computers and Applications, Volume 1, pp. 33-36, ACM, Jan. 1988.

R. Alverson and et al., “The Tera Computer System,” in Proc. 1990 Interna-
tional Conference on Supercomputing, pp. 1—6, June 1990.

C. B. Stunkel, D. G. Shea, B. Abali, M. M. Dennean, P. H. Hochschild, D. J.
Joseph, B. J. N athanson, M. Tsao, and P. R. Varker, “Architecture and Imple-

mentation of Vulcan,” in Proceedings of the 8th International Parallel Processing
Symposium, pp. 268-274, IEEE, Apr. 1994.

W. J. Dally, “The J-Machine: System Support for Actors,” in Actors:
Knowledge-Based Concurrent Computing (Hewitt and Agha, Eds.), MIT Press,
1989.

B. Duzett and R. Buck, “An Overview of the nCUBE 3 Supercomputer,”
pp. 458—464, IEEE, Jul. 1992.

The Cray Research, Massively Parallel Processor System: CRAY T3D, 1993.

N. Koike, “NEC Cenju—3: A Multiprocessor-Based Parallel Computer,” in Pro-
ceedings of the 1994 International Parallel Processing Symposium, pp. 396-401,
IEEE, Apr. 1994.

J. Zahorjan and C. McCann, “Processor Scheduling in Shared Memory Multi-
processors,” in Proceedings of the 1990 ACM Conference on Measurement and
Modeling of Computer Systems, pp. 214—225, ACM, May 1990.

S. Majumdar, D. L. Eager, and R. B. Bunt, “Scheduling in Multiprogrammed
Parallel Systems,” in Proceedings of the 1988 Conference on Measurement and
Modeling of Computer Systems, pp. 104—113, ACM, May. 1988.

S. T. Leutenegger and M. K. Vernon, “The Performance of Multiprogrammed
Multiprocessor Scheduling Algorithms,” in Proceedings of the 1990 ACM Con-

ference on Measurement and Modeling of Computer Systems, pp. 226—236,
ACM, May 1990. I

141

[35] A. Tucker and A. Gupta, “Process Control and Scheduling Issues for Multipro—
grammed Shared-Memory Multiprocessors,” in Proceedings 12th ACM Sympo-
sium on Operating System Principles, ACM, Nov. 1989.

[36] K. Li and K. H. Cheng, “Static Job Scheduling in Partitonable Mesh Connected
Systems,” Journal of Parallel and Distributed Computing, pp. 152—159, Oct.
1990.

[37] H. El-Rewini and T. G. Lewis, “Scheduling Parallel Program Tasks onto
Arbitrary Target Machines,” Journal of Parallel and Distributed Computing,
pp. 138—153, Sep. 1990.

[38] F. Ercal, J. Ramanujam, and P. Sadayappan, “Task Allocation onto a Hyper-
cube by Recursive Mincut Bipartitioning,” Journal of Parallel and Distributed
Computing, pp. 35—44, Oct. 1990.

[39] B. Shirazi and M. Wang, “Analysis and Evaluation of Heuristic Methods
for Static Task Scheduling,” Journal of Parallel and Distributed Computing,
pp. 222—232, Oct. 1990.

[40] Y. Zhu and M. Ahuja, “Job Scheduling on a Hypercube,” in Proceedings of the
1990 Conference on Distributed Computing Systems, pp. 510—517, IEEE, Jun.
1990.

[41] D.-T. Peng and K. G. Shin, “Static Allocation of Periodic Tasks with Prece-
dence Constraints in Distributed Real-Time Systems,” in Proc. of Intl. Conf.
on Distributed Computing Systems, pp. 190—198, IEEE, 1990.

[42] K. Ramamritham, “Allocation and Scheduling of Complex Periodic Tasks,” in
Proc. of 1990 Int. Conference on Distributed Computing Systems, pp. 108—115,
IEEE, 1990.

[43] Q. Wang and K. H. Cheng, “A Heuristic of Scheduling Parallel Tasks and its
Analysis,” SIAM Journal on Computing, vol. 21, pp. 281-294, Apr. 1992.

[44] J. Blazewicz, M. Drabowski, and J. Weglarz, “Scheduling Multiprocessor Tasks
to Minimize Schedule Length,” IEEE Transactions on Computers, vol. 35,
pp. 389—393, May 1986.

[45] C. Coroyer and Z. Liu, “Effectiveness of Heuristics and Simulated Annealing for
the Scheduling of Concurrent Tasks - An Empirical Comparison,” in Proceed-

ings of PARLE ’93, Parallel Architectures and Languages Europe, Jun. 1993.

142

[46] M. R. Garey and R. L. Graham, “Bounds For Multiprocessor Scheduling with
Resource Constraints,” SIAM Journal on Computing, vol. 4, pp. 187—200, Jun.
1975.

[47] J. Du and J. Y.-T. Leung, “Complexity of Scheduling Parallel Task Systems,”
SIAM Journal on Computing, vol. 2, pp. 473-487, Nov. 1989.

[48] S. Y. Lee and J. K. Aggarwal, “A Mapping Strategy for Parallel Processing,”
IEEE Transactions on Computers, pp. 433—442, Apr. 1987.

[49] S. H. Bokhari, “On the Mapping Problem,” IEEE Transactions on Computers,
pp. 207—214, Mar. 1981.

[50] P. Sadayappan and F. Ercal, “Nearest-Neighbor Mapping of Finite Element
Graphs onto Processor Meshes,” IEEE Transactions on Computers, pp. 1408—
1424, Dec. 1987.

[51] C. T. Ho and S. L. Johnsson, “On the Embedding of Arbitrary Meshes in
Boolean Cubes with Expansion Two Dilation Two,” in International Conference
on Parallel Processing, pp. 188—191, IEEE, 1987.

[52] M. Y. Chan, “Embedding of Grids into Optiaml Hypercubes,” SIAM Journal
on Computing, vol. 20, pp. 834—864, Oct. 1991.

[53] R. Aleliunas and A. L. Rosenberg, “On Embedding Rectangular Grids in Square
Grids,” IEEE Transactions on Computers, pp. 907—913, Sep. 1982.

[54] J. A. Ellis, “Embedding Rectangular Grids into Square Grids,” IEEE Transac-
tions on Computers, pp. 46—52, Jan. 1991.

[55] R. G. Melhem, “Embedding Rectangular Grids into Square Grids with Dilation
Two,” IEEE Transactions on Computers, pp. 1446—1455, Dec. 1990.

[56] Y. E. Ma and L. Tao, “Embeddings Among Toruses and Meshes,” in Interna-
tional Conference on Parallel Processing, pp. 178—187, IEEE, 1987.

[57] K. Efe, “Embedding Mesh of Trees in the Hypercube,” Journal of Parallel and
Distributed Computing, pp. 222—230, 1991.

[58] A. Y. Wu, “Embedding of Tree Networks into Hypercubes,” Journal of Parallel
and Distributed Computing, pp. 238—249, 1985.

[59] R. Varadarajan, “Embedding Shuffle Networks in Hypercubes,” Journal of Par-
allel and Distributed Computing, pp. 252—256, 1991.

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

143

K. Knowlton, “A Fast Storage Allocator,” Communication of ACM, vol. 8,
pp. 623—625, Oct. 1965.

D. D. Sharma and D. K. Pradhan, “Fast and Eﬂicient Strategies for Cubic and
Non-cubic Allocation in Hypercube Multiprocessors,” in International Confer-
ence on Parallel Processing, vol. I, pp. 118—127, IEEE, 1993.

H. Wang and Q. Yang, “Prime Cube Graph Approach for Processor Alloca-
tion in Hypercube Multiprocessors,” in International Conference on Parallel
Processing, vol. I, pp. 25—32, IEEE, 1991.

P. Mohapatra, C.Yu, C.R.Das, and J.Kim, “A Lazy Scheduling Scheme for
Improving Hypercube Performance,” in International Conference on Parallel
Processing, vol. I, pp. 110—117, IEEE, 1993.

M.-S. Chen and K. G. Shin, “Subcube Allocation and Task Migration in Hy-
percube Multiprocessors,” IEEE Transactions on Computers, vol. 39, pp. 1146—
1155, Sep. 1990.

C. Yu and C. R. Das, “Limit Allocation: An Efﬁcient Processor Management
Scheme for Hypercubes,” in International Conference on Parallel Processing,
vol. II, pp. 143—150, IEEE, 1994.

K. Li and K.-H. Cheng, “Job Scheduling in a Parallel Mesh Using a Two-
Dimensional Buddy System Partitioning Scheme,” IEEE Transactions on Par-
allel and Distributed Systems, vol. 2, pp. 413—422, Oct. 1991.

J. Ding and L. N. Bhuyan, “An Adaptive Submesh Allocation Strategies for

Two-Dimensional Mesh Connected Systems,” in International Conference on
Parallel Processing, vol. II, pp. 193—200, IEEE, 1993.

Y. Zhu, “Efﬁcient Processor Allocation Strategies for Mesh-Connected Parallel
Computers,” Journal of Parallel and Distributed Computing, pp. 328—337, 1992.

D. D. Sharma and D. K. Pradhan, “A Fast and Efﬁcient Strategy for Submesh
Allocation in Mesh-Connected Parallel Computer,” in Proceedings of the 1993
Symposium on Parallel and Distributed Processing, pp. 682—689, IEEE, Dec.
1993.

D. D. Sharma and D. K. Pradhan, “Job Scheduling in Mesh Multicomputers,”
in International Conference on Parallel Processing, vol. II, pp. 251—258, IEEE,
1994.

[71]

[72]

[73]

[74]

[75]

[75]

[77]

[78]

[79]

[80]

[81]

144

S. Bhattacharya and W.-T. Tsai, “Lookahead Processor Allocation in Mesh-
Connected Massively Parallel Multicomputer,” in Proceedings of the 1994 In-
ternational Parallel Processing Symposium, pp. 868—875, IEEE, Apr. 1994.

M. R. Garey and D. S. Johnson, Computers and Intractability : A Guide to the
Theory of NP-Completeness. Freeman, 1988.

B. Baker, E. Coffman, and R. Rivest, “Orthogonal Packings in Two Dimen-
sions,” SIAM Journal on Computing, vol. 4, pp. 846—855, Nov. 1980.

E. Coffman, J. F. M.R., J. D.S., and R. Tarjan, “Performance Bounds for Level-
oriented Two-dimensional Packing Algorithms,” SIAM Journal on Computing,

vol. 4, pp. 808—826, Nov. 1980.

E. Coffman, J. Leung, J .Y., and D. Slutz, “On the Optinality of First-Fit and
Level Algorithms for Parallel Machine Assignments and Sequencing,” in 1977
International Conference on Parallel Processing, pp. 95—99, IEEE, 1977.

K. Li and K. H. Cheng, “Static Job Scheduling in Partitionable Mesh Connected
Systems,” Journal of Parallel and Distributed Computing, pp. 152—159, Oct.
1990.

J. Turek, J. L. Wolf, and P. S. Yu, “Approximate Algorithms for Parallelizable
Tasks,” in Symposium on Parallel Architecture and Language, pp. 323-332,
IEEE, 1992.

K. Li and K. Cheng, “Complexitiy of Resource Allocation and Job Scheduling
Problems in Partitionable Mesh Connected Systems,” in Proceedings of the Ist
Annual IEEE Symposium on Parallel and Distributed Processing, IEEE, May
1989.

J. Y.-T. Leung, T. Tam, C. Wong, G. Young, and F. Chin, “Packing Squares
into a Squre,” Tech. Rep. UTDCS-12-89, Computer Science Department, Uni-
versity of Texas at Dallas, July 1989.

S. Chittor and R. Enbody, “Hypercube vs. 2d meshes,” in Proceedings SIAM
Fourth Annual Conference on Parallel Processing for Scientiﬁc Computing,
pp. 313—318, SIAM, 1990.

S. Chittor and R. Enbody, “Predicting The Effect of Mapping on The Commu—
nication Performance of Large Multicomputers,” in 1991 International Confer-
ence on Parallel Processing, IEEE, 1991.

145

[82] S. Chittor and R. Enbody, “High-Performance Simulator for Large Wormhole—

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

Routed Networks,” International Journal in Computer Simulator, pp. 151—174,
1992.

V. S. Adve and M. K. Vernon, “Performance Analysis of Multiprocessor Mesh
Interconnection Networks with Wormhole Routing,” tech. rep., Department of

Computer Science at University of Wisconsin-Madison, June 1992.

S. Abraham, “Performance of Multicomputer Networks under Pin-out Con-
straints,” Journal of Parallel and Distributed Computing, pp. 237-248, 1991.

J. H. Kim and A. A. Chien, “The Impact of Packetization in Wormhole-Routed
Networks,” in Proceedings of PARLE ’93, Parallel Architectures and Languages
Europe, pp. 242—253, Jun. 1993.

W. J. Dally and C. L. Seitz, “Deadlock-Free Message Routing in Multiprocessor
Interconnection Networks,” IEEE Transactions on Computers, pp. 547—553,
May 1987.

W. J. Dally, “Express Cubes: Improving the Performance of k-ary n-cube In-
terconnection Networks,” IEEE Transactions on Computers, vol. 40, pp. 1016-
1023, Sep. 1991.

W. J. Dally, “Virtual-Channel Flow Control,” Journal of Parallel and Dis-
tributed Computing, pp. 194-205, 1992.

L. M. Ni and P. K. McKinley, “A Survey of Wormhole Routing Techniques in
Direct Networks,” IEEE Computer, vol. 26, pp. 62 — 76, Feb. 1993.

C. J. Glass and L. M. Ni, “The Turn Model for Adaptive Routing,” in Proceed-
ings of the 19th Annual International Symposium on Computer Architecture,
May 1992.

Z. Liu, J. Duato, and L.-E. Thorelli, “Grouping Virtual Channels for Deadlock-
Free Adaptive Wormhole Routing,” in Proceedings of PARLE ’93, Parallel Ar-
chitectures and Languages Europe, pp. 254—265, Jun. 1993.

D. H. Linder and J. C. Harden, “An Adaptive and Fault Tolerant Wormhole
Routing Strategy for k-ary n-cubes,” IEEE Transactions on Computers, pp. 2—
12, Jan. 1991.

146

[93] K. J. H. and A. A. Chien, “Evaluation of Wormhole Routed Networks Under
Hybrid Trafﬁc Loads,” in Proceedings of the Hawaii International Conference
on System Sciences, Jan 1993.

[94] S. Borkar and et al., “An Integrated Solution to High-Speed Parallel Comput-
ing,” in Proc. of the Supercomputing Conference, pp. 330—339, 1988.

[95] P. Y. Song, “Design of a Network for Concurrent Message Passing Systems,”
Tech. Rep. Master’s Thesis, Department of Electrical Engineering at MIT, May
1988.

[96] S. F. P. Raghavan and E. Upfal, “A Theory of Wormhole Routing in Parallel
Computers,” in Symposium on the Foundations of Computer Science, pp. 563—

572, 1992.

[97] E. A. Brewer and B. C. Kuszmaul, “How to Get Good Performance from the
CM-5 Data Network,” in Proceedings of the 1994 International Parallel Pro-
cessing Sympos ium, pp. 858 — 867, IEEE, Apr. 1994.

[98] P. K. McKinley, “Simulation of Wormhole Routing in Multisim,” in Proc. 23rd
Pittsburg Conference on Modeling and Simulation, Apr. 1992.

[99] H. Schwetman, “CSIM: C-Based, Process-Oriented Simulation Language,”
Tech. Rep. PP-080-85, Microelectronics and Computer Technology Corpora-
tion, 1985.

[100] L. Kleinrock, Queueing Systems, Volume I: Theory. John Wiley and Sons, 1975.

[101] S. Chittor and R. Enbody, “Performance Evaluation of Mesh-connected
Wormhole-routed Networks for Interprocessor Communication in Multicom-

puters,” in Proceedings of Supercomputing Conference, pp. 647—656, Nov. 1990.

[102] D. Min and M. W. Mutka, “A Framework for Predicting Delay Due to Job Inter-
actions in a 2-D Mesh Multicomputer,” in Proceedings of the 1993 International
Parallel Processing Symposium, pp. 350 - 357, IEEE, Apr. 1993.

[103] S. M. Ross, Stochastic Processes. John Wiley and Sons, 1983.

[104] F. A. G. Alexander M. Mood and D. C. Boes, Introduction to the Theory of
Statistics. McGraw Hill, 1974.

147

[105] P. K. McKinley and C. Trefftz, “MultiSim: A Tool for the Study of Large-scale
Multiprocessors,” in Proc. 1993 International Workshop on Modeling, Analy-
sis and Simulation of Computer and Telecommunication Networks (MASCOTS
93), pp. 57—62, Jan 1993.

[106] A. Silberschatz, J. Peterson, and P. Galvin, Operating System Concepts. Addi-
son Wesley, 3rd ed., 1991.