. ..
Miguv." r 21.5... r5.

.. l

S .23.? .. y».

(”.3 3...; .

q ..w.§w.f5..ﬁ 43%
.P. a a! .
ﬁihuﬂiz .1.
. h
. «5min...
; ‘. i.
V , ’1 3 PI
.1“; his»:
1... . . r 4%..th-
n.:} . a . .. Sun’s!»
. ., u . .1 ..
. . . § 134.75,)": ‘III n.1t
, . 1.2.3.» .5.£.th.x. L Aiwh‘nﬂlih
. tau}: did .i\ DH}!!! 4 :Vv M
‘ I 2. 3:9; 1‘ i.
V , , a! a ‘ Hi ....it.m.n.b.wzi ..
‘ . . 4 D‘ O.“ P:‘A. .b. h‘
. . ..k:.. .\...... ah» c
.r..i...:.. . . ‘ . 31.55. h... «nuns»... . a4
3 '30... 5!.I.? 01.31:. ' .
in: . . 32...! ()1 3....
.. i. a ‘ ‘ . v1: 2. x...
‘ u...» .935. d
‘ K: I use...
.. $1.:
xv .
3.. .. . V
FA. “mfg:
.65..
1.9.33.3...
‘1: ..
.. S

.nﬂ! ) 1..-
b. 1115.3.
31......913...»
Nitrﬁh .qu .54,
‘ i. '18». -

_. 2:1

: ..:.:.au.::
.35 2135-:

y. .
1.3.. s:
3.33.357!
15:315.:

A
L.

'Ei i1:
21m.-

6
5

MS 7
, I | &
:2-2

A. G? I! .. ‘
t a v . ..
. I. ’5 "\‘IA\-;

«75.2.3. I
Oil-ii...

.nx.
3

I
xv
‘ .5). 13,.‘1.
. ‘5... “.u?
I! A
139:!
t I».
Iv

hot
.gvv!‘
. i: 1. v: .‘79.

:53 I

My: 2.;

 

 

53.30.! .4 u,
711-?

ms IHWUIIHIIHHWIIUIHIIWII IIUIUHINI

2 31293014
(19%,)

This is to certify that the

dissertation entitled

PROCESSOR ALLOCATION AND JOB SCHEDULING IN
TWO-DIMENSIONAL MESH SYSTEMS

presented by

YUNG-KANG CHU

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degree in EleCtrica] Eng.

M Km:

Major professor

Date i/lg/q;

. MSU is an Affirmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY
Michlgan State
University

 

 

 

PLACE N RETURN BOX to man this chockwt from your mood.
TO AVOID FINES return on or baton dd. duo.

DATE DUE DATE DUE DATE DUE

      

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU to An Nﬂmdtvo ActkWEqml Oppommlly IMMIOﬂ
W

PROCESSOR ALLOCATION AND JOB
SCHEDULING IN TWO-DIMENSIONAL MESH
SYSTEMS

By

Yang-Kong Chu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Electrical Engineering

1995

ABSTRACT

PROCESSOR ALLOCATION AND JOB
SCHEDULING IN TWO-DIMEN SION AL MESH
SYSTEMS

By

Yung-Kang Chu

Mesh connected parallel architectures have become increasingly popular in the
design of multiprocessor systems in recent years. Many partitionable two-dimensional
mesh systems have been built or are currently being developed. To allow the best
usage of these systems, effective submesh allocation and job scheduling schemes are
desirable.

Continuous allocations and deallocations of various sized submeshes result in a
fragmented mesh system. The goal of this research is to reduce the system fragmen-
tation and the mean job response time. We have developed efficient and effective
submesh allocation and scheduling schemes in two-dimensional mesh systems. A
simulation too which we developed is used to evaluate the performance of existing
schemes as well as our proposed schemes. The tool provides a flexible graphical user
interface to performing and analyzing simulations. The new schemes deliver better

performance than best existing schemes.

© Copyright 1995 by Yung-Kang Chu
All Rights Reserved

To my parents

iv

ACKNOWLEDGMENTS

I would like to express my appreciation and acknowledge the contributions of the
following persons. This dissertation could not have been completed without their
participation and help.

First, I would like to thank Dr. Diane T. Rover, my major advisor. I have
benefited from her involvement throughout my doctoral work. Her very positive
inﬂuence on my growth as a professional and as a person is invaluable. I could not
thank her enough for her guidance.

I would also like to thank the other members of my doctoral program guidance
committee: Dr. David Fisher, Dr. Lionel Ni, and Dr. David Yen, for their gener-
ous comments and participation. They have helped to improve the quality of this
dissertation both in its technical content and organization.

I am indebted to Dr. I—Ling Yen for her efforts during the early phases of this
research. Her insightful discussion with me helped me to motivate this work.

Finally, thanks go to my family, friends, and colleagues. Their love and support

accompanied me through the years of this work.

TABLE OF CONTENTS

LIST OF TABLES viii
LIST OF FIGURES ix
1 Introduction 1
1.1 Problem Statement .............................. 2
1.2 Motivation and Research Goal ........................ 3
1.3 Organization ................................. 5
2 The Problem and Its Background 6
2.1 Job Scheduling and Processor Allocation Problems in Multiprocessor Sys-

tems .................................... 8
2.2 The 2D Mesh Architecture .......................... 10
2.3 Problem Description ............................. 13
2.4 Submesh Allocation Strategies ........................ 13
2.4.1 Two-Dimensional Buddy System ..................... 14
2.4.2 Frame Sliding Strategy ........................... 15
2.4.3 First Fit Strategy .............................. 16
2.4.4 Best Fit Strategy .............................. 18
2.4.5 Adaptive Scan Strategy .......................... 19
2.4.6 Busy List Strategy ............................. 20
2.4.7 Lookahead Strategies ............................ 21
2.4.8 Non-contiguous Processor Allocation Algorithms ............ 24
2.5 Dynamic Job Scheduling Schemes ...................... 25
2.5.1 Scan Scheduling Schemes for Hypercubes ................. 25
2.5.2 Lazy Scheduling Scheme for Hypercubes ................. 26
2.5.3 Scan Scheduling Scheme for 2D Meshes .................. 26
2.5.4 Priority Scheduling Scheme for 2D Meshes ................ 27
2.6 Performance Evaluation Tools for Processor Allocation and Scheduling . 29
3 Performance Evaluation Methodology and Tool 32
3.1 Simulation Description ............................ 33
3.2 Workload Characterization .......................... 34
3.3 Performance Metrics ............................. 36
3.4 Simulation and Experiment Analysis Tool ................. 39

vi

vii

4 Cost-Performance Assessment of Schemes 48
4.1 Processor Allocation Strategies ....................... 49
4.1.1 Performance Analysis ........................... 49
4.1.2 Complexity Analysis ............................ 52
4.2 Scheduling Schemes .............................. 57
5 Job Scheduling Schemes 58
5.1 Immediate Fit Strategy ........................... 59
5.2 Scan All Strategy ............................... 61
5.3 Performance Results ............................. 61
5.4 Discussion ................................... 72
6 Job Scheduling Using Multiple Queues 74
6.1 Job Scheduling Schemes ........................... 75
6.2 Multiple Queues Scheme ........................... 77
6.3 Performance Results ............................. 81
6.3.1 Performance of Variations of Multiple Queues Scheme ......... 81
6.3.2 Comparing Multiple Queues Scheme with Other Schemes ........ 87
6.4 Discussion ................................... 92
7 Estimated Execution Time Allocation Strategies 95
7.1 Basic Idea and Heuristics .......................... 96
7.2 Detailed Algorithms ............................. 98
7.3 Performance Results ............................. 101
7.4 Discussion ................................... 107
8 Busy Distance Inverse Allocation Strategy 110
8.1 Basic Ideas .................................. 111
8.2 Best-Fit Heuristic ............................... 114
8.3 Candidate Submesh Generation ....................... 116
8.4 Detailed Algorithm .............................. 117
8.5 Performance Results ............................. 118
8.6 Discussion ................................... 123
9 Conclusions 124
9.1 Contributions ................................. 125
9.2 Future Work ................................. 127

BIBLIOGRAPHY 128

3.1
3.2

4.1
4.2

LIST OF TABLES

Submesh allocation strategies supported by the simulator. ........ 40
Job scheduling schemes supported by the simulator ............. 41
Limitations of existing submesh allocation strategies. ........... 53
New submesh allocation strategies. ..................... 56

viii

2.1

3.1
3.2
3.3
3.4
3.5
3.6
3.7

4.1
4.2
4.3
4.4
4.5

5.1
5.2
5.3
5.4
5.5

5.6
5.7

5.8
5.9

5.10
5.11

5.12

6.1
6.2
6.3

LIST OF FIGURES

A two dimensional mesh M (5, 4). ...................... 12
The Sched-All simulator ............................ 41
The Sched—All experiment planner. ..................... 43
The Sched-All experiment analyzer ...................... 44
An example data file .............................. 45
The generated GNUPLOT program. .................... 46
The generated plot (screen dump). ..................... 46
The generated plot ............................... 47
Mean response time vs. load (w and h: uniform distribution). ...... 50
Mean response time vs. load (w and h: increasing distribution). ..... 51
Mean response time vs. load (w and h: decreasing distribution) ...... 51
Performance vs. complexity of submesh allocation strategies ........ 54
Performance vs. complexity of submesh allocation strategies ........ 56
The IF scheduling scheme ........................... 60
The SA scheduling scheme. ......................... 62
System utilization vs. load (w and h: uniform distribution). ....... 63
Mean response time vs. load (w and h: uniform distribution). ...... 65
Standard deviation of response time vs. load (w and h: uniform distribu-

tion) ..................................... 65
Mean response time vs. load (w and h: decreasing distribution) ...... 67
Standard deviation of response time vs. load (w and h: decreasing distri-

bution). .................................. 67
Mean response time vs. load (w and h: increasing distribution). ..... 68
Standard deviation of response time vs. load (w and h: increasing distri-

bution). .................................. 68
Mean response time vs. job size (w and h: uniform distribution). . . . . 70
Standard deviation of response time vs. load (w and h: uniform distribu-

tion) ..................................... 70
Mean and standard deviation of response time vs. waiting time limit (w

and h: uniform distribution). ...................... 71
The Multiple Queues scheduling scheme. .................. 80

Mean response time vs. # queues, load=0.5, w and h: uniform distribution. 82
Mean response time vs. # queues, load=0.5, (w and h: uniform distribu-
tion) ..................................... 83

ix

6.4
6.5
6.6
6.7
6.8

6.9

6.10
6.11
6.12
6.13
6.14
6.15

7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10

8.1
8.2
8.3
8.4
8.5

8.6
8.7

8.8
8.9

X

Mean response time vs. # queues, load=0.5, (w and h: decreasing distri-

bution). .................................. 84
Mean response time vs. # queues, load=0.5, (w and h: increasing distri-

bution). .................................. 84
Mean response time vs. # queues, load=0.7, (w and h: uniform distribu-

tion) ..................................... 85
Mean response time vs. # queues, load=0.7, (w and h: decreasing distri-

bution). .................................. 86
Mean response time vs. # queues, load=0.7, (w and h: increasing distri—

bution). .................................. 86
Mean response time vs. load (w and h: uniform distribution). ...... 88
Mean response time vs. load (w and h: decreasing distribution) ...... 88
Mean response time vs. load (w and h: increasing distribution). ..... 89
Mean response time vs. load (w and h: uniform distribution). ...... 90
Mean response time vs. load (w and h: decreasing distribution) ...... 90
Mean response time vs. load (w and h: increasing distribution). ..... 91

Mean response time vs. waiting time limit (w and h: uniform distribution). 92

A two-dimensional mesh M (8, 8). ...................... 97
The EET-SB allocation strategy. ...................... 99
The EET-SC allocation strategy. ...................... 100
Mean response time vs. load (w and h: uniform distribution). ...... 102
Mean response time vs. load (w and h: increasing distribution). ..... 103
Mean response time vs. load (w and h: decreasing distribution) ...... 104
Mean response time vs. load (w and h: uniform distribution). ...... 105
Mean response time vs. load (w and h: decreasing distribution) ...... 106
Mean response time vs. load (w and h: uniform distribution). ...... 106
Mean response time vs. load (w and h: decreasing distribution) ...... 107
A two dimensional mesh M(10, 10) ...................... 112
Busy distance vectors. ............................ 115
The Busy Distance Inverse strategy ...................... 118
Mean response time vs. load (w and h: uniform distribution). ...... 119
Standard deviation of response time vs. load (w and h: uniform distribu-

tion) ..................................... 120
Mean response time vs. load (w and h: increasing distribution). ..... 121
Standard deviation of response time vs. load (w and h: increasing distri-

bution). .................................. 121
Mean response time vs. load (w and h: decreasing distribution) ...... 122

Standard deviation of response time vs. load (w and h: decreasing distri-
bution). .................................. 123

Chapter 1

Introduction

Massively parallel computers (MPCS) with many processors are considered by many
as the most promising technology to achieve high computational power. MPCs speed
up computations and simulations, and moreover, they can solve a much larger problem
size than other platforms can [31, 32].

A technologically feasible mean of building MPCs is using distributed memory and
message-passing networks. A distributed memory parallel computer system consists
of many nodes where each node has its own processor(s), local memory, and routing
devices. In systems we considered in this work, each node has only one processor,
so we use the terms node and processor in a topology or network interchangeably in
this dissertation. Processors are connected to one another based on a certain network
topology. For example, in a two-dimensional (2D) mesh system, each processor is
connected to its four nearest orthogonal neighbors. The mesh topology has become
increasingly popular in the design of multiprocessor systems as wormhole routing
techniques have advanced [50]. It has advantages such as simple and regular structure,

1

2

easy layout, and good scalability. Many 2D mesh systems have been built or are under
development recently. The Intel Paragon [60], Intel Touchstone Delta [61], and Tera
Computer System [2] are three examples.

A partitionable system may be partitioned into smaller subsets of processors,
where each subset can be allocated to independent jobs concurrently. After comple-
tion of a job, the allocated processors are released for future jobs. For example, a
partitionable 2D mesh system can be divided into submeshes of different sizes and run
different jobs. To fully utilize such a system and maximize overall system throughput,
we need to consider the problem of how to effectively schedule and allocate jobs to

processors.

1 .1 Problem Statement

Job scheduling problems may be divided into two categories, static and dynamic. The
static job scheduling problem is to minimize the ﬁnish time of the last completed job
for a given job list. For each job in the list, the required number of processors and
its processing time are given (we use the terms processing time, execution time, and
service time interchangeably in this dissertation).

On the other hand, for the dynamic job scheduling problem, jobs arrive in a
stream with a certain arrival rate and are placed in a global job queue. The job
scheduler chooses the next job to be allocated based on some scheduling scheme,
while the processor allocator tries to ﬁnd a set of free processors for the job using

a processor allocation strategy. In this dissertation, we are concerned with dynamic

3

job scheduling and processor allocation on two—dimensional mesh systems. Each job
requests a ﬁx sized rectangular submesh, and the mesh system is partitionable to run

multiple jobs at the same time.

1.2 Motivation and Research Goal

Similar to conventional memory allocation resulting in a fragmented memory, contin—
uous allocation and deallocation of various sized submeshes in the 2D mesh system
result in a fragmented mesh. The more fragmented a mesh is, the harder it is to find
a free submesh for an incoming job. From the user’s point of view, this results in
longer waiting delay and response time, Conversely, from the system’s point of view,
this results in lower system throughput and utilization. Our goal in this research is
to improve the overall system utilization and reduce the mean job response time for
the dynamic job scheduling/ processor allocation problem in 2D mesh systems. To
achieve this goal, we have several approaches, all of which are based upon identify-
ing potential improvements in existing scheduling and allocation schemes. First, we
propose to develop a more effective and efﬁcient processor allocation strategy that
can further reduce the system fragmentation and mean job response time by finding
a best-ﬁt location for each job. Then, we plan to develop advanced job scheduling
schemes other than the First-Come-First-Served (FCFS) policy to avoid the potential
performance loss due to its blocking effect. Under FCFS scheduling policy, the job
dispatcher allocates the incoming job only if the waiting queue is empty and a free

submesh is found; otherwise it puts the job at the end of the waiting queue. Upon

4

the completion of a job, the processors in the submesh that were allocated to the job
are released and the dispatcher checks if jobs in the front of the waiting queue can be
allocated. When a job with a large submesh request is to be allocated, it usually has
to wait until almost all the allocated jobs leave the system, thus blocking later jobs

from entering the system and resulting a longer waiting delay and wasted resources.

Further, we have observed that the different completion times of jobs in the system
can result in busy processors being arbitrarily scattered throughout the mesh, thus
causing large external fragmentation. As indicated by Feitelson and Nitzberg [29],
about two thirds of the jobs running on parallel systems are programs submitted re-
peatedly. The approximate execution time of these programs can be quite accurately
estimated by the users. By providing an estimated execution time to the proces-
sor allocator, the system can allocate jobs with similar finishing times in the same
areas. We will investigate the effect of using estimated execution times to guide sub-
mesh allocation and develop processor allocation strategies that utilize the estimated

execution time information.

In order to achieve all of the above research goals, a simulation tool that supports
experimentation and comparison of performance is needed to help in evaluating ex-
isting schemes and developing new ones. We will develop a tool for simulation and
performance analysis and display. Using this tool, we can compare various schemes

under different load conditions and job characteristics.

1.3 Organization

The rest of this dissertation is organized as follows. Chapter 2 gives the background
and discusses the problems of job scheduling and processor allocation in multiproces-
sor systems in more details. Related research is also summarized in this chapter. In
Chapter 3, we present the performance evaluation methodology, describe the simula-
tion model and the workload, and introduce the software tool that we have developed
for simulation and performance evaluation. Then, we present simulation results of the
performance of existing submesh allocation strategies, and assess cost-performance of
the strategies in Chapter 4. In Chapter 5, we present two new job scheduling schemes
called Immediate Fit and Scan All [11]. They dramatically improve the performance
of the FCFS scheme. Then, we present an advanced job scheduling scheme using
multiple queues in Chapter 6. It gives further performance improvement. In Chapter
7, we present a family of new submesh allocation strategies called Estimated Exe-
cution Time strategies [12], which utilize estimated execution time information for
better processor allocation. In Chapter 8, we present a new submesh allocation strat-
egy called Busy Distance Inverse strategy [9], which gives the lowest time complexity
and yields the best performance among all submesh allocation strategies that do not
utilize estimated execution time information. Finally, we give the conclusions and

future research in Chapter 9.

Chapter 2

The Problem and Its Background

Scheduling is a classical ﬁeld with several interesting problems. In general, the
scheduling problem assumes a set of resources and a set of consumers serviced by
those resources according to a certain policy. A job in a factory and a customer in
a bank are examples of consumers. A machine in a factory and a teller in a bank
are examples of resources. The goal of this problem is to ﬁnd an efficient policy
to determine a sequence and an assignment of consumers to resources to optimize
some performance measures. Generally speaking, determining a sequence or order
is equivalent to scheduling, and determining an assignment, allocation. First-Come—
First-Served (FCF S) is an example of a scheduling policy. Scheduling policy per-
formance varies with different circumstances. While FCFS may be appropriate in a
bank environment, it may not be the best policy to be applied to jobs in a factory.
In the arena of parallel and distributed computing, scheduling problems have
gained the attention of many researchers. One interesting problem area is concerned

with scheduling parallel program tasks on parallel and distributed systems. In this

6

7

problem, a program can be viewed as a collection of tasks which may run serially or
in parallel. The tasks are the consumers and are represented using directed graphs
called task graphs. The processing elements (or processors) are the resources and
their interconnection networks are represented using undirected graphs. An optimal
schedule determines both the allocation and the execution order of each task to op-
timize some performance criterion, for example, the maximum ﬁnishing time of a set

of programs. Interested readers can refer to [28] for more details of this problem.

Another interesting problem area is concerned with job scheduling and processor
allocation on multiprocessor systems. In this problem, each job requires a ﬁxed
number of processors and has a ﬁxed execution time. Because of the various topologies
of target machines, we must develop and apply different processor allocation and job
scheduling schemes on different topologies. In this dissertation, we are concerned with
the job scheduling and processor allocation problem on a popular topology called the

two-dimensional (2D) mesh.

In this chapter, we ﬁrst introduce the problems of job scheduling and processor
allocation on multiprocessor systems in Section 2.1. Then we discuss the 2D mesh
architecture in Section 2.2. In Section 2.3, we describe the dynamic job schedul-
ing/ processor allocation problems on 2D mesh systems. Sections 2.4 and 2.5 summa-
rize previous submesh allocation strategies and job scheduling schemes, respectively.

In Section 2.6, recent software tool development in this area is introduced.

8
2.1 Job Scheduling and Processor Allocation

Problems in Multiprocessor Systems

Generally speaking, job scheduling and allocation problems can be divided into two
categories: static and dynamic. The objective of static approaches is to minimize the
ﬁnishing time of the last completed job for a given job list. For each job in the list, the
required number of processors and its processing time are given. On the other hand,
the dynamic approaches assume a stochastic stream of arriving jobs, with no a priori
information about their characteristics. Furthermore, the approaches have targeted
hypercube and low-dimensional meshes, which are popular topologies in the design of
multiprocessor systems. In a hypercube system, each job requires a subcube of a ﬁxed
dimension. In a 2D mesh system, each job requires a rectangular submesh. Effective
allocation strategies are needed to ﬁnd free subcubes and submeshes, respectively, to
reduce fragmentation.

Static job scheduling problems in multiprocessor systems have been studied ex-
tensively. The multiprocessor scheduling problem [30] is to ﬁnd an optimal non-
preemptive schedule of n independent jobs on m identical processors to minimize
the ﬁnishing time of the last completed job. In this problem, each job has a ﬁxed
execution time t and requires only one processor. This problem has been proven to
be NP-hard for m 2 2. This problem is equivalent to a variant of the bin packing
problem [17], which is to minimize the maximum height of m bins required to pack a
given list of n objects of various heights.

The bin packing problem is generalized to the two-dimensional packing problem

9

[16, 4, 15], which is to pack a list of n rectangles of various widths and heights into
an open-ended rectangle of a certain width W to minimize the overall height. Here,
the packing is oriented, i.e., a rotation of 90° (interchange width w and height h)
is not allowed. This problem is analogous to statically scheduling a list of n jobs
non—preemptively, where each job requires w units of contiguous block of resource
for h units of time, in a system having a partitionable resource of W units arranged
contiguously in a linear fashion.

Another generalization of the packing problem is the three-dimensional packing
problem [42], which is to pack a list of n rectangular boxes into a big rectangular
box having inﬁnite height and a bottom of size W x H and minimize the overall
height of the packing. Each box has a ﬁxed size of w X h x t, where w x h is the
bottom size and t is the height of the box. Here the packing is oriented in the time
dimension, but may not be oriented in the other two dimensions. In other words, a
rotation of 90° of small boxes in the bottom is allowed. This problem is analogous to
statically scheduling/ allocating jobs in a 2D mesh system of size W x H, where each
job requires a rectangular submesh of ﬁxed width w, ﬁxed height h, and processing
time t. The 2D mesh architecture is described in more detail in the next section.

Several static job scheduling and subcube allocation schemes for hypercube sys—
tems have been proposed [6, 7, 58, 59]. Several static job scheduling and submesh
allocation schemes have been proposed. Li and Cheng proposed a mesh partition-
ing strategy, layer by layer (LL), and a static scheduling policy, longest processing
time ﬁrst, in [41, 43]. Li and Cheng also proposed another scheme, Two-Dimensional

Buddy System (2DBS), for mesh partitioning in [44, 47].

10

For the static job scheduling problem in a hypercube system, each job requires a
subcube of a ﬁxed dimension. A subcube allocation strategy is needed for the processor
allocator to ﬁnd a free subcube. Several static job scheduling and subcube allocation
schemes for hypercube systems were proposed in [59, 58, 6, 7]. For the static job
scheduling problem in a 2D mesh system of size W x H, each job requires a rectan-
gle submesh of ﬁxed width w, height h, and a processing time of t. This problem is
equivalent to the three-dimensional packing problem [42], In real applications, jobs ar-
rive randomly in a stochastic stream. A dynamic job scheduling/ processor allocation
scheme, which does not require information about future jobs, is more realistic than a
static scheme, and it is more applicable to computer systems. A number of dynamic
processor allocation strategies using the First-Come-First-Served (FCFS) scheduling
scheme have been proposed for hypercube systems [8, 26, 27, 1, 36, 37, 14, 19, 22].
Before we discuss the dynamic job scheduling/processor allocation schemes for 2D
mesh systems, we introduce the 2D mesh architecture in more detail in the next sec-
tion. We will discuss more dynamic scheduling schemes for hypercubes and meshes

in Section 2.5.

2.2 The 2D Mesh Architecture

Mesh architectures have become increasing popular in the design of multiprocessor
systems because of their advantages such as simple structure and good scalability. In
particular, low-dimensional meshes are favored over hypercubes in certain systems.

Formally, an n-dimensional mesh has k0 x k1 x . . . x kn-1 nodes, k,- nodes along each

11

dimension i, arranged in an n—dimensional grid. A hypercube, or a binary n-cube, is
a special case of the n-dimensional mesh in which k,- = 2 for all i, 0 S i g n - 1. A
two-dimensional mesh consists of W x H nodes arranged in a two-dimensional grid.
The bisection density, that is, the product of the bisection width and the channel
bandwidth, can be used as a measure of the network cost, where the bisection width
is the minimum number of channels that must be removed to partition the network
into two equal subnetworks.

Consider a 2“ x 2“ 2D mesh and a 2n-cube having the same number of nodes
N = 22". The bisection widths of these two topologies are 2" and 22""1, respectively.
If both networks have the same bisection density, then the channel bandwidth of the
2D mesh is 2""1 times larger than that of the n-cube. In other words, for the same
number of processors and same network cost, the 2D mesh can offer higher channel
bandwidth than a hypercube does. However, a 2D mesh has higher diameter, which
is the maximum distance between two nodes. For systems in which the network
latency depends on the path length, the hypercube is a better choice. For systems
that support wormhole routing, the network latency is almost independent of the
path length when there is no contention and the message length is relatively large.
Low—dimensional meshes have become increasingly popular for such systems since
the disadvantage of large diameter diminishes as wormhole routing techniques have
advanced [50]. In the remainder of this section, we focus on the 2D mesh architectures
and some necessary deﬁnitions.

A 2D mesh, denoted by M(W, H), consists of W x H nodes arranged in a W x H

two-dimensional grid. Each node represents a processor and each edge is a communi—

12

cation link. The node in column i and row j can be represented by address < i, j >.
A non-boundary node < i,j >, where 0 < i < W —1 and 0 < j < H — 1, is connected
to four neighboring nodes < i :l: 1, j > and < i, j :l: 1 > (a boundary has fewer neigh-
bors). A two—dimensional submesh in M(W, H), denoted by S(w,h), is a subgrid
of M(W,H) such that 1 S w 3 W and 1 g h S H. The address of a submesh is
denoted by a quadruple < x1,y1,x2,y2 >, where < $1,311 > indicates the lower left
corner and < 1:2, yg > indicates the upper right corner of the submesh. The base of
a submesh is the node at the lower left corner. Figure 2.1 shows a rectangular mesh
M(5,4). Nodes < 2,1 >,< 3,1 >,< 4,1 >,< 2,2 >,< 3,2 >,and < 4,2 > form a
submesh S (3, 2) with address < 2,1, 4, 2 >. A node or submesh is busy (allocated) if

it is assigned to a job. A node or submesh is free if it is not busy.

A

A O ’)
(1,3123) <3,3)r <4,3>\
<0,2>> <1,2>I (2’? <3,2>. <3,
<O.1></L<1,1>! <2,1>*<3,1>$ <4,1>3
<0,0> <1,0> <2,0> <3,0> <4,0>

Figure 2.1: A two dimensional mesh M(5, 4).

0

(0,3)

A

 

 

 

 

 

A node or submesh is busy (allocated) if it is assigned to a job. A node or submesh
is free if it is not busy. A free submesh is a submesh whose nodes are all free. A

busy (allocated) submesh is a submesh whose nodes are all assigned to a job.

13
2.3 Problem Description

In this work, we deal with the dynamic job scheduling/ processor allocation problem
in a 2D mesh system. A two—dimensional mesh-connected system M (W, H) has W x
H processors. Jobs arrive randomly and request submeshes of various widths and
heights. The system tries to ﬁnd a free submesh for incoming jobs following a speciﬁc
scheduling scheme and a submesh allocation strategy. If a free submesh is found for a
job, the system allocates the job to the free submesh. When a job is done, the system
deallocates the job and releases the submesh. The goal of the job scheduling/ processor
allocation schemes is to minimize the internal and external fragmentation so that the

system utilization is high and the mean job response time is low.

The existing submesh allocation strategies and job scheduling schemes are dis-

cussed briefly in the following sections.

2.4 Submesh Allocation Strategies

The submesh allocation strategy plays an important role to improve the performance
of a 2D mesh multiprocessor system servicing multiple jobs. Many attempts have been
made to develop high performance, low complexity submesh allocation strategies. The
existing dynamic submesh allocation strategies are discussed briefly in the following

subsections.

14

2.4.1 Two-Dimensional Buddy System

The Two—Dimensional Buddy System (2DBS) proposed by Li and Cheng [45, 46] is
derived based on the conventional one—dimensional buddy allocation scheme [38]. The
2DBS is applicable only when the system is a square mesh with the length of each
side being exactly a power of two. Also, it can allocate only square submeshes with

side lengths being exact powers of two.

A square block (submesh) is speciﬁed as B(.r,y,k), where < :r,y > is the base
of the submesh and 2k is its size (side length). If k > 0, then B(:r,y,k — 1), B(a: +
2"“, y, k— 1), B(:r, y+2k“l, k— 1), and B(:1:+2"‘1, y+2k‘1, k— 1) are also submeshes,
which are buddies of each other. These four square submeshes of size 2’“1 form a
larger submesh of size 2". For a mesh system of size 23 (23 X 2B), R + 1 free block
lists (FBLs) are maintained and FBL,- contains all available blocks of size 2‘, where
0 S i S R. For an incoming job requesting a w X h submesh, the system tries to

allocate a square submesh of size i for it, where i = 2l109’lmaxlw'hlll.

The search starts from FBL,- for the smallestj such that j Z i and FBL,- is not
empty. If no non-empty FBL is found, the job returns to the queue front; otherwise
remove the ﬁrst block B(.r,y,j) from FBLj. Ifj > i, split the block B(.7:,y,j) into
four buddies of size j — 1, allocate the job to the ﬁrst buddy and append the rest three
buddies to FBLJ-_1. The worst case complexity of the allocation procedure is 0(R)
or 0(log N), where N = W x H = 2” is the number of processors in the system.
When a job ﬁnishes, it releases and append the block B(:r, y, k) to F BLk. If all three

other buddies of B are also in F BLk, the four buddies is removed from the FBL]c

15

and combined into a block of size k + 1 and appended to FBLk+1. Similarly, possible
combination of buddies is checked. The worst case complexity of the deallocation

procedure is also 0(log N).

This strategy tends to give large internal fragmentation when the requested sub-
mesh size is not a power of two. Hence, a signiﬁcant amount of processor resources

could be wasted.

2.4.2 Frame Sliding Strategy

The Frame Sliding (FS) strategy proposed by Chuang and Tzeng [13] introduced
the idea of busy set, coverage set and reject set. The busy set is the collection of
all busy submeshes. The coverage set is the union of the coverage submeshes of all
busy submeshes, where the coverage submesh of a busy submesh with respect to an
incoming requested submesh is a submesh whose nodes can not serve as the base for

the requested submesh.

For an allocated submesh < 3:1, yl, .732, y; >, the coverage submesh for an incoming
job J(w, h) is < ma1r(0, xl—w+1), macr(0, yl—h+1), $2, y2 >. where max(i,j) returns
the larger value of i and j. The reject set with respect to an incoming job J (w, h) for a
mesh system M(W, H) consists of two reject submeshes < W—w+ 1, 0, W— 1, H—l >
and < 0, H — h + 1, W — 1, H — 1 > in the rightmost and the topmost parts of the
mesh, respectively. Nodes in the reject set can never serve as the base of any free

submesh for the incoming job.

For a job requesting a w x h submesh, the F S strategy ﬁrst computes the coverage

16

set and reject set according to w, h, and the busy set. Starting from the lowest
leftmost free processor, it compares the base node of every candidate frame (submesh)
in sequence against the coverage set and reject set. If this node belongs to any element
of the two sets, the frame slides w nodes to the right. If it exceeds the right boundary,
the frame moves up h nodes and then slide from right to left. The slide continues
until a free submesh is found or the entire mesh is searched. The FS strategy has a

complexity of 0(%), where N A is the number of currently allocated submeshes.

The PS strategy can be applied to rectangular meshes with arbitrary width and
height. It allocates ajob to a free submesh with the same size as that of the job, thus
eliminating the internal fragmentation completely. However, due to the use of ﬁxed
vertical and horizontal strides, this strategy may not be able to ﬁnd a free submesh

even if there is one available.

2.4.3 First Fit Strategy

The First Fit (FF) strategy was proposed by Zhu [57]. Instead of using busy set, a
busy array B [W, H] is constructed according each processor’s status, where B [i, j] is
1 (0) if processor < i, j > is busy (free). Instead of computing coverage submeshes
individually, a coverage array C[W — w + 1, H — h + 1] can be computed from the
busy array where C[i,j] is 0 if node < i,j > can be a base for the incoming job. The
reject set is no longer needed because the dimension of the coverage array is reduced

by w — 1 and h — 1 in a: and y directions, respectively.

For an incoming request of a w x h submesh, the FF strategy computes the coverage

17

array in two steps: (1) to ﬁll the the coverages to the left of busy submeshes, and
(2) to ﬁll the coverages below the busy submeshes. To ﬁll the left coverage, the FF
strategy scans each row in the busy array B[W, H] from right to left and initializes
left..cov = W. At element B[i,j], if B[i,j] = 1, then set left_cov = maa:(i — w + 1,0).
Now ifi Z left_cov, then C[i,j] = 1; else C[i,j] = 0. To ﬁll the bottom coverage, the
the FF strategy scans each column of the coverage array from top to bottom starting
at the leftmost column and initializes btm-cov = H. At element C[i,j], if C[i,j] = 1,
then set btm_cov = max(j — h + 1,0). Now ifj Z btm_cov, then C[i,j] = 1; else
C [i , j] = 0. The ﬁrst zero found in the upper left corner of the coverage array is used
as the base for the requested submesh. The FF strategy requires 0(W X H) or 0(N)

time steps, where N is the number of processors in the system.

This approach eliminates the allocation miss encountered in the FS strategy. How-
ever, if the rotation of the requested submesh by 90° is allowed, it can further improve
the performance. However, the FF strategy allocates a job to the ﬁrst available sub-
mesh. Thus, a large contiguous area may be divided and allocated for a small submesh
request, resulting in potential external fragmentation. On the other hand, a best-ﬁt
strategy tries to allocate a job to a “best-ﬁt” submesh such that large contiguous
space can be retained for incoming jobs with large requested sizes. Consequently, a
best ﬁt method may improve the system utilization and, hence, the overall system

performance.

18
2.4.4 Best Fit Strategy

In [57], Zhu also proposed a Best Fit (BF) allocation strategy. Instead of always
allocating the job to the submesh at the upper left corner of the mesh system, the
BF strategy chooses among the available submeshes for allocation based on a certain
heuristic that tries to allocate a job at the best corner to reduce the external frag-
mentation. A corner is an element in the coverage array that serves as the end of
a consecutive sequence of 0’s in both the row and column where it is located. The
heuristic is to choose the corner in the coverage array with the largest number of busy

neighbors (1’s) surrounding it.

For a submesh request, the coverage array is computed following the procedure
described for the FF strategy. The coverage array is then scanned as follows to
identify corners: (1) Scan each row in the coverage array in any order and mark two
end elements of each sequence of consecutive 0’s as left and right, respectively; and
(2) Scan each column in the coverage array in any order, and for each consecutive
sequence of 0’s in a column, mark the two end elements as bottom and top, respectively.
If any top or bottom is also marked by left or right, it is a corner. The number of
busy neighbors (1’s) of each corner is then computed. A corner located in one of
the four corners of the coverage array is considered as having two additional 1’s
surrounding it, while a corner located in the boundaries of the coverage array, e.g.,
C [0, 1], is considered as having one additional 1. The corner with largest number of
busy neighbors is chosen for allocation. If there are ties, the corner with the smallest

area is selected, where the area of a corner is the product of the numbers of consecutive

19

0’s in its row and column starting from the corner. The BF strategy also requires
0(W x H) or 0(N) steps.

Although the BF strategy is designed to reduce the fragmentation and subse-
quently to improve the system performance, it does not show a better performance
compared with the FF strategy in most circumstances. Zhu indicated in [57] that the
BF strategy may cause jobs being scattered in the mesh, while the FF strategy tends
to allocate jobs toward the upper left corner. Consequently, the mesh becomes more

fragmented when using the BF strategy.

Thus, a best—ﬁt submesh allocation strategy may perform worse than or similar to
a ﬁrst-ﬁt strategy and scatter allocated submeshes throughout the mesh if the best-ﬁt

selection heuristic is inferior.

2.4.5 Adaptive Scan Strategy

The Adaptive Scan (AS) strategy proposed by Ding and Bhuyan [25] is an improve-
ment over the Frame Sliding strategy discussed in Section 2.4.2. It is the ﬁrst strategy
that allows the rotating of the submesh request by 90° if a free submesh is not found

in the ﬁrst scan.

After constructing the coverage set according to the submesh request, it scans the
entire mesh starting from the lowest leftmost node < 0,0 >, moving left to right,
bottom to top, to node < W — w + 1, H — h + 1 > with adaptive horizontal strides
to skip impossible nodes. The stride is determined as follows. If the current node

belongs to a coverage submesh in the coverage set, let rm” be the maximal :1: index

20

of this coverage submesh. The scan jumps from the current node to the :1:me + 1
position in the same row until the x index exceeds W — w + l or a free submesh is
found. If a free base is not found, it scans the next row from left to right until the y
index exceeds H — h + 1 or a free submesh is found. If it can not ﬁnd a free base after
scanning the whole mesh, it rotates the requested submesh from S (w, h) to S (h, w)

and scans the mesh one more time. The AS strategy has the worst case complexity

0f 0(N X NA).

The AS strategy always ﬁnds a free submesh if there is one available, so it has
complete submesh recognition capability, and it gives the best performance among

all ﬁrst-ﬁt allocation strategies assuming an FCFS scheduling policy is used.

2.4.6 Busy List Strategy

The Busy List (BL) strategy was proposed by Das Sharma and Pradhan in [21]. It is
a best ﬁt strategy with complete submesh recognition capability. In this algorithm,

a list of all allocated submeshes, or busy list (busy set), is maintained.

For an incoming job J (w, h), the algorithm generates initial candidate submeshes
for both J (w, h) and J (h, w) along the four corners of each allocated submesh within
the mesh boundaries. The initial candidate submeshes generated from each allocated
submesh are then considered one by one to check for overlap with any of the other
allocated submeshes. An initial candidate submesh slides along the boundary of the
allocated submesh that generated it until a free submesh is found only if the candidate

submesh overlaps with another allocated submesh. At the same time, the candidate

21

submesh increases its boundary value by the length of the common boundary if it
is adjacent to some allocated submesh. The boundary value of a free node is the
number of its busy neighbors. The boundary value of a free submesh is the sum
of the boundary values of all the nodes in the submesh boundary. In other words,
the boundary value of a free submesh is the number of busy neighbors it has. The
algorithm also generates candidate submeshes which are at the four corners of the
mesh system. However, they are discarded in case of an overlap with some allocated
submesh. The best ﬁt submesh is the one with the highest boundary value. The worst

case complexity of the BL strategy is 0(Nﬁ).

The consequence of the BL strategy is that jobs tend to gather either at the four
corners of the mesh system or adjacent to allocated submeshes. However, if an initial
candidate submesh does not overlap with any allocated submesh (i.e., it is free), it
does not slide. So, the BL allocation strategy does not consider all free submeshes
adjacent to the allocated submeshes. As indicated by Das Sharma and Pradhan [21],
the BL strategy may not always be able to ﬁnd the free submesh with the highest
boundary value. It can be modiﬁed to consider all the boundary nodes of all allocated

submeshes with the penalty of longer processing overheads.

2.4.7 Lookahead Strategies

Bhattacharya and Tsai proposed the Lookahead strategies [5], which allow examining
the queue of waiting jobs to retain the largest possible contiguous free space for large

sized waiting jobs. When a submesh request J (w, h) arrives in the system M (W, H),

22

the prohibited region (coverage submesh) for each allocated submesh is computed.
Then, starting with a single free region < 0, 0, W — w + 1, H — h + 1 >, the prohibited
regions are subtracted from the free regions one by one. At each subtraction step,
free regions may be divided into smaller free regions or be totally eliminated. The
resulting free regions after completing the subtraction steps form the free space. If
the free space is empty, no submesh is available. If one or more free regions are
obtained, each processor in the free regions can be selected as the base for the submesh
requested.

The authors ﬁrst proposed three allocation policies without looking into the wait-
ing queue, namely, First-Fit (FF), Best-Fit (BF), and Worst-Fit (WF). First-Fit
policy selects the free region with the least x-valued base. If there are ties, it chooses
the one with the least y-value. The base of the selected free region serves as the base
of the submesh allocating the the job. Best-Fit policy selects the free region with the
least area, while Worst—Fit policy selects the free region with the largest area. The
performance of these three policies are very similar. Each policy has its advantage
under certain job characteristics and load.

In the process of computing free regions, the resulting free regions are not unique.
The subtraction order of the prohibited regions and the division orientation (hori-
zontally or vertically) might lead to different results. Further, allocating a job to the
smallest free region does not necessarily result in leaving largest possible free space
for later jobs. Actually, the smallest free region might be in the middle of a large free
space, and allocating a job to it causes large external fragmentation.

The lookahead heuristic selects a free region that leaves a larger space for later

23

jobs. The k-lookahead heuristic allocates the current submesh request to a free region
such that a maximum number of k largest waiting jobs in the queue can also be
allocated. The l-lookahead Best-Fit scheme, BFL(1), starts the allocation process by
checking whether allocating the current job to the smallest free region would leave
enough space for the largest waiting job in the queue. If not, it tries to allocate the
current job to the next larger free region and checks again, repeating until a free
submesh is found for the largest waiting job or all the free regions are tested. If this
allocation process fails to ﬁnd a free submesh for the largest waiting job, it allocates
the current job to the smallest free region. On the other hand, in the 1-lookahead
Worst-Fit scheme, WFL(1), the allocation process proceeds from the largest free
region to the smallest free region. The proposed strategies require 0(Nj) steps.
Based on the results of the simulation we performed, the BFL(1) and WFL(1) have
very similar performance. Furthermore, both lookahead strategies have negligible
performance differences compared with the FF, BF, and WF policies they proposed.
Comparing the FF strategy proposed here with the Adaptive Scan strategy and Zhu’s
FF strategy, we found that the only difference among their ﬁrst-ﬁt search processes is
the searching directions. The FF strategy selects the available base with the smallest
x-value (if there are ties, selects the one with the least y-value). On the other hand,
the Adaptive Scan strategy ﬁrst ﬁnds the available base which has the smallest y-
value with the smallest :r-value. However, Zhu’s FF strategy ﬁnds the ﬁrst available
base which has the smallest x-value with the largest y-value. If we allow the rotation
of the requested submesh by 90° when a free submesh is not found in the ﬁrst pass for

these three strategies, they have similar performance. So, the BFL(1) and WFL(1)

24

have very similar performance to the ﬁrst-ﬁt strategies, such as AS or FF.

The lookahead strategies try to select a best free region for allocation among the
free regions generated as described above by looking into the waiting queue. However,
the number of free regions is O(n) [5], where n is the number of currently allocated
submeshes. The maximum number of submeshes allocated in the system depends
on the size of the mesh and the sizes of the allocated submeshes, and it is usually a
small number. So, the resulting number of free regions is also very small. No matter
which best-ﬁt heuristic they use, from a small number of free regions, it is difﬁcult to
ﬁnd a good submesh comparable to the submesh selected among a large number of

candidate submeshes using other best—ﬁt strategies.

2.4.8 Non-contiguous Processor Allocation Algorithms

All the submesh allocation strategies mentioned above allocate a contiguous block of
processors that physically form a submesh. Liu, Lo, and Windisch proposed three
non-contiguous processor allocation strategies [48] that lift the restriction on contigu-
ity of processors in order to eliminate the system fragmentation completely. However,
non-contiguous processor allocation introduces potential problems due to message—
passing contention, yielding communication interference among jobs. Non-contiguous

strategies are beyond the scope of this work.

25
2.5 Dynamic Job Scheduling Schemes

Job scheduling schemes for partitionable multiprocessor systems play an important
role for improving system performance. Several dynamic job scheduling schemes for
hypercubes and 2D meshes have been proposed. We will discuss them briefly in the

following subsections.

2.5.1 Scan Scheduling Schemes for Hypercubes

Krueger, Lai, and Radiya proposed a family of scheduling schemes, Scan, for hyper-
cubes [39, 40]. For a hypercube of dimension r, r + 1 separate queues, Qk, 0 S k S r,
are maintained for requested subcubes having dimensions 0 to r. We can scan the
queues from small dimension to large dimension (Scan Up) or vice versa (ScanDown)
in a circular way. All jobs of a certain dimension are allocated in the F CFS order

until the queue is empty, and then the dispatcher allocates jobs in the next queue.

Starvation is possible for both ScanUp and ScanDown. To solve this problem,
two starvation free (SF) algorithms, Scan UpSF and ScanDownSF are proposed. For
both algorithms, before the scheduler begins allocating jobs from queue i, it moves
the entries found in the queue to CurrentQueue. The scheduler allocates jobs from
CurrentQueue until it’s empty, then moves on to the next non-empty queue. Jobs
that arrive at queue i after its entries have been moved to CurrentQueue must wait
until the scheduler returns to queue i after allocating jobs from all other non-empty

queues.

26

2.5.2 Lazy Scheduling Scheme for Hypercubes

Mohapatra, Yu, Das, and Kim proposed a Static Partitioning scheme and a Lazy
Scheduling scheme [49]. Static Partitioning scheme divides the r-cube system into
one (r — 1)—cube, one (r — 2)-cube, ..., one l-cube, and two 0-cubes. There are r

queues maintained for jobs requesting k-subcubes, where 0 S k S r — 1. Each queue

is scheduled on a FCFS basis.

Lazy Scheduling scheme maintains r +1 queues for an r-cube system. Each queue
has a threshold length denoted as Nk, where 0 S k S r. Initially, Nk is set to zero.
M. is increased by one upon a k-cube allocation and decreased by one upon a k-cube
deallocation. In other words, M, denotes the number of allocated k—cubes currently
in the system. When a job requesting a k-cube arrives, it goes to the end of queue k.
If the number of waiting jobs in queue k is less than Nk, the job in the queue front
gets allocated when a job completes and releases a subcube of the same dimension. If
the number of waiting jobs in queue k is greater than Nk, the scheduler allocates the

job in the queue front using a subcube allocation strategy if a free subcube is found.

2.5.3 Scan Scheduling Scheme for 2D Meshes

Babbar and Krueger proposed a Scan scheduling discipline for 2D mesh systems in
[3]. It differs from the Scan scheduling discipline for hypercubes only in the number
of queues it maintains. It is only applicable when the mesh system and the requested

submeshes are square meshes.

For a square mesh of size L x L, if all jobs request square submeshes, Scan main-

27

tains L separate queues, one for each possible size of submesh request. On arrival, an
incoming job joins the end of the queue corresponding to the size of the submesh it
requires. All jobs in a certain queue are allocated before the scheduler moves to the
next queue. Based on the direction of scanning, there are two possible alternatives,

Scan Up and ScanDown.

However, for jobs requesting arbitrary sizes and shapes, Scan maintains only a
single ordered queue. For ScanU p, jobs are ordered from the smallest to the largest
number of processors requested, while for ScanDown, jobs are ordered from largest
to smallest. Actually, ScanUp and ScanDown are exactly the same as Smallest Job
First and Largest Job First scheduling policies [39], respectively, when only a single

queue is maintained.

2.5.4 Priority Scheduling Scheme for 2D Meshes

Das Sharma and Pradhan proposed a job scheduling scheme that combines a priority

technique with a reservation technique [23] for 2D meshes.

The scheduling scheme can be implemented with any submesh allocation strategy.
It tries to ﬁnd a free submesh for an incoming job using the submesh allocator. If
a free submesh is not available, the scheme searches to ﬁnd a submesh that can be
reserved for the incoming job. A processor may not be reserved twice and a reserved
processor may not be allocated to another job, even though it may be free. All possible
submeshes are considered in order to select a submesh for reservation. Among all the

submeshes that are available for reservation, the one with the minimum number of

28

free processors is selected for reservation. If there is a tie, the candidate that overlaps
with the least number of allocated submeshes is selected. A job that successfully
reserves a submesh is appended to a reservation list.

If a submesh can not be reserved for the incoming job, the job is appended to
the waiting list. The jobs in the waiting list do not block the allocation/ reservation
process unless a certain number of subsequent jobs arriving after the job in the head
of the waiting list have been allocated or reserved.

For each job in the waiting list, the scheme keeps a count of the number of jobs
arriving later than it that have been allocated. To avoid the starvation problem, the
job in the front of the waiting list has the highest priority to be allocated if more
than MAXJ’RI jobs arriving after it have been allocated.

If we do not allow any reservations, the priority scheduler behaves like a non-
preemptive priority scheduler. If a job can not be allocated, it joins the waiting
list.

Jobs from the reservation and waiting lists are considered for allocation in an
FCFS manner, without blocking. When a job ﬁnishes and releases its processors, an
attempt is made to allocate as many jobs waiting in the reservation list followed by the
waiting list in the FCFS order. This proceeds in three phases. In the ﬁrst phase, the
entire reservation list is searched to ﬁnd out the reserved submeshes whose processors
are all free and allocate them to corresponding jobs. In the second phase, the entire
reservation list is searched again to determine if any of the reserved submeshes can
be allocated due to the processors released by the ﬁnished job or the free processors

released by the reserved submeshes (after the jobs that reserved them got allocated

29

during the second phase itself). In the third phase, an attempt is made to allocate
the jobs in the waiting list, and then try to reserve submeshes for the jobs in the
waiting list.

The priority technique alone, i.e., reservation is not allowed, improves the perfor-
mance. However, the reservation technique alone does not improve the performance,
due to the waste of large number of reserved free processors. A combination of the
priority and reservation techniques results in a lower waiting delay than the prior-
ity technique alone for low to moderate loads. However, for high loads, the priority

technique alone yields best results.

2.6 Performance Evaluation Tools for Processor

Allocation and Scheduling

A great deal of effort has been directed towards developing software tools for the
problems of scheduling parallel program tasks on parallel and distributed systems
[28]. However, software tools for the study of processor allocation and job scheduling

problems are in limited supply.

Windisch, Miller, and Lo have developed a software tool called ProcSimity [55]
for research in the area of processor allocation and scheduling for distributed memory
multicomputers. It includes two main components: (1) a multicomputer simulator

and (2) a performance analysis / visualization tool.

The simulator currently supports a two-dimensional mesh topology ranging in sizes

30

up to 1024 nodes (32 x 32) and k-ary n-cubes ranging in sizes up to 1024 nodes for
selected values of k and n. It supports the following scheduling schemes: First-Come-
First-Served, Last-Come-First-Served, Shortest (Longest) Service Demand, Shortest
(Longest) Hold Time, and Smallest (Largest) Job First. Processor allocation strate-
gies for 2D meshes supported include selected submesh allocation strategies: Frame
Sliding, First Fit, Best Fit, and non-contiguous allocation strategies (job requests for
number of processor needed) proposed in [48]. Jobs running in the system can be
simulated at two different levels. First, jobs can be delayed for a randomly gener-
ated service time, and then depart, i.e., only service time is modeled and simulated.
Second, jobs are simulated by executing speciﬁc parallel application communication
patterns, i.e., only communication is modeled without detailed simulation of compu—
tation.

When the ProcSimity simulator is executed, it begins by reading a detailed set of
parameters from the terminal’s stdin. However, it is easier to invoke the simulator
by redirecting a ﬁle to the simulator. The ﬁle contains a list of all the parameters for
the run. Each execution of the simulator produces output to stdout containing three
types of information: (1) the parameters used, (2) the results of each individual run,
and (3) the ﬁnal results, averaged over all runs.

Performance information is provided to the visualization tool through trace ﬁles
generated by the simulator. The visualization tool provides a dynamic display of
allocation and deallocation events and performance metrics. The Topology window
displays the network topology and the dynamic activity of jobs entering and leaving

the system. The SimInfo window provides information such as topology dimensions,

31

job size distribution, and the processor allocation and scheduling algorithms. In a
typical running session, a single trace ﬁle is loaded, and allocation and deallocation

activities ofjobs are displayed.

Chapter 3

Performance Evaluation

Methodology and Tool

Simulation is a useful technique for computer systems performance analysis. During
the design stage of an algorithm, a simulation model provides an easy way to predict
the performance or compare several alternatives. Further, a simulation model allows
various algorithms to be compared under a wide variety of workloads.

As we discussed in the previous chapter, processor allocation and scheduling prob-
lems on a 2D mesh system are NP-hard. Due to the complexity of these problems,
a discrete event simulator was built to evaluate the performance of various processor
allocation and scheduling schemes.

In this chapter, we ﬁrst present the simulation model in Section 3.1. Then, we give
the workload characterization and performance metrics in Section 3.2 and Section 3.3,
respectively. In Section 3.4, we describe the performance evaluation tool that we
developed.

32

33
3.1 Simulation Description

In discrete event simulation, there are three basic operations that a simulation pro-

gram needs to perform: event generation, event handling, and statistics collection.

Event generation can be trace-driven or distribution-driven. In trace—driven sim-
ulation, the time of occurrence of the event and the magnitude of parameters asso-
ciated with the event are obtained from measurements on a real system and then
used directly in the simulation. Trace—driven simulations may be necessary if detailed
characteristics of the input must be preserved. In most other cases, distribution-
driven simulation provides a compact, reproducible, and easily modiﬁable model of
the workload [35]. Event handling is usually performed by advancing the simulation
clock to the occurrence of the next event, updating the system state correspondingly,

and then enabling some other events.

Selecting a proper language is probably the most important step in the process
of developing a simulation model. Special-purpose simulation languages have built-in
facilities for time—advancing, random number generation, statistical data collection,
and report generation. While these languages may save development time for expe-
rienced programmers, they usually are available only through special purchase and
consume more system resources, e.g., CPU time. Further, beginners must spend some
time learning the languages. On the other hand, a general-purpose language offers
faster run-time speed, but programmers have to spend extra time developing routines

for event handling, statistical data collection, and so forth.

Another alternative is an extension of a general-purpose language. The extension

34

consists of a collection of routines to handle tasks that are commonly required in
simulations. It provides a compromise in terms of efficiency, ﬂexibility, and portability.

CSIM [52] is an example of such an approach.

Simulation of processor allocation and scheduling schemes requires a description
of the system. The underlying target machine is a two-dimensional mesh-connected
multiple processor system. This simulation considers various mesh sizes ranging from
16 X 16 to 128 X 128. However, the simulation results for mesh systems of various sizes
follow a similar trend, thus, we only report the performance results for a 32 X 32 mesh
system in this dissertation. Each job J(w, h) arrives randomly in a stochastic stream
and requests a rectangular submesh S (w, h). The system tries to ﬁnd a free submesh
for incoming jobs following a speciﬁc scheduling scheme and a submesh allocation
strategy. If a free submesh is found for a job, it is allocated to the job until the job is
ﬁnished. Then, the system deallocates the ﬁnished job and releases its processors. In
this simulator, we have implemented various existing submesh allocation schemes as
well as new allocation and scheduling schemes that we developed. In the next section,

we will describe the workload characterization in detail.

3.2 Workload Characterization

Simulation involves constructing a model for the behavior of the system and driving
it with an appropriate abstraction of the workload. In our simulation model, basic
input workload parameters include: the scheduling disciplines, the submesh allocation

strategies, job inter-arrival times and service times, job sizes (widths and heights),

35

and mesh sizes.

Job inter-arrival times are assumed to have the exponential distribution with the
mean of IATM. That is, the jobs arrive in a Poisson process. The Poisson process
is commonly used to model independent jobs arriving randomly in the computer
system. Job service (execution) times are also distributed exponentially with a mean
of EX T M.

The width w and height h of the requested submesh are generated independently.
We considered three distributions for w and h, namely, uniform, decreasing, and
increasing distributions [13], to cover different characteristics of jobs. In the uniform
distribution, w and h are equally likely to have a value of 1 through N (N = 32). In
the decreasing distribution, the range of 1 through 32 is divided into four intervals,
[1,4], [5,8], [9,16], and [17,32]. The distribution of w and h within each interval is still
uniform. However, the probability that w or h falls into one of the intervals decreases
as the values in the interval increase. The probabilities that the height or width falls
in these intervals are P[1,4] = 0.4, P[5,8] = 0.2, P[9,16] = 0.2, and P[17, 32] = 0.2.
Essentially, a decreasing distribution on w and h represents a system with more jobs
requesting small submeshes. Similarly, in an increasing distribution, the probability
that w or h falls into one of the intervals increases as the values in the interval
increase. So, for increasing distribution, we have intervals [1,16], [17,24], [25,28],
and [29,32], and the probabilities that the height or width falls in these intervals are
P[1,16] =2: 0.2, P[17, 24] = 0.2, P[25,28] = 0.2, and P[29,32] = 0.4. If w and h
follow the increasing distribution, the system will have greater chance to receive jobs

requesting large submeshes.

36

In order to compare the performance of various strategies under different loads,

we deﬁne the load as follows:

72 X EXTM

load z N x IATM

 

where n is the mean requested submesh size (number of processors) of jobs, and N
is the total number of processors in the 2D mesh system. In our simulation, we ﬁx

the EXTM to be 10.0 time units and adjust the IATM according to the desired load.

The system will become unstable with too small an IATM value.

For each separate simulation run, all random number streams are initialized with
the same seed value to guarantee that each allocation and scheduling method receives

the same stream of jobs.

3.3 Performance Metrics

We observe the behavior of the simulated system by calculating and analyzing various
statistics, or performance metrics, which are outputs from the simulation experiment.
The metrics of interest include: mean response time, standard deviation of response
time, and system utilization. The response time of a job is deﬁned as the time that
elapses from the moment the job arrives to the system until it completes execution.
In other words, it’s the sum of the waiting time and the service time. Mean job
response time provides a good, user-oriented perspective on performance. On the

other hand, standard deviation of response time complements the mean response

37

time as a measure of performance. A scheme that reduces the mean response time
may result in an increased standard deviation of response time, that is, the response
time is more unpredictable. A scheme is considered better if it not only reduces the
mean response time, but also reduces the standard deviation of response time under

various job characteristics and loads.

The utilization of a resource is measured as the fraction of time the resource is
busy servicing requests. Thus, for a single processor, the utilization with respect to

time is deﬁned as follows:

Definition 3.1 The utilization (with respect to time) of a single processor system

over a given time period is

. . . timebu,
utzlzzatzontime 2' ————-—3i—.
timeelapsed

That is the ratio of busy time to total elapsed time over a given period.

The system we focus on is a multi-processor system. Only a fraction of processors
may be used at a given time, For a multiprocessor system, the utilization is usually

deﬁned with respect to space, i.e. processors, of the system at a time instant.

Deﬁnition 3.2 The utilization (with respect to space) of a multiprocessor system at

a given time is

altprocessorsbu,y
total # processors '

 

utilizationspace =

However, this utilization could further be averaged over time to give the overall

system utilization over a given period of time. The overall utilization is a good

38

measure from the system’s point of view.

We record the activities (arrivals, departures, etc.) of all jobs and processors and
the response times of all jobs. The results reported in this thesis are reasonably
accurate. The error is less than 5% with a 90% conﬁdence level. The method of inde-
pendent replications [34] is used to achieve this level of precision. Each independent
run performs 50,000 job allocations/deallocations, and data collected from the ﬁrst
1000 jobs are discarded to eliminate the initial transient effect.

As in studies by Chuang, Tzeng, Ding, Bhuyan, Li, Cheng, and Zhu [13, 25, 45, 57],
the overheads of the allocation and scheduling algorithms are assumed to be negligible
with respect to job execution time in our simulation study. To assess the validity of
this assumption, we measured the overhead of various strategies based on prototype
implementations on a SUN Sparcstation 10. The actual average CPU time needed
per allocation for various strategies ranges from 0.2 ms for the Busy Distance Inverse
strategy (to be introduced in Chapter 8) to 10 ms for the Best Fit strategy on a 32 X 32
mesh system. In contrast, the mean job execution time used for the simulation is 10
seconds. Thus, the assumption seems reasonable.

We have made extensive efforts to verify and validate the correctness of our sim-
ulation tool. We have self-consistency check points throughout the program. For
example: check if the arriving time, waiting time, execution time, and ﬁnishing time
of each job would match; compare the resulting system utilization with the input
workload, etc. We have cross-checked the results against published results where
possible, both qualitatively and quantitatively. The simulation tool provides a valid

comparison of various strategies. The next step, left as future work, is to validate the

39

results against an actual implementation with real workloads.

3.4 Simulation and Experiment Analysis Tool

We have developed a simulation and experiment analysis tool called Sched-All for
job scheduling and allocation on 2D mesh multicomputers. Sched—All is scalable
in the sizes of mesh which can be simulated. It can evaluate the performance of
various submesh allocation and job scheduling schemes on a mesh system of size up
to 128 X 128. It includes three main components: (1) a simulator, (2) an experiment
planner, and (3) an experiment analyzer.

The core of the simulator is implemented in C++ using CSIM simulation library
[52]. CSIM is a process-oriented discrete-event simulation package for use with C or
C++ programs. It provides a runtime library to support the necessary simulation
procedures. The graphical user interface is designed using Tcl/Tk [51]. TC] stands
for Tool Command Language. It is a scripting language and users can extend it by
creating new commands implemented in C. Tk is the associated X windows toolkit
of Tel. Users can also implement custom widgets in C.

Sched—All supports a wide variety of processor allocation and scheduling schemes.
Table 3.1 lists the current library of submesh allocation strategies, which include
some strategies discussed in Chapter 2 and two advanced strategies we developed
(to be presented in Chapters 7 and 8). We also include three modiﬁed strategies in
the simulation tool. We modify the First Fit and Best Fit strategies to consider 90°

rotation of an incoming job and modify the Busy List strategy to consider all possible

40

 

 

 

 

 

 

 

 

 

 

 

 

 

Allocation strategy Category [Chapter
Adaptive Scan ﬁrst-ﬁt 2
First Fit ﬁrst-ﬁt 2
First Fit (modiﬁed) ﬁrst—ﬁt 2
Best Fit best-ﬁt 2
Best Fit (modiﬁed) best-ﬁt 2
Busy List best-ﬁt 2
Busy List (modiﬁed) best-ﬁt 2
Lookahead BF L(0) best-ﬁt 2
Lookahead BFL(1) best-ﬁt 2
Busy Distance Inverse best-ﬁt 8
Estimated Execution Time best-ﬁt 7

 

 

 

 

 

Table 3.1: Submesh allocation strategies supported by the simulator.

candidate submeshes along the boundaries. Table 3.2 lists the current library of job
scheduling schemes, which include some schemes discussed in Chapter 2 and two

advanced strategies we developed (to be presented in Chapters 5 and 6).

Figure 3.1 shows a screen dump of the main window of the simulator. The top
portion of the window lists a set of buttons for speciﬁc functions. The bottom portion

of the window is a status box.

A user can select the desired mesh size, allocation strategy, scheduling scheme,
job characteristics, and load by pressing the corresponding button. After pressing
the “Run” button, a message is displayed in the status box to indicate if the simu—
lation is started successfully. By pressing the “Status” button, the current status of
all simulation runs (either “ﬁnished”, “still running”, or “terminated”) is displayed
in the status box. For each simulation run, the simulator generates an output ﬁle
that contains information such as mean and variance of response time and system

utilization.

41

time information
no
no
no

 

Table 3.2: Job scheduling schemes supported by the simulator.

  
 
  

Salted - All Simulator

 

 
 

0 FCFS 1. N. SD.- use. 0. 1-_._ -. .

 
 
  

 
 

Dace Not enough. Please ,
OFCFS 1. N. SE DEC 03.1

  

 

 

 

 

Figure 3.1: The Sched—All simulator.

42

Figure 3.2 shows the screen dump of the main window of the experiment planner.
An experiment is deﬁned for our purpose as a set of simulation runs in which the
runs are to be compared with one another via plotting on the same graph axes. The
experiment planner is designed to set up the collection of data for an experiment
comparing the performance of various processor allocation and scheduling schemes
under different system loads. Users can use the “Load” button to specify the load
range of interest. Then, similar to the simulator, users can use the “Mesh Size”,
“Allocation”, “Scheduling”, and “Job Char” buttons to select a desired environment
and scheme, and then press the “Add” button to add this scheme to the current
experiment. After selecting all the schemes to be compared, the desired data ﬁle for
the experiment analyzer is generated by pressing the “Generate” button under “File”.
A pop-up window lets the user input a ﬁlename for the data ﬁle.

The experiment analyzer uses GNUPLOT [54] for plotting experimental data. It
reads in the data ﬁle generated by the experiment planner, creates a GNUPLOT
program which generates a postscript ﬁle of the plot, and then displays the plot on
the screen using a postscript viewer utility software Ghostview. Figure 3.3 shows a
screen dump of the experiment analyzer and some pop-up windows for user inputs.
Users can input the data ﬁlename, the GNUPLOT program name, and the postscript
ﬁlename. Users also can specify the plot title, the x-axis and y-axis labels, and for
each scheme, the line type (solid, dashed, or dotted), marker type (diamond, plus,
square, x, triangle, or asterisk), and text label (tag). Under the “Options” button,
users can set the range of each axis, the tic positions, key position, and log scaling.

When the “Plot” button is pressed, a GNUPLOT program is created to generate a

43

__ §c1£~ICNTFx 517153171?

5, “farm.

:
.
1‘

 

Figure 3.2: The Sched-All experiment planner.

 

Figure 3.3: The Sched—All experiment analyzer.

45

postscript plot ﬁle, and the plot is displayed on the screen. The experiment analyzer
can be invoked from within the simulator or the experiment planner. It can also be
used stand-alone, as a graphical user interface for GNUPLOT.

Figure 3.4 shows the content of an example data ﬁle plot .dat. The ﬁrst column
corresponds to the x-axis values, one additional column represent one curve to be
plotted. Figure 3.5 shows the GNUPLOT program generated by the Sched-All exper-
iment analyzer illustrated in Figure 3.3. Noted that the x and y tics and key position
and the x and y ranges are not not speciﬁed in this example, so the default values are
used. Figure 3.6 shows the screen dump of the plot. Because of the poor resolution

of the screen dump, the corresponding postscript version of the plot is also shown in

 

Figure 3.7.
0.1 50.0 52.0 54.0 56.0 58.0 60.0 62.0
0.2 40.0 42.0 44.0 46.0 48.0 50.0 52.0
0.3 30.0 32.0 34.0 36.0 38.0 40.0 42.0
0.4 20.0 22.0 24.0 26.0 28.0 30.0 32.0
0.5 10.0 12.0 14.0 16.0 18.0 20.0 22.0

 

 

 

Figure 3.4: An example data ﬁle.

46

 

set term postscript eps 24
set output ’p10t.ps’

set title ’My Title’

set xlabel ’My X-label’
set ylabel ’My Y-label’
set xtics

set ytics

set key

set nologscale 1

set nologscale y

plot [z] [z] \

’plot.dat’ using 1:2 title ’line 1’ with linespoints 0 1,\
’plot.dat’ using 1:3 title ’line 2’ with linespoints 1 2, \
’plot.dat’ using 1:4 title ’line 3’ with linespoints 8 3, \
’plot.dat' using 1:5 title ’line 4’ with linespoints 1 4, \
’plot.dat’ using 1:6 title ’line 5’ with linespoints 1 5, \
’plot.dat’ using 1:7 title ’line 6’ with linespoints 1 6, \
’plot.dat’ using 1:8 title ’line 7’ with linespoints 1 1

 

 

 

Figure 3.5: The generated GNUPLOT program.

 

Giostview, version 1.5

Fla-PS My Tltle
Tue Jul 11 01:03:271 7” . . . -

 

 

line 1 ‘0"—

My Y-Inbel

 

    

 

 

| l l l l l I

10
DJ 0J5 02 025 03 035 04 045 05
My X-Iabel

 

 

 

 

 

 

 

 

 

Figure 3.6: The generated plot (screen dump).

My Y-label

47

 

My Title
70 , 1 1 r I
line 1 -.., .......

60 line 2 —+—:

 

 

 

10

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
My X-Iabel

l L 1

Figure 3.7: The generated plot.

Chapter 4

Cost-Performance Assessment of

Schemes

We have discussed some previous work in the area of submesh allocation and job
scheduling in Chapter 2. While these submesh allocation strategies improve the
overall system performance and reduce the system fragmentation, the actual proces-
sor utilization and external fragmentation resulting from these strategies still leaves
plenty of room for further improvement. Furthermore, some strategies reduce the job
mean response time, but they have to pay the price of higher time complexity. It is

very desirable to achieve high performance and low complexity at the same time.

In this chapter, we ﬁrst compare the performance of current processor alloca-
tion strategies and discuss the complexity versus performance issues in Section 4.1.
Then, we discuss the limitations of the FCFS scheduling scheme and the possible
performance gain due to scheduling schemes in Section 4.2.

48

49
4.1 Processor Allocation Strategies

Various submesh allocation strategies have been proposed. Performance compari-
son and analysis of these strategies are very desirable. In Section 4.1.1, we perform
simulation experiments, present the results, and give a comprehensive performance
comparison and analysis of existing submesh allocation strategies. Further, the limi-
tations of these strategies and the complexity versus performance issues are discussed

in Section 4.1.2

4. 1 . 1 Performance Analysis

The Two—Dimensional Buddy System (2DBS) is the ﬁrst dynamic submesh alloca-
tion strategy ever developed. It causes large internal fragmentation if the requested
submesh is not a square mesh with all sides being exactly powers of 2. The Frame
Sliding (F S) strategy eliminates the internal fragmentation, but it sometimes misses
an available free submesh and causes unnecessary external fragmentation. While
these two strategies provided the foundation of this research area, their performance
is surpassed by the succeeding strategies.

We perform the following experiments and introduce selected results to study the
performance and existing strategies. Performance comparison and analysis of the ex-
isting submesh allocation strategies form the basis of our further research in this area.
In the following experiments, two ﬁrst-ﬁt strategies, FF and AS, and three best-ﬁt
strategies, BF, BL, and BFL(1), are considered. We performed simulation to evaluate

the performance of these ﬁve processor allocation strategies under different workloads

50

and job characteristics. Figures 4.1, 4.2, and 4.3 illustrate the mean response time
versus load for uniform, increasing, and decreasing job distribution, respectively, in
a 32 X 32 mesh system. Noted that some of the strategies we compared here do not
consider rotating the incoming jobs by 90°. So, in order to have a fair comparison,

all requested submeshes are square submeshes.

 

 
   
   

60 i . .
50 - Adaptive Scan —<>— .
Busy List —°—-
First Fit -e—
40 - Best Fit —~—- «
Lookahead BFL(1)

  

Mean Response Time

 

 

 

30 - -
20 - .
1o - .
O l 1 l 1
0.1 0.2 0.3 0.4
Load

Figure 4.1: Mean response time vs. load (w and h: uniform distribution).

Under uniform job distribution (Figure 4.1), we can observe that the three best-ﬁt
allocation strategies (BF, BL, and BFL(1)) yield better performance compared with
the two ﬁrst-ﬁt strategies (FF and AS). Under increasing distribution, (Figure 4.2),
we can observe that the performance difference among various strategies is not as
signiﬁcant as that of the previous case. When jobs are distributed with increasing
distribution, the system tends to have more jobs requesting large submeshes. eighty

percent of the jobs request submeshes in size greater than or equal to 16 X16 in a 32 X 32

51

 

 

 

 

 

50 Y I 1

40 ~ Adaptive Scan 4— «
“5’ Busy List -+—
; First Fit +—
Q 30 _ Best Fit —*—' d
2 Lookahead BFL(1) 4—
a
(D
0)
CE 20 - ~
C
as
0
2

10 -

O l l i J l

0.1 0.2 0.3 0.4 0.5
Load

Figure 4.2: Mean response time vs. load (w and h: increasing distribution).

190 . . .
180 ~ .
170 - Adaptive Scan _._ , ~
160 - Busy List —0— ~
150 - First Fit -9— ‘
140 - Best Fit +—

130 - L k h +_
120 _ 00 a sad BFL(1)

110 -
100 - -
90 r .
70 - ’
60 ~
50 -
40 r
38 ‘
1o - ”’7‘
0 1 1 l l

0.1 0.2 0.3 0.4

Load

 

 

Mean Response Time

I l l l l

 

 

 

Figure 4.3: Mean response time vs. load (w and h: decreasing distribution).

52

system. Consequently, the effect on performance improvement due to the processor
allocation strategy is diminished. Under decreasing distribution, only 20% of the jobs
request submeshes in size greater than or equal to 16 X 16 in a 32 X 32 system. So,
the processor allocation strategy plays a critical role to improve the performance. As
we can observe from Figure 4.3, the performance difference among various strategies
becomes much larger than the previous two cases. The BL strategy has the best
performance among all the strategies. The BF strategy has very similar performance
compared with BL. The other best-ﬁt strategy, Lookahead BFL(1), however, has

similar performance compared with the ﬁrst-ﬁt AS strategy.

First-ﬁt strategies allocate jobs to the ﬁrst available location obtained, while best-
ﬁt strategies try to allocate jobs to the best location available. As we can observe from
Figures 4.1 to 4.3, best-ﬁt strategies yield better performance than ﬁrst—ﬁt strategies.
This observation motivates our research in new and more advanced best-ﬁt submesh
allocation strategies. However, performance is not the only concern of strategies. We

will discuss the complexity versus performance issues in the next section.

4.1.2 Complexity Analysis

As we have discussed in Chapter 2, various submesh allocation strategies have been
proposed. Some schemes reduce the job mean response time, but they pay the price
of higher time complexity. It is very desirable to achieve high performance and low
complexity at the same time. Intuitively, best-ﬁt strategies are more complicated

than ﬁrst-ﬁt strategies. However, a well designed best-ﬁt strategy may outperform

53

ﬁrst—ﬁt strategies not only in performance, but also in complexity.

Table 4.1 summarizes the limitations and time complexity of various submesh
allocation strategies. Figure 4.4 shows relative performance versus complexity trends
of recent allocation strategies. In Figure 4.4 and Table 4.1, N denotes the number of
processors in the mesh system, N A denotes the number of jobs currently running in

the system, and w and h denote the width and height, respectively, of the job being

 

 

 

 

 

 

 

 

allocated.

Strategy Category Considers Considers Complexity
job rotation execution time

2DBS ﬁrst-ﬁt no no 0(logN )
FS ﬁrst-ﬁt no no 0( £454)
FF ﬁrst-ﬁt no no 0( N)
BF best—ﬁt no no 0( N)
AS ﬁrst-ﬁt yes no 0(N X NA)
BL best-ﬁt yes no 0( N f; )
Lookahead best-ﬁt no no 0( N g )

 

 

 

 

 

 

 

Table 4.1: Limitations of existing submesh allocation strategies.

The Two-Dimensional Buddy System (2DBS) is applicable only when the system
is a square mesh with all sides being exactly powers of 2. Additionally, it causes large
internal fragmentation if the requested submesh does not also satisfy this constraint.
It has a time complexity of 0(logN). The Frame Sliding (F S) strategy is applicable to
any rectangular mesh system with arbitrary widths and heights, and it eliminates the
internal fragmentation. The cost is that the FS has a much higher time complexity
of 0(N—J-gf). However, the FS strategy sometimes misses an available free submesh

and causes unnecessary external fragmentation. The AS strategy allows the rotating

54

Complexity
ll

0(N X N A) r” * Adaptive Scan

* Frame Sliding

0W) l— * First Fit * Best Fit

 

 

3
0(NA) *— * Busy List
0(Nﬁ) "[r * Lookahead
] >
optimal
Performance

Figure 4.4: Performance vs. complexity of submesh allocation strategies.

55

of the submesh request by 90° and searching the mesh system one more time if a free
submesh is not found in the ﬁrst pass. The AS strategy eliminates the allocation miss
of FS strategy, but its time complexity, 0(N X N A), is also higher than that of the FS
strategy. The FF and BF strategies also eliminate the allocation miss encountered
by the F S strategy, and each has the time complexity of 0(N). However, they did
not consider rotating of the submesh request by 90° for possible allocation. The Busy
List (BL) strategy has a time complexity of 0(N3), and it is the ﬁrst strategy that
has time complexity not related to N, the mesh size. However, its best-ﬁt heuristic
considers the number of busy neighbors only, neglecting busy nodes two or more
hops away. Usually, N is much larger than N A, so the goal of recent research is
to develop strategies having complexity independent of mesh size as well as having
higher performance. The Lookahead strategies take a different approach to allow
examining the queue of waiting jobs and try to retain a free space large enough for
the largest sized waiting job. The time complexity is further reduced to 0(Ni), but

their performance is not as good as the BF and the BL strategies.

While the submesh allocation problem is certainly well studied, the actual proces-
sor utilization and external fragmentation resulting from these strategies still leaves
plenty of room for further improvement. We argue that performance gains are still
possible at little or no cost in additional overhead. Looking ahead to the new Esti-
mated Execution Time (EET) and Busy Distance Inverse (BDI) strategies we pro—
posed in Chapters 7 and 8, respectively, we present their relative complexity and

performance in Table 4.2 and Figure 4.5.

56

 

 

 

 

 

 

 

 

 

 

Strategy Category Considers Considers Complexity
job rotation execution time

EET-SC best-ﬁt yes yes 0( N)

EET-SB best-ﬁt yes yes 0( N 3)

BDI best-ﬁt yes no 0( Ni)

 

 

Table 4.2: New submesh allocation strategies.

 

 

Complexity
ll
0(N X NA) ‘l’ * Adaptive Scan
* Frame Sliding
0(N) “r * First Fit * Best Fit * EET-SC
3
0(NA) ‘l‘ * Busy List * BET-SB
0(Nﬁ) “r * Lookahead *Busy Distance Inverse
[ >
optimal
Performance

Figure 4.5: Performance vs. complexity of submesh allocation strategies.

57
4.2 Scheduling Schemes

Traditionally, submesh allocation strategies allocate jobs by a FCFS scheme. The
allocation process always starts from the front of the waiting queue, so scheduling
complexity of FCFS scheme is 0(1). However, a job with a large submesh request
is to be allocated, it usually has to wait until almost all the allocated jobs leave the
system, thus blocking later jobs from entering the system.

Krueger, Lai, and Radiya have claimed that job scheduling is more important than
processor allocation for hypercube computers [40]. They have developed a family of
job scheduling schemes called Scan, which puts jobs into separate waiting queues
according to their dimensions. All jobs of a certain dimension are allocated in the
FCFS order until the queue is empty, and then the dispatcher allocates jobs in the
next queue. It takes constant time to put a job in the waiting queue corresponding
to its dimension, it also takes constant time to allocate jobs in a queue in the FCFS
order. So, the scheduling complexity of the of the Scan scheduling schemes is also
0(1). However, the Scan scheduling schemes improve the performance against FCFS
scheme dramatically in both reducing the mean job response time and increasing the
overall system utilization. This work motivates our research of developing effective
and efficient job scheduling schemes for 2D meshes. In the following two chapters, we
concentrate on the job scheduling issues and introduce our development of advanced
job scheduling schemes for 2D mesh systems. In Chapters 7 and 8, we present our

development on submesh allocation and mesh partitioning strategies.

Chapter 5

Job Scheduling Schemes

Krueger, Lai, and Radiya have claimed that job scheduling is more important than
processor allocation for hypercube computers [40]. They developed a family of job
scheduling schemes called Scan, which improve mean job response time dramatically
against the FCFS and some other job scheduling schemes. However, job schedul-
ing schemes for hypercube computers are not applicable to two-dimensional mesh
computers.

Traditionally, submesh allocation strategies allocate jobs by a F CF S policy. When
a job arrives, the dispatcher allocates the job only if the waiting queue is empty and
a free submesh is found; otherwise it puts the job at the end of the waiting queue.
Upon the completion of a job, the processors in the system that were allocated to the
job are released and the dispatcher checks if jobs in the front of the waiting queue can
be allocated. When a job with a large submesh request is to be allocated, it usually
has to wait until almost all the allocated jobs leave the system, thus blocking later

jobs from entering the system. To reduce the external fragmentation and avoid the

58

59

potential performance loss due to blocking, different job scheduling policies need to
be considered.

In this chapter, we propose two starvation free job scheduling schemes, Immedi-
ate Fit (IF) and Scan All (SA), which improve the mean response time and system
utilization dramatically for 2D mesh systems. The detailed strategies and perfor-
mance results under different load conditions and job characteristics are given in the

following sections.

5.1 Immediate Fit Strategy

When a job arrives in the system, there may exist a large enough submesh to be
allocated to it. However, this job will not be allocated in the FCFS policy if the
waiting queue is not empty. Incorporating job scheduling strategies for processor
allocation can allow a job to be allocated when possible. In Immediate Fit (IF)
strategy, we allocate the incoming job immediately if it ﬁts in a free submesh in
the system. However, allowing a newly arrived job to enter the system will reduce
the chance of freeing processors for a waiting job with a large submesh request and,
hence, result in a starvation problem. The possibility of encountering starvation will
be higher with a heavier system load. One solution to this starvation problem is
to enforce a waiting time limit. When a job arrives, the dispatcher tries to allocate
the job only if there are no jobs in the waiting queue exceeding the waiting time
limit. Otherwise, it simply puts the job at the end of the waiting queue. When a job

completes execution and processors are released, the dispatcher checks if jobs in the

60

front of the waiting queue can be allocated. and allocates them when possible. The

algorithm for the Immediate Fit strategy is given in Figure 5.1.

 

When a job J arrives:

1. check if the waiting queue is empty;
if waiting queue is not empty, then
retrieve the job, J I, in the front of the
waiting queue;
if J; has exceeded its waiting time limit, then
put J in the waiting queue and stop;
endif;
endif;

2. use the desired allocation strategy to ﬁnd a free
submesh for J;
if a free submesh is available, then
allocate processors for J;
else put J in the waiting queue;

endif;
Upon completion of a job J:
1. release processors allocated to J;

2. retrieve the job, J}, in the front of the
waiting queue;

3. use the desired allocation strategy to ﬁnd free
submesh for J f;
if a free submesh is available, then
remove J I from the waiting queue and
allocate processors for J I;
goto step 2;
endif;

 

 

 

Figure 5.1: The IF scheduling scheme.

61
5.2 Scan All Strategy

When a job completes execution, a submesh is freed. Jobs which previously could
not be allocated can now be considered for possible allocation. In the Immediate Fit
strategy, the allocation process will be blocked if the job in the front of the queue
could not be allocated. Here, we discuss a Scan All (SA) scheme, which scans the
entire queue for possible allocation of any job in the queue upon completion of a job.
When the system identiﬁes a job that can be allocated, it will allocate the appropriate
submesh to it.

The only difference between the SA scheme and the IF scheme is the procedure

followed “upon completion of a job J.” This is shown in Figure 5.2.

5.3 Performance Results

We performed simulation to evaluate the performance of the IF and SA scheduling
schemes compared with the FCFS policy. The scheduling policies are combined with
various processor allocation strategies. The Adaptive Scan (AS) and Busy List (BL)
schemes were chosen to represent the ﬁrst-ﬁt and best-ﬁt allocation strategies.
Figure 5.3 illustrates the system utilization against load using AS and BL alloca-
tion strategies with the FCFS, IF and SA scheduling schemes. Here, the waiting time
limit is set to be 500. A uniform distribution is considered for the height and width of
the requested submeshes. The maximum system utilization achieved using AS or BL
allocation strategies with FCFS scheduling scheme is 51.3% and 53.7%, respectively.

When the AS and BL allocation strategies are combined with the IF scheduling

62

 

 

Upon completion of a job J:
1. release processors allocated to J;

2. retrieve the job, J], in the front of the
waiting queue;

3. use the desired allocation strategy to ﬁnd a

free submesh for J 1’;

if a free submesh is not available, then
if J] has exceeded its waiting time limit, then

stop;

endif;

else remove J f from the waiting queue and
allocate processors for J f;

endif;

4. if Jf is the last job in the waiting queue, then
stop;
else retrieve the next job in the waiting queue
and let it be Jf;
endif;

5. go to step 3;

 

Figure 5.2: The SA scheduling scheme.

 

63

scheme, the system utilization is improved to 53.9% and 55.8%, respectively. The
SA scheduling scheme improves the system utilization much more signiﬁcantly. The

system remains stable with a load up to 0.7.

 

 
  
  
 

   

 

 

 

0.8 "AS(FCFS) "a ..... , .1
AS IF
0.7 ~ AS( A ...,. ..... 1
c BL(FCFS ...__
3 0.6 . BL IF; —.— 4
‘3 BL( A ...—
37; 0.5 - q
3
E 0.4 - .
2
% 0.3L +
0.2 - .
0.1 - ‘
0 1 1 1 1 1 L 1 l
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3

Load

Figure 5.3: System utilization vs. load (w and h: uniform distribution).

It is interesting to compare the curves as follows. Observe the FCF S curves
(dashed and solid lines with square markers). Note the differential, which is due to
the allocation strategy. Next, compare the dashed lines only. Similarly, compare the
solid lines only. These differentials are due to the scheduling strategy. In most cases,
the latter differentials (due to scheduling) are greater.

We measure the system performance based on the mean response time. To have
an accurate measurement, only stable systems are considered. From Figure 5.3, a
reasonable range of load is chosen to guarantee the system stability. Figure 5.4

illustrates the mean response time under different loads (from 0.1 to 0.5) using AS

64

and BL allocation strategies combined with various scheduling schemes. The width
and height of the requested submeshes are uniformly distributed from 1 to 32. When
the load is light (less than 0.2), all allocation strategies have similar performance.
Most jobs will be served promptly after they enter the system. For load greater than
0.3, IF and SA scheduling schemes show significant performance gain. When the
load reaches 0.5, the IF scheduling scheme reduces the mean response time by more
than 48% for both AS and BL allocation strategies; and the SA scheduling scheme
reduces it by more than 73% for both. The standard deviation of the response time is
illustrated in Figure 5.5. We can observe that the standard deviations are about the
same magnitude as their corresponding means. The high variability of the response
times is the nature of this problem: some jobs may get serviced promptly, while
others may encounter long waiting delays in the queue. When evaluating a system of
considerable variability, we should consider not only the mean response time, but also
its standard deviation. One scheduling scheme may result in a lower mean response
time but a higher standard deviation.

As we can conclude from Figure 5.4 and Figure 5.5, the performance gained by
using the IF and SA scheduling schemes versus the FCFS policy is far more signiﬁcant
than the improvement obtained due to the use of a better allocation strategy.

The same experiment is conducted to consider different characteristics of job re-
quests, namely, decreasing and increasing distributions for the width w and height
h of submeshes requested. These distributions are described in Section 3.2. The
system becomes unstable if the load reaches 0.5 when w and h are under decreasing

distribution and F CF S scheduling is used. Thus, we report the results up to a load

65

 

 

 

 

 

110 r
100 -
90 - AS(FCFS ...,, ...... j
m AS IF .-.,, .......
E. 80 " BL?|:S(§FQ ..-,. ........ 5+ ,
3 70 _ BL IF H— i
5 60 b BL( A _,,_ 1
8' ,
m 50 -
D:
c 40 ~
8
2 30 : .
20 - ]
10 - q
0 . . . , 1
0 0.1 0.2 0.3 0.4 05

Load

Figure 5.4: Mean response time vs. load (w and h: uniform distribution).

120 . . w

 

 
 
 
 
 

    

 

 

 

110 ~ .
100 - ’
f‘g’ 9° ' “(235.5.
l; 80 ' AS( A) . ‘
2 7° - was] ::
é 5° ' BL( A -—-— ‘
a: 50 ' i
‘5 4o - .
8 so - -
20 ~
10 ~
0 . L . . L
o 0.1 0.2 0.3 0.4 0.5

Load

Figure 5.5: Standard deviation of response time vs. load (w and h: uniform distri-
bution).

66

of 0.4 for the decreasing distribution case.

Figures 5.6 and 5.7 show the mean and standard deviation of response time versus
load, respectively, for various scheduling schemes under the decreasing distribution of
job requests. The relative performance among various scheduling strategies remains
the same except that the IF scheme yields a larger standard deviation than the FCFS
policy. With a decreasing distribution, the system has more small jobs than large
ones. As more small jobs come into the system and get allocated by the IF scheduling
policy, large jobs in the front of the waiting queue face potentially longer delays in
getting allocated.

The simulation results for various strategies under the increasing distribution of
job requests are presented in Figures 5.8 and 5.9. Performance results similar to that
of the uniform distribution are observed here: the IF scheduling scheme performs
better than the FCFS scheme, and the SA scheduling scheme performs better than the
IF scheme. However, for increasing distribution, the performance difference among
various strategies is not as signiﬁcant as that of the previous two cases (uniform
and decreasing). This moderate performance improvement for increasing distribution
of job requests is expected. The system tends to have more jobs requesting large
submeshes in an increasing distribution and, hence, there is a reduced chance of
successfully allocating a job that is not in the front of the queue. Consequently, the
effect on performance improvement due to the scheduling policy is diminished.

The main reason why IF and SA scheduling schemes drastically improve the per-
formance is that these schemes give priority to jobs with smaller submesh requests.

Thus, it would be interesting to analyze the effects of these scheduling policies on

Mean Response Time

50

4o

30

20

10

67

 

 

 
 

 
 

 

 

~ AS(FCFS ~43 ,0 .
AAS IF '
_ BL(FSCFS_+_ +—
BL IF ‘
BL(A
0 0.1 0.2 0.3 0.4
Load

Figure 5.6: Mean response time vs. load (w and h: decreasing distribution).

SD of Response Time

70

60

50

40

30-

20-

10-

 

I

I

 

m

AS(FCFS;
AS IF
AS( A) ...,, .......
BL(FCFS +—
BL IF —~—
BL( A) —-—

 

 

 

0.1 0.2
Load

0.3

Figure 5.7: Standard deviation of response time vs.

distribution).

0.4

load (w and h: decreasing

68

 

 
 
 
    
 

so . . .
4o - AS(FCFS a '
g AS IF ,
I: AS( A1,..-“
0 30 _ BL(FCFS +—
2 BL IF —~— ‘
3 BL( A
(D
0
(I 20 - 1
C
<0
0
2
1o -
o

 

 

 

0 0.1 0.2 0.3 0.4 0.5
Load

Figure 5.8: Mean response time vs. load (w and h: increasing distribution).

 

  
 
  
   

 

 

 

50 l r I l 1
4o - AS(FCFS a
“5’ AS IF “+ ~
i: AS( A -..,. ,
q) BL(FCFS +—
30 -
2 BL IF ~—
8 BL( A
m
o
a: 20 - .
”5
O
(D
10 - ~
0 1 1 4 1 1
0 0.1 0.2 0.3 0.4 0.5

Load

Figure 5.9: Standard deviation of response time vs. load (w and h: increasing distri-
bution).

69

jobs of different requested sizes.

Consider a 32 X 32 mesh. The maximum number of nodes a job can request is
1024. We divided thejobs into four groups according to their requested submesh sizes,
namely, jobs requesting 1 to 256 nodes, 257 to 512 nodes, 513 to 768 nodes, and 769 to
1024 nodes. The mean and standard deviation of response time are measured for each
job group under different scheduling schemes. In this simulation, we assume the width
and height of the requested submesh are distributed under uniform distribution. Busy
List strategy is adopted for submesh allocation. The results are shown in Figure 5.10
and Figure 5.11 for a load of 0.5.

We can see that the IF and SA scheduling schemes reduce the mean and standard
deviation of response time for all job groups, though, with more improvement for jobs
with smaller submesh requests.

As discussed in Section 3, a waiting time limit is introduced to resolve the potential
starvation problem. Jobs which are not in the front of the waiting queue will not
have the chance to be allocated if any job has exceeded its waiting time limit. To
study the effect of waiting time limit on the system performance, we conducted the
following simulation: The IF and SA scheduling schemes are each combined with the
BL allocation strategy. A uniform distribution is considered for the width and height
of the requested submeshes. The load of the system is set to 0.5. The response time
is measured against various waiting time limits. Figure 5.12 presents the simulation
results.

As the waiting time limit decreases, the effect of using the IF or SA scheduling

policy becomes less signiﬁcant and, hence, less performance improvement can be

 

 

 

 

140 - FCFS -&-— «
130 . Immediate Fit —~—— .
120 _. Scan All *— 1
“E’ 110 . /__/€ .
i: 100 - -
a”; 90 - .
§ 80 - 4
g 70 - 1
a: 60 - .
5 50 - .
g 40 - «
30 - 1
20 - .
10 - 1
o 1 1 1 1
0 256 512 768 1024
Job Sizes

Figure 5.10: Mean response time vs. job size (w and h: uniform distribution).

 

 

 

 

 

110 - FCES —e— _
_ Immediate it —+— .
100 Scan All +—
GE) 90 "’ W ‘
F 80 - - /
3 1
c 70 [ 1
§ 60 + «
a? so . t
”5 40 r 1
D
a) 30 - 4
20 - -
10 - -
O L 1 L L
0 256 512 768 1024
Job Sizes

Figure 5.11: Standard deviation of response time vs. load (w and h: uniform distri-
bution).

71

 

  
   

 

 

 

 

 

110 . s . r .
100 Mean using BL(IF —-°— -
S.D. using BL(IF +«v
90 1,- Mean using BL(SA +-— -
’ ' S.D. using BL(SA ~~* -------
g 80 . -
'2; 7o . ,,,,, + + «
(I)
8 L 4
Q 60
8
a: 50 - a -
40 - 1
20 1 L I 7 7
0 100 200 300 400 500

Waiting Time Limit

Figure 5.12: Mean and standard deviation of response time vs. waiting time limit (w
and h: uniform distribution).

obtained. Essentially, a scheduling policy reduces to F CFS policy when the waiting
time limit is zero. If the waiting time limit is greater than 500, all scheduling schemes
will perform as if the waiting time limit is inﬁnite. Actually, for both IF and SA
scheduling schemes, setting the waiting time limit to be 100 already gives very good
performance gains. For both scheduling schemes, the performance difference between
the waiting time limit being 300 and inﬁnity is very small. From the results, we can
see that the SA and IF scheduling schemes work well with a reasonable value of the

waiting time limit.

72
5.4 Discussion

In this chapter, we presented two new job scheduling schemes, Immediate Fit and Scan
All. A waiting time limit is introduced to avoid a starvation problem. Simulation
results show that our scheduling schemes signiﬁcantly improve the system utilization
and improve the mean and standard deviation of response time for different load
conditions and job characteristics. More precisely, our SA scheduling scheme can
handle a load up to 0.7 and results in a 70% system utilization. For a moderate load
of 0.5, the SA scheme reduces the mean response time of the FCFS policy by more
than 75%. The simulation results show that we can have signiﬁcant performance gain

with a reasonable value of the waiting time limit.

The Scan All scheme has greater overhead than the F CFS and IF schemes because
it scans all jobs in the waiting queue. However, for uniformly distributed width and
height, the mean queue length for a system load of 0.5 is only 3.6. IF and SA not only
favor smaller jobs, but they also improve the performance of larger jobs by reducing

the waiting queue length. So, the mean response times for all sized jobs are improved.

From the simulation results, we have observed that when the load is low to mod-
erate, the performance difference among various scheduling and processor allocation
schemes is small. So, when the system is not busy, most jobs will be served promptly
after they enter the system, no matter which processor allocation and job scheduling
scheme we use. When the load is heavy, the performance difference among various
schemes increases, and better schemes make bigger performance gain. So, we consider

a scheme is better than others if it gives better performance when the load is heavy.

73

This observation assists in comparing our work with other recently reported results.

As we have presented in Chapter 2, the priority and reservation techniques for job
scheduling and submesh allocation was proposed in ICPP ’94 (August 1994) [23]. The
priority technique alone, i.e., reservation is not allowed, improves the performance.
However, the reservation technique alone does not improves the performance, due to
the waste of large number of free processors that get reserved. A combination of the
priority and reservation techniques results in a lower waiting delay than the priority
technique alone for low to moderate loads. However, for high loads, the priority
technique alone yields best results.

Since the priority technique gives better performance when the load is heavy, we
use it as a basis for comparison. The priority technique is very similar to the SA
scheme that we proposed: when a job ﬁnishes and releases processors, an attempt is
made to allocate the jobs from the waiting queue in an FCFS fashion; when a job
arrives, the jobs waiting in the waiting queue do not block the allocation process
unless MAX.PRI number of subsequent jobs arriving after the job at the head of
the queue have been allocated. This is done to avoid starvation of jobs requiring
larger submeshes. Each job waiting in the queue keeps track of the number of jobs
arriving later than it that have been allocated. We take a different approach to
solve the starvation problem by imposing a waiting time limit. If the waiting time
limit is inﬁnity in SA and MAX_PRI in the priority scheme is also inﬁnity, we have

determined that SA and the priority scheme behave exactly the same.

Chapter 6

Job Scheduling Using Multiple

Queues

In the Scan and Lazy job scheduling schemes for hypercube computers [39, 49], mul-
tiple queues are used. Instead of using a single job waiting queue, r+1 queues are
maintained for an r-dimensional hypercube system, one for each possible subcube
dimension. These two schemes improve mean job response time dramatically relative
to schemes that do not cluster equal-sized jobs. In a 2D mesh system, jobs request
submeshes of various sizes and shapes. It is not trivial to cluster submesh requests
into separate queues for job scheduling.

In this chapter, we will investigate the effect of various ways of using multiple
queues for job scheduling in 2D mesh systems. A job scheduling scheme using multiple
queues is developed to further improve the performance of the SA scheduling scheme
presented in Chapter 5. Performance results show that the Multiple Queues scheme

further reduces the mean response time under all load conditions and different job

74

75

characteristics.

In the next section, we introduce some job scheduling schemes other than F CFS.
Then, we pr0pose the Multiple Queues scheduling scheme in Section 6.2. Pgrformance

results are given in Section 6.3.

6.1 Job Scheduling Schemes

Various job scheduling schemes have been proposed over the years that classify jobs
for more efﬁcient scheduling. In this section, we consider these in general and then
turn to their use in multiple queues approaches.

In multiprocessor systems, it is feasible to schedule jobs according to the number
of processors they request. Smallest-Job-First (the job requiring the smallest number
of processors ﬁrst) and Largest-Job-First (the job requiring the largest number of
processors ﬁrst) are two possibilities.

In single CPU systems, Shortest-Job—First (SJF) is a nonpreemptive scheduling
discipline in which the waiting job with the smallest estimated run—time—to—completion
is run next. SJF selects jobs for service in a manner that ensures the next job will
complete and leave the system as soon as possible. This tends to reduce the number
of waiting jobs, and also reduces the number of jobs waiting behind large jobs. As
a result, SJF can minimize the average waiting time of jobs. The waiting times,
however, have a larger variance (i.e., are more unpredictable) than FIFO, especially
for large jobs [24].

The obvious problem with SJF is that it requires precise knowledge of how long a

76

job will run, and this information is not usually available. The best SJF can do is to
rely on a user estimate of run time. In production environments where the same jobs
run regularly, it may be possible to provide reasonable estimates. But in development
environments, users rarely know how long their programs will execute.

If users know that the system is designed to favor jobs with small estimated exe-
cution times, they may give small estimates. However, the scheduler can be designed
to remove this temptation. First, the user can be forewarned that if the job runs
longer than estimated, it will be terminated and the user will be charged for the
work. Second, the system can run the job for the estimated time plus a small per-
centage extra, and then suspend it and it may be restarted at a later time. The user
would pay for the suspending/ restarting overhead. Another solution is to run the job
for the estimated time at normal billing rates, and then to charge a premium rate for
additional execution time. Under this arrangement, the user providing low run-time
estimates to obtain better service will pay a sharp premium.

Neither number of processors nor execution time gives a complete indication of
workload. The service demand (the product of execution time and number of pro-
cessors required) gives a better indication of the workload. The Smallest-Service-
Demand-Job-First scheme is evaluated in Section 6.3.2, alone with each of the schemes
named above.

A family of job scheduling schemes, called Scan, was proposed by Krueger, Lai, and
Radiya in [39] for hypercube computers. It is similar to the C-Scan disk-scheduling
discipline [53]. The basic idea behind the Scan strategies is to classify jobs according to

their sizes and put them into separate queues. We can either scan up from the queues

77

with smaller jobs or scan down the queues with larger jobs. For an r-dimensional
hypercube, it maintains r + 1 separate queues (queues 0 to r) for requested subcubes
having dimensions 0 to r. When a job arrives, it goes to the end of the queue
corresponding to its dimension. All jobs of a certain dimension will be allocated in
the FCFS order until the queue is empty, and then the dispatcher will allocate jobs
in the next queue. The queues are scanned from small dimension to large dimension,
i.e., from queues 0 to r, (ScanUp) or from queues r to 0 (ScanDown) in a circular
way. However, starvation is possible for both ScanUp and ScanDown. To solve this
problem, before allocating jobs in a speciﬁc queue i, the job dispatcher moves all the
jobs in queue i to another queue called the current-queue. It allocates jobs in the
current_queue until it is empty. Then it moves the jobs in the next non-empty queue
into the current_queue. Jobs arriving at queue i after the dispatcher moves the jobs
from queue i to the current_queue must wait until the next time the job dispatcher
visits queue i.

In the next section, we introduce a job scheduling scheme using multiple queues
for 2D mesh systems. Several methods of clustering job requests are proposed. Per-

formance results are given in Section 6.3.

6.2 Multiple Queues Scheme

Several problems arise when we try to establish a scheme using multiple queues for
2D mesh systems. First of all, what is the best criterion to classify jobs into different

queues? Secondly, what is an optimal number of queues? Although, it is straightfor-

78

ward to have r + 1 queues for subcube requests from dimensions 0 to r, there is no
natural way to decide how many queues are needed for 2D mesh systems to attain
optimal performance. For hypercube systems, every job in the same queue has ex-
actly the same size, so the allocation process stops when a job in the queue can not
be allocated. For a 2D mesh system, we have different sizes and shapes of submesh
requests in the same queue. Even two submesh requests having the same number of
processors may have different shapes. For example, having a 2 X 6 submesh request
in the front of the queue that can not be allocated does not mean a 3 X 4 submesh
request can not be allocated either.

For the Multiple Queues scheme, we propose three criteria for classiﬁcation of
submesh requests: total number of processors, the larger side length of the requested
submesh, and the smaller side length. The performance of different criteria is eval-
uated in Section 6.3.1. The optimal number of queues needed is also determined by
the simulation performed in Section 6.3.1. Assume that we have Q queues (numbered
1 to Q) and N processors in the system. In the Multiple Queues scheduling scheme,
the scan process always starts from queue 1. If we use scan up, and the number of
processors requested is the job classiﬁcation criterion, then a job which requests x
processors will go to queue [AIL/Q]. Similarly, if we use scan down, then a job which
requests x processors will go to queue Q— [N15] +1. If we use the larger (smaller) side
of the requested submesh as the classiﬁcation criterion, then the job with the larger
(smaller) side length x will go to queue [la/L62] for scan up or queue Q — [la/LQl + l for

scan down, where L is the length of the larger (smaller) side of the mesh system.

In the Multiple Queues scheme, the job dispatcher scans through jobs in all queues

79

in a speciﬁc order, from queues of large jobs to queues of small jobs (scan down) or
vice versa (scan up). Applying the Scan All scheme presented in Chapter 5 to each
waiting queue, the allocation process does not block if the job in the front of the
queue can not be allocated. If a job can not be allocated, the dispatcher attempts
to allocate the next job in the queue. Additionally, when a job arrives, there might
exist a large enough submesh for it. We try to allocate an incoming job immediately
if it ﬁts in a free submesh in the system (Immediate Fit). The incoming job is put in

a queue only if there is no free submesh found for it.

However, allowing a newly arrived job to enter the system reduces the chance of
freeing processors for a waiting job having a large submesh request and, hence, results
in a starvation problem. The possibility of encountering starvation is higher with a
heavier system load. Our solution to this starvation problem, as in the schemes of
Chapter 5, is to enforce a waiting time limit. If the job dispatcher tries to allocate
a job in a waiting queue but no free submesh is found for it and it has exceeded its
waiting time limit, the job dispatcher stops the scan process and does not allocate
the next job in the queue. When a job arrives, the dispatcher tries to allocate the
job only if there are no jobs in any waiting queue exceeding the waiting time limit.
Otherwise, it simply puts the job at the end of the waiting queue corresponding to

its size.

The detailed algorithm for a system with Q waiting queues is listed in Figure 6.1.
In the next section, we will perform simulation runs to see which is the best job

classiﬁcation criterion and what is the optimal number of queues needed.

80

 

 

When a job J arrives:
l. initialize the scan from waiting queue i = 1;

2. if waiting queue i is empty, then goto step 4;
else retrieve the job in the front of waiting queue i and let it be J 1’;

3. if J; has exceeded its waiting time limit then
put J in the appropriate waiting queue and stop;

4. ifi < Q (waiting queue i is not the last queue), then
i=i+h
goto step 2;
5. use the desired allocation strategy to ﬁnd a free submesh for J;

if a free submesh is available, then allocate processors for J;
else put J in the appropriate waiting queue and stop.

Upon completion of a job J:
1. initialize the scan from waiting queue i = 1;
2. release processors allocated to J;
3. retrieve the job in the front of the waiting queue i and let it be Jf;

4. use the desired allocation strategy to ﬁnd a free submesh for J f;
if a free submesh is available, then
remove J I from the waiting queue and allocate processors for J f;
goto step 6;

5. if J, has exceeded its waiting time limit then stop;

6. if J; is not the last job in waiting queue i, then
retrieve the next job in waiting queue i and let it be J 1’;
goto step 4;
else
ifi < Q (waiting queue i is not the last queue), then
i = i + 1;
goto step 3;
else stop;
endif;
endif;

 

Figure 6.1: The Multiple Queues scheduling scheme.

 

81
6.3 Performance Results

In this section, we conduct simulations to determine the best job classiﬁcation crite-
rion and number of queues needed for the Multiple Queues scheme. Then we compare
the performance of the Multiple Queues scheme with other scheduling schemes. In
our simulation, the waiting time limit is set to be inﬁnity. The width w and height h
of requested submeshes are generated independently. We considered three distribu-
tions for w and h, namely, uniform, decreasing, and increasing distributions [13], to
cover different characteristics of jobs. The processor allocation strategy we use is the

BL strategy.

6.3.1 Performance of Variations of Multiple Queues Scheme

In the Multiple Queues scheme, jobs are classiﬁed into different queues using one of
three criteria: total number of processors, the larger side of the requested submesh, or
the smaller side. Tow types of scan, ScanUp an ScanDown, are also considered. The
following experiment is performed under a load of 0.5, and the widths and heights
of jobs are distributed under uniform distribution. The performance of the original
ScanUp and ScanDown schemes (without scanning through all jobs in the waiting
queues) is compared for various number of queues and different job classiﬁcation cri-
teria. Figure 6.2 shows the mean response time versus number of queues for variations
of the Scan schemes. When we have only a single queue, these schemes degrade to the
F CFS scheme. If two queues are used, scan up and scan down will perform the same.

The number-of-processors criterion performs the best among the three criteria. When

82

the number of queues is larger than four, scan up performs a little better than scan
down. This matches the results reported by Kruger, Lai, and Radiya [39]. As we can
observe, using number of processors as the job classiﬁcation criterion results in the
best performance. If we use scan up and number of processors as the job classiﬁcation
criterion, the mean response time will be reduced by more than 40%. If we use the
larger side as the job classiﬁcation criterion, the queues for larger jobs contain many
more jobs than the queues for smaller jobs. For example, if we have 32 queues, queue
1 contains only 1 X 1 jobs, and queue 32 contains jobs ranging from 1 X 32 to 32 X 32.
Under such circumstances, jobs are unevenly distributed among the queues, so the

effect of multiple queues diminishes.

 

   

 

 

 

130 . i i s .
120 ScanUp(# processors —-——‘
ScanUp(larger side —*——
m 110 ScanUp(smaller side —e—.
.E ScanDown(# processors; -~- --------
'- 100 ScanDown(larger side ---+ -------
g ScanDown(smaller side) a
§ 90 - ~
can)
c so - ~
m
o ~~..__,
2 70 h = -~ - ,. 'iiiiif‘iif” ' i"'ij';.::;r-~-- ”
60 ~ .
50 . . - - -
1 2 4 8 16 32

# Queues

Figure 6.2: Mean response time vs. # queues, load=0.5, w and h: uniform distribu-
tion.

Next, we performed similar simulations to evaluate the performance of the Mul-

tiple Queues scheme presented in Figure 6.1, which scans through all jobs in the

83

waiting queues. The width and height of jobs are under uniform, decreasing, and
increasing distributions, and the results are illustrated in Figures 6.3, 6.4, and 6.5,

respectively. Using only a single queue, the Multiple Queues scheme degrades to

 

I I

MQScanUp(# rocessors) ...—
MQScanUp larger side —1—
MQScanUp(smaller side —a— .
MQScanDown(# processors ..
MQScanDown(lar or side +
MQScanDown(sma ler side a

35

Mean Response Time

 

 

 

20 1 1 1 1 1

 

# Queues

Figure 6.3: Mean response time vs. # queues, load=0.5, (w and h: uniform distribu-
tion).

the Scan All scheme presented in Chapter 5. The performance of the SA scheme is
much better than the FCF S scheme and the original Scan schemes because the SA
scheme removes the blocking effect of the FCFS scheme. When the number of queues
increases, ScanDown performs better than ScanUp. It is harder to ﬁnd a free submesh
for larger jobs than for smaller jobs, so the result is within our expectation. If we
try to allocate smaller jobs ﬁrst, it becomes even harder to allocate larger jobs, and
the waiting time of large jobs will be much longer. Conversely, if we try to allocate
larger jobs ﬁrst, there might still exist some small free submeshes for smaller jobs. So,

ScanDown is the better one to adopt for the Multiple Queues scheme. As the number

84

 

MQScanUp(# processors -—
MQScanUp(larger side ..—
MQScanUp(smaller side +—
30 P MQScanDown(# processors ...,. ....... q
MQScanDown(larger side +
MQScanDown(smaller side a »

Mean Response Time

 

 

 

 

# Queues

Figure 6.4: Mean response time vs. # queues, load=0.5, (w and h: decreasing
distribution).

 

45

ﬁ ﬁ r '

MQScanUp(# processors —~——
MQScanUp(larger side —+—
MQScanUp(smaller side —e——
MQScanDown(# processors -
MQScanDown(lar er side + ~
40 ’ MQScanDown(smaler side ..-,, ,_ -

Mean Response Time

 

 

 

 

# Queues

Figure 6.5: Mean response time vs. # queues, load=0.5, (w and h: increasing distri-
bution).

85

of queues increases, the mean response time reduces if we use ScanDown regardless
of which job classiﬁcation criterion is used. When the system load is relatively light,
it is difficult to observe performance differences. The results of increasing the load

to 0.7 and repeating the experiments for ScanDown are plotted in Figures 6.6, 6.7,

 

 

 

 

and 6.8.
140 T , . . .
130 - MQScanDown(# processors .-.,. ........ , 1
MQScanDown(larger side ~-+ ~
a) 120 - MQScanDown(smaller side -- a -
i: 110 ~ .
g 100 - ,
5%

Q 90 " .4
5 80 i .
5‘ 9. - ~
60 h QTY... .. .,:::;.:'Q ...:: ..................... Q -

50 l l l 1 1
1 2 4 8 16 32

# Queues

Figure 6.6: Mean response time vs. # queues, load=0.7, (w and h: uniform distribu-
tion).

Apparently, larger side length is not a good criterion under heavy load. A job
might request a submesh having a big larger side length but very few processors.
It then has priority to be allocated and can possibly block a larger job (in terms
of number of processors) behind it from being allocated. Number of processors and
smaller side length have similar performance, and the former performs slightly better
then the latter. In our opinion, number of processors gives a better indication of the

real workload, and it performs the best among the three criteria, so we choose number

86

 

 

 

 

14o . T , r .
130 r MQScanDown(# processors .. ., """"""""""""""" + ,
MQScanDown(lar er side .+
m 120 . MQScanDown(sma ler side a ,
i: ,+" d
o 110 .
a 100 - l
g l|-~-- '+ ‘ -‘ A+
m 90 Q " ' I... _
g .. an
a: 30 - ..... .w -
2 i u i B . g. Q
70 - --.- .
60 ~ . ., J
1 2 4 8 16 32
# Queues

Figure 6.7: Mean response time vs. # queues, load=0.7, (w and h: decreasing
distribution).

 

 

 

 

 

MQScanDown(# rocessors
110 " MQScanDown(far er side ‘
MQScanDown(sma ler side a-
m ..+.
..E_
I- 100 "f i. *' ‘
a
C
o +'
3
a) 90 r a .
m
C _
m ”n.
G) ‘13..
2 80 ~ . ----- ........ -
70 1 1 1 1 1
1 2 4 8 16 32
#Queues

Figure 6.8: Mean response time vs. # queues, load=0.7, (w and h: increasing distri-
bu tion).

87

of processors as the job classiﬁcation criterion for the Multiple Queues scheme.

As we can observe from Figures 6.6 to 6.8, the more queues we use, the better the
performance. Also, the performance ﬂattens out as the number of queues increases.
As the number of queues increases from 1 to 32, the mean response time reduces
from 79.91 to 56.57 time units for the uniform distribution, from 92.37 to 60.64 for
the decreasing distribution, and from 98.66 to 76.50 for the increasing distribution.
Using 32 queues for a 32 x 32 mesh system is just enough. So, the number of queues

used in the Multiple Queues scheme is set to be the side length of the mesh system.

6.3.2 Comparing Multiple Queues Scheme with Other

Schemes

In the following experiments, we compare the Multiple Queues scheme with other
scheduling schemes under various load conditions. Here we use 32 queues, classify
jobs by number of requested processors, and apply ScanDown to try to allocate large
jobs ﬁrst for the Multiple Queues scheme. The other scheduling schemes considered
include: First-Come-Fz'rst-Served (FCFS), Smallest-Job-First (SmallestJF), Largest-
Job-Fz'rst (LargestJF), Shortest-Job-Fz'rst (SJF), and Smallest-Service-Demand-Job-
First (SmallestSDJF). The mean response time using these six schemes for a load
of 0.1 to 0.7 under uniform, decreasing, and increasing distributions are plotted in

Figures 6.9, 6.10, and 6.11, respectively.

The Multiple Queues scheme has the best performance. The main advantage of

the Multiple Queues scheme is that it scans through all jobs in the waiting queues,

88

 

   
 
   

60 I I 1 T r
FCFS *—
50 ~ SJF "'8”

SmallestSDJF ...,,

 

 

 

 

g Slrnallestj: -+—
-- _ argest -*—
l; 40 Multiple Queues +—
53). 30 - .
Q
r:
5 20 -
o ............
2 ...
10 -
0 1 L 1 1 1 1 1
O 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Load

Figure 6.9: Mean response time vs. load (w and h: uniform distribution).

 

 

 

  

 

7o . , . . > f
60 - FCFS _,_ ,
m 8 II 833'}: ...,... ........ .
ma est ...m.-.
F—E' 5° ' SmallestJF +-
‘D LargestJF +—
3 40 ~ Multiple Queues ..—
co;
3
E 30 - 4
g
Q) .
2 20 .....
10 .
0

 

 

 

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Load

Figure 6.10: Mean response time vs. load (w and h: decreasing distribution).

89

 

 

 

 

 

 

80 . . I r . .
70 _ FCFS ee— *
SJF 41~
0 50 . SmallestSDJF ..., .
E SmallestJ'l: —1—
_ LargestJ —~—
§ 50 Multiple Queues —*— 7
g 40 - -
6‘6
: 30 ~ ‘
m
(D
:2 20 + «
10 - -
0 1 1 1 1 1 1 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Load

Figure 6.11: Mean response time vs. load (w and h: increasing distribution).

checking for possible allocation, while other schemes stop if a job in the queue can
not be allocated. For comparison purposes, we can modify the other schemes to
scan all jobs in the waiting queue and to try to allocate jobs when they arrive. The
FCF S scheme becomes the Scan All (SA) scheme proposed in Chapter 5, which is
equivalent to the Multiple Queues scheme with a single queue. The results are shown

in Figures 6.12, 6.13, and 6.14.

The Shortest—Job—First and Smallest-Service—Demand—Job-First schemes have a
slight advantage over the Multiple Queues scheme when the load is light. If the users
can somehow estimate the execution times of their jobs and provide this information
to the system, it is possible for the job scheduler to place shorter jobs ﬁrst to achieve
better performance. Overall, the Multiple Queues scheme performs the best if no

execution time information is available.

Mean Response Time

Figure 6.12: Mean response time vs. load (w and h:

Mean Response Time

80-
70-
60-
50-
40-

30

20-

10
0

1 00
90
80

70-

60
50

40-

30

20-

10
O

90

 

 

SA-—§

SJF(SA ...B .....

SmallestSDJF(SA
SmallestJF(SA

LargestJF(SA
Multiple Queues

    

l I

 

 

l

 

0

Load

0.1 0.2 0.3 0.4 0.5

0.6 0.7

uniform distribution).

 

 

SmallestSDJF SA «>7
SmallestJF SA
LargestJF SA +—
Multiple Queues —*—

._..__

l l

r SA *—
SJF SA ‘8

 

 

O

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Load

Figure 6.13: Mean response time vs. load (w and h: decreasing distribution).

91

 

   
  
   

 

 

 

 

100 . . . . .
90 - SA —-— ~
SJFSSA .Dm-
G, 80 - SmallestSDJF SA . ,. ~
E 70 _ SmallestJF(SA —+-— +
i: LargestJF(SA +—
g 60 _ Multiple Queues —~— .
§_ 50 .. .
0
a: 40 _ .
C
8 so - .
2
20 - -
10 l ~
0 1 1 1 1 1 1 1
O 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Load

Figure 6.14: Mean response time vs. load (w and h: increasing distribution).

As discussed in Section 6.2, a waiting time limit is introduced to resolve the
potential starvation problem. Jobs that are not in the front of the waiting queue will
not be considered for allocation if any job has exceeded its waiting time limit. To study
the actual effect of waiting time limit on the system performance, we conducted the
following experiment. A uniform distribution is considered for the width and height
of the requested submeshes. The load of the system is set to 0.5. The response time
is measured against various waiting time limits. Figure 6.15 presents the simulation

results.

As the waiting time limit decreases, more jobs will exceed their waiting time
limits, rendering the scanning process ineffective in allocating other jobs. Hence,
less performance improvement can be obtained. Setting the waiting time limit to

be 300 gives very good performance gains. If the waiting time limit is greater than

92
110 T n T Y I r

 

100 Multiple Queues —~—- .

Mean Response Time

 

u
H

IV

 

 

h“

20 . . . 1 i
o 100 200 300 400 500 600

Waiting Time Limit

 

Figure 6.15: Mean response time vs. waiting time limit (w and h: uniform distribu-
tion).

600, all scheduling schemes perform as if the waiting time limit is inﬁnite. The
performance difference between the waiting time limit being 300 and inﬁnity is very
small. Actually, the starvation problem could only happen when the system load is
very heavy. Reducing the waiting time limit simply causes greater mean response
time and longer queue length. From the results, we can see that the Multiple Queues

scheme works well with a reasonable value of the waiting time limit.

6.4 Discussion

In this chapter, we have proposed a starvation-free job scheduling scheme, Multiple
Queues, that can be used with any existing submesh allocation strategy. It is based

on the Scan All scheduling scheme presented in Chapter 5. The effect of various

93

ways of using multiple queues for job scheduling in 2D mesh systems is investigated.
Simulation results show that the Multiple Queues scheme significantly reduces the
mean response time of existing allocation strategies under different load conditions
and job characteristics. A higher system utilization and lower internal fragmentation
are also achieved.

Babbar and Krueger recently proposed a Scan scheduling discipline for 2D mesh
systems that also employs multiple queues [3]. However, it is only applicable when
the mesh system and the requested submeshes are square meshes. For a square mesh
of size L x L, it maintains L separate queues, one for each possible size of submesh
request. On arrival, an incoming job joins the end of the queue corresponding to the
size of the submesh it requires. All jobs in a certain queue are allocated before the
scheduler moves to the next queue. This approach has the blocking effect similar
to the FCFS scheme, that is, larger jobs may block smaller jobs from entering the
system.

For a system accepting jobs requesting arbitrary sizes and shapes, their approach
maintains only a single ordered queue. For ScanUp, jobs are ordered from the smallest
to the largest number of processors requested, while for ScanDown, jobs are ordered
from largest to smallest. Actually, ScanUp and ScanDown are exactly the same as
Smallest-Job-First and Largest-Job—First scheduling policies, respectively, when only
a single queue is maintained.

On the other hand, the Multiple Queues scheme we presented is applicable when
the mesh system and the requested submeshes are not squared. Further, it performs

better than the Smallest-Job—First and Largest-Job—First under all circumstances.

94

The job scheduling schemes presented in Chapters 5 and 6 can be used with any
processor allocation strategy and further improve allocation performance compared
with the FCFS policy. In the next two chapters, we will focus on processor allocation

strategies instead of job scheduling schemes.

Chapter 7

Estimated Execution Time

Allocation Strategies

We have performed simulation to study the performance of existing strategies and
found that the different completion times of the jobs in the system can result in
busy processors being scattered throughout the mesh, thus causing large external
fragmentation.

Execution time information is useful for job scheduling. For example, the Shortest-
Job—First (SJF) scheduling discipline selects the the waiting job with the smallest
estimated run-time—to—completion to run next.

Having an estimated execution time for the job submitted to the processor allo—
cator, the system can try to allocate jobs with similar completion times in the same
area. In practice, since most jobs running on parallel systems are programs submit-
ted repeatedly [29], users usually can quite accurately estimate the execution time of

their jobs.

95

96

In this chapter, we present a family of processor allocation strategies, called Esti-
mated Execution Time (BET), which utilize the estimated execution time information.
In the next section, we give the basic idea of the BET strategies. Then we provide

detailed algorithms in Section 7.2. The performance results are given in Section 7.3.

7 .1 Basic Idea and Heuristics

The basic idea of our EET strategies is to allocate a job in close proximity to jobs
with similar estimated completion times. The estimated completion time (ECT) of
a job in the system is the time it was allocated plus its estimated execution time.
The estimated completion time of the job which is currently being allocated is the
current time plus its estimated execution time. The estimated completion time dif-
ference (ECTD) is calculated for each boundary node of a candidate submesh and
its neighbor. If a neighbor is free, the ECTD is the estimated execution time of the
job. If a neighboring node is busy, the ECTD is defined as the absolute value of the
estimated completion time difference of the two neighboring jobs. If the neighboring
node is beyond the boundary, the ECTD is zero.

Because of inaccurate estimation, it is possible to encounter a situation in which
the ECT of a neighboring job is less than the current time. If this happens, current
time is used to represent the ECT of the neighboring job and the ECTD is merely the
estimated execution time of the job currently being allocated. The total estimated
completion time difference (TECTD) of a candidate submesh is the sum of the esti-

mated completion time differences of all its neighboring nodes. Consider the 8 x 8

97

mesh in Figure 7.1.

 

 

 

Job y Candidate submesh B

- ~ f—«'* :4“ e g r ,=”_“~, ..A‘fi
Y gyv , '
_ d an» a... — L
n m. m; . , a
l y , y ,‘ i] ‘i
* ’ ' V“ : ‘ * ' .~
* | ’5 a,“ " ,

* Y Y H ‘ ,
,Z- a l ,1, ,,,L\i , A [I 3;; ,1 ii
y y .‘gi x x r X .
. ... _ ,. i '1 .. -..‘,,- l

.1:

 

Figure 7.1: A two-dimensional mesh M (8, 8).

Two jobs, X and Y, are currently in the system with estimated completion times
cc and y, respectively. Submeshes A and B are two candidate submeshes for a job Z
requesting a 3 X 4 submesh with estimated completion time 2, where z is the current
time t plus the estimated execution time 2’ of Z.

If both at and y are greater than t, candidate submesh A has a total estimated
completion time difference of 3 X |z — :13] + 2 X Iz — yl + 2 X z’. Candidate submesh B
has a total estimated completion time difference of 3 X |z — a:| + 4 X 2’. If t is greater
than :1: and t is less than y, A has a total estimated completion time difference of
3 X z’ + 2 X |z — yl + 2 X z’ and B has a total estimated completion time difference

of 3 X z’ + 4 X 2’. If both a: and y are less than t, A has a total estimated completion

98

time difference of 3 X z’ + 2 X z’ + 2 X z’ and B has a total estimated completion time
difference of 3 X z’ + 4 X z'. The objective is to choose the candidate submesh with

the smallest total estimated completion time difference.

7 .2 Detailed Algorithms

We ﬁrst propose the Estimated Execution Time-Search Boundaries (EET-SB) strat—
egy. It will search candidate submeshes which are adjacent to the boundaries of the
mesh system or to allocated submeshes. The searching procedure is similar to the
BL strategy, but the EET-SB strategy will consider all possible candidate submeshes
along the boundaries. After each successful submesh allocation, the system must
record the estimated completion time of that job and update the busy list for fu-
ture allocation consideration. After each job completes and releases its processors,
the busy list again needs to be updated. The detailed algorithm is presented in

Figure 7.2.

The Best Fit allocation strategy proposed by Zhu is designed to find corners in
the coverage array. A major reason why the BF strategy performs worse than the BL
strategy is that the BF strategy does not allow the rotating of the requested submesh
by 90 degrees. We propose the Estimated Execution Time—Search Corner (EET-SC)
processor allocation strategy in Figure 7.3. The corner-searching procedure is similar
to that in the BF strategy, but we allow the rotating of the requested submesh.

Simulation results and performance comparisons are given in the next section.

99

 

 

Allocate J(w, h)

1. if busyJist = null, then
allocate J(w,/z) to < 0,0,w — 1,h — l >;
stop;
else

initialize min-TECTD = 00 and TECTD = 0;

2. generate candidate submeshes for J (w, h) and
J (h, w) on the four corners of the mesh system
and put them in the candidateJist;

3. for each job B in the busyJist do
generate candidate submeshes for J (w, h)
and J(h, w) that lie within the boundaries
of the mesh system along the four boundaries
of B and put them in candidateJist;
end for;

4. get the ﬁrst submesh S in the candidateJist;

5. for each job B in the busylist do

if S overlaps with B, then
remove S from the candidateJist and goto step 8.

else if S is adjacent to B, then
multiply their EC TD by the length of their
common boundary, add it to TEC TD, and
mark those neighboring nodes of S' as busy;

end for;

6. multiply the number of free neighboring nodes of S
by the BET of J(w, h), and add it to TECTD;

7. if TECTD < min-TECTD, then
choice_submesh = 5;

min- TECTD = TECTD,

8. if S is not the last submesh in the candidateJist, then
get the next submesh in candidatelist,
let it be S, and goto step 5;

9. if the candidateJist is empty, then
put J(w, h) in the waiting queue and stop;
else
allocate J(w,h) or J(h, w) to choiceusubmesh and stop;

 

Figure 7.2: The EET-SB allocation strategy.

 

100

 

 

Allocate J(w, h)

1.
2.

compute the coverage array for J (w, h);

scan all the rows in the coverage array and
mark two end elements of each sequence of
consecutive 0’s as end;

. scan all the columns in the coverage array;

if any element is an end element of a

consecutive sequence of 0’s in a column,

and it is previously marked as end, then
add it to cornerJist;

.compute the TECTD for the candidate

submesh based at each corner;

. compute the coverage array for J (h, w);

.follow steps 2,3, and 4 to identify corners and

their corresponding TEC TD,

. if the cornerJist is empty, then

put J(w, h) in the waiting queue and stop;
else
choose the candidate submesh with the

smallest TEC TD and stop;

 

Figure 7.3: The EET-SC allocation strategy.

 

101
7 .3 Performance Results

In this section, we present the simulation results comparing the performance of our
EET strategies with the BL and BF strategies. In order to achieve the peak per-
formance of the BL strategy, we implement it in such a way that it will consider all
free neighboring nodes and always ﬁnd the free submesh with the highest boundary
value. In order to have a fair comparison, the BF strategy is implemented in such a

way that it can rotate the requested submesh as in the EET-SC strategy.

In addition to the parameters we described in Chapter 3, the estimated execution
time for each job is needed for this simulation for the EET strategies. We generate
estimated execution time a: for each job according to the exact job execution time y
of the job by a normal distribution with mean y and standard deviation c X y, where
c is a coefficient indicating the accuracy of the estimation. In our experiments, c is
ranging from 0 (perfect estimation) to 1. If the estimated execution time generated

is negative, it will be set to zero.

We measure the system performance based on the mean response time. To have
an accurate measurement, only stable systems are considered. Figure 7.4 illustrates
the mean response time under different workloads using BL and EET-SB strategies
with the coefficient c equal to 0.0, 0.5, and 1.0. The width and height of the requested
submeshes are uniformly distributed from 1 to 32. When the load is light (less than
0.3), all allocation strategies have similar performance. Most jobs will be served
promptly after they enter the system. For load greater than 0.4, the EET-SB strategy

shows performance gain. When the load reaches 0.5, the EET-SB strategy reduces the

102

 

 

 

 

 

100 , , . . .
90 L BL «— q
EET-SB c=0.0 +,,
m 80 r- EET-SB 0:05 ...,... x q
E 70 L- EET'SB c=1.o '~~)( ~~~~~~~ F‘ ‘
93 60 ~ .
C
§ 50 . g
G
a: 40 - J
C
8 30 + 4
z
20 ~ +
10 ~ q
o - . . , g
0 0.1 0.2 0.3 0.4 05

Load

Figure 7.4: Mean response time vs. load (w and h: uniform distribution).

mean response time compared to BL allocation strategy by 18% to 27% under different
estimation accuracies. When the load reaches 0.6, the system becomes unstable, so

we report the results only up to a load of 0.5.

The EET-SB strategy outperforms the BL strategy under all workloads when
width and height of requested submeshes are uniformly distributed. As the execution
time estimation error decreases, the mean response time becomes smaller. When the
coefficient c is 1, the standard deviation of the estimated execution time equals the
exact execution time. The estimation error is very large under this condition, and
over 16% of the jobs will have their estimated execution time set to zero. In a real
system, it is reasonable that some users have no idea about the execution times of
their submitted jobs. The system can set a default value or zero to those jobs for

the purpose of allocation. In our simulation, the EET-SB strategy outperforms the

103

 

      
 

 

 

 

so , . . f .
40 ' BL —+— +
"5’ EET-SB c=0,o .....
l: EET-SB 0:0.5
m EET-SB o=1.0
m 30 - ‘
C
s
U)
(D
II 20 - .
C
as
(D
2
1o - .
o 1 . . . .
0 0.1 0.2 0.3 0.4 05

Load

Figure 7.5: Mean response time vs. load (to and h: increasing distribution).

BL strategy even under the worst case of c = 1. When the coefﬁcient c is 0, exact
execution time is used to select the candidate submesh and the EET-SB strategy

performs the best.

The same experiment is conducted to consider different characteristics of job re-
quests, namely, increasing and decreasing distributions for the width w and height
h of submeshes requested. The performance results of the mean response time for
BL and EET-SB strategies when the width and height of requested submeshes are

distributed under the increasing distribution is presented in Figure 7.5.

Performance results are similar to that of the uniform distribution case. However,
the performance difference among various strategies is not as signiﬁcant as that of

the previous case for the same reason as we discussed in Section sec-perf-s.

Figure 7.6 shows the mean response time versus load for BL and EET-SB strategies

104

 

 

 

 

 

200 T , . I I
180 - BL _,_ -
EET-SB 0:0,0
m 160 'EET-SB =o.5 I
E 140 _ EET-SB c=1.0 .
3 120 - .- -
C
§ 100 - I
m ’-
‘I 80 - .
C
8 60 - I
2
4o - I
20 - .
O L 1 1 l I
0 0.1 0.2 0.3 0.4 05

Load

Figure 7.6: Mean response time vs. load (to and h: decreasing distribution).

under the decreasing distribution of job requests.

With a decreasing distribution, the system has more small jobs than large ones.
The processor allocation strategy plays an important role when the system load is
heavy. As the load reached 0.5, the EET-SB strategy reduced the mean response

time by 15 to 55% of the BL strategy under different estimation accuracies.

Figures 7.4 to 7.6 show that the EET strategy outperforms the BL strategy under

all load conditions, various job characteristics, and different estimation accuracies.

Because the performance difference is small when the width and height of the
requested submesh are distributed under increasing distribution, only the results for

uniform and decreasing distribution are presented for the BF and EET-SC strategies.

Figures 7.7 and 7.8 show the mean response time under various loads for BF

and EET-SC strategies when w and h are distributed under uniform and decreasing

105

160 . I
150 - BF _,_ I
140 ~ EET-SC(C=0.0I ...+ .. .

 

U I I

130 . EET-SC C=0.5
120 _ EET-SC(C=1.0

110 - -
100 P
90 ~
80 -
70 -

50-
40-
30-
20-
10-
0 1 1 1 L

0 0.1 0.2 0.3 0.4 0.5
Load

Mean Response Time

 

 

 

 

Figure 7.7: Mean response time vs. load (10 and h: uniform distribution).

distributions, respectively.

As we can observe from Figures 7.7 and 7.8, the EET-SC strategy outperforms

the BF strategy.

In order to compare the performance of all the strategies discussed above, we
combine Figure 7.4 with Figure 7.7 and Figure 7.6 with Figure 7.8 into Figures 7.9

and 7.10, respectively.
We can see that the BL strategy outperforms the BF strategy under all conditions.
The EET-SC and EET-SB strategies have similar performances. The EET-SC
strategy has a slight advantage over the EET—SB strategy in the c = 1 case, while
the EET-SB strategy performs better than the EET-SC strategy when c = 0. When

EET-SB strategy is used, we have all the candidate submeshes along the boundaries

to choose from, while the EET-SC strategy considers only corners. It is reasonable

106

 

 

 

 

 

 

300 - BF -o- ,
EET-SC C=0.0 ...,. .......
a) EET-SC C=0.5 --.,
I; 250 ’EET-SC c=1.0 ------------ ~
g 200 " l
o
3
g 150 - I
E
o 100 - .
2
50 - .
0 TT ‘1 I

0 0.1 0.2 0.3 0.4 0.5
Load

Figure 7.8: Mean response time vs. load (w and h: decreasing distribution).

 

     
 

 

 

 

100 l I I I I- I
90 - BL —+— " —
EET-SB C=0.0 *—
m 80 ' EET-SB C=0.5 —'— ‘
E 70 _ EET-SB C=1.0 —*— I
i: BF ~43 ........
‘0’), 60 _ EET-SC C=0.0 W‘” _,
c EET-SC C=0.5 °
g, 50 . EET-SC c=1.0 a .
6‘6 40 ~ 4
c
8 30 - .
2
20 - -
10 - ‘
O 1 1 1 1 4
0 0.1 0.2 0.3 0.4 0.5

Load

Figure 7.9: Mean response time vs. load (w and h: uniform distribution).

107
200 U I I T I- I

130 - BL 4—
EET-se(c=o.oI ..._

 

  
 
  
 

160 " EET-SB :05

140 _ EET-SB c=1I.30F II
.. EET-SC C=0.0 1. . .

12° EET-SC was) ------- 1

100 . EET-SC(c=1.0

80-

 
 

Mean Response Time

40 r I
20 r ..... j

0 1 1 1 1 1
0 0.1 0.2 0.3 0.4 0.5
Load

 

 

 

Figure 7.10: Mean response time vs. load (w and h: decreasing distribution).

that the EET—SB strategy can ﬁnd a better ﬁt than the EET-SC strategy when the
execution time estimation is precise. In the case of c = 1, since the EET-SC strategy
will only consider corners and the EET-SB strategy will consider all boundaries, EET-
SB strategy is more likely to choose a bad candidate submesh because of the large
estimation error. Finally, the EET strategies outperform the BF and BL strategies

under all conditions.

7 .4 Discussion

In this chapter, we have investigated the effect of using estimated execution time to
guide submesh allocation. We have developed two processor allocation strategies for
two-dimensional mesh-connected systems, called the EET-SB and EET-SC strategies,

which utilize the estimated execution time information provided by users to allocate

108

jobs with similar ﬁnishing times in the same areas.

Extensive simulation runs have been performed to compare various strategies. The
results indicate that both EET strategies have better performance than previously
proposed strategies under all circumstances. Inaccurate estimation of such execution
time will simply cause less overall system performance improvement, but not cause
any execution failures.

As we have discussed in Chapter 6, systems using SJF scheduling scheme may
encounter some difﬁculties if the users know that the system is designed to favor jobs
with small estimated execution times and give small estimates of their jobs. Unlike
Shortest-Job-First (SJF) scheduling scheme, EET strategies do not assign priorities
to jobs base on their execution times. Users intentionally give smaller or larger
estimations of the execution times will not change their priorities of execution, but
will degrade the overall system performance and result in a longer response time. So,
users have no reasons to fool the system by giving inaccurate estimations intentionally.

Basically, EET-SC and EET-SB strategies improve the performance relative to
strategies that do not use execution time information. Both EET-SC and EET-SB
use the same candidate submesh selection heuristic, and the difference between them
is the candidate submesh generation method and the detailed implementation. The
performance of EET-SC and EET-SB strategies are very similar, and the reason why
they perform better is the candidate submesh selection heuristic using estimated time
information. In the next chapter, we will introduce a new submesh allocation strategy
and an efﬁcient candidate submesh generation method. The new candidate submesh

generation method can be adapted for EET strategies. Job scheduling schemes pre—

109

sented in Chapters 5 and 6 can also be applied to further improve the performance.

Chapter 8

Busy Distance Inverse Allocation

Strategy

Generally, submesh allocation strategies can be divided into two categories, ﬁrst-ﬁt
and best-ﬁt. A ﬁrst-ﬁt submesh allocation strategy allocates the job to the ﬁrst free
submesh it detects. A best-ﬁt strategy identiﬁes several candidate submeshes, and
the job is allocated to the best-ﬁt candidate based on a certain heuristic.

In general, a ﬁrst-ﬁt allocation strategy tends to divide a large free space into
disjoint, small pieces, causing large external fragmentation. Therefore, a carefully
designed best-ﬁt strategy is the key to better performance. Existing best-ﬁt alloca-
tion strategies improve upon the fragmentation problem but leave plenty of room for
further improvement. In this chapter, we present a new best-ﬁt processor allocation
strategy called Busy Distance Inverse (BDI) strategy for two-dimensional mesh sys—
tems. As with other best-ﬁt strategies, it tends to allocate a job to a submesh that

is in close proximity to other busy submeshes or mesh boundaries, however, it incor-

110

111

porates more information in its heuristic. Yet, it yields the least complexity among

all best-ﬁt strategies.

The performance of the BDI strategy is compared with other best-ﬁt strategies,

including Busy-List and Best-Fit, under the FCFS scheduling discipline.

In the following sections, we will give the basic ideas and detailed algorithms of

the BDI strategy, and then present the performance results.

8.1 Basic Ideas

Intuitively, it is better to allocate a job to processors in the near neighborhood of
other busy submeshes and leave a larger contiguous free space for future jobs, thus

reducing the external fragmentation.

The BF strategy is based on the heuristic to choose a corner in the coverage array
having largest number of 1’s surrounding it as the base of the free submesh. The BL
strategy selects the candidate submesh having the most busy neighbors. These two
strategies consider only the numbers of 1’s surrounding corners in the coverage array
or the numbers of busy neighbors of candidate submeshes. Neither of these strategies

considers busy processors two or more hops away in the mesh.

Consider the 10 X 10 mesh in Figure 8.1. A solid dot represents a busy processor

112
f<9 .9.

<0.9> <1,9> <2,9> <3. 9)» <4 9) <5. 9) <6 9? <7, 9> <8,9>'t <9, 95$
l 5 i ' 1

<0.8) <1,8$ <2,8> <3, 8> <4, 8) <5. 8) <6, 8> <7, 8:» <8, 8>i <9, 8>I

.
A.‘ - .-.. ”A.‘ .-. 4 k \ ,2

F- .- . ‘ . .. >- *. ﬂ

.- _ O~
<O,7> (1,7) <2,7> <3, 7) <4 7): <5, 7) <6. 7) <7, 7> <8,7> <9,7>

<o,65 <1,65' <2.6>’ <35; <46; <5.6“>‘ «5,69,» <7,65‘j»' <85; <95:
1 i j i i '

 

-—- . . <7 .7774“ .——- --—- k_———. /.—-»+J -.. __~_. ._..__

<0 5; «.55 <2, 55 <3. 55 <4 55‘“ <5 55 <6 55' <7 55 <8 55}“ <9,5

<045 <1 4$<2 4’ <3 4: <4 .4; <5 4%» <6,4$ <14; <s.45}" <9,45

 

    
 

<O,2> <1,2> <22> <3,2> <4,2

in ._ 7 mi“ ”:54- +
<0,1>“ <1,1>' <2.1>"7<3,1>I5<4,1> <5,1!iI <6.1> <7,1> <8.1> <9,1>

[ALI

<0,0> <1,0> <2,0> <3.0> <4.0> <5,0> <6,0> <7.0> <8.0> <9,0>

Figure 8.1: A two dimensional mesh M(10,10).

and a circle represents a free processor. The corresponding busy array is

1000000011

B[10,10] :-

 

 

113

and the coverage array (including the reject set) for an incoming job J (3, 3) is

1111111111
1111111111

1000001111

C[10,10] =

 

 

There are two 3-corners, C [0, 5] and C [6,4], in the coverage array. The area of
C [0, 5] is six while the area of C [6,4] is three. So, C [6, 4] will be selected as the base
of the submesh for the incoming job J (3, 3) if the Best Fit strategy is used. If we use
the Busy List strategy, J (3, 3) could be allocated to either submesh < 0,5, 2, 7 > or
< 6, 4, 8, 6 > because both have the same highest boundary value of seven among all
candidate submeshes. However, the lower left corner of the mesh might be a better

choice to allocate J (3, 3), because it leaves a larger free space for later jobs.

We present a new processor allocation strategy, called Busy Distance Inverse
(BDI), which considers busy processors two and even more hops away. It allocates

the incoming job in the most crowded area and leaves a larger free space to later jobs.

114

In order to make the busy processors not adjacent to the candidate submesh still
have some effect to the best—ﬁt submesh selection, we design the BDI strategy so that
each corner processor of a candidate submesh calculates the distance to the nearest
busy processor or mesh boundary (we call it busy distance) in each direction. The
larger the busy distances of a candidate submesh, the less it satisﬁes the best-ﬁt
heuristic.

There are three major parts of the BDI strategy: (1) the best-ﬁt heuristic, (2) the
candidate submesh generation methods, and (3) an efficient algorithm to implement

it.

8.2 Best-Fit! Heuristic

We need the following definitions to help us describe the detailed algorithm.

Deﬁnition 8.1 The processors at the four corners, lower-left, lower-right, upper-

left, upper-right, of a candidate submesh A, are denoted as LL(A), LR(A), UL(A),

U R(A), respectively.

Deﬁnition 8.2 The busy distance from a corner of a candidate submesh A is the
number of hops from it to the nearest busy processor in a given direction. If the
corner is adjacent to a boundary in a given direction, then that busy distance is

assumed to have a value of one.

Deﬁnition 8.3 For a candidate submesh A, its busy distance vector has eight compo—

nents, consisting of the two busy distances at each of the four corners of A, which are

115
denoted: LLS(A), LLW(A), LRS(A), LRE(A), ULN(A), ULW(A), URN(A), and

URE(A). They are depicted in Figure 8.2. LLS'(A) refers to the busy distance from
LL(A) to the nearest busy processor in the south direction. All other components

are similarly deﬁned.

A

<0,9 <1,9> <2,9:V <3, 9>1 <4, 9>1 <5, 9) <6, 9) <7, 9>_ <8 8,9: <9, 9>
-I III 1* .1__ i..-
<0.8¥ <1,65I <2,65' <3 8> <4,65; <5,65'I1 <6,8>‘I» <7,6'5’I} <38! <93:

.55 l L L

<lo,i5w <1,75 <2,75 <3,75 <4,7 (5'71 <6,75 <77f <8,7

 

   
  

I 1 1 .
Candidate L’LM_ an: “HAIR—#54." _. ._._._.. -1 W__5’D..__ .
Submesh B <0 .65 <1 ,65' <2, 65 <3, 65 <4, 65; <5. 65 <6, 6> <7, 65 <6. 65 <9,6
1 1 |

A A ~ 1 A A, A. A A A

<1.55' <2, 5 <,3 55 <4 .55 <5 .55 <6 .55 <7 .55? <8, 55 <9,55
n‘ L _ L I LIW - L L
<1 45} <2 4 <3 .41 <4, 45 <5, 4 <6 4 <7, 4 6,4 <9.4

1 1 , <
1 .. RN18) ‘ l l l
I <1,3> <2,3 <3,3>1 <4,3 <5, 3> <6, 3>i <7. 3) <8, 3>i <9, 3

 
  
    
 

 

ULN(A)
<O,3

 

 

 

_; _ _ URE(A) ' l 1
ULW(A)‘-‘ — I—ﬁ—f— .ﬁ—W
<0 ,2I>T <1 ,2>I <2, 2>I <3 2> <4 2 <5,2 <6 2 <7 2 <9,2>1
Candidate ,L ' <9;
SmeQShA (O’Pl <1,1> <2, 1> i<3,1é(<l;,I ,1> <5,1 <7 81.9.1
LLW(A) ‘—1 ~.

 

 

 

“ &
<0. .0171.) .05 <2,o <3,o5 <4,0> <5,o5 <6,0> <7, 05 <8. 05 <9,o5
(

LL A) LRS(A)

Figure 8.2: Busy distance vectors.

Deﬁnition 8.4 The busy distance inverse (BDI) of a candidate submesh A, BDI(A),

 

 

 

is deﬁned as

 

 

 

1 1 1 1 1 1 1 1
LLS(A) + LLW(A) + LRS(A) + LRE(A) + ULN(A) + ULW(A) + URN(A) + URE(A)°
As an example, consider the 10 X 10 mesh given in Figure 8.2. If the incoming job

requests a 3 X 3 submesh, there are several candidate submeshes available. Consider

116

two candidate submeshes A, < 0,0, 2, 2 >, and B, < 0, 5, 2, 7 >. Applying the above
deﬁnitions, we can determine that BDI(A) = 1 + 1 + 1 + 1 + % + :1; + % + % = 6.
BDI(B)=l+1+1+1+1+§+§+§=5.64.

The best-ﬁt selection heuristic is to choose the candidate submesh having the

largest BDI value. Thus, in this example, A is better than B.

8.3 Candidate Submesh Generation

The BF strategy generates candidate submeshes on the corners in the coverage array
and has a time complexity of 0(N), where N is the number of processors in the sys-
tem. The BL strategy generates candidate submeshes along the four corners of the
allocated submeshes, and its complexity is 0(N3), where N A is the number of allo-
cated submeshes. The Lookahead strategies divide the free processors that can serve
as the base of the requested submesh into rectangular free regions, and the candidate
submeshes are based at the lower-left corners of these free regions. It generates 0(NA)
candidate submeshes and has a time complexity of 0(Nﬁ) [5]. Usually the number
of allocated submeshes, N A, is much smaller than N, the number of processors in the
system. Das Sharma and Pradhan [21] showed that the BL strategy has much lower
average allocation overhead compared with the strategies having 0(N) complexity.
The Lookahead strategy offers even lower time complexity than the BL strategy does.

Based on the simulation results collected using our simulation tool (which we de—
scribe in detail in the next section), the BDI best-ﬁt selection heuristic combined with

the BF and BL strategies’ candidate submesh generation methods offer similar per-

117

formance. However, the Lookahead strategy’s candidate submesh generation method
performs poor with the BDI heuristic. One possible reason is that this method restrict
itself to checking only one corner of each free region, and this may disregard better
candidate submeshes. Instead of using only the lower-left corners of free regions as
the bases of candidate submeshes, if we consider all four corners of the free regions
as possible bases, the BDI heuristic offers similar performance using all three strate-
gies’ candidate submesh generation methods. So, the preferred submesh generation

method is the Lookahead strategy, since it offers lower complexity.

8.4 Detailed Algorithm

Using the Lookahead strategy’s candidate submesh generation method, we can gener-
ate 0(NA) candidates in 0(Nj) time. Each candidate must consider all N A allocated
submeshes to calculate the eight components of its busy distance vector. A candidate
takes 0(1) time to compare with an allocated submesh and update the eight com-
ponents of its busy distance vector, so it takes 0(Nﬁ) time for 0(NA) candidates to
compare with N A allocated submeshes. It takes 0(NA) time to calculate the busy dis-
tance inverse values for 0(NA) candidates and choose the candidate with the largest
BDI value. So, the overall complexity of the BDI strategy is 0(N/21), the same as the

Lookahead strategies.

The detailed algorithm is presented in Figure 8.3.

118

 

1. let J(w,/z) be the ﬁrst job in queue;

2. if the busy set is empty, then
allocate J(w,/2) to < 0,0,w — 1, h - 1 > and stop;

3. generate coverage submeshes for J (w, h);

4. starting with a single free region < 0, O, W — w + 1, H — h + 1 >
subtract the coverage submeshes for J (w, h)
and generate a list of free regions;

5. if the list of free regions is not empty, then
calculate the BDI values for all the candidate submeshes
based on four corners of all the free regions and choose
the candidate submesh having the largest BDI value and stop;

6. if w 75 h, then repeat steps 3 to 5 for J(h,w);
7. put J (w, h) back on front of queue and stop;

 

 

 

Figure 8.3: The Busy Distance Inverse strategy.

8.5 Performance Results

Via simulation, we compared the performance of the BDI strategy with the BL and
BF strategies. In order to have a fair comparison, the BF strategy is implemented in
such a way that it can rotate the requested submesh by 90 degrees as in the BL and
BDI strategies.

We measure the system performance based on the mean and standard deviation of
response time. To have an accurate measurement, only stable systems are considered.

Figure 8.4 illustrates the mean response time under different workloads using
BF, BL, and BDI strategies. The width and height of the requested submeshes
are uniformly distributed from 1 to 32. When the load is light (less than 0.3), all

allocation strategies have similar performance. Most jobs are served promptly after

119

 

 

 

 

 

14o . . . .
130 - BF —-——- ~
120 - '- -'— .
11o - BD' *— -
“5’ 100 - J
‘5 9° ' .
2 80- -
8 7o - «
m
2 60 - .
g 50 ~ 4
g 40 - -
30 - -
20 . .
1o : J
0 L 1 a 1
0.1 0.2 0.3 0.4 0.5
Load

Figure 8.4: Mean response time vs. load (w and h: uniform distribution).

they enter the system. For load greater than 0.4, the BDI strategy shows performance
gain. When the load reaches 0.5, the BDI strategy reduces the mean response time
compared to the BL allocation strategy by 18%. When the load reaches 0.6, the
system becomes unstable, so we report the results only up to a load of 0.5. Note that
in a recent job characteristics study [‘29], the observed system utilization was above
40% for over 80% of the time on weekdays in a practical scientiﬁc parallel computing

environment. The standard deviation of the response time is presented in Figure 8.5.

Figures 8.4 and 8.5 indicate that the BDI strategy outperforms the BL and BF
strategies under all workloads when width and height of requested submeshes are

uniformly distributed.

The same experiment is conducted to consider different characteristics of job re-

quests, namely, increasing and decreasing distributions for the width to and height h

120

 

 

 

 

 

140 r . .
130 - BF +— ~
120 - 3|- +- .
110 ~ 30' "— -
(D
E 100- ~
a, 90 - -
‘8 80 - .
g 70 - l
m 60 ' '1
3 50 - .
8 4o - .
30 - .
20 - ,
1o ' -
0 1 l l l
0.1 0.2 0.3 0.4 0.5
Load

Figure 8.5: Standard deviation of response time vs. load (w and h: uniform distri-
bution).

of submeshes requested.

The simulation results of the mean and standard deviation of response time for
BF, BL, and BDI strategies when the width and height of requested submeshes are
distributed under the increasing distribution are presented in Figures 8.6 and 8.7.

Performance results are similar to that of the uniform distribution case. However,
the performance difference among various strategies is not as significant as that of the
previous case. This moderate performance improvement for increasing distribution
of job requests is expected. The system tends to have more jobs requesting large
submeshes in an increasing distribution and, hence, there is a reduced chance of
allocating large jobs into the system while some jobs are already in the system. The
extreme case is that all jobs request the full mesh, so no matter which processor

allocation strategy you use, the performance remains the same.

121

 

 

 

 

50 . . I .
40 - BF -*—- ~
0 BL +—
.g BDI +—
3 30 - .
C
O
Q.
(I)
9
c 20 ~ -
(U
(D
2 I
10 ~
0 1 1 1 1
0.1 0.2 0.3 0.4 0.5
Load

Figure 8.6: Mean response time vs.

load (w and h:

increasing distribution).

 

 

 

 

50 I I T I
40 — BF’—*—- «
g BL HF—-
i: BEN-«~—
8 30 - -
C
o
O.
(I)
(D
1: 20 - .
‘8
O
00 .
10 - -
O 4 1 4 1
Oﬁl (12 (13 (14 (15
Load

Figure 8.7: Standard deviation of response time vs. load (w and [2: increasing distri-

bution).

122

Figures 8.8 and 8.9 show the shows the mean and standard deviation of response
time versus load , respectively, for BF, BL and BDI strategies under the decreasing

distribution ofjob requests.

300 .
280 r BF -— ~
260 _ BL +— .
220 - .
200 - 4
180 - ‘
160 * -
140 r -
120 - ‘
100 - i
80 - 4
60 ~ -
40 - -
20 1 = _ _

O 1

0.1 0.2 0.3 0.4 0.5

Load

 

Mean Response Time

 

 

I l L

 

Figure 8.8: Mean response time vs. load (w and h: decreasing distribution).

With a decreasing distribution, the system has more small jobs than large ones.
The processor allocation strategy plays an important role when the system load is
heavy. As the load reaches 0.5, the BDI strategy reduces the mean response time by

55% of the BL strategy.

Figures 8.4 to 8.8 show that the BDI strategy outperforms the BL strategy under
all load conditions and various job characteristics. The main reason why the BDI
strategy performs better is that it considers not only adjacent busy processors of a
candidate submesh, but it also considers busy processors two or more hops away, thus

having a global View of the system and a superior best-ﬁt heuristic.

123

 

 

 

 

 

24o . . . f
220 ~ BF ...__ -
20° ' 13%| : 1
o . J
g 180
u- 160 - -
§ 140 - «
g 120 - -
g 100 ~ .
“5 80 L .
8 60 . -
4o - ~
20 - = _ -
0 1 1 4 1
0.1 0.2 0.3 0.4 0.5
Load

Figure 8.9: Standard deviation of response time vs. load (w and h: decreasing
distribution).

8.6 Discussion

In this chapter, we have presented a new processor allocation strategy, called the Busy
Distance Inverse strategy. It is a best—ﬁt strategy that tries to allocate a submesh not
only adjacent to busy submeshes or mesh boundaries, but also in close proximity to
other busy submeshes or mesh boundaries.

Extensive simulation runs have been performed to compare the BDI strategy with
various strategies. The results indicate that it performs better than BF and BL strate-
gies under all load conditions and job characteristics. Moreover, the BDI strategy can
be effectively applied in practice. It can be implemented efficiently, is scalable (i.e.,
does not depend directly on the mesh size), and performs well under load conditions

that occur in actual scientiﬁc parallel computing environments.

Chapter 9

Conclusions

Mesh connected parallel architectures have become increasingly popular in the design
of multiprocessor systems in recent years. To allow the best usage of these systems,

effective submesh allocation and job scheduling schemes are desirable.

Continuous allocations and deallocations of various sized submeshes result in a
fragmented mesh system. We have developed efficient and effective submesh alloca-
tion and scheduling schemes on 2D mesh systems. This research improves the system
utilization and reduces the mean job response time signiﬁcantly. Further, we have
built a simulation and experiment analysis tool called Sched-All for job scheduling
and allocation on 2D mesh multicomputers to evaluate the performance of existing

schemes as well as our proposed schemes.

In this chapter, we summarize the contributions of this research in Section 9.1.
Then, we suggest some future research directions in Section 9.2

124

125
9.1 Contributions

We have developed a simulation and experiment analysis tool called Sched—All for
job scheduling and allocation on 2D mesh multicomputers. Sched-All is scalable
in the sizes of mesh which can be simulated. It can evaluate the performance of
various submesh allocation and job scheduling schemes on a mesh system of size up
to 128 x 128. It includes a performance evaluation simulator, an experiment analyzer,
and a performance plotter. It supports a comprehensive set of processor allocation
and scheduling schemes, and its modular and object oriented design make it very easy

to maintain and to integrate new schemes into it.

We ﬁrst incorporated job scheduling for processor allocation on 2D mesh mul-
ticomputers. We have developed two new job scheduling schemes, Immediate Fit
and Scan All. They can be used with any submesh allocation strategy. A waiting
time limit is introduced to avoid a starvation problem. Simulation results show that
our scheduling schemes signiﬁcantly improve the system utilization and improve the
mean and standard deviation of response time for different load conditions and job

characteristics.

Based on the Scan All scheduling scheme, we developed a starvation-free job
scheduling scheme, Multiple Queues, that can also be used with any existing sub-
mesh allocation strategy. The effect of various ways of using multiple queues for
job scheduling in 2D mesh systems is investigated. Simulation results show that the
Multiple Queues scheme further improves the performance of the SA scheme under

different load conditions and job characteristics. A higher system utilization and

126

lower internal fragmentation are also achieved.

Turning to processor allocation strategies, we have presented pioneering work
to guide processor allocation on 2D mesh multicomputers with estimated execution
time. We have developed two processor allocation strategies for two—dimensional
mesh-connected systems, called the EET—SB and EET-SC strategies, which utilize
the estimated execution time information provided by users to allocate jobs with
similar ﬁnishing times in the same areas. Again, extensive simulation runs have
been performed to compare various strategies. The results indicate that both EET
strategies have better performance than previously proposed strategies under all cir-
cumstances. Inaccurate estimation of execution time will simply cause less overall
system performance improvement, but not cause any execution failures.

Although users usually can quite accurately estimate the execution time of their
jobs, people may argue that execution time information is not always available. We
have also developed a new processor allocation strategy, called the Busy Distance
Inverse (BDI) strategy, which does not use the execution time information. It is
a best-ﬁt strategy that tries to allocate a submesh not only adjacent to busy sub-
meshes or mesh boundaries, but also in close proximity to other busy submeshes or
mesh boundaries. The consequence is that it performs the best among all submesh
allocation strategies that do not use execution time information. Moreover, the BDI
strategy has the lowest time complexity among all submesh allocation strategies and
can be effectively applied in practice. It can be implemented efficiently, is scalable
(i.e., does not depend directly on the mesh size), and performs well under load con-

ditions that occur in actual scientiﬁc parallel computing environments.

127
9.2 Future Work

Our Sched-All tool can be extended to enhance its functionality. For example, it
would be useful to integrate it with the visualization tool ProcSimity (described in
Section 2.6). ProcSimity provides a dynamic display of allocation and deallocation
events and performance metrics. We can modify the Sched-All simulator to generate
traceﬁles in the format readable by ProcSimity’s visualization tool. Further, topolo-
gies other than 2D mesh (for example, hypercubes) and their subcube allocation
strategies could also be integrated in the Sched—All tool.

In this research as well as related research, probabilistic workload is used to drive
the simulator to evaluate the performance of various schemes. We considered uni-
form, decreasing, and increasing distributions of jobs in the simulation tool. Other
distributions, for example, bimodal distribution, could also be considered. Further,
it is very desirable to collect actual workload data from a system serving multiple,
mixed jobs to drive the simulator.

Prototypical implementation of the existing schemes and the new schemes that we
have developed is necessary to measure actual overheads to the system and the actual
performance. We have obtained preliminary results for selected simulation runs as
noted in Section 3.3. However, further results are needed from a more extensive
implementation. This is essential prior to adopting a particular strategy in practice.

The assumption of negligible overheads needs to be supported quantitatively.

BIBLIOGRAPHY

Bibliography

[1] A. Al-Dhelaan and B. Bose, “A New Strategy for Processor Allocation in an

N-cube Multiprocessor,” Proc. International Phoenix Conference on Computers
and Communications, pp. 114-118, 1989.

[2] R. Alverson, et al., “The Tera Computer System,” Proc. International Confer-
ence on Supercomputing, pp. 1-6, Jun. 1990.

[3] D. Babbar and P. Krueger, “A Performance Comparison of Processor Alloca-
tion and Job Scheduling Algorithms for Mesh-Connected Multiprocessors,” Proc.
IEEE Symp. on Parallel and Distributed Processing, pp. 46-53, 1994.

[4] B. S. Baker, E. G. Coffman, Jr., and R. L. Rivest, “Orthogonal packings in two
dimensions,” SIAM J. Comput., pp. 846-855, Nov. 1980.

[5] S. Bhattacharya and W.—T. Tsai, “Lookahead Processor Allocation in Mesh-
Connected Massively Parallel Multicomputer,” Proc. International Parallel Pro-
cessing Symposium, pp. 868-875, 1994.

[6] G.-I. Chen and T.-H. Lai, “Scheduling Independent Jobs on Hypercubes,” Proc.
5th Symp. on Theoretical Aspects of Computer Science, pp. 273—280, 1988.

[7] G.-I. Chen and T.-H. Lai, “Preemptive Scheduling of Independent Jobs on a
Hypercube,” Information Processing Letters, no. 28, pp. 201-206, 1988.

[8] M. S. Chen and K. G. Shin, “Processor Allocation in an N-cube Multiprocessor,”
IEEE Transactions on Computers, pp. 1396-1407, Dec. 1987.

[9] Y.-K. Chu and D. T. Rover, “An Effective Two-Dimensional Mesh Partitioning
Strategy,” Parallel Processing Letters, special issue on Partitioning and Schedul-

ing for Parallel and Distributed Systems (accepted and to be published in Dec.
1995).

[10] Y.-K. Chu, I.-L. Yen, and D. T. Rover, “Processor Allocation in Mesh Systems
Using Multiple Queues,” Tech. Rep. CPS-93—33, Dept. of Computer Science,
Michigan State University, June 1994.

[11] Y.-K. Chu, I.-L. Yen, and D. T. Rover, “Incorporating Job Scheduling for Pro-
cessor Allocation on Two-Dimensional Mesh-Connected Systems,” Proc. 7th In-

ternational Conference on Parallel and Distributed Computing Systems, pp. 124-
129, Oct. 1994.

128

129

[12] Y.—K. Chu, I.—L. Yen, and D. T. Rover, “Guiding Processor Allocation with

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[201

[21]

[22]

[23]

Estimated Execution Time for Mesh Connected Multiple Processor Systems,”
Proc. 28th Hawaii International Conference of System Sciences, pp. 163-172,
Jan.1995.

P.-J. Chuang and N.—F. Tzeng, “An Efficient Submesh Allocation Strategy for
Mesh Computer Systems,” Proc. International Conference on Distributed Com-
puting Systems, pp. 256-263, May. 1991.

P.-J. Chuang and N.-F. Tzeng, “A Fast Recognition-Complete Processor Allo-
cation Strategy for Hypercube Computers,” IEEE Transactions on Computers,
pp. 467-479, Apr. 1992.

E. G. Coffman, Jr., M. R. Carey, and D. S. Johnson, “Approximation Algorithms
for Bin-Packing—an Updated Survey,” Algorithm Design for Computer System
Design (G. Ausiello, M. Lucertini, P. Seraﬁni, ed), pp. 49-106, Springer—Verlag,
New York, 1984.

E. G. Coffman, Jr., M. R. Garey, D. S. Johnson, and R. E. Tarjan, “Perfor-
mance Bounds for Level-Oriented Two-Dimensional Packing Algorithms,” SIAM
J. Comput., pp. 808—826, Nov. 1980.

E. G. Coffman, Jr., J. Y.-T. Leung, and D. Slutz, “On the Optimality of First—Fit
and Level Algorithms for Parallel Machine Assignments and Sequencing,” Proc.
International Conference on Parallel Processing, pp. 95-99, 1977.

D. Das Sharma, G. D. Holland, and D. K. Pradhan, “Subcube Level Time-
Sharing in Hypercube Multicomputers,” Proc. International Conference on Par-
allel Processing, vol. II, pp. 134-142, 1994.

D. Das Sharma and D. K. Pradhan, “A Novel Approach for Subcube Allocation
in Hypercube Multiprocessors,” Proc. IEEE Symp. on Parallel and Distributed
Processing, pp. 336-345, Dec. 1992.

D. Das Sharma and D. K. Pradhan, “Submesh Allocation in Mesh Multicom-
puters,” Dept. of Computer Science, Texas A&M University, Tech. Rep. 93-043,
1993.

D. Das Sharma and D. K. Pradhan, “A Fast and Efficient Strategy for Sub-
mesh Allocation in Mesh—Connected Parallel Computers,” Proc. IEEE Symp. on
Parallel and Distributed Processing, pp. 682-689, Dec. 1993.

D. Das Sharma and D. K. Pradhan, “Fast and Efficient Strategies for Cubic
and Non-Cubic Allocation in Hypercube Multiprocessors,” Proc. International
Conference on Parallel Processing, vol. I, pp. 118-127, 1993.

D. Das Sharma and D. K. Pradhan, “Job Scheduling in Mesh Multicomputers,”
Proc. International Conference on Parallel Processing, vol. II, pp. 251-258, 1994.

130

[24] H. M. Deitel, An Introduction to Operating Systems, Addison-Wesley, 1983.

[25] J. Ding and L. N. Bhuyan, “An Adaptive Submesh Allocation Strategy for Two—
Dimensional Mesh Connected Systems,” Proc. 1993 International Conference on

Parallel Processing, vol. II, pp. 193-200, 1993.

[26] S. Dutt and J. P. Hayes, “On Allocating Subcubes in a Hypercube Multipro-
cessor,” Proc. Third Conference on Hypercube Computers and Applications, pp.

801-810, 1988.

[27] S. Dutt and J. P. Hayes, “Subcube Allocation in Hypercube Computers,” IEEE
Transactions on Computer, pp. 341-352, Mar. 1991.

[28] H. El—Rewini, T. G. Lewis, and H. H. Ali, Task Scheduling in Parallel and Dis-
tributed Systems, Prentice Hall, 1994.

[29] D. G. Feitelson and B. Nitzberg, “Job Characteristics of a Production Parallel
Scientiﬁc Workload on the NASA Ames iPSC / 860,” Tech Report RC 19773, IBM
T. J. Watson Research Center, Oct. 1994.

[30] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the
Theory of NP-Completeness, Freeman, 1979.

[31] V. P. Gopinath, T. A. Grotjohn, Y.-K. Chu, and D. T. Rover, “Three Dimen-
sional Simulations of Plasmas Using MPPs,” Proc. 7th SIAM Conference on
Parallel Processing for Scientiﬁc Computing, pp. 201-206, Feb. 1995.

[32] V. P. Gopinath, T. A. Grotjohn, Y.-K. Chu, and D. T. Rover, “Parallelization
and Performance of Three-Dimensional Plasma Simulation,” Proc. 5th Sympo-
sium on the Frontiers of Massively Parallel Computation, pp. 148-155, Feb. 1995.

[33] M. T. Heath and J. A. Etheridge, Visualizing performance of parallel programs,
Tech. Rep. ORNL/TM-11813, Oak Ridge National Laboratory, Oak Ridge, TN,
May. 1991.

[34] R. Jain, The Art of Computer Systems Performance Analysis, John Wiley &
Sons,1991.

[35] K. Kant, Introduction to Computer System Performance Evaluation, McGraw-
Hill, 1992.

[36] J. Kim, C. R. Das, and W. Lin, “A Processor Allocation Scheme for Hypercube
Compu,” Proc. International Conference on Parallel Processing, vol. II, pp. 231-

238, 1989.

[37] J. Kim, C. R. Das, and W. Lin, “A Top-Down Processor Allocation Scheme for
Hypercube Computers,” IEEE Transactions on Parallel and Distributed Systems,
vol. 2, pp. 20-30, Jan. 1991.

[381

[39]

[40]

[41]

[4‘21

[43]

[44]

[451

[461

[47]

[48]

[49]

[50]

131

K. C. Knowlton, “A Fast Storage Allocator,” Communications of the ACM, pp.
623-625, Oct. 1965.

P. Krueger, T.-H. Lai, and V. A. Radiya, “Processor Allocation vs. Job Schedul-
ing on Hypercube Computers,” Proc. International Conference on Distributed
Computing Systems, pp. 394-401, May. 1991.

P. Krueger, T.-H. Lai, and V. A. Radiya, “Job Scheduling is More Important
than Processor Allocation for Hypercube Computers,” IEEE Transactions on
Parallel and Distributed Systems, May. 1994.

K. Li and K. H. Cheng, “Job Scheduling in Partitionable Mesh Connected Sys-
tems,” Proc. International Conference on Parallel Processing, vol. II, pp. 65—72,

Aug. 1989.

K. Li and K. H. Cheng, “Complexity of Resource Allocation and Job Schedul-
ing Problems in Partitionable Mesh Connected Systems,” Proc. IEEE Symp. on
Parallel and Distributed Processing, pp. 358-365, 1989.

K. Li and K. H. Cheng, “Static Job Scheduling in Partitionable Mesh Connected
Systems,” Journal of Parallel and Distributed Computing, pp. 152-159, Oct. 1990.

K. Li and K. H. Cheng, “Job Scheduling in PMCS Using a 2DBS as the System
Partitioning Scheme,” Proc. International Conference on Parallel Processing, vol.
I, pp. 119-122, 1990.

K. Li and K. H. Cheng, “A Two Dimensional Buddy System for Dynamic Re-
source Allocation in a Partitionable Mesh Connected System,” Proc. ACM Com-
puter Science Conference, pp. 22-28, Feb. 1990.

K. Li and K. H. Cheng, “A Two—Dimensional Buddy System for Dynamic Re—
source Allocation in a Partitionable Mesh Connected System,” Journal of Par-
allel and Distributed Computing, pp. 79-83”, May 1991.

K. Li and K. H. Cheng, “Job Scheduling in a Partitionable Mesh Using a Two-
Dimensional Buddy System Partitioning Scheme,” IEEE Transactions on Par-
allel and Distributed Systems, pp. 413-422”, 1991.

W. Liu, V. Lo, K. Windisch, and B. Nitzberg, “Non-contiguous Processor Allo-
cation Algorithms for Distributed Memory Multicomputers,” Proc. International
Conference on Supercomputing, pp. 227-236, 1994.

P. Mohapatra, C. Yu, C. R. Das, and J. Kim, “A Lazy Scheduling Scheme for
Improving Hypercube Performance,” Proc. International Conference on Parallel
Processing, vol. I, pp. 110-117, 1993.

L. M. Ni and P. K. McKinley, “A Survey of Wormhole Routing Techniques in
Direct Networks,” Computer, vol. 26, pp. 62-76, Feb. 1993.

132
[51] J. K. Ousterhout, Tcl and the Tk Toolkit, Addison-Wesley, 1994.

[52] H. Schwetman, CSIM User’s Guide, Microelectronics and Computer Technology
Corporation, 1991.

[53] T. J. Teorey and TB. Pinkerton, “A Comparative Analysis of Disk Scheduling
Policies”, Communications of ACM, pp. 177-184, March, 1972.

[54] T. Williams and C. Kelley, GUNPLOT: An Interactive Plottong Program, 1993.

[55] K. Windisch, J. V. Miller, and V. Lo, “ProcSimity: An Experimental Tool for
Processor Allocation and Scheduling in Highly Parallel Systems,” The Fifth Sym-

posium on the Frontiers of Massively Parallel Computation, pp. 414-421, Feb.,
1995.

[56] C. Yu and C. R. Das, “Limit Allocation: An Efficient Processor Management
Scheme for Hypercubes,” Proc. International Conference on Parallel Processing,

vol. II, pp. 143-150, 1993.

[57] Y. Zhu, “Efficient Processor Allocation Strategies for Mesh-Connected Parallel
Computers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 328-337,
Dec. 1992.

[58] Y. Zhu and M. Ahuja, “Preemptive Job Scheduling on a Hypercube,” Proc.
International Conference on Parallel Processing, vol. I, pp. 301-304, 1990.

[59] Y. Zhu and M. Ahuja, “Job Scheduling on a Hypercube,” Proc. International
Conference on Distributed Computing Systems, pp. 510-516, 1990.

[60] Paragon XP/S Product Overview, Intel Corporation, 1991.
[61] A Touchstone DELTA System Description, Intel Corporation, 1991.

   

"I111111111111111“