5392?}:

. .
:22

”$41;
.‘ A
. .32»:-
5'3:

..
.-,A -
W»;

) _

:‘r

I .‘i ‘1‘!
5.31%,

5.1;.

'l

$523222

~ ‘ .3”,
«4 4.

~11.
:222 1

‘9

' 5—,.

‘ 2

at 3% h
m.

§§N1313 w.
' €§€ﬁ§y§gﬂ
1' f ‘ " :.
Siam

. m ‘

h r’

p} .

m‘v ‘3
. _ v
‘
ﬁgﬁﬁﬂ;
‘ 1A I
V...‘ ‘F‘éﬁ

w.qu‘_'>r“

'5“: ie."‘."2.f~» '2

'-:$ﬂ»‘€yrl 3%:

-' 2‘ .2222,» .‘r‘: 253,2.

ﬂl (dim: .vl M}. ..: ‘v ‘5‘).‘fs‘ﬁi
“'2‘ 5r 42"241 It»
22222222225

31% ,.

 

\ . 4 I I

' <2» 2532225222
.1 , y _ . '
#:"e-{y‘zg‘xl 3%“ A Yti'n {41! ac.
‘ 1' :I‘r $5 I.) u l .M It§x
. . $55.: . 3. "

7 :2 a»: , FM 1;»

 

 

 

 

 

llllllllllllllll ;

2
NE
(D

This is to certify that the

dissertation entitled

PROCESSOR SCHEDULING IN A DISTRIBUTED-MEMORY
COMPUTING ENVIRONMENT

presented by

Stephen W. Turner

has been accepted towards fulﬁllment
of the requirements for

Ph . D . degree in Wence

 

\jtmh HM;

 

Major professor

Date Elamdv} 28L “135—

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0» 12771

 

 

LSBRARY
Michigan State
University

 

 

 

PLACE ll RETURN BOX to move this chockom from your record.
TO AVOID FINES rotum on or botoro duo duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PROCESSOR SCHEDULING IN A DISTRIBUTED-MEMORY
COMPUTING ENVIRONMENT

By

Stephen W. Turner

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

February 28, 1995

ABSTRACT
PROCESSOR SCHEDULING IN A DISTRIBUTED-MEMORY COMPUTING
ENVIRONMENT
By

Stephen W. Turner

In recent years, the development of large-scale distributed-memory computers has given
the user community unprecedented levels of computing power. In order to effectively use
the available computing power, processor scheduling algorithms have been deveIOped that
allow many users to share distributed computing resources while Obtaining the best possible
job turnaround time. However, not all existing scheduling techniques take full advantage of
available computing power. For example, in hypercubes, a cluster must normally be allo-
cated as an entire subcube, which can result in high internal fragmentation, as well as poor
job performance. Although the distributed workstation environment has recently become
popular as a choice for a distributed-memory parallel computer, the problem of scheduling

speciﬁcally for parallel job execution has not been well studied in this environment.

In this thesis, we present cluster allocation and scheduling methods that can be used
to improve performance, in the form of system throughput, in two classes of distributed-
memory parallel computers: hypercubes, and the network of workstation-based clustered

parallel computer (NOW-based CPC). For hypercubes, we present an improved cluster al-

location technique that reclaims unused processors from 2D mesh allocation, resulting in
improved system throughput. For the NOW-based CPC, we present new cluster allocation
and processor scheduling techniques that result in improved system throughput without

adversely affecting the job turnaround time (JTT) for individual users.

© COpyright February 28, 1995 by Stephen W. Turner

All Rights Reserved

To my parents, and my wife, Dawn

ACKNOWLEDGMENTS

I wish to thank my thesis advisors, Professors Lionel M. Ni and Betty H.C. Cheng.
Without their direction and support, this dissertation would not be possible. They provided
me with a great deal of encouragement, a direction for my research, and a great sense of
organization. I have learned a great deal from them during my time in graduate school. I
also wish to acknowledge my committee members, Professors Anthony Wojcik and David
Yen, for their support. As the Chair of our department, Dr. Wojcik has at many times
in the past provided assistance. Professors Wojcik and Ni are primarily responsible for
my decision to study at Michigan State University. Professors Richard Enbody, Philip
McKinley, John Forsyth, and Donald Weinshank all deserve credit for their willingness to

talk to me and provide assistance at numerous times in the past.

I also wish to thank the members of the staff of the Computer Science department,
particularly Cathy Davison, Laura Mae Higbee, Linda Moore, and Michele Sidel. I cannot
count the number of times they have helped me through the intricacies of the MSU bureau-
cracy. My assistantship supervisor for the past four years, Frank Northrup, also deserves

credit for his support, as well as his flexibility with my work schedule.

Many fellow students and colleagues from the department have provided support and
preserved my sanity. The lunch crowd, especially Dr. Reid Baldwin, Barbara Birchler, Dr.

vi

David Chesney, Barbara Czerny, Dr. Marie-Pierre Dubuisson, Edgar Kalns, Ron Sass, and
Dr. Christian Ti‘efftz, provided many hours of entertaining conversation and camaraderie
when it was greatly needed. I regard all of them as good friends. Dr. Maureen Galsterer,
whom I have known since I entered graduate school, deserves special acknowledgement for
being a great friend during all these years. We have the special bond that comes from sharing
a common enemy. Patrick Edsall, whom I knew in both undergraduate and graduate school,
introduced me to my wife, and he was a good friend in the early days when I knew nobody
else at MSU.

I wish to acknowledge the love and support of the many members of my extended family.
My father, Professor Walter Turner, has been my role model all my life. He provided the
standard to shoot for and the example to follow. My mother, M. Patricia Gioia, and her
husband, Professor Anthony Gioia, deserve a great deal of credit for being there, providing
support, and giving sage advice when I needed it. Of course, my siblings, Connie, Anne,
Roberta, Nancy, Nick, and Jane (in order of seniority) all deserve mention for the influences
they have had on my life. I wish to thank my wife’s parents, Jacquelyn and David Moore,
for welcoming me into their family and giving us so much help.

Foremost, I wish to acknowledge and thank the most important person in my life, my
wife Dawn. She deserves the most credit, for putting up with me during all of my ups and
downs, for loving me, and for providing support in times of need.

Last, but least, I’d like to thank the Academy, especially all the little people...

vii

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xi
1 Introduction and Motivations 1
1.1 Processor Scheduling ................................. 3
1.2 Thesis Statement and Research Overview ..................... 6
1.2.1 Research Objectives ................................ 6
1.2.2 Research Contributions .............................. 9
1.3 Organization of Dissertation ............................. 10
2 Problem Statement and Related Work 11
2.1 Processor Scheduling in N UMA Architecture Multiprocessors .......... 15
2.1.1 Single Address Space Machines .......................... 16
2.1.2 Separate Address Space Machines ........................ 17
2.2 Cluster Allocation in Hypercubes .......................... 19
2.3 Processor Scheduling in Workstation Clusters ................... 21
3 Cluster Allocation in Hypercubes 25
3.1 Problem Deﬁnition .................................. 25
3.2 Notation ........................................ 28
3.3 The AFL Cluster Allocation Method ........................ 31
3.3.1 Partitioning Leftover Subcubes .......................... 32
3.3.2 Allocation of 2D Meshes .............................. 37
3.3.3 Deallocation of 2D Mesh Requests ........................ 38
3.4 Analysis of the AFL Cluster Allocation Method .................. 40
3.4.1 Analysis of the Partitioning Algorithm ...................... 42
3.4.2 Analysis of the AFL Allocation Algorithm .................... 42
3.4.3 Analysis of the Modiﬁed Free List Algorithm .................. 43
3.4.4 Analysis of the Cluster Partitioning Algorithm ................. 43
3.4.5 Analysis of the Coalesce Algorithm ........................ 43
3.4.6 Analysis of the AFL Deallocation Algorithm .................. 44
4 Experimental Results for the AFL Cluster Allocation Method 46
4.1 Square and Rectangular Requests With Uniform Distributions. ......... 49
4.2 Interval Distribution, Small Dimension in [1, 8] ................... 52
4.3 Interval Distribution, Small Dimension in [1, 16]. ................. 65

viii

ix

5 Timesharing in Workstation Clusters 76
5.1 Problem Deﬁnition .................................. 76
5.2 Characterization of Workload ............................ 79
5.2.1 Format of Experiments .............................. 81
5.2.2 Testbed ....................................... 85
5.3 Analytical Model ................................... 86
5.3.1 Barrier with EP, Non-preemptive Kernel .................... 87
5.3.2 Barrier with Barrier, Non-preemptive Kernel .................. 89
5.3.3 Analytical Model With Preemptive Kernel ................... 91
5.3.4 Discussion of Analysis ............................... 92
6 Elmer: A Scheduling Facility for a Network of Workstations 94
6.1 Implementation Issues ................................ 96
6.2 Facility Functions .................................. 100
6.2.1 User-Level Commands ............................... 103
6.2.2 Operation of Elmer ................................ 105
6.2.3 Operation of RSD ................................. 110
6.3 Cluster Allocation Algorithms ............................ 113
6.4 Evaluation of Elmer’s Performance ......................... 117
6.5 Experimental Results ................................ 118
7 Conclusion 128
7.1 Hypercubes ...................................... 128
7.2 Networks of Workstations .............................. 129
7.3 Contributions of this Thesis ............................. 130
7.4 Future Work ..................................... 131

BIBLIOGRAPHY 132

LIST OF TABLES

2.1 Architecture classiﬁcation according to memory access ............... 14
4.1 Interval distributions simulated ............................ 47
5.1 Barrier Parameters used to simulate different workloads .............. 83

LIST OF FIGURES

1.1 System-Level Processor Scheduling .......................... 4
1.2 Node-Level Processor Scheduling ........................... 5
2.1 Memory Architecture and Memory Address Space Models ............. 13
3.1 Communication interference in an embedded mesh in a subcube. ........ 27
3.2 The partitioning algorithm. ............................. 33
3.3 Leftover regions from 2D mesh embeddings ..................... 34
3.4 Allocation of a 5 x 5 mesh. ............................. 34
3.5 The AFL Allocation algorithm ............................ 38
3.6 The modiﬁed free list allocation algorithm. .................... 39
3.7 The cluster partitioning algorithm. ......................... 40
3.8 The coalesce algorithm. ............................... 41
3.9 The auxiliary free list deallocation algorithm. ................... 41
4.1 JTT and system utilization vs. system load for constant workload. ....... 49
4.2 JTT and system utilization vs. system load for square mesh requests in a [1, 16]
uniform distribution ................................ 50
4.3 JTT and system utilization vs. system load for rectangular mesh requests in a
[1,16] uniform distribution. ........................... 50
4.4 JTT and system utilization vs. system load for square mesh requests in a [1, 32]
uniform distribution ................................ 51
4.5 JTT and system utilization vs. system load for rectangular mesh requests in a
[1, 32] uniform distribution. ........................... 51
4.6 JTT and system utilization vs. system load for P[1,8] = A, 3932] = 1— A interval
distributions. ................................... 52
4.7 JTT and system utilization vs. system load for .Pll’gl = A, P[9.32] = 1— A interval
distributions. ................................... 53
4.8 JTT and system utilization vs. system load for Pll’gl = A, 31732] = 1 - A
interval distributions ................................ 55
4.9 JTT and system utilization vs. system load for .PU’S] = A, 1117,32] = 1 — A
interval distributions ................................ 56
4.10 JTT and system utilization vs. system load for P[1,8] = A, 32032] = l — A
interval distributions ................................ 57
4.11 JTT and system utilization vs. system load for P[1,8] = A, 1120,32] = 1 — A
interval distributions ................................ 58
4.12 JTT and system utilization vs. system load for P[1’8] = A, 1125,32] = 1 — A
interval distributions ................................ 59
4.13 JTT and system utilization vs. system load for 31,8] = A, 1125,32] = 1 — A
interval distributions ................................ 60

xi

4.14

4.15

4.16

4.17

4.18

4.19

4.20

4.21

4.22

4.23

4.24

4.25

4.26

4.27

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10

6.1
6.2
6.3
6.4
6.5
6.6
6.7

xii

JTT and system utilization vs. system load for Pug] = A, 32832] = 1 — A
interval distributions ................................ 61
JTT and system utilization vs. system load for Pug] = A, P[2s,32] = 1 — A
interval distributions ................................ 62
JTT and system utilization vs. system load for .313] = A, 1130,32] = 1 — A
interval distributions ................................ 63
JTT and system utilization vs. system load for Pug] = A, 1130,32] = 1 — A
interval distributions ................................ 64
JTT and system utilization vs. system load for 31,16] = A, P[17,32] = 1 — A
interval distributions ................................ 66
JTT and system utilization vs. system load for P[1,16] = A, 31732] = 1 — A
interval distributions ................................ 67
JTT and system utilization vs. system load for P[1,16] = A, 1390,32] = 1 — A
interval distributions ................................ 68
JTT and system utilization vs. system load for P[1,16] = A, [12032] = 1 — A
interval distributions ................................ 69
JTT and system utilization vs. system load for P[1,15] = A, 1995,32] = 1 — A
interval distributions ................................ 70
JTT and system utilization vs. system load for 111,16] = A, P9532] = 1 — A
interval distributions ................................ 71
JTT and system utilization vs. system load for P[1,16] = A, 1398,32] = 1 — A
interval distributions ................................ 72
JTT and system utilization vs. system load for P]1,16] = A, 1128,32] = 1 — A
interval distributions ................................ 73
JTT and system utilization vs. system load for Pll216] = A, 1090,32] = 1 — A
interval distributions ................................ 74
JTT and system utilization vs. system load for P[1,16] = A, 1330,32] = 1 —- A
interval distributions ................................ 75
Difference Between Ideal and Observed Speedups. ................ 77
Barrier communication characterization ....................... 80
Programs running alone, execution time ....................... 82
Programs running alone, speedup. ......................... 82
Experimental method. ................................ 84
Execution pattern of Barrier with exclusive control of cluster ........... 87
Execution pattern of Barrier and EP, equal priority. ............... 88
Execution pattern of Barrier and EP, EP at low priority .............. 89
Execution pattern of two Barriers with both barrier processes on one CPU. . . 90
Execution pattern of high-priority Barrier with low-priority barrier when barrier-
implementing processes are on different CPUs. ................ 91
Logical view of Elmer Operations ........................... 101
Logical view of the Elmer main program. ..................... 105
Logical view of elmersubmit ............................ 107
Logical view of elmer.query ............................. 109
Communication involved in elmer_kill. ...................... 110
Logical view of the RSD main program. ...................... 111
Format of the cluster data structure ......................... 113

6.8

6.9

6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
6.20

6.21
6.22
6.23

6.24
6.25

6.26

xiii

The cluster allocation algorithm, version 1. ....................
The cluster allocation algorithm, version 2. ....................
Overhead (ms) for job initiation. ..........................
Overhead (ms) for job Termination. ........................
Experimental format, Barrier running with EP ...................
Job turnaround time for B (v1) and EP .......................
Job turnaround time for B (v2) and EP .......................
Job turnaround time for B (v3) and EP .......................
Experimental format, Barrier programs sharing same barrier processor.
Job turnaround time for B (v2) and B (v2), barrier processors same .......
Job turnaround time for B (v2) and B (v3), barrier processors same .......
Job turnaround time for B (v3) and B (v3), barrier processors same .......
Experimental format, Barrier programs, barrier process shares CPU with non-
barrier process. ..................................
Job turnaround time for B (v2) and B (v3), barrier processors different. . . . .
Job turnaround time for B (v1) and B (v3), barrier processors different. . . . .
Experimental format, Barrier programs, barrier process occupies a CPU exclu-
sively. .......................................
Job turnaround time for B (v1) and B (v3), partially intersecting clusters. . . .
Experimental format, Barrier programs, low-priority Barrier occupies a subset
of high-priority Barrier’s cluster. ........................
Job turnaround time for B (v1 and v3), v3 having a subset cluster ........

124
125
125

126
126

127

Chapter 1

Introduction and Motivations

In recent years, computing power has been made available to the user community at a larger
scale than ever before. The rapid development of processor technology for the scientiﬁc
workstation and personal computer markets has contributed to the development of scalable
parallel computers (SPCS). SPCs are a viable platform on which to solve the so—called
grand-challenge problems, and such systems are usually characterized by the distribution

of memories among a collection of nodes.

There are a wide variety of SPC architectures available, although they can be classiﬁed
into two popular general approaches: (1) massively parallel computers (MPCs), including
meshes, hypercubes, multistage interconnection networks (MINS), etc., and (2) networks of
workstations (NOWS) (also referred to here as workstation clusters). In an MPC, the nodes,
each of which consists of a processor, local memory, and other supporting devices, are
connected by either a direct (point—to—point) or indirect (switch-based) network. Examples
of direct network MPCs include the hypercube and 2D mesh. Examples of indirect network
MPCs include MINS and crossbar networks. Nodes in a NOW are similar to those in an MPC,
but the two architectures are different in two primary ways: (1) the network bandwidth

1

2

available in a workstation cluster is generally lower than that available in an MPG, and (2)
local disks may be easily added to individual workstation nodes, which is almost impossible
in an MPC. Workstation clusters can be connected by either shared-medium or switch-based
networks. Examples of shared-medium networks for workstation clusters include Ethernet

and Token Ring. Examples Of switch-based networks for workstation clusters include the

DEC GIGAswitch and the ATM switch.

MPCs and NOWs are expandable and can achieve a proportional increase in performance
without changing their basic architecture. By increasing the capacity Of certain hardware
components, performance increases can be achieved in computational capacity, memory
bandwidth and capacity, internal interconnect (network) bandwidth, and I/O bandwidth
and capacity. However, in order to efﬁciently utilize the increased performance offered by a
SPC, an effective processor scheduling policy must be used. In a parallel computer, processor
scheduling involves ordering the execution Of arriving jobs (job dispatching), as well as
allocating sets Of processors, called clusters, for the scheduled jobs (cluster allocation). For
a given application, the optimal cluster size to achieve its lowest execution time is often
much lower than the total number Of processors available in the parallel computer. Tzen
and Ni [1] demonstrated that exceeding the optimal number of processors can lead to
performance degradation due to the overhead involved in communication amongst many
processors, as well as that of allocation of programs and data, etc. to the processors in a
distributed memory machine. Therefore, since the optimal cluster size Of many jobs is less
than the total number of processors available, it is possible for many different user jobs to

execute concurrently in a distributed memory parallel computer.

The MPG and NOW environments may also be characterized as high-performance com-

puting environments, in which the primary concern is performance as perceived by the

3

user. In this kind Of environment, it is desirable to maximize system performance without

adversely affecting the performance of user jobs.

1.1 Processor Scheduling

Classic work in processor scheduling for distributed memory machines has focused on the
static scheduling problem [2]. For this well-known NP-complete problem, speciﬁc job char-
acteristics, such as the number Of jobs, the cluster size, the execution time Of each job, and
the memory requirements of each job, can be determined in advance. A static scheduling
algorithm considers some or all of these characteristics and determines a schedule that will
minimize the total completion time for a set of pre—submitted jobs.

Static scheduling may be suitable for real-time and/or batch scheduling systems, in
which overall completion time is the primary concern. However, in a dynamic environment,
in which jobs arrive in a stochastic stream over time, it is impossible to know most job
characteristics beforehand. For an incoming job, the system must rely on currently ob-
servable information to make scheduling decisions. Such information includes the system’s
current status, as well as a number of characteristics about executing jobs. These charac-
teristics may include a job’s cluster size, its degree Of parallelism, its execution time, its
message-passing behavior, its memory requirements, and its CPU utilization. A scheduler
operating in a dynamic environment may use some or all of these Observed characteristics
to make job dispatching and cluster allocation decisions about incoming jobs. Due to their
interaction, the cluster allocation and job dispatching algorithms may also depend on one
another. For example, if the ﬁrst-come, ﬁrst-served (FCFS) queueing discipline is used by
the job dispatcher, a situation may often occur in which there is no suitable cluster for the

job being scheduled, yet there are suitable clusters for many jobs waiting in the queue.

4

There are two primary criteria for measuring the effectiveness Of a system’s processor
scheduling algorithms: job turnaround time (JTT) and system throughput. JTT measures
the time a user’s job spends in the system from when it is enqueued to when it ﬁnishes
its execution, and it is the primary measure of performance from the user’s perspective.
System throughput measures the number of jobs that are executing in the system within
some unit of time. A good system scheduler in a dynamic environment maintains a high

system throughput with a low average JTT.

In a parallel system, processor scheduling can be further classified as occurring at either
the system or node level. Figure 1.1 illustrates the process of system-level scheduling. In
the figure, solid arrows represent the flow Of user jobs among major logical steps of the
scheduling process. The job dispatcher ﬁrst chooses among jobs in a global job queue,
sending its choice to the cluster allocation algorithm. The cluster allocation algorithm
reserves a cluster of processors for the use of the chosen job. The ﬁgure also shows that
jobs may not necessarily be given exclusive access to their clusters. That is, one processor

belongs to two different clusters, containing threads from both job A and job B.

 

Job Dispatching [3 Job A thread

gang’ﬁ’,’ -—> Job Queue ]———>Eluster Allocatlon I

l

 

 

 

\ JOb B mm“

 

 

 

 

 

A'med Allocated
Cluster 0'08"" Unallocated
A A

 

 

 

 

 

 

 

 

 

 

 

In... a D nun-D a... pm;

I III I

Communlcatlon Network

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.1: System-Level Processor Scheduling.

5

Figure 1.2 illustrates an example of node-level scheduling, in which three jobs share
access to a single processor through some form of processor sharing. The ﬁgure illustrates
the process of context-switching, which occurs when a job’s thread either expires its time
quantum or is blocked. This ﬁgure presents a simpliﬁed model, in which a thread simply

returns to the process queue when it is removed from the processor.

Pm D Jab A thread

.-> —->:lI—> [:1 Immmd

I Job c thread

 

 

 

 

Figure 1.2: Node-Level Processor Scheduling.

The focus of the research presented here is how to design good, practical processor
scheduling algorithms that increase performance from the standpoint of both the user and
the system. We investigated both job dispatching and cluster allocation in two classes
of distributed memory multiprocessors: hypercubes and NOWS. First, we devised a new
cluster allocation technique for hypercubes that increases system throughput and decreases

Observed job turnaround time as compared to existing subcube allocation methods.

Second, we studied cluster allocation techniques for NOWS that take advantage of pro-
cessor sharing in a high performance computing environment consisting Of scientiﬁc work-
stations connected together by a high-speed network. At the node level, our work used
existing Unix Operating system facilities to develop processor sharing techniques that adapt
to the execution Of parallel jobs running on individual nodes. At the system level, the job
dispatching and cluster allocation methods use information provided by individual work-
stations to make scheduling decisions that take advantage of idle CPU cycles on individual

workstations.

6

1.2 Thesis Statement and Research Overview

Thesis Statement

The thesis of our research is that existing processor scheduling facilities, in the form of
cluster allocation and job dispatching techniques, can be used to improve the performance
of parallel jobs in two classes of distributed memory architectures: the hypercube and the
workstation cluster viewed as a clustered parallel computer (CPC). The performance of the
parallel jobs improves in terms of both job turnaround time and system throughput. This

performance improvement is realised by taking advantage of underutilized CPU resources.

1 .2. 1 Research Objectives

Our research can be divided into two primary areas: cluster allocation in hypercubes, and
processor scheduling for networks Of workstations. Our work with cluster allocation in
hypercubes accomplishes two main objectives: increased performance, and practicality. In
terms of performance, our cluster allocation method increases overall system utilization
and decreases average JTT by eliminating the internal fragmentation that results from
subcube-only cluster allocation. Our cluster allocation method, the auxiliary free list (AFL)
method [3], allows unused processors in a subcube to be reclaimed by the system for use in
other jobs. In terms of practicality, the AFL method can be implemented with any existing
subcube allocation method on numerous existing hypercube multiprocessors.

Networks of workstations are very popular as a choice for high-performance parallel com-
putation, primarily due to their flexibility, expandability, and low cost relative to massively-
parallel processors of Similar computational capability. While the individual workstations
may be used in the traditional manner as single-user computers, the cluster of workstations

may also be used as a compute server. Our research examines the scheduling of resources

7

in a workstation cluster from this standpoint, and our primary goal is to preserve the best
possible JTT available to the user while achieving the secondary goal of improved system
throughput.

In order to achieve the goals stated here, we have developed processor scheduling tech-
niques that take advantage of idle time in CPUS. This includes system-level and node-level
methods, and we present a study of their impact on system performance. The major com-

ponents of this research include:

e The development Of the auxiliary free list (AF L) algorithm for subcube allocation.
This cluster allocation technique allows two-dimensional mesh requests of any size to

be allocated in a hypercube (limited by the number of processors in the hypercube).

e A simulation study of the AFL method for subcube allocation in hypercubes. We
demonstrated that the AFL method, when combined with the well-known Free List
subcube allocation method, can increase system utilization and decrease average J TT,
relative to using the Free List method alone. Furthermore, the AFL method can be
implemented on any existing hypercube as a supplement to any existing subcube

allocation method.

0 A feasibility study that demonstrated that timeshared execution of parallel jobs in a

cluster can be used to improve system throughput, as well as user JTT.

0 An experimental cluster allocation facility, Elmer, that uses realtime scheduling mech-
anisms to allow the prioritized, timeshared, scheduling of parallel jobs in a workstation

cluster, as demonstrated in the feasibility study.

0 Cluster allocation algorithms for use in a network of workstations. These algorithms

are used by the Elmer facility to perform cluster allocation at the system level for

8

NOW-based CPCs. The algorithms are characterized by their use of the current degree

of multiprogramming, as well as CPU utilization, at individual nodes in the system.

Throughout our investigations, we have explicitly addressed the practicality of our pro-
posed methods. This is illustrated in three major focuses Of our work. First, we made
realistic assumptions so that the results of our research may be implemented on existing
machines. For example, it is well-known that workstation clusters can be subject to pe-
riods of intense activity during the day, yet sit almost idle during the night [4, 5]. We
concentrated on making more effective use of the workstation environment during periods
Of low usage, allowing for users to log in without fear of having their workstation resources
consumed by the computing demands of some parallel job. Second, much of our research
involved an actual implementation, simplifying the task Of parallel programming for users,
and providing an example implementation Of our research. For example, the AFL method
for cluster allocation in hypercubes can be implemented on any existing machine using
any known subcube allocation method. Furthermore, our implementation of a schedul-
ing facility for workstation clusters facilitates parallel programming. Currently, one Of the
most commonly available programming environments for distributed workstation clusters is
PVM [6]. While PVM allows users to program heterogeneous computing environments as a
single computational resource, it requires them to perform all scheduling and allocation of
CPUs in workstation environments. Our facility eliminates this requirement. Third, our re—
search involving workstation clusters is applicable to other architectures, including different

workstation environments, as well as other classes Of distributed memory multiprocessors.

1.2.2 Research Contributions

The primary contribution of this research is the demonstration that it is possible for system
resources to be reclaimed to improve system and user performance in a distributed com-
puting environment. We developed scheduling techniques that improve system throughput
and decrease Observed user J TT. Parallel applications rarely, if ever, are able to fully utilize
the hardware resources over which they have control. However, most distributed parallel
programming environments give user jobs exclusive control over the cluster of processors on
which they execute. In some cases, such as subcube allocation for hypercubes, the cluster
allocation technique may allocate tOO many processors to a job. In other cases, such as
in a workstation cluster environment, parallel jobs contain message-passing overhead that
results in idle CPU time on individual processors. Both of these examples illustrate forms
of internal system fragmentation, in which certain system resources (the CPUS) are under-
utilized. Our research presented methods that overcome these limitations by giving other

jobs access to these underutilized processors.

Our research involving cluster allocation in hypercubes demonstrated that unneeded
CPU resources can be reclaimed by the system without adversely affecting the performance
Of user jobs while increasing system utilization and throughput, thereby decreasing aver-
age JTT. Our research involving scheduling for networks of workstations demonstrated the
feasibility Of the timesharing Of jobs in distributed memory computing environments. Fur-
thermore, we provided cluster allocation techniques that take advantage Of timeshared job
execution. We showed that improved system throughput can be attained at little or no cost

to user JTT in a high-performance computing environment.

10

1.3 Organization Of Dissertation

The remainder of this dissertation is organized as follows. Chapter 2 deﬁnes the problem
Of processor scheduling and presents work relevant to our research. A classiﬁcation Of par-
allel architectures according to a layered architecture of parallel programming models [7]
is presented. In addition, work related to our investigations in scheduling hypercubes and
workstation clusters, including the general class of distributed memory parallel machines
are discussed. Chapter 3 presents the AFL algorithm for cluster allocation in hypercubes,
including the motivation for, development of, and complexity analyses of the algorithm.
Chapter 4 presents the results of a simulation study that compared the AFL cluster allo-
cation method to the free-list subcube allocation method for numerous different workloads
in a 1024 processor hypercube. Chapter 5 presents our initial study into the feasibility of
timesharing in a Clustered Parallel Computer (CPC). The results Of this study veriﬁed the
feasibility of using timeshared scheduling in distributed-memory parallel computers. Chap-
ter 6 describes the scheduling facility that we implemented to accomplish the timeshared
scheduling of jobs in a CPC, also presenting the experimental results used to verify the Op-
eration Of our scheduling facility. Finally, Chapter 7 presents our conclusions and directions

for future work.

Chapter 2

Problem Statement and Related

Work

Processor scheduling in parallel machines consists of two major components: job dispatching
and cluster allocation. The purpose of a job dispatching algorithm is to order the execution
of enqueued user jobs. The purpose Of the cluster allocation algorithm is to determine a set

of processors on which to allocate the next scheduled job.

In a parallel system, the scheduling Of work on processors occurs at both the system and
node levels. At the system level, the job dispatching algorithm implements some form of
queueing discipline for newly-submitted user jobs. At the node-level, the processor sharing
algorithm determines the order of execution of jobs on individual processors. The most
common form of job dispatching algorithm is to place arriving jobs in a FCFS queue. This
method ensures fairness to all jobs, because every job is guaranteed to eventually receive
service. However, the disadvantage Of FCFS is that a large job waiting in the head of the
queue can block access to smaller jobs waiting behind it in the queue, even if the system
has the resources to service the smaller jobs. Therefore, the use Of an FCFS queue can

11

12

result in lower system utilization. The processor sharing algorithm determines the degree
of multiprogramming at individual nodes, which is the number of jobs that execute on one
processor at once. Job dispatching and processor sharing algorithms can be highly depen-
dent on the architecture of a parallel computer. For example, a hypercube may implement
no processor sharing algorithm whatsoever, while a distributed workstation cluster might

use a variation of the round-robin timesharing technique to implement processor sharing.

Cluster allocation techniques are also highly dependent on the architecture of the par-
allel computer. Due to the difﬁculty of connecting hundreds or thousands Of processors
together, the interconnection network of a machine may be constructed according to a reg-
ular geometry, such as a hypercube or 2D mesh. Such geometries impose restrictions on
communication among processors, which makes it desirable for programs to execute on clus-
ters of processors that are grouped closely together. For example, a cluster in a 2D mesh
consists of a contiguous X x Y submesh, as Opposed to a randomly-picked set Of X x Y
processors. The geometry can also affect the number of processors allowed to be allocated
to an application. For example, in hypercubes, cluster allocation is usually restricted to
subcube allocation, in which a user job must execute on a subcube of 2’c processors (for

somek 2 0).

In other systems, cluster allocation techniques are somewhat less complicated. For
example, in a bus-based Shared memory multiprocessor, the communication medium is a
shared bus coupled with a shared memory. In such a system, the limitation on communi-
cation exists only in the bandwidth of the bus and memory. Communication occurs at a
uniform rate from one processor to another, as well as between any processor and any mem-

ory location. Therefore, cluster allocation in a bus-based system is as simple as obtaining a

13

set Of processors, although the total number of processors available may be limited by the

bandwidth Of the memory and bus.

The wide variety Of parallel computer architectures available today implies that there
are a wide variety of processor scheduling algorithms. Factors affecting scheduling deci-
sions include the memory access Speed, the method of accessing memory, and the type of
interconnection network. We present a classiﬁcation Of different parallel architectures that
will facilitate the study of the many different methods that have been developed. Ni [7]
deﬁned a layered architecture Of parallel programming models. The ﬁrst two levels of this

model deﬁne the memory architecture and the memory address space and are presented in

 

 

 

 

 

 

    
   
 

 

 

Figure 2.1.
Separate Single Address Space Memory
Address Addres S
Space Static I Dynamic 8 pm
. , Memory
NUMA Architecture UMA Architecture Architecture

 

 

Figure 2.1: Memory Architecture and Memory Address Space Models.

At the ﬁrst level, the memory architecture deﬁnes the method of memory access used by
the system hardware. In NUMA (Non- Uniform Memory Access) architectures, the speed at
which a word in memory is retrieved is dependent on the physical location Of the memory
word. For example, a processor in a NUMA machine might have access to both local and
remote memory; local memory access can be considerably faster than access to remote

memory. In UMA (Uniform Memory Access) architectures, memory access speed is not

14

dependent on its physical location. That is, speed Of access is the same for all physical

locations.

The memory address Space, which can be either single or separate, deﬁnes the memory
as it is seen by an executing parallel program. In the separate address space model, each
sub-process (or thread) Of a parallel application has its own address space, to which no other
thread has direct access. In the single address space model, the program views memory as
one large, uninterrupted address space. Each thread of a parallel application may access
any location in this address Space. This model can be further classiﬁed as either static or
dynamic binding. In a static binding single address space, there is exactly one mapping of a
virtual memory address to a physical location (that is, once the virtual memory address is
assigned a physical address, it remains that way for the duration Of program execution). In
a dynamic binding address space, virtual memory addresses may migrate from one physical

address to another.

 

 

 

 

Memory Address Binding
Access Space Type
Method
NUMA single static / dynamic
separate static
UMA single static
separate static

 

 

 

 

Table 2.1: Architecture classiﬁcation according to memory access.

From the model presented in Figure 2.1, we can derive a taxonomy of parallel archi-
tectures classiﬁed by their memory access methods, as listed in Table 2.1. For separate
address space NUMA machines, in particular, the problem Of processor scheduling still con-
tains many Open issues involving both job dispatching and cluster allocation. Sections 2.1

through 2.3 identify and discuss some of these issues as they pertain to hypercubes and

15

networks Of workstations. Section 2.1 ﬁrst presents a review Of the research in processor
scheduling techniques for NUMA architecture multiprocessors, including single and separate
address space machines. Section 2.2 then discusses the problem of cluster allocation in hy-
percube multiprocessors, presenting a review of literature relevant to the research presented
in Chapters 3 and 4. Section 2.3 deﬁnes the issues of processor scheduling in a workstation
cluster and presents a review Of literature relevant to the research presented in chapters 5

through 6.

2.1 Processor Scheduling in N UMA Architecture Multipro-

CCSSOI‘S

Non-Uniform Memory Access multiprocessors are commonly considered as the class of dis-
tributed memory separate address space machines. However, numerous examples Of shared
memory NUMA multiprocessors exist, such as the Kendall Square Research KSRl [8], the
BBN GPlOOO and TC2000 [9, 10], and the Cray T3D [7]. In addition, many other distributed
memory machines maintain separate address space as their standard programming model,
yet support a shared memory environment. Examples include the nCUBE 3 hypercube [11],

and MIT’s J-machine [12], a 3-D torus.

Most NUMA architecture multiprocessors can be characterized by the method Of inter-
connection among processors, which is normally some structured topology (e.g., hypercube
or mesh). Examples Of NUMA multiprocessor interconnections include the Multistage In-
terconnection Network (MIN), the hypercube, the 2D and 3D mesh or torus, the FAT
tree [13], and the ring. Examples of the MIN include the BBN GP1000 and TC2000 [9, 10],

IBM’s RP3 [14] and SP-l [7], and the NYU Ultracomputer [15]. Examples Of the hyper-

16

cube include the Intel iPSC [16], the nCUBE 2 and 3 [17, 11], and the Thinking Machine

Corporation’s CM-2.

2.1.1 Single Address Space Machines

Static Binding Machines

In a single address space NUMA machine, all memory references are part Of a Single
global address space. The implementation of this global address space may either be by
hardware support for direct remote memory access, as in the case of the Cray-T3D, or
in a special network that separates processors from memories, as in the case of the BBN

GP1000, BBN TC2000, IBM’s RP3, and the NYU Ultracomputer.

The Cray T3D is a massively parallel computer consisting Of a 3—dimensional torus that
connects nodes together. In this machine, a node consists of two PEs (processor/ memory
pairs). Currently, no information is available as to processor scheduling techniques on the

Cray T3D.

In the case Of the BBN GP1000 and TC2000, as well as the IBM RP3 and SP-l, proces-
sors are connected together via a multistage interconnection network (MIN), in which each
stage Of the network consists of a line of small crossbar switches. The outermost two stages
are connected on one side to processors and on the other to memories (or, alternatively,
reconnected to the processors), while the inner stages Of the MIN are connected so that a
direct link may be established, through switching through the MIN, from one processor to
another (or from a processor to a memory module). To our knowledge, there have been no

substantive studies on processor scheduling in MINS.

Dynamic Binding Machines

17

In a single address space dynamic binding NUMA machine, the binding between a virtual
address and a physical address is not permanent. This class Of machines attempts to take
advantage of the principle of locality by moving the physical location of data when necessary.
That is, if a processor issues a memory reference, and the data it refers to is not stored
locally, then the dynamic binding mechanism moves the data over to the local processor’s
memory. There is only one known example of this kind Of machine, the Kendall Square

Research KSR 1 [8]

The KSR 1 is implemented using a fat-tree [13] and is scalable from 8 to 1,088 processors.
The lowest level Of the tree connects up to 32 processors using a form of token ring. Cluster
allocation on the KSR 1 is accomplished hierarchically according to the organization of the
system, implying that, on a fully-conﬁgured system, a process may allocate from 8 to 1,088

processors, increasing the size of the cluster by adding more leaves Of the fat-tree.

2.1.2 Separate Address Space Machines

Separate address space parallel machines are often characterized by the geometry Of their in-
terconnection networks. Although many different interconnection network geometries have
been proposed, only a few different geometries have been implemented in actual machines.
These include the hypercube [18], the k-dimensional mesh, and the multistage interconnec-
tion network (MIN) [9, 10]. The research presented in this section primarily considers cluster
allocation techniques in meshes and MINS, while Section 2.2 presents research relevant to
our investigations in hypercubes.

Cluster allocation in mesh-connected parallel computers is concerned with Obtaining
2-dimensional submeshes for allocation to user applications. The ﬁrst proposed method, by

Li and Cheng, is known as the two-dimensional buddy (2DB) strategy [19]. This strategy

18

is a generalization of the buddy strategy developed for subcube allocation in hypercubes.
For this strategy to apply, meshes must be square, having side lengths that are a power
of two. Additionally, each submesh request must be a square submesh whose side lengths
are a power of two. This method suffers from two primary drawbacks: (1) large internal
fragmentation results from forcing the allocation of square submeshes with side length the

power Of two; (2) it is not applicable to general 2-dimensional meshes.

Chuang and Tzeng [20] developed the frame sliding strategy to address the drawbacks of
the 2DB strategy. In this method, the incoming request dimensions are treated as a frame
used to search for available space in the mesh. The search begins at the lowest leftmost
available processor. When a frame area is unavailable, the frame is slid horizontally or
vertically with a stride equal to the width or height, respectively, Of the frame. Zhu [21]
recognized that the frame sliding method also suffers from external fragmentation, as well
as the inability to recognize all available free submeshes in its searchng technique. There-
fore, he developed the BF and FF (best-fit and ﬁrst-ﬁt) strategies, which are capable of

recognizing any available submesh.

For a w x h mesh, either Of Zhu’s BF or FF strategies requires 0(w - h) time to al-
locate a submesh, which can be prohibitive in large meshes. To eliminate this problem,
Bhattacharya et al [22] developed a method based on computing free submesh regions.
For a mesh containing n currently—allocated submeshes, their method requires 0(n2) time
to allocate a submesh. They point out that since n is Often small, their method usually

outperforms Zhu’s, unless n 2 \/ w - h.

For job scheduling on meshes, most authors assume FIFO queueing and use some
submesh allocation algorithm to perform cluster allocation [19, 20, 21]. However, Bhat-

tacharya et al [22] examined lookahead processor scheduling in the context of submesh

19

allocation. They modiﬁed two strategies: BF and WP (worst-ﬁt) to examine the effects
of scheduling if the contents of the job queue are considered. In all cases, their lookahead

method achieved better performance than the comparable method with no lookahead.

2.2 Cluster Allocation in Hypercubes

Most research on cluster allocation in hypercubes is concerned with Obtaining a subcube for
allocation to a user application. In a hypercube of degree n (n 2 0), a subcube is a cluster
of processors that is, itself, a hypercube of degree It (0 5 k S n). Our investigations
examined the question of whether processors left over from a subcube allocation can be
reclaimed by the system for use by other jobs. The motivation for this work is that the
Optimal cluster size for a particular job may be much lower than the dimension of subcube
required to Obtain the required number Of processors for the job. Because having too many
processors can be detrimental to the performance of a job, we proposed and studied the
auxiliary free list (AFL) strategy [3], in which 2D meshes are used as clusters in hypercubes.
The AFL strategy achieves a more effective utilization of processors in hypercubes. Details
of this method are given in 8 Chapters 3 and 4.

The most well-known subcube allocation methods are the buddy system [23], the gray
code method [24], and the free list method [25]. Other variations include the modiﬁed
buddy strategy [26], the associative memory method [27], and the MPP method [28].

The buddy system was ﬁrst proposed in 1965 as a memory storage allocation scheme.
For a hypercube Qn, it uses 2" bits to keep track of the availability of the 2" nodes. A
k-cube (0 g k S n) is located by ﬁnding a contiguous sequence of 2" bits whose addresses
start with an integral multiple of 2". For a subcube of dimension k, the buddy system can

recognize 2""" subcubes.

20

The gray code method is a variant Of the buddy system. The buddy system searches
a list of addresses that are arranged in ascending order from 0 to 2" in an n-cube. By
arranging the addresses as a gray code listing, the gray code method is able to detect twice

as many subcubes as the buddy system.

The so-called modiﬁed buddy strategy [26] was proposed as an improvement on the
buddy system and gray code strategies. Like the gray code method, it works by making a
more effective search Of the list of processor addresses and their corresponding allocation

bits.

The free list method is unique due to its method of search for subcubes. Instead of a
list Of processor addresses with allocation bits, the free list method only maintains a list Of
currently available subcubes. When a subcube Of dimension k is requested, a free k-cube is
either immediately allocated, or the free list is searched in higher dimensions until a larger

subcube is found, decomposed (broken into smaller subcubes), and a k-cube allocated.

The free list method is the only practical method that provides complete subcube recog-
nition. That is, for an n-cube and a k-cube request, there are 2"‘k (2) possible recognizable
subcubes, since there are (2) ways to arrange the “don’t care” bits of a k-cube, and 2""c

ways to arrange the remaining n — k bits. For a comparable request, the buddy system is

able to recognize 2"—’° subcubes, and the gray code method can recognize 2"“"Jr1 subcubes.

The associative memory method [27] also provides complete subcube recognition. How-
ever, it requires access to an associative memory to determine subcube availability. For a
subcube of dimension n, the memory required is 2" n-bit words. For example, a 10—cube
would require an associative memory of 1024 10-bit words. Although it has constant time
performance for subcube allocation, the hardware costs of this associative memory would

be prohibitive in a large scale hypercube.

21

Krueger et al [29] studied different job scheduling techniques in hypercubes. They argued
that job scheduling can have a greater effect on system throughput and job turnaround time
than the choice of subcube allocation method. Instead of FCFS scheduling, they proposed
and implemented a family Of scheduling disciplines, called scan, that have a great effect
on system performance. Their studies indicated that an effective job scheduling strategy is

more important than the cluster allocation strategy in hypercubes.

2.3 Processor Scheduling in Workstation Clusters

A network Of workstations (or workstation cluster) can be deﬁned as a set Of scientiﬁc
workstations connected together by some form of network. Normally, workstations are
characterized as containing their own disk drives, memory, Operating system, keyboard,
and monitor. That is, a workstation is an independent computer capable of functioning
without the aid Of services from other computers. Parallel programming in workstation
clusters is typically performed using user-level software systems, such as PVM [6], Linda [30],
P4 [31], or MP1 [32]. Such software systems enable the user to treat networks of machines
as single “virtual machines” by implementing message-passing and other protocols over
existing network protocols.

To date, very little research has been published on the subject of scheduling such parallel
jobs on clusters of workstations. Workstations are not originally intended for the execution
of large-scale parallel jobs, and the use of standard network protocols, coupled with software
systems like PVM, Linda, or P4, makes it likely that the message-passing performance Of
jobs in that environment will be lower than that on a purpose-built MPP. Although research
is under way towards building low-overhead message-passing protocols for networked com-

puter systems [33], workstation clusters are still at a disadvantage because they must be

22

general enough to accommodate the software for which the network protocols were originally

built.

Lower message-passing performance can result in lower program performance, as well
as Signiﬁcant amounts of idle CPU time. With this assumption, our research examined
the effects Of allowing multiple jobs to execute on individual workstations. Measurable im-
provements in system throughput were observed, often at little cost to the performance Of
individual jobs. Our research focused on scheduling algorithms, at the system and node
level, that are Optimized towards maximizing the beneﬁt of allowing time shared job exe-

cution in a cluster of workstations.

A central issue in research in scheduling for a workstation cluster is how it is perceived
as a computational resource. There are two different primary models. A workstation cluster
can be treated as a collection of independent computers, each Of which is “owned” by its
primary user, or it may be treated as a compute server, available for the execution Of any
job. Under the ownership model, the quality of service to a workstation’s owner is the
primary concern. If the owner is currently using his/ her workstation, then he/she Should
perceive no change in responses from the system. The compute server model views the set
of workstations in much the same way as a massively parallel processor might be viewed,

which is as a single computational resource on which to execute parallel jobs.

A number of studies have examined scheduling in workstation clusters . Most of these
studies are concerned with the ownership of a workstation by its primary user, and are thus
concerned with preserving the owner’s quality of service while ﬁnding idle cycles with which
to execute jobs [4, 5, 34, 35, 36]. Litzkow et al [5] developed Condor as a means to exploit
idle workstations during non-peak usage times. In developing the scheduling algorithm for

Condor, Mutka and Livny [4] addressed the problem of long-term scheduling for jobs that

23

execute for long periods Of time with little need for interaction or interprocess communica-
tion. Theimer and Lantz [34] examined the tradeoffs between centralized and decentralized
scheduling facilities, concluding that although decentralized facilities are more fault tol-
erant, centralized facilities are more scalable and perform better. Stumm [35] developed
a decentralized facility for automatically scheduling jobs in a distributed, heterogeneous
computing environment. Stumm argued that parallel jobs require exclusive access to the

clusters allocated to them in order to execute at Optimum efﬁciency.

All of the above workstation scheduling methods use a conservative approach to pre-
serving the owner’s quality of service, which is to ensure that remote jobs are not exeCuted
on previously occupied workstations. In contrast, Krueger and Chawla [36] use priority
resource allocation in the form of prioritized CPU scheduling, ﬁle system access, and mem-
ory management. Their argument is that a signiﬁcant portion Of idle workstation cycles
go unused by other distributed schedulers simply because the cycles exist on workstations
previously occupied. They showed that it is possible for remote and local jobs to coexist

through judicious use Of priority mechanisms.

One study has examined the workstation cluster in the context of high performance par-
allel computing. Atallah et al [37] examined the problem Of using a network of workstations
for computationally intensive parallel applications. They presented analytical arguments
that co—scheduling (gang scheduling) is necessary to guarantee that sub tasks start con-
currently and execute at the same pace, although they presented no conclusions about the

performance of gang scheduling in a workstation cluster.

In a study not speciﬁcally related to workstation clusters but relevant to our research,
Setia et al [38] compared the performance Of simple round-robin scheduling with that Of

“run to the end of phase” scheduling, in which a job is suspended when it reaches the end

 

24

 

of a computation/ communication phase. They proposed a processor scheduling method in
which incoming parallel processes share time on equal—Sized clusters. By using analytical
and simulation studies, they concluded that allowing processor Sharing on individual nodes

can be of some beneﬁt to system throughput performance.

 

Chapter 3

Cluster Allocation in Hypercubes

This chapter discusses our investigations in the area Of processor scheduling in hypercube
multiprocessors. Speciﬁcally, we addressed the problem of cluster allocation [3]. Our re-
search achieved an improvement in both response time and overall system utilization by
allowing 2D mesh clusters to be allocated within a hypercube, making Obsolete the tradi-
tional requirement that only subcubes be allocated. The technique we introduced, known
as the auxiliary free list method, can be implemented on existing systems with any known

subcube allocation method.

3.1 Problem Deﬁnition

The problem Of subcube allocation has been studied extensively to maximize processor
utilization and minimize system fragmentation in hypercubes. Several strategies have been
proposed and implemented for subcube allocation, including the buddy strategy [23], the
gray code (GC) strategy [24], the modiﬁed buddy strategy [26], the MPP method [28],
and the free list strategy [25]. Of these approaches, only the MPP method and the free
list strategy have been shown to perform optimally, since they provide perfect subcube

25

26

recognition. Additionally, the free-list strategy Operates with lower overhead than the MPP

method, and is therefore regarded as the best subcube allocation policy.

For hypercube machines, such as the nCUBE—2 and the nCUBE-3, the restriction of
allocating subcubes causes low processor utilization. Although the hypercube is a powerful
network topology [39], 2D and 3D meshes are more pOpular application topologies. For
example, grid domain decomposition for solving partial differential equations is an applica-
tion that can easily be implemented on 2D and 3D meshes. In addition, 2D and 3D meshes
can be used more efﬁciently by allocating exactly the number of processors requested. For
example, if the Optimal number of processors for a task is 600, then the smallest subcube
that can be allocated is 1024 processors, resulting in a waste of 424 processors, while a 2D

mesh may allocate a 20 x 30 cluster.

Consider the 4-dimensional cube shown in Figure 3.1, in which one job is allocated a
2 x 5 mesh, and another job is allocated a 2 x 3 mesh. With a restriction tO subcube
allocation, both jobs cannot be simultaneously executed, even though the total number of
processors, 16, is sufﬁcient. Without the subcube restriction, both clusters may be allocated
in the 4-cube, as Shown in Figure 3.1. However, a closer look reveals that communication
from node 0100 to node 1010 in the 2 x 5 cluster may cause link contention with communi-
cation between nOdes 0110 and 0010 in the 2 x 3 cluster, if the popular deadlock-free E—cube

routing is used [40].

If nothing is assumed about communication among processors within a cluster, then
communication may occur between any two processors within any cluster. Thus, the poten-
tial for contention can result in intercluster communication interference, which is desirable
to avoid. Many known processor allocation strategies have been developed to guaran-

tee contention-free cluster allocation, such as the subcube allocation strategies for hyper-

27

cubes [23, 24, 25, 26, 27, 28], the 2D mesh allocation techniques [19, 20, 21, 22], and the

one used in CM-5 [41] (fat tree topology).

0110
. . 1/111
/:’°° Hg

 

 

 

 

 

 

 

 

 

 

 

0000 O O 1110
7
1000 01010

(a) Cube Layout.

0 r A A ANA

1 \J \r w v

000 001 011 010 110 111 101
(b) Mesh Layout.

Figure 3.1: Communication interference in an embedded mesh in a subcube.

Non-cubic (N C) cluster allocation in a hypercube is the process Of allocating :1: connected
nodes, where z ¢ 2". Kim et a1 [42, 25] performed preliminary investigations into the
possibility Of non-cubic cluster allocation. Their method requires that a non-cubic cluster
be composed of adjacent cubes. For example, if a cluster request consists of 11 nodes, the free
list allocation algorithm is used to ﬁnd adjacent 3- and 2-cubes (or alternatively, a 3-cube,
a 1-cube, and a 0-cube) to form a cluster Of 11 processors. In their study of NC allocation,

the effects of intercluster communication interference are not explicitly addressed.

It is well-known that the hypercube may embed a k-dimensional mesh with dilation 1 [43].

The auxiliary fvee list method uses dilation 1 embeddings of meshes to address the problems

28

Of low processor utilization and intercluster communication interference. The allocation
method is contention-free and uses existing deadlock-free E-cube routing. In addition, it
improves performance over the free—list method by decreasing job turnaround time and
increasing processor utilization. Our investigations address 2D mesh allocation. For appli-

cations not requiring a 2D mesh, the closest 2D mesh is assumed to be allocated.

3.2 Notation

Let Q" denote an n-dimensional hypercube of 2" nodes, in which the address Of nodes and
subcubes are represented by an n-bit string a = an_1a,,_2 . . . a0, a,- E {0, 1, 4:} = 23, with
bit a,- corresponding to dimension i and ‘»k’ representing the “don’t care” symbol. 2" is the
set of k-bit address strings whose elements are in E. E” can be used to uniquely represent

addresses in an n-cube, and Operations can be deﬁned on cube addresses as follows:

Deﬁnition: Hamming Distance: For two address strings (I = ak_1ak_2 - - - a0
and ﬂ = bk_1bk_2 - ubo in 2" for some integer k, the Hamming distance is
deﬁned as

k—l
H(0tﬁ) = Z h(aitbi)‘;

i=0

where h(a, b) = 1 if a,b 6 {0,1} and a aé b, and h(a, b) = 0 otherwise [42, 25].

Deﬁnition: Binary Reflected Gray Code (BRGC): Let 0,, be the n-bit

BRGC. Cu is deﬁned recursively as follows:

01 ={0,1},Gi={OG,-_1,1G:_1}

29

For the above deﬁnition, G: is the set Obtained by reversing the order of G,. For ex-
ample, G2 = {00,01,11,10} and G; = {10,11,01,00}, hence G3 = {000,001,011,010,
110,111,101,100}. Furthermore, 9,- will refer to the ith gray code in a sequence. For

example, in G3, 94 = 110. A more general deﬁnition Of gray codes is given in [24].

Deﬁnition: Adjacent and Complementary Cubes:

Given two cubes identiﬁed by address strings a = an_1an_2---ao and ,6 =
ﬁn_lﬁn_2~-ﬂo, they are adjacent if H(a,ﬁ) = 1. The complement of oz
at j is a; = an_1an_2---aj+16;-Oj_1---ao, where 0 S j 5 n — 1 and

0335; E {0, 1}.

In order to introduce a lemma relevant to our new allocation scheme, a new notation
for binary addresses, as well as communication channels, is presented. A node address S
is represented by S = Shsch, where 5;, is a binary string representing its high order bits
(those bits that are higher-order than the communication channel), sc is the one-bit string
in the position Of the communication channel, and $3 is a binary string representing the low
order bits. A communication channel is represented as C = ChACg, where z\ E {T, 1}. Here,
T indicates a bit changing from 0 to 1 and 1 indicates the Opposite. Thus, C = Ch 1‘ Cg
indicates a communication channel in which the direction is 0 —> 1. For example, consider
the following two node addresses that share a communication channel: N1 = 0111010,
N2 = 0110010. If communication is flowing from N1 to N2, then the channel is represented
by C = 011 .L 010, with 5;, = 011 and St = 010 for C. For N1, so = 1, and for N2, 36 = 0.

The following lemma states important properties shared between source and destination
nodes, as well as a communication channel used by messages sent from the source to the

destination.

30

Lemma 1 Let S = Shsch be a source address, and D = thcDg be a destination address,
where 36 and dc denote the bit used by communication channel C. Without loss of generality,
let C = Ch 1‘ Cg. Then E—cube routing from S to D will use channel C if the following 3

conditions hold: (1) 3C = 0 and d6 = 1, (2) Sh = Ch, and (3) D3 = Cg.

Proof: (1) Since C is 0 —> 1, the channel bit on S must be 0, and the channel bit on
D must be 1. (2) E—cube routing examines and (possibly) changes the lowest order bit
ﬁrst, followed by the second lowest order bit, and continues until the highest order bit is
examined. Assume that I, is the intermediate address produced after the jth step of the
E—cube algorithm. Also assume that Sh aé Ch and C is an intermediate channel used by
the E—cube routing algorithm. Before the ﬁrst step, IO = S (Io = Shsch). If |Sg| = k,
then after k steps of the E-cube algorithm, 1;, = ShscDg. Since C is the channel used, this
condition contradicts the assumption that S, 75 Ch. Therefore, 5,, must be the same as
Ch. (3) D; 2' Cg is seen from the proof of item (2), in which after k steps Of the E—cube

algorithm, 1;, = ShscDg. Therefore, D; = Cg. D

The following lemma states properties about the addresses Of the leftover cubes.

Lemma 2 Let Gn be the n-bit BRGC, and let H,- be any subset of BRGC gray codes having
2" elements (0 S i 5 n). If all gray codes in H,- have the same n —i most signiﬁcant
bits, then the cube address for this set of nodes may be represented by replacing the i least

signiﬁcant bits of any gray code in H,- with ‘* ’.

Proof: The proof of this lemma is Obtained by observing the n-bit BRGC (for any n > 0).

Cl

31
3.3 The AFL Cluster Allocation Method

The primary Objective Of the auxiliary free list is to supplement existing subcube allocation
methods by partitioning the unused nodes into subcubes, which are then stored in the
auxiliary free list. Leftover nodes are partitioned into subcubes SO that clusters may be
guaranteed to be contention-free. In our investigations, the auxiliary free list was used to
supplement the free list algorithm [25]. However, the AFL may be used in conjunction
with any of the well-known subcube allocation methods. FUrthermore, since this study
restricts itself to the allocation of 2D meshes in subcubes, a modiﬁed version Of the free list
algorithm, reﬂecting this restriction, is presented.

The free list algorithm for subcube allocation has been shown to be the most effective
algorithm for recognizing available subcubes. It maintains a list of currently available
subcubes for each subcube dimension. When an incoming request requires a subcube of
dimension k, the free list algorithm can check, in O(n) time (for a hypercube Q"), whether
a k-cube can be obtained. There are three main issues to consider with auxiliary free list
allocation: (1) how to partition leftover nodes, (2) allocation using the auxiliary free list,
and (3) deallocation using the auxiliary free list.

In the free list algorithm, subcubes exist in one of two states: allocated (in-use) or free
(available for cluster allocation). Available subcubes are located by consulting a list of
currently free subcubes, rather than by consulting a list of what is currently allocated. The
AFL introduces a third state, in which a subcube is leftover from a 2D mesh embedding.
Subcubes in this state are maintained in the auxiliary free list, which has the same format
as the original free list.

To embed an X x Y mesh cluster, it is necessary to ﬁnd an available k-cube (k =

[log2 X] + [log2 Y]) for a dilation 1 embedding. This embedding leaves 2" — X Y leftover

32

nodes. The AF L allocation algorithm generates leftover subcubes that comprise these left-
over nodes, placing them in the auxiliary free list. Each 2D mesh cluster that generates
leftover subcubes is also associated with an ownership list, containing only the leftover sub—
cubes that it generates. This ownership list is maintained so that the original k-cube may
be quickly recovered, if possible, when the 2D mesh cluster is deallocated. That is, the
cube manager can quickly check whether the cubes in its ownership list are allocated to
determine whether they may be concatenated back into the original k-cube. The ownership

list is explained further in Section 3.3.3.

3.3.1 Partitioning Leftover Subcubes

The partitioning algorithm (Figure 3.2) is the means by which the AFL algorithm extracts
leftover subcubes from a 2D mesh allocation. For a dilation-1 2D mesh embedding, assume
that t = [log2 X] and m = [log2 Y] (k = t + m). If G, is used to denote the i-bit binary-
reﬂected grey-code (BRGC), then the ﬁrst X and Y codewords of Cl and Gm, respectively,
are used to address the embedded mesh in the X and Y dimensions. Let L; and Ly be
the set Of unused codewords resulting from the embedding Of the X and Y dimensions,
respectively (leI = 2‘ — X and [Ly] = 2’" — Y). In addition, let A; and A3, be the set of

codewords used for X and Y dimension embedding, respectively (IAII = X and [Ag] = Y).

Figure 3.3 illustrates the regions Of the subcube (in a mesh layout form) covered by
the sets Ax, Ag, L3, and Ly. In addition, it illustrates two overlapping regions resulting
from an embedding. Region (1) represents the area deﬁned by ([Lzl x IGmI), and region (2)
represents the area deﬁned by (ILy| x IGgI). It is desirable for the algorithm to produce the
largest possible leftover subcubes. Hence, there are two cases: (a) region (1) is larger than

region (2); (b) region (2) is larger than region (1). The algorithm is essentially identical

33

for either case, the only difference being the sets Of codewords being used in steps (1) and
(2) of the partitioning algorithm. Thus, for clarity, Figure 3.2 presents the algorithm as

handling only case (a).

 

Algorithm: Partitioning.
Input: a subcube Of dimensionality 1:, an X x Y mesh request.
Output: A set of leftover subcubes.

Procedure:
(1) For all elements Of LI:
Compute z-component subcube addresses by changing low-order bits of subsets
Of L3 tO *.
Replace all L3 addresses with subcube addresses computed from L,.
Add the concatenation of each :r-component with *m to the AFL.
(2) (a) For all elements Of Ly:
Compute y-component subcube addresses by changing low-order bits Of subsets
Of Ly to tr.
Replace all Ly addresses with subcube addresses computed from L1,.
(b) For all elements of AI:
Compute z-component subcube addresses by changing low-order bits of subsets
of AC to at.
Replace all A,B addresses with subcube addresses computed from Am.
(C) For all elements Of A: and Ly:
Compute subcube addresses by concatenating members Of Ly to members of AI.
Add each computed address to the AFL.

 

 

 

Figure 3.2: The partitioning algorithm.

The partitioning algorithm takes advantage of the property given in Lemma 2, which
enables the algorithm to easily compute cube addresses by using an address from some gray
code subset and replacing the i least signiﬁcant bits with ’*’. For example, in the 4-bit
BRGC, the sequence Of 4 codes {1010, 1011, 1001, 1000} has “10” as its two most signiﬁcant
bits. The cube address of this sequence is therefore 10 * at.

The results of the partitioning algorithm are illustrated in Figure 3.4, in which a 5 x 5
mesh has been embedded in a 6-cube. The cube is pictured in a mesh-layout form. As the

ﬁgure illustrates, the leftover subcubes generated by the partitioning algorithm are disjoint,

 

 

+—---.>----»I
-/
A

 

 

 

\\
\\\\

 

 

 

 

 

 

 

Figure 3.3: Leftover regions from 2D mesh embeddings.

and many pairs of cubes are adjacent. TO keep them in this form, they are kept separate
from the normal free list, as the free list deallocation algorithm would rearrange this format.

I Nodes inside the 2-D mesh.

I 3 Generated Extra Cubes.

000

001

 

000 001 011 mo 110 Ill 10] 100

Figure 3.4: Allocation Of a 5 x 5 mesh.

Intuitively, it can be seen that using the partitioning algorithm can increase system
utilization: the previous example of a 20 x 30 mesh leaves 424 extra nodes that would not
be allocated by the original free list algorithm. The partitioning algorithm will generate an

8-cube, a 7-cube, a 5-cube, and a 3—cube as available leftover cubes.

35

Theorem 1 states that the auxiliary free list algorithm will allow multiple 2D meshes to
be allocated within one subcube without communication interference if E—cube routing is

used.

Theorem 1 The leftover cubes created by the 2D mesh partitioning algorithm suffer no
communication interference from the 20 mesh cluster, and the 20 mesh cluster suffers no

communication interference from the leftover cubes.

Proof:

First, it is necessary to Show that communication within a subcube never leaves that
subcube. Assume, without loss of generality, that a k-cube Q], exists having the address
an_1a,,_2-~ak+1aka_1---Xo, where 0:,- 6 {0,1} and XJ- = t. In addition, assume that S
and D are source and destination nodes within (21,, respectively, and that the E-cube routing
algorithm uses a channel, C, outside Of Qk, when routing from S to D. By Lemma 1, Sh =
Ch. Because it exists within Qk, the highest-order n — k bits of S are an_1an_2 - - - ak+1ak.
Since C is not contained within Qk, it must be true that C differs from S in at least one of
these bits. Assume that this bit is bit m (k S m g n— 1). According to Lemma 1, D3 = Ce.
This condition implies that the highest order n — k bits of D differ in at least one position
from those of S. However, since D also exists within Qk, it must be true that the highest
order n — k bits Of S and D are identical. Therefore, C cannot exist outside of Q], if E—cube
routing is used. Therefore, the 2D mesh cluster suffers no communication interference from
the leftover cubes.

TO Show that the leftover cubes suffer no communication interference from the 2D mesh
cluster, it is only necessary to show that, for any (S,D) pair in the 2D mesh cluster, the
E—cube routing algorithm will not use any channels used by nodes within any of the leftover

cubes.

36

Assume that (51, D1) are two nodes inside the 2D mesh cluster and that (Sz,D2) are
two nodes inside a leftover cube generated by the partitioning algorithm. Also assume that
C is a channel used by communication from 81 to D1, as well as by communication from 52
to D2. Then by condition 2 of Lemma 1, it must be true that 51 h = 52,, and by condition

3 Of Lemma 1, it must be true that D1, = D2,.

Assume that, for a 2D mesh embedding, an address A may be broken into an X com-
ponent and a Y component, where A = A x e Ay. For an X x Y mesh embedding, let
k = t + m, where E = [logz X] and m = [logz Y]. A X will be 8 bits long, and Ay will be
m bits long. Now assume that d, is an element Of either S; or S5 used to generate the X
component of some leftover cube B, and let d,- = 2’. Then the X component of B will be
Bx = Bg_1ﬁg_2-»-ﬂ,+1ﬂ,-*,_1 '-°*0, where 3,, 6 {0,1} and *p = * (0 S p S t - 1) (recall
that ‘*’ represents the don’t care bit). Similarly, if dj is an element of S7 used to generate
the Y component Of the same leftover cube B, and dj = 23’, then the Y component Of B

Will be By = Um_10m_2 ° ' '6j+10j *j—l - - - *0, where 9,, E {0, 1} and *p = * (0 S p S m— 1).

The don’t care bits in the X and Y components of B deﬁne the communication channels
within the leftover subcube. Therefore, there are two cases to consider: (1) C is deﬁned
by one of the “don’t care” bits in the X component of the leftover cube B, and (2) C is

deﬁned by one of the “don’t care” bits in the Y component of the leftover cube B.

(Case 1): If C is deﬁned by the X component of B, then the t — i most signiﬁcant bits
of C deﬁne a unique preﬁx to the address of C. If node 51 exists within the 2D mesh, it
cannot have this preﬁx to the X component Of its address, since all 2i nodes having this
preﬁx are, by deﬁnition, part of a generated leftover cube. Therefore, since 51,, 75 52,, 51

cannot use communication channel C in any message it sends.

37

(Case 2): If C is deﬁned by the Y component Of B, then the (3 most signiﬁcant bits of
C deﬁne a unique preﬁx to the address Of C. As in case 1, if node 81 exists within the
2D mesh, it cannot have this preﬁx to the X component Of its address, since all 2i nodes
having the same most signiﬁcant (3 bits as C are by deﬁnition part Of a generated leftover
cube. Therefore, since 51, ¢ 52,, 51 cannot use communication channel C in any message

it sends. CI

3.3.2 Allocation Of 2D Meshes

The AFL method is intended to supplement any existing subcube allocation algorithm.
Therefore, cluster allocation using the AFL algorithm is a simple decision procedure, in
which the AFL is checked ﬁrst for an incoming cluster request. If it cannot satisfy an in-
coming request, then control is passed to the regular subcube allocation algorithm (modiﬁed
to make use of the partitioning algorithm). Again, note that a subcube contained in the
AFL may also be partitioned to produce a set Of leftover subcubes. Assume that the X x Y
request requires a k-cube to be satisﬁed (k = [log X] + [log2 Y]).

The AFL allocation algorithm checks dimensions k through n Of the AFL for the ex-
istence of a subcube to satisfy the X x Y mesh request. In the event that a k-cube is
not available, but a higher dimensioned subcube is available, the algorithm applies a pro-
cess identical to the procedure performed by the free list algorithm, in which a higher-
dimensioned subcube is iteratively partitioned into lower-dimensioned subcubes until the
required k-cube is Obtained and used. Figure 3.5 illustrates the AFL allocation algorithm.

When cluster requests arrive, the AFL allocation algorithm is used to ﬁrst consult

the AFL. If no suitable subcube can be found in the auxiliary free list, the modiﬁed free

38

 

Algorithm: Auxiliary Free List Allocation.
Input: An X x Y mesh request.

Output: Allocated X x Y mesh (control is passed to the Modiﬁed FL algorithm if no
k-cube can be found).

Procedure:
( 1) If a subcube of dimension k, Qk, exists in the AFL, then
(a) Partition the subcube using the Partitioning algorithm
(b) Mark the X x Y region of Q], as allocated, and exit.
Else, perform step (2).
(2) Find the lowest dimension i, k + 1 _<_ i _<_ n, such that a subcube Q,- exists in the AFL.
(3) If no such Q,- exists, then perform the normal subcube allocation algorithm
(here, the Modified Free List Algorithm).
Else, decompose the Obtained subcube Q,- iteratively as follows:
(a) While i > k, do:
Obtain a subcube Q, from the ith dimension AFL.
Split Q,- into two subcubes of dimension i — 1, adding them to the AFL
of dimension i.
(b) Perform step (1).

 

 

 

Figure 3.5: The AF L Allocation algorithm.

list algorithm (Figure 3.6) is called. The primary difference between the modiﬁed free list
algorithm and the original free list algorithm is that the modiﬁed free list algorithm, instead
of simply allocating a subcube, performs the partitioning algorithm to allocate 2D mesh

clusters.

3.3.3 Deallocation Of 2D Mesh Requests

The deallocation of a 2D mesh request involves three major steps. First, the region embed-
ding the mesh is partitioned intO a set Of subcubes in a manner similar to the partitioning
algorithm (Figure 3.2). Second, the resulting list of subcubes is coalesced with the AFL
subcubes produced when the 2D request was allocated. Third, any higher-dimensional sub-

cubes resulting from this coalescing process are then “deallocated” by adding them to the

39

 

Algorithm: Modiﬁed Free List Allocation.
Input: An X x Y mesh request.

Output: Sends a subcube to the partitioning algorithm or returns nothing if the
requested cluster cannot be allocated.

Procedure:
(1) If a subcube of dimension k, Qk, exists in the FL, then
(a) Partition the subcube using the Partitioning algorithm
(b) Mark the X x Y region of Q, as allocated, and exit.
Else, perform step (2).
(2) Find the lowest dimension i, k + 1 g i g n, such that a subcube Q,- exists in the FL.
(3) If no such Q,- exists, then Exit (the subcube could not be allocated).
Else, decompose the obtained subcube Q, iteratively as follows:
(a) While 2' > k, do:
Obtain a subcube Q,- from the ith dimension FL.
Split Q,- into two subcubes of dimension 2' — 1, adding them to the FL
of dimension i.
(b) Perform step (1).

 

 

 

Figure 3.6: The modiﬁed free list allocation algorithm.

list Of subcubes to deallocate and, under certain conditions, performing the normal subcube
deallocation procedure.

The partitioning Of the embedded mesh is performed by the Cluster Partitioning Al-
gorithm (Figure 3.7). This algorithm is Similar to the partitioning algorithm (Figure 3.2),
except that it partitions the 2D mesh cluster, instead Of the region leftover from the 2D
mesh cluster allocation.

Leftover subcubes originally created by the partitioning algorithm are linked together by
an ownership list that allows the deallocation procedure to quickly check (using a linked-list
traversal) whether these leftover cubes are free or allocated. In the event that all of the
leftover cubes are currently free when the 2D mesh is deallocated, the Coalesce algorithm
(Figure 3.8) will restore the original k-cube. In this case, the normal subcube deallocation
algorithm is performed on the k-cube. Otherwise, coalescing will produce several disjoint

subcubes, which may be added to the database of free subcube regions for the normal

40

subcube allocation algorithm. In any event, the ownership list is destroyed when the 2D
mesh is deallocated using the Auxiliary Free List Deallocation algorithm (Figure 3.9). Those
owned subcubes that are still allocated will be deallocated using the AFL Deallocation
algorithm when they become free. This process will eventually restore the original k-cube

when all Of the leftover subcubes become free.

 

Algorithm: Cluster Partitioning.
Input: An X x Y cluster to be deallocated.
Output: A set of subcubes composing the X x Y cluster.

Procedure:
(1) (a) For all elements of A3:
Compute x-component subcube addresses by changing low-order bits of
subsets of A, to :2.
Replace all A, addresses with subcube addresses computed from A3.
(b) For all elements of Ag:
Compute y-component subcube addresses by changing low-order bits of
subsets of Ag to at.
Replace all Ay addresses with subcube addresses computed from Ay.
(c) For all elements of A; and Ag:
Compute subcube addresses by concatenating members of A,,
to members of A3.
Add each computed address to the AFL.

 

 

 

Figure 3.7: The cluster partitioning algorithm.

3.4 Analysis Of the AFL Cluster Allocation Method

This section presents the analysis of the algorithms associated with 2-D mesh allocation
and deallocation. A complete analysis of the complexity of the original Free List algorithms
is given by Kim et a1 [25]. For allocation and deallocation in an n-cube, assume that
the original cluster request required a k-dimensional subcube. Both the allocation and
deallocation Of 2D mesh clusters do not add noticeable overhead to the original free list

algorithm. AFL allocation adds 0(k2) complexity to the FTee List algorithm, which is

41

 

Algorithm: Coalesce.
Input: A list of subcubes, T, Of dimension 0 to k — 1.

Output: A list of subcubes resulting from coalescing complementary
cubes.

Procedure:
(1)Fori=0tok—1do
Let c be the number Of subcubes of dimension i. If e Z 2, then
For all pairs of previously un-coalesced subcubes of dimension i :
If the two subcubes are complementary, then
(a) Combine them into a subcube of dimension i + 1
(b) Remove them from T.
(c) Add the new (i + 1)-subcube to T.

 

 

 

Figure 3.8: The coalesce algorithm.

 

 

Algorithm: Auxiliary Free List Deallocation.
Input: A deallocating X x Y cluster C, with ownership list 0.
Output: Adds subcube(s) to the AFL or the FL.

Procedure:
(1) Perform the Cluster Partitioning algorithm on C,
placing the subcubes obtained in a temporary list T.
(2) For all subcubes Q,- in 0:
(a) Remove Q,- from 0 (“disown” it).
(b) If Q, is not allocated, then remove it from the AFL and add it to T.
(3) Perform the Coalesce algorithm on T.
(4) If T contains one subcube of dimension k, then
add that subcube to the FL, and
perform the Pfee List deallocation algorithm;
Else (T contains multiple subcubes), add all subcubes in T to the FL.

 

 

Figure 3.9: The auxiliary free list deallocation algorithm.

42

0(n). More importantly, AFL deallocation adds no more than 0(k3) complexity to the
Free List deallocation algorithm, which is normally at least 0(n3). The existence of the
AFL deallocation algorithm makes it Often unnecessary to perform the Free List deallocation
algorithm, resulting in a greater time savings. Therefore, the methods introduced in this
paper do not introduce signiﬁcant time complexity to the free list algorithms. A more
complete analysis Of the individual algorithms involved in the AFL method is presented in

the following subsections.

3.4.1 Analysis of the Partitioning Algorithm

In step (1) Of the partitioning algorithm, the number Of elements Of L3c is 0(3) (recall
that Z = [logz X ]) Each of the three substeps Of step (1) takes constant time for each
element of L3. Thus, step (1) is 0(6). Similarly, steps (2)(a) and (2)(b) are 0(m) and 0(8),
respectively. Because i and m can both range from 0 to k, any Operation that is 0(8) or
0(m) can be assumed to be 0(k). Step (2)(c) requires a two loops to perform Operations
on two sets, each of which is 0(k). Therefore, this step is 0(k2), and the overall complexity
Of the partitioning algorithm is 0(k2). Practically, t and m are normally much less than k,

meaning that the complexity of 0(k2) for this algorithm is an overestimate.

3.4.2 Analysis Of the AFL Allocation Algorithm

The auxiliary free list allocation algorithm is almost identical to the Free List algorithm for
subcube allocation. The two primary differences are that it (1) examines the AFL and not
the FL, and (2) it uses the partitioning algorithm when a 2-D mesh request requires less

than a complete subcube. If the complexity due to the partitioning algorithm is ignored, the

43

complexity of the auxiliary free list algorithm is 0(n) (see the analysis Of the FL algorithm

in [25]).

3.4.3 Analysis of the Modiﬁed Free List Algorithm

The only difference between the modiﬁed free list algorithm and the original free list algo-
rithm is that the modiﬁed algorithm calls the partitioning algorithm when a suitable k-cube
is Obtained. In [42, 25], the authors showed that the complexity of the free list algorithm is
0(n), where n is the dimension Of the hypercube multiprocessor. Thus, if the overhead Of

the partitioning algorithm is not considered, the Modiﬁed Free List algorithm is still 0(n).

3.4.4 Analysis Of the Cluster Partitioning Algorithm

Step 1(a) of the cluster partitioning algorithm (Figure 3.7) performs a loop on all elements
of the set Ax, which may have up to k elements. Each sub step of this lOOp requires a
constant time to compute. Therefore, this step is 0(k). Step 1(b) is 0(k) for the same
reason. Since step 1(c) requires a doubly-nested loop on two sets having up to k elements,

it is 0(k2), implying that the entire algorithm is 0(k2).

3.4.5 Analysis of the Coalesce Algorithm

This algorithm requires a triple-nested loop. The outer loop considers k dimensions Of
subcubes. The inner two loops consider all possible pairings of subcubes Of some dimension
i, where i is the dimension under consideration by the outer loop. It is possible for any
particular i dimension to contain k subcubes. Therefore, the inner two loops may require
the selection Of up to k2 pairs Of subcubes for any given dimension. Each substep ((a) to (c))

requires a constant time Operation. Therefore, the entire algorithm can be considered to be

44

0(k3). In practice, however, since there are no more than k2 total cubes in the temporary

list, and Often many fewer than this, this algorithm does not take signiﬁcant time to run.

3.4.6 Analysis Of the AFL Deallocation Algorithm

The auxiliary free list deallocation algorithm is a decision procedure that determines whether
to perform an auxiliary free list deallocation, a free list deallocation, or a 2-D mesh dealloca-
tion, which uses the coalesce algorithm. Steps 1 and 3 are simply calls to other algorithms.
Step 2 requires a linked-list traversal of the ownership list. Each substep is a constant
time Operation, and the ownership list itself may contain up to k2 elements, because the
partitioning algorithm creates up to k2 leftover subcubes. Step (4) takes either constant
time (the If condition) or up to k2 steps (the Else condition). Therefore, the complexity of
the AFL deallocation algorithm is 0(k2), if the cluster partitioning algorithm, the coalesce

algorithm, and the free list deallocation algorithm are ignored.

It is important to note that both the allocation and deallocation Of 2-D mesh clusters do
not add signiﬁcant overhead to the free list algorithm. The auxiliary free list allocation algo-
rithm adds 0(k2) complexity to the free list allocation algorithm, which is 0(n). However,
more importantly, the auxiliary free list deallocation algorithm adds no more than 0(k3)
complexity to the free list deallocation algorithm, which is normally at least 0(n3). The
existence Of the auxiliary free list deallocation algorithm also makes it Often unnecessary
to perform the free list deallocation algorithm, resulting in a greater time savings over the

course of allocating/deallocating many requests.

A real-world example from our simulations provides insight into the actual overhead

incurred from using the auxiliary free list method. Simulations were performed on Sun

45

SPARCstation 10 computers. On a SPARC 10, a typical job can be allocated and deallo-

cated in approximately 800 microseconds.

Chapter 4

Experimental Results for the AFL

Cluster Allocation Method

Extensive simulations have been conducted to compare the performance of the FL method
alone with the FL method augmented with the AF L method for allocating user jobs. In
our experiments, 2D mesh requests were generated and allocated using the free list and
auxiliary free list algorithms on a 10-dimension hypercube (1024 processors). The X and
Y dimensions of incoming requests were varied under numerous distributions. Simulation
results depend on the workload distribution, including such factors as the mesh size, the
mesh geometry (rectangular or square), and the job’s service time. With this in mind, many
simulations were run, including: square meshes, in which the X dimension was randomly
generated on [1,16] and [1,32] uniform distributions; rectangular meshes, in which the X
and Y dimensions were randomly generated on [1,16] and [1,32] random distributions; and
numerous interval distributions [21] for rectangular meshes.

The interval distributions were used to simulate bipartite distributions, in which there
are many small requests coupled with many very large requests. Interval distributions are

46

47

based On uniform distributions, except that different ranges Of values are generated with
speciﬁc probabilities. The sum of all interval probabilities must equal 1.0. For example, An
interval distribution might be P[1,8] = 0.3, P[g,24] = 0.5, and P9532] = 0.2, indicating that
the probability of a value being generated over the interval [1,8] is 0.3, the probability of
a value being generated over the interval [9,24] is 0.5, and the probability of a value being

generated over the interval [25,32] is 0.2.

Table 4.1 lists the sets Of interval distributions that were simulated for rectangular mesh
requests. Here, “small dimension” refers to the interval over which small mesh request
dimensions are generated, and “large dimension” refers to the large mesh request dimension
intervals. Each bipartite distribution was run with P(small) = A, P(large) = 1 — A, for
A E {0.10,0.30,0.50,0.70,0.90} (P(small) is the probability for small mesh dimensions,

and P(large) is the probability for large mesh dimensions).

 

 

 

 

 

 

 

 

 

 

 

 

small small
dimension dimension
in [1,8] in [1,16]
large [9,32]
dimension [17,322 [17,322
in: [20,32] 20,32:
25,32: 25,32]
28,32: 28,32:
[30,32] 30,32]

 

 

 

 

 

 

Table 4.1: Interval distributions simulated.

In these experiments, cluster requests are processed according to a ﬁrst-come-ﬁrst-serve
(FCFS) queueing strategy, and the overhead of the cluster allocation and deallocation algo—
rithms is ignored. The job interarrival time and service (execution) times are both assumed
tO have exponential distributions. Service time is generated with a mean of 4.0 seconds,

and interarrival time is varied with respect to the service time. Results presented are nor-

48

malized with respect to the system load (service time / arrival time). For each separate
simulation run, all random number streams are initialized with the same unique seed value,
to guarantee that each allocation method receives the same input values. The interarrival
time, service time, X dimension, and Y dimension, as well as a probability generator used
to generate values in the interval distributions, are all independent. The results presented

are generated with a 95% conﬁdence interval on the job turnaround time (JTT) measure.

The results from an experiment using a constant workload illustrates the potential per-
formance increase Of the AFL method over the FL method. In this experiment, 50% of
mesh requests were 4 x 6 and the other 50% were 4 x 2 meshes. The workload was
generated by alternating the 4 x 6 and 4 x 2 requests, with the ﬁrst request being a 4 x 6
mesh (that is, requests are generated in the sequence 4 x 6, 4 x 2, 4 x 6, 4 x 2, ---.)
With this workload, the AFL method is able to allocate a 4 x 6 and a 4 x 2 mesh request
in one 5-cube, because each 4 x 6 allocation generates a leftover 3-cube. The FL method,
however, requires an entire 5-cube for each 4 x 6 mesh request, followed by a 3-cube for
each 4 x 2 request. When the ﬁrst 4 x 2 request is processed, the smallest available
free-list subcube is a 5-cube, which is decomposed into a 4-cube and two 3-cubes. After
allocation, the remaining 3 and 4-cubes cannot be used to allocate the incoming 4 x 6
request. Under very heavy system loads, this condition occurs after every 4 x 2 mesh

allocation, effectively lowering the overall system utilization by 50%.

Figure 4.1 conﬁrms the performance advantage of the AFL method over the FL method
when applied to this workload. Here, the AFL method Obtained 100% utilization, while
the FL method Obtained 50% utilization. In addition, the point at which JTT started to
increase exponentially was 4.0 for the AFL method, while it was 2.0 for the FL method

(the AFL method performed twice as well as the FL method, as expected). Furthermore,

49

this experiment helped to verify the correctness Of the simulator used in the experiments

presented here.

Constant Workload Distribution

 

 

Constant Workload Distribution

 

 

 

 

 

 

 

 

100 f I ﬁr 100 1 I I I I I ,4
F!- '.— 1'
AFL -+" x’
80 *- FL +-- 1 80 l. , ..
AFL: '+" lg
’1
JTT S s "
6o - -l y 60 r- x" -
(sec) Util I.”
>
0%)
40 I- ~ - -
20 '- ’_4, - 20 '- -
”as"
ML-‘PTM’JM
0 l l 1 l l l l 0 l l l l 1 L l L
O 0.5 l l.5 2 2.5 3 3.5 4 4.5 0 0.5 1 LS 2 2.5 3 3.5 4 4.5
System Load (service/arrival) System Load (servioclarrival)

Figure 4.1: JTT and system utilization vs. system load for constant workload.

4.1 Square and Rectangular Requests With Uniform Distri-

butions.

Figures 4.2 and 4.3 Show the job turnaround time (JTT) versus the system load, as well
as the system utilization versus the system load, for square and rectangular mesh loads in
which mesh dimensions were generated in [1,16] uniform distributions. The performance Of
both methods is approximately the same. This is due to the fact that there are no mesh
requests larger than 16 x 16. Any leftover subcubes generated by the AFL’s partitioning
algorithm are consequently of dimension at most 6, which results in fewer opportunities for
the AFL method to exploit the existing leftover subcubes.

Figures 4.4 and 4.5 give the experimental results for square and rectangular mesh loads
in which mesh dimensions were generated in [1,32] uniform distributions. In this case, the
AFL method gives Slightly better average utilization and JTT values than the FL method

alone. Since there is a greater variation in mesh sizes being generated (from 1 x 1 to

[l.l6] Disuihution

 

"ll ‘t T T I I I T
80 - 1
FL 4—
l'l'l' AFL -+---
(sec) 60 - .
40 . ..
20 - -
0 l l 1 l l l J

 

 

 

l 2 3 4 5 6 7
System Load (service/arrival)

50

ll.l6l Distribution

 

1m I I *1 I l I
8‘) "' at
FL *—
Sys AFL -*---
Util 60 - ..
(‘76)

 

 

 

 

0 1 l 1 l l 1 J
l 2 3 4 5 6 7
System Load (service/arrival)

Figure 4.2: JTT and system utilization vs. system load for square mesh requests in a [1, 16]

uniform distribution.

[l.l6| Distribution

 

 

 

 

I“) I T T I T T T I
80 - .
FL +-
JTI‘ AFL -+---
(sec) 60 - ..
40 >- a
20 ] .
0 1 l l l l l l l
l 2 3 4 5 6 7 8
Syste Load (service/arrival)

Figure 4.3: JTT and system utilization vs.

[1, l6] uniform distribution.

[ l .16] Distribution

 

 

 

 

If” T V r 1 T I I I
80 - 0‘
FL *—
Sys AFL -+---
Util 60 ~ ..
(9.») >
40 .- .
20 - -»
0 l l I. 1 1 l l 1
l 2 3 4 5 6 7 8
System Load (service/arrival)

system load for rectangular mesh requests in a

51

32 x 32), there are more instances in which a large mesh request generates many leftover

subcubes, allowing the AFL method to satisfy a larger percentage Of incoming requests,

 

 

 

 

 

 

resulting in the increased performance shown.
[L32] Distribution [L32] Disuibution
I“) r I“, I I 1T I I I I
FL *—
AFL-+~
80 r 80 r -1
JTT Sys
(sec) 60 -‘ U111 60 ’- -‘
m
-,-e—--~+----e—--+-t
4(1 4 4() r
20 1 21) r .
l 0 l J l l l l l
1.4 1.6 1.8 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
System Load (service/arrival)

 

 

 

0
0.8 l 1.2

0.2 0.4 0.6
System Load (service/arrival)
Figure 4.4: JTT and system utilization vs. system load for square mesh requests in a [1, 32]

 

 

 

 

 

 

unIform distribution.
[1.32] Distribution [L32] Distribution
Im I I I I I j I T I l(x, I I I I I I I I
5' FL *-
l 5 AFL +
80 FL +— : - 80 - A
AFL -+-- ;
rrr ,f l Sys
(sec) 60 '1' U111 60 '- -<
1' (‘3)
_,se---4---<-
40 ~ 40 - ' ’
20 - 20 - l
l l () l l l l J. l L l
I 1.2 1.4 1.6 1.3 2 0.2 0.4 0.6 0.8 I 1.2 1.4 L6 l.8 2
System Load (service/arrival)

 

 

 

00.2 0.4 0.6 0.8
System Load (service/arrival)
Figure 4.5: JTT and system utilization vs. system load for rectangular mesh requests in a

[1, 32] uniform distribution.

52

4.2 Interval Distribution, Small Dimension in [1, 8].

Figures 4.6 and 4.7 present the JTT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension
is in the [1,8] uniform interval, and the large mesh dimension is in the [9,32] uniform
interval. The interval distributions were assigned as follows: Pug] = A and 11932] = 1 — A,
for A E {O.1,0.3,0.5,0.7,0.9}.

We observe that as the percentage, A, of small jobs is increased, the overall JTT im-
proves, yet the system utilization also decreases. This is due to the greater number of small
mesh requests being processed as A increases. It results in a larger percentage of small
leftover subcubes that cannot be used to allocate any incoming requests. This trend is also

observable for all other results in this section (see Figures 4.8 through 4.17).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=O.10 A=0.10
100 v T 100 r u
80 " FL .0—
AFL 4"-
FL +-
Aﬁ. +-- Sys
Util 60 1- .1
(‘1‘)
20 " .1
I I. I I 0 I I I I I I I
0.5 1.5 2 2.5 3 3.5 l 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A=O.30 A=O.30
1(1) t i T I“) t I
80 - -« 80 - FL -O—
AFL -+---
FL +-
m AFL -,.-. Sys
(sec) 60 '- Util 61) r
1%)
40 - i 4‘) W’
20 - 20 .. 4
O I I I () I l I I I 1 I
0.5 l 1.5 2 2.5 3 3.5 l 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.6: JTT and system utilization vs. system load for 31,8] = A, P932] 2 1 — A

interval distributions.

 

 

 

 

53

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A = 0.50 A = 0.50
100 r f r r I I 1m r r T I I r I
80 - 5 . 80 - n. 4— .
i AFL -+---
3 FL ~—
m 3 AFL -+-- Sys
(sec) 60 5 -4 Util 60 r "
g (‘1!)
40 .
20 .3 1
0 I I I I 0 I I I I I I I
0.5 1 1.5 2 2.5 3 3.5 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/anival)
A = 0.70 A = 0.70
“n I I ﬁ ,‘r I I l‘x) f I I I I I ﬁ
80 1- E -1 80 1- FL -O-— -I
5 AFL -+--
i n. +—
m 5 AFL -+--- Sys
(sec) 60 r 5 -‘ Util 60 - -
3 (‘1')
4‘) I- g 1 m #- cl
20 '- 'i .4
1,!
“r”,
0 I I 4L 1 I 0 I L I I I I I
0.5 I 1.5 2 2.5 3 3.5 l 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A=0.90 A=0.90
1“) I f I I I I : 1m I I I I I I I
:' . 80 ~ FL +— -
I" AFL -+---
m Sys
(sec) Util 60 1- -
l 1%)
q 40 " d
‘ 20 " 6L
0 I I I J I I I
1 1.2 1.4 1.6 1 8 2 2.2

 

 

 

0.5 1 1 5 2 2.5 3 3.5 .
System Load (service/arrival)

System Load (service/arrival)
Figure 4.7: JTT and system utilization vs. system load for Pug] = A, P[9’32] —— 1 — A

interval distributions.

54

Figures 4.8 and 4.9 present the JTT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension
is in the [1,8] uniform interval, and the large mesh dimension is in the [17,32] uniform
interval. The interval distributions were assigned as follows: P[1,8] = A and 13117.32] = 1 — A,
for A E {O.1,0.3,0.5,0.7,0.9}.

This set of experiments illustrates the effect of removing some of the “middle” range
of mesh sizes in the experiments. As the lower limit on the upper interval increases, the
overall system utilization and JT T improves (again, the effect can be seen by observing all
ﬁgures in this Section). This effect is due to the fact that, from experiment to experiment,
a greater percentage of leftover subcubes are being created by larger and larger generated
meshes, which results in a greater number of the small mesh requests being satisﬁed by

leftover subcubes.

 

 

55

 

 

 

 

 

 

 

 

A=0.10
1m 1 , I— I I I
80 > .
FL *—
JTT AFL .,.--
(sec) 60 - .
40 " .1
20 " d
0 I I I I I
0.5 1 1.5 2 2.5 3
System Load (servicelarrival)
A=030
I y r I I
§ FL +—
111” 5 AFL «~-
(scc) g r
,1
I I I I
0.5 1 1.5 2 2.5 3
System Load (service/arrival)

Figure 4.8: JTT and system utilization vs. system load for Pug] = A, P[17.32] = 1 — A

interval distributions.

Sys
U111
(‘11)

Sys
Util
1%)

 

 

 

 

 

 

 

 

 

 

 

 

A=0.10
Im I I I I I r
FL +-

30 - AFL +~~ -
60 *- .
.nL"- - t- "r- f Av ¢ ¢
40 1- .
20 '- “
0 I I I I I I I

1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival)
A=0.30
100 I I I I I I
FL *-

8‘) I- AFI. 'ﬁ'" 1
60 - .
LI If g A : : :

40 f ﬁ 7 v v "
20 " .1
0 I I I I I I I

I 1.2 1.4 1.6 1 8 2 2.2

System Load (service/ariival)

56

 

 

 

 

 

 

 

 

 

 

 

A = 0.50
: I I I
.5 .
5' FL +—
111" E AFL -+---
(sec) f .
,5 4
,i y
I I I
0.5 1 1.5 2 2.5 3
System Load (servicelarrival)
A = 0.70
I I I I
.5 a +—
m i AFL -+--
(5m) 5 -1
s .
,5 ‘
I I I
1.5 2 2.5 3
System Load (service/arrival)
A = 0.90
.1
JTT
(SOC) -4

 

 

 

 

0.5

2.5

I 1 .5 2
System Load (service/arrival)

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0_50
101) 1 u u 1 f I
FL +—
80 *- AFL + ~
Sys
Util 60 h -
(%)
40 1- ""---4----+----+----+-----+ ﬂ
/¢ : Af : c c
20 - ..
0 I I I I I I I
1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival)
A=0.70
Im I I I I I I I
FL *—
80 - AFL ~+--- -1
Sys
Util 60 - -1
(9’61
40 - -1
-'*---~+ --------- .0
--"*‘
' 1
20 1- .
() I I I I I
1.2 1.4 1.6 1.8 2.2
System Load (service/arrival)
A=0.90
I“) I I I I I I I
FL *—
80 '- AFL ,_ '1
Sys
U111 60 ~ .1
(‘7c)
40 - 4
2t) [- .
0 I I I L L I I
1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival)

Figure 4.9: JTT and system utilization vs. system
interval distributions.

103d for P[l,8] = A, 31732] = 1 — A

57

Figures 4.10 and 4.11 present the JTT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension
is in the [1,8] uniform interval, and the large mesh dimension is in the [20,32] uniform
interval. The interval distributions were assigned as follows: 1311.8] = A and 1390,32] = 1 — A,

for A e {0.1,0.3,0.5,0.7,0.9}.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.10 A:0.1(1
1‘” I , I I I I 1“) I I I I I I I
80 *- 4 80 - 4
FL +—
m AFL -+--- Sys ,2" -+ . .
(sec) 60 r « um 60 - ._, As If i ; : :
40 b 4 40 - H. +— -
AFL +-
20 - -* 20 - e
0 l I I I I 0 I I I I I I I
0.5 l 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (scrvice/am'val)
A=(1.30 A=0.30
I“) I I ; I I I III) I I I I I I
80 L 1' - 80 ~ «
, FL *—
J'IT E AFL -+--- Sys
(sec) 60 . 5 ~ um 60 - ,
i (%) u"; ‘4‘ ‘ i : ¢ f
1 a'Ik—A A 4 A A A A
40 ~ 5 ‘] 40 / FL +— <
1 AH. «"-
i
20 - E. -1 20 1- -4
,2,
0 I"' I I I I 0 I I I I I I I
0.5 l 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.10: JTT and system utilization vs. system load for P[1,8] = A, 32032] = 1 — A
interval distributions.

58

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.50 A=0.50
[m I I : I I I 1(1) I I I I r I I
80 ~ . 80 - ~
5' FL .._
m .1 AFL 4'“ Sys
(sec) 60 L 5 ~ U111 60 - 1
5 (96)
:1 ’-’*-,--‘w“ ¢ 41 ¢ ;
40 h- .1 J 40 I- e'
i AFL -+—--
20 - ,‘1 - 20 ~ ~
,4!
0"“,
O I- I I I I 0 I I I I I I 4
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (servicelanival) System Load (service/arrival)
A=0.70 A=0.70
lm I I I : I I 1m f f I I I I I
so - ,5 - 80 - FL +— .
i AFL -+--
5 FL *-
1'1'1' 5 AFL -+-- Sys
(sec) 60 - 1 - U111 60 - -
5 1%)
40 L :5 " 40 '- '-*-—',+-----O T
1 ”r"
1' WI
20 - ,i « 20 .
'1’,
,l
a"'""
0 I I I I 0 I I I I I I I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A=0.90 A=0.90
Im I I r I p 'm I T I I I I
- 80 - FL +— -
AH. -+--
JTT Sys
(sec) . U111 60 F 6
(‘56)
- 40 1- -]
0 I L I; I I I I
0.5 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival)

1
System Load (service/arrival)

Figure 4.11: JTT and system utilization vs. system load for P[1,8] = A, 1390,32] = 1 — A

interval distributions.

59

Figures 4.12 and 4.13 present the JTT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension
is in the [1,8] uniform interval, and the large mesh dimension is in the [25,32] uniform
interval. The interval distributions were assigned as follows: P[1,8] = A and P9532] = 1 — A,

for A e {0.1,0.3,0.5,0.7,0.9}.

 

 

 

 

 

 

 

 

 

  

 

 

 

 

 

 

 

 

 

 

 

 

A=0.10 A=O.10
1m ﬁr I I I 1") I I r I I I
80 ’ ' 80 1’ '1
—"‘" ‘5 ‘r" ¢ ¢ r
4 FL -9— /' gc ; 3 v ¢ ¢ c
m AFL "*'“ Sys f
(560) 60 ’ .1 Util 60 ~ .1
(‘75)
40 1' I ‘1 40 " FL -0— '4
f AFL *-
20 [ -] 20 '- ~
0 I I I I 0 I I I I I J I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (servicelarrival) System Load (servicelzmival)
A=0.30 A=0.30
Iw ﬁﬁ I : I I ﬂ "X1 I I ‘I I ﬂ I
80 [ g d 80 h 4
3 FL +-
m ‘[ AFL -+-" Sys ',—"'""+ 3 4 t # -
(sec) 60 ~ , - um 60 ~ lg: ; I c A - A <
: (9“) f v I v
40 ~ - 40 - FL ~— -
* AFL -¢---
20 _ ,1 - 20 - ~
,3
o I I I I I I 0 I I I I I I I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.12: JTT and system utilization vs. system load for 1:)[1’8] = A, P9532] = 1 - A
interval distributions.

60

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.50 A=0.50
II” I I : I I I I“) I ﬁ' I I I I
80 ~ 4 80 ~ ~
E FL +—
m 5 AFL -+-- Sys
(sec) 60 - 5 -1 U111 60 - -4" A 4
.4 m x" '
s 'L' . ¢ -
40 - ,1 « 40 / FL +_ .
i AFL -+--
4‘
20 - I . 20 ~ -
l'l
(O
x’
0 - I I I I 0 I I I I I I _I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrivaI)
A=0.70 A=0.70
I“) f I I la) I I I I I I
so :' . 80 F FL +— 4
5 AFL ...-.
3 FL *—
m 3 AFL -*-'- Sys
(sec) 60 ’ .1 Util 60 - -
; (‘75)
r': _ 2.---+ ......... _
.40"
40 - 40 - —*"' -,
20 ~ 20 P .1
0 I I 0 I I I I I I I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2.2
System Load (service/arrival) System Load (service/arrival)
A=0.90 A=0.90
Im T I I I I 1(m I I ﬁ r I I I
80 80 '- FL '0— '1
AFL -¢---
111' Sys
(sec) 60 Util 60 - -4
(95)
a) m 1- c1
20
0 0 I I I I J I I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.13: JTT and system utilization vs. system load for Pll’gl = A, 32532] = 1 — A

interval distributions.

61

Figures 4.14 and 4.15 present the J TT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension
is in the [1,8] uniform interval, and the large mesh dimension is in the [28,32] uniform
interval. The interval distributions were assigned as follows: P[1,8] = A and 1128,32] = 1 — A,

for A e {0.1,0.3,0.5,0.7,0.9}.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=1).10 A=0.10
Im 1* :I I I I I“) I I T T 1 1' 1
80 80 "’"u‘." ‘- ¢ ; r 4.
FL 4— //
m AFL -+--- Sys
(sec) 60 1- -< U111 60 - «
1%)
40 - « 40 - FL 4— .
AFL 4"-
20 - d 20 r- «
o L I I I I 0 I I I 4 I I I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (service/mrival) System Load (service/arrival)
A=0.30 A=0.30
lm I I I I la) f I I I I I I
80 1- ' .1 80 1- .1
I: h“ " ¢ ¢ ‘f 4 4
: FL +— r
1'11” 5 AFL 4'" Sys x"——¢ : c c c :ﬁA
(sec) 60 - 5 « um 60 / «
5 1%)
40 .. E J 40 - FL -o—- «1
3 AFL -+ -
20 . - 20 ~ 4
,1»
0 I — I I I I 0 I I I I I I I
0.5 1 1.5 2 2.5 3 1 1.2 1.4 1.6 1.8 2 2.2
System Load (servicelam'val) System Load (service/arrival)

Figure 4.14: JTT and system utilization vs. system load for 1011.8] = A, 32832] = 1 -- A
interval distributions.

62

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A = 0.50 A = 0.50
I I if la) I I I I I
~ 80 r .
FL -¢—
JTT AFL -+--- Sys ,-¢ ¢ .+_
(see) ~ Util 60 *- xM'” ' F ~
(q') ,r'L’ t : ¢ A A A
.1 40 / FL +_ ‘
AFL -+---
-4 20 r- -t
I I I 0 I I I I I I
l.5 2 2.5 3 l 1.2 L4 1.6 1.8 2.2
System Load (servicclanivnl) System Load (service/arrival)
A=0.70 A=O.70
I : I I la) I I I I I I I
g -I m u- R +—- q
: AFL -+--
:' FL +—
H'T E AFL -..-- Sys
(sec) 3 « um 60 - -
i (z)
3 .
I I 0 I I I I I L I
0.5 1 L!) 2.5 3 l 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A=0.90 A=0.90
1m I I I :I I“) I I I I I I I
80 ~ FL § . so - n. +— «
AFL -+ 5 AH. -+---
m 5' Sys
(sec) 60 l- 5 d Util 60 ~ 4
g' (‘19)
40 v f < 40 - l
20 " "’1. -4
-*‘*"‘
o I I I I 0 I I I I I I
0.5 I l.5 2 2.5 3 1.2 L4 L6 1.8 2 2.2
System Load (service/arrival)

System Load (service/arrival)

Figure 4.15: JTT and system utilization vs. system load for 111,8] = A, P[28,32] = l — A

interval distributions.

63

Figures 4.16 and 4.17 present the J TT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension
is in the [1,8] uniform interval, and the large mesh dimension is in the [30,32] uniform
interval. The interval distributions were assigned as follows: IDU’S] = A and P[30,32] = 1 — A,

for A E {0.1,0.3,0.5,0.7,0.9}.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.10 A=0.10
lm I : I 1' I I 1m I I I I I I I
x"”°""““*“'+--- 4 4 e 1—
30 .- q 80 f v v v v 4 - ; -
FL *—
n'r AFL -+--- Sys
(sec) 60 ~ - Util 60 ~ ~
(‘1)
40 u- -4 40 r- FL -0— -4
AFL -+--
20 - 4 20 L .
0 I I I I I 0 I I I I I I I
0.5 I 1.5 2 2.5 3 l 1.2 1.4 1.6 L8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A=030 A=030
lm I I I l(x) I I I I I I i
30 I. .. 30 _ A A .
H, '9— ."ih‘" v c t e A A Aﬁ
m AFL -+-- Sys / ' '
(sec) 60 - .. Util 60 ..
(‘1)
4O - - 40 - FL 4—
m -+-- l
20 - - 20 » J
0 I I I I 0 I I I J I I I
0.5 l 1.5 2 2.5 3 l 1.2 L4 L6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.16: JTT and system utilization vs. system load for .313] = A, [13032] = 1 — A
interval distributions.

64

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.50
I“) I I 100 I I r I I ﬁ
80 -4 80 - 4
FL 4—
TIT AFL 4‘" Sys ”A. + ¢ 4. s ;
(sec) 60 4 Util 60 - x/ A -«
(IE / ' ' ' I : I
40 ~ 40 - FL +— .
AFL -+-°-
20 -‘ 20 - -
0 I I O I I I I I I I
0.5 l 1.5 2 2.5 3 l 1.2 1.4 L6 1.8 2 2.2
System Load (service/arrival) System Load (service/am val)
A=0.70 A=0.70
lm I I II I I la) I I I I j I I
so - E « 80 ~ FL +— «
s AFL -+--
5 F1 *—
JTT : All -+--- Sys
(sec) 60 - 5 - Util 60 - -
l (‘1)
5 "_-,-§-"'+"'+ ----------- >
4‘) i- s -1 4‘) I- .1
20 . i - 20 _ .
4”
a”,
0 I I I I 0 I I I I I I I
0.5 l 1.5 2 2.5 3 I l.2 1.4 L6 1.8 2.2
System Load (service/anival) System Load (service/arrival)
A=0.90 A=0.90
[m .7 I l“) I I I I I I
so ,5 - so ~ FL +- .
f AFL -+—--
m ,5 Sys
(SOC) 60 .' - Util 60 ~ -
:' m
40 I: J 40 b q
j
20 r” d
0 I 0 I I I I I I I
0.5 l l 5 2 2.5 3 l 1.2 1.4 1.6 1.8 2 2.2
System Load (service/arrival)

Figure 4.17: JTT and system utilization vs.

System Load (service/arrival)

interval distributions.

system load for P[1,8] = A, 1130,32] = 1 — A

65

4.3 Interval Distribution, Small Dimension in [1,16].

Figures 4.18 and 4.19 present the JTT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension is
in the [1,16] uniform interval, and the large mesh dimension is in the [17,32] uniform interval.
The interval distributions were assigned as follows: P[1,16] = A and 1117,32] = 1 — A, for
A E {0.1,0.3,0.5,0.7, 0.9}.

The conclusions to be drawn from these experiments are similar to those presented in
Section 4.2. However, the change in the lower interval from [1, 8] to [1, 16] has the further
effect of increasing user JTT, yet improving system utilization. This occurs because, for
the lower range of mesh requests, a greater percentage of requests are larger (that is, the
[1,16] interval can create larger mesh requests than the [1, 8] interval), thus using more of
the system’s resources. However, the JTT is adversely affected because the difference in the
lower range of mesh requests results in a lower number of leftover subcubes that are large
enough to satisfy incoming requests. This effect is more pronounced as A increases. For

example, compare Figures 4.9 and 4.19.

(sec)

(soc)

Figure 4.18: JTT and system utilization vs. system load for P[1,16] = A, 1117,32] = 1 — A

 

 

 

 

 

 

 

 

 

 

A = 0. l()
I“) I . r r r I
80 >-
FL +—
AFL ~+~
a) .-
40 -
20 -
‘) I I I I I
0.5 I 1.5 2 2.5
System Load (service/arrival)
A=0.30
III) I , I I I
80 3
E a +
5 AFL -4---
60 5
40
20 E
a
0 I I I I
0.5 I I.5 2 2.5
System Load (service/anival)

interval distributions.

66

Sys
Util
(‘1)

Sys
Util
(‘1?)

 

 

 

 

 

 

 

 

 

 

 

 

A:0.IO
'm I I I I I
80 - .
60 ~ ~
40 - FL 4— -‘
AFL -+—--
20 >- .
0 I I I I I I I
I L2 1.4 L6 I.8 2 2.2
System Load (service/arrival)
A=0.30
lm I I I I I I I
60 " 1
Mad; ; 1 : . ; -
40 FL +— <
AFL -+--
2“ '- -I
0 I I I I I I I
l 1.2 1.4 1.6 [.8 2.2
System Load (service/arrival)

67

 

 

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.50 A=0.50
1m t I f I 1(X) I I I I I I
so ~ 80 ~ -
i FL ~—
m E AFL -+-- Sys
(sec) 60 5 - um so - .
E (‘56)
.E "‘4.-- + 4 r ‘r
‘0 : -1 40 ,. x ' +— .3
3 AFL -+--
20 i " 20 ‘ ‘
0 I I I 0 I I I I I I I
0.5 I 1.5 2 2.5 3 I 1.2 1.4 L6 1.8 2 2.2
System Load (service/arrival) System Load (service/anival)
A=0.70 A=0.70
If ﬂ I lw I I I I I I I
,5 . 80 - FL 4— -
E AFL -+---
5 FL *-
JTI‘ 5 AFL '+-- Sys
(sec) g . Util 60 - .4
5 (‘1)
g . .0 . .11
s --"' »
,i ,/
I .1 20 ‘
I I I 0 I I I I I I I
1.5 2 2.5 3 I 1.2 L4 1.6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A = 0.90 A = 0.90
1(x) I I I I I I
" 80 '- FL 4— ..
AFL -*---
JTT Sys
(50¢) ‘ Util 60 - .t
(W
a 40 >- .
0 I I I I I Q I
0.5 I 1.5 2 2.5 3 I I2 [.4 1.6 LB 2 2.2
System Load (sewicelaiﬁval)

System Load (service/arrival)

Figure 4.19: JTT and system utilization vs. system load for 111,16] 2 A, P[17,32] = 1 — A
interval distributions.

68

Figures 4.20 and 4.21 present the J TT and system utilization versus the system load for

meshes generated using bipartite interval distributions in which the small mesh dimension is

in the [1,16] uniform interval, and the large mesh dimension is in the [20,32] uniform interval.

The interval distributions were assigned as follows: P[1.16] = A and 32032] = 1 — A, for

A e {0.1,0.3,0.5,0.7,0.9}.

A=0.IO

 

lm I 0'

(sec) 60 ~

 

20-

 

0 I I

I r

I I

I

 

 

1
System

1.5 2
Load (service/arrival)

A = 0.30

2.5

 

(sec) 60

 

I I

I

 

 

 

1.5 2

System Load (service/arrival)

2.5

Sys
Util
(9F)

Sys
Util
(9H

20

4O

20

 

 

 

 

 

 

 

 

 

 

 

 

A = 0.”)
W7 I f I I I I
r- '4
v- FL +— d
AFL -+---
I I I I I I I
I 1.2 1.4 L6 1.8 2 2.2
System Load (service/arrival)
A = 0.30
I I I I I I I
-—: : - : i :
/ FL +—- -I
AFL -+---
b 1
I I I I I I I
I 1.2 L4 1.6 I.8 2.2
System Load (service/arrival)

Figure 4.20: JTT and system utilization vs. system load for P[1,16] = A, 32032] = 1 — A

interval distributions.

(5°C)

(5°C)

(sec)

Figure 4.21: JTT and system utilization vs. system load for P[1,16] = A, Pl20,32l = 1 — A

IOO

2O

 

 

 

 

 

 

 

 

 

 

A = 0.50
T I I
FL -¢—
AFL -+—--
I I I
1.5 2 2.5 3
System Load (servicclarﬁval)
I I
FL +—
AFL -+--
I 4
0.5 I 1.5 2 2.5 3

System Load (service/arrival)

A=0.90

 

 

 

 

 

0.5

I I .5 2
System Load (sen'icelaxﬁval)

interval distributions.

69

Sys
U10
(9*)

Sys
Util
(9P)

Sys
Util
(%)

 

 

 

 

 

 

 

 

 

 

A = 0.50
1(X) I I I I I I I
80 '- .4
60 '- -4
“”4- ‘7- :— -* A L
40 ~ ' FL -o— J
AFL -+--
20 - .
0 I I l I I I I
I l.2 [.4 1.6 1.8 2 2.2
System Load (service/arrival)
A = 0.70
1m I I I I I I I
80 - H. -o— 1
AFL -4---
60 - -
"+----+---~-+—------~--a.
40 b ”,4" a,
20 ~ -[
0 I I I l I I I
l 1.2 [.4 L6 1.8 2 2.2
System Load (service/arrival)
A=0.90
Im I I I I I I I
80 L FL +— -t
AFL -+---

wr-

 

 

 

 

I .2 1.4 I .6 I .8 2 2.2
System Load (service/arrival)

70

Figures 4.22 and 4.23 present the JTT and system utilization versus the system load for

meshes generated using bipartite interval distributions in which the small mesh dimension is

in the [1,16] uniform interval, and the large mesh dimension is in the [25,32] uniform interval.

The interval distributions were assigned as follows:

A e {0.1,0.3,0.5,0.7,0.9}.

 

 

 

 

 

 

 

 

 

I)[l,l6] = A and Pl25,32] = 1 — A, for

 

 

 

 

 

 

 

 

 

 

 

 

 

A=(I.I0 A=().l0
lm I I I I“) I I I I I I I
80 80 - -<
FL +— 43! 2 i s :‘r a s s
JTT AFL -+-- Sys /
(sec) 60 RAFL -°--- Util 60 - q
(‘1)
40 40 ~ FL +— -
AFL -+---
RAFL ~am
20 20 ~ -
0 I I I I 0 I I I I J I I
0.5 I I5 2 2.5 I I.2 L4 L6 1.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A=0.30 A=0.30
100 I I I I“) I I I I I I I
80 80 . -‘
FL +—
JTT AFL """' Sys "p: = = i : r, ;
(sec) 60 RAFL '0" Util 60 l- -’L e e c Ar ; A. a
“I” f
40 40 - FL 4— .
AFL -+- -
RAFL ~0-
20 20 ~ 4
0 I I I 0 I I I I I I I
I.5 2 2.5 3 I I.2 1.4 1.6 1.8 2 2.2
System Load (service/arrive” System Load (service/arrival)

Figure 4.22: JTT and system utilization vs. system load for P[1,16] = A, P9532] = 1 — A
interval distributions.

(56C)

(sec)

(86C)

Figure 4.23: JTT and system utilization vs.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

:05“
I I I
FL 4—
AFL «"-
RAFL -G--- ..
I I I
0.5 I 1.5 2 2.5 3
System Load (service/arrival)
A=0.70
lm I I I
so F ,l 4
I a +—
[ AFL -+---
(,0 ~ , RAFL -t3--- 1
l
40 ~ .
20 - ~
0 I I I I
0.5 1 LS 2 2.5 3
System Load (service/arrival)
A=0.90
I“) I I I I I i
80 FL +— ' 7
AFL -+--
RAFL -a--- f
m / 4
40 i a
20 .
0 I I I I
0.5 I 1.5 2 2.5 3
System Load (sewicclanival)

interval distributions.

71

Sys
Util
(96)

Sys
Util
(%)

Sys
Util
(‘10)

system load for P[1,16] = A, £125,321 2 1 — A

 

60L-

 

AFL -+--
RAFL -a
20 - .
0 I l I I I I

w. “f“‘f'f"*'““““"-*’

.4

 

 

I.2 L4 1.6 I 8 2 2.2

System Load (sewice/ariival)

 

 

 

 

 

A=0.70
1m I I I I I I
80 - FL +—— .
AFL -4---
RAFL -B--~
60 - 4
k'k"'.""."—'—"{’
40 _ 3
20 - .
0 I I I I I I
1.2 L4 1.6 1.8 2 2.2
System Load (service/arrival)
A=0.90
100 u v r f u u
80 ~ FL .o_ .
AFL -+---
RAFL -t3—~
60 — .
40 F .1
20 " .4
0 I I I I I I

 

 

 

I .2 l .4 I .6 I .8 2
System Load (service/arrival)

2.2

72

Figures 4.24 and 4.25 present the JTT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension is
in the [1,16] uniform interval, and the large mesh dimension is in the [28,32] uniform interval.
The interval distributions were assigned as follows: P[1,16] = A and 1312832] = 1 — A, for

A e {0.1,0.3,0.5,0.7, 0.9}.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=0.10 A=0.l0
'. I I I I I“) I I I I I I I
« so ﬁns-3"": : e i 5
FL +—
ITT AFL +- - Sys
(we) - Util so F .
(‘3‘)
~ 40 - FL +— <
AFL -+--
‘ 20 - .,
I I I L 0 I I I I I I I
I [.5 2 2.5 3 l [.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival) System Load (service/arrival)
A = 0.30 A = 0.30
I“) I. I I I lm I I I r T I I
so » g - so » «
E FL _._ _---O~----+- -:- e . A. e - t
m E AFL ,.__- Sys 4‘ v 3 3 3 3 3 3
(see) so ~ 5 ~ um so / ~
g (7:)
40 ~ i « 4o - FL 4— -
i AFL -4---
20 ~ i « 20 F q
0 I I I I I 0 I I I I I I I
0.5 I [.5 2 2.5 3 I [.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.24: JTT and system utilization vs. system load for P[1,16] = A, 32832] = 1 — A
interval distributions.

(5°C)

(3°C)

(see)

Figure 4.25: JTT and system utilization vs. system load for P[1,16] = A, 32832] = 1 — A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

interval distributions.

A = 0.50
[m I I I I I
80 - ..
FL +—
AFL -+--
60 - -
4‘) " .4
I I I
0.5 l [.5 2 2.5 3
System Load (service/arrival)
A=0.70
lm I I I I I
80 - «
FL +-
AFL -+--
w '- q
40 - ..
I I I
05 [ [.5 2.5 3
System Load (servicelanival)
[00 u
80 .
a) .1
40 «4
20 ..
0 I
0.5 [ [.5 2 2.5
System Load (service/arrival)

73

Sys
Util
(%)

Sys
Util
(95)

Sys
Util
(‘1')

HI)

6“

20

[00

60

20

80

60

 

 

 

 

 

 

A = 0.50
I I I I I I
>- J
b F1- -Q—- d
AFL -+---
I L I I I I I
I [.2 .4 [.6 [.8 2 2.2
System Load (service/arrival)
A = ”.70
I I I I I I I

 

 

 

 

 

 

I I I I I I I
[ [.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival)
A = 0.90
I I I I I I

 

I I

I

I

 

FL +— .
An. -,.--

 

 

l [.2

[.4 [.8
System Load (service/arrival)

[.6

“F

2.2

74

Figures 4.26 and 4.27 present the J TT and system utilization versus the system load for
meshes generated using bipartite interval distributions in which the small mesh dimension is
in the [1,16] uniform interval, and the large mesh dimension is in the [30,32] uniform interval.
The interval distributions were assigned as follows: P[1,16] = A and 1130,32] = 1 — A, for

A e {0.1,0.3,0.5,0.7,0.9}.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A=().[0 A=0.l0
lm I I I I [(x) I I I I I T I
”Lug" 3 -$- t - i i
so 4 so / ' ' ' «
FL +—
m AFL -+--- Sys
(5°C) 60 < Util 60 - .
(‘1)
[
I
40 ~ 40 FL +—- - '
[' AFL -o- -
20 -< 20 - - I
0 I J l I 0 I I I I I I I
0.5 [ [.5 2 2.5 3 [ [.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival) System Load (scrvicelarn‘val)
A=0.30 A=030
1m I I I I 1(X) I I I I I I I
80 ~ l 80 i- ~
f----+---+ : A. L A 4
FL ° x" v 3 ¢ ¢ 4 5' 3
m AFL +-- Sys /
(SOC) 60 r -< Util 60 .
Wt)
40 r- J 40 h- F14 -Q— -:
AFL +-
20 - -[ 20 - -
0 I I I I I 0 I I I I I I I
0.5 I [.5 2 2.5 3 [ [.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival) System Load (service/arrival)

Figure 4.26: JTT and system utilization vs. system load for P[1.16] = A, P[30,32] = 1 — A
interval distributions.

(set)

(5°C)

(5°C)

Figure 4.27: JTT and system utilization vs. system load for P[1,16] = A, 33032] = 1 — A

80

20

80

80

60

20

 

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A = 0.50
I I I I
.J
FL +—
AFL -+---
I I I I
0.5 [ [.5 2 2.5 3
System Load (servicclan-ival)
A = 0.70
I I I I I
FL -o—
AFL -+---
I I I
0.5 [ [.5 2 2.5 3
System Load (service/arrival)
A = 0.90
I
.1
I J I I
0.5 [ l 5 2 2.5 3

System Load (service/anival)

interval distributions.

75

Sys
Util
(‘1')

Sys
Util
(‘32)

Sys
Util
(‘1’)

80

40

20

80

20

H!)

80

60

 

 

 

A = 0.50
I I I I I T
q
.t"";-""X'""I""1'""1“'"Z """"" ’

FL-.— .4

 

I I I I I I

 

[.2 [.4 [.6 [.8 2 2.2
System Load (servicelam'val)

A = 0.70

 

 

 

 

 

I I I I

j-
b

 

[.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival)

A=0.90

 

 

 

I I I I I I

AFL -+--

 

 

I I I I I I
[.2 [.4 [.6 [.8 2 2.2
System Load (service/arrival)

Chapter 5

Timesharing in Workstation

Clusters

This chapter discusses issues involving timeshared processor scheduling and cluster allo-
cation in Networks of Workstations. We examined the problem of whether it is possible
for high-performance parallel jobs to be scheduled in a timeshared fashion yet still accom-
plish two primary goals: (1) the JTT of the high-priority job is not affected; (2) system

throughput is improved [44].

5.1 Problem Deﬁnition

In a NoW considered as a CPC, the primary performance concern is the job turnaround
time (JTT) as perceived by the individual user. J TT measures the time a user’s job spends
in the system from the time it is submitted to the time it ﬁnishes its execution. Normally,
space sharing, in which a job is given exclusive access to its cluster of processors, is used to
minimize JTT. Another measure of performance is system throughput, which measures the

76

77

amount of work being performed by the system (that is, the number of processors occupied
by jobs) within some unit of time.

The inherent ﬂexibility of the workstation cluster makes it suitable for several divergent
types of tasks. Its most common use is as a set of personal workstations for groups of
primary interactive users. The use of workstation clusters for remote batch job execution
has also been extensively examined [5, 34, 35, 36]. In a workstation-based CPC, J TT can be
affected by a number of factors, such as message-passing latency, communication protocols,
and the processor scheduling policy, as well as uneven workload distributions. These factors
can contribute to sub—linear speedup for many user jobs, as illustrated in Figure 5.1. The
difference between ideal and observed speedup often represents a certain amount of idle

CPU cycles, which can be exploited through timeshared job execution.

“ ,' [deal Speedup
\

,"\
Speedup . ,,’<{\\§

 

Number of
Processors

Figure 5.1: Difference Between Ideal and Observed Speedups.

In an ideal CPC environment, timeshared execution of parallel jobs would be achieved
by redesigning the operating system to produce a lower overhead, more efficient system. For
example, the common Unix operating system contains many daemon processes that reside
in memory, creating extra overhead that may interfere with the execution of parallel jobs.
An optimized CPC may eliminate many of these processes to achieve higher performance.

Classic work in processor scheduling has focused on static scheduling [2], in which the

overall completion time of a set of pre—submitted jobs is the primary concern. For this

78

well-known NP-complete problem, speciﬁc job characteristics, such as the number of jobs,
the cluster size, the execution time, and the memory requirements of each job, can be
determined in advance. A static scheduling algorithm may consider these characteristics to

determine an optimal schedule in a batch or real-time environment.

We consider the environment in which a CPC exists to be dynamic. Jobs arrive in
a stochastic stream over time, making it impossible for the system to know an incoming
job’s characteristics beforehand. Thus, scheduling decisions rely on observable information,

including the characteristics of executing jobs, as well as the current state of the system.

In order to ensure fairness, most system-level schedulers use the F CF S queueing disci-
pline. Since the user’s perceived performance is the primary concern, a processor scheduling
policy in which their job is preempted by a later-arriving job is unacceptable. For the sys-
tem to maintain high performance for the highest-priority job, that job’s execution must

not be hindered by later-arriving, lower-priority jobs.

The approach of this study is to use a ﬁrst-come, highest-priority (F CHP) queueing
mechanism to schedule the timeshared execution of incoming user jobs. In a worksta-
tion cluster, the use of timeshared scheduling can measurably increase system throughput.
Furthermore, the use of priority allows us to greatly reduce the penalty paid, by the highest-
priority job, for having to share its cluster of processors with lower-priority jobs. In most
cases, the high-priority job achieves performance at, or close to, the performance it achieved
under space sharing. Our scheduling techniques, implemented on an existing operating sys-
tem, provide a simple testbed for processor scheduling experiments in NOW—based CPCs.
Our study involves the use of existing, unmodiﬁed kernel scheduling primitives, with no

other changes to the operating system or communication protocols.

 

79
5.2 Characterization of Workload

The organization of message-passing activity in a workstation cluster normally is non-
structured, in the sense that there may be no beneﬁt to using a particular kind of message-
passing pattern. For example, a program in which each process communicates with its four
logical nearest neighbors will not necessarily derive a beneﬁt from executing on processors
connected by an Ethernet or FDDI crossbar. Therefore, the experiments in our study are
concerned not with the patterns of communication, but with the volume of communication

that occurs.

For our experiments, we chose two programs that have varying communication-density
characteristics. The ﬁrst program is the Embarrassingly Parallel (EP) kernel from the N AS
parallel benchmark suite 1. In this program, there is an initialization period in which some
communication is used to perform self load—balancing, followed by a long period of compu-
tation. Each thread of EP performs some initial computation, and work is allocated on each
processor depending on how much each thread was able to ﬁnish during the initialization
period. At the end of the program’s execution, each thread sends its results to the initiating
process. Thus, the EP kernel is characterized as having a high average processor utilization

with very little interprocess communication.

The second program is an example of a barrier synchronization program, a common
parallel programming model in which the execution can be divided into a set of phases, each
of which consists of computation followed by communication that is used to synchronize

processes and/ or communicate intermediate results [45]. Since it is impossible to test many

 

1We wish to thank Dr. V. Sunderam of Emory University for providing the code for the N AS Kernels
used in our measurements.

 

80

different programs, we implemented the barrier program to create an artiﬁcial load on
processors in order to perform our experiments.

Figure 5.2 presents a space-time diagram of the message-passing activity of the barrier
program executing on 6 processors in our testbed. The horizontal lines represent CPU
activity for each thread of the parallel process, while the gaps in the horizontal lines represent
times during which the CPU is idle due to communication. The angled lines connecting
processors 1 through 5 to processor 0 represent messages being sent to/ from the barrier.
The ﬁgure shows that on processor 0 (the barrier processor), there is little idle time, yet on

the non-barrier CPUs, there are signiﬁcant amounts of CPU time that the program does

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

not use.
5 P
l l j I l l R
O
4 I c
E
S
3 . | 5
1 O
y r J R
I u ,

2 I l I .
l l 1 a l N
F I U
1 I ‘ I H
I 'I B
‘ l. m, E
0 R

485 TIHE 874

Figure 5.2: Barrier communication characterization.

Our experiments with the barrier program used three versions differing according to the
number of phases, the total amount of work per phase, and the amount of data passed at
the end of each phase. The parameter p determines the number of phases performed. The
amount of work per phase, w, determines the total amount of work to be divided among the

P processors in each iteration. For example, if w = 900, 000 and P = 6, then each processor

81

performs 150,000 inner loop iterations per phase. The token length, t, determines the
amount of data sent by processes at the end of each computational phase.

Each of our test programs uses a master/ slave paradigm, in which the master program
spawns P copies of the slave program, which perform the parallel computation (the master
program resides on the same processor as one of the slave programs). After spawning the
slaves and sending initial data to them, the master program goes to sleep, waiting for a

“done” message from the slave program that implements the barrier.

5.2.1 Format of Experiments
Space Sharing Performance

Performance measurements for each program when it had exclusive use of the cluster (that
is, under space sharing conditions) were taken to measure each job’s maximum possible
performance in our environment. The experiments were run under both preemptive and
non-preemptive kernels. In a preemptive kernel, a high-priority job that is ready to run
forces any executing low-priority job to be moved off the CPU (that is, preempted). In a
non-preemptive kernel, a high-priority job that is ready to run must wait for a low-priority
job to either expire its time quantum or block.

The results of these experiments are given in Figures 5.3 and 5.4. The relatively low
speedup of EP is due primarily to two factors in our environment: (1) all of our experiments
measured the program’s total execution time, including the load-balancing overhead built
into the EP program; (2) our experiments used the small EP problem size (N = 222, where
N deﬁnes the total number of outer loop iterations in the EP kernel). For example, the full
problem size of EP (N = 228), running on 6 processors, achieved a speedup of nearly 6 in

our environment.

82

 

 

non-preemptive kernel preemptive kemel
j I I I I I I I
200 L 1 200 s -
Exec [50 - gm: 3] I: - Exec [50 - 32:23 :5 f“ .
Time Barrier v3 -e--- Time Barrier v3 -e---
(sec) 1‘. E? 4"“ (SCC)

   

 

 

 

 

 

 

o I I I I 0 I I J I
I 2 3 4 5 6 [ 2 3 4 5 6
Number of Processors Number of Pmcessors

Figure 5.3: Programs running alone, execution time.

 

 

 

 

 

 

 

 

 

non-preemptive kernel preemptive kernel
6 I I I I I 6 T I I I .
.’ "I
,’ ’-
.’ "'
Barrier vI 4— Barrier vl 4— If,
Barner v2 -+--- _,I’ ] Barrier v2 -+--- _,."
5 _ Bamer V3 "3‘” ’1" 5 ' Bam'er v3 -G~- ,-’ "
5? Alone “I’m“ ’1’ EP Alone ”x...” I]!
[deal _ ..... ’_,- [deal - ..... i,-
.I. ,'
.I ’ f’,/1: .1" J:
Speedup 4 '- 11 If" - Speedup 4 - I,» /_,.,-r'
I.I x. I ’l ”,l
I I I!“ y .’ ,-/x I
.I. ,/ 1" //"
’ I ’ /,x"' ,1" I / If
3 I- " I /"H ”1+ __________ a] 3 .- 1 I ‘4. ---------- :r
I.I /" """""" It" //// .......
[I x’ "',-P ’_I x “'r‘
v,' I "4‘ ‘I. -"o
I ,/ ’a' ’ ',v
_I' J!" "y ‘1' ’y‘
2 >- ,./:,.-./“ ’4‘1’ . . ‘ '8 .......... a ------------ 1! 2 _- l I" "._..-' a ........ G ........
rx‘ a -------- ‘, ............ a -------- I:
,I' ’a' .I ’1' ' . a
,0." r’ ". ”:’
4 >’. 1"" B
n ' .’ ' I .
I I J I 1 I . N l . l I
I 2 3 4 5 6 I 2 3 4 5 6
Number of Processors Number of Processors

Figure 5.4: [Programs running alone, speedup.

The higher overhead of the preemptive kernel resulted in a slightly higher overall execu-
tion time (and slightly lower speedup) for each barrier program. This overhead was incurred
primarily because the barrier programs’ communication overhead resulted in greater over-
head imposed by the preemptive kernel. Thus, the difference in execution time between
the non-preemptive version and preemptive version grew as the number of processors grew.
For example, the 5 processor execution time, on the non-preemptive kernel, for version 2
of Barrier was approximately 37.7 seconds. The same version required approximately 38.3

seconds when running under the preemptive kernel. The opposite effect occurred for the

 

83

EP program because of its lack of communication. It ran slightly faster on the preemptive
kernel, because it ran with lower context switching activity, as well as low message—passing
activity. For example, its 5 processor execution time on the non-preemptive kernel was

about 33.9 seconds, while it required about 31.5 seconds on the preemptive kernel.

Table 5.1 gives the parameters used with Barrier to simulate different workloads. The
message size passed at the end of each computational phase was kept constant so that
the expected program delay would be the same, under space sharing conditions, for each
program. The variable parameters were chosen so that the total work performed by each
version is approximately the same. In addition, the execution time of each version of
Barrier on one processor is approximately the same as that of the EP program. The
number of phases and the number of iterations per phase are varied so that the frequency of
communication and the duration of each phase changes. Version 1 obtains a speedup close
to that of EP. Version 3 simulates a program in which communication begins to dominate
the execution of the program once a certain number of processors are used. As Figure 5.3
illustrates, its execution time increases from 5 to 6 processors. The parameters of version 2

were chosen so that its performance would fall between versions 1 and 3.

 

 

 

 

Phase Number of Message Size
Version iterations Phases (bytes)
1 1,350,000 830 800
2 890,000 1250 800
3 445,000 2500 800

 

 

 

 

 

 

Table 5.1: Barrier Parameters used to simulate different workloads.

84

Timesharing Experimental Methods

For this study, the purpose of using timeshared scheduling was to improve overall system
throughput by allowing jobs to share time in the cluster. However, simply allowing an ar-
riving job to begin execution on a previously occupied cluster of processors would adversely
affect the performance of the previously executing job. Therefore, our experiments exam-
ined the use of priority scheduling to allow incoming jobs to use idle CPU cycles without

severely affecting the JTT of existing high-priority jobs.

Figure 5.5 gives an illustrative example of our experimental method. In this ﬁgure, EP
and a version of barrier are run concurrently (using timeshared scheduling) in a cluster.
When the ﬁrst job to ﬁnish terminates, the remaining job is given exclusive control of the
cluster (normally, the high-priority job is expected to ﬁnish ﬁrst). The measured execution
time of the programs gives an estimate of their JTT if they begin execution at approximately
the same time. The ﬁgure also illustrates the fact that multiple measurements were taken

for each experiment to improve conﬁdence intervals.

t

 

>
Job1 F—Ji—4knef—Ji—4 --- L—Ji-4nne

Job2 i EP % EP l _.. l Ep :

 

 

Figure 5.5: Experimental method.

The purpose of this experimental method is to simulate the condition in which the second
job to arrive suddenly becomes the high-priority job when the original high-priority job
terminates. The ﬁrst set of experiments, implemented on a non-preemptive kernel, used a
system call embedded in the user code that reset the priority of the program at the beginning

of execution on each CPU. The high-priority job retained its default priority, while the low-

 

85

priority job issued the system call. The second set of experiments were implemented on a
preemptive kernel and performed using the facility described in Chapter 6. All timesharing

experimental results are reported in Section 6.5.

5.2.2 Testbed

The platform on which we conducted our experiments is a homogeneous distributed work-
station environment conﬁgured for high-performance parallel computing research. It is com-
posed of 6 identical DEC Alpha 400 AXP workstations, connected by a DEC GIGAswitch
crossbar network, as well as by a standard Ethernet connection. The limitation of 6 pro-
cessors in our environment was overcome by using different versions of Barrier to vary its
granularity. Each Alpha is conﬁgured with 32 MB of RAM and a 424 MB SCSI disk, on
which resides the OSF/ 1 (v. 2.0) operating system. In order to ensure minimal interference
from outside factors, all experiments using the GIGAswitch were performed when no other

user jobs were running on the Alpha workstations.

All user jobs involved in our experiments are written in C with PVM 3.2 [6]. The
programs were implemented so that direct routing, in which each thread of the program
bypassed the PVM daemon after connections were established, was used whenever possible.
Our initial experiments illustrated that not using direct routing in our environment would
have a signiﬁcant adverse affect on program performance. For example, running on the
non-preemptive kernel, the space-shared execution time of version 1 of Barrier increased
from approximately 31.4 seconds to 34.4 seconds on 6 processors. Furthermore, not using
direct routing resulted in considerably more overhead and / or interference attributable to

the PVM daemon.

86

The preemptive kernel referred-to in the text was implemented by conﬁguring the OSF/ 1
kernel for the realtime (RT) environment. This conﬁguration enabled various realtime
system calls to be used. Several scheduling classes were also enabled, allowing the use of
the required preemptive scheduling by the operating system. Chapter 6 describes the facility
implemented on the RT kernel that was used to perform the experiments that required the

use of a preemptive kernel.

5.3 Analytical Model

The purpose of the analytical model presented here is to understand the penalty paid by
high-priority jobs when they are run in a timeshared fashion with low-priority jobs. For
the purposes of this study, we assumed that the operating system scheduling mechanism
employed in the non-preemptive (non-RT) kernel is round-robin with multilevel feedback
(RRMF) [46]. RRMF maintains a stratum of different job priorities, in which jobs at
the same priority level are given timeshared access to the CPU using round-robin (RR)
scheduling, while jobs at lower priority levels are not given CPU time unless all jobs at a
higher priority level are either blocked or ﬁnished. Fhrthermore, we assumed that once a
job occupies the CPU on the non-RT kernel, it may not be preempted by another user-level
job. The average context-switch time is assumed to be very small relative to the total
execution time of a program, and is therefore ignored in the models presented here. Under
these assumptions, a simple model was constructed to represent the execution time of a
program in a timeshared workstation cluster.

Consider a barrier program that has exclusive access to its cluster of processors. In
this case, the total program execution time, E3, is determined by several factors: (1) the

average execution time of each computational phase, 63; (2) the average message—passing

 

87

delay at each processor, 6; (3) the number of phases executed by a program, p; and (4) an
initial setup time at the beginning of the program, 3. Since no process of a Barrier process
may proceed past the barrier before all other processes are ﬁnished with a given phase, each
process is conceptually identical. Thus, a model of the activity on one CPU is sufﬁcient to
characterize the behavior of the program, as well as determine its approximate execution
time. For example, Figure 5.6 presents an execution model of Barrier when it maintains
exclusive control of its cluster. For the Barrier program, the execution time can then be

represented with the equation:

 

E3 = s + p(e3 + 6) (5.1)
BurrlorExocutlon [:] t
Processorldlo - 6 6 >
Bammhy [_54 l—_l l——_l
l:-:-

Figure 5.6: Execution pattern of Barrier with exclusive control of cluster.

For our experiments involving the two jobs sharing time on processors, there are two
major analytical models to consider: ( 1) EP running with Barrier; (2) Barrier running with

Barrier.

5.3.1 Barrier with EP, Non-preemptive Kernel

Figure 5.7 presents an execution model of barrier and EP running at the same priority.
In this model, EP and Barrier execute in round-robin fashion during Barrier’s execution
phases, as well as during Barrier’s startup phase (assume that context switching occurs
when Barrier and EP are executing in round-robin fashion, and that no context switching
occurs when EP executes during Barrier’s message delay). This results in each phase of

Barrier being (approximately) doubled in total time, since EP and Barrier are both given

 

."f ;<,~.;~;.,:(;.:.;.,.‘-,-v
\ . . . .

 

 

Figure 5.7: Execution pattern of Barrier and EP, equal priority.

equal time quantums during each of Barrier’s execution phases. When Barrier is blocked
waiting for a message, EP is given exclusive access to the CPU, which results in a random
additional delay, d, imposed on Barrier when its message arrives. In this case, d 3 gm»,
where qu is the time quantum given to EP while Barrier is blocked (this value cannot be
determined precisely). Using this model, the execution time of barrier can be approximated

by a simple equation:

E3 = 23 + p(2eB + 6 + d) (5.2)

At the processor implementing the barrier, the message delay is normally close to 0

(that is, 6 << 63). Thus, the overall execution time can be approximated by:

E3 = 28+p(263+0+d)
2 23 + p(2eB) (5-3)

= 2(3 + p63)

Since the progress of the Barrier entire program depends on its barrier process, Equation 5.3
estimates that the overall execution time of Barrier can be expected to approximately
double, since the execution time of its barrier process will approximately double.

Figure 5.8 presents a model of a barrier program running with EP executing at a lower
priority, in which EP is only given CPU time when Barrier is blocked due to the issuance

of a message receive. In this case, the execution time of Barrier is expected to be close to

 

89

that available when Barrier maintains exclusive control of its cluster, although one can also
expect that the execution time will be somewhat higher due to the random extra delay, d,

occurring whenever Barrier’s message arrives.

EB=s+pag+6+d) (an

earner Exocutlon [:l t

 

 

Figure 5.8: Execution pattern of Barrier and EP, EP at low priority.

This model estimates that with EP running at a lower priority, the execution time of
Barrier will be affected primarily by the random extra delay, d, without otherwise having
its performance penalized. If the value of d is small enough, EP may “steal” CPU cycles on

the processors on which Barrier is executing without severe penalty to the JTT of Barrier.

5.3.2 Barrier with Barrier, Non-preemptive Kernel

When considering one version of Barrier executing concurrently with another, an additional
message delay is introduced into the model. However, the choice of keeping the message
size constant for different versions of Barrier allows us to simplify the analysis, as we can
assume that 6 = 61 z 62 z 63 (where 6; is the delay experienced by version 2'). Since we
are primarily concerned with the execution time of the high-priority process when both
processes are running at separate priority levels, we omit an analysis of the two programs
running at the same priority. For all analyses in this subsection, we refer to the high-priority
version of Barrier as “p1” and the low-priority version of Barrier as “p2” (e.g., the total

execution time and phase execution time of p1 are denoted by Epl and em, respectively).

 

90

The most informative means by which to analyze the performance of these programs
under timeshared conditions is to examine the activity at the processor implementing the
barrier process of p1. Since each of the other processes of the program depends on the
barrier process to resume execution, the barrier process effectively dominates the progress
of the program’s execution. On the processor implementing the barrier, the program’s CPU
demand is high, and the message delay is close to zero (refer to the graph of processor 0 in
Figure 5.2).

For example, if the barrier processes of both programs are run on the same CPU, then
the situation as illustrated by Figure 5.9 can occur. In this case, the message delay for p1
is initially almost zero, but p2 interferes with the execution of p1 by obtaining a quantum
of CPU time every time p1 issues a message-receive. For P processors, the processor
implementing pl’s barrier issues P — 1 message-receives before issuing a multicast to the
P — 1 other processors. Since the CPU demand by p2 is also high, it will likely obtain a
time quantum every time p1 issues a message-receive. Therefore, the execution time of p1

in the cluster can be approximated by:

Epl = 3 + p(ep1 + (P — 1)d) (5.5)

where d = m, the time quantum given to p2 when p1 issues a message-receive.

t

 

 

 

 

L E

Figure 5.9: Execution pattern of two Barriers with both barrier processes on one CPU.

Figure 5.10 presents the model of execution in which the processes implementing the

barrier for each program are on different CPUs. The ﬁgure models the activity of the pro-

 

91

cessor implementing pl’s barrier. In this case, p2 is still able to steal some CPU cycles
from p1. However, since the CPU demand of p2 is expected to be lower on non-barrier
processors, the likelihood that p2 will require fewer time quantums when p1 reaches the
barrier increases. Thus, the execution is expected to be less than that predicted by Equa-
tion 5.5 but greater than that obtainable with exclusive access to the cluster. In addition,
it is possible for idle CPU time to occur on processors not implementing pl’s barrier, as

illustrated in the ﬁgure.

 

   
 
 

mm." m [:1 punch-barrier -.
Idle cpu -

 

 

pmml 7---

Figure 5.10: Execution pattern of high-priority Barrier with low-priority barrier when
barrier-implementing processes are on different CPUs.

5.3.3 Analytical Model With Preemptive Kernel

The prior analyses considered the execution time of the high-priority job if it is sched-
uled using the RRMF scheduling mechanism. If preemptive scheduling is employed, two
ﬁxed-priority scheduling mechanisms can be used, including the FIFO and RR scheduling

mechanisms. The FIFO mechanism is deﬁned recursively as follows:

a when a FIFO process at the head of a process queue becomes ready to run, it preempts
any process running at a lower priority. It will run until it blocks or terminates, or

until it is preempted by a higher-priority process.

0 When a FIFO process blocks, the CPU is given to the next process existing in the

FIFO queue of the same priority.

 

92

o If no such processes exist, then the CPU is given to the process at the head of the

queue at the next highest priority level.

The R scheduling mechanism is deﬁned similarly, except that processes are given time

quantums instead of being allowed to run until they block.

Therefore, when we consider a high-priority Barrier job running using a preemptive
scheduling mechanism, with the additional assumption that no other processes exist at the
same priority level, then the execution of the Barrier job can be modeled by Figure 5.6, and
its expected execution time is represented by Equation 5.1. Thus, on a system that uses
preemptive scheduling methods, a high-priority process should be able to achieve its ideal

execution time, regardless of whether lower-priority jobs exist in the system.

5.3.4 Discussion of Analysis

In order to simplify our model, and because context switching overhead is assumed to be
relatively low when compared to the total execution time, our analysis ignored the effects
of context switching. However, in real execution, the extra context switching occurring
during the timeshared execution of programs can be expected to add some overhead to the

execution times of the programs.

The analyses predicted that an improvement in system throughput should be obtainable
if the relative priority of two jobs are adjusted with respect to each other. For example,
ﬁgure 5.8 shows that under ideal conditions, the low-priority EP program is able to use
the CPU while the Barrier program is waiting for message delivery. When two Barrier
programs occupy the same cluster, however, the analyses predict that little, if any, through-

put improvement may be obtained. Figures 5.9 and 5.10 illustrate that this low predicted

93

improvement is due primarily to the high CPU demand of the high-priority job’s barrier
process.

Of further consequence is the random delay introduced when the low-priority job is
executing when the high-priority job’s message arrives. This delay, which exists only under
a non-preemptive kernel, is a function of the time quantum given to the low priority process,
and may be as high as one time quantum. The analyses predicted that this delay can
have a measurable affect on the JTT of the high-priority job, when the second job is
executing at a lower priority. If the random extra delay does not exist (as in the case of the
preemptive kernel), then the analysis predicted that the high-priority job will obtain the

same performance that is available under pure space sharing conditions.

 

Chapter 6

Elmer: A Scheduling Facility for a

Network of Workstations

In the previous chapter, we discussed the issues involved in employing timeshared scheduling
to take advantage of idle CPU cycles in a workstation cluster. Our analyses made two
predictions regarding the execution time of the high-priority job: 1) if a non-preemptive
kernel is employed, then the high-priority job will obtain a JTT close-to, but not exactly
equal to, the JTT that can be obtained under space sharing conditions; 2) if a preemptive
kernel is employed, then the high-priority job will obtain a JTT equal to the JTT that can
be obtained under space sharing conditions. The analyses also showed that under certain
conditions, system throughput can be improved when two jobs are allowed to share time in
the system.

The predictions suggested a potential experimental technique, in which the source code
of a low-priority job is modiﬁed to reset its priority with respect to the high-priority job.
However, there are several drawbacks to this technique. First, it requires user involvement
in the form of source-code modiﬁcation, which is unrealistic in a real high-performance

94

 

95

computing environment. Second, there is no automatic means to reset a job’s actual priority
if its effective priority is changed. That is, if the low-priority job suddenly became the high-
priority job, its priority could not be automatically improved, relative to other processes.
Third, there is no way to implement automatic cluster allocation. The code implementing

the EP and Barrier programs require that the programs perform their own cluster allocation.

Phrthermore, the RRMF (round-robin with multilevel feedback) scheduling policy of a
non-preemptive kernel can allow a low-priority job to steal a signiﬁcant number of CPU
cycles from a high-priority job. On a CPU that uses RRMF scheduling, a process at a
low priority level is not scheduled on the CPU unless all processes at higher priority levels
are unable to run. However, a lower-priority process can be given a longer round-robin
time quantum than higher-priority processes, and no user-level process can be preempted
by another user-level process, regardless of the relative priorities. Therefore, once a low-
priority process is scheduled on a CPU, it may continue to execute even when a higher-
priority process becomes ready to run. Processes scheduled using the RRMF policy are
also not guaranteed to remain at the same priority level throughout their life, resulting in
the remote possibility that a job designated as high-priority could end up executing at a
lower priority than a job designated as low-priority. The drawbacks of RRMF revealed the
need for the use of preemptive scheduling policies to guarantee that the higher-priority job

executed whenever it was ready.

The preceding conclusions provided the motivation to create the facility described within
this chapter. This facility, Elmer, was developed as a processor scheduling facility for a
NOW-based CPC. It also serves as a testbed for experimental job dispatching and cluster
allocation methods, in order for us to further study and improve processor scheduling meth-

ods within a NOW-based CPC. For job dispatching methods, which schedule the order of

96

execution of incoming jobs, Elmer’s queueing algorithm can be modiﬁed to allow for the
prioritization of user jobs. Elmer’s cluster allocation facility, which allocates processors to
incoming jobs, is designed to have its cluster allocation algorithm modiﬁed to take advan-
tage of various forms of observable system information, such as individual CPU utilization
and the overall degree of timesharing allowed.

The Elmer facility also provides a practical solution to the problems involved with
modiﬁcation of user code to reset priority. Elmer automatically schedules user jobs, and
it automatically allocates clusters of processors for them. Minimal user-code modiﬁcation
is required in the form of a library call that registers the user job with the Elmer facility,
enabling it to correctly determine the processors on which it can execute. Elmer also
automatically sets and resets the priority of executing jobs, depending on when other jobs

enter and exit the system.

6.1 Implementation Issues

The Elmer facility operates as follows: a job arrives in the system when a user executes a
command that submits his/ her job to be executed by the facility. Elmer runs a cluster al-
location algorithm to determine whether sufficient resources exist for the job. If insufficient
resources exist, then the job is placed in a queue. When the job is scheduled for execu-
tion, daemons on each CPU intercept it by resetting its priority with respect to previously
executing jobs. The system daemons also monitor the progress of each job, automatically
resetting the priority of executing jobs when other jobs enter and/ or leave the system. Fur-
thermore, the Elmer facility uses the realtime scheduling facilities of the OSF/l kernel to
implement preemptive process scheduling, ensuring that the highest-priority job can run

whenever it is ready.

97

A central issue in research in scheduling for a NOW is how it is perceived as a com-
putational resource. Two primary models exist. A NOW can be treated as a collection of
independent computers, each of which is “owned” by its primary user, or it may be treated
as a CPC, available for the execution of any job. Under the ownership model, the quality
of service to a workstation’s owner is the primary concern. If the owner is currently using
his/ her workstation, then he/she should perceive no change in responses from the system.
The CPC model views the set of workstations in much the same way as a single paral—
lel processor might be viewed, in which the entire system, consisting of many processors,
memories, disks, and a network, is viewed as a single computational resource on which to

execute parallel jobs.

The CPC model provides the premise under which Elmer operates. In a typical high-
performance computing environment, jobs are submitted and scheduled through some form
of system-wide scheduling mechanism to ensure fairness and high performance for individual
jobs. However, a user operating within the typical workstation cluster environment performs
parallel programming by using different types of user—level software, such as PVM [6],
p4 [31], or MP1 [32]. These systems often require the user to perform job scheduling and
cluster allocation tasks, which are normally performed by the operating system in a CPC.
Thus, the purpose of Elmer is to provide a scheduling facility with which high-performance
jobs can be executed in a workstation cluster. Elmer provides a somewhat simpler interface
for user-level parallel programming tools, as well as a centralized queueing point to reduce
contention for system resources. Elmer is designed to run on a computer architecture

consisting of a set of Unix-based [46] workstations connected by some form of network.

The primary goal of Elmer is to ensure that the highest-priority job receives the highest-

possible quality of service, which means that the highest-priority job should always be able

 

98

to achieve the shortest possible JTT for the number of processors it uses. Traditional CPC
scheduling methods ensure high-quality service by enforcing space sharing, in which a job
exclusively occupies its cluster of processors. In contrast, Elmer resets the priority of pro—
cesses on individual CPUs to allow them to coexist. Jobs are assigned their initial priority
according to the queueing mechanism implemented within Elmer. As a simple and effective
means to guarantee fairness, it currently enforces the ﬁrst-come, ﬁrst-served (FCFS) queue-
ing discipline. It also extends this concept to include priority, in which the ﬁrst-to-arrive
job receives the highest priority (ﬁrst-come, highest-priority, or FCHP). This concept is
similar to the ownership model described above. The highest-priority job executing in a
cluster is considered to be the owner of the cluster until it terminates. When an owner job
terminates, the ﬁrst job that arrived after the owner job is elevated to owner status, and

its priority is updated accordingly.

For timeshared parallel job execution, the primary implementation issue to consider is
that of how to maintain a high level of service to the highest-priority process on an individ-
ual CPU. Since minimizing JTT is of greatest importance, the highest-priority process must
be able to execute whenever it is ready. Ideally, to ensure that lower-priority jobs cannot
interfere with the progress of the high-priority job by infringing upon its CPU time, the
high-priority job must always be able to preempt lower-priority jobs. Under many circum-
stances, this would require a modiﬁcation of the scheduling facilities within the Unix kernel.
However, the OSF/ 1 operating system implements the Posix realtime standards [47], which
contain the FIFO scheduling class. This class is recursively deﬁned as follows: a Posix
FIFO process running at a given priority level will continue to run until it blocks, voluntar-
ily yields the CPU, or is preempted by a process running at a higher priority level. When a

Posix FIFO process becomes ready to run and reaches the head of the queue for its priority

99

level, it preempts any lower-priority processes occupying the CPU. Thus, our implementa-
tion uses realtime scheduling mechanisms to ensure that the highest-priority user job runs

when it is ready.

The realtime scheduling facilities also allowed us to implement a form of self gang-
scheduling for the highest-priority job. At each processor, the highest-priority process pre-
empts all other processes when it is ready to run. By deﬁnition, this approach is a form of
gang-scheduling, because gang-scheduling requires that all processes of a user job be run-
ning (occupying their respective CPUS) at the same time. However, this scheduling policy
is somewhat weaker in its requirements than gang-scheduling, for two reasons: 1) none of
the lower-priority processes can be guaranteed to receive gang-scheduling, 2) in order for
this policy to be effective, all processes of a user job must exist at the same, highest, priority

level. Any lower-priority processes will hinder the execution of the program.

The need for all processes of a high-priority job to execute at the highest priority is
illustrative of some of the cluster allocation issues that can be studied by using the Elmer
facility. First, there are a number of cluster-allocation issues for lower-priority jobs. For ex-
ample, it may be undesirable for a lower-priority job to occupy a set of processors contained
in the clusters of two (or more) separate higher—priority jobs. Second, the CPU usage on
individual processors must be considered. Third, our facility currently does not consider
memory usage in its scheduling decisions. Fourth, the global degree of multiprogramming is
a factor in making cluster allocation decisions. Fifth, the facility is currently implemented
on a homogeneous network of workstations. The Elmer facility can be used to study issues

involving heterogeneous facilities.

The queueing facility used to implement cluster allocation and job dispatching uses a

centralized scheduling mechanism because this mechanism has been shown to be simple and

100

efﬁcient to operate. Furthermore, Theimer and Lantz [34] examined the tradeoffs between
centralized and decentralized scheduling facilities, concluding that although decentralized
facilities are more fault tolerant, centralized facilities are more scalable and perform bet-
ter. Section 6.4 will present some measurements that examine the performance of Elmer’s
centralized facility.

We also considered the issue of robustness in the design of Elmer. In order for Elmer
to be robust, it must be able to perform several system tasks without failing, which may
include: 1) the cancellation of executing jobs; 2) the ability to monitor system activity,
such as CPU usage and / or memory usage; 3) support for PVM and other user-level parallel
programming software; 4) the ability to change the system queueing policy.

When an executing job is canceled, the Elmer facility must be able to accordingly update
the priority of lower-priority jobs that were sharing that job’s cluster. That is, new jobs
may not be allowed to preempt previously executing jobs. In monitoring CPU and / or mem-
ory usage, the Elmer facility provides itself with information that allows more intelligent
cluster allocation and job dispatching decisions to be made. Support for user-level parallel
programming software is essential in a NOW-based CPC, because very few user programs
are written that use direct message-passing implementations. The ability for Elmer’s queue—
ing policy to be changed is important, because it will allow further experiments, involving

advanced clustering and scheduling methods, to be conducted.

6.2 Facility Functions

The Elmer facility is composed of two root-level daemon processes, Elmer and RSD (ReSchedul-
ing Daemon), together with several user-level commands for interfacing with the facility.

The Elmer daemon is a centralized queueing facility responsible for all job dispatching and

 

101

cluster allocation decisions within a NOW, as well as communication with users. The RSD
daemon acts as a slave server that performs actions requested by Elmer. It is responsible for
intercepting incoming jobs and rescheduling their priorities, as well as other actions related
to the local CPU. Figure 6.1 illustrates the communication between Elmer and RSD in a
workstation cluster. Running on one workstation, the Elmer daemon accepts user requests,
either queueing or executing jobs as appropriate. When an action local to a workstation
is required, Elmer sends a message to the RSD running on that workstation via a TCP
connection. The RSD then executes the action, which may include: kill a running job,
report CPU activity, reschedule an incoming job, and reschedule running jobs because a

higher-priority job has ﬁnished execution.

 

 

 

 

 

 

 

 

 

 

 

 

” I ‘\ \\
’ ’ ’ ‘v
ﬂ
Machlne 0 Machlne 1 Machlne P-1

<‘ ‘ ‘ ‘> TCP connectlon

Figure 6.1: Logical view Of Elmer Operations.

The Operation Of Elmer with RSD to schedule an incoming job works as follows: the user
issues the esubmit command tO submit their parallel program to be executed. Elmer runs
the cluster allocation allocation algorithm, which uses information passed from esubmit,
together with state information from the system, to determine whether the job can be
allocated within the cluster Of workstations. If the cluster allocation algorithm allows it,
then Elmer sends a message to RSD to inform it that a user job is entering the system. The
message includes information such as the job’s sequence number, its priority level, and the

user identiﬁcation.

 

102

RSD then waits for the incoming job to begin executing, after which it collects system
information to determine all PIDs of the newly executing job. When a list of PIDs is
Obtained, RSD resets the priority and/or scheduling class of the user job to an appropriate
value. All user jobs execute at a lower priority level than Elmer or RSD.

Cluster allocation for a user job requires some interaction with the jOb itself. PVM
programs require that the user’s job specify a set Of hosts on which to execute. Therefore,
the Elmer interface for PVM requires a certain amount of user participation in the form
of a library call that must be inserted into the user’s PVM code. This library call simply
reads an environment variable, set by Elmer, to determine the set Of hosts on which the
program may execute. The library call is implemented for C programs only. It is inserted

into the user code as follows:

#include "e1mer_reg.h"

main()

elmer-register(); /* Run this at the start Of program execution */

In the course Of implementation, it was discovered that the user-level PVM daemon
can have a great affect on the Operation of the user-level program. That is, if a single
user runs two separate PVM-based programs using the Elmer facility, then each program
must interact with the same PVM daemon. It was discovered that this interaction greatly
impaired the performance Of both jobs, especially if realtime scheduling techniques were

being used. Currently, there is not a solution for this problem. However, if two distinct

 

103

users are running PVM~based programs, then the PVM daemons are also rescheduled with
respect to each other and the user-level processes.

The PVM daemon is a helper process that facilitates communication among processors.
It continues to run after the nominal user job terminates, requiring a special implementation
with respect to the priority at which the PVM daemon runs. When a PVM-based user job
enters the system, Elmer and RSD also reschedule that user’s PVM daemon. When the user
job terminates, the PVM daemon is restored tO its original priority level and scheduling
class. This was required in the implementation for two reasons: 1) if the PVM daemon
is not set to a FIFO scheduling class with a higher priority than the user’s job, then it
adversely affects the performance of the user job; 2) if the PVM daemon is not restored
to its original priority and scheduling class when the user job terminates, then it adversely
affects the performance Of future jobs from the same user.

The Operation of Elmer with RSD when a user job terminates works as follows: when
Elmer detects that a user job has terminated, it runs the cluster deallocation algorithm,
which restores that job’s hosts to the list of available processors. It also sends a message to
the RSD on each Of that job’s hosts, informing the RSD that the user job has terminated.

RSD uses this information to update the priority Of processes that still exist on its CPU.

6.2.1 User-Level Commands

User—level communication with Elmer and RSD is accomplished with a set Of commands,
including: esubmit, ekill, ecancel, and equery.
The esubmit command is used for submitting job requests to the Elmer facility. Its

format is as follows:

esubmit [-p hostfile] [-P nprocs]

 

104

01‘

esubmit nprocs command

In the ﬁrst command form, esubmit is used tO submit PVM daemon requests to the
system. Users may specify automatic generation of PVM hosts (“-P nprocs”), in which
they include the number Of processors they wish tO use, or they may specify a default
hostﬁle, which contains a list Of hosts in the PVM-speciﬁed format. The second command
form is the means by which the user submits a job tO the system. The esubmit command
requires the user to include the number of processors in their request so that Elmer may
perform the required cluster allocation.

The equery command is used for requesting status information about the cluster. Its

format is as follows:

equery [-r] [-q]

If no Options are given tO equery, then it reports the status of running and queued jobs in
the cluster (that is, “equery” with no Options is equivalent to “equery -r -q”). The two
Options to the command are used tO report only running or only enqueued jobs. The “-r”
Option causes Elmer to report only the currently running jobs (that is, the current status
Of the cluster). The “-q” Option causes Elmer to report only the enqueued jobs.

The ecancel and ekill commands are used to cancel queued jobs and tO terminate

running jobs, respectively. The format of the commands are as follows:

ecancel <job sequence number)

01‘

ekill <job sequence number>

 

105

6.2.2 Operation Of Elmer

Figure 6.2 presents a logical diagram Of the Operation Of the Elmer daemon. The primary
program logic consists Of an inﬁnite lOOp Operating on the queueing host. In this lOOp,
the program accepts TCP connections from users executing the various Elmer commands
(e.g., esubmit). When a connection is established, the main loop ﬁrst forks a child process,
which reads the ﬁrst value sent, parsing it to determine what action is tO be performed. The
child then executes the appropriate procedure, while the parent continues to accept further
incoming requests using an inﬁnite loop. In the ﬁgure, the “Parse Message” action leads tO
one Of the following: a PVM daemon initiation, a parallel job submission, a status query

(queues and currently running processes), a queue remove request, or a job kill request.

    
 
 

TCP
Connection

{Signal Received}
Kernel Signal

 

 
    
       

  
 

gear Job Terminates}

Connection Received} X“ Value

eseage Data _—
Elmer Job
Queue

{Parent}
Message Date

  

Cluster
Deallocation

Cluster
Allocation

Allocation Succede}
eerJob

    

 

  

Whild}
essage Data

 

 
  

 

Elmer
Action

 

 

Emulate
Job
Submittal

 

Figure 6.2: Logical view Of the Elmer main program.

The process labeled child_term is a signal handling facility. When the Elmer daemon

receives the SIGCHILD signal, which indicates that a child process Of Elmer has terminated,

 

106

the child_term process is executed. Since all user jobs submitted to Elmer are executed
by Elmer, it becomes their parent, enabling it to detect their termination. When the
child_term process detects that a child process has terminated, it ﬁrst checks to see whether
the child process is actually a user job. It accomplishes this by reading the exit value given
by the jOb (this exit value is discussed below in the description Of the elmer_submit
procedure). If child that terminated was a user job, then child.term uses the exit value
to read a shared memory location that tells it what the job sequence number and former
job priority were. These values allow child_term to run the cluster deallocation algorithm
(represented by the “Cluster Deallocation” process in the ﬁgure), which notiﬁes the RSDs

from that job’s cluster Of the job’s termination.

After cluster deallocation is performed, child_term checks the global queue for waiting
user jobs. If any exist, it runs the cluster allocation algorithm to determine whether it
may execute one (represented by the “Emulate Job Submittal” process in the ﬁgure). If
the cluster allocation algorithm is successful, child-term forks a child process that runs
a procedure (emulate_elmer_submit) that emulates the elmersubmit procedure by

reading the enqueued job’s data from a queueing directory and then executing the job.

Operation of the esubmit Command

Figure 6.3 presents a logical diagram Of the actions performed when a user job is submitted
to the Elmer daemon. Elmer’s main program (Figure 6.2) receives a TCP connection with
an initially-read action indicating a user jOb submission. It forks a child process, which
parses the input and executes the elmer_submit procedure illustrated in Figure 6.3. If the
data sent involves a user job submittal, then the procedure reads all relevant data sent by

the esubmit command and runs the cluster allocation algorithm to determine whether the

107

job may be executed. This can only happen if: (1) there are no jobs previously enqueued,

and (2) the cluster can satisfy the user jOb request.

 

 

 

 

 

  
       
  
    
  
 

SDgta indicates User Job}
0

  

Cluster
Allocatio

{Allocation Fails} User Job

 

{Data Indicates PVM Initialize} Initiate User’s
Job PVM Daemons

Enqueue

Allocation Succeeds}
ser Job

   
       
   

    
   

 

 

Elmer Job
{Parent} Wait for Queue
Child to
User Job Terminate
Uighllgg LChildb Terminates}
ser Jbo

  

Exit with
Com uted
Exit alue

   

Figure 6.3: Logical view of elmer_submit.

If the job cannot currently be run, then the user’s job is enqueued by storing the data
received from esubmit in a ﬁle in Elmer’s global queueing directory. If the job can be run,
then another child process is forked, which executes it. The parent process Of the new child
waits for the child process to terminate. When the child terminates, the parent continues
by computing an exit value, e = j mod N, where j is the user job’s global job sequence
number, and N is a prime number that determines the largest number Of user jobs allowed
in the system at once. It then exits with an exit value of e + 100 (100 is added to the

exit value because the value Of j mod N could equal 0 or 1 for some jobs, and there are

108

Elmer processes that exit with values Of 0 or 1, which the child..term signal handler should
ignore).

The esubmit command is also used to initiate user-level PVM daemon processes. When
the esubmit data is received, elmersubmit initially determines whether the job is a PVM
initialization or a normal user job submittal. When elmer_submit initiates PVM daemons,
one Of two Options are speciﬁed by the user: (1) the user supplies the PVM host ﬁle, and (2)
the Elmer daemon automatically generates the PVM host ﬁle. When the user supplies the
PVM host ﬁle, the location Of the host ﬁle is simply sent by esubmit to the Elmer daemon,
which then starts the PVM daemon with the host ﬁle passed as a parameter. If automatic
host ﬁle generation is speciﬁed by the user, then the Elmer daemon chooses enough hosts to
satisfy the user’s request, creating the host ﬁle and executing the PVM daemon with that

host ﬁle as a parameter.

Operation of the equery Command

Figure 6.4 provides a logical view Of the actions performed by Elmer for the equery com-
mand. The procedure illustrated in this ﬁgure, elmer.query, simply checks the status Of
user jobs in the system. The equery command accepts two parameters, which tell it whether
to query running user jobs, queued user jobs, or both (the default). The elmer.query pro—
cedure simply reads the contents of Elmer queue and / or its cluster and prints them out for

the user.

Operation of the ecancel and skill Commands

The ecancel command requires a simple procedure to be performed by the Elmer daemon.

When the initial action requesting “cancel” is read via TCP, Elmer calls the elmer.cancel

109

 

Elmer

 

 

 

  
  
 
   
  
     

Child Process
From Elmer

{Options = “-q" only}
Options Option List

 
 
 

Options include "-r" or No Options}
Option List

   
   
 

 

{Options include "-q"
0" N0 Options} Print Contents of
Printglggtgrts 0' Option List Global Job Queue

 

Figure 6.4: Logical view Of elmer_query.

procedure. This procedure checks the queue for the job that is being canceled. If it does
not exist within the queue, then an error message is printed. Otherwise, the job is removed

from the queue, and its associated data ﬁle is removed.

The skill command is somewhat more complicated, because it requires that the Elmer
daemon communicate with every RSD running on hosts that involve the user job. Figure 6.5
illustrates the communication actions that Elmer must perform when a job is killed by its
user. When the action is read in by the Elmer daemon, it calls the elmer_kill procedure,
which reads the sequence number Of the job tO be killed from the ekill command. The
elmer_kill procedure then examines the cluster data structure. If no active jobs in the
cluster match the job sequence number, then an error message is reported back to the
user. Otherwise, the host names within this job’s cluster are recorded, and the cluster is
updated to reﬂect that the user’s job has terminated. The actual updating Of the cluster
data structures is accomplished by the child_term signal handler. When the parents Of the
former user job terminate, they will compute a special exit value when their child is killed
by Elmer. The child_term signal handler will read that exit value and proceed as deﬁned

in Figure 6.2.

110

When the hosts involving this user’s job are identiﬁed, the elmechill procedure will
Open a TCP connection to the RSD at each Of these hosts. This connection allows it tO

send relevant job data to the individual RSDs, which actually kill the executing jobs.

<' ' ' ‘> TCP connection

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

@ E [E
K ) K J
V V
Machines in user cluster Other machines

Figure 6.5: Communication involved in elmer_kill.

6.2.3 Operation of RSD

Figure 6.6 presents a logical diagram Of the operation of RSD. When RSD is initiated, it ﬁrst
spawns a sub-process called util, which implements an inﬁnite lOOp to continually update
a region Of shared memory (every few seconds) with the current CPU utilization. The
primary program logic Of RSD consists Of a single inﬁnite loop. As in the Elmer daemon,
the main loop accepts TCP connections, although it only communicates with the Elmer
daemon. When a connection is established, the main lOOp ﬁrst forks a child, which reads
the ﬁrst value sent to determine what action is tO be performed. The child then executes
the appropriate procedure, while the parent continues in its inﬁnite loop to accept further
connections from the Elmer daemon. In the ﬁgure, “RSD Action” is one of the following:

a job initiation (reschedule an incoming job), a job kill (terminate an executing job), a

 

111

CPU report (report current CPU utilization), and a job termination (an executing jOb has

ﬁnished) .

 

Child xecute
Process Util

 

  

Connection
Received

Figure 6.6: Logical view of the RSD main program.

 

A job initiation action occurs when Elmer allocates the local processor to be part of an
incoming job’s cluster. When the elmer_submit or child.term procedures successfully
receive a cluster from the cluster allocation algorithm, a child process is forked that executes
the request. The parent Of that child process then calls a procedure called notify_alloc,
which opens a TCP connection with the RSD running on each host in the cluster. When
the connection is established, notify.alloc communicates the job’s priority level, its UID,

and its global job sequence number tO each Of the RSDs. The RSD uses this information to

112

establish the job’s priority, issuing system calls to reset the job’s priority and / or scheduling
policy.

When Elmer’s child.term procedure detects that one Of its jobs has ﬁnished execution,
it calls the notify_dealloc procedure, which performs actions similar to those performed
by notify.alloc. That is, notify.dealloc Opens a TCP connection with the RSD running
on each host in the former job’s cluster. When the connection is established, it sends the
job’s former priority level, UID, and global jOb sequence number to each Of the RSDs. Each
RSD uses this information tO update its local scheduling information, which involves the
updating of the scheduling policies and / or priorities of jobs currently executing on the local
processor. If any lower-priority jobs exist locally when the RSD receives notiﬁcation Of a
deallocation, then the priority Of those jobs is increased so that they receive more favorable
scheduling. This action also serves to enforce the policy that the earliest jobs to arrive are
the highest priority. That is, RSD and Elmer ensure that no new jobs can arrive to take
the departing job’s high-priority spot.

When a user issues the skill command, they must supply Elmer with a job’s global
sequence number. Elmer interacts with all RSDs in that job’s cluster by supplying the job’s

global sequence number. RSD then terminates the job using the kill system call.

When the user issues the ecancel command, they must supply Elmer with a job’s
global sequence number. For the ecancel command, however, no interaction with RSD is

required. Elmer simply removes the job from its queue.

When the cluster allocation algorithm is executed by Elmer, it may or may not use
the current CPU utilization from individual CPUs in the system, depending on its current
implementation. When the cluster allocation algorithm requires the CPU utilization from

individual processors, it sends a request to the RSD on each processor under consideration.

113

Each RSD reads a shared memory location (updated every few seconds by the util process)

and sends a reply back to Elmer.

6.3 Cluster Allocation Algorithms

In a NOW-based CPC in which the allowable degree of multiprogramming (number Of user
jobs allowable on any host at once) is one, the cluster data structure might consist Of a
one-dimensional array, with each array element representing a host in the NOW. In Elmer,
which considers timeshared scheduling, one can think Of the cluster as a set Of virtual
clusters, with each virtual cluster representing a priority level and / or one of the degrees Of
multiprogramming allowed. Figure 6.7 illustrates the structure Of the cluster as maintained
by Elmer. A virtual cluster is maintained for D degrees Of multiprogramming (priority
levels range from 0 to D - 1, with level 0 indicating the highest priority). If P hosts exist
within the system, the cluster is essentially a D x P array Of cluster entries. In order to
uniquely identify different jobs, each cluster entry consists Of a job’s global sequence number

and a job UID.
O 1 P-1

0IJIU][J[U]...EE
1[J|u][J[u|...[:]1_|

D-1[JTU][J[U]...E]I]

Job Sequence Number
Job’s UID

Degree of Multiprogramming
Total Hosts in System

 

 

 

399%

Figure 6.7: Format Of the cluster data structure.

Two primary cluster allocation algorithms are being considered for the Elmer facility.

The ﬁrst (most naive) version is illustrated in Figure 6.8. This algorithm relies on no

114

information from the individual RSDs. Rather, it considers only the current degree Of
multiprogramming at each host, as well as a global allowable degree of multiprogramming,
D, which is determined in advance. It examines the cluster data structure (Figure 6.7) at
each priority level. If the values Of J and U are —1 for a particular host at a particular

priority level, then it is considered free at that priority level.

Version 1 of the cluster allocation algorithm examines the cluster at each priority level
until it ﬁnds a cluster that satisﬁes the request or determines that no cluster can be allocated.
For a job request of n hosts, if there are n or more free hosts at priority level i, then the
request may be satisﬁed at that priority level. Since it is most desirable for a job tO have
the highest possible priority, the search proceeds from 0 to D — 1. If the algorithm ﬁnds no
priority level in which n hosts are available, then it returns null to indicate that no cluster
can be allocated. If D = 1, then this algorithm is equivalent to allowing only space shared
scheduling, in which only one user job occupies a processor at any given time (that is, the

user job is given exclusive access to its cluster).

Version 1 of the cluster allocation algorithm is conceptually simple, and it has a low
time complexity. Step (1) is a simple assignment that takes constant time. The While
loop of step (2) considers D levels Of priority. Substep (a) requires a simple For loop that
considers up to P hosts. Substep (b) is a constant time Operation. Thus, step ( 1) requires
0(DP) time. For the Else portion Of step (2), a lOOp Of up to P hosts is required to select
the host list at priority j, marking each of the selected hosts as allocated. Thus, step (2)
is 0(P). If we consider that D is a constant selected at run time, then the complexity Of

version 1 is then 0(P).

The second version Of the cluster allocation algorithm, illustrated in Figure 6.9, is similar

tO the ﬁrst version, except that it uses CPU utilization gathered from individual RSDs as

115

 

Algorithm: Cluster Allocation, v1.
Input: A cluster request consisting Of its number Of processors, n.

Output: A list Of hosts and an integer priority level if the request is granted (NULL if
not).

Procedure:
(l)Leti=0andj=-1
Let D = Degree of multiprogramming
Let P = total hosts in system.
(2) While (i < D) and (j = —1)
(a) Let c = the number Of hosts with priority slot i free.
(b) If (e Z n) then j = i, the priority level to allocate.
Else, add 1 to i.
(3) If (j = —1) then return NULL (no allocation).
Else
Randomly select n free hosts at priority j.
Mark each Of those hosts as allocated.
Return the host list and the priority value.

 

 

 

Figure 6.8: The cluster allocation algorithm, version 1.

extra information when making cluster allocation decisions. In this algorithm, D is still
determined in advance, but the utilization levels at each host can be used to override the
value of D at each host. The primary difference between the two algorithms is that the
second version collects the current utilization from every host and uses that information as
an additional criteria in determining whether a host is free or unallocatable for a particular
priority level.

If the overhead Of computing CPU utilization at each host is ignored, the second version
Of the cluster allocation algorithm has the same time complexity as the ﬁrst version. This
assumption is realistic, because version 2 only requires that a message be sent to each
host inquiring about the current CPU utilization. Step 3 (a) only requires an extra IF
comparison for each host in the loop, as compared to step 2 (a) in version 1.

The second version Of the cluster allocation algorithm illustrates a number Of issues

involved in cluster allocation in this environment. Version 2 can be expected to be more

116

 

Algorithm: Cluster Allocation, v2.
Input: A cluster request consisting of its number of processors, n.

Output: A list Of hosts and an integer priority level if the request is granted (NULL if
not).

Procedure:
(l)Leti=0andj=—1
Let D = Degree of multiprogramming
Let P = total hosts in system.
Let t = the maximum CPU utilization threshold (0 < t < 1).
(2) For each host H.- Do
Collect the current CPU utilization, u,, at H,.
(3) While (i < D) and (j = —1)
(a) Let c = the number Of hosts with priority slot i free for which u,- S t.
(b) If (e 2 n) then j = i, the priority level to allocate.
Else, add 1 to i.
(4) If (j = —1) then return NULL (no allocation).
Else
Randomly select n free hosts at priority j.
Mark each of those hosts as allocated.
Return the host list and the priority value.

 

 

 

Figure 6.9: The cluster allocation algorithm, version 2.

effective than version 1 because different parallel jobs use varying percentages Of the total
available CPU time. If a job executing at the highest priority level uses most Of the CPU
time on all Of the hosts in its cluster, then another lower-priority job should not be executed
in that cluster, since the lower priority job will receive little or no CPU time until the high-
priority job is ﬁnished executing. In this case, space sharing is a more effective scheduling
policy, because it provides the potential that the newly-arrived job will be scheduled earlier,

if another cluster is made available.

The algorithm also illustrates the variable deﬁnition of the term “free” as it applies to
a host for a particular priority level. In the naive cluster allocation algorithm (version 1), a
host is free for a priority level if that level has not been allocated. In version 2, the host is

only free if that level has not been allocated and if sufficient CPU resources are available.

117

It also reveals other issues involving the degree Of multiprogramming, D. A workstation
with a CPU—bound job is likely to have a maximum local degree Of multiprogramming, (1,
equal to 1. However, if a workstation contains several highly communication-bound jobs,
its (1 value could be greater than the global degree Of multiprogramming, D. Version 2 Of
the cluster algorithm does not take this issue into consideration. That is, on every host,
the value Of d is always less than or equal tO D.

Yet another issue not considered by version 2 is whether to allow a lower-priority cluster
tO span multiple higher-priority clusters. As it is presented, version 2 allows the creation of
a lower-priority cluster that may consist Of hosts containing multiple separate, independent,
high-priority jobs. Our prior research in [44] suggested that this may be detrimental tO both

user JTT and system throughput.

6.4 Evaluation Of Elmer’s Performance

Figure 6.10 illustrates the overhead, in milliseconds, that Elmer uses when a job initiation
occurs. The ﬁgure presents the measure Of the time it takes for Elmer to accept the incoming
job, allocate a cluster (using version 1 Of the cluster allocation algorithm), and notify all
of the RSDs in the cluster Of the incoming job. The ﬁgure shows that our implementation
requires an overhead ranging from a low Of about 59 milliseconds (on one processor) to a
high of about 103 milliseconds (on all 6 processors). By interpolating the results, it can be
Observed that the performance shown is sufﬁcient for Elmer to handle clusters with up to
107 processors, if a maximum request rate Of 1 per second is allowed. This ﬁgure is sufﬁcient
for a maximum desired cluster size of 100 processors.

Figure 6.11 illustrates the overhead, in milliseconds, that Elmer uses when a job termi-

nation occurs. The ﬁgure presents the measure Of the time it takes when Elmer performs

118

 

 

 

 

200 ’ loblnit 4— .
Exec 150 - -4
Time
(ms)
1m /’
4
50 - -
0 I I l I
l 2 3 4 5 6
NumberofProcessors

Figure 6.10: Overhead (ms) for job initiation.

several tasks upon detecting a user job termination. The tasks include performing the
cluster deallocation algorithm, including updating the cluster status, and notifying the in-
dividual RSDs so that they can update their internal data structures. The ﬁgure shows
that our implementation requires an overhead ranging from a low Of about 13 milliseconds
on one processor, tO a high Of about 57 milliseconds on 6 processors. This overhead is much
lower than that Of job initiation, due mainly to the fact that the job initiation functions
of Elmer perform more tasks and require a number Of more expensive system calls. Again,
the overhead Of job termination in Elmer is sufﬁcient for the maximum desired cluster size

Of 100 processors.

6.5 Experimental Results

This section describes the experiments involving the Elmer facility. It presents a comparison
Of a set of timesharing experiments run under both realtime (RT) and non-RT Operating
system kernels. The set Of experiments run under the non-RT kernel were originally pre-

sented in [44]. These experiments present the performance Of timeshared jobs in which

119

 

 

 

 

f I T M
200 - lob Term -O—- -
Exec 150 P
Time
(ms)
100 ~ ~
l
i
0 l I I 41
l 2 3 4 5 6
Number of Processors

Figure 6.11: Overhead (ms) for job Termination.

non—preemptive scheduling methods were employed. The experiments run under the RT
kernel present the performance Of timeshared jobs in which preemptive scheduling methods
were employed.

For the results presented in this section, all “SS” JTT ﬁgures represent the JTT Of
each job under space-sharing conditions, in which the high-priority job was run ﬁrst to
completion, followed by the low-priority job. All “TS” JTT ﬁgures represent the JTT Of
each job when both were submitted at the same time. J TT is represented as execution time
in the ﬁgures.

Figure 6.12 presents the experimental format Of the ﬁrst set Of experiments, in which
Barrier (v1, v2, and v3, respectively) ran as the high-priority job, with EP running as the

low priority jOb.

 

Processor: 0 1 2

3
barrier CPU E Barrier: E D I:] I:
non-barrier CPU [3 EP: D [:I [:I [:I

Figure 6.12: Experimental format, Barrier running with EP.

45
DE
DE

 

 

 

120

Figures 6.13 through 6.15 present the results Of those experiments. In each case, a mea-
surable improvement in overall ﬁnishing time was Observed. The improvement is illustrated
by the fact that the execution time Of the low-priority EP job, when run under timesharing
conditions, improved with respect to the space sharing case. Part of the Observed improve-
ment was likely due to the EP program’s self load-balancing feature. The CPU containing
the barrier process was given less work by the EP algorithm, allowing Barrier’s execution

to proceed relatively unhindered on that processor.

Of more interest, however, is the comparison Of the non-RT experiments to similar
experiments run using the RT kernel. In the non-RT experiments, the JTT of EP improved
signiﬁcantly over the space-sharing case, but the Barrier program was penalized somewhat
in its JTT. On 1 or 2 processors, there was little beneﬁt to decreasing EP’s priority relative
to Barrier, since Barrier incurred little or nO communication overhead. Furthermore, in
the non-RT case, signiﬁcant degradation Of the performance Of the high-priority Barrier
job occurred due to the system’s scheduling algorithms, which allowed EP to steal CPU
cycles from the Barrier job. When communication overhead became more signiﬁcant (at 4
tO 6 processors), the execution time Of Barrier, with EP at a lower priority, approached the
execution time that it Obtained using space-sharing scheduling. The use Of the preemptive
scheduling facilities, however, allowed the high-priority Barrier program tO execute whenever
it was ready, achieving an execution time that was essentially identical to its space-sharing

execution time for any number Of processors.

Figure 6.16 presents the experimental format Of the second set Of experiments, in which
different versions Of Barrier were run together in the cluster. In this set Of experiments, the

barrier process of each job shared the same CPU.

121

 

 

 

 

 

 

 

 

 

 

 

 

non-preemptive kernel preemptive kernel
\f‘. I I ﬂ I .v‘ r r I 1
200 ~ {x 3 vi m (SS) -.— « 200 3C“. 8 vi in (SS) .._ -
\ EPrrnssi -+-- \5, E911? (SS) -+—--
BvlJTTCTS) -B--- \1 EV] J’I'l‘(TS) «a...
\‘s \ EP m (TS) "*0.-. :\ HP 111‘ (TS) www
Excc 150 {r ‘= 4 Exec 150 - «
Time ' Time
(sec) (sec)
100
50
0 I I I I o I I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Processors Number of Processors
Figure 6.13: Job turnaround time for B (v1) and EP.
non-preemptive kernel preemptive kernel
“ I I I I I I I I
200 - B v2 m(SS) 4— . 200 r 3 V2 rrr (55) *" '
EPJ'IT (ss) -+~- =3, EPn'r (SS) -+--
a v2 m (TS) ~e~- a, B V2 in (rsi -e--~
1...: E? “'1‘ (r5) ...x...... 1% EP 1'" (r8) "I‘m"
Exec 150 II * Exec i50
Time Time
(sec) (sec)

 

 

 

 

50

 

 

 

 

0 I I I I O I I I I
i 2 3 4 5 6 i 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.14: Job turnaround time for B (v2) and EP.

Figures 6.17 through 6.19 present the results of those experiments. In each case, the
high-priority version of barrier, running on the RT kernel, achieved an execution time
identical tO its execution time under space-sharing conditions. Running on the non-RT
kernel, the high-priority job was again penalized somewhat in its execution time. In this
case, the penalty is explained by the fact that the barrier process Of both programs shared
the same CPU. As illustrated in Figure 5.2, the Barrier program’s barrier process (executing

on processor 0 in the ﬁgure) has a high CPU demand for each program. This high CPU

122

 

 

   

 

 

 

 

 

 

non-preemptive kernel preemptive kernel
\ I I r I ii“ I I I r
200 r- x a v3 m (SS) +— - 200 a v3 111(53) -.— ~
in EPJTT (SS) -+--- EPJTT (SS) -+---
3V3mCTS) ‘9‘" Bv3.lTT(TS) -r3---
. EPmas) .. ermcrs)
\‘~,
Exec l50 : Exec 150 - \23, ..
Time ‘~ Time ‘\ .
(see) (see)
4
mo
50
0 I I I I 0 I I I I;
l 2 3 4 5 6 I 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.15: Job turnaround time for B (v3) and EP.

 

Processor: 0 1 2 3 4 5

barrier cpu g High-priority Barrier: IX] 1:] [:1 [:1 I] [:1
non-barrier CPU E] Low-priority Barrier: E E] E] [:1 [:l [:I

Figure 6.16: Experimental format, Barrier programs sharing same barrier processor.

 

 

 

demand results in contention for CPU resources by both jobs, and the low-priority job is

able to steal CPU cycles as a result.

The high CPU demand Of each job’s barrier process also explains the very low im-
provement in throughput, as shown by the ﬁnishing time Of the low-priority job. Since the
high-priority job requires most Of the cycles on the CPU containing its barrier process, it
prevents the low-priority job from obtaining enough of the otherwise idle CPU cycles in the

cluster, resulting in the low Observed improvement in throughput.

The results of the previous experiments suggested the third set Of experiments, as il-
lustrated in Figure 6.20. Here, the barrier process Of each job shared its CPU with a

non-barrier process Of the other job.

non-preemptive kernel

 

 

 

 

 

 

 

 

 

 

123

preemptive kernel

 

 

 

 

 

 

 

 

 

 

200 ~ 3 v2]Ti‘(SS) .._ ~ 200 P’s, a v2 mrssi .._ -
8 V2 (lowHTI‘ (SS) ‘*"" i: 8 v2 (lowHTT (SS) -+---
Bv2JTf(TS) .3.-. szn'rch) -B--~
\ B v2 (IOW) JTT (TS) .... B v2 (Iow)i1'r (TS) «x
Exec i50 "- \ - Exec i50 - {:5 .
Time . ‘ Time "a
(see) ’ X‘s... (see) !‘12:.
‘ 1
mo
50
0 I I I I 0 I I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Ptocessors Number of Processors
Figure 6.17: Job turnaround time for B (v2) and B (v2), barrier processors same.
non-preemptive kernel preemptive kernel
200 \ B v2 i'l'l‘(SS) .... - 200 r a v2 n'rrssi +— -
B V3 JTT (SS) '4’.“ B V3 TIT (SS) 4'“
BVZJ'I'ITIS) ‘9‘“ Bv21TI’(TS)13'"
B V3 111‘ (TS) “I r" B V3 TIT (TS) "K
\ .
Exec 150 "- ’39 - Exec 150 - "it ~
Time " ks,’ Time "-1: ...
(sec) (sec) ..
I
mo
50
0 I I I I 0 I I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.18: Job turnaround time for B (v2) and B (v3), barrier processors same.

This set of experiments further illustrated the effects of the high CPU demand Of the

barrier process Of each jOb. In this case, although a non-barrier process Of the low-priority
job occupied the same CPU as the barrier process of the high-priority job, the high CPU de-
mand Of the high-priority barrier process still prevented the low-priority job from Obtaining
signiﬁcant CPU time. For the non-RT experiments, the high-priority job achieved slightly
better performance than the case in which two barrier processes shared the same CPU.

However, the RT experiments still achieved the highest performance for the high-priority

job, as expected.

124

 

 

non-preemptive kernel preemptive kernel
200 l a v2 m (SS) .... 4 200 — ‘3: e v21Tl‘(SS) .._ -
\ 8 v3 JTT (SS) -+--- \x‘ a v3 In (53) -+---
Bv211'TCTS) -B--- \t_~. svzmrrsi .a---
"e. 3 V3 m ”3) "‘ \tr,‘ a v3 m (TS) —x
Exec 150 :l "3%. . Exec 150 - it: .
“m" Time ‘~

(see) (see)

   

50

 

 

 

 

 

 

o I I I I o I I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.19: Job turnaround time for B (v3) and B (v3), barrier processors same.

 

Processor: 0 1 2 3 4 5

barrier CPU X High-priority Barrier: E D [:I [:j I:] D
non-barrier CPU E] Low-priority Barrier: E] [:I E] E] [:I E

Figure 6.20: Experimental format, Barrier programs, barrier process shares CPU with
non-barrier process.

 

 

 

The prior results provided motivation for the examination of more advanced cluster
allocation techniques. Up to this point, all experiments had given the same set Of processors
to both jobs. Figure 6.23 illustrates the fourth variation on the experiments, in which the
barrier process Of each job was given exclusive access to its CPU. In this experiment, the
programs were run on partially intersecting clusters, in which the barrier process Of each
program had exclusive access to a CPU (hence, the experiment could be run only for cluster

sizes Of 1 tO 5 processors, and for a cluster size Of 1, the programs used the same processor).

When each job was given exclusive access to the CPU implementing its barrier process,
a signiﬁcant improvement in overall JTT (hence, throughput) was Observed, as illustrated

in Figure 6.24. Fhrthermore, the RT-based experiment achieved the best-possible perfor-

125

 

 

non-preemptive kernel preemptive kernel
I I I I .1,“ I I I r
200 ~ 8 v2 rrrrssi +— - 200 - B v2m(SS) -o— «
B v3 JTT (88) -+-- B v3 rrr (88) -+---
“ B V2 m ([.8) '6'” \‘3‘ 8 V2 m (TS) '8".
\ B V3 m (T S) x ‘33 B V3 m (TS) «he
’9 "'13.
Exec 150 [r f - Exec [50 l- 3?“ -n
Time 3:35 Time ”-111“
(sec) h." \\..T\“:‘:‘ (sec) ...,‘s‘

   

 

 

 

 

 

 

50
0 I I I I O I I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.21: Job turnaround time for B (v2) and B (v3), barrier processors different.

 

 

non-preemptive kernel preemptive kernel
I I I I "3“ r I I I
200 -\ B V] in (SS) —.— - 200 L "11;. 3 vi rrr (SS) +— «
\ a v3 m (SS) -+--- 93, 8 v3 m (88) -+--
,’ 3V1 JTT (TS) 9:; “it: B V] m (IS) '8'"
2,, v3 JTT (TS) K \ B v3 JTT (TS) r-x --~
it, “i.“
Exec [50 “- 23% - Exec 150 - ‘.3__ -
Time 1. Time 8:31
(sec) ’.\‘:"\“:.. (SOC) “-x‘.
4 r. “:33:
mo ~ was-.. -

L
L-
--
.. -.
'v.‘ s
was
-- ---

~---~::----- ,
............,:*-_---------.

ll

   

 

 

 

 

 

 

50 -
0 I I I I o I I I I
l 2 3 4 S 6 l 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.22: Job turnaround time for B (v1) and B (v3), barrier processors different.

mance for the high-priority job, while it allowed the low-priority job to achieve a signiﬁcant

improvement in its overall JTT.

Figure 6.25 illustrates the ﬁnal variation on the experiments, in which the low-priority
job was given a subset cluster Of the high-priority job’s cluster. In this experiment, the
high-priority job’s barrier process had exclusive use Of its CPU, while the low-priority job’s
barrier process had to occupy a CPU from the high-priority job’s cluster. In each case,

the high-priority job ran using 6 processors, while the low-priority job varied from 1 to

126

 

Processor:

0 2 3 4
barrler CPU E ngh-prlorlty Barrler: [Z I] D [:1 D
non-barrler CPU E] Low-priority Barrler: D I] I: I] E

Figure 6.23: Experimental format, Barrier programs, barrier process occupies a CPU
exclusively.

 

 

 

 

 

   

 

 

 

 

 

 

non-preemptive kernel preemptive kernel
I I I I q, I I I I
200 - . B vl m (SS) +— ~ 200 ~ \~ 3 v3 m (SS) +— -
‘2, 8 v3 JTT (SS) -+--- 8 v3 JTT (SS) -+---
"e BvlJTTCTS) -B--- I)" By3ma‘s) .3...
8 v3 JTT (TS) ---~ “g, B v3 m (r5) at
Exec 150 Cl’ Ex, - Exec 150 '- YI), ..
Time \x, Time 1‘5,
(sec) ' ' ‘ (sec) "
O I I I I o I I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.24: Job turnaround time for B (v1) and B (v3), partially intersecting clusters.

5 processors (again, to preserve the exclusivity of the high-priority barrier process, the

low-priority job could only use up to 5 processors).

 

Processor: 0 1 2 3 4 5

barrler CPU E ngh-prlorlty Barrier: E [:1 [j E] I] C]
non-barrlsr CPU E] Low-prlorlty Barrler: [:1 E] E] D E

Figure 6.25: Experimental format, Barrier programs, low-priority Barrier occupies a subset
of high-priority Barrier’s cluster.

 

 

 

Figure 6.26 presents the results of that experiment. In both the non-RT and RT cases,
the high-priority job’s JTT was unaffected, while the low-priority job achieved an improve-

ment in throughput by ﬁnishing ahead of the Space-sharing case.

127

 

 

 

 

 

 

 

 

 

non-preemptive kernel preemptive kernel
I I I I I I I U
200 - Bvl JTT(SS) +— . 200 - Bv3JTT(SS) *— -
B v3 JTT (SS) -+--- B v3 JTT (SS) -+---
BvlJTTCTS) -G--- BVSJTI‘CI‘S) -B---
B v3 1T1” (TS) ~I~~~ B v3 JTT (TS) -*
Exec 150 r -‘ Exec 150 .f j
Time 15,, Time x‘x‘
(sec) (sec) \MX
\\ \ \-. ‘
\ \ ‘Bn. \\ :*‘~.
100 - y. ‘‘‘‘‘‘‘ . 100 - X- ““““““ «
3 .. ~-$ ......... \ ~+ ........... ‘_ ......
5“wa "'" """"""" -+ 16 ~ mm..-“ ""+
V ”I“ m "x ...- _ x
50 - a 50 - 1
lr—-I-—--—---.------.---—-——l : z : :
0 I I I I 0 I I I
l 2 3 4 5 6 l 2 3 4 5 6
Number of Processors Number of Processors

Figure 6.26: Job turnaround time for B (v1 and v3), v3 having a subset cluster.

These experiments have shown that the primary goal of not affecting the high-priority
job’s JTT can be (and has been) accomplished. They also suggest that more advanced
cluster allocation techniques are required to ensure the secondary goal of higher throughput.
For example, the experiments presented in Figures 6.17 through 6.22 illustrate that while the
high-priority job may be unaffected, the low-priority job may gain nothing if an ineffective
cluster allocation is made. Figures 6.24 and 6.26 indicate that, for barrier programs, the

CPU on which a barrier process exists should not be scheduled with other jobs.

Chapter 7

Conclusion

This thesis has described advanced processor scheduling methods for two classes of dis-
tributed memory parallel computers. The techniques described here have been shown to
decrease job turnaround time and increase system throughput. Furthermore, they have
been shown to be practical techniques that are easily integrated into existing systems. This
chapter presents a brief summary of the two major aspects of our research, followed by our

conclusions, as well as directions for future work.

7 .1 Hypercubes

There are two primary conclusions that may be reached from the study of the AFL technique
for cluster allocation in hypercubes. First, using the AFL with 2—D mesh requests generally
increases system throughput and decreases average user JTT. When the AFL method is
used in conjunction with the FL algorithm, jobs in allocated clusters suffer no communica-
tion interference from other jobs, yet they are not required to be a perfect subcube. The
AFL provides an additional beneﬁt to jobs that would typically suffer severe performance
degradation due to having too many allocated processors. Second, the geometry of the mul-

128

 

 

129

tiprocessor architecture is not necessarily a restriction on the cluster allocation method. By
using 2—D mesh allocation within subcubes, the AFL method is shown not only to increase
system performance, but to provide the potential to increase individual job performance,

as well.

The AFL algorithm also has practical beneﬁts: it may be used in conjunction with any
existing subcube allocation algorithm. Therefore, it may be implemented on many existing

installed systems that may not use the FL method of subcube allocation.

7 .2 Networks of Workstations

Our initial feasibility study into timesharing techniques for user jobs in workstation clusters
indicated that the prioritized execution of parallel jobs can improve system throughput
without significant penalty to the observed user JTT. Our analytical models suggested that
user code could be modiﬁed to reset a job’s priority with respect to other jobs. However,
one of the primary drawbacks to this technique was its requirement of user participation
to realize an improvement in system throughput. The study also revealed that preemptive
scheduling methods would be required so that the execution time of the high-priority job

was unaffected.

The feasibility study provided the motivation for the creation of the Elmer clustering and
scheduling facility. Elmer has allowed us to study experimental clustering and scheduling
methods in NOW-based CPCS. Phrthermore, it provides a simpler user interface for the
prioritized execution of high-performance jobs. The Elmer facility is uniquely specialized
to take advantage of the timeshared execution of parallel jobs. The cluster allocation

algorithms used within Elmer were also developed for this purpose.

130

The performance measurements obtained from the use of the Elmer facility allowed us
to verify that our cluster allocation algorithms provide a higher system throughput than

pure space sharing techniques.

7 .3 Contributions of this Thesis

In a high-performance computing environment in which standard processor scheduling tech-
niques are used, parallel applications are rarely able to fully utilize the hardware resources
over which they have control. Standard scheduling techniques can result in two forms
of under-utilization of system resources. First, inefficient cluster allocation can result in
unused processors within a cluster. Second, space sharing of parallel jobs can result in
underutilization of CPUs due to the communication latency present in clusters of work-
stations. The research presented in this thesis addresses and solves these problems using
unique, yet practical, methods.

Our research in hypercube cluster allocation has demonstrated a means to more ef-
fectively allocate clusters of processors in the hypercube architecture. Unneeded CPU
resources, in the form of over-allocated clusters of processors, can contribute to decreased
performance of individual parallel jobs due to communication overhead. Our cluster alloca-
tion technique, the auxiliary free list method, demonstrated improved performance over the
free list subcube allocation method in two respects. First, it improved system throughput
and decreased average user JTT. Second, it enabled speciﬁc user jobs to utilize only the
resources necessary to achieve their optimum performance, eliminating the over-allocation
of processors that can result in decreased performance for the speciﬁc job. Furthermore,
our research has demonstrated that it is possible to allocate clusters of a Speciﬁc size in

hypercubes, eliminating the requirement of using subcube allocation.

131

Our research in scheduling for workstation clusters has demonstrated a means by which
processor sharing can be employed to improve system throughput and individual CPU
utilization without penalty to the JTT of user jobs. We ﬁrst showed that timesharing
of parallel jobs can be employed to improve system throughput. We then demonstrated
a means by which preemptive scheduling can be employed so that the execution time of
higher-priority jobs are not affected by the presence of lower-priority jobs. We have also
developed a practical facility that uses realtime scheduling mechanisms to implement this
prioritized scheduling technique. Our implementation also demonstrates that the use of
preemptive scheduling techniques can be accomplished with little or no modiﬁcations to

the operating system.

7 .4 Future Work

There are a number of issues yet to be addressed, speciﬁcally involving the allocation of
clusters in a NOW-based CPC. Our research has presented a means by which processor
sharing may take place in a CPC. However, our experimental facility has limited the scope
of our experiments. First, the number of machines is limited to 6, which prevented us
from examining issues involving cluster allocation when a larger number of processors is
available. Furthermore, we were unable to examine issues concerning the scalability of the
Elmer facility. Second, our experiments did not examine the impact of memory usage on
the allowable degree of multiprogramming at each node. Third, we presented two possible
cluster allocation algorithms for NOW-based CPCS. Our preliminary results of experiments
testing the throughput obtainable with these cluster allocation algorithms suggests that

further investigations into cluster allocation, as well as job scheduling, are merited.

BIBLIOGRAPHY

Bibliography

[1] T.-H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical scheduling scheme
for parallel compilers,” IEEE Transactions on Parallel and Distributed Systems, vol. 4,
pp. 87—98, January 1993.

[2] J. Mario J. Gonzalez, “Deterministic processor scheduling,” ACM Computing Surveys,
vol. 9, pp. 173—204, September 1977.

[3] S. W. Turner, L. M. Ni, and B. H. C. Cheng, “Contention-free 2d-mesh cluster alloca-
tion in hypercubes,” IEEE Transactions on Computers, To appear.

[4] M. W. Mutka and M. Livny, “Scheduling remote processing capacity in a workstation-
processor bank network,” in Proceedings of the 7th International Conference on Dis-
tributed Computing Systems, (Berlin, West Germany), pp. 2-9, September 1987.

[5] M. J. Litzkow, M. Livny, and M. W. Mutka, “Condor - a hunter of idle workstations,”
in Proceedings of the 8th International Conference on Distributed Computing Systems,
(San Jose, California), pp. 104—111, June 1988.

[6] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, “PVM 3
user’s guide and reference manual,” Tech. Rep. ORN L / TM-12187, Oak Ridge National
Laboratory, Oak Ridge, Tennessee 37831, May 1993.

[7] L. M. Ni, Parallel Computer Architecture and Programming. East Lansing, MI: Un-
published Manuscript, 1993.

[8] Kendall Square Research, KSR! Technical Summary, 1993.
[9] BBN Advanced Computers Inc., Cambridge, Massachusetts, Inside the GP1000, 1989.

[10] BBN Advanced Computers Inc., Cambridge, Massachusetts, Inside the TC2000 Com-
puter, 1990.

[11] B. Duzett and R. Buck, “An overview of the nCUBE 3 supercomputer,” in Proceedings
of the Frontiers ’92, pp. 458—464, IEEE, 1992.

[12] W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R. Nuth, R. E.
Davison, and G. A. Fyler, “The message-driven processor: A multicomputer processing
node with efficient mechanisms,” IEEE Micro, pp. 23—39, 1992.

[13] C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,”
IEEE Transactions on Computers, vol. C-34, pp. 892—901, October 1985.

132

133

[14] G. F. Pﬁster et al., “An introduction to the IBM research parallel processor proto-
type (RP3),” in Experimental Parallel Computing Architectures (J. J. Dongarra, Ed.),
pp. 123 — 140, Elsevier Science Publishers B.V., Amsterdam, 1987.

[15] A. Gottlieb, “An overview of the NYU ultracomputer project,” in Experimental Parallel
Computing Architectures (J. Dongarra, Ed.), pp. 25 — 95, North Holland, 1987.

[16] R. Arlauskas, “iPSC / 2 System: A second generation hypercube,” in Proceedings of the
Third Hypercube Conference, (Pasadena, CA), pp. 38—42, Association for Computing
Machinery, Jan. 1988.

[17] nCUBE Corporation, nCUBE 6400 Programmer’s Guide. Beaverton, Oregon, 1990.

[18] C. L. Seitz, “The Cosmic Cube,” Communications of the ACM, vol. 28, pp. 22—33, Jan.
1985.

[19] K. Li and K.-H. Cheng, “A two-dimensional buddy system for dynamic resource allo-
cation in a partitionable mesh connected system,” Journal of Parallel and Distributed
Computing, vol. 12, pp. 79—83, 1991.

[20] P.-J. Chuang and N.-F. Tzeng, “An efﬁcient submesh allocation strategy for mesh
computer systems,” in Proceedings of the 11th International Conference on Distributed
Computing Systems, (Arlington, Texas), pp. 256—263, 1991.

[21] Y. Zhu, “Efficient processor allocation strategies for mesh-connected parallel comput-
ers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 328—337, 1992.

[22] S. Bhattacharya, L. M. Ni, and W.-T. Tsai, “Lookahead processor allocation in mesh-
connected massively parallel computers,” Tech. Rep. 93-25, Computer Science Depart-
ment, University of Minnesota, April 1993.

[23] K. C. Knowlton, “A fast storage allocator,” Communications of the ACM, vol. 8,
pp. 623—625, October 1965.

[24] M.-S. Chen and K. G. Shin, “Processor allocation in an n-cube multiprocessor using
gray codes,” IEEE Transactions on Computers, vol. C-36, pp. 1396—1407, December
1987.

[25] J. Kim, C. R. Das, and W. Lin, “A top-down processor allocation scheme for hypercube
computers,” IEEE Transactions on Parallel and Distributed Systems, vol. 2, no. 1,
pp. 20—30, 1991.

[26] A. Al—Dhelaan and B. Bose, “A new strategy for processor allocation in an n-cube
multiprocessor,” in Proceedings of the International Phoenix Conference on Computers
and Communications, March 1989.

[27] C.-H. Huang, T.-L. Huang, and J.-Y. Juang, “On processor allocation in hypercube
multiprocessors,” in Proceedings of COMPSAC ’89: Computer Software and Applica-
tions Conference, (Orlando, Florida), pp. 16—23, IEEE Computer Society, September
1989.

[28] S. Dutt and J. P. Hayes, “On allocating subcubes in a hypercube multiprocessor,” in
Proceedings of the Third Conference on Hypercube Concurrent Computers and Appli-
cations, (Pasadena, California), pp. 801—810, ACM, January 1988.

134

[29] P. Krueger, T.-H. Lai, and V. A. Radiya, “Processor allocation vs. job scheduling on

hypercube computers,” in Proceedings of the 11th International Conference on Dis-
tributed Computing Systems, (Arlington, Texas), pp. 394—401, 1991.

[30] N. Carriero and D. Gelertner, “Linda in context,” Communications of the ACM, vol. 32,

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

pp. 444—458, April 1989.

R. Butler and E. Lusk, “User’s guide to the P4 programming system,” Tech. Rep.
ANL/MCS-92/ 17, Argonne National Laboratory, Mathematics and Computer Science
Division, Argonne, Illinois 60439-4844, 1992.

D. W. Walker, “The design of a standard message passing interface for distributed
memory concurrent computers,” Tech. Rep. ORNL/TM-12512, Oak Ridge National
Laboratory, Oak Ridge, TN, October 1993.

S. White, A. Alund, and V. S. Sunderam, “The NAS parallel benchmarks on virtual
parallel machines,” tech. rep., Department of Mathematics and Computer Science,
Emory University, Atlanta, Georgia, November 1993.

M. M. Theimer and K. A. Lantz, “Finding idle machines in a workstation-based dis-
tributed system,” in Proceedings of the 8th International Conference on Distributed
Computing Systems, (San Jose, California), pp. 112-122, 1988.

M. Stumm, “The design and implementation of a decentralized scheduling facility for
a workstation cluster,” in Proceedings of the Second IEEE Conference on Computer
Workstations, (Santa Clara, California), pp. 12—22, March 1988.

P. Krueger and R. Chawla, “The stealth distributed scheduler,” in Proceedings of the
11th International Conference on Distributed Computing Systems, (Arlington, Texas),
pp. 336—343, 1991.

M. J. Atallah, C. Lock, and D. C. Marinescu, “Co-scheduling compute-intensive tasks
on a network of workstations: Model and algorithms,” in Proceedings of the 11th Inter-
national Conference on Distributed Computing Systems, (Arlington, Texas), pp. 344-
352, 1991.

S. K. Setia, M. S. Squillante, and S. K. Tripathi, “Processor scheduling on multipro—
grammed, distributed memory parallel computers,” in Proceedings of the 1.9.93 ACM
Sigmetrics Conference on Measurement and Modeling of Computer Systems, (Santa
Clara, California), pp. 158-170, 1993.

Y. Sad and M. H. Schultz, “Topological Properties of Hypercubes,” IEEE Transac-
tions on Computers, vol. C-37, pp. 867—872, July 1988.

L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct
networks,” IEEE Computer, vol. 26, pp. 62—76, Feb. 1993.

C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi,
J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. S. Pierre, D. S. Wells, M. C. Wong,
S.-W. Yang, and R. Zak, “The network architecture of the connection machine CM-
5,” in Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, (San
Diego, California), pp. 272—285, May 1992.