This is to certify that the
thesis entitled

Parallel Network RAM: Effectively Utilizing Global Cluster
Memory For Large Data-Intensive Programs

presented by

Jonathan James Oleszkiewicz

has been accepted towards fulﬁllment
of the requirements for the

degree of MS. in Computer Science and Engineering

Major Professor’s Signature
I/5 /2004

Date

MSU is an Afﬁnnative Action/Equal Opportunity Institution

.‘—.-l---.- -

 

 

LIBRARY
Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/ClRC/DateDuo.p65-p.15

 

PARALLEL NETWORK RAM
EFFECTIVELY UTILIZING GLOBAL CLUSTER MEMORY
FOR LARGE DATA-INTENSIVE PROGRAMS

By

Jonathan James Oleszkiewicz

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
Department of Computer Science

2004

ABSTRACT

PARALLEL NETWORK RAM
EFFECTIVELY UTILIZING GLOBAL CLUSTER MEMORY
FOR LARGE DATA-INTENSIV E PROGRAMS

By

Jonathan James Oleszkiewicz

Large scientiﬁc parallel applications demand large amounts of memory space. Cur-
rent parallel computing platforms schedule jobs without fully knowing their memory
requirements. This leads to uneven memory allocation in which some nodes are over-
loaded. This, in turn, leads to disk paging, which is extremely expensive in the context
of scientiﬁc parallel computing. To solve this problem, we propose a new peer-to—peer
solution called ”Parallel Network RAM”. This approach avoids the use of disk, better
utilizes available RAM resources, and will allow larger problems to be solved while
reducing the computational, communication and synchronization overhead typically

involved in parallel applications.

To Patty, my constant source of inspiration and the love of my life.

iii

Acknowledgments
This research was made possible by grants from the Michigan Space Grant Consortium

and the National Science Foundation.

iv

Table of Contents

LIST OF TABLES

1

2

LIST OF FIGURES

Introduction

1.1 Problem Statement ............................

1.2 Terminology ................................

1.3 Background and Related Work ......................
1.3.1 Parallel Scheduling Algorithms .................
1.3.2 Previous Solutions to the Memory Problem ...........
1.3.3 Network RAM ...........................

Parallel Network RAM

2.1 Introduction ................................

2.2 Generic Description ............................
2.2.1 Clients ...............................
2.2.2 Managers .............................
2.2.3 Servers ...............................

2.3 Designs ..................................
2.3.1 Client ...............................
2.3.2 Centralized ............................
2.3.3 Local Managers ..........................
2.3.4 Backbone .............................

Methodology

3.1 Simulator .................................
3.1.1 Workload Model .........................
3.1.2 System Models ..........................
3.1.3 Job Behavior Models .......................
3.1.4 Scheduler Models .........................
3.1.5 Metrics ..............................

3.2 Experimental Setup ............................
3.2.1 Experiment Table Detail .....................
3.2.2 Dimensions of the Experiment Table ..............

Performance Evaluation

4.1 64 Node Cluster ..............................
4.1.1 Base Experiments .........................
4.1.2 RAM Variations ..........................
4.1.3 Network Performance Variations .................
4.1.4 Network Topology Variations ..................
4.1.5 Space Sharing Scheduler Experiments ..............

4.2 128 Node Cluster .............................
4.2.1 Base Experiments .........................

vii

viii

(OOOQOEO'It—tl-J

31
31
31
35
37
40
42
42
46

4.2.2 RAM Variations .......................... 49

4.2.3 Network Performance Variations ................. 49

4.2.4 Network Topology Variations .................. 51

4.2.5 Space Sharing Scheduler Experiments .............. 54

5 Analysis 56
5.1 Discussion ................................. 56
5.1.1 Memory Load ........................... 56

5.1.2 Network .............................. 57

5.1.3 Scheduling ............................. 58

5.1.4 PNR Designs ........................... 59

5.2 Conclusion ................................. 60
5.3 Future Work ................................ 61
LIST OF REFERENCES 63

vi

List of Tables

3.1 Table of Experiments. .......

5.1 Recommended PNR Design Usage.

vii

ooooooooooooooooooo

ooooooooooooooooooo

2.1

2.2
2.3
2.4
2.5
2.6

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

4.9

4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
4.32
4.33
4.34
4.35

List of Figures

Diagram of Parallel Network RAM. Application 2 is assigned to PES
P3, P4 and P5 but utilizes available memory space of P2, P6 and P7.
The architecture of a servent. Each servent can act in three roles.

Client-only design. ............................
Centralized design. ............................
Local manager design. ..........................
Backbone design. .............................

Base experiment - 64 PE - several workloads - response time ......
Base experiment - 64 PE - 4,000 jobs - response time ..........
Base experiment - 64 PE - 4,000 jobs - optimization ratio. ......
Base experiment - 64 PE - 5,000 jobs - response time ..........
Base experiment - 64 PE - 5,000 jobs - optimization ratio. ......
RAM experiments - 64 PE - 4,000 jobs - response time .........
RAM experiments - 64 PE - 4,000 jobs - optimization ratio. .....
RAM experiments - 64 PE - 5,000 jobs - response time .........
RAM experiments - 64 PE — 5,000 jobs - optimization ratio. .....
Network experiments - 64 PE - 4,000 jobs - response time. ......
Network experiments - 64 PE - 4,000 jobs - optimization ratio .....
Network experiments - 64 PE - 5,000 jobs -— response time. ......
Network experiments - 64 PE - 5,000 jobs — optimization ratio .....
Topology experiments - 64 PE - 4,000 jobs - response time .......
Topology experiments - 64 PE - 4,000 jobs - optimization rate .....
Topology experiments - 64 PE — 5,000 jobs - response time .......
Topology experiments - 64 PE - 5,000 jobs - optimization ratio.

Space sharing experiments - 64 PE - 4,000 jobs - response time.

Space sharing experiments - 64 PE - 4,000 jobs - optimization ratio. .
Space sharing experiments - 64 PE — 5,000 jobs - response time.

Space sharing experiments - 64 PE - 5,000 jobs - optimization ratio. .
Base experiment - 128 PB - several workloads - response time. . . . .
Base experiment - 128 PE - 4,000 jobs - response time. ........
Base experiment - 128 PE - 4,000 jobs - optimization ratio .......
Base experiment - 128 PE - 5,000 jobs - response time. ........
Base experiment - 128 PE - 5,000 jobs - optimization ratio .......
RAM experiments - 128 PE — 4,000 jobs - response time. .......
RAM experiments - 128 PE - 4,000 jobs - optimization ratio ......
RAM experiments - 128 PE - 5,000 jobs - response time. .......
RAM experiments - 128 PE - 5,000 jobs - optimization rate. .....
Network experiments - 128 PE - 4,000 jobs -— response time .......
Network experiments - 128 PE - 4,000 jobs - optimization ratio.

Network experiments - 128 PE — 5,000 jobs - response time .......
Network experiments - 128 PE - 5,000 jobs - optimization ratio.

Topology experiments - 128 PE - 4,000 jobs - response time. .....

viii

10
12
16
16
18
18

32
32
32
33
33
36
36
36
38
38
38
39
39
41
41
41
43
43
43
44
44
44
45
45
45
47
47
47
48
48
50
50
50
52
52

4.36 Topology experiments - 128 PE - 4,000 jobs - optimization ratio. . . .
4.37 Topology experiments - 128 PE - 5,000 jobs - response time. .....
4.38 Topology experiments - 128 PE - 5,000 jobs - optimization ratio. . . .
4.39 Space sharing experiments - 128 PE - 4,000 jobs - response time. . . .
4.40 Space sharing experiments - 128 PE - 4,000 jobs - optimization ratio.
4.41 Space sharing experiments - 128 PE - 5,000 jobs - response time. . . .
4.42 Space sharing experiments - 128 PE - 5,000 jobs - optimization ratio.

ix

52
53
53
54
54
55
55

Chapter 1

Introduction

1. 1 Problem Statement

Many scientific computing applications demand a large amount of processing
power, memory space and I / O accesses. Cluster systems with networked server nodes
are becoming more popular for executing high-performance scientific computing ap—
plications for both economic and technical reasons [16]. One standard approach to
reducing the runtime of such applications is to parallelize them into multiple parallel
processes so many cluster nodes can run parts of the application simultaneously. An
advantage of this approach is the CPU and memory resources of the application are
evenly distributed and used. However, this advantage may not serve the best perfor-
mance interests of parallel processes because a balanced workload distribution among
parallel processes may result in unbalanced resource utilization in a cluster. Uniform
use of resources at the parallel process level does not necessarily mean the system
itself is evenly utilized.

We can attempt to adjust memory load by adjusting the number of processes each
parallel process has, but there are trade-offs between memory usage and efﬁciency. In

order to ensure each node has enough memory space to accommodate processes, we

could partition parallel processes into a large number of processes. This results in less
work and more required synchronization for each process. As a consequence, the CPU
on each node may be underutilized. Parallel speedup is very hard to improve as the
number of processes increases, due to increasing communication and synchronization
overhead. This problem is sometimes referred to as the performance exponential
degradation issue.

To ensure good CPU utilization we must limit the number of processes in a
parallel process. But, as the problem size increases, nodes may run out of available
memory and be forced to use the local disk as a swapping site [3, 4]. Performance will
suffer from frequent page faults since hard disks are orders of magnitude slower than
RAM. Research has shown that disk paging results in unsatisfactory performance on
parallel platforms and should be avoided [3, 4, 26].

We can increase the number of cluster nodes or the amount of RAM at each
node, but doing so may be impossible or very expensive given the cluster setup. In
addition, it is unlikely that the additional nodes will offer any real beneﬁt over the
long-term, since it is likely the users of the cluster will simply increase their usage
of the system to match the new resources available. This will lead us back to the
original problem.

The key problem is memory usage, and this problem has two parts: memory
fragmentation and paging overhead. The aggregate memory capacity of the system
may be enough to satisfy memory demands, but cluster memory is distributed into
small chunks. Usage of these chunks may be uneven and inefﬁcient. On the most
heavily loaded nodes, disk paging is invoked which incurs a high cost.

Network RAM [1, 18] has been proposed for use by sequential jobs in clusters to
even memory load and reduce paging overhead. This technique allows applications to
allocate more memory than is available on the local machine while avoiding paging to

disk by allocating idle memory of other machines over a fast interconnecting network.

This remote RAM is treated as a new layer in the memory hierarchy between RAM
and disk. Resulting page accesses are slower than RAM, but faster than disk [1, 9,
18,23,39]

Existing network RAM techniques should not be directly applied to parallel
jobs for performance gains. One issue is that processes from the same parallel job
synchronize regularly. If each node hosting these processes seeks network RAM in-
dependently, they may be granted an uneven amount of network RAM. With this
uneven allocation, the processes executing on these nodes will run at different speeds.
However, the parallel job as a whole will only run at the speed of the slowest process,
due to synchronization. The nodes with extra network RAM waste it, since their
hosted processes will spend most of their time waiting for other processes. Therefore,
coordination is required to grant overloaded nodes equal portions of memory to allow
hosted processes to run at equal speeds.

Another issue is network congestion. If parallel processes individually seek out
network RAM with no coordination among themselves, a potentially large amount of
unnecessary network trafﬁc will result. This may induce congestion on the cluster.
Parallel applications require high performance networks to run efﬁciently. Congestion
could seriously impact the performance of jobs on the system.

We propose a new peer-to-peer solution, called Parallel Network RAM (PN R), so
overloaded cluster nodes can utilize idle remote memory. In this scheme, each node
may request memory resources from remote nodes and provide memory resources
for others. Requests are indirect: each node contacts a manager (super-peer) node
and requests that it allocate network RAM on its behalf. Managers coordinate the
allocation of network RAM of several nodes and ensure that memory resources are
distributed evenly to the nodes hosting parallel processes belonging to the same par-
allel job. PN R will allow more jobs to execute concurrently without resorting to

disk paging. This will lead to decreased average response times and higher system

throughput.

This thesis makes several contributions.

0 We ﬁrst identify the unbalanced resource utilization problem in a cluster with
a mixed workload of jobs with different resource requirements. Existing tech-
niques cannot maximize the performance gain of parallel jobs in such an envi—

ronment in terms of both parallel speedup and execution time.

0 We propose a novel and effective solution to this problem called Parallel N et-
work RAM (PNR). PNR makes it possible for parallel jobs in a cluster to utilize
memory resources from available remote nodes. The CPU cycles will be pro-
vided by a small subset of nodes while the global memory space of the cluster

- is open to the memory demands of any parallel job. Since the speed gap be-
tween accessing local memoryand remote memory is shrinking, and the speed
gap between accessing local disk and remote memory continues to enlarge, the
proposed scheme is expected to be beneﬁcial. for large scale scientiﬁc computing

applications now and in the future.

0 We build a simulator that models a cluster and several proposed PNR algo-
rithms. Conducting trace-driven simulations, we compare the performance of
six different PN R designs to each other and to a disk paging-only solution. We
then identify which proposed algorithms are superior and under what condi-

tions.

The rest of this thesis is organized as follows. Chapter 1 describes the prob—
lem presented in cluster systems and reviews related work on this issue. Chapter 2
describes Parallel Network RAM, the algorithms used to implement PNR, different
proposed PNR designs and their strengths and weaknesses. Chapter 3 describes the

simulator we have created to test our PNR algorithms and it describes the speciﬁc

experiments set up for these tests. Chapter 4 describes the results of the experi-
ments introduced in chapter 3. Chapter 5 discusses the results gathered and draws

conclusions about the various designs.

1 .2 Terminology

The systems described in this thesis include computing clusters and supercom-
puters. We describe individual computers in the clusters and CPUs in the super-
computers with the terms ”PE” (Processing Element) and ”node”. The terms are
equivalent and will be used interchangeably.

The terms ”parallel process” and ”job” are also equivalent and are used to de—
scribe programs that have multiple threads of execution that run on separate nodes.
”Thread” and ”process” denote the individual threads of execution of parallel pro-
cesses.

We deﬁne four different Parallel Network RAM designs. We compare these de-
signs to a system that uses only disk paging (no network RAM). For brevity, we will

use acronyms to identify these designs. The following list deﬁnes each acronym.

DP - Disk Paging - applies to systems that do not use Parallel Network RAM.
PNR - Parallel Network RAM - a generic label for all Parallel Network RAM designs.
CEN - Centralized PNR — only one manager exists.

CLI - Clients only PNR - a pure peer-to—peer design.

MAN - Local Managers PN R - a design that uses randomly selected local managers.

BBX - Backbone PN R — a design where a certain number of servents (represented

by X) act as ”backbone” managers.

1.3 Background and Related Work

The majority of work in this area has focused on parallel job scheduling. In
this section, we describe various parallel job schedulers and previous solutions to the

problem of overloaded memory on cluster systems.

1.3.1 Parallel Scheduling Algorithms

The primary duty of the scheduler on a cluster system is to ensure high system
throughput and low overall response times of submitted jobs. The most simplistic
scheduling model is the dedicated machine model. In this model, only one job runs
on the system at a time. Each job is scheduled using a priority queue which may be
sorted in a variety of ways, such as ﬁrst-come ﬁrst-serve (F CFS), estimated shortest
job first, best ﬁt, or worst ﬁt. Each job runs until completion and may not be
preempted. This model lacks time sharing and space sharing and since most parallel
jobs do not use all of the available nodes in the cluster, this scheme can waste a large
amount of resources.

An approach that makes better use of available resources is the space sharing
model. Space sharing allows more than one job to be scheduled on the cluster at
one time. Each node is devoted to one process, and each job runs until completion
without preemption. The space sharing model is vulnerable to large, long-running
jobs monopolizing the system and a bad scheduling decision is difﬁcult to correct.
The dedicated machine model shares this problem [16].

Gang scheduling combines both space sharing and time sharing to avoid the
problems associated with large jobs. Each job is alloted a time slot and the job may
execute in its time slot. Nodes within each time slot are space shared. When a time
quantum has expired, all jobs in the current slot are preempted and replaced with jobs

in the next slot. The preemption process is called a ”Parallel Context Switch” (PCS)

and involves a certain amount of overhead. There is at most one process running
on each node at any given time, although some schemes relax this constraint [36].
The time slot mechanism ensures no large job may monopolize the system for a long
period of time.

The maximum number of time slots allowed is known as the Multi-Programming
Level (MPL). Setting the MPL to a low number is a convenient way to reduce PCS
overhead and reduce overall memory load on nodes. A system with an MPL value of
one is a simple space sharing system with no time sharing.

The length (in time) of time slots varies. It is intuitive to make each time slice the
same length. Fixed-length time slots range from 1 second [4] to 120 seconds [29]. One
disadvantage of uniform-length time slices is that excessive idleness can be caused by
short jobs [10]. Some research has shown that variable length time quantums may be
desirable to increase efﬁciency [14, 34] .‘

There are a variety of processor allocation strategies and variations of gang
scheduling. These include ﬁrst ﬁt, best ﬁt, left—right by size, left-right by slots, load-
based allocation strategies, ”buddy” systems such as distributed hierarchical control,
and migration-based algorithms [10, 41].

One problem with gang scheduling is that the usage of nodes within time slots
may become fragmented as jobs terminate and new jobs take their place. Two tech-
niques can reduce node fragmentation: alternative scheduling and slot uniﬁcation.
Alternative scheduling discovers time slots that have idle nodes and ﬁnds jobs that
run on these same nodes in other time slots. These jobs are allowed to run in both
their original time slot and the partially idle slot. This effectively doubles the amount
of time the selected jobs are allowed to run, increasing system utilization and reducing
average response time.

Slot uniﬁcation attempts to unify time slots which use disjoint nodes. Typically,

such situations occur after job termination [10]. This technique reduces the total

number of time slots used by the scheduler and increases the speed of the system as
perceived by the parallel processes.

Another available technique is backﬁlling. If users provide estimated job run
times, the scheduler can identify idle areas in the schedule which can be ﬁlled with
smaller, lower priority jobs. The smaller jobs must not interfere with the scheduled
running time of jobs ahead in the queue. If a backﬁlled job runs for too long, it is
killed. Backﬁlling has been shown to increase utilization and reduce response time.
Several variations of backﬁlling have been studied [24, 28, 32, 33, 35, 40].

The MAUI scheduler, a particularly popular and mature scheduler for cluster
systems is a space—sharing, backﬁlling scheduler [20]. Research work has been done
to implement gang schedulers on both supercomputing platforms [7] and clusters of

workstations [19].

1.3.2 Previous Solutions to the Memory Problem

Previous studies agree that unmodiﬁed disk paging results in severely reduced
performance on parallel systems [4, 30]. Generally speaking, previous solutions to
this problem either attempt to avoid disk paging entirely or attempt to reduce its
effect.

Various ways to avoid paging by altering the system scheduler have been sug-
gested. If no memory information about incoming jobs is known, then the simplest
solution is to keep MPL to a minimum [24]. Another solution attempts to guess mem-
ory usage information based on information about the job provided by the user and
information contained in the program executable. This guess is used in scheduling
decisions [3]. Another solution uses speedup information known about jobs ahead
of time to make scheduling decisions [26, 27]. Given this information, the sched-
uler can choose to give more processors to efﬁcient jobs to increase utilization or to

memory—intensive jobs to increase throughput.

User-provided runtime and memory information can be used in scheduling deci-
sions as well. In fact, the use of user estimates of runtimes is the basis of backﬁlling.
However, it is well known that such information is unreliable, as users will tend to
estimate numbers that will ensure they get a good position in the scheduler queue.
Some backﬁlling schemes attempt to take advantage of systemic inaccurate runtime
estimations [28].

One method that is aimed at reducing the disk paging penalty is called block
paging. In this scheme, the system groups sets of pages together and acts upon
these groups as units. Groups are deﬁned by the system based on memory reference

behavior of jobs [34].

1.3.3 Network RAM

Network RAM is a technique that reduces paging overhead. Much work has
been done for sequential job scheduling with memory considerations. Regarding net-
work RAM implementations, the Global Memory System (GMS) [9] and the Remote
Memory Pager [23] attempt to reduce page fault overhead by using remote paging
techniques. DoDo [1] is designed to improve system throughput by harvesting idle
memory space in a distributed system. In DoDo, processes running on the local
system have the highest priority for using CPUs and memory on their workstations.
This divides the global memory system into different local regions.

A memory ushering algorithm is used in MOSIX for memory load sharing [2].
This solution is a job—migration—based load sharing approach. Recently, several load
sharing alternatives have been developed. These techniques consider both CPU and
memory resources with known and unknown memory demands [37, 38]. The objective
of the designs is to reduce the number of page faults caused by unbalanced memory
allocations of distributed jobs so that overall performance can be signiﬁcantly im-

proved.

Chapter 2

Parallel Network RAM

2. 1 Introduction

We propose a novel and effective technique called Parallel Network RAM (PNR)
to better utilize both CPU and memory and minimize communication and synchro-
nization overhead. We demonstrate the basic idea of PNR in Figure 2.1. In this ﬁgure,
application 2 runs on nodes P3, P4 and ”P5 but utilizes available memory space in
nodes P2, P6 and P7 where non-memory intensive applications 1 and 3 reside. With
PNR, instead of three nodes being overloaded while four are underloaded, all seven
nodes are fully utilized.

Our objective is to fundamentally improve the efﬁciency of large-scale scientiﬁc

Application 1 Application 2 Application 3

 

 

Figure 2.1: Diagram of Parallel Network RAM. Application 2 is assigned to PEs P3,
P4 and P5 but utilizes available memory space of P2, P6 and P7.

10

computing. Under our proposed solution, CPU and memory resource allocations
are considered separately. CPU utilization requirements can be taken into account
during the scheduling phase, and memory requirements can be handled as needed
during execution. In this way, CPU usage can be optimized to maximize utilization
and minimize communication and synchronization overhead while memory usage can
include both the local memory space from the assigned CPUs and the remote memory
space in other nodes (as needed). With PNR, speedup can be scaled as the available
remote memory increases, and performance can also be scaled as the problem size
increases.

Because CPU usage and memory usage are considered separately, PNR does
not coordinate with or receive information from the assumed centralized scheduler of
the system. This allows PN R to be implemented on a variety of platforms without

changes to their existing scheduling policies.

2.2 Generic Description

In brief, all nodes in the system host PN R servents (see ﬁgure 2.2). All servents
act as PNR clients and servers. Some servents may act as managers. Managers act
as proxies for clients to communicate with servers. The purpose of managers is to

coordinate client requests.

2.2.1 Clients

A PNR client attempts to allocate and deallocate network RAM on the behalf
of its hosting node. The node uses allocated network RAM as additional virtual
memory just as it would with disk space. When a process starts execution on a node,
it allocates the amount of memory it will use during its execution. If the node’s client

determines that this allocation will lead to disk usage, the client contacts a manager

11

 

Server

 

Mana er
Servent g

 

Client

 

 

 

Figure 2.2: The architecture of a servent. Each servent can act in three roles.

and requests network RAM. Once network RAM is allocated, the client is informed
by the manager what machines are serving the network RAM, and how much was
allocated. The client may then start sending pages to the servers for storage and later
retrieval.

Similarly, when a process stops execution it will deallocate its memory. If the
client detects that network RAM previously allocated is no longer needed, then the
client signals a manager that it may be deallocated. Clients deallocate chunks of

memory ﬁrst from servers believed to have the highest memory load.

2.2.2 Managers

PN R managers are the centerpiece of most of the proposed PNR algorithms.
They do the majority of allocation and deallocation work and act as proxies between

clients and servers.

Network RAM Request

Managers listen for network RAM requests from clients (including the client
residing on the same node as the manager). Depending on the PN R strategy, the

manager may act immediately on this request or wait for other requests to come in

12

before acting. Most strategies require that all threads within a parallel job must
contact the same manager before the manager is allowed to act on any one of the
threads’ requests. For this to work, it is assumed that the manager can discover the
total number of threads belonging to the parallel job. As each new request comes in,

information on the aggregate memory request is stored in a request table.

Server Request and Response

When all requests are received from relevant clients, the manager attempts to
select a server from an availability list. Multiple selection schemes are possible. Cur-
rently, a randomized worst-ﬁt scheme is used. That is, the server perceived to have
the most RAM available for remote allocation (i.e. the server with the most idle
RAM) is always selected. If servers tie for the highest amount of RAM available, one
server is randomly chosen.

The server contacted may grant, partially grant, or deny the request. It is im-
portant to note that a server that was listed as having RAM available for allocation
may not have the same amount available by the time the request message reaches it.

The server’s response is received by the PN R manager. If too little memory was
allocated, the PNR manager will attempt to contact another server. The process is
repeated until enough network RAM is granted by multiple servers or until no servers

are left to query.

Network RAM Request Response

When either enough network RAM is allocated or when all possible servers are
queried, the manager will calculate the total amount of network RAM granted. Based
on this, the manager will divide network RAM evenly among the requesting clients.
If a client requests less network RAM than it would get in its fair share, exactly as

much as it requests is granted and the remainder is reserved for other clients. If a

13

client requests more network RAM that it would get in its fair share, it gets its share
and only receives more if other clients do not need all of their shares. Each client is
informed of which servers are hosting the granted network RAM.

The even distribution of network RAM is done to ensure that each process in
a parallel job runs at roughly the same speed. If network RAM were granted hap-
hazardly, each process would run at different relative speeds. The job as a whole
would run at the speed of the slowest process because of synchronization and the fast

processes will be held back.

Network RAM Deallocation Notiﬁcation

It is assumed that deallocation notiﬁcations are only sent at job termination time.
The manager will listen for deallocation notiﬁcations from clients. The deallocation
amounts indicated in the messages may have nothing to do with how much was previ-
ously allocated to those clients: In some cases, no network RAM may be deallocated
because it is still needed by the client. ‘In most strategies, the manager will wait for

each client associated with a parallel job to send a deallocation notiﬁcation.

Server Notiﬁcation and Response

When all deallocation notiﬁcations are received by the manager, the manager
determines how much network RAM needs to be deallocated and on which servers.
Each server is then notiﬁed that it may deallocate the speciﬁed amount of RAM. The

server will respond to this message with an acknowledgment.

Broadcasting Availability Information

During each message-passing interaction, clients and servers piggyback current
memory load information onto messages sent to managers. The manager receives this

information in addition to the relevant payload. It uses this up-to-date memory load

14

information to update its own network RAM availability table. It uses this table for
server selection.

Some PNR strategies may prefer to share this information by broadcasting it
to all servents. Other strategies may only broadcast to selected servents. Currently,
broadcasts are only executed after network RAM has been allocated or deallocated.

It should be noted that memory load information may already be out-of-date
before it is even broadcast. There is no way around this problem, since it is possible
for memory loads to change as messages are being passed on the network. We believe
this broadcasting solution provides an economical way to keep availability information

reasonably up—to—date.

2.2.3 Servers

Servers receive requests from managers for network RAM. If the server has more
unallocated RAM than a certain threshold, it will grant the network RAM request
and allocate memory to the manager up to. that threshold. Currently, the unallocated
memory threshold is set to a low value - a tenth of a megabyte. This value can be
adjusted and may be useful for future work.

After the memory is allocated, servers receive requests to read and write the
allocated network RAM directly from clients. Servers grant all valid deallocation

attempts.

2.3 Designs

We propose four different PNR designs. Each design has a slightly different
architecture and has different amounts of communication overhead associated with

it. This section describes these designs.

15

    
     

Figure 2.3: Client-only design. Figure 2.4: Centralized design.

   
   

2.3.1 Client

In this strategy (CLI), each client uses its servent’s manager to send allocation
requests (see ﬁgure 2.3). The local manager does not wait for messages from other
clients - it acts immediately upon the client’s request. The manager attempts to
[allocate as much network RAM as possible for its client. When the client receives
network RAM, it begins execution immediately. It does not wait for the other threads
in the parallel job. The local manager does not share memory load information with
other servents.

This strategy allows clients to allocate network RAM quickly and eliminates all
coordination overhead. It is scalable, since each client is responsible only for itself. It
is also simple to implement, since there is no need for a manager at all.

However, this solution has major drawbacks. First, memory load information is
not shared. Each client must discover memory load information for itself and this
may lead to a large amount of ineffective network RAM allocation requests.

Second, and more serious, the clients do not coordinate memory allocation with
each other. Some clients may receive large amounts of network RAM if they get their
messages to the servers ﬁrst while other clients get very little network RAM. This

will not improve the performance of parallel jobs as a whole, since each job will only

16

execute at the speed of its slowest thread.

In fact, this scheme may worsen overall performance, since much of the network
RAM allocated is wasted. Thus, no beneﬁt is gained from network RAM and higher
memory loads on the servers may induce disk paging, the very problem we are trying

to avoid!

2.3.2 Centralized

In this strategy (CEN), only one manager exists, and it receives all client requests
(see ﬁgure 2.4). All servents know the identity of this manager. All clients will contact
this manager and it will coordinate network RAM allocation. The server uses memory
load information sent in by clients and servers to make allocation decisions. Since only
one active manager exists, the centralized manager does not broadcast any memory
load information it receives.

I” This‘scheme has the advantage of one agent having all of the information and
making all of the decisions. No time or network bandwidth is wasted in sending
coordination messages to other managers. The disadvantage of this strategy is that
it is not scalable. As the system size grows, the network connections leading to the
node hosting the manager will become a bottleneck and limit the performance of the
system. In the real world, the computational and memory overhead of hosting this
central manager would also become major factors. However, in our simulator, these

are not taken into account.

2.3.3 Local Managers

In this strategy (MAN), each client contacts a ”local” manager (see ﬁgure 2.5).
Speciﬁcally, whenever a job starts or stops, one of the servents running on a node
associated with that job will volunteer to act as the manager in addition to acting as

a client. Each servent involved must agree on which servent will act as the manager.

17

Servent

Client/ one”,
\

Figure 2.5: Local manager design. Figure 2.6: Backbone design.

 
  

  
   

     
 

 
     

All clients from the involved servents will contact this manager. The manager will
take their requests, allocate network RAM (if possible), and divide the received net-
work RAM evenly among the requesting clients. At the end of each allocation and
deallocation, the manager will broadcast memory load information to each servent in
the system. This is necessary since each servent on the system can potentially act as
a manager. '

Since no single node is loaded with all requests, this solution is scalable. It makes
good use of allocated resources via the coordination of client requests. Broadcasting
the memory load information should keep the tables of the servents relatively up—to-
date.

However, the major drawback of this approach is the broadcasting step. Sending
a message to each servent on the system can introduce congestion into the system,
especially if no real broadcasting facility exists and broadcasting must be implemented
via multiple point—to-point messages. The larger the system, the bigger this issue will
become.

Also, since there is no central authority on memory allocation information, ser-
vents may be more likely to act on outdated information. Like the centralized strategy,

waiting time is incurred in the coordination step by waiting for all clients to send in

18

their requests.

2.3.4 Backbone

This strategy is a hybrid of the centralized and local manager strategies. Here, a
variable—sized subset of servents will act as managers (see ﬁgure 2.6). This subset of
servents will be well known and all clients will contact these servents for their network
RAM requests. Clients will randomly select a manager for service. As in the MAN
design, all clients associated with a job must agree on who to contact.

The backbone of managers will coordinate among themselves by broadcasting
memory load information only to managers. If the backbone of managers consists of
only one member, then this strategy is equivalent to the centralized strategy. If the
backbone of managers consists of all servents, then this strategy is closely equivalent
to the local managers strategy.

This scheme can potentially be a ”best of both worlds” solution, compared to
the centralized and local managers solutions. It is more scalable than the centralized
solution, since load is shared among many servents, and it uses fewer messages for
coordination than the local managers solution, since broadcast messages only need to
be sent to a subset of nodes.

The backbone strategy also has the advantage of being customizable to the cluster
setup. If the network is small, then a small backbone (perhaps a backbone of one)
may be all that is required. As the network gets larger, then the number of servents

in the backbone can be increased appropriately to manage scalability.

19

Chapter 3

Methodology

3. 1 Simulator

To test our proposed PN R designs, we created a simulator that models a parallel
platform, parallel jobs, and our PNR algorithms [25]. This simulator is written in
C++ and compiles under UNIX systems. We used a modiﬁed version of Sim++ as
our main simulation library. Sim++ is the C++ version of Simpack, a simulation
library created by Paul Fishwick [6, 17]. Sim++ is open-source, and we modiﬁed the
implementation to suit our needs and maximize performance.

This chapter describes the models we implemented in our simulator and the

experiments we ran on the simulator.

3. 1. 1 Workload Model

Many workload traces and synthetic workload generators exist for use by simula-
tors. However, no standard memory usage benchmarks currently exist [5]. We decided
to use a large trace with memory allocation information that has been assembled and
discussed by Feitelson [11].

The trace has been gathered from the CM-5 parallel platform at the Los Alamos

20

National Lab. It contains information about 201,387 jobs run through the majority
of 1996. We will use a subset of jobs from this well-studied trace. Since the workload
proﬁle at a given site tends to be fairly stable over time [22], using a subset of jobs
should be indicative of the general load on the system. However, using a workload
trace may be biased toward that collected site’s policies and not representative of all
workloads [5, 15, 22]. For example, in our trace no process attempts to allocate more
memory than is physically available on a node and all jobs allocate PES in powers
of two [13]. Obviously, not every cluster installation will have a workload with these

characteristics.

3. 1.2 System Models

To complement the CM-5 workload trace, we model a system architecture similar
to the CM-5. Each node in thesimulated system runs at 33 Mhz and has 32 MB
of local memory. The original CM-5 did not support paging [13] and we add this
capability to our simulated system. Each node has a disk of inﬁnite capacity that
runs at 7200 RPM with a seek time of 9 ms and a transfer rate of 50 MBps. The
interconnecting network is assumed to be a simple Ethernet 100 Mbps star topology.
Each link has a latency of 50 nanoseconds and the central switch has a processing
delay of 80 microseconds. The original CM-5 had 32 processors dedicated solely to
system tasks, so it is assumed that the operating system of the CM-5 imposes no

CPU or memory load on the nodes.

Network Models

The simulator provides simulated network links. A link connects either one
computer to another computer (or set of computers) or a computer to a switch. Only
one message may be transmitted on a link at a time. A link has a ﬁxed latency

time and a transmission time based on the bandwidth and the size of the message

21

transmitted. In our experiments, we set the latency of each link to be that of the
speed of light through 10 meters of copper wire (50 nanoseconds). No collisions occur
on the simulated links. Incoming messages are queued if the link is currently in use.
A simulated network switch stores communications and forwards them to the
appropriate link. Switches may only forward one message at a time and all other
messages are queued. Switches are assumed to have a single inﬁnite length queue
sorted in FIFO order. Switches have a ﬁxed length processing delay that can be set
as a parameter. The base value of this delay is set to 80 microseconds.
Point—to-point messages traverse links and switches only. Routes including nodes
as intermediate points are not used. Some PNR methods use broadcasts to send
information to servents. In our simulator, broadcasts are simulated using N point-to-

point messages from the broadcasting node to each other node.

3.1.3 Job Behavior Models

Each job is composed of multiple processes. It is assumed that each process allo—
cates a static amount of memory at start time. Each process accesses memory at its
node independently. Previous studies have shown that parallel scientiﬁc applications
generate memory references every three to ﬁve CPU cycles and have a cache hit ratio
that ranges from approximately 50% to 65% [8, 31]. We assume that our processes

access memory every four CPU cycles and have a cache hit ratio of 50%.

Synchronization Model

All processes in a job synchronize with each other at regular intervals. The
synchronization pattern is a simple master/worker pattern. One process is chosen
as the ”master” process and all other threads in the job are ”workers”. After a
certain amount of CPU time, each worker sends a message to the master. After the

master receives all messages, the workers are allowed to proceed with execution. In

22

our experiments, we set the time interval between synchronizations to be one CPU
second (once every 32 million cycles). This is not a heavy synchronization load.

Future experiments should use more complex and heavier loads.

Page Fault Detection Algorithm

Originally, we used a memory access model similar to that presented in [3] to
produce a low-locality memory access pattern. However, its usage became a bottle-
neck in our simulator. We replaced its use with a new model based on an exponential
function.

Each thread has an average memory access rate. The page fault detection algo-
rithm determines how much time will elapse between the current time and the time
when the next page fault will occur. For example, if no pages of the currently running
thread are loaded into memory, then the ﬁrst memory access will be a page fault and
the time until the next page fault will be the same as the average memory access rate.
If all pages of the thread are loaded into memory, then the amount of time until the
next page fault will be inﬁnite.

If some, but not all, pages are loaded into memory, there should be some time
between page faults that is greater than the time between memory accesses, since
some references will be to pages already loaded. In fact, this should happen often
due to locality of memory references. We construct a model to mimic this behavior.

To start, we deﬁne the time until the next page fault to be T. We then deﬁne
the average memory access rate to be R. If we assume no pages are ever loaded into
memory, then the next page fault will occur in exactly this amount of time.

T = R

The operating system will load a page into memory when a page fault occurs. It
is very likely that subsequent memory accesses will reference the same page because

memory accesses, in general, follow locality. If the program accessed memory linearly,

23

then the memory references will traverse the entire loaded page before encountering
a new, unloaded page. We assume this is the case and multiply the memory access
rate by the page size P. In the case of our simulated system, P has the value 4096.

T = R x P

No thread will ever access memory perfectly linearly and it is possible that the
thread will return to pages already loaded into memory. The more data that is loaded
into memory, the more likely it is that this situation will occur. We deﬁne the data
brought in from page faults as D. D starts at a zero value and increases by the size
of a page at each page fault event. There must be a limit on the amount of memory
paged in by a single thread, and that limit is L. L is deﬁned as the minimum of the
total amount of memory overallocated on the node and the thread data size. If, for
instance, the amount of memory overallocated on the current PE is 32 MB, and the
current thread size. is 16 MB, L is 16.

L = min(N0deOverallocati0n, ThreadSize)

We can calculate the amount of overallocated data 0 of the thread not currently
in RAM by subtracting D from L.

0 = L — D

From this amount, we can calculate the percentage of overallocated data not
currently in RAM. We call this value U.

U = 0/ L

We can then apply U to the calculation of T. Dividing our previous equation
by U will cause the average time until the next page fault to go up as more data is
loaded into RAM. Note that when U is 0%, T will be deﬁned as inﬁnite.

T = (R x P) / U

Finally, the resulting T is given to an exponential distribution as an average. The
distribution will randomize the time between accesses while still following a general

trend.

24

T = ecrpntl((R x P)/U)

One problem with this model is that it does not handle processes that are larger
than physical RAM correctly. Speciﬁcally, this model will have the processes load
a only ﬁnite amount of data into RAM. This is not correct for very large processes
where, in order to page parts of a program in, other parts will have to be paged out.
In this case, it is possible that the process will thrash between localities forever.

We avoid this problem using the following method. When all of the process’ data
(as determined by the model) is loaded into memory (U = 0%), the simulator will
reset the amount of memory unloaded using a triangle distribution. This triangle dis-
tribution is given the parameters A = P, B = P, and C’ = (RAM Size—ThreadSize).
This makes the amount unloaded likely to be small - generating a low level of page
faults for future execution. This model mimics locality changes in the program and
will produce the desired behavior of continued page fault activity for very large pro-

CCSSBS.

Network RAM Access Model

The paging model is used as the foundation of the network RAM access model.
When a page fault occurs, the simulator checks what percentage of virtual memory of
the node is stored as network RAM and what percentage is stored on disk. A random
number is chosen to decide if network RAM or disk has just been referenced.

If the disk is referenced, then the activity is no different than normal paging.
If network RAM is used, a round-trip message occurs between the host and the
server node storing its pages. It is assumed that the page is read from remote RAM

instantaneously and the only time penalty is that caused by communication overhead.

25

3. 1.4 Scheduler Models

It is assumed there is one centralized scheduler for the system. In our simulations
we experiment with two schedulers: a space sharing scheduler and a gang scheduler.
For both schedulers, we use FCFS as our queuing discipline since it has been shown
to be a simple yet efﬁcient ordering which guarantees fairness [24]. Most simple node
packing schemes lead to identical performance, so we use best-ﬁt packing to follow the
example of [10]. It is assumed that the schedulers have no knowledge of the memory
requirements of jobs. Neither scheduler takes these requirements into account when
scheduling decisions are made.

Each time slice in the gang scheduler runs for a 60 second quantum as suggested
by [29]. The time required to perform a PCS is ﬁxed at 4 ms [4]. The maximum
number of time slices (MPL) is set to two. This number is conservative and limits

paging activity [24]. Alternative scheduling and slot uniﬁcation are provided.

3. 1.5 Metrics

There are no universally valid and accepted metrics. In fact, different metrics
can give contradictory results [12, 15, 21]. However, most installations use a few ”de
facto” standard metrics such as response time and utilization [5]. In this section we
list and justify the metrics we use in our experiments.

The ﬁrst metric is average response time, or total wallclock time from submit to
ﬁnish. This metric is used very often and directly reﬂects the goal we are attempting
to achieve - improved parallel process performance. One disadvantage of this metric
is that it can overemphasize large jobs. In parallel workloads, small jobs account for
the majority of jobs. [10, 12, 15]

To directly compare DP to the various PN R designs, we create another metric:
”optimization ratio”. This metric is based on average response time and represents

the improvement of a PNR design over DP.

26

Optimizati0nRatz'0(PNRDesign) = W x 100%

We also calculate the average and standard deviation of node memory allocation
and disk allocation by sampling the memory allocation information of each node every
50,000 simulated time seconds.

Many other possibly useful metrics, such as utilization, throughput, slowdown,
and bounded slowdown are not reported here. We have brieﬂy examined these metrics
and trends in response times appear to match well with trends in these metrics.

However, additional comparisons with these metrics may yield interesting data and

could be the focus of future work.

3.2 Experimental Setup

To determine the effectiveness of PN R in comparison to DP, we deﬁne several
experiments. The basic set of experiments is deﬁned in table 3.1. These experiments

reflect our main points of interest:

0 Performance under varying memory loads.
0 Performance under varying network speeds.
0 Performance under different network topologies.

Performance under different scheduling strategies.

The basic experiment table becomes multidimensional as we apply it to seven
different paging methods, workloads of 4,000 and 5,000 jobs, and to 64 and 128 node
systems. The total number of experiments performed for this thesis based on table

3.1 is 560. Some additional experiments are performed.

27

 

RAM Network Topology Space Sharing

 

150% 10 Bus 50%
135% 100 Star 75%
125% 1,000 Connected 100%

115% 10,000
100% 100,000
85%
75%
65%
50%

 

 

 

 

 

 

Table 3.1: Table of Experiments.

3.2.1 Experiment Table Detail
RAM

To test each scheme under varying memory loads we vary the amount of RAM
available at each node while holding memory demands of jobs constant. The parame-
ter given signiﬁes the relative amount of RAM present at each node in the experiment.
For instance, for the 150% experiment, we adjust the default 32 MB of RAM to 48

MB (or 150% of the original value).

Network

To test each scheme under varying network performance, we alter link bandwidth
and switch processing delay. The base value of the network performance is 100 Mbps
with a 80 microsecond switch processing delay. The parameter value 1000 signiﬁes
performance 10 times better than the base value - with link bandwidth at 1000 Mbps
and switch processing delay at 8 microseconds. 10,000 and 100,000 values follow the
same pattern. Parameter value 10 has link bandwidth of 10 Mbps but does not alter

the switch processing delay.

28

Topology

To test each scheme under different topologies, we deﬁne three topologies:

0 Star - N links and 1 switch
0 Bus - 1 link

0 Connected - N (N — 1) links - a fully connected system with links for both

directions.

The star is the base topology. Note that N is the total number of nodes in the

system.

Scheduler

I To test each scheme under different scheduling strategies, we deﬁne both a gang
scheduler and a space sharing scheduler. The gang scheduler is used as the base
’ scheduler. The space sharing scheduler is simply the gang scheduler with an MPL of
one. When we use the space sharing scheduler, we run the scheduler under varying
RAM workloads. Following the same pattern as the gang scheduling experiments, we
vary the RAM ratio from 50% to 100%. We do not simulate anything beyond 100%
RAM because of the CM-5 scheduling policy which prevented any jobs larger than

the RAM available from running.

3.2.2 Dimensions of the Experiment Table

Each experiment deﬁned in table 3.1 is run on 4,000 and 5,000 job workloads.
We discovered in early experiments that jobs start to become backlogged under the
gang scheduler with the 5,000 job workload. By comparing the 4,000 and 5,000 job
workloads we can understand how each solution performs under different types of

system loads.

29

Each experiment is run on a 64 node cluster and a 128 node cluster. Running
experiments on different-sized systems will indicate how well the various designs scale.
Note that each workload was originally collected on a 1,024 node system and if the
system is scaled down, the workload is sealed with it. For example, a job that asked
for 512 processors under the original workload will ask for 32 processors in the 64
node system, and 64 processors in the 128 node system. Congestion on the larger
networks will increase since more communication is required for synchronization and
broadcasts.

We run seven different paging methods against each experiment. The base is
disk paging (DP). We test six PNR methods: the centralized method (CEN), the
client-only method (CLI), the local managers method (MAN), and three variants of
the backbone method (BB). The three backbone methods deﬁne the number of PNR
managers to be ith, éth, and T’éth the total number of nodes in the system. For
instance, on the 64 node system, the number of managers is deﬁned as 16 (BB16), 8

(BB8), and 4 (BB4), respectively.

30

Chapter 4

Performance Evaluation

4.1 64 Node Cluster

This section describes the results of the experiments introduced in table 3.1 run
on a simulated 64 node cluster. Note that we deﬁne the ”base” set of experiments as

experiments where all of the default parameters described in chapter 3 are set.

4. 1. 1 Base Experiments

In ﬁgure 4.1 we see the results for the base set of experiments for ﬁve different
workloads. Response time for each design follows an exponential curve as workloads
get larger. For the 5,000 job workload, jobs are starting to get backlogged. This
increases average response time for all paging designs. This trend continues for the
6,000 job workload (not shown). We focus on the 4,000 and 5,000 job workloads to
observe the difference between backlogged and non-backlogged systems.

Figure 4.2 shows the 4,000 job workload where all PNR designs are close in
performance. Each design has an optimization ratio of 10-12% over disk paging
(ﬁgure 4.3). Interestingly, CLI has the best optimization ratio at 12.08%. This goes

against the expectation that CLI will always lead to poor performance due to lack of

31

Response Time (seconds)

Optimization %

Response Time (seconds)

70000 T-ee-—'- ,

64 Nodes

 

 

60000 -
50000 ‘

 

 

 

40000
30000

 

 

20000

 

’ ‘ l ‘ ‘. l .
82323525252352?

1....

 

10000 [T we

V.
.39'0'

       

.320.

 

 

0 l
1000

    

2000 3000

Jobs

4000

   

I DP
[1 CLI
g CEN
[a BB.
[E BBS
N 3316
[a MAN

 

Figure 4.1: Base experiment - 64 PE - several workloads - response time.

 

 

 

64 Nodes - 4000 Jobs
DP CLl CEN BB4 BBB BBlG MAN
Deggn
Figure 4.2: Base experiment — 64 PE - 4,000 jobs — response time.
64 Nodes - 4000 Jobs
12.5 f

 

 

CLI

BBS

BB4
Deﬁgn

Figure 4.3: Base experiment - 64 PE - 4,000 jobs - optimization ratio.

32

Response Time (seconds)

Optimization %

64 Nodes - 5000 Jobs
70000 ..-1-__., -_--._ ._

 

 

 

 

 

 

 

sooooa—g—“wI j~a~-«——-.—v.—-eew.s cz1_.1.1.11_,..
WWOﬂ am l_::wmu~m44+mmﬂn_w__
40000-—l -—] «w—j [4—T“Ti_4, ,.
30000]-, --3 *] l—w~l 3.4; l
zooooaal «a—] 111] [a] [1.] ] . 3
10000]—] ﬂ—. ——‘ a-———l (“4' i , .“
0 __. _l._--._. ”n L“ was T- L1__._l__ ,_ l___l._l,__t T l .1. .l.__l ,_
DP cu CEN BB4 BBB BBl6 MAN
Deﬁgn

Figure 4.4: Base experiment - 64 PE - 5,000 jobs - response time.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

40- 1, ~ _- ——w —— — —~
30~ +~ e~e ._l -— Af”"" 11 ——
V .. TNT]
20 -4 4s — - L —— — —
,0 _ w _ ]_ __ I___ _
0*] 1 ~— 4
CLI CEN BB4 BBB 3816 MAN
Deggn

Figure 4.5: Base experiment - 64 PE - 5,000 jobs - optimization ratio.

33

servent coordination.

The average memory usage for DP was 6.88 MB, with a standard deviation of
4.96 MB. All PNR designs had a standard deviation around 4.4 MB. CLI did not
lead to signiﬁcantly uneven memory allocation (as was expected).

Figure 4.4 demonstrates that, for the 5,000 job workload, the PNR designs are
more differentiated. CLI was the only design with an average response time higher
than DP’s. Figure 4.5 shows that BBl6 is the best design at a 33.38% speedup over
DP. This ﬁgure shows us that BB16 is at the top of a curve, where performance of
the coordinating designs is worst at CEN, gets better as the solution becomes more
distributed (BB4 and BB8) and degrades past a certain point of distribution (MAN).
These results are more in line with our initial hypothesis that a hybrid of centralized
and decentralized approaches would have the best performance.

All non-CLI PNR methods experienced approximately 70% or less the number of
page faults DP experienced. This can be attributed to the fact that, since PN R jobs
ﬁnish faster, they experience fewer PCSs. This, in turn, leads to less page loading
due to memory contention by multiple processes.

We observe some interesting statistics for memory allocation for each design. DP
had highest average memory allocation at 17.87 MB per node with standard deviation
of 5.17 MB. All PN R designs had lower allocation averages - ranging from 14.45 MB
to 16.37 MB. BB16 had the lowest average at 14.45 MB, but it also had the highest
standard deviation of all designs at 5.76 MB! In fact, each PN R design except for
MAN had a higher standard deviation than DP - ranging from 5.18 to 5.76 MB (with
MAN having 4.98 MB).

This results comes as a surprise, since PN R should be smoothing out memory
usage, not creating more hotspots! Fhrther, it may be possible that BB16 had the
lowest response time only by chance - since the hotspots created were not in heavy

use by the system.

34

4. 1.2 RAM Variations

For both the 4,000 and 5,000 job workloads, we see response times for all designs
converge as RAM is increased (ﬁgures 4.6 and 4.9). In the 4,000 job workload,
response time converges earlier. In the 5,000 job workload, the designs converge
later (due to backlog) at RAM ratio 135%. Data is omitted from the ﬁgures when
optimization rate is 0%. As RAM is decreased, all designs increase response time
exponentially. Presumably, response times will also converge when so little RAM is
available that allocating network RAM is impossible.

At the 100% RAM ratio and above in the 4,000 job workload all PNR methods
use 0 MB of disk space on average. Standard deviation for each is 0 MB. This accounts
for similarity of results for the PNR methods.

For the 4,000 job workload, we see a very large initial boost in performance as
RAM is decreased. The optimization ratio is the among the highest observed for
all experiments - in the 75% to 100% range (ﬁgure 4.7). This advantage quickly
evaporates at 65%, where most designs actually underperform DP. At 50%, most of
the PN R designs recover, but have a lower performance beneﬁt (between 25-50%).

CLI is the clear loser in each of the experiments where the RAM ratio is less
than 100%. MAN consistently outperforms or performs as well as other methods in
each scenario. The other methods are more difﬁcult to judge, since their performance
ﬂuctuates.

In the 5,000 job workload, one immediately obvious feature is the big performance
boost PN R receives as RAM is initially increased (ﬁgure 4.9). The 4,000 job workload
experienced no such beneﬁt. The extra RAM must help PN R efﬁciently clear out the
backlog of jobs. As RAM is increased past a certain point (around 135%) PNR and
DP converge.

Interestingly, despite these favorable response times at RAM ratio 115% and

125%, the average allocated memory and standard deviation metrics are not unusually

35

Optimization % Response Time (seconds)

Response Time (seconds)

64 Nodes- 4000 Jobs

600000T,_,L,1,LLL,,,1,_L_LL, ~477-—,
5000000 , , ~ ,_,_ #7 # l!DP ]
g... - , - . ,, h - WW- gig ,

133,35. l. ‘I
300000: \ ;: , 1 , , . , , -~ #1 7334 l

. x - - . . MW 4 vi

. t .. , * , - *~ A,
,1. t:sll:..,.___._ 8816
50 65 75 85 100 115 125 135 150 l:”ﬁ":l

RAM Ratio

Figure 4.6: RAM experiments - 64 PE - 4,000 jobs — response time.

64 Nodes- 4000 Jobs

 

 

100 T ,_-L r— 4 ‘f‘f‘ -7~ -——~———~4
g3 ' - a ' . - can
25 N 7mg" I. —— a

0 rﬁxj ,2, . . v... 1. n . ,iuc " .. . . . 1E3] BB4
'25 ’ ' ] ’ " ’ "' ' " ' " " lg: BBB
_50.i, , l ,7 ' , , , ,7, W i. .‘

; “Nit: l
_75 . ll -1 c -, it... 3316
-100. . . . . ,. 1, 7— ;‘_.‘ MAN 2
50 65 75 85 100 115 125 ‘
RAM Ratio

Figure 4.7: RAM experiments — 64 PE — 4,000 jobs — optimization ratio.

64 Nodes- 5000 Jobs

 

   
    
    

       

 

      

 

1000000 7' -____ , I DP ,1
800000 L, i i .RF 1 WC”
600000 7 z]; ggi CEN

l = I r
400000 ~ g +7 fr; 334
2000004 {533 l ,

gig . [m 8816
0 ‘ ' '7 ‘ 7 ‘ ‘ ‘ [00 MAN
50 65 75 35 100 115 125 135 150 » . ,,-,_

RAM Ratio

Figure 4.8: RAM experiments — 64 PE - 5,000 jobs - response time.

36

favorable. In fact, some PN R methods have an average and standard deviation higher
than DP. The differentiating characteristic between DP and PN R must be the average
disk usage. While DP uses about 1.25 MB on average, each PNR method uses about
0.5 MB or less at RAM ratio 115%. Similarly, at 125% RAM ratio, DP uses 0.32 MB
whereas each most PNR methods use less than 0.1 MB.

W'hen RAM becomes more scarce MAN offers consistent performance beneﬁts
but this time it is not always the best design. At 85% and 75% we observe smaller-
scale versions of the performance boosts observed in the 4,000 workload. At 65%
there is another performance hit - but, again, at a smaller scale. At 50%, modest

improvements are back.

4.1.3 Network Performance Variations

For all cluster setups and workloads, the performance of PNR at 10 Mbps of
bandwidth was so poor, it was not included in the ﬁgures for these experiments.
To get an idea of the performance, the total simulatedtime elapsed for the 5,000
workload rose an order of magnitude from 106 to 107 seconds. This result shows that
PN R is extremely sensitive to low-performance networks.

Figures 4.10 and 4.11 show the 4,000 job workload, where the methods with the
most coordination overhead suffer at bandwidth 100 Mbps. However, as bandwidth is
increased beyond 100 Mbps, all PNR methods become equivalent in terms of response
times and optimization ratios.

N o PNR method in these experiments uses disk paging at any bandwidth. The
limiting factor here is coordination overhead. It makes sense that as communica-
tion becomes less costly, coordination overhead becomes such a small factor that the
various methods converge to the same response time.

Figure 4.12 and 4.13 show that, for the 5,000 job workload, the general curve

observed in the base test is preserved until bandwidth 100,000 Mbps, where the results

37

64 Nodes - 5000 Jobs

 

 

    

            

| N 4 8 1 1 | N 4 8 1 8 1
1LEBBBA1 . PLEBBBA . BBA
_C C B B B M .m VD C C B B B M _ m B B M
max. _//11 1 Rx; . ..1 9. [AA 1 ..\\ .1
11 // E1 Rs m 1 1:1 mm w , .1111 t m t
l 1 l l n ill l ll- e
O S
.1 n
. t 1 o
3 5 m 1 1 1 .1. .l 1 w.
a. 1 3 .m. 1 1 o m
1.14 1 .n 1 __/_./=/____ ==== M _
1.\.\.1..1\\. \ﬂ\\m1\ 1 11 5 mu 1 11 ..V0.0N.0.u1.0w00./00/ 0w0‘0w00w00000ﬁ m .nOU
up n s 1 m.
1 . .w s 1 0 _
1v .J . 0’ s 1"
S \\. 1 H m .w 1 0 4 .wmx
xxx 1 w. m. _ =ﬁ/_h_/_\_\_F_/E/=._ w a m.
u wt 1 1 m 1 ...v$.~..W//. ._ o m P w /./.,W///. //1/_/ .
\l\\1\x11\ 1 0 .m E 1 1 1 4 1u0000000000¢0 000 00000000000.
frat? 0 t 4 m m 6 4 0.6: 0.0000.
1 .. a a _ 1 w _ .1.
l S d S 1
. 1 6 e t e
. .. . n n 1
\..n......\...\ M _ d 1 e d
”mm/m 1 E M m 0 1 0 % m 0
xx f. m N A mm .m N 1:: /==/_/==_=/=/_=_\_/====/==
111.1NN.._N , 5 .m 6 1 a 6 Wavy
.. . 1.4.113... 7 1 1 1
\. ....:.:..xx .._ 1 m1 A l 1W.
1 e
1 , . 1 w
1 _.1 5 M 1 0 a
., 1 6 A 1 O N
.hl H. R 1 .1 1 ..
1 ﬂ +‘\.V\\\.1 ﬂ and. 11 m
H .1... n1 o 4.. 4“
_ «WA/n A 5 e e
1 I
ll [1]. - l m We
0 5 0 5 0 .1 .1
5 2 2 R... F F

 

o\o cozmuEsao E_ o coamom 1x. co_um~_E_uaO

 

100000

10000
38

Bandwidth

1000

100
Figure 4.11: Network experiments - 64 PE - 4,000 jobs — optimization ratio.

1|DP ]
[ TCLI [
‘33CEN

Kg? BB4 .
‘ ‘BBB ]

4:1;

Response Time (seconds)

 

§3\ BBl6‘
[_lMAN

Bandwidth

Figure 4.12: Network experiments — 64 PE - 5,000 jobs — response time.

64 Nodes - 5000 Jobs

 
 
 
    

     

  
   
     
 
   

   
  
  
    
   

 

 
  
 

70- 77 11 47 ,
s 60+ 7 7—77 7 — 7 \
= ~ - 1 .,:/
B 7,
to 111 .~/= W
'E ; 545%; [ 11
E 1 4: r BB16
111111 1 LJMAN]
10000 100000
Bandwidth

Figure 4.13: Network experiments — 64 PE - 5,000 jobs — optimization ratio.

39

are noisy. Increases in network performance beneﬁts CEN the most but also tends to
increase the performance of each method.

CLI performs worse than every other PN R method in all cases. However, for
bandwidth greater than 100 Mbps, CLI still manages to beat DP. In fact, CLI even has
lower average memory usage (5.46 MB) and standard deviation (2.02 MB) compared
to DP (6.09 and 2.71 MB).

For this experiment, synchronization overhead must not be the only limiting fac-
tor. The backlog of jobs must be inﬂuencing the experiment and producing noisy
results. The phenomena may be explained if we consider that if processes are be-
coming backlogged in the 5,000 job workload, and if network performance results in
higher job turnover, then longer-running jobs will tend to start running together in
greater frequency than under lower-performing networks. Longer-running jobs tend
to demand more memory than shorter jobs so, given this situation, memory load and

response times will rise.

4.1.4 Network Topology Variations

For the 4,000 job workload (ﬁgures 4.14 and 4.15), PN R designs are superior
to DP on all topologies. For the bus topology, it appears that less coordination
overhead results in higher performance. This is not too surprising since the potential
for congestion on a bus network is very high. For the connected network, we observe
results similar to those at the high bandwidth network. Each PNR design has roughly
equivalent performance.

Since the connected network does not increase bandwidth, and since the op-
timization ratio is similar to that found on the high-performance networks, we can
conclude that the signiﬁcant factor in this experiment and in the network experiments
is congestion. The problem with low-performing networks was not simply the message

service time, but the queuing delay introduced by the slower-moving messages.

40

Response Time (seconds)

Optimization %

64 Nodes - 4000 Jobs

     

20004 7 7 11 -- 11 1 11 11,1. 7 ~17
1750 , ~ 7 7 I DP
1500 7 3; CLI ]
1250 7 N
1000 - 1 ’1'; CEN
750 7 1 B? 334 [
500 . 3 BBB
258 ’ l l ‘b E 8316
Star ‘ ’ MAN
Topology

Figure 4.14: Topology experiments - 64 PE - 4,000 jobs - response time.

64 Nodes - 4000 Jobs

12.5“ 1 1 7
10 77 CLI [
7.57» 71> ’ ﬁCEN l
. 1 .~ "5 ‘ .543 BB4 l

5 ‘ [ l 1; I [

] . it": BBB .'

2.5 ' ] RS 3316]
o . . 1 fj MAN_]

 

Figure 4.15: Topology experiments - 64 PE - 4,000 jobs - optimization rate.

Response Time (seconds)

64 Nodes - 5000 Jobs
700007 - 1 1 1 1111 1 1_1
60000 A ‘ * [I DP
50000] 3:," CLI
40000 i [53/ CEN
20000 7; 1E1 BBB
10000 - g 3816
0 , j. MAN
1

 

 

Topology

Figure 4.16: Topology experiments - 64 PE - 5,000 jobs - response time.

41

For the 5,000 job workload (ﬁgure 4.16) we have noisier results. In the bus
topology, the general trend still appears to be the less synchronization, the better.
For the connected topology, all designs (including DP) get a performance boost except
MAN. BB16 is the superior method at a 60% optimization ratio. Notably, CLI still

underperforms DP. It is diﬂicult to say there is a deﬁnite pattern in these results.

4.1.5 Space Sharing Scheduler Experiments

For all workloads, recall that no individual job is larger than the CM-5’s available
RAM. So, with a space-scheduler, no two jobs will be loaded on one node and, at RAM
ratio 100%, no node should ever be overloaded. All designs should experience nearly
identical performance. This result is conﬁrmed in all RAM ratio 100% space-sharing
experiments (with trivial differences resulting from PNR coordination methods).

Figure 4.18 shows that, for the 4,000 job workload, each method has approxi-
mately the same performance at 7 5% RAM ratio with CLI having a slight edge. For
50%, CLI loses its edge while all the rest of the methods gain a performance boost.

Figure 4.21 shows the same results for the 5,000 job workload. Each method has
approximately the same performance at 75%. CLI again loses its performance beneﬁts
at 50% while the rest of the PN R methods approach a respectable 45% optimization

ratio.

4.2 128 Node Cluster

This section describes the results of experiments run on a simulated 128 node
cluster. These experiments are primarily done to test scalability as compared to the

64 node cluster.

42

64 Nodes - 5000 Jobs
70 177747 17* 11_.11w1_1 1 1 11

 

Optimization %

 

Connected
Topology

Figure 4.17: Topology experiments - 64 PE — 5,000 jobs - optimization ratio.

64 Nodes - 4000 Jobs - Space Sharing

 

[Elam

,3; CLI
‘N CEN
/ 334
(@338 .
ﬁg 3316
Q MAN

 

 

 

 

Response Time (seconds)

RAM Ratio

Figure 4.18: Space sharing experiments — 64 PE - 4,000 jobs - response time.

64 Nodes - 4000 Jobs - Space Sharing

 

 

 

 

60f 11 ._ 7 7 77‘ 111 ’T T" ’” rgf“
°\o 50 T _ , l: CLI
g 40* 1 @gCEN
'4: l 5
-‘~° 3°“ T @332
_ ,1
.§ 2011 iii
a 10% 1 [@3316
0 o, , g . [ [f'jMAN

 

 

50
RAM Ratio

Figure 4.19: Space sharing experiments - 64 PE - 4,000 jobs — optimization ratio.

43

 

 

 

g des - 5000 Jobs - Space Sharing

C 1 1111111 11 11 1111 11

g 3_ 3311 13111311111 [lop ]

g; 1111;1111111 44 [l'CLl

g 1 _111 1111111 ,ﬂum]

; 444444 4444444 79334 ‘

g TT T T T T L 3 BB8

8 .I 1.2/591 . .I 1.3-2714:35’ 1 3"“ 3316i

3 i 75 100 [[41 MANJ

m 1111.
RAM Ratio

Figure 4.20: Space sharing experiments - 64 PE — 5,000 jobs - response time.

64 Nodes - 5000 Jobs - Space Sharing

504,47 44 74 7 4444—4 —- 4 A7
°\° ‘- T -7 7 3 4. l CLI ]
C ,:

.9 33 2 CEN [
H ‘ , ,
.3 V2 334 ]
g i: BB8 .
B. lTN‘N‘N [
O 1,, 3316i

1 ‘ lMAN :

   

RAM Ratio

Figure 4.21: Space sharing experiments - 64 PE - 5,000 jobs - optimization ratio.

   
     
   
       
    

    

  

 

 

 

E 128 Nodes

c 100000 111, , * 3

3 i l DP

8 80000] T—V 7 T 'TT T' T3‘ [i T CLI

0 60000———7 " TT 7 7 $75 ‘ f .3) CEN

.§ 3:42 i 3;”.

'- 40000 7 11313333 3?; ‘. . 333

0’ Bil-=- 1 E3316

"’ 200007 . 1111 £24: ,_ ,

B 0 Ti 2’2 , EN 3332

in Tl 1.10.1.1“ .. -. .3 .

2 1000 2000 3000 4000 5000 w
Jobs

Figure 4.22: Base experiment - 128 PE - several workloads — response time.

44

Response Time (seconds)

Response Time (seconds)

128 Nodes - 4000 Jobs
2w0P1l111.,_mmU1m 1_AA

 

    

DP CLI CEN BB8 3816 8832 MAN
Deﬁgn

Figure 4.23: Base experiment - 128 PE - 4,000 jobs - response time.

128 Nodes - 4000 Jobs

 

 

 

 

 

  
  

Optimization %

 

816

   

MAN
Deﬂgn

Figure 4.24: Base experiment - 128 PE - 4,000 jobs - optimization ratio.

128 Nodes - 5000 Jobs

 

100000
80000
60000
40000 ;;i' ;;.I

20000. E E

0 DP CLI CEN 338 331 3332 ' MAN
Deﬂgn

 

i—e—H—a

      

Figure 4.25: Base experiment — 128 PB - 5,000 jobs — response time.

45

4.2. 1 Base Experiments

When we run the base set of experiments over various workloads (shown in ﬁgure
4.22), we see the same exponential curve in response time as in the 64 node system.

In the 4,000 job workload, the CLI optimization ratio beats the other PNR
designs by a signiﬁcant margin (ﬁgure 4.24). While each other PN R design is between
5% and 7.5% optimization ratio, CLI approaches 15%. CLI also had the lowest
average memory allocated and the lowest standard deviation at 5.52 MB and 4.35
MB, respectively. DP had an average of 7.54 MB and standard deviation of 5.23 MB
and MAN, for instance, had an average of 6.28 MB and a standard deviation of 4.65
MB.

In the 5,000 job workload there appears to be scalability problems. CLI and BB8
are the only methods that experience signiﬁcant performance enhancement (about
5%). Others (CEN, BB16, MAN) experience performance degradation.

This result is not unexpected since the underlying network topology is a star.
The bottleneck resource, the central switch, becomes more congested as the network
becomes larger and as more messages are sent. As observed earlier, PNR is very
sensitive to poor network performance and it makes sense that PNR performs poorly
here. CLI has a natural advantage in larger networks because individual peers do not
synchronize with each other. With each other design, coordination overhead must be
incurred.

Interestingly, each PN R design had a lower standard deviation in memory usage
than DP. Each PNR design also used less disk space than DP. However, some PNR
designs had a signiﬁcantly higher average memory allocation. All PNR designs still
experienced fewer page faults. The number of page faults ranged from 50% to 80%
that of DP.

46

Response Time (seconds) Optimization %

Optimization %

128 Nodes - 5000 Jobs

Sj ’7ﬁieiAifiieﬁiiii ##A~i*v

, _ -1,_1,1m1, mﬂ, 7’7F’77

-5? 7 7 ,!s ,_"7 *J i#7,___; ;

i #iilri; ﬁiwgi' 1 i.#iri_i‘—‘i ._._

-w ~i-~— mm____w_, AW— _"1, 1_

-25 ‘ __ w -4 _##~* 7:1 7 —A—i
1 _l’7 ##7 if“, L_ _] -,_,_#¥ ii‘i

'35 ifwy-ql PW, A—u—ﬂ ivTiP—T 7 7 "I—i—__"i 7"7'?7_
CLI CEN BBB 3816 3332 MAN

Design

Figure 4.26: Base experiment - 128 PE - 5,000 jobs — optimization ratio.

128 Nodes - 4000 Jobs

  

500000
T

 

 

100 115' 125 135 150
RAM Ratio

Figure 4.27: RAM experiments - 128 PE - 4,000 jobs — response time.

128 Nodes - 4000 Jobs
100 -~ ,, .111--. 1,-.. immi- , A- ,1,
i@ CEN
iEﬁ’j BB8
iE BB16
SQ BB32
‘LjMANé

 

Figure 4.28: RAM experiments — 128 PE - 4,000 jobs - optimization ratio.

47

Response Time (seconds)

Optimization %

128 Nodes- 5000 Jobs

 

 

50000017 #777 — 1 1,1 11 11 11,11 —- ilDPT

i 1 , . .11 _1 -111 1,1 ,1 1 ‘
400000. 1 N H (CLI
3000001 ‘ 1_11 11 1 __111_111 -— gﬁSCEN

200000 3 . - ‘ *7 11 ~7 1 1 1 IEBBS |

100000 - 1. " g — — 1_¥ _.1, L , 3315

o 1Ia1I§ IIZE raga _ I1,_1111, ’®\BB32

75 115 125 135 150 11:3lﬁiii

00RAM Ratio

Figure 4.29: RAM experiments — 128 PE - 5,000 jobs — response time.

128 Nodes - 5000 Jobs

H W,
N CEN
V/ 338
‘g33316
N 3332

i
‘ MAN

 

RAM Ratio

Figure 4.30: RAM experiments - 128 PE - 5,000 jobs - optimization rate.

48

4.2.2 RAM Variations

Note that there is data missing at 65% and 50% RAM ratios for both the 4,000
and 5,000 job workloads due to simulation time constraints.

For the 4,000 job workload (shown in ﬁgures 4.27 and 4.28) PN R converges as
expected as RAM is increased. As in the 64 node cluster, large performance increases
are observed for 85% and 75% RAM ratios (approaching 100% and 50% improvement,
respectively) The improvement at 75% RAM ratio is not as pronounced as in the 64
node system. Interestingly, the % backbone solution experiences a performance hit in
both the 64 node and 128 node systems. The performance penalty is more pronounced
in the 128 node system.

For the 5,000 job workload (in ﬁgures 4.29 and 4.30), we observe the PN R op-
timization ratio converging as expected. However, for 125% RAM ratio, there is a
large performance penalty felt by all of the PN R designs (with CLI being affected the
least). All PNR designs except CLI experience at least a -100% optimization ratio.
All designs improved average response time from 115% to 125%, but DP improved
more than each PNR method. It is not known if this is a statistical ﬂuke or something
more interesting. More tests need to be performed.

As RAM becomes more scarce, the optimization ratio of the PNR methods don’t
reach as high as in the 64 node network (not much above 50%). As in the 64 node

network, the best performance gains are observed at 85% and 75% RAM ratios.

4.2.3 Network Performance Variations

For the 4,000 job workload, PNR performance improves as soon as network
performance is improved. For network bandwidths of 1,000 Mbps and higher, the
average response time of each PNR design is equivalent (ﬁgure 4.31). Each PNR
design is superior to using DP. The optimization ratio settles at about 12.5% (ﬁgure

4.32).

49

 

      
 

  

.m

m.

0

m,

4

a i 111L; ii
“w , .va.... .MMW m
1 1

 

Amocooomv oEE. omcoamom

Bandwidth

Figure 4.31: Network experiments - 128 PE - 4,000 jobs - response time.

128 Nodes - 4000 Jobs

    

         

   
    

====\\=\==_=_===_===
///%////////¢M/
I ’0

 

  

 

as .1285an

1000 10000

100

Bandwidth

Figure 4.32: Network experiments - 128 PE - 4,000 jobs - optimization ratio.

 

1000 10000 100000

128 Nodes - 5000 Jobs

100

 

100000,,

Amucouomv mEE. omcoamwm

Bandwidth

Figure 4.33: Network experiments - 128 PB - 5,000 jobs - response time.

50

For the 5,000 job workload, each method improves dramatically with higher
bandwidth. Similar to the 64 node system, each PNR design outperforms DP at
network bandwidth 1,000 Mbps and higher. BBB and BB16, in particular, have
optimization ratios in the 50—75% range at bandwidth 1,000 Mbps and 10,000 Mbps
(ﬁgure 4.34). However, they lose their lead at 100,000 Mbps while CEN, BB32 and
MAN all gain. It is difﬁcult to say what the general trend here is, other than that

better network performance tends to elevate the performance of each PN R design.

4.2.4 Network Topology Variations

For the 4,000 job workload, we do not show the results for the bus topology in
ﬁgure 4.35 or 4.36. This is because the bus response times were orders of magnitude
higher than the star and connected topologies. For the CEN method, for instance,
the optimization ratio was a staggering -2,500%! Obviously the bus network does not
scale well and PNR is hit hard by this. With the connected topology, each method
is close to equivalent.

Figures 4.37 and 4.38 show the 5,000 job workload results. Just as in the 4,000
job workload, the bus network causes all PNR designs to behave poorly. CLI is least
effected and CEN experiences a huge performance hit - almost reaching —300% (ﬁgure
4.38)! It is interesting that this negative performance is better in comparison to the
4,000 job workload. In the 4,000 job workload, the PNR methods all use less disk
space than the PNR methods do in the 5,000 job workload. We speculate that the
bus network actually drives average network service time higher than disk service
time, and the PNR methods in the 4,000 job workload use network RAM more than
in the 5,000 job workload.

The connected network gives results similar to the high bandwidth experiments.
Here, all PNR designs have better performance than DP. BB8 and BB16 are the clear

winners at an optimization ratio greater than 50%.

51

128 Nodes - 5000 Jobs

 

 

 

75 1 A 111 — AAA A A~ _111_
°\° 50T " if]
c 7 7’ 7 ’ i
o
a: 25 .11 A -
(D : '
.E . ‘1 l ‘ i
g 0 a? if; ‘
,_. .
o. -2 1 .1, ,
o 5

-50.._._ _,.7...Mi_..__.-__-ﬁ_ _- ‘ MAN)

100 1000 10000 100000 7 "
Bandwidth

Figure 4.34: Network experiments - 128 PE - 5,000 jobs — optimization ratio.

g 128 Nodes - 000 Jobs

5 2500 11 1 A 1 1 1 [.ADP ,1
u ' J
g 2000 ECU _ ’
g 1500 i P§ CEN .
i: 1000, '— :73 B38. Vi
‘1’ a

g 500 1 1, 3816‘
8. 0 l 11$ BB32 i
8 i iMANJ
m 1.11

 

Topology

Figure 4.35: Topology experiments - 128 PE — 4,000 jobs - response time.

128 Nodes - 4000 Jobs
15 w _ , _1_111_11111, 1 1111

Optimization %

 

 

Topology

Figure 4.36: Topology experiments - 128 PE - 4,000 jobs - optimization ratio.

52

Response Time (seconds)

Optimization %

128 Nodes- 5000 Jobs

 

  

 

 

   

250000 _l_1 “111

200000

150000) A . ._1 -h _ _ _

100000) ' 11 .1

50000 L . i ' 3 11

01 .2121 .. %
Star Connected

Topology

Figure 4.37: Topology experiments — 128 PE - 5,000 jobs - response time.

128 Nodes - 5000 Jobs

 

4507?? A — AA
-200-4 A 17
-250-( kéA 1 1 AA
-300~ AAA
Bus Star Connected
Topology

icu 1,
.ééiCEN .'
V2338

(___BBl6l
N3332
MAN 3

Figure 4.38: Topology experiments - 128 PE - 5,000 jobs - optimization ratio.

53

Ii DP (
:1. cu i
333cm ‘
21,2338 3
15; 3316.
;§3j§ BB32i
il lMAN ,1

Response Time (seconds)

 

RAM Ratio

Figure 4.39: Space sharing experiments - 128 PE - 4,000 jobs - response time.

128 Nodes - 4000 Jobs - Space Sharing

 

 

401" _11 AA
:..3 + -
3 25A .1 QCEN l
g 20 ,1 ‘32338 _i
g :3; f: :EBBlGI
3 51 1 (32.33332l
0111 QMANJ

 

Figure 4.40: Space sharing experiments - 128 PE - 4,000 jobs - optimization ratio.

4.2.5 Space Sharing Scheduler Experiments

The results for both workloads echo those of the 64 node network and are shown
in ﬁgures 4.39, 4.40, 4.41, and 4.42. In both sets of results, designs are equivalent
at RAM ratio 100%, CLI has an advantage at RAM ratio 75%, and CLI loses its
advantage at RAM ratio 50%. At RAM ratio 75% and 50% all non—CLI PNR designs

experience almost identical performance.

54

 

E 128 Nodes - 5000 Jobs - Space Sharing

c AA A A AAA AA AAAAA-A A—AAAA [AA AA
8 , _11_111_1_ _ 1__ 1 1 i. DP

iii , j N 111_.# ,___,-1- ETCLI

0E) AA AAAAAA—AAA {NCEN
i?- 7 “-7" 7' —7 7‘ - Xi BB8
8 i ““7 7 7’ ’777’7 3__:E 3316
5 i’ ’ ’7 W 3332
8 l. . MAN
m L W.__

   

RAM Ratio

Figure 4.41: Space sharing experiments - 128 PE - 5,000 jobs - response time.

128 Nodes - 5000 Jobs - Space Sharing

 

 

25 EAAAAA 7m ,
2° 1.1111111 i i, ,, 11 - m. i
-9 i 1‘7 i i{>;>j CEN
H 15" V 11 , t—A 2.- 1.
g \ i E i iVﬂ/BB8
5;: 1° ‘ “TE . @3316
8 5“ E73 (N3332

0A?) 1.1 ii :iMAN

   

RAM Ratio

Figure 4.42: Space sharing experiments - 128 PE - 5,000 jobs - optimization ratio.

55

Chapter 5

Analysis

5. 1 Discussion

We have described our results in chapter 5. We now interpret these results to
discover the general models of performance PNR follows. Also, we identify platforms

and conﬁgurations on which PNR would be useful.

5. 1 . 1 Memory Load

Given our observations, it is clear that both PN R and DP follow an exponential
curve as memory load is changed. As memory load increases, PNR and DP both tend
toward inﬁnite response times. As memory load decreases, PN R and DP will converge
on a constant number. PNR does not offer a fundamentally difference performance
curve, but PNR’S curve is lower than DP’s by a constant factor.

This model implies that adding PNR to lightly loaded systems (with high per-
formance networks) should not harm performance and under moderate memory loads
PN R can lead to large improvements over disk paging. As observed, this improve-
ment can approach 100%. However, as memory becomes too scarce to share, PNR

performance will converge to DP performance.

56

One surprising result of our simulations is that the proposed PN R designs do
not necessarily smooth out memory usage as measured by the standard deviation of
system memory usage. For some experiments, PNR memory usage was even more
non-uniform than DP’s memory usage. This indicates that more work must be done
to ensure PN R itself does not create more overloaded nodes.

Also, some of 'the exceptions to our proposed model are troubling. Speciﬁcally,
the 125% RAM ratio experiment on the 128 node cluster unexpectedly had most
PN R methods with an optimization ratio of -100% or more. Other experiments had
occasional incidents where all PN R methods do very well except for one (3g. 85%
RAM Ratio on 128 node system with 4,000 job workload). To determine if this is
a problem with PNR or just an artifact, we need to re—run our experiments with
different random number seeds, different portions of our collected workload, and with

different workloads.

5.1.2 Network

PNR is very sensitive to network performance. The general shape of results
is another exponential curve, where PNR response time tends toward inﬁnity as
network service time is increased and converges to some constant number as service
time decreases. Since low network performance results in low PNR performance,
PNR should not be considered on systems with low bandwidth or high message RTT
times. A standard 100 Mbps network should be considered a minimum for satisfactory
performance.

Topology is also an important consideration. Our experiments Show that con-
gestion is a major source of waiting time. In order for PNR to scale, it will have
to run on a network that scales well as nodes are added. Bottlenecks, such as the
central switch in the star topology, will quickly degrade PNR performance as nodes

are added. Switching to a network topology that has no bottlenecks has the same

57

effect as upgrading a topology with bottlenecks to a very high speed network. To
get maximum performance, PNR should be used on a fast network with few (or no)
bottlenecks.

When network performance is high and RAM is relatively plentiful, we observe
that each PN R design has equivalent performance. This is because coordination
of memory resources is not a crucial issue when RAM is plentiful, and coordination
overhead is less important with a fast network. These are the two main differentiating
factors among the PNR designs, and when they are eliminated, they all perform

similarly.

5. 1.3 Scheduling

On a space sharing system, as RAM becomes more scarce, each PNR method
except CLI appears to perform equivalently. The main limiting factor on the space
sharing system is network RAM allocation coordination. Network speed is less impor-
tant because PCSs are eliminated on this system - which eliminates a large portion
of network activity (i.e. the large amount of page faults after a PCS). This reduces
overall congestion. All of the schemes that use coordination appear to work equally
well in each space sharing system tested. Depending on the workload and cluster
setup, the best performance gains can be anywhere between 15-60%. If only a light
load is expected, CLI is the best choice for a space sharing system.

On the gang scheduling systems, however, network performance is crucial to PNR
performance. PCS’s introduce large, regular bursts of network activity and PNR de-
signs that use the network to coordinate allocation suffer. When a high-performance
network is available, however, PNR can produce pronounced performance gains. Un-
der the 4,000 job workloads, the maximum speed up using PNR was typically in
the 10-12.5% range, although when RAM became scarce, optimization ratio could

reach as high as 100%. For the 5,000 job workload, the maximum optimization ratio

58

observed was 75%, but large ratios were more common than in the 4,000 job workload.

It is interesting to note that the absolute response times of DP and PN R on the
space sharing scheduler are better than the gang scheduler for the 5,000 job workload.
Backlogged jobs introduce a huge paging penalty to both PN R and DP. However, for
the 4,000 job workload, gang scheduling results in better response times, since no
backlogging is occurring and the beneﬁts of time sharing outweigh paging penalties.

For heavily-loaded systems, PNR can signiﬁcantly reduce the response time of
jobs as compared to DP. However, it may make sense to attempt to avoid paging
entirely by using a space sharing system instead. Using a small MPL in conjunction
with PNR may produce results better than both alone. For moderately to lightly
loaded systems, the advantages of time-sharing outweigh the disadvantages of in—

creased paging, and PNR is recommended to reduce average response times.

5.1.4 PNR Designs

CLI does surprisingly well in certain situations. If RAM is plentiful, CLI beneﬁts
since it requires very little overhead to allocate network RAM. When the network is
slow, CLI performs better than its counterparts because it contributes relatively little
to network congestion. If it is known that a system has a slow network and will be
lightly loaded, CLI may be the best method to implement due to its simplicity and
low overhead costs.

However, CLI performs poorly in heavy-load situations where RAM is scarce. In
these situations, it becomes necessary for network RAM allocation to be coordinated
to avoid wasting it. In these situations, that leaves us with CEN, BBX, and MAN.
When the network is fast, all three can potentially be used effectively. When the
network is slow, only BBX and MAN should be considered since CEN will introduce
signiﬁcant congestion on the network resources that service the central server.

It is difﬁcult to choose between the BBX and MAN methods. There appears

59

 

Low Load High Load
Fast Network ANY CEN, BBX, MAN
Slow Network CLI BBX

 

 

 

 

 

 

 

Table 5.1: Recommended PNR Design Usage.

to be a tendency for BBIG to be the superior method in the 64 node network (% of
nodes) and BB8/BB16 to be superior in the 128 node network (1A6 and % of nodes).
However, no deﬁnite conclusion can be reached from the results gathered. We will
tentatively claim that BBX has an advantage over MAN in slow networks because
less messages must be broadcast for synchronization purposes.

If the network is fast, and RAM is plentiful, then each of the PNR designs are
equivalent. Any one could be used. It may make sense to use CLI in this situation
since it eliminates the need for a manager (proxy) entirely.

A summary of these recommendations is in table 5.1.

5.2 Conclusion

In this thesis we identiﬁed a novel way of reducing page fault service time and
better utilizing memory resources in a cluster system running parallel processes. This
method, which we call Parallel Network RAM, uses remote idle RAM as another tier
in the memory hierarchy for parallel jobs in clusters. We proposed several different
PNR designs and evaluated the performance of each under different conditions. We
discovered that different designs are appropriate in different situations and that simply
applying PN R to an existing system is not a guarantee of increased performance.

Applied correctly to lightly loaded systems, performance gains between 10-25%
can be achieved. Applied correctly to heavily loaded systems, performance gains can
be as high as 100%. Applied incorrectly, PN R can result in extreme performance

losses. The highest observed in our simulations were approximately -2500%.

60

Parallel Network RAM is probably best applied in conjunction with other meth-
ods to reduce page load. For instance, a combination of job migration, Parallel
Network RAM, heuristics used to guess memory usage at the scheduler, and other

scheduler techniques would likely result in better performance than PN R alone.

5.3 Future Work

This thesis only took a survey of PNR performance given a variety of cluster
conﬁgurations and did not focus on any particular experiment. Our experiments’ sta-
tistical signiﬁcance should be conﬁrmed by re-running our experiments with different
random number seeds and with different subsets of the workload trace.

A surprising result of our work was that no single PN R design worked well under
all conditions. This indicates that, to create a single, uniﬁed PN R design that works
under a variety of cluster conditions, a design that can intelligently change its behavior
based on current cluster conditions should be developed. Such an algorithm could
simply use the designs proposed in this thesis, but switch among which design is being
used based on the current state of the cluster. This would make the implementation
of PN R much easier, since only one system needs to be developed for all clusters.

This thesis only used a speciﬁc subset of a single trace from a particular parallel
system. To our knowledge, traces such as the one we used that include memory
information are very rare. It would be invaluable to collect a new trace on a modern
parallel system running cutting-edge parallel programs. Not only would this allow us
to compare how PNR acts on a modern system, it would also allow us to calibrate
our simulator and ensure that the models used (e.g. page fault models) are realistic.

This simulator only explores jobs that use a static amount of memory determined
at runtime. It would be useful to know how PNR performs given a more realistic

dynamic memory workload. The simulator also only simulates a very simplistic job

61

synchronization model. A more intense and complex model would give us more
realistic results.

Our approach only allocates and deallocates network RAM at job start and stop
time. It would be interesting to know if allowing the PNR system to allocate and
deallocate memory at additional times would lead to increased performance. A more
periodic approach may better accommodate memory usage.

This approach does not consider job migration. A system which is allowed to
choose between job migration and PNR to balance load would be very interesting
and may potentially combine advantages from both methods.

A comparison of PNR methods against methods which attempt to avoid paging
by guessing memory usage would be interesting. One method may be superior to
another or, more likely, the combination of the two methods may yield increased

performance.

62

Bibliography
[1] Anurag Acharya and Sanjeev Setia. The utility of exploiting idle memory for
data-intensive computations. Technical Report TRCS98-02, 1998.

[2] A. Barak and A. Braverman. Memory ushering in a scalable computing cluster.
Journal of Microprocessors and Microsystems, 22(3-4):175—182, August 1998.

[3] Anat Batat and Dror G. Feitelson. Gang scheduling with memory considerations.
In 14th Intl. Parallel Distributed Processing Symp., pages 109—114, 2000.

[4] Douglas C. Burger, Rahmat S. Hyder, Barton P. Miller, and David A. Wood.
Paging tradeoffs in distributed-shared—memory multiprocessors. The Journal of
Supercomputing, 10(1):87—104, 1996.

[5] Steve J. Chapin, Walfredo Cirne, Dror G. Feitelson, James Patton Jones, Scott T.
Leutenegger, Uwe Schwiegelshohn, Warren Smith, and David Talby. Benchmarks
and standards for the evaluation of parallel job schedulers. In Dror G. Feitelson
and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing,
pages 67-90. Springer-Verlag, 1999. Lect. Notes Comput. Sci. vol. 1659.

[6] RM. Cubert and P. Fishwick. Sim++, version 1.0. Technical report, Department
of Computer and Information Science and Engineering, University of Florida, 28
July 1995.

[7] Allen B. Downey. Lachesis: A job scheduler for the Gray T3E. In Dror G. Feitel-
son and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing,
pages 47—61. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459.

[8] G. F. Pﬁster F. Darema-Rogers and K. So. Memory access patterns of parallel
scientiﬁc programs. Performance Evaluation Review, 15(1):46—57, 1987.

[9] Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin,
Henry M. Levy, and Chandramohan A. Thekkath. Implementing global mem-
ory management in a workstation cluster. In Symposium on Operating Systems
Principles, pages 201—212, 1995.

[10] Dror G. Feitelson. Packing schemes for gang scheduling. In Dror G. Feitelson
and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing,
pages 89—110. Springer—Verlag, 1996. Lect. Notes Comput. Sci. vol. 1162.

[11] Dror G. Feitelson. Memory usage in the LANL CM-5 workload. In Dror G.
Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Pro-
cessing, pages 78—94. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291.

[12] Dror G. Feitelson. Metrics for parallel job scheduling and their convergence.
In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for
Parallel Processing, pages 188—205. Springer Verlag, 2001. Lect. Notes Comput.
Sci. vol. 2221.

63

[13]

[14]

[15]

[16]

[17]

[18]

' [19]

[20]

[21]

[22]

[23]

Dror G. Feitelson and Larry Rudolph. Parallel job scheduling: Issues and ap-
proaches. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling
Strategies for Parallel Processing, pages 1—18. Springer-Verlag, 1995. Lect. Notes
Comput. Sci. vol. 949.

Dror G. Feitelson and Larry Rudolph. Evaluation of design choices for gang
scheduling using distributed hierarchical control. Journal of Parallel and Dis-
tributed Computing, 35(1):18—34, 1996.

Dror G. Feitelson and Larry Rudolph. Metrics and benchmarking for parallel
job scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling
Strategies for Parallel Processing, pages 1—24. Springer-Verlag, 1998. Lect. Notes
Comput. Sci. vol. 1459.

Dror G. Feitelson, Larry Rudolph, Uwe Schwiegelshohn, Kenneth C. Sevcik, and
Parkson Wong. Theory and practice in parallel job scheduling. In Dror G. Feitel-
son and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing,
pages 1—34. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291.

Paul A. Fishwick. Simulation Model Design and Execution. Prentice-Hall Inc.,
1995.

Michail D. Flouris and Evangelos P. Markatos. High Performance Cluster Com-
puting, chapter 16, pages 383—408. Prentice Hall, 1999.

Atsushi Hori, Hiroshi Tezuka, and Yutaka Ishikawa. Overhead analysis of pre-
emptive gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job
Scheduling Strategies for Parallel Processing, pages 217—230. Springer Verlag,
1998. Lect. Notes Comput. Sci. vol. 1459.

David Jackson, Quinn Snell, and Mark Clement. Core algorithms of the Maui
scheduler. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling
Strategies for Parallel Processing, pages 87-102. Springer Verlag, 2001. Lect.
Notes Comput. Sci. vol. 2221.

James Patton Jones and Bill N itzberg. Scheduling for parallel supercomputing:
A historical perspective of achievable utilization. In Dror G. Feitelson and Larry
Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1—16.
Springer-Verlag, 1999. Lect. Notes Comput. Sci. vol. 1659.

Virginia Lo, Jens Mache, and Kurt Windisch. A comparative study of real work-
load traces and synthetic workload models for parallel job scheduling. In Dror G.
Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Pro-
cessing, pages 25—46. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459.

Evangelos P. Markatos and George Dramitinos. Implementation of a reliable
remote memory pager. In USENIX Annual Technical Conference, pages 177—
190, 1996.

64

[24] Yanyong Zhang Anand Sivasubramaniam Hubertus Franke Jose E. Moreira. Im-
proving parallel job scheduling by combining gang scheduling and backﬁlling
techniques. IPDPS, pages 133—142, 2000.

[25] John Oleszkiewicz and Li Xiao. Pnrsim: A parallel network ram simulator.
Technical Report MSU-CSE—04-13, Computer Science and Engineering, Michigan
State University, East Lansing, Michigan, April 2004.

[26] Eric W. Parsons and Kenneth C. Sevcik. Beneﬁts of speedup knowledge
in memory-constrained multiprocessor scheduling. Performance Evaluation,
27/28(4):253—272, 1996.

[27] Eric W. Parsons and Kenneth C. Sevcik. Coordinated allocation of memory
and processors in multiprocessors. In Measurement and Modeling of Computer
Systems, pages 57—67, 1996.

[28] V. Subramani S. Srinivasan, R. Kettimuthu and P. Sadayappan. Characterization
of backﬁlling strategies for parallel job scheduling. Proc. of 2002 Intl. Workshops
on Parallel Processing (held in conjunction with the 2002 Intl. Conf. on Parallel
Processing, I CPP 2002), 2002.

[29] Uwe Schwiegelshohn and Ramin Yahyapour. Improving ﬁrst-come-ﬁrst-serve job

' scheduling by gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors,
Job Scheduling Strategies for Parallel Processing, pages 180—198. Springer Verlag,

1998. Lect. Notes Comput. Sci. vol. 1459. ' -

[30] Senjeev K. Setia. The interaction between memory allocation and adaptive par-
titioning in message-passing multicomputers. In Dror G. Feitelson and Larry
Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 146—
164. Springer-Verlag, 1995. Lect. Notes Comput. Sci. vol. 949.

[31] Kenneth C. Sevcik and Songnian Zhou. Performance Beneﬁts and Limitations
of Large NUMA Multiprocessors. In Proceedings of Performance ’93, pages 183—
204. Elsevier Science Publ, Sept 93.

[32] Quinn O. Snell, Mark J. Clement, and David B. Jackson. Preemption based
backﬁll. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors,
Job Scheduling Strategies for Parallel Processing, pages 24—37. Springer Verlag,
2002. Lect. Notes Comput. Sci. vol. 2537.

[33] Srividya Srinivasan, Rajkumar Kettimuthu, Vijay Subramani, and P. Sadayap-
pan. Selective reservation strategies for backﬁll job scheduling. In Dror G. Feitel-
son, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies
for Parallel Processing, pages 55—71. Springer Verlag, 2002. Lect. Notes Comput.
Sci. vol. 2537.

[34] Fang Wang, Marios Papaefthymiou, and Mark Squillante. Performance eval-
uation of gang scheduling for parallel and distributed multiprogramming. In

65

[35]

[36]

[37]
[38]
[39]

[40]

[41]

Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Par-
allel Processing, pages 277—298. Springer Verlag, 1997. Lect. Notes Comput. Sci.
vol. 1291.

William A. Ward, Jr., Carrie L. Mahood, and John E. West. Scheduling jobs
on parallel systems using a relaxed backﬁll strategy. In Dror G. Feitelson, Larry
Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Par-
allel Processing, pages 88—102. Springer Verlag, 2002. Lect. Notes Comput. Sci.
vol. 2537.

Y. Wiseman and D. Feitelson. Paired gang scheduling. Technical Report 2001-10,
Hebrew University, 2001.

L. Xiao, S. Chen, and X. Zhang. Dynamic cluster resource allocations for jobs
with known and unknown memory demands. IEEE Transactions on Parallel and
Distributed Systems, 13(3):223—240, March 2002.

L. Xiao, S. Chen, and X. Zhang. Adaptive memory allocations in clusters to
handle unexpectedly large data-intensive jobs. IEEE Transactions on Parallel
and Distributed Systems, 15(6), June 2004.

Li Xiao, Xiaodong Zhang, and Stefan A. Kubricht. Incorporating job migration

-. and network RAM to share cluster memory resourCes. In HPDC, pages 71—78,

2000.

Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. An integrated
approach to parallel scheduling using gang-scheduling, backﬁlling, and migration.
In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for
Parallel Processing, pages 133—158. Springer Verlag, 2001. Lect. Notes Comput.
Sci. vol. 2221.

Bing Bing Zhou, David Welsh, and Richard P. Brent. Resource allocation
schemes for gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors,
Job Scheduling Strategies for Parallel Processing, pages 74—86. Springer Verlag,
2000. Lect. Notes Comput. Sci. vol. 1911.

66

 

"EEEEEEEEEEEEElli"