This is to certify that the thesis entitled Parallel Network RAM: Effectively Utilizing Global Cluster Memory For Large Data-Intensive Programs presented by Jonathan James Oleszkiewicz has been accepted towards fulfillment of the requirements for the degree of MS. in Computer Science and Engineering Major Professor’s Signature I/5 /2004 Date MSU is an Affinnative Action/Equal Opportunity Institution .‘—.-l---.- - LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c:/ClRC/DateDuo.p65-p.15 PARALLEL NETWORK RAM EFFECTIVELY UTILIZING GLOBAL CLUSTER MEMORY FOR LARGE DATA-INTENSIVE PROGRAMS By Jonathan James Oleszkiewicz A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science 2004 ABSTRACT PARALLEL NETWORK RAM EFFECTIVELY UTILIZING GLOBAL CLUSTER MEMORY FOR LARGE DATA-INTENSIV E PROGRAMS By Jonathan James Oleszkiewicz Large scientific parallel applications demand large amounts of memory space. Cur- rent parallel computing platforms schedule jobs without fully knowing their memory requirements. This leads to uneven memory allocation in which some nodes are over- loaded. This, in turn, leads to disk paging, which is extremely expensive in the context of scientific parallel computing. To solve this problem, we propose a new peer-to—peer solution called ”Parallel Network RAM”. This approach avoids the use of disk, better utilizes available RAM resources, and will allow larger problems to be solved while reducing the computational, communication and synchronization overhead typically involved in parallel applications. To Patty, my constant source of inspiration and the love of my life. iii Acknowledgments This research was made possible by grants from the Michigan Space Grant Consortium and the National Science Foundation. iv Table of Contents LIST OF TABLES 1 2 LIST OF FIGURES Introduction 1.1 Problem Statement ............................ 1.2 Terminology ................................ 1.3 Background and Related Work ...................... 1.3.1 Parallel Scheduling Algorithms ................. 1.3.2 Previous Solutions to the Memory Problem ........... 1.3.3 Network RAM ........................... Parallel Network RAM 2.1 Introduction ................................ 2.2 Generic Description ............................ 2.2.1 Clients ............................... 2.2.2 Managers ............................. 2.2.3 Servers ............................... 2.3 Designs .................................. 2.3.1 Client ............................... 2.3.2 Centralized ............................ 2.3.3 Local Managers .......................... 2.3.4 Backbone ............................. Methodology 3.1 Simulator ................................. 3.1.1 Workload Model ......................... 3.1.2 System Models .......................... 3.1.3 Job Behavior Models ....................... 3.1.4 Scheduler Models ......................... 3.1.5 Metrics .............................. 3.2 Experimental Setup ............................ 3.2.1 Experiment Table Detail ..................... 3.2.2 Dimensions of the Experiment Table .............. Performance Evaluation 4.1 64 Node Cluster .............................. 4.1.1 Base Experiments ......................... 4.1.2 RAM Variations .......................... 4.1.3 Network Performance Variations ................. 4.1.4 Network Topology Variations .................. 4.1.5 Space Sharing Scheduler Experiments .............. 4.2 128 Node Cluster ............................. 4.2.1 Base Experiments ......................... vii viii (OOOQOEO'It—tl-J 31 31 31 35 37 40 42 42 46 4.2.2 RAM Variations .......................... 49 4.2.3 Network Performance Variations ................. 49 4.2.4 Network Topology Variations .................. 51 4.2.5 Space Sharing Scheduler Experiments .............. 54 5 Analysis 56 5.1 Discussion ................................. 56 5.1.1 Memory Load ........................... 56 5.1.2 Network .............................. 57 5.1.3 Scheduling ............................. 58 5.1.4 PNR Designs ........................... 59 5.2 Conclusion ................................. 60 5.3 Future Work ................................ 61 LIST OF REFERENCES 63 vi List of Tables 3.1 Table of Experiments. ....... 5.1 Recommended PNR Design Usage. vii ooooooooooooooooooo ooooooooooooooooooo 2.1 2.2 2.3 2.4 2.5 2.6 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 List of Figures Diagram of Parallel Network RAM. Application 2 is assigned to PES P3, P4 and P5 but utilizes available memory space of P2, P6 and P7. The architecture of a servent. Each servent can act in three roles. Client-only design. ............................ Centralized design. ............................ Local manager design. .......................... Backbone design. ............................. Base experiment - 64 PE - several workloads - response time ...... Base experiment - 64 PE - 4,000 jobs - response time .......... Base experiment - 64 PE - 4,000 jobs - optimization ratio. ...... Base experiment - 64 PE - 5,000 jobs - response time .......... Base experiment - 64 PE - 5,000 jobs - optimization ratio. ...... RAM experiments - 64 PE - 4,000 jobs - response time ......... RAM experiments - 64 PE - 4,000 jobs - optimization ratio. ..... RAM experiments - 64 PE - 5,000 jobs - response time ......... RAM experiments - 64 PE — 5,000 jobs - optimization ratio. ..... Network experiments - 64 PE - 4,000 jobs - response time. ...... Network experiments - 64 PE - 4,000 jobs - optimization ratio ..... Network experiments - 64 PE - 5,000 jobs -— response time. ...... Network experiments - 64 PE - 5,000 jobs — optimization ratio ..... Topology experiments - 64 PE - 4,000 jobs - response time ....... Topology experiments - 64 PE - 4,000 jobs - optimization rate ..... Topology experiments - 64 PE — 5,000 jobs - response time ....... Topology experiments - 64 PE - 5,000 jobs - optimization ratio. Space sharing experiments - 64 PE - 4,000 jobs - response time. Space sharing experiments - 64 PE - 4,000 jobs - optimization ratio. . Space sharing experiments - 64 PE — 5,000 jobs - response time. Space sharing experiments - 64 PE - 5,000 jobs - optimization ratio. . Base experiment - 128 PB - several workloads - response time. . . . . Base experiment - 128 PE - 4,000 jobs - response time. ........ Base experiment - 128 PE - 4,000 jobs - optimization ratio ....... Base experiment - 128 PE - 5,000 jobs - response time. ........ Base experiment - 128 PE - 5,000 jobs - optimization ratio ....... RAM experiments - 128 PE — 4,000 jobs - response time. ....... RAM experiments - 128 PE - 4,000 jobs - optimization ratio ...... RAM experiments - 128 PE - 5,000 jobs - response time. ....... RAM experiments - 128 PE - 5,000 jobs - optimization rate. ..... Network experiments - 128 PE - 4,000 jobs -— response time ....... Network experiments - 128 PE - 4,000 jobs - optimization ratio. Network experiments - 128 PE — 5,000 jobs - response time ....... Network experiments - 128 PE - 5,000 jobs - optimization ratio. Topology experiments - 128 PE - 4,000 jobs - response time. ..... viii 10 12 16 16 18 18 32 32 32 33 33 36 36 36 38 38 38 39 39 41 41 41 43 43 43 44 44 44 45 45 45 47 47 47 48 48 50 50 50 52 52 4.36 Topology experiments - 128 PE - 4,000 jobs - optimization ratio. . . . 4.37 Topology experiments - 128 PE - 5,000 jobs - response time. ..... 4.38 Topology experiments - 128 PE - 5,000 jobs - optimization ratio. . . . 4.39 Space sharing experiments - 128 PE - 4,000 jobs - response time. . . . 4.40 Space sharing experiments - 128 PE - 4,000 jobs - optimization ratio. 4.41 Space sharing experiments - 128 PE - 5,000 jobs - response time. . . . 4.42 Space sharing experiments - 128 PE - 5,000 jobs - optimization ratio. ix 52 53 53 54 54 55 55 Chapter 1 Introduction 1. 1 Problem Statement Many scientific computing applications demand a large amount of processing power, memory space and I / O accesses. Cluster systems with networked server nodes are becoming more popular for executing high-performance scientific computing ap— plications for both economic and technical reasons [16]. One standard approach to reducing the runtime of such applications is to parallelize them into multiple parallel processes so many cluster nodes can run parts of the application simultaneously. An advantage of this approach is the CPU and memory resources of the application are evenly distributed and used. However, this advantage may not serve the best perfor- mance interests of parallel processes because a balanced workload distribution among parallel processes may result in unbalanced resource utilization in a cluster. Uniform use of resources at the parallel process level does not necessarily mean the system itself is evenly utilized. We can attempt to adjust memory load by adjusting the number of processes each parallel process has, but there are trade-offs between memory usage and efficiency. In order to ensure each node has enough memory space to accommodate processes, we could partition parallel processes into a large number of processes. This results in less work and more required synchronization for each process. As a consequence, the CPU on each node may be underutilized. Parallel speedup is very hard to improve as the number of processes increases, due to increasing communication and synchronization overhead. This problem is sometimes referred to as the performance exponential degradation issue. To ensure good CPU utilization we must limit the number of processes in a parallel process. But, as the problem size increases, nodes may run out of available memory and be forced to use the local disk as a swapping site [3, 4]. Performance will suffer from frequent page faults since hard disks are orders of magnitude slower than RAM. Research has shown that disk paging results in unsatisfactory performance on parallel platforms and should be avoided [3, 4, 26]. We can increase the number of cluster nodes or the amount of RAM at each node, but doing so may be impossible or very expensive given the cluster setup. In addition, it is unlikely that the additional nodes will offer any real benefit over the long-term, since it is likely the users of the cluster will simply increase their usage of the system to match the new resources available. This will lead us back to the original problem. The key problem is memory usage, and this problem has two parts: memory fragmentation and paging overhead. The aggregate memory capacity of the system may be enough to satisfy memory demands, but cluster memory is distributed into small chunks. Usage of these chunks may be uneven and inefficient. On the most heavily loaded nodes, disk paging is invoked which incurs a high cost. Network RAM [1, 18] has been proposed for use by sequential jobs in clusters to even memory load and reduce paging overhead. This technique allows applications to allocate more memory than is available on the local machine while avoiding paging to disk by allocating idle memory of other machines over a fast interconnecting network. This remote RAM is treated as a new layer in the memory hierarchy between RAM and disk. Resulting page accesses are slower than RAM, but faster than disk [1, 9, 18,23,39] Existing network RAM techniques should not be directly applied to parallel jobs for performance gains. One issue is that processes from the same parallel job synchronize regularly. If each node hosting these processes seeks network RAM in- dependently, they may be granted an uneven amount of network RAM. With this uneven allocation, the processes executing on these nodes will run at different speeds. However, the parallel job as a whole will only run at the speed of the slowest process, due to synchronization. The nodes with extra network RAM waste it, since their hosted processes will spend most of their time waiting for other processes. Therefore, coordination is required to grant overloaded nodes equal portions of memory to allow hosted processes to run at equal speeds. Another issue is network congestion. If parallel processes individually seek out network RAM with no coordination among themselves, a potentially large amount of unnecessary network traffic will result. This may induce congestion on the cluster. Parallel applications require high performance networks to run efficiently. Congestion could seriously impact the performance of jobs on the system. We propose a new peer-to-peer solution, called Parallel Network RAM (PN R), so overloaded cluster nodes can utilize idle remote memory. In this scheme, each node may request memory resources from remote nodes and provide memory resources for others. Requests are indirect: each node contacts a manager (super-peer) node and requests that it allocate network RAM on its behalf. Managers coordinate the allocation of network RAM of several nodes and ensure that memory resources are distributed evenly to the nodes hosting parallel processes belonging to the same par- allel job. PN R will allow more jobs to execute concurrently without resorting to disk paging. This will lead to decreased average response times and higher system throughput. This thesis makes several contributions. 0 We first identify the unbalanced resource utilization problem in a cluster with a mixed workload of jobs with different resource requirements. Existing tech- niques cannot maximize the performance gain of parallel jobs in such an envi— ronment in terms of both parallel speedup and execution time. 0 We propose a novel and effective solution to this problem called Parallel N et- work RAM (PNR). PNR makes it possible for parallel jobs in a cluster to utilize memory resources from available remote nodes. The CPU cycles will be pro- vided by a small subset of nodes while the global memory space of the cluster - is open to the memory demands of any parallel job. Since the speed gap be- tween accessing local memoryand remote memory is shrinking, and the speed gap between accessing local disk and remote memory continues to enlarge, the proposed scheme is expected to be beneficial. for large scale scientific computing applications now and in the future. 0 We build a simulator that models a cluster and several proposed PNR algo- rithms. Conducting trace-driven simulations, we compare the performance of six different PN R designs to each other and to a disk paging-only solution. We then identify which proposed algorithms are superior and under what condi- tions. The rest of this thesis is organized as follows. Chapter 1 describes the prob— lem presented in cluster systems and reviews related work on this issue. Chapter 2 describes Parallel Network RAM, the algorithms used to implement PNR, different proposed PNR designs and their strengths and weaknesses. Chapter 3 describes the simulator we have created to test our PNR algorithms and it describes the specific experiments set up for these tests. Chapter 4 describes the results of the experi- ments introduced in chapter 3. Chapter 5 discusses the results gathered and draws conclusions about the various designs. 1 .2 Terminology The systems described in this thesis include computing clusters and supercom- puters. We describe individual computers in the clusters and CPUs in the super- computers with the terms ”PE” (Processing Element) and ”node”. The terms are equivalent and will be used interchangeably. The terms ”parallel process” and ”job” are also equivalent and are used to de— scribe programs that have multiple threads of execution that run on separate nodes. ”Thread” and ”process” denote the individual threads of execution of parallel pro- cesses. We define four different Parallel Network RAM designs. We compare these de- signs to a system that uses only disk paging (no network RAM). For brevity, we will use acronyms to identify these designs. The following list defines each acronym. DP - Disk Paging - applies to systems that do not use Parallel Network RAM. PNR - Parallel Network RAM - a generic label for all Parallel Network RAM designs. CEN - Centralized PNR — only one manager exists. CLI - Clients only PNR - a pure peer-to—peer design. MAN - Local Managers PN R - a design that uses randomly selected local managers. BBX - Backbone PN R — a design where a certain number of servents (represented by X) act as ”backbone” managers. 1.3 Background and Related Work The majority of work in this area has focused on parallel job scheduling. In this section, we describe various parallel job schedulers and previous solutions to the problem of overloaded memory on cluster systems. 1.3.1 Parallel Scheduling Algorithms The primary duty of the scheduler on a cluster system is to ensure high system throughput and low overall response times of submitted jobs. The most simplistic scheduling model is the dedicated machine model. In this model, only one job runs on the system at a time. Each job is scheduled using a priority queue which may be sorted in a variety of ways, such as first-come first-serve (F CFS), estimated shortest job first, best fit, or worst fit. Each job runs until completion and may not be preempted. This model lacks time sharing and space sharing and since most parallel jobs do not use all of the available nodes in the cluster, this scheme can waste a large amount of resources. An approach that makes better use of available resources is the space sharing model. Space sharing allows more than one job to be scheduled on the cluster at one time. Each node is devoted to one process, and each job runs until completion without preemption. The space sharing model is vulnerable to large, long-running jobs monopolizing the system and a bad scheduling decision is difficult to correct. The dedicated machine model shares this problem [16]. Gang scheduling combines both space sharing and time sharing to avoid the problems associated with large jobs. Each job is alloted a time slot and the job may execute in its time slot. Nodes within each time slot are space shared. When a time quantum has expired, all jobs in the current slot are preempted and replaced with jobs in the next slot. The preemption process is called a ”Parallel Context Switch” (PCS) and involves a certain amount of overhead. There is at most one process running on each node at any given time, although some schemes relax this constraint [36]. The time slot mechanism ensures no large job may monopolize the system for a long period of time. The maximum number of time slots allowed is known as the Multi-Programming Level (MPL). Setting the MPL to a low number is a convenient way to reduce PCS overhead and reduce overall memory load on nodes. A system with an MPL value of one is a simple space sharing system with no time sharing. The length (in time) of time slots varies. It is intuitive to make each time slice the same length. Fixed-length time slots range from 1 second [4] to 120 seconds [29]. One disadvantage of uniform-length time slices is that excessive idleness can be caused by short jobs [10]. Some research has shown that variable length time quantums may be desirable to increase efficiency [14, 34] .‘ There are a variety of processor allocation strategies and variations of gang scheduling. These include first fit, best fit, left—right by size, left-right by slots, load- based allocation strategies, ”buddy” systems such as distributed hierarchical control, and migration-based algorithms [10, 41]. One problem with gang scheduling is that the usage of nodes within time slots may become fragmented as jobs terminate and new jobs take their place. Two tech- niques can reduce node fragmentation: alternative scheduling and slot unification. Alternative scheduling discovers time slots that have idle nodes and finds jobs that run on these same nodes in other time slots. These jobs are allowed to run in both their original time slot and the partially idle slot. This effectively doubles the amount of time the selected jobs are allowed to run, increasing system utilization and reducing average response time. Slot unification attempts to unify time slots which use disjoint nodes. Typically, such situations occur after job termination [10]. This technique reduces the total number of time slots used by the scheduler and increases the speed of the system as perceived by the parallel processes. Another available technique is backfilling. If users provide estimated job run times, the scheduler can identify idle areas in the schedule which can be filled with smaller, lower priority jobs. The smaller jobs must not interfere with the scheduled running time of jobs ahead in the queue. If a backfilled job runs for too long, it is killed. Backfilling has been shown to increase utilization and reduce response time. Several variations of backfilling have been studied [24, 28, 32, 33, 35, 40]. The MAUI scheduler, a particularly popular and mature scheduler for cluster systems is a space—sharing, backfilling scheduler [20]. Research work has been done to implement gang schedulers on both supercomputing platforms [7] and clusters of workstations [19]. 1.3.2 Previous Solutions to the Memory Problem Previous studies agree that unmodified disk paging results in severely reduced performance on parallel systems [4, 30]. Generally speaking, previous solutions to this problem either attempt to avoid disk paging entirely or attempt to reduce its effect. Various ways to avoid paging by altering the system scheduler have been sug- gested. If no memory information about incoming jobs is known, then the simplest solution is to keep MPL to a minimum [24]. Another solution attempts to guess mem- ory usage information based on information about the job provided by the user and information contained in the program executable. This guess is used in scheduling decisions [3]. Another solution uses speedup information known about jobs ahead of time to make scheduling decisions [26, 27]. Given this information, the sched- uler can choose to give more processors to efficient jobs to increase utilization or to memory—intensive jobs to increase throughput. User-provided runtime and memory information can be used in scheduling deci- sions as well. In fact, the use of user estimates of runtimes is the basis of backfilling. However, it is well known that such information is unreliable, as users will tend to estimate numbers that will ensure they get a good position in the scheduler queue. Some backfilling schemes attempt to take advantage of systemic inaccurate runtime estimations [28]. One method that is aimed at reducing the disk paging penalty is called block paging. In this scheme, the system groups sets of pages together and acts upon these groups as units. Groups are defined by the system based on memory reference behavior of jobs [34]. 1.3.3 Network RAM Network RAM is a technique that reduces paging overhead. Much work has been done for sequential job scheduling with memory considerations. Regarding net- work RAM implementations, the Global Memory System (GMS) [9] and the Remote Memory Pager [23] attempt to reduce page fault overhead by using remote paging techniques. DoDo [1] is designed to improve system throughput by harvesting idle memory space in a distributed system. In DoDo, processes running on the local system have the highest priority for using CPUs and memory on their workstations. This divides the global memory system into different local regions. A memory ushering algorithm is used in MOSIX for memory load sharing [2]. This solution is a job—migration—based load sharing approach. Recently, several load sharing alternatives have been developed. These techniques consider both CPU and memory resources with known and unknown memory demands [37, 38]. The objective of the designs is to reduce the number of page faults caused by unbalanced memory allocations of distributed jobs so that overall performance can be significantly im- proved. Chapter 2 Parallel Network RAM 2. 1 Introduction We propose a novel and effective technique called Parallel Network RAM (PNR) to better utilize both CPU and memory and minimize communication and synchro- nization overhead. We demonstrate the basic idea of PNR in Figure 2.1. In this figure, application 2 runs on nodes P3, P4 and ”P5 but utilizes available memory space in nodes P2, P6 and P7 where non-memory intensive applications 1 and 3 reside. With PNR, instead of three nodes being overloaded while four are underloaded, all seven nodes are fully utilized. Our objective is to fundamentally improve the efficiency of large-scale scientific Application 1 Application 2 Application 3 Figure 2.1: Diagram of Parallel Network RAM. Application 2 is assigned to PEs P3, P4 and P5 but utilizes available memory space of P2, P6 and P7. 10 computing. Under our proposed solution, CPU and memory resource allocations are considered separately. CPU utilization requirements can be taken into account during the scheduling phase, and memory requirements can be handled as needed during execution. In this way, CPU usage can be optimized to maximize utilization and minimize communication and synchronization overhead while memory usage can include both the local memory space from the assigned CPUs and the remote memory space in other nodes (as needed). With PNR, speedup can be scaled as the available remote memory increases, and performance can also be scaled as the problem size increases. Because CPU usage and memory usage are considered separately, PNR does not coordinate with or receive information from the assumed centralized scheduler of the system. This allows PN R to be implemented on a variety of platforms without changes to their existing scheduling policies. 2.2 Generic Description In brief, all nodes in the system host PN R servents (see figure 2.2). All servents act as PNR clients and servers. Some servents may act as managers. Managers act as proxies for clients to communicate with servers. The purpose of managers is to coordinate client requests. 2.2.1 Clients A PNR client attempts to allocate and deallocate network RAM on the behalf of its hosting node. The node uses allocated network RAM as additional virtual memory just as it would with disk space. When a process starts execution on a node, it allocates the amount of memory it will use during its execution. If the node’s client determines that this allocation will lead to disk usage, the client contacts a manager 11 Server Mana er Servent g Client Figure 2.2: The architecture of a servent. Each servent can act in three roles. and requests network RAM. Once network RAM is allocated, the client is informed by the manager what machines are serving the network RAM, and how much was allocated. The client may then start sending pages to the servers for storage and later retrieval. Similarly, when a process stops execution it will deallocate its memory. If the client detects that network RAM previously allocated is no longer needed, then the client signals a manager that it may be deallocated. Clients deallocate chunks of memory first from servers believed to have the highest memory load. 2.2.2 Managers PN R managers are the centerpiece of most of the proposed PNR algorithms. They do the majority of allocation and deallocation work and act as proxies between clients and servers. Network RAM Request Managers listen for network RAM requests from clients (including the client residing on the same node as the manager). Depending on the PN R strategy, the manager may act immediately on this request or wait for other requests to come in 12 before acting. Most strategies require that all threads within a parallel job must contact the same manager before the manager is allowed to act on any one of the threads’ requests. For this to work, it is assumed that the manager can discover the total number of threads belonging to the parallel job. As each new request comes in, information on the aggregate memory request is stored in a request table. Server Request and Response When all requests are received from relevant clients, the manager attempts to select a server from an availability list. Multiple selection schemes are possible. Cur- rently, a randomized worst-fit scheme is used. That is, the server perceived to have the most RAM available for remote allocation (i.e. the server with the most idle RAM) is always selected. If servers tie for the highest amount of RAM available, one server is randomly chosen. The server contacted may grant, partially grant, or deny the request. It is im- portant to note that a server that was listed as having RAM available for allocation may not have the same amount available by the time the request message reaches it. The server’s response is received by the PN R manager. If too little memory was allocated, the PNR manager will attempt to contact another server. The process is repeated until enough network RAM is granted by multiple servers or until no servers are left to query. Network RAM Request Response When either enough network RAM is allocated or when all possible servers are queried, the manager will calculate the total amount of network RAM granted. Based on this, the manager will divide network RAM evenly among the requesting clients. If a client requests less network RAM than it would get in its fair share, exactly as much as it requests is granted and the remainder is reserved for other clients. If a 13 client requests more network RAM that it would get in its fair share, it gets its share and only receives more if other clients do not need all of their shares. Each client is informed of which servers are hosting the granted network RAM. The even distribution of network RAM is done to ensure that each process in a parallel job runs at roughly the same speed. If network RAM were granted hap- hazardly, each process would run at different relative speeds. The job as a whole would run at the speed of the slowest process because of synchronization and the fast processes will be held back. Network RAM Deallocation Notification It is assumed that deallocation notifications are only sent at job termination time. The manager will listen for deallocation notifications from clients. The deallocation amounts indicated in the messages may have nothing to do with how much was previ- ously allocated to those clients: In some cases, no network RAM may be deallocated because it is still needed by the client. ‘In most strategies, the manager will wait for each client associated with a parallel job to send a deallocation notification. Server Notification and Response When all deallocation notifications are received by the manager, the manager determines how much network RAM needs to be deallocated and on which servers. Each server is then notified that it may deallocate the specified amount of RAM. The server will respond to this message with an acknowledgment. Broadcasting Availability Information During each message-passing interaction, clients and servers piggyback current memory load information onto messages sent to managers. The manager receives this information in addition to the relevant payload. It uses this up-to-date memory load 14 information to update its own network RAM availability table. It uses this table for server selection. Some PNR strategies may prefer to share this information by broadcasting it to all servents. Other strategies may only broadcast to selected servents. Currently, broadcasts are only executed after network RAM has been allocated or deallocated. It should be noted that memory load information may already be out-of-date before it is even broadcast. There is no way around this problem, since it is possible for memory loads to change as messages are being passed on the network. We believe this broadcasting solution provides an economical way to keep availability information reasonably up—to—date. 2.2.3 Servers Servers receive requests from managers for network RAM. If the server has more unallocated RAM than a certain threshold, it will grant the network RAM request and allocate memory to the manager up to. that threshold. Currently, the unallocated memory threshold is set to a low value - a tenth of a megabyte. This value can be adjusted and may be useful for future work. After the memory is allocated, servers receive requests to read and write the allocated network RAM directly from clients. Servers grant all valid deallocation attempts. 2.3 Designs We propose four different PNR designs. Each design has a slightly different architecture and has different amounts of communication overhead associated with it. This section describes these designs. 15 Figure 2.3: Client-only design. Figure 2.4: Centralized design. 2.3.1 Client In this strategy (CLI), each client uses its servent’s manager to send allocation requests (see figure 2.3). The local manager does not wait for messages from other clients - it acts immediately upon the client’s request. The manager attempts to [allocate as much network RAM as possible for its client. When the client receives network RAM, it begins execution immediately. It does not wait for the other threads in the parallel job. The local manager does not share memory load information with other servents. This strategy allows clients to allocate network RAM quickly and eliminates all coordination overhead. It is scalable, since each client is responsible only for itself. It is also simple to implement, since there is no need for a manager at all. However, this solution has major drawbacks. First, memory load information is not shared. Each client must discover memory load information for itself and this may lead to a large amount of ineffective network RAM allocation requests. Second, and more serious, the clients do not coordinate memory allocation with each other. Some clients may receive large amounts of network RAM if they get their messages to the servers first while other clients get very little network RAM. This will not improve the performance of parallel jobs as a whole, since each job will only 16 execute at the speed of its slowest thread. In fact, this scheme may worsen overall performance, since much of the network RAM allocated is wasted. Thus, no benefit is gained from network RAM and higher memory loads on the servers may induce disk paging, the very problem we are trying to avoid! 2.3.2 Centralized In this strategy (CEN), only one manager exists, and it receives all client requests (see figure 2.4). All servents know the identity of this manager. All clients will contact this manager and it will coordinate network RAM allocation. The server uses memory load information sent in by clients and servers to make allocation decisions. Since only one active manager exists, the centralized manager does not broadcast any memory load information it receives. I” This‘scheme has the advantage of one agent having all of the information and making all of the decisions. No time or network bandwidth is wasted in sending coordination messages to other managers. The disadvantage of this strategy is that it is not scalable. As the system size grows, the network connections leading to the node hosting the manager will become a bottleneck and limit the performance of the system. In the real world, the computational and memory overhead of hosting this central manager would also become major factors. However, in our simulator, these are not taken into account. 2.3.3 Local Managers In this strategy (MAN), each client contacts a ”local” manager (see figure 2.5). Specifically, whenever a job starts or stops, one of the servents running on a node associated with that job will volunteer to act as the manager in addition to acting as a client. Each servent involved must agree on which servent will act as the manager. 17 Servent Client/ one”, \ Figure 2.5: Local manager design. Figure 2.6: Backbone design. All clients from the involved servents will contact this manager. The manager will take their requests, allocate network RAM (if possible), and divide the received net- work RAM evenly among the requesting clients. At the end of each allocation and deallocation, the manager will broadcast memory load information to each servent in the system. This is necessary since each servent on the system can potentially act as a manager. ' Since no single node is loaded with all requests, this solution is scalable. It makes good use of allocated resources via the coordination of client requests. Broadcasting the memory load information should keep the tables of the servents relatively up—to- date. However, the major drawback of this approach is the broadcasting step. Sending a message to each servent on the system can introduce congestion into the system, especially if no real broadcasting facility exists and broadcasting must be implemented via multiple point—to-point messages. The larger the system, the bigger this issue will become. Also, since there is no central authority on memory allocation information, ser- vents may be more likely to act on outdated information. Like the centralized strategy, waiting time is incurred in the coordination step by waiting for all clients to send in 18 their requests. 2.3.4 Backbone This strategy is a hybrid of the centralized and local manager strategies. Here, a variable—sized subset of servents will act as managers (see figure 2.6). This subset of servents will be well known and all clients will contact these servents for their network RAM requests. Clients will randomly select a manager for service. As in the MAN design, all clients associated with a job must agree on who to contact. The backbone of managers will coordinate among themselves by broadcasting memory load information only to managers. If the backbone of managers consists of only one member, then this strategy is equivalent to the centralized strategy. If the backbone of managers consists of all servents, then this strategy is closely equivalent to the local managers strategy. This scheme can potentially be a ”best of both worlds” solution, compared to the centralized and local managers solutions. It is more scalable than the centralized solution, since load is shared among many servents, and it uses fewer messages for coordination than the local managers solution, since broadcast messages only need to be sent to a subset of nodes. The backbone strategy also has the advantage of being customizable to the cluster setup. If the network is small, then a small backbone (perhaps a backbone of one) may be all that is required. As the network gets larger, then the number of servents in the backbone can be increased appropriately to manage scalability. 19 Chapter 3 Methodology 3. 1 Simulator To test our proposed PN R designs, we created a simulator that models a parallel platform, parallel jobs, and our PNR algorithms [25]. This simulator is written in C++ and compiles under UNIX systems. We used a modified version of Sim++ as our main simulation library. Sim++ is the C++ version of Simpack, a simulation library created by Paul Fishwick [6, 17]. Sim++ is open-source, and we modified the implementation to suit our needs and maximize performance. This chapter describes the models we implemented in our simulator and the experiments we ran on the simulator. 3. 1. 1 Workload Model Many workload traces and synthetic workload generators exist for use by simula- tors. However, no standard memory usage benchmarks currently exist [5]. We decided to use a large trace with memory allocation information that has been assembled and discussed by Feitelson [11]. The trace has been gathered from the CM-5 parallel platform at the Los Alamos 20 National Lab. It contains information about 201,387 jobs run through the majority of 1996. We will use a subset of jobs from this well-studied trace. Since the workload profile at a given site tends to be fairly stable over time [22], using a subset of jobs should be indicative of the general load on the system. However, using a workload trace may be biased toward that collected site’s policies and not representative of all workloads [5, 15, 22]. For example, in our trace no process attempts to allocate more memory than is physically available on a node and all jobs allocate PES in powers of two [13]. Obviously, not every cluster installation will have a workload with these characteristics. 3. 1.2 System Models To complement the CM-5 workload trace, we model a system architecture similar to the CM-5. Each node in thesimulated system runs at 33 Mhz and has 32 MB of local memory. The original CM-5 did not support paging [13] and we add this capability to our simulated system. Each node has a disk of infinite capacity that runs at 7200 RPM with a seek time of 9 ms and a transfer rate of 50 MBps. The interconnecting network is assumed to be a simple Ethernet 100 Mbps star topology. Each link has a latency of 50 nanoseconds and the central switch has a processing delay of 80 microseconds. The original CM-5 had 32 processors dedicated solely to system tasks, so it is assumed that the operating system of the CM-5 imposes no CPU or memory load on the nodes. Network Models The simulator provides simulated network links. A link connects either one computer to another computer (or set of computers) or a computer to a switch. Only one message may be transmitted on a link at a time. A link has a fixed latency time and a transmission time based on the bandwidth and the size of the message 21 transmitted. In our experiments, we set the latency of each link to be that of the speed of light through 10 meters of copper wire (50 nanoseconds). No collisions occur on the simulated links. Incoming messages are queued if the link is currently in use. A simulated network switch stores communications and forwards them to the appropriate link. Switches may only forward one message at a time and all other messages are queued. Switches are assumed to have a single infinite length queue sorted in FIFO order. Switches have a fixed length processing delay that can be set as a parameter. The base value of this delay is set to 80 microseconds. Point—to-point messages traverse links and switches only. Routes including nodes as intermediate points are not used. Some PNR methods use broadcasts to send information to servents. In our simulator, broadcasts are simulated using N point-to- point messages from the broadcasting node to each other node. 3.1.3 Job Behavior Models Each job is composed of multiple processes. It is assumed that each process allo— cates a static amount of memory at start time. Each process accesses memory at its node independently. Previous studies have shown that parallel scientific applications generate memory references every three to five CPU cycles and have a cache hit ratio that ranges from approximately 50% to 65% [8, 31]. We assume that our processes access memory every four CPU cycles and have a cache hit ratio of 50%. Synchronization Model All processes in a job synchronize with each other at regular intervals. The synchronization pattern is a simple master/worker pattern. One process is chosen as the ”master” process and all other threads in the job are ”workers”. After a certain amount of CPU time, each worker sends a message to the master. After the master receives all messages, the workers are allowed to proceed with execution. In 22 our experiments, we set the time interval between synchronizations to be one CPU second (once every 32 million cycles). This is not a heavy synchronization load. Future experiments should use more complex and heavier loads. Page Fault Detection Algorithm Originally, we used a memory access model similar to that presented in [3] to produce a low-locality memory access pattern. However, its usage became a bottle- neck in our simulator. We replaced its use with a new model based on an exponential function. Each thread has an average memory access rate. The page fault detection algo- rithm determines how much time will elapse between the current time and the time when the next page fault will occur. For example, if no pages of the currently running thread are loaded into memory, then the first memory access will be a page fault and the time until the next page fault will be the same as the average memory access rate. If all pages of the thread are loaded into memory, then the amount of time until the next page fault will be infinite. If some, but not all, pages are loaded into memory, there should be some time between page faults that is greater than the time between memory accesses, since some references will be to pages already loaded. In fact, this should happen often due to locality of memory references. We construct a model to mimic this behavior. To start, we define the time until the next page fault to be T. We then define the average memory access rate to be R. If we assume no pages are ever loaded into memory, then the next page fault will occur in exactly this amount of time. T = R The operating system will load a page into memory when a page fault occurs. It is very likely that subsequent memory accesses will reference the same page because memory accesses, in general, follow locality. If the program accessed memory linearly, 23 then the memory references will traverse the entire loaded page before encountering a new, unloaded page. We assume this is the case and multiply the memory access rate by the page size P. In the case of our simulated system, P has the value 4096. T = R x P No thread will ever access memory perfectly linearly and it is possible that the thread will return to pages already loaded into memory. The more data that is loaded into memory, the more likely it is that this situation will occur. We define the data brought in from page faults as D. D starts at a zero value and increases by the size of a page at each page fault event. There must be a limit on the amount of memory paged in by a single thread, and that limit is L. L is defined as the minimum of the total amount of memory overallocated on the node and the thread data size. If, for instance, the amount of memory overallocated on the current PE is 32 MB, and the current thread size. is 16 MB, L is 16. L = min(N0deOverallocati0n, ThreadSize) We can calculate the amount of overallocated data 0 of the thread not currently in RAM by subtracting D from L. 0 = L — D From this amount, we can calculate the percentage of overallocated data not currently in RAM. We call this value U. U = 0/ L We can then apply U to the calculation of T. Dividing our previous equation by U will cause the average time until the next page fault to go up as more data is loaded into RAM. Note that when U is 0%, T will be defined as infinite. T = (R x P) / U Finally, the resulting T is given to an exponential distribution as an average. The distribution will randomize the time between accesses while still following a general trend. 24 T = ecrpntl((R x P)/U) One problem with this model is that it does not handle processes that are larger than physical RAM correctly. Specifically, this model will have the processes load a only finite amount of data into RAM. This is not correct for very large processes where, in order to page parts of a program in, other parts will have to be paged out. In this case, it is possible that the process will thrash between localities forever. We avoid this problem using the following method. When all of the process’ data (as determined by the model) is loaded into memory (U = 0%), the simulator will reset the amount of memory unloaded using a triangle distribution. This triangle dis- tribution is given the parameters A = P, B = P, and C’ = (RAM Size—ThreadSize). This makes the amount unloaded likely to be small - generating a low level of page faults for future execution. This model mimics locality changes in the program and will produce the desired behavior of continued page fault activity for very large pro- CCSSBS. Network RAM Access Model The paging model is used as the foundation of the network RAM access model. When a page fault occurs, the simulator checks what percentage of virtual memory of the node is stored as network RAM and what percentage is stored on disk. A random number is chosen to decide if network RAM or disk has just been referenced. If the disk is referenced, then the activity is no different than normal paging. If network RAM is used, a round-trip message occurs between the host and the server node storing its pages. It is assumed that the page is read from remote RAM instantaneously and the only time penalty is that caused by communication overhead. 25 3. 1.4 Scheduler Models It is assumed there is one centralized scheduler for the system. In our simulations we experiment with two schedulers: a space sharing scheduler and a gang scheduler. For both schedulers, we use FCFS as our queuing discipline since it has been shown to be a simple yet efficient ordering which guarantees fairness [24]. Most simple node packing schemes lead to identical performance, so we use best-fit packing to follow the example of [10]. It is assumed that the schedulers have no knowledge of the memory requirements of jobs. Neither scheduler takes these requirements into account when scheduling decisions are made. Each time slice in the gang scheduler runs for a 60 second quantum as suggested by [29]. The time required to perform a PCS is fixed at 4 ms [4]. The maximum number of time slices (MPL) is set to two. This number is conservative and limits paging activity [24]. Alternative scheduling and slot unification are provided. 3. 1.5 Metrics There are no universally valid and accepted metrics. In fact, different metrics can give contradictory results [12, 15, 21]. However, most installations use a few ”de facto” standard metrics such as response time and utilization [5]. In this section we list and justify the metrics we use in our experiments. The first metric is average response time, or total wallclock time from submit to finish. This metric is used very often and directly reflects the goal we are attempting to achieve - improved parallel process performance. One disadvantage of this metric is that it can overemphasize large jobs. In parallel workloads, small jobs account for the majority of jobs. [10, 12, 15] To directly compare DP to the various PN R designs, we create another metric: ”optimization ratio”. This metric is based on average response time and represents the improvement of a PNR design over DP. 26 Optimizati0nRatz'0(PNRDesign) = W x 100% We also calculate the average and standard deviation of node memory allocation and disk allocation by sampling the memory allocation information of each node every 50,000 simulated time seconds. Many other possibly useful metrics, such as utilization, throughput, slowdown, and bounded slowdown are not reported here. We have briefly examined these metrics and trends in response times appear to match well with trends in these metrics. However, additional comparisons with these metrics may yield interesting data and could be the focus of future work. 3.2 Experimental Setup To determine the effectiveness of PN R in comparison to DP, we define several experiments. The basic set of experiments is defined in table 3.1. These experiments reflect our main points of interest: 0 Performance under varying memory loads. 0 Performance under varying network speeds. 0 Performance under different network topologies. Performance under different scheduling strategies. The basic experiment table becomes multidimensional as we apply it to seven different paging methods, workloads of 4,000 and 5,000 jobs, and to 64 and 128 node systems. The total number of experiments performed for this thesis based on table 3.1 is 560. Some additional experiments are performed. 27 RAM Network Topology Space Sharing 150% 10 Bus 50% 135% 100 Star 75% 125% 1,000 Connected 100% 115% 10,000 100% 100,000 85% 75% 65% 50% Table 3.1: Table of Experiments. 3.2.1 Experiment Table Detail RAM To test each scheme under varying memory loads we vary the amount of RAM available at each node while holding memory demands of jobs constant. The parame- ter given signifies the relative amount of RAM present at each node in the experiment. For instance, for the 150% experiment, we adjust the default 32 MB of RAM to 48 MB (or 150% of the original value). Network To test each scheme under varying network performance, we alter link bandwidth and switch processing delay. The base value of the network performance is 100 Mbps with a 80 microsecond switch processing delay. The parameter value 1000 signifies performance 10 times better than the base value - with link bandwidth at 1000 Mbps and switch processing delay at 8 microseconds. 10,000 and 100,000 values follow the same pattern. Parameter value 10 has link bandwidth of 10 Mbps but does not alter the switch processing delay. 28 Topology To test each scheme under different topologies, we define three topologies: 0 Star - N links and 1 switch 0 Bus - 1 link 0 Connected - N (N — 1) links - a fully connected system with links for both directions. The star is the base topology. Note that N is the total number of nodes in the system. Scheduler I To test each scheme under different scheduling strategies, we define both a gang scheduler and a space sharing scheduler. The gang scheduler is used as the base ’ scheduler. The space sharing scheduler is simply the gang scheduler with an MPL of one. When we use the space sharing scheduler, we run the scheduler under varying RAM workloads. Following the same pattern as the gang scheduling experiments, we vary the RAM ratio from 50% to 100%. We do not simulate anything beyond 100% RAM because of the CM-5 scheduling policy which prevented any jobs larger than the RAM available from running. 3.2.2 Dimensions of the Experiment Table Each experiment defined in table 3.1 is run on 4,000 and 5,000 job workloads. We discovered in early experiments that jobs start to become backlogged under the gang scheduler with the 5,000 job workload. By comparing the 4,000 and 5,000 job workloads we can understand how each solution performs under different types of system loads. 29 Each experiment is run on a 64 node cluster and a 128 node cluster. Running experiments on different-sized systems will indicate how well the various designs scale. Note that each workload was originally collected on a 1,024 node system and if the system is scaled down, the workload is sealed with it. For example, a job that asked for 512 processors under the original workload will ask for 32 processors in the 64 node system, and 64 processors in the 128 node system. Congestion on the larger networks will increase since more communication is required for synchronization and broadcasts. We run seven different paging methods against each experiment. The base is disk paging (DP). We test six PNR methods: the centralized method (CEN), the client-only method (CLI), the local managers method (MAN), and three variants of the backbone method (BB). The three backbone methods define the number of PNR managers to be ith, éth, and T’éth the total number of nodes in the system. For instance, on the 64 node system, the number of managers is defined as 16 (BB16), 8 (BB8), and 4 (BB4), respectively. 30 Chapter 4 Performance Evaluation 4.1 64 Node Cluster This section describes the results of the experiments introduced in table 3.1 run on a simulated 64 node cluster. Note that we define the ”base” set of experiments as experiments where all of the default parameters described in chapter 3 are set. 4. 1. 1 Base Experiments In figure 4.1 we see the results for the base set of experiments for five different workloads. Response time for each design follows an exponential curve as workloads get larger. For the 5,000 job workload, jobs are starting to get backlogged. This increases average response time for all paging designs. This trend continues for the 6,000 job workload (not shown). We focus on the 4,000 and 5,000 job workloads to observe the difference between backlogged and non-backlogged systems. Figure 4.2 shows the 4,000 job workload where all PNR designs are close in performance. Each design has an optimization ratio of 10-12% over disk paging (figure 4.3). Interestingly, CLI has the best optimization ratio at 12.08%. This goes against the expectation that CLI will always lead to poor performance due to lack of 31 Response Time (seconds) Optimization % Response Time (seconds) 70000 T-ee-—'- , 64 Nodes 60000 - 50000 ‘ 40000 30000 20000 ’ ‘ l ‘ ‘. l . 82323525252352? 1.... 10000 [T we V. .39'0' .320. 0 l 1000 2000 3000 Jobs 4000 I DP [1 CLI g CEN [a BB. [E BBS N 3316 [a MAN Figure 4.1: Base experiment - 64 PE - several workloads - response time. 64 Nodes - 4000 Jobs DP CLl CEN BB4 BBB BBlG MAN Deggn Figure 4.2: Base experiment — 64 PE - 4,000 jobs — response time. 64 Nodes - 4000 Jobs 12.5 f CLI BBS BB4 Defign Figure 4.3: Base experiment - 64 PE - 4,000 jobs - optimization ratio. 32 Response Time (seconds) Optimization % 64 Nodes - 5000 Jobs 70000 ..-1-__., -_--._ ._ sooooa—g—“wI j~a~-«——-.—v.—-eew.s cz1_.1.1.11_,.. WWOfl am l_::wmu~m44+mmfln_w__ 40000-—l -—] «w—j [4—T“Ti_4, ,. 30000]-, --3 *] l—w~l 3.4; l zooooaal «a—] 111] [a] [1.] ] . 3 10000]—] fl—. ——‘ a-———l (“4' i , .“ 0 __. _l._--._. ”n L“ was T- L1__._l__ ,_ l___l._l,__t T l .1. .l.__l ,_ DP cu CEN BB4 BBB BBl6 MAN Defign Figure 4.4: Base experiment - 64 PE - 5,000 jobs - response time. 40- 1, ~ _- ——w —— — —~ 30~ +~ e~e ._l -— Af”"" 11 —— V .. TNT] 20 -4 4s — - L —— — — ,0 _ w _ ]_ __ I___ _ 0*] 1 ~— 4 CLI CEN BB4 BBB 3816 MAN Deggn Figure 4.5: Base experiment - 64 PE - 5,000 jobs - optimization ratio. 33 servent coordination. The average memory usage for DP was 6.88 MB, with a standard deviation of 4.96 MB. All PNR designs had a standard deviation around 4.4 MB. CLI did not lead to significantly uneven memory allocation (as was expected). Figure 4.4 demonstrates that, for the 5,000 job workload, the PNR designs are more differentiated. CLI was the only design with an average response time higher than DP’s. Figure 4.5 shows that BBl6 is the best design at a 33.38% speedup over DP. This figure shows us that BB16 is at the top of a curve, where performance of the coordinating designs is worst at CEN, gets better as the solution becomes more distributed (BB4 and BB8) and degrades past a certain point of distribution (MAN). These results are more in line with our initial hypothesis that a hybrid of centralized and decentralized approaches would have the best performance. All non-CLI PNR methods experienced approximately 70% or less the number of page faults DP experienced. This can be attributed to the fact that, since PN R jobs finish faster, they experience fewer PCSs. This, in turn, leads to less page loading due to memory contention by multiple processes. We observe some interesting statistics for memory allocation for each design. DP had highest average memory allocation at 17.87 MB per node with standard deviation of 5.17 MB. All PN R designs had lower allocation averages - ranging from 14.45 MB to 16.37 MB. BB16 had the lowest average at 14.45 MB, but it also had the highest standard deviation of all designs at 5.76 MB! In fact, each PN R design except for MAN had a higher standard deviation than DP - ranging from 5.18 to 5.76 MB (with MAN having 4.98 MB). This results comes as a surprise, since PN R should be smoothing out memory usage, not creating more hotspots! Fhrther, it may be possible that BB16 had the lowest response time only by chance - since the hotspots created were not in heavy use by the system. 34 4. 1.2 RAM Variations For both the 4,000 and 5,000 job workloads, we see response times for all designs converge as RAM is increased (figures 4.6 and 4.9). In the 4,000 job workload, response time converges earlier. In the 5,000 job workload, the designs converge later (due to backlog) at RAM ratio 135%. Data is omitted from the figures when optimization rate is 0%. As RAM is decreased, all designs increase response time exponentially. Presumably, response times will also converge when so little RAM is available that allocating network RAM is impossible. At the 100% RAM ratio and above in the 4,000 job workload all PNR methods use 0 MB of disk space on average. Standard deviation for each is 0 MB. This accounts for similarity of results for the PNR methods. For the 4,000 job workload, we see a very large initial boost in performance as RAM is decreased. The optimization ratio is the among the highest observed for all experiments - in the 75% to 100% range (figure 4.7). This advantage quickly evaporates at 65%, where most designs actually underperform DP. At 50%, most of the PN R designs recover, but have a lower performance benefit (between 25-50%). CLI is the clear loser in each of the experiments where the RAM ratio is less than 100%. MAN consistently outperforms or performs as well as other methods in each scenario. The other methods are more difficult to judge, since their performance fluctuates. In the 5,000 job workload, one immediately obvious feature is the big performance boost PN R receives as RAM is initially increased (figure 4.9). The 4,000 job workload experienced no such benefit. The extra RAM must help PN R efficiently clear out the backlog of jobs. As RAM is increased past a certain point (around 135%) PNR and DP converge. Interestingly, despite these favorable response times at RAM ratio 115% and 125%, the average allocated memory and standard deviation metrics are not unusually 35 Optimization % Response Time (seconds) Response Time (seconds) 64 Nodes- 4000 Jobs 600000T,_,L,1,LLL,,,1,_L_LL, ~477-—, 5000000 , , ~ ,_,_ #7 # l!DP ] g... - , - . ,, h - WW- gig , 133,35. l. ‘I 300000: \ ;: , 1 , , . , , -~ #1 7334 l . x - - . . MW 4 vi . t .. , * , - *~ A, ,1. t:sll:..,.___._ 8816 50 65 75 85 100 115 125 135 150 l:”fi":l RAM Ratio Figure 4.6: RAM experiments - 64 PE - 4,000 jobs — response time. 64 Nodes- 4000 Jobs 100 T ,_-L r— 4 ‘f‘f‘ -7~ -——~———~4 g3 ' - a ' . - can 25 N 7mg" I. —— a 0 rfixj ,2, . . v... 1. n . ,iuc " .. . . . 1E3] BB4 '25 ’ ' ] ’ " ’ "' ' " ' " " lg: BBB _50.i, , l ,7 ' , , , ,7, W i. .‘ ; “Nit: l _75 . ll -1 c -, it... 3316 -100. . . . . ,. 1, 7— ;‘_.‘ MAN 2 50 65 75 85 100 115 125 ‘ RAM Ratio Figure 4.7: RAM experiments — 64 PE — 4,000 jobs — optimization ratio. 64 Nodes- 5000 Jobs 1000000 7' -____ , I DP ,1 800000 L, i i .RF 1 WC” 600000 7 z]; ggi CEN l = I r 400000 ~ g +7 fr; 334 2000004 {533 l , gig . [m 8816 0 ‘ ' '7 ‘ 7 ‘ ‘ ‘ [00 MAN 50 65 75 35 100 115 125 135 150 » . ,,-,_ RAM Ratio Figure 4.8: RAM experiments — 64 PE - 5,000 jobs - response time. 36 favorable. In fact, some PN R methods have an average and standard deviation higher than DP. The differentiating characteristic between DP and PN R must be the average disk usage. While DP uses about 1.25 MB on average, each PNR method uses about 0.5 MB or less at RAM ratio 115%. Similarly, at 125% RAM ratio, DP uses 0.32 MB whereas each most PNR methods use less than 0.1 MB. W'hen RAM becomes more scarce MAN offers consistent performance benefits but this time it is not always the best design. At 85% and 75% we observe smaller- scale versions of the performance boosts observed in the 4,000 workload. At 65% there is another performance hit - but, again, at a smaller scale. At 50%, modest improvements are back. 4.1.3 Network Performance Variations For all cluster setups and workloads, the performance of PNR at 10 Mbps of bandwidth was so poor, it was not included in the figures for these experiments. To get an idea of the performance, the total simulatedtime elapsed for the 5,000 workload rose an order of magnitude from 106 to 107 seconds. This result shows that PN R is extremely sensitive to low-performance networks. Figures 4.10 and 4.11 show the 4,000 job workload, where the methods with the most coordination overhead suffer at bandwidth 100 Mbps. However, as bandwidth is increased beyond 100 Mbps, all PNR methods become equivalent in terms of response times and optimization ratios. N o PNR method in these experiments uses disk paging at any bandwidth. The limiting factor here is coordination overhead. It makes sense that as communica- tion becomes less costly, coordination overhead becomes such a small factor that the various methods converge to the same response time. Figure 4.12 and 4.13 show that, for the 5,000 job workload, the general curve observed in the base test is preserved until bandwidth 100,000 Mbps, where the results 37 64 Nodes - 5000 Jobs | N 4 8 1 1 | N 4 8 1 8 1 1LEBBBA1 . PLEBBBA . BBA _C C B B B M .m VD C C B B B M _ m B B M max. _//11 1 Rx; . ..1 9. [AA 1 ..\\ .1 11 // E1 Rs m 1 1:1 mm w , .1111 t m t l 1 l l n ill l ll- e O S .1 n . t 1 o 3 5 m 1 1 1 .1. .l 1 w. a. 1 3 .m. 1 1 o m 1.14 1 .n 1 __/_./=/____ ==== M _ 1.\.\.1..1\\. \fl\\m1\ 1 11 5 mu 1 11 ..V0.0N.0.u1.0w00./00/ 0w0‘0w00w00000fi m .nOU up n s 1 m. 1 . .w s 1 0 _ 1v .J . 0’ s 1" S \\. 1 H m .w 1 0 4 .wmx xxx 1 w. m. _ =fi/_h_/_\_\_F_/E/=._ w a m. u wt 1 1 m 1 ...v$.~..W//. ._ o m P w /./.,W///. //1/_/ . \l\\1\x11\ 1 0 .m E 1 1 1 4 1u0000000000¢0 000 00000000000. frat? 0 t 4 m m 6 4 0.6: 0.0000. 1 .. a a _ 1 w _ .1. l S d S 1 . 1 6 e t e . .. . n n 1 \..n......\...\ M _ d 1 e d ”mm/m 1 E M m 0 1 0 % m 0 xx f. m N A mm .m N 1:: /==/_/==_=/=/_=_\_/====/== 111.1NN.._N , 5 .m 6 1 a 6 Wavy .. . 1.4.113... 7 1 1 1 \. ....:.:..xx .._ 1 m1 A l 1W. 1 e 1 , . 1 w 1 _.1 5 M 1 0 a ., 1 6 A 1 O N .hl H. R 1 .1 1 .. 1 fl +‘\.V\\\.1 fl and. 11 m H .1... n1 o 4.. 4“ _ «WA/n A 5 e e 1 I ll [1]. - l m We 0 5 0 5 0 .1 .1 5 2 2 R... F F o\o cozmuEsao E_ o coamom 1x. co_um~_E_uaO 100000 10000 38 Bandwidth 1000 100 Figure 4.11: Network experiments - 64 PE - 4,000 jobs — optimization ratio. 1|DP ] [ TCLI [ ‘33CEN Kg? BB4 . ‘ ‘BBB ] 4:1; Response Time (seconds) §3\ BBl6‘ [_lMAN Bandwidth Figure 4.12: Network experiments — 64 PE - 5,000 jobs — response time. 64 Nodes - 5000 Jobs 70- 77 11 47 , s 60+ 7 7—77 7 — 7 \ = ~ - 1 .,:/ B 7, to 111 .~/= W 'E ; 545%; [ 11 E 1 4: r BB16 111111 1 LJMAN] 10000 100000 Bandwidth Figure 4.13: Network experiments — 64 PE - 5,000 jobs — optimization ratio. 39 are noisy. Increases in network performance benefits CEN the most but also tends to increase the performance of each method. CLI performs worse than every other PN R method in all cases. However, for bandwidth greater than 100 Mbps, CLI still manages to beat DP. In fact, CLI even has lower average memory usage (5.46 MB) and standard deviation (2.02 MB) compared to DP (6.09 and 2.71 MB). For this experiment, synchronization overhead must not be the only limiting fac- tor. The backlog of jobs must be influencing the experiment and producing noisy results. The phenomena may be explained if we consider that if processes are be- coming backlogged in the 5,000 job workload, and if network performance results in higher job turnover, then longer-running jobs will tend to start running together in greater frequency than under lower-performing networks. Longer-running jobs tend to demand more memory than shorter jobs so, given this situation, memory load and response times will rise. 4.1.4 Network Topology Variations For the 4,000 job workload (figures 4.14 and 4.15), PN R designs are superior to DP on all topologies. For the bus topology, it appears that less coordination overhead results in higher performance. This is not too surprising since the potential for congestion on a bus network is very high. For the connected network, we observe results similar to those at the high bandwidth network. Each PNR design has roughly equivalent performance. Since the connected network does not increase bandwidth, and since the op- timization ratio is similar to that found on the high-performance networks, we can conclude that the significant factor in this experiment and in the network experiments is congestion. The problem with low-performing networks was not simply the message service time, but the queuing delay introduced by the slower-moving messages. 40 Response Time (seconds) Optimization % 64 Nodes - 4000 Jobs 20004 7 7 11 -- 11 1 11 11,1. 7 ~17 1750 , ~ 7 7 I DP 1500 7 3; CLI ] 1250 7 N 1000 - 1 ’1'; CEN 750 7 1 B? 334 [ 500 . 3 BBB 258 ’ l l ‘b E 8316 Star ‘ ’ MAN Topology Figure 4.14: Topology experiments - 64 PE - 4,000 jobs - response time. 64 Nodes - 4000 Jobs 12.5“ 1 1 7 10 77 CLI [ 7.57» 71> ’ fiCEN l . 1 .~ "5 ‘ .543 BB4 l 5 ‘ [ l 1; I [ ] . it": BBB .' 2.5 ' ] RS 3316] o . . 1 fj MAN_] Figure 4.15: Topology experiments - 64 PE - 4,000 jobs - optimization rate. Response Time (seconds) 64 Nodes - 5000 Jobs 700007 - 1 1 1 1111 1 1_1 60000 A ‘ * [I DP 50000] 3:," CLI 40000 i [53/ CEN 20000 7; 1E1 BBB 10000 - g 3816 0 , j. MAN 1 Topology Figure 4.16: Topology experiments - 64 PE - 5,000 jobs - response time. 41 For the 5,000 job workload (figure 4.16) we have noisier results. In the bus topology, the general trend still appears to be the less synchronization, the better. For the connected topology, all designs (including DP) get a performance boost except MAN. BB16 is the superior method at a 60% optimization ratio. Notably, CLI still underperforms DP. It is diflicult to say there is a definite pattern in these results. 4.1.5 Space Sharing Scheduler Experiments For all workloads, recall that no individual job is larger than the CM-5’s available RAM. So, with a space-scheduler, no two jobs will be loaded on one node and, at RAM ratio 100%, no node should ever be overloaded. All designs should experience nearly identical performance. This result is confirmed in all RAM ratio 100% space-sharing experiments (with trivial differences resulting from PNR coordination methods). Figure 4.18 shows that, for the 4,000 job workload, each method has approxi- mately the same performance at 7 5% RAM ratio with CLI having a slight edge. For 50%, CLI loses its edge while all the rest of the methods gain a performance boost. Figure 4.21 shows the same results for the 5,000 job workload. Each method has approximately the same performance at 75%. CLI again loses its performance benefits at 50% while the rest of the PN R methods approach a respectable 45% optimization ratio. 4.2 128 Node Cluster This section describes the results of experiments run on a simulated 128 node cluster. These experiments are primarily done to test scalability as compared to the 64 node cluster. 42 64 Nodes - 5000 Jobs 70 177747 17* 11_.11w1_1 1 1 11 Optimization % Connected Topology Figure 4.17: Topology experiments - 64 PE — 5,000 jobs - optimization ratio. 64 Nodes - 4000 Jobs - Space Sharing [Elam ,3; CLI ‘N CEN / 334 (@338 . fig 3316 Q MAN Response Time (seconds) RAM Ratio Figure 4.18: Space sharing experiments — 64 PE - 4,000 jobs - response time. 64 Nodes - 4000 Jobs - Space Sharing 60f 11 ._ 7 7 77‘ 111 ’T T" ’” rgf“ °\o 50 T _ , l: CLI g 40* 1 @gCEN '4: l 5 -‘~° 3°“ T @332 _ ,1 .§ 2011 iii a 10% 1 [@3316 0 o, , g . [ [f'jMAN 50 RAM Ratio Figure 4.19: Space sharing experiments - 64 PE - 4,000 jobs — optimization ratio. 43 g des - 5000 Jobs - Space Sharing C 1 1111111 11 11 1111 11 g 3_ 3311 13111311111 [lop ] g; 1111;1111111 44 [l'CLl g 1 _111 1111111 ,flum] ; 444444 4444444 79334 ‘ g TT T T T T L 3 BB8 8 .I 1.2/591 . .I 1.3-2714:35’ 1 3"“ 3316i 3 i 75 100 [[41 MANJ m 1111. RAM Ratio Figure 4.20: Space sharing experiments - 64 PE — 5,000 jobs - response time. 64 Nodes - 5000 Jobs - Space Sharing 504,47 44 74 7 4444—4 —- 4 A7 °\° ‘- T -7 7 3 4. l CLI ] C ,: .9 33 2 CEN [ H ‘ , , .3 V2 334 ] g i: BB8 . B. lTN‘N‘N [ O 1,, 3316i 1 ‘ lMAN : RAM Ratio Figure 4.21: Space sharing experiments - 64 PE - 5,000 jobs - optimization ratio. E 128 Nodes c 100000 111, , * 3 3 i l DP 8 80000] T—V 7 T 'TT T' T3‘ [i T CLI 0 60000———7 " TT 7 7 $75 ‘ f .3) CEN .§ 3:42 i 3;”. '- 40000 7 11313333 3?; ‘. . 333 0’ Bil-=- 1 E3316 "’ 200007 . 1111 £24: ,_ , B 0 Ti 2’2 , EN 3332 in Tl 1.10.1.1“ .. -. .3 . 2 1000 2000 3000 4000 5000 w Jobs Figure 4.22: Base experiment - 128 PE - several workloads — response time. 44 Response Time (seconds) Response Time (seconds) 128 Nodes - 4000 Jobs 2w0P1l111.,_mmU1m 1_AA DP CLI CEN BB8 3816 8832 MAN Defign Figure 4.23: Base experiment - 128 PE - 4,000 jobs - response time. 128 Nodes - 4000 Jobs Optimization % 816 MAN Deflgn Figure 4.24: Base experiment - 128 PE - 4,000 jobs - optimization ratio. 128 Nodes - 5000 Jobs 100000 80000 60000 40000 ;;i' ;;.I 20000. E E 0 DP CLI CEN 338 331 3332 ' MAN Deflgn i—e—H—a Figure 4.25: Base experiment — 128 PB - 5,000 jobs — response time. 45 4.2. 1 Base Experiments When we run the base set of experiments over various workloads (shown in figure 4.22), we see the same exponential curve in response time as in the 64 node system. In the 4,000 job workload, the CLI optimization ratio beats the other PNR designs by a significant margin (figure 4.24). While each other PN R design is between 5% and 7.5% optimization ratio, CLI approaches 15%. CLI also had the lowest average memory allocated and the lowest standard deviation at 5.52 MB and 4.35 MB, respectively. DP had an average of 7.54 MB and standard deviation of 5.23 MB and MAN, for instance, had an average of 6.28 MB and a standard deviation of 4.65 MB. In the 5,000 job workload there appears to be scalability problems. CLI and BB8 are the only methods that experience significant performance enhancement (about 5%). Others (CEN, BB16, MAN) experience performance degradation. This result is not unexpected since the underlying network topology is a star. The bottleneck resource, the central switch, becomes more congested as the network becomes larger and as more messages are sent. As observed earlier, PNR is very sensitive to poor network performance and it makes sense that PNR performs poorly here. CLI has a natural advantage in larger networks because individual peers do not synchronize with each other. With each other design, coordination overhead must be incurred. Interestingly, each PN R design had a lower standard deviation in memory usage than DP. Each PNR design also used less disk space than DP. However, some PNR designs had a significantly higher average memory allocation. All PNR designs still experienced fewer page faults. The number of page faults ranged from 50% to 80% that of DP. 46 Response Time (seconds) Optimization % Optimization % 128 Nodes - 5000 Jobs Sj ’7fiieiAifiiefiiiii ##A~i*v , _ -1,_1,1m1, mfl, 7’7F’77 -5? 7 7 ,!s ,_"7 *J i#7,___; ; i #iilri; fiiwgi' 1 i.#iri_i‘—‘i ._._ -w ~i-~— mm____w_, AW— _"1, 1_ -25 ‘ __ w -4 _##~* 7:1 7 —A—i 1 _l’7 ##7 if“, L_ _] -,_,_#¥ ii‘i '35 ifwy-ql PW, A—u—fl ivTiP—T 7 7 "I—i—__"i 7"7'?7_ CLI CEN BBB 3816 3332 MAN Design Figure 4.26: Base experiment - 128 PE - 5,000 jobs — optimization ratio. 128 Nodes - 4000 Jobs 500000 T 100 115' 125 135 150 RAM Ratio Figure 4.27: RAM experiments - 128 PE - 4,000 jobs — response time. 128 Nodes - 4000 Jobs 100 -~ ,, .111--. 1,-.. immi- , A- ,1, i@ CEN iEfi’j BB8 iE BB16 SQ BB32 ‘LjMANé Figure 4.28: RAM experiments — 128 PE - 4,000 jobs - optimization ratio. 47 Response Time (seconds) Optimization % 128 Nodes- 5000 Jobs 50000017 #777 — 1 1,1 11 11 11,11 —- ilDPT i 1 , . .11 _1 -111 1,1 ,1 1 ‘ 400000. 1 N H (CLI 3000001 ‘ 1_11 11 1 __111_111 -— gfiSCEN 200000 3 . - ‘ *7 11 ~7 1 1 1 IEBBS | 100000 - 1. " g — — 1_¥ _.1, L , 3315 o 1Ia1I§ IIZE raga _ I1,_1111, ’®\BB32 75 115 125 135 150 11:3lfiiii 00RAM Ratio Figure 4.29: RAM experiments — 128 PE - 5,000 jobs — response time. 128 Nodes - 5000 Jobs H W, N CEN V/ 338 ‘g33316 N 3332 i ‘ MAN RAM Ratio Figure 4.30: RAM experiments - 128 PE - 5,000 jobs - optimization rate. 48 4.2.2 RAM Variations Note that there is data missing at 65% and 50% RAM ratios for both the 4,000 and 5,000 job workloads due to simulation time constraints. For the 4,000 job workload (shown in figures 4.27 and 4.28) PN R converges as expected as RAM is increased. As in the 64 node cluster, large performance increases are observed for 85% and 75% RAM ratios (approaching 100% and 50% improvement, respectively) The improvement at 75% RAM ratio is not as pronounced as in the 64 node system. Interestingly, the % backbone solution experiences a performance hit in both the 64 node and 128 node systems. The performance penalty is more pronounced in the 128 node system. For the 5,000 job workload (in figures 4.29 and 4.30), we observe the PN R op- timization ratio converging as expected. However, for 125% RAM ratio, there is a large performance penalty felt by all of the PN R designs (with CLI being affected the least). All PNR designs except CLI experience at least a -100% optimization ratio. All designs improved average response time from 115% to 125%, but DP improved more than each PNR method. It is not known if this is a statistical fluke or something more interesting. More tests need to be performed. As RAM becomes more scarce, the optimization ratio of the PNR methods don’t reach as high as in the 64 node network (not much above 50%). As in the 64 node network, the best performance gains are observed at 85% and 75% RAM ratios. 4.2.3 Network Performance Variations For the 4,000 job workload, PNR performance improves as soon as network performance is improved. For network bandwidths of 1,000 Mbps and higher, the average response time of each PNR design is equivalent (figure 4.31). Each PNR design is superior to using DP. The optimization ratio settles at about 12.5% (figure 4.32). 49 .m m. 0 m, 4 a i 111L; ii “w , .va.... .MMW m 1 1 Amocooomv oEE. omcoamom Bandwidth Figure 4.31: Network experiments - 128 PE - 4,000 jobs - response time. 128 Nodes - 4000 Jobs ====\\=\==_=_===_=== ///%////////¢M/ I ’0 as .1285an 1000 10000 100 Bandwidth Figure 4.32: Network experiments - 128 PE - 4,000 jobs - optimization ratio. 1000 10000 100000 128 Nodes - 5000 Jobs 100 100000,, Amucouomv mEE. omcoamwm Bandwidth Figure 4.33: Network experiments - 128 PB - 5,000 jobs - response time. 50 For the 5,000 job workload, each method improves dramatically with higher bandwidth. Similar to the 64 node system, each PNR design outperforms DP at network bandwidth 1,000 Mbps and higher. BBB and BB16, in particular, have optimization ratios in the 50—75% range at bandwidth 1,000 Mbps and 10,000 Mbps (figure 4.34). However, they lose their lead at 100,000 Mbps while CEN, BB32 and MAN all gain. It is difficult to say what the general trend here is, other than that better network performance tends to elevate the performance of each PN R design. 4.2.4 Network Topology Variations For the 4,000 job workload, we do not show the results for the bus topology in figure 4.35 or 4.36. This is because the bus response times were orders of magnitude higher than the star and connected topologies. For the CEN method, for instance, the optimization ratio was a staggering -2,500%! Obviously the bus network does not scale well and PNR is hit hard by this. With the connected topology, each method is close to equivalent. Figures 4.37 and 4.38 show the 5,000 job workload results. Just as in the 4,000 job workload, the bus network causes all PNR designs to behave poorly. CLI is least effected and CEN experiences a huge performance hit - almost reaching —300% (figure 4.38)! It is interesting that this negative performance is better in comparison to the 4,000 job workload. In the 4,000 job workload, the PNR methods all use less disk space than the PNR methods do in the 5,000 job workload. We speculate that the bus network actually drives average network service time higher than disk service time, and the PNR methods in the 4,000 job workload use network RAM more than in the 5,000 job workload. The connected network gives results similar to the high bandwidth experiments. Here, all PNR designs have better performance than DP. BB8 and BB16 are the clear winners at an optimization ratio greater than 50%. 51 128 Nodes - 5000 Jobs 75 1 A 111 — AAA A A~ _111_ °\° 50T " if] c 7 7’ 7 ’ i o a: 25 .11 A - (D : ' .E . ‘1 l ‘ i g 0 a? if; ‘ ,_. . o. -2 1 .1, , o 5 -50.._._ _,.7...Mi_..__.-__-fi_ _- ‘ MAN) 100 1000 10000 100000 7 " Bandwidth Figure 4.34: Network experiments - 128 PE - 5,000 jobs — optimization ratio. g 128 Nodes - 000 Jobs 5 2500 11 1 A 1 1 1 [.ADP ,1 u ' J g 2000 ECU _ ’ g 1500 i P§ CEN . i: 1000, '— :73 B38. Vi ‘1’ a g 500 1 1, 3816‘ 8. 0 l 11$ BB32 i 8 i iMANJ m 1.11 Topology Figure 4.35: Topology experiments - 128 PE — 4,000 jobs - response time. 128 Nodes - 4000 Jobs 15 w _ , _1_111_11111, 1 1111 Optimization % Topology Figure 4.36: Topology experiments - 128 PE - 4,000 jobs - optimization ratio. 52 Response Time (seconds) Optimization % 128 Nodes- 5000 Jobs 250000 _l_1 “111 200000 150000) A . ._1 -h _ _ _ 100000) ' 11 .1 50000 L . i ' 3 11 01 .2121 .. % Star Connected Topology Figure 4.37: Topology experiments — 128 PE - 5,000 jobs - response time. 128 Nodes - 5000 Jobs 4507?? A — AA -200-4 A 17 -250-( kéA 1 1 AA -300~ AAA Bus Star Connected Topology icu 1, .ééiCEN .' V2338 (___BBl6l N3332 MAN 3 Figure 4.38: Topology experiments - 128 PE - 5,000 jobs - optimization ratio. 53 Ii DP ( :1. cu i 333cm ‘ 21,2338 3 15; 3316. ;§3j§ BB32i il lMAN ,1 Response Time (seconds) RAM Ratio Figure 4.39: Space sharing experiments - 128 PE - 4,000 jobs - response time. 128 Nodes - 4000 Jobs - Space Sharing 401" _11 AA :..3 + - 3 25A .1 QCEN l g 20 ,1 ‘32338 _i g :3; f: :EBBlGI 3 51 1 (32.33332l 0111 QMANJ Figure 4.40: Space sharing experiments - 128 PE - 4,000 jobs - optimization ratio. 4.2.5 Space Sharing Scheduler Experiments The results for both workloads echo those of the 64 node network and are shown in figures 4.39, 4.40, 4.41, and 4.42. In both sets of results, designs are equivalent at RAM ratio 100%, CLI has an advantage at RAM ratio 75%, and CLI loses its advantage at RAM ratio 50%. At RAM ratio 75% and 50% all non—CLI PNR designs experience almost identical performance. 54 E 128 Nodes - 5000 Jobs - Space Sharing c AA A A AAA AA AAAAA-A A—AAAA [AA AA 8 , _11_111_1_ _ 1__ 1 1 i. DP iii , j N 111_.# ,___,-1- ETCLI 0E) AA AAAAAA—AAA {NCEN i?- 7 “-7" 7' —7 7‘ - Xi BB8 8 i ““7 7 7’ ’777’7 3__:E 3316 5 i’ ’ ’7 W 3332 8 l. . MAN m L W.__ RAM Ratio Figure 4.41: Space sharing experiments - 128 PE - 5,000 jobs - response time. 128 Nodes - 5000 Jobs - Space Sharing 25 EAAAAA 7m , 2° 1.1111111 i i, ,, 11 - m. i -9 i 1‘7 i i{>;>j CEN H 15" V 11 , t—A 2.- 1. g \ i E i iVfl/BB8 5;: 1° ‘ “TE . @3316 8 5“ E73 (N3332 0A?) 1.1 ii :iMAN RAM Ratio Figure 4.42: Space sharing experiments - 128 PE - 5,000 jobs - optimization ratio. 55 Chapter 5 Analysis 5. 1 Discussion We have described our results in chapter 5. We now interpret these results to discover the general models of performance PNR follows. Also, we identify platforms and configurations on which PNR would be useful. 5. 1 . 1 Memory Load Given our observations, it is clear that both PN R and DP follow an exponential curve as memory load is changed. As memory load increases, PNR and DP both tend toward infinite response times. As memory load decreases, PN R and DP will converge on a constant number. PNR does not offer a fundamentally difference performance curve, but PNR’S curve is lower than DP’s by a constant factor. This model implies that adding PNR to lightly loaded systems (with high per- formance networks) should not harm performance and under moderate memory loads PN R can lead to large improvements over disk paging. As observed, this improve- ment can approach 100%. However, as memory becomes too scarce to share, PNR performance will converge to DP performance. 56 One surprising result of our simulations is that the proposed PN R designs do not necessarily smooth out memory usage as measured by the standard deviation of system memory usage. For some experiments, PNR memory usage was even more non-uniform than DP’s memory usage. This indicates that more work must be done to ensure PN R itself does not create more overloaded nodes. Also, some of 'the exceptions to our proposed model are troubling. Specifically, the 125% RAM ratio experiment on the 128 node cluster unexpectedly had most PN R methods with an optimization ratio of -100% or more. Other experiments had occasional incidents where all PN R methods do very well except for one (3g. 85% RAM Ratio on 128 node system with 4,000 job workload). To determine if this is a problem with PNR or just an artifact, we need to re—run our experiments with different random number seeds, different portions of our collected workload, and with different workloads. 5.1.2 Network PNR is very sensitive to network performance. The general shape of results is another exponential curve, where PNR response time tends toward infinity as network service time is increased and converges to some constant number as service time decreases. Since low network performance results in low PNR performance, PNR should not be considered on systems with low bandwidth or high message RTT times. A standard 100 Mbps network should be considered a minimum for satisfactory performance. Topology is also an important consideration. Our experiments Show that con- gestion is a major source of waiting time. In order for PNR to scale, it will have to run on a network that scales well as nodes are added. Bottlenecks, such as the central switch in the star topology, will quickly degrade PNR performance as nodes are added. Switching to a network topology that has no bottlenecks has the same 57 effect as upgrading a topology with bottlenecks to a very high speed network. To get maximum performance, PNR should be used on a fast network with few (or no) bottlenecks. When network performance is high and RAM is relatively plentiful, we observe that each PN R design has equivalent performance. This is because coordination of memory resources is not a crucial issue when RAM is plentiful, and coordination overhead is less important with a fast network. These are the two main differentiating factors among the PNR designs, and when they are eliminated, they all perform similarly. 5. 1.3 Scheduling On a space sharing system, as RAM becomes more scarce, each PNR method except CLI appears to perform equivalently. The main limiting factor on the space sharing system is network RAM allocation coordination. Network speed is less impor- tant because PCSs are eliminated on this system - which eliminates a large portion of network activity (i.e. the large amount of page faults after a PCS). This reduces overall congestion. All of the schemes that use coordination appear to work equally well in each space sharing system tested. Depending on the workload and cluster setup, the best performance gains can be anywhere between 15-60%. If only a light load is expected, CLI is the best choice for a space sharing system. On the gang scheduling systems, however, network performance is crucial to PNR performance. PCS’s introduce large, regular bursts of network activity and PNR de- signs that use the network to coordinate allocation suffer. When a high-performance network is available, however, PNR can produce pronounced performance gains. Un- der the 4,000 job workloads, the maximum speed up using PNR was typically in the 10-12.5% range, although when RAM became scarce, optimization ratio could reach as high as 100%. For the 5,000 job workload, the maximum optimization ratio 58 observed was 75%, but large ratios were more common than in the 4,000 job workload. It is interesting to note that the absolute response times of DP and PN R on the space sharing scheduler are better than the gang scheduler for the 5,000 job workload. Backlogged jobs introduce a huge paging penalty to both PN R and DP. However, for the 4,000 job workload, gang scheduling results in better response times, since no backlogging is occurring and the benefits of time sharing outweigh paging penalties. For heavily-loaded systems, PNR can significantly reduce the response time of jobs as compared to DP. However, it may make sense to attempt to avoid paging entirely by using a space sharing system instead. Using a small MPL in conjunction with PNR may produce results better than both alone. For moderately to lightly loaded systems, the advantages of time-sharing outweigh the disadvantages of in— creased paging, and PNR is recommended to reduce average response times. 5.1.4 PNR Designs CLI does surprisingly well in certain situations. If RAM is plentiful, CLI benefits since it requires very little overhead to allocate network RAM. When the network is slow, CLI performs better than its counterparts because it contributes relatively little to network congestion. If it is known that a system has a slow network and will be lightly loaded, CLI may be the best method to implement due to its simplicity and low overhead costs. However, CLI performs poorly in heavy-load situations where RAM is scarce. In these situations, it becomes necessary for network RAM allocation to be coordinated to avoid wasting it. In these situations, that leaves us with CEN, BBX, and MAN. When the network is fast, all three can potentially be used effectively. When the network is slow, only BBX and MAN should be considered since CEN will introduce significant congestion on the network resources that service the central server. It is difficult to choose between the BBX and MAN methods. There appears 59 Low Load High Load Fast Network ANY CEN, BBX, MAN Slow Network CLI BBX Table 5.1: Recommended PNR Design Usage. to be a tendency for BBIG to be the superior method in the 64 node network (% of nodes) and BB8/BB16 to be superior in the 128 node network (1A6 and % of nodes). However, no definite conclusion can be reached from the results gathered. We will tentatively claim that BBX has an advantage over MAN in slow networks because less messages must be broadcast for synchronization purposes. If the network is fast, and RAM is plentiful, then each of the PNR designs are equivalent. Any one could be used. It may make sense to use CLI in this situation since it eliminates the need for a manager (proxy) entirely. A summary of these recommendations is in table 5.1. 5.2 Conclusion In this thesis we identified a novel way of reducing page fault service time and better utilizing memory resources in a cluster system running parallel processes. This method, which we call Parallel Network RAM, uses remote idle RAM as another tier in the memory hierarchy for parallel jobs in clusters. We proposed several different PNR designs and evaluated the performance of each under different conditions. We discovered that different designs are appropriate in different situations and that simply applying PN R to an existing system is not a guarantee of increased performance. Applied correctly to lightly loaded systems, performance gains between 10-25% can be achieved. Applied correctly to heavily loaded systems, performance gains can be as high as 100%. Applied incorrectly, PN R can result in extreme performance losses. The highest observed in our simulations were approximately -2500%. 60 Parallel Network RAM is probably best applied in conjunction with other meth- ods to reduce page load. For instance, a combination of job migration, Parallel Network RAM, heuristics used to guess memory usage at the scheduler, and other scheduler techniques would likely result in better performance than PN R alone. 5.3 Future Work This thesis only took a survey of PNR performance given a variety of cluster configurations and did not focus on any particular experiment. Our experiments’ sta- tistical significance should be confirmed by re-running our experiments with different random number seeds and with different subsets of the workload trace. A surprising result of our work was that no single PN R design worked well under all conditions. This indicates that, to create a single, unified PN R design that works under a variety of cluster conditions, a design that can intelligently change its behavior based on current cluster conditions should be developed. Such an algorithm could simply use the designs proposed in this thesis, but switch among which design is being used based on the current state of the cluster. This would make the implementation of PN R much easier, since only one system needs to be developed for all clusters. This thesis only used a specific subset of a single trace from a particular parallel system. To our knowledge, traces such as the one we used that include memory information are very rare. It would be invaluable to collect a new trace on a modern parallel system running cutting-edge parallel programs. Not only would this allow us to compare how PNR acts on a modern system, it would also allow us to calibrate our simulator and ensure that the models used (e.g. page fault models) are realistic. This simulator only explores jobs that use a static amount of memory determined at runtime. It would be useful to know how PNR performs given a more realistic dynamic memory workload. The simulator also only simulates a very simplistic job 61 synchronization model. A more intense and complex model would give us more realistic results. Our approach only allocates and deallocates network RAM at job start and stop time. It would be interesting to know if allowing the PNR system to allocate and deallocate memory at additional times would lead to increased performance. A more periodic approach may better accommodate memory usage. This approach does not consider job migration. A system which is allowed to choose between job migration and PNR to balance load would be very interesting and may potentially combine advantages from both methods. A comparison of PNR methods against methods which attempt to avoid paging by guessing memory usage would be interesting. One method may be superior to another or, more likely, the combination of the two methods may yield increased performance. 62 Bibliography [1] Anurag Acharya and Sanjeev Setia. The utility of exploiting idle memory for data-intensive computations. Technical Report TRCS98-02, 1998. [2] A. Barak and A. Braverman. Memory ushering in a scalable computing cluster. Journal of Microprocessors and Microsystems, 22(3-4):175—182, August 1998. [3] Anat Batat and Dror G. Feitelson. Gang scheduling with memory considerations. In 14th Intl. Parallel Distributed Processing Symp., pages 109—114, 2000. [4] Douglas C. Burger, Rahmat S. Hyder, Barton P. Miller, and David A. Wood. Paging tradeoffs in distributed-shared—memory multiprocessors. The Journal of Supercomputing, 10(1):87—104, 1996. [5] Steve J. Chapin, Walfredo Cirne, Dror G. Feitelson, James Patton Jones, Scott T. Leutenegger, Uwe Schwiegelshohn, Warren Smith, and David Talby. Benchmarks and standards for the evaluation of parallel job schedulers. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 67-90. Springer-Verlag, 1999. Lect. Notes Comput. Sci. vol. 1659. [6] RM. Cubert and P. Fishwick. Sim++, version 1.0. Technical report, Department of Computer and Information Science and Engineering, University of Florida, 28 July 1995. [7] Allen B. Downey. Lachesis: A job scheduler for the Gray T3E. In Dror G. Feitel- son and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 47—61. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. [8] G. F. Pfister F. Darema-Rogers and K. So. Memory access patterns of parallel scientific programs. Performance Evaluation Review, 15(1):46—57, 1987. [9] Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, Henry M. Levy, and Chandramohan A. Thekkath. Implementing global mem- ory management in a workstation cluster. In Symposium on Operating Systems Principles, pages 201—212, 1995. [10] Dror G. Feitelson. Packing schemes for gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 89—110. Springer—Verlag, 1996. Lect. Notes Comput. Sci. vol. 1162. [11] Dror G. Feitelson. Memory usage in the LANL CM-5 workload. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Pro- cessing, pages 78—94. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. [12] Dror G. Feitelson. Metrics for parallel job scheduling and their convergence. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 188—205. Springer Verlag, 2001. Lect. Notes Comput. Sci. vol. 2221. 63 [13] [14] [15] [16] [17] [18] ' [19] [20] [21] [22] [23] Dror G. Feitelson and Larry Rudolph. Parallel job scheduling: Issues and ap- proaches. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1—18. Springer-Verlag, 1995. Lect. Notes Comput. Sci. vol. 949. Dror G. Feitelson and Larry Rudolph. Evaluation of design choices for gang scheduling using distributed hierarchical control. Journal of Parallel and Dis- tributed Computing, 35(1):18—34, 1996. Dror G. Feitelson and Larry Rudolph. Metrics and benchmarking for parallel job scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1—24. Springer-Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. Dror G. Feitelson, Larry Rudolph, Uwe Schwiegelshohn, Kenneth C. Sevcik, and Parkson Wong. Theory and practice in parallel job scheduling. In Dror G. Feitel- son and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1—34. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. Paul A. Fishwick. Simulation Model Design and Execution. Prentice-Hall Inc., 1995. Michail D. Flouris and Evangelos P. Markatos. High Performance Cluster Com- puting, chapter 16, pages 383—408. Prentice Hall, 1999. Atsushi Hori, Hiroshi Tezuka, and Yutaka Ishikawa. Overhead analysis of pre- emptive gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 217—230. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. David Jackson, Quinn Snell, and Mark Clement. Core algorithms of the Maui scheduler. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 87-102. Springer Verlag, 2001. Lect. Notes Comput. Sci. vol. 2221. James Patton Jones and Bill N itzberg. Scheduling for parallel supercomputing: A historical perspective of achievable utilization. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 1—16. Springer-Verlag, 1999. Lect. Notes Comput. Sci. vol. 1659. Virginia Lo, Jens Mache, and Kurt Windisch. A comparative study of real work- load traces and synthetic workload models for parallel job scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Pro- cessing, pages 25—46. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. Evangelos P. Markatos and George Dramitinos. Implementation of a reliable remote memory pager. In USENIX Annual Technical Conference, pages 177— 190, 1996. 64 [24] Yanyong Zhang Anand Sivasubramaniam Hubertus Franke Jose E. Moreira. Im- proving parallel job scheduling by combining gang scheduling and backfilling techniques. IPDPS, pages 133—142, 2000. [25] John Oleszkiewicz and Li Xiao. Pnrsim: A parallel network ram simulator. Technical Report MSU-CSE—04-13, Computer Science and Engineering, Michigan State University, East Lansing, Michigan, April 2004. [26] Eric W. Parsons and Kenneth C. Sevcik. Benefits of speedup knowledge in memory-constrained multiprocessor scheduling. Performance Evaluation, 27/28(4):253—272, 1996. [27] Eric W. Parsons and Kenneth C. Sevcik. Coordinated allocation of memory and processors in multiprocessors. In Measurement and Modeling of Computer Systems, pages 57—67, 1996. [28] V. Subramani S. Srinivasan, R. Kettimuthu and P. Sadayappan. Characterization of backfilling strategies for parallel job scheduling. Proc. of 2002 Intl. Workshops on Parallel Processing (held in conjunction with the 2002 Intl. Conf. on Parallel Processing, I CPP 2002), 2002. [29] Uwe Schwiegelshohn and Ramin Yahyapour. Improving first-come-first-serve job ' scheduling by gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 180—198. Springer Verlag, 1998. Lect. Notes Comput. Sci. vol. 1459. ' - [30] Senjeev K. Setia. The interaction between memory allocation and adaptive par- titioning in message-passing multicomputers. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 146— 164. Springer-Verlag, 1995. Lect. Notes Comput. Sci. vol. 949. [31] Kenneth C. Sevcik and Songnian Zhou. Performance Benefits and Limitations of Large NUMA Multiprocessors. In Proceedings of Performance ’93, pages 183— 204. Elsevier Science Publ, Sept 93. [32] Quinn O. Snell, Mark J. Clement, and David B. Jackson. Preemption based backfill. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages 24—37. Springer Verlag, 2002. Lect. Notes Comput. Sci. vol. 2537. [33] Srividya Srinivasan, Rajkumar Kettimuthu, Vijay Subramani, and P. Sadayap- pan. Selective reservation strategies for backfill job scheduling. In Dror G. Feitel- son, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Parallel Processing, pages 55—71. Springer Verlag, 2002. Lect. Notes Comput. Sci. vol. 2537. [34] Fang Wang, Marios Papaefthymiou, and Mark Squillante. Performance eval- uation of gang scheduling for parallel and distributed multiprogramming. In 65 [35] [36] [37] [38] [39] [40] [41] Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Par- allel Processing, pages 277—298. Springer Verlag, 1997. Lect. Notes Comput. Sci. vol. 1291. William A. Ward, Jr., Carrie L. Mahood, and John E. West. Scheduling jobs on parallel systems using a relaxed backfill strategy. In Dror G. Feitelson, Larry Rudolph, and Uwe Schwiegelshohn, editors, Job Scheduling Strategies for Par- allel Processing, pages 88—102. Springer Verlag, 2002. Lect. Notes Comput. Sci. vol. 2537. Y. Wiseman and D. Feitelson. Paired gang scheduling. Technical Report 2001-10, Hebrew University, 2001. L. Xiao, S. Chen, and X. Zhang. Dynamic cluster resource allocations for jobs with known and unknown memory demands. IEEE Transactions on Parallel and Distributed Systems, 13(3):223—240, March 2002. L. Xiao, S. Chen, and X. Zhang. Adaptive memory allocations in clusters to handle unexpectedly large data-intensive jobs. IEEE Transactions on Parallel and Distributed Systems, 15(6), June 2004. Li Xiao, Xiaodong Zhang, and Stefan A. Kubricht. Incorporating job migration -. and network RAM to share cluster memory resourCes. In HPDC, pages 71—78, 2000. Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam. An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migration. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 133—158. Springer Verlag, 2001. Lect. Notes Comput. Sci. vol. 2221. Bing Bing Zhou, David Welsh, and Richard P. Brent. Resource allocation schemes for gang scheduling. In Dror G. Feitelson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 74—86. Springer Verlag, 2000. Lect. Notes Comput. Sci. vol. 1911. 66 "EEEEEEEEEEEEElli"