. .. Miguv." r 21.5... r5. .. l S .23.? .. y». (”.3 3...; . q ..w.§w.f5..fi 43% .P. a a! . fiihufliz .1. . h . «5min... ; ‘. i. V , ’1 3 PI .1“; his»: 1... . . r 4%..th- n.:} . a . .. Sun’s!» . ., u . .1 .. . . . § 134.75,)": ‘III n.1t , . 1.2.3.» .5.£.th.x. L Aiwh‘nfllih . tau}: did .i\ DH}!!! 4 :Vv M ‘ I 2. 3:9; 1‘ i. V , , a! a ‘ Hi ....it.m.n.b.wzi .. ‘ . . 4 D‘ O.“ P:‘A. .b. h‘ . . ..k:.. .\...... ah» c .r..i...:.. . . ‘ . 31.55. h... «nuns»... . a4 3 '30... 5!.I.? 01.31:. ' . in: . . 32...! ()1 3.... .. i. a ‘ ‘ . v1: 2. x... ‘ u...» .935. d ‘ K: I use... .. $1.: xv . 3.. .. . V FA. “mfg: .65.. 1.9.33.3... ‘1: .. .. S .nfl! ) 1..- b. 1115.3. 31......913...» Nitrfih .qu .54, ‘ i. '18». - _. 2:1 : ..:.:.au.:: .35 2135-: y. . 1.3.. s: 3.33.357! 15:315.: A L. 'Ei i1: 21m.- 6 5 MS 7 , I | & :2-2 A. G? I! .. ‘ t a v . .. . I. ’5 "\‘IA\-; «75.2.3. I Oil-ii... .nx. 3 I xv ‘ .5). 13,.‘1. . ‘5... “.u? I! A 139:! t I». Iv hot .gvv!‘ . i: 1. v: .‘79. :53 I My: 2.; 53.30.! .4 u, 711-? ms IHWUIIHIIHHWIIUIHIIWII IIUIUHINI 2 31293014 (19%,) This is to certify that the dissertation entitled PROCESSOR ALLOCATION AND JOB SCHEDULING IN TWO-DIMENSIONAL MESH SYSTEMS presented by YUNG-KANG CHU has been accepted towards fulfillment of the requirements for Ph.D. degree in EleCtrica] Eng. M Km: Major professor Date i/lg/q; . MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michlgan State University PLACE N RETURN BOX to man this chockwt from your mood. TO AVOID FINES return on or baton dd. duo. DATE DUE DATE DUE DATE DUE MSU to An Nflmdtvo ActkWEqml Oppommlly IMMIOfl W PROCESSOR ALLOCATION AND JOB SCHEDULING IN TWO-DIMENSIONAL MESH SYSTEMS By Yang-Kong Chu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical Engineering 1995 ABSTRACT PROCESSOR ALLOCATION AND JOB SCHEDULING IN TWO-DIMEN SION AL MESH SYSTEMS By Yung-Kang Chu Mesh connected parallel architectures have become increasingly popular in the design of multiprocessor systems in recent years. Many partitionable two-dimensional mesh systems have been built or are currently being developed. To allow the best usage of these systems, effective submesh allocation and job scheduling schemes are desirable. Continuous allocations and deallocations of various sized submeshes result in a fragmented mesh system. The goal of this research is to reduce the system fragmen- tation and the mean job response time. We have developed efficient and effective submesh allocation and scheduling schemes in two-dimensional mesh systems. A simulation too which we developed is used to evaluate the performance of existing schemes as well as our proposed schemes. The tool provides a flexible graphical user interface to performing and analyzing simulations. The new schemes deliver better performance than best existing schemes. © Copyright 1995 by Yung-Kang Chu All Rights Reserved To my parents iv ACKNOWLEDGMENTS I would like to express my appreciation and acknowledge the contributions of the following persons. This dissertation could not have been completed without their participation and help. First, I would like to thank Dr. Diane T. Rover, my major advisor. I have benefited from her involvement throughout my doctoral work. Her very positive influence on my growth as a professional and as a person is invaluable. I could not thank her enough for her guidance. I would also like to thank the other members of my doctoral program guidance committee: Dr. David Fisher, Dr. Lionel Ni, and Dr. David Yen, for their gener- ous comments and participation. They have helped to improve the quality of this dissertation both in its technical content and organization. I am indebted to Dr. I—Ling Yen for her efforts during the early phases of this research. Her insightful discussion with me helped me to motivate this work. Finally, thanks go to my family, friends, and colleagues. Their love and support accompanied me through the years of this work. TABLE OF CONTENTS LIST OF TABLES viii LIST OF FIGURES ix 1 Introduction 1 1.1 Problem Statement .............................. 2 1.2 Motivation and Research Goal ........................ 3 1.3 Organization ................................. 5 2 The Problem and Its Background 6 2.1 Job Scheduling and Processor Allocation Problems in Multiprocessor Sys- tems .................................... 8 2.2 The 2D Mesh Architecture .......................... 10 2.3 Problem Description ............................. 13 2.4 Submesh Allocation Strategies ........................ 13 2.4.1 Two-Dimensional Buddy System ..................... 14 2.4.2 Frame Sliding Strategy ........................... 15 2.4.3 First Fit Strategy .............................. 16 2.4.4 Best Fit Strategy .............................. 18 2.4.5 Adaptive Scan Strategy .......................... 19 2.4.6 Busy List Strategy ............................. 20 2.4.7 Lookahead Strategies ............................ 21 2.4.8 Non-contiguous Processor Allocation Algorithms ............ 24 2.5 Dynamic Job Scheduling Schemes ...................... 25 2.5.1 Scan Scheduling Schemes for Hypercubes ................. 25 2.5.2 Lazy Scheduling Scheme for Hypercubes ................. 26 2.5.3 Scan Scheduling Scheme for 2D Meshes .................. 26 2.5.4 Priority Scheduling Scheme for 2D Meshes ................ 27 2.6 Performance Evaluation Tools for Processor Allocation and Scheduling . 29 3 Performance Evaluation Methodology and Tool 32 3.1 Simulation Description ............................ 33 3.2 Workload Characterization .......................... 34 3.3 Performance Metrics ............................. 36 3.4 Simulation and Experiment Analysis Tool ................. 39 vi vii 4 Cost-Performance Assessment of Schemes 48 4.1 Processor Allocation Strategies ....................... 49 4.1.1 Performance Analysis ........................... 49 4.1.2 Complexity Analysis ............................ 52 4.2 Scheduling Schemes .............................. 57 5 Job Scheduling Schemes 58 5.1 Immediate Fit Strategy ........................... 59 5.2 Scan All Strategy ............................... 61 5.3 Performance Results ............................. 61 5.4 Discussion ................................... 72 6 Job Scheduling Using Multiple Queues 74 6.1 Job Scheduling Schemes ........................... 75 6.2 Multiple Queues Scheme ........................... 77 6.3 Performance Results ............................. 81 6.3.1 Performance of Variations of Multiple Queues Scheme ......... 81 6.3.2 Comparing Multiple Queues Scheme with Other Schemes ........ 87 6.4 Discussion ................................... 92 7 Estimated Execution Time Allocation Strategies 95 7.1 Basic Idea and Heuristics .......................... 96 7.2 Detailed Algorithms ............................. 98 7.3 Performance Results ............................. 101 7.4 Discussion ................................... 107 8 Busy Distance Inverse Allocation Strategy 110 8.1 Basic Ideas .................................. 111 8.2 Best-Fit Heuristic ............................... 114 8.3 Candidate Submesh Generation ....................... 116 8.4 Detailed Algorithm .............................. 117 8.5 Performance Results ............................. 118 8.6 Discussion ................................... 123 9 Conclusions 124 9.1 Contributions ................................. 125 9.2 Future Work ................................. 127 BIBLIOGRAPHY 128 3.1 3.2 4.1 4.2 LIST OF TABLES Submesh allocation strategies supported by the simulator. ........ 40 Job scheduling schemes supported by the simulator ............. 41 Limitations of existing submesh allocation strategies. ........... 53 New submesh allocation strategies. ..................... 56 viii 2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 6.1 6.2 6.3 LIST OF FIGURES A two dimensional mesh M (5, 4). ...................... 12 The Sched-All simulator ............................ 41 The Sched—All experiment planner. ..................... 43 The Sched-All experiment analyzer ...................... 44 An example data file .............................. 45 The generated GNUPLOT program. .................... 46 The generated plot (screen dump). ..................... 46 The generated plot ............................... 47 Mean response time vs. load (w and h: uniform distribution). ...... 50 Mean response time vs. load (w and h: increasing distribution). ..... 51 Mean response time vs. load (w and h: decreasing distribution) ...... 51 Performance vs. complexity of submesh allocation strategies ........ 54 Performance vs. complexity of submesh allocation strategies ........ 56 The IF scheduling scheme ........................... 60 The SA scheduling scheme. ......................... 62 System utilization vs. load (w and h: uniform distribution). ....... 63 Mean response time vs. load (w and h: uniform distribution). ...... 65 Standard deviation of response time vs. load (w and h: uniform distribu- tion) ..................................... 65 Mean response time vs. load (w and h: decreasing distribution) ...... 67 Standard deviation of response time vs. load (w and h: decreasing distri- bution). .................................. 67 Mean response time vs. load (w and h: increasing distribution). ..... 68 Standard deviation of response time vs. load (w and h: increasing distri- bution). .................................. 68 Mean response time vs. job size (w and h: uniform distribution). . . . . 70 Standard deviation of response time vs. load (w and h: uniform distribu- tion) ..................................... 70 Mean and standard deviation of response time vs. waiting time limit (w and h: uniform distribution). ...................... 71 The Multiple Queues scheduling scheme. .................. 80 Mean response time vs. # queues, load=0.5, w and h: uniform distribution. 82 Mean response time vs. # queues, load=0.5, (w and h: uniform distribu- tion) ..................................... 83 ix 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 X Mean response time vs. # queues, load=0.5, (w and h: decreasing distri- bution). .................................. 84 Mean response time vs. # queues, load=0.5, (w and h: increasing distri- bution). .................................. 84 Mean response time vs. # queues, load=0.7, (w and h: uniform distribu- tion) ..................................... 85 Mean response time vs. # queues, load=0.7, (w and h: decreasing distri- bution). .................................. 86 Mean response time vs. # queues, load=0.7, (w and h: increasing distri— bution). .................................. 86 Mean response time vs. load (w and h: uniform distribution). ...... 88 Mean response time vs. load (w and h: decreasing distribution) ...... 88 Mean response time vs. load (w and h: increasing distribution). ..... 89 Mean response time vs. load (w and h: uniform distribution). ...... 90 Mean response time vs. load (w and h: decreasing distribution) ...... 90 Mean response time vs. load (w and h: increasing distribution). ..... 91 Mean response time vs. waiting time limit (w and h: uniform distribution). 92 A two-dimensional mesh M (8, 8). ...................... 97 The EET-SB allocation strategy. ...................... 99 The EET-SC allocation strategy. ...................... 100 Mean response time vs. load (w and h: uniform distribution). ...... 102 Mean response time vs. load (w and h: increasing distribution). ..... 103 Mean response time vs. load (w and h: decreasing distribution) ...... 104 Mean response time vs. load (w and h: uniform distribution). ...... 105 Mean response time vs. load (w and h: decreasing distribution) ...... 106 Mean response time vs. load (w and h: uniform distribution). ...... 106 Mean response time vs. load (w and h: decreasing distribution) ...... 107 A two dimensional mesh M(10, 10) ...................... 112 Busy distance vectors. ............................ 115 The Busy Distance Inverse strategy ...................... 118 Mean response time vs. load (w and h: uniform distribution). ...... 119 Standard deviation of response time vs. load (w and h: uniform distribu- tion) ..................................... 120 Mean response time vs. load (w and h: increasing distribution). ..... 121 Standard deviation of response time vs. load (w and h: increasing distri- bution). .................................. 121 Mean response time vs. load (w and h: decreasing distribution) ...... 122 Standard deviation of response time vs. load (w and h: decreasing distri- bution). .................................. 123 Chapter 1 Introduction Massively parallel computers (MPCS) with many processors are considered by many as the most promising technology to achieve high computational power. MPCs speed up computations and simulations, and moreover, they can solve a much larger problem size than other platforms can [31, 32]. A technologically feasible mean of building MPCs is using distributed memory and message-passing networks. A distributed memory parallel computer system consists of many nodes where each node has its own processor(s), local memory, and routing devices. In systems we considered in this work, each node has only one processor, so we use the terms node and processor in a topology or network interchangeably in this dissertation. Processors are connected to one another based on a certain network topology. For example, in a two-dimensional (2D) mesh system, each processor is connected to its four nearest orthogonal neighbors. The mesh topology has become increasingly popular in the design of multiprocessor systems as wormhole routing techniques have advanced [50]. It has advantages such as simple and regular structure, 1 2 easy layout, and good scalability. Many 2D mesh systems have been built or are under development recently. The Intel Paragon [60], Intel Touchstone Delta [61], and Tera Computer System [2] are three examples. A partitionable system may be partitioned into smaller subsets of processors, where each subset can be allocated to independent jobs concurrently. After comple- tion of a job, the allocated processors are released for future jobs. For example, a partitionable 2D mesh system can be divided into submeshes of different sizes and run different jobs. To fully utilize such a system and maximize overall system throughput, we need to consider the problem of how to effectively schedule and allocate jobs to processors. 1 .1 Problem Statement Job scheduling problems may be divided into two categories, static and dynamic. The static job scheduling problem is to minimize the finish time of the last completed job for a given job list. For each job in the list, the required number of processors and its processing time are given (we use the terms processing time, execution time, and service time interchangeably in this dissertation). On the other hand, for the dynamic job scheduling problem, jobs arrive in a stream with a certain arrival rate and are placed in a global job queue. The job scheduler chooses the next job to be allocated based on some scheduling scheme, while the processor allocator tries to find a set of free processors for the job using a processor allocation strategy. In this dissertation, we are concerned with dynamic 3 job scheduling and processor allocation on two—dimensional mesh systems. Each job requests a fix sized rectangular submesh, and the mesh system is partitionable to run multiple jobs at the same time. 1.2 Motivation and Research Goal Similar to conventional memory allocation resulting in a fragmented memory, contin— uous allocation and deallocation of various sized submeshes in the 2D mesh system result in a fragmented mesh. The more fragmented a mesh is, the harder it is to find a free submesh for an incoming job. From the user’s point of view, this results in longer waiting delay and response time, Conversely, from the system’s point of view, this results in lower system throughput and utilization. Our goal in this research is to improve the overall system utilization and reduce the mean job response time for the dynamic job scheduling/ processor allocation problem in 2D mesh systems. To achieve this goal, we have several approaches, all of which are based upon identify- ing potential improvements in existing scheduling and allocation schemes. First, we propose to develop a more effective and efficient processor allocation strategy that can further reduce the system fragmentation and mean job response time by finding a best-fit location for each job. Then, we plan to develop advanced job scheduling schemes other than the First-Come-First-Served (FCFS) policy to avoid the potential performance loss due to its blocking effect. Under FCFS scheduling policy, the job dispatcher allocates the incoming job only if the waiting queue is empty and a free submesh is found; otherwise it puts the job at the end of the waiting queue. Upon 4 the completion of a job, the processors in the submesh that were allocated to the job are released and the dispatcher checks if jobs in the front of the waiting queue can be allocated. When a job with a large submesh request is to be allocated, it usually has to wait until almost all the allocated jobs leave the system, thus blocking later jobs from entering the system and resulting a longer waiting delay and wasted resources. Further, we have observed that the different completion times of jobs in the system can result in busy processors being arbitrarily scattered throughout the mesh, thus causing large external fragmentation. As indicated by Feitelson and Nitzberg [29], about two thirds of the jobs running on parallel systems are programs submitted re- peatedly. The approximate execution time of these programs can be quite accurately estimated by the users. By providing an estimated execution time to the proces- sor allocator, the system can allocate jobs with similar finishing times in the same areas. We will investigate the effect of using estimated execution times to guide sub- mesh allocation and develop processor allocation strategies that utilize the estimated execution time information. In order to achieve all of the above research goals, a simulation tool that supports experimentation and comparison of performance is needed to help in evaluating ex- isting schemes and developing new ones. We will develop a tool for simulation and performance analysis and display. Using this tool, we can compare various schemes under different load conditions and job characteristics. 1.3 Organization The rest of this dissertation is organized as follows. Chapter 2 gives the background and discusses the problems of job scheduling and processor allocation in multiproces- sor systems in more details. Related research is also summarized in this chapter. In Chapter 3, we present the performance evaluation methodology, describe the simula- tion model and the workload, and introduce the software tool that we have developed for simulation and performance evaluation. Then, we present simulation results of the performance of existing submesh allocation strategies, and assess cost-performance of the strategies in Chapter 4. In Chapter 5, we present two new job scheduling schemes called Immediate Fit and Scan All [11]. They dramatically improve the performance of the FCFS scheme. Then, we present an advanced job scheduling scheme using multiple queues in Chapter 6. It gives further performance improvement. In Chapter 7, we present a family of new submesh allocation strategies called Estimated Exe- cution Time strategies [12], which utilize estimated execution time information for better processor allocation. In Chapter 8, we present a new submesh allocation strat- egy called Busy Distance Inverse strategy [9], which gives the lowest time complexity and yields the best performance among all submesh allocation strategies that do not utilize estimated execution time information. Finally, we give the conclusions and future research in Chapter 9. Chapter 2 The Problem and Its Background Scheduling is a classical field with several interesting problems. In general, the scheduling problem assumes a set of resources and a set of consumers serviced by those resources according to a certain policy. A job in a factory and a customer in a bank are examples of consumers. A machine in a factory and a teller in a bank are examples of resources. The goal of this problem is to find an efficient policy to determine a sequence and an assignment of consumers to resources to optimize some performance measures. Generally speaking, determining a sequence or order is equivalent to scheduling, and determining an assignment, allocation. First-Come— First-Served (FCF S) is an example of a scheduling policy. Scheduling policy per- formance varies with different circumstances. While FCFS may be appropriate in a bank environment, it may not be the best policy to be applied to jobs in a factory. In the arena of parallel and distributed computing, scheduling problems have gained the attention of many researchers. One interesting problem area is concerned with scheduling parallel program tasks on parallel and distributed systems. In this 6 7 problem, a program can be viewed as a collection of tasks which may run serially or in parallel. The tasks are the consumers and are represented using directed graphs called task graphs. The processing elements (or processors) are the resources and their interconnection networks are represented using undirected graphs. An optimal schedule determines both the allocation and the execution order of each task to op- timize some performance criterion, for example, the maximum finishing time of a set of programs. Interested readers can refer to [28] for more details of this problem. Another interesting problem area is concerned with job scheduling and processor allocation on multiprocessor systems. In this problem, each job requires a fixed number of processors and has a fixed execution time. Because of the various topologies of target machines, we must develop and apply different processor allocation and job scheduling schemes on different topologies. In this dissertation, we are concerned with the job scheduling and processor allocation problem on a popular topology called the two-dimensional (2D) mesh. In this chapter, we first introduce the problems of job scheduling and processor allocation on multiprocessor systems in Section 2.1. Then we discuss the 2D mesh architecture in Section 2.2. In Section 2.3, we describe the dynamic job schedul- ing/ processor allocation problems on 2D mesh systems. Sections 2.4 and 2.5 summa- rize previous submesh allocation strategies and job scheduling schemes, respectively. In Section 2.6, recent software tool development in this area is introduced. 8 2.1 Job Scheduling and Processor Allocation Problems in Multiprocessor Systems Generally speaking, job scheduling and allocation problems can be divided into two categories: static and dynamic. The objective of static approaches is to minimize the finishing time of the last completed job for a given job list. For each job in the list, the required number of processors and its processing time are given. On the other hand, the dynamic approaches assume a stochastic stream of arriving jobs, with no a priori information about their characteristics. Furthermore, the approaches have targeted hypercube and low-dimensional meshes, which are popular topologies in the design of multiprocessor systems. In a hypercube system, each job requires a subcube of a fixed dimension. In a 2D mesh system, each job requires a rectangular submesh. Effective allocation strategies are needed to find free subcubes and submeshes, respectively, to reduce fragmentation. Static job scheduling problems in multiprocessor systems have been studied ex- tensively. The multiprocessor scheduling problem [30] is to find an optimal non- preemptive schedule of n independent jobs on m identical processors to minimize the finishing time of the last completed job. In this problem, each job has a fixed execution time t and requires only one processor. This problem has been proven to be NP-hard for m 2 2. This problem is equivalent to a variant of the bin packing problem [17], which is to minimize the maximum height of m bins required to pack a given list of n objects of various heights. The bin packing problem is generalized to the two-dimensional packing problem 9 [16, 4, 15], which is to pack a list of n rectangles of various widths and heights into an open-ended rectangle of a certain width W to minimize the overall height. Here, the packing is oriented, i.e., a rotation of 90° (interchange width w and height h) is not allowed. This problem is analogous to statically scheduling a list of n jobs non—preemptively, where each job requires w units of contiguous block of resource for h units of time, in a system having a partitionable resource of W units arranged contiguously in a linear fashion. Another generalization of the packing problem is the three-dimensional packing problem [42], which is to pack a list of n rectangular boxes into a big rectangular box having infinite height and a bottom of size W x H and minimize the overall height of the packing. Each box has a fixed size of w X h x t, where w x h is the bottom size and t is the height of the box. Here the packing is oriented in the time dimension, but may not be oriented in the other two dimensions. In other words, a rotation of 90° of small boxes in the bottom is allowed. This problem is analogous to statically scheduling/ allocating jobs in a 2D mesh system of size W x H, where each job requires a rectangular submesh of fixed width w, fixed height h, and processing time t. The 2D mesh architecture is described in more detail in the next section. Several static job scheduling and subcube allocation schemes for hypercube sys— tems have been proposed [6, 7, 58, 59]. Several static job scheduling and submesh allocation schemes have been proposed. Li and Cheng proposed a mesh partition- ing strategy, layer by layer (LL), and a static scheduling policy, longest processing time first, in [41, 43]. Li and Cheng also proposed another scheme, Two-Dimensional Buddy System (2DBS), for mesh partitioning in [44, 47]. 10 For the static job scheduling problem in a hypercube system, each job requires a subcube of a fixed dimension. A subcube allocation strategy is needed for the processor allocator to find a free subcube. Several static job scheduling and subcube allocation schemes for hypercube systems were proposed in [59, 58, 6, 7]. For the static job scheduling problem in a 2D mesh system of size W x H, each job requires a rectan- gle submesh of fixed width w, height h, and a processing time of t. This problem is equivalent to the three-dimensional packing problem [42], In real applications, jobs ar- rive randomly in a stochastic stream. A dynamic job scheduling/ processor allocation scheme, which does not require information about future jobs, is more realistic than a static scheme, and it is more applicable to computer systems. A number of dynamic processor allocation strategies using the First-Come-First-Served (FCFS) scheduling scheme have been proposed for hypercube systems [8, 26, 27, 1, 36, 37, 14, 19, 22]. Before we discuss the dynamic job scheduling/processor allocation schemes for 2D mesh systems, we introduce the 2D mesh architecture in more detail in the next sec- tion. We will discuss more dynamic scheduling schemes for hypercubes and meshes in Section 2.5. 2.2 The 2D Mesh Architecture Mesh architectures have become increasing popular in the design of multiprocessor systems because of their advantages such as simple structure and good scalability. In particular, low-dimensional meshes are favored over hypercubes in certain systems. Formally, an n-dimensional mesh has k0 x k1 x . . . x kn-1 nodes, k,- nodes along each 11 dimension i, arranged in an n—dimensional grid. A hypercube, or a binary n-cube, is a special case of the n-dimensional mesh in which k,- = 2 for all i, 0 S i g n - 1. A two-dimensional mesh consists of W x H nodes arranged in a two-dimensional grid. The bisection density, that is, the product of the bisection width and the channel bandwidth, can be used as a measure of the network cost, where the bisection width is the minimum number of channels that must be removed to partition the network into two equal subnetworks. Consider a 2“ x 2“ 2D mesh and a 2n-cube having the same number of nodes N = 22". The bisection widths of these two topologies are 2" and 22""1, respectively. If both networks have the same bisection density, then the channel bandwidth of the 2D mesh is 2""1 times larger than that of the n-cube. In other words, for the same number of processors and same network cost, the 2D mesh can offer higher channel bandwidth than a hypercube does. However, a 2D mesh has higher diameter, which is the maximum distance between two nodes. For systems in which the network latency depends on the path length, the hypercube is a better choice. For systems that support wormhole routing, the network latency is almost independent of the path length when there is no contention and the message length is relatively large. Low—dimensional meshes have become increasingly popular for such systems since the disadvantage of large diameter diminishes as wormhole routing techniques have advanced [50]. In the remainder of this section, we focus on the 2D mesh architectures and some necessary definitions. A 2D mesh, denoted by M(W, H), consists of W x H nodes arranged in a W x H two-dimensional grid. Each node represents a processor and each edge is a communi— 12 cation link. The node in column i and row j can be represented by address < i, j >. A non-boundary node < i,j >, where 0 < i < W —1 and 0 < j < H — 1, is connected to four neighboring nodes < i :l: 1, j > and < i, j :l: 1 > (a boundary has fewer neigh- bors). A two—dimensional submesh in M(W, H), denoted by S(w,h), is a subgrid of M(W,H) such that 1 S w 3 W and 1 g h S H. The address of a submesh is denoted by a quadruple < x1,y1,x2,y2 >, where < $1,311 > indicates the lower left corner and < 1:2, yg > indicates the upper right corner of the submesh. The base of a submesh is the node at the lower left corner. Figure 2.1 shows a rectangular mesh M(5,4). Nodes < 2,1 >,< 3,1 >,< 4,1 >,< 2,2 >,< 3,2 >,and < 4,2 > form a submesh S (3, 2) with address < 2,1, 4, 2 >. A node or submesh is busy (allocated) if it is assigned to a job. A node or submesh is free if it is not busy. A A O ’) (1,3123) <3,3)r <4,3>\ <0,2>> <1,2>I (2’? <3,2>. <3, ! <2,1>*<3,1>$ <4,1>3 <0,0> <1,0> <2,0> <3,0> <4,0> Figure 2.1: A two dimensional mesh M(5, 4). 0 (0,3) A A node or submesh is busy (allocated) if it is assigned to a job. A node or submesh is free if it is not busy. A free submesh is a submesh whose nodes are all free. A busy (allocated) submesh is a submesh whose nodes are all assigned to a job. 13 2.3 Problem Description In this work, we deal with the dynamic job scheduling/ processor allocation problem in a 2D mesh system. A two—dimensional mesh-connected system M (W, H) has W x H processors. Jobs arrive randomly and request submeshes of various widths and heights. The system tries to find a free submesh for incoming jobs following a specific scheduling scheme and a submesh allocation strategy. If a free submesh is found for a job, the system allocates the job to the free submesh. When a job is done, the system deallocates the job and releases the submesh. The goal of the job scheduling/ processor allocation schemes is to minimize the internal and external fragmentation so that the system utilization is high and the mean job response time is low. The existing submesh allocation strategies and job scheduling schemes are dis- cussed briefly in the following sections. 2.4 Submesh Allocation Strategies The submesh allocation strategy plays an important role to improve the performance of a 2D mesh multiprocessor system servicing multiple jobs. Many attempts have been made to develop high performance, low complexity submesh allocation strategies. The existing dynamic submesh allocation strategies are discussed briefly in the following subsections. 14 2.4.1 Two-Dimensional Buddy System The Two—Dimensional Buddy System (2DBS) proposed by Li and Cheng [45, 46] is derived based on the conventional one—dimensional buddy allocation scheme [38]. The 2DBS is applicable only when the system is a square mesh with the length of each side being exactly a power of two. Also, it can allocate only square submeshes with side lengths being exact powers of two. A square block (submesh) is specified as B(.r,y,k), where < :r,y > is the base of the submesh and 2k is its size (side length). If k > 0, then B(:r,y,k — 1), B(a: + 2"“, y, k— 1), B(:r, y+2k“l, k— 1), and B(:1:+2"‘1, y+2k‘1, k— 1) are also submeshes, which are buddies of each other. These four square submeshes of size 2’“1 form a larger submesh of size 2". For a mesh system of size 23 (23 X 2B), R + 1 free block lists (FBLs) are maintained and FBL,- contains all available blocks of size 2‘, where 0 S i S R. For an incoming job requesting a w X h submesh, the system tries to allocate a square submesh of size i for it, where i = 2l109’lmaxlw'hlll. The search starts from FBL,- for the smallestj such that j Z i and FBL,- is not empty. If no non-empty FBL is found, the job returns to the queue front; otherwise remove the first block B(.r,y,j) from FBLj. Ifj > i, split the block B(.7:,y,j) into four buddies of size j — 1, allocate the job to the first buddy and append the rest three buddies to FBLJ-_1. The worst case complexity of the allocation procedure is 0(R) or 0(log N), where N = W x H = 2” is the number of processors in the system. When a job finishes, it releases and append the block B(:r, y, k) to F BLk. If all three other buddies of B are also in F BLk, the four buddies is removed from the FBL]c 15 and combined into a block of size k + 1 and appended to FBLk+1. Similarly, possible combination of buddies is checked. The worst case complexity of the deallocation procedure is also 0(log N). This strategy tends to give large internal fragmentation when the requested sub- mesh size is not a power of two. Hence, a significant amount of processor resources could be wasted. 2.4.2 Frame Sliding Strategy The Frame Sliding (FS) strategy proposed by Chuang and Tzeng [13] introduced the idea of busy set, coverage set and reject set. The busy set is the collection of all busy submeshes. The coverage set is the union of the coverage submeshes of all busy submeshes, where the coverage submesh of a busy submesh with respect to an incoming requested submesh is a submesh whose nodes can not serve as the base for the requested submesh. For an allocated submesh < 3:1, yl, .732, y; >, the coverage submesh for an incoming job J(w, h) is < ma1r(0, xl—w+1), macr(0, yl—h+1), $2, y2 >. where max(i,j) returns the larger value of i and j. The reject set with respect to an incoming job J (w, h) for a mesh system M(W, H) consists of two reject submeshes < W—w+ 1, 0, W— 1, H—l > and < 0, H — h + 1, W — 1, H — 1 > in the rightmost and the topmost parts of the mesh, respectively. Nodes in the reject set can never serve as the base of any free submesh for the incoming job. For a job requesting a w x h submesh, the F S strategy first computes the coverage 16 set and reject set according to w, h, and the busy set. Starting from the lowest leftmost free processor, it compares the base node of every candidate frame (submesh) in sequence against the coverage set and reject set. If this node belongs to any element of the two sets, the frame slides w nodes to the right. If it exceeds the right boundary, the frame moves up h nodes and then slide from right to left. The slide continues until a free submesh is found or the entire mesh is searched. The FS strategy has a complexity of 0(%), where N A is the number of currently allocated submeshes. The PS strategy can be applied to rectangular meshes with arbitrary width and height. It allocates ajob to a free submesh with the same size as that of the job, thus eliminating the internal fragmentation completely. However, due to the use of fixed vertical and horizontal strides, this strategy may not be able to find a free submesh even if there is one available. 2.4.3 First Fit Strategy The First Fit (FF) strategy was proposed by Zhu [57]. Instead of using busy set, a busy array B [W, H] is constructed according each processor’s status, where B [i, j] is 1 (0) if processor < i, j > is busy (free). Instead of computing coverage submeshes individually, a coverage array C[W — w + 1, H — h + 1] can be computed from the busy array where C[i,j] is 0 if node < i,j > can be a base for the incoming job. The reject set is no longer needed because the dimension of the coverage array is reduced by w — 1 and h — 1 in a: and y directions, respectively. For an incoming request of a w x h submesh, the FF strategy computes the coverage 17 array in two steps: (1) to fill the the coverages to the left of busy submeshes, and (2) to fill the coverages below the busy submeshes. To fill the left coverage, the FF strategy scans each row in the busy array B[W, H] from right to left and initializes left..cov = W. At element B[i,j], if B[i,j] = 1, then set left_cov = maa:(i — w + 1,0). Now ifi Z left_cov, then C[i,j] = 1; else C[i,j] = 0. To fill the bottom coverage, the the FF strategy scans each column of the coverage array from top to bottom starting at the leftmost column and initializes btm-cov = H. At element C[i,j], if C[i,j] = 1, then set btm_cov = max(j — h + 1,0). Now ifj Z btm_cov, then C[i,j] = 1; else C [i , j] = 0. The first zero found in the upper left corner of the coverage array is used as the base for the requested submesh. The FF strategy requires 0(W X H) or 0(N) time steps, where N is the number of processors in the system. This approach eliminates the allocation miss encountered in the FS strategy. How- ever, if the rotation of the requested submesh by 90° is allowed, it can further improve the performance. However, the FF strategy allocates a job to the first available sub- mesh. Thus, a large contiguous area may be divided and allocated for a small submesh request, resulting in potential external fragmentation. On the other hand, a best-fit strategy tries to allocate a job to a “best-fit” submesh such that large contiguous space can be retained for incoming jobs with large requested sizes. Consequently, a best fit method may improve the system utilization and, hence, the overall system performance. 18 2.4.4 Best Fit Strategy In [57], Zhu also proposed a Best Fit (BF) allocation strategy. Instead of always allocating the job to the submesh at the upper left corner of the mesh system, the BF strategy chooses among the available submeshes for allocation based on a certain heuristic that tries to allocate a job at the best corner to reduce the external frag- mentation. A corner is an element in the coverage array that serves as the end of a consecutive sequence of 0’s in both the row and column where it is located. The heuristic is to choose the corner in the coverage array with the largest number of busy neighbors (1’s) surrounding it. For a submesh request, the coverage array is computed following the procedure described for the FF strategy. The coverage array is then scanned as follows to identify corners: (1) Scan each row in the coverage array in any order and mark two end elements of each sequence of consecutive 0’s as left and right, respectively; and (2) Scan each column in the coverage array in any order, and for each consecutive sequence of 0’s in a column, mark the two end elements as bottom and top, respectively. If any top or bottom is also marked by left or right, it is a corner. The number of busy neighbors (1’s) of each corner is then computed. A corner located in one of the four corners of the coverage array is considered as having two additional 1’s surrounding it, while a corner located in the boundaries of the coverage array, e.g., C [0, 1], is considered as having one additional 1. The corner with largest number of busy neighbors is chosen for allocation. If there are ties, the corner with the smallest area is selected, where the area of a corner is the product of the numbers of consecutive 19 0’s in its row and column starting from the corner. The BF strategy also requires 0(W x H) or 0(N) steps. Although the BF strategy is designed to reduce the fragmentation and subse- quently to improve the system performance, it does not show a better performance compared with the FF strategy in most circumstances. Zhu indicated in [57] that the BF strategy may cause jobs being scattered in the mesh, while the FF strategy tends to allocate jobs toward the upper left corner. Consequently, the mesh becomes more fragmented when using the BF strategy. Thus, a best—fit submesh allocation strategy may perform worse than or similar to a first-fit strategy and scatter allocated submeshes throughout the mesh if the best-fit selection heuristic is inferior. 2.4.5 Adaptive Scan Strategy The Adaptive Scan (AS) strategy proposed by Ding and Bhuyan [25] is an improve- ment over the Frame Sliding strategy discussed in Section 2.4.2. It is the first strategy that allows the rotating of the submesh request by 90° if a free submesh is not found in the first scan. After constructing the coverage set according to the submesh request, it scans the entire mesh starting from the lowest leftmost node < 0,0 >, moving left to right, bottom to top, to node < W — w + 1, H — h + 1 > with adaptive horizontal strides to skip impossible nodes. The stride is determined as follows. If the current node belongs to a coverage submesh in the coverage set, let rm” be the maximal :1: index 20 of this coverage submesh. The scan jumps from the current node to the :1:me + 1 position in the same row until the x index exceeds W — w + l or a free submesh is found. If a free base is not found, it scans the next row from left to right until the y index exceeds H — h + 1 or a free submesh is found. If it can not find a free base after scanning the whole mesh, it rotates the requested submesh from S (w, h) to S (h, w) and scans the mesh one more time. The AS strategy has the worst case complexity 0f 0(N X NA). The AS strategy always finds a free submesh if there is one available, so it has complete submesh recognition capability, and it gives the best performance among all first-fit allocation strategies assuming an FCFS scheduling policy is used. 2.4.6 Busy List Strategy The Busy List (BL) strategy was proposed by Das Sharma and Pradhan in [21]. It is a best fit strategy with complete submesh recognition capability. In this algorithm, a list of all allocated submeshes, or busy list (busy set), is maintained. For an incoming job J (w, h), the algorithm generates initial candidate submeshes for both J (w, h) and J (h, w) along the four corners of each allocated submesh within the mesh boundaries. The initial candidate submeshes generated from each allocated submesh are then considered one by one to check for overlap with any of the other allocated submeshes. An initial candidate submesh slides along the boundary of the allocated submesh that generated it until a free submesh is found only if the candidate submesh overlaps with another allocated submesh. At the same time, the candidate 21 submesh increases its boundary value by the length of the common boundary if it is adjacent to some allocated submesh. The boundary value of a free node is the number of its busy neighbors. The boundary value of a free submesh is the sum of the boundary values of all the nodes in the submesh boundary. In other words, the boundary value of a free submesh is the number of busy neighbors it has. The algorithm also generates candidate submeshes which are at the four corners of the mesh system. However, they are discarded in case of an overlap with some allocated submesh. The best fit submesh is the one with the highest boundary value. The worst case complexity of the BL strategy is 0(Nfi). The consequence of the BL strategy is that jobs tend to gather either at the four corners of the mesh system or adjacent to allocated submeshes. However, if an initial candidate submesh does not overlap with any allocated submesh (i.e., it is free), it does not slide. So, the BL allocation strategy does not consider all free submeshes adjacent to the allocated submeshes. As indicated by Das Sharma and Pradhan [21], the BL strategy may not always be able to find the free submesh with the highest boundary value. It can be modified to consider all the boundary nodes of all allocated submeshes with the penalty of longer processing overheads. 2.4.7 Lookahead Strategies Bhattacharya and Tsai proposed the Lookahead strategies [5], which allow examining the queue of waiting jobs to retain the largest possible contiguous free space for large sized waiting jobs. When a submesh request J (w, h) arrives in the system M (W, H), 22 the prohibited region (coverage submesh) for each allocated submesh is computed. Then, starting with a single free region < 0, 0, W — w + 1, H — h + 1 >, the prohibited regions are subtracted from the free regions one by one. At each subtraction step, free regions may be divided into smaller free regions or be totally eliminated. The resulting free regions after completing the subtraction steps form the free space. If the free space is empty, no submesh is available. If one or more free regions are obtained, each processor in the free regions can be selected as the base for the submesh requested. The authors first proposed three allocation policies without looking into the wait- ing queue, namely, First-Fit (FF), Best-Fit (BF), and Worst-Fit (WF). First-Fit policy selects the free region with the least x-valued base. If there are ties, it chooses the one with the least y-value. The base of the selected free region serves as the base of the submesh allocating the the job. Best-Fit policy selects the free region with the least area, while Worst—Fit policy selects the free region with the largest area. The performance of these three policies are very similar. Each policy has its advantage under certain job characteristics and load. In the process of computing free regions, the resulting free regions are not unique. The subtraction order of the prohibited regions and the division orientation (hori- zontally or vertically) might lead to different results. Further, allocating a job to the smallest free region does not necessarily result in leaving largest possible free space for later jobs. Actually, the smallest free region might be in the middle of a large free space, and allocating a job to it causes large external fragmentation. The lookahead heuristic selects a free region that leaves a larger space for later 23 jobs. The k-lookahead heuristic allocates the current submesh request to a free region such that a maximum number of k largest waiting jobs in the queue can also be allocated. The l-lookahead Best-Fit scheme, BFL(1), starts the allocation process by checking whether allocating the current job to the smallest free region would leave enough space for the largest waiting job in the queue. If not, it tries to allocate the current job to the next larger free region and checks again, repeating until a free submesh is found for the largest waiting job or all the free regions are tested. If this allocation process fails to find a free submesh for the largest waiting job, it allocates the current job to the smallest free region. On the other hand, in the 1-lookahead Worst-Fit scheme, WFL(1), the allocation process proceeds from the largest free region to the smallest free region. The proposed strategies require 0(Nj) steps. Based on the results of the simulation we performed, the BFL(1) and WFL(1) have very similar performance. Furthermore, both lookahead strategies have negligible performance differences compared with the FF, BF, and WF policies they proposed. Comparing the FF strategy proposed here with the Adaptive Scan strategy and Zhu’s FF strategy, we found that the only difference among their first-fit search processes is the searching directions. The FF strategy selects the available base with the smallest x-value (if there are ties, selects the one with the least y-value). On the other hand, the Adaptive Scan strategy first finds the available base which has the smallest y- value with the smallest :r-value. However, Zhu’s FF strategy finds the first available base which has the smallest x-value with the largest y-value. If we allow the rotation of the requested submesh by 90° when a free submesh is not found in the first pass for these three strategies, they have similar performance. So, the BFL(1) and WFL(1) 24 have very similar performance to the first-fit strategies, such as AS or FF. The lookahead strategies try to select a best free region for allocation among the free regions generated as described above by looking into the waiting queue. However, the number of free regions is O(n) [5], where n is the number of currently allocated submeshes. The maximum number of submeshes allocated in the system depends on the size of the mesh and the sizes of the allocated submeshes, and it is usually a small number. So, the resulting number of free regions is also very small. No matter which best-fit heuristic they use, from a small number of free regions, it is difficult to find a good submesh comparable to the submesh selected among a large number of candidate submeshes using other best—fit strategies. 2.4.8 Non-contiguous Processor Allocation Algorithms All the submesh allocation strategies mentioned above allocate a contiguous block of processors that physically form a submesh. Liu, Lo, and Windisch proposed three non-contiguous processor allocation strategies [48] that lift the restriction on contigu- ity of processors in order to eliminate the system fragmentation completely. However, non-contiguous processor allocation introduces potential problems due to message— passing contention, yielding communication interference among jobs. Non-contiguous strategies are beyond the scope of this work. 25 2.5 Dynamic Job Scheduling Schemes Job scheduling schemes for partitionable multiprocessor systems play an important role for improving system performance. Several dynamic job scheduling schemes for hypercubes and 2D meshes have been proposed. We will discuss them briefly in the following subsections. 2.5.1 Scan Scheduling Schemes for Hypercubes Krueger, Lai, and Radiya proposed a family of scheduling schemes, Scan, for hyper- cubes [39, 40]. For a hypercube of dimension r, r + 1 separate queues, Qk, 0 S k S r, are maintained for requested subcubes having dimensions 0 to r. We can scan the queues from small dimension to large dimension (Scan Up) or vice versa (ScanDown) in a circular way. All jobs of a certain dimension are allocated in the F CFS order until the queue is empty, and then the dispatcher allocates jobs in the next queue. Starvation is possible for both ScanUp and ScanDown. To solve this problem, two starvation free (SF) algorithms, Scan UpSF and ScanDownSF are proposed. For both algorithms, before the scheduler begins allocating jobs from queue i, it moves the entries found in the queue to CurrentQueue. The scheduler allocates jobs from CurrentQueue until it’s empty, then moves on to the next non-empty queue. Jobs that arrive at queue i after its entries have been moved to CurrentQueue must wait until the scheduler returns to queue i after allocating jobs from all other non-empty queues. 26 2.5.2 Lazy Scheduling Scheme for Hypercubes Mohapatra, Yu, Das, and Kim proposed a Static Partitioning scheme and a Lazy Scheduling scheme [49]. Static Partitioning scheme divides the r-cube system into one (r — 1)—cube, one (r — 2)-cube, ..., one l-cube, and two 0-cubes. There are r queues maintained for jobs requesting k-subcubes, where 0 S k S r — 1. Each queue is scheduled on a FCFS basis. Lazy Scheduling scheme maintains r +1 queues for an r-cube system. Each queue has a threshold length denoted as Nk, where 0 S k S r. Initially, Nk is set to zero. M. is increased by one upon a k-cube allocation and decreased by one upon a k-cube deallocation. In other words, M, denotes the number of allocated k—cubes currently in the system. When a job requesting a k-cube arrives, it goes to the end of queue k. If the number of waiting jobs in queue k is less than Nk, the job in the queue front gets allocated when a job completes and releases a subcube of the same dimension. If the number of waiting jobs in queue k is greater than Nk, the scheduler allocates the job in the queue front using a subcube allocation strategy if a free subcube is found. 2.5.3 Scan Scheduling Scheme for 2D Meshes Babbar and Krueger proposed a Scan scheduling discipline for 2D mesh systems in [3]. It differs from the Scan scheduling discipline for hypercubes only in the number of queues it maintains. It is only applicable when the mesh system and the requested submeshes are square meshes. For a square mesh of size L x L, if all jobs request square submeshes, Scan main- 27 tains L separate queues, one for each possible size of submesh request. On arrival, an incoming job joins the end of the queue corresponding to the size of the submesh it requires. All jobs in a certain queue are allocated before the scheduler moves to the next queue. Based on the direction of scanning, there are two possible alternatives, Scan Up and ScanDown. However, for jobs requesting arbitrary sizes and shapes, Scan maintains only a single ordered queue. For ScanU p, jobs are ordered from the smallest to the largest number of processors requested, while for ScanDown, jobs are ordered from largest to smallest. Actually, ScanUp and ScanDown are exactly the same as Smallest Job First and Largest Job First scheduling policies [39], respectively, when only a single queue is maintained. 2.5.4 Priority Scheduling Scheme for 2D Meshes Das Sharma and Pradhan proposed a job scheduling scheme that combines a priority technique with a reservation technique [23] for 2D meshes. The scheduling scheme can be implemented with any submesh allocation strategy. It tries to find a free submesh for an incoming job using the submesh allocator. If a free submesh is not available, the scheme searches to find a submesh that can be reserved for the incoming job. A processor may not be reserved twice and a reserved processor may not be allocated to another job, even though it may be free. All possible submeshes are considered in order to select a submesh for reservation. Among all the submeshes that are available for reservation, the one with the minimum number of 28 free processors is selected for reservation. If there is a tie, the candidate that overlaps with the least number of allocated submeshes is selected. A job that successfully reserves a submesh is appended to a reservation list. If a submesh can not be reserved for the incoming job, the job is appended to the waiting list. The jobs in the waiting list do not block the allocation/ reservation process unless a certain number of subsequent jobs arriving after the job in the head of the waiting list have been allocated or reserved. For each job in the waiting list, the scheme keeps a count of the number of jobs arriving later than it that have been allocated. To avoid the starvation problem, the job in the front of the waiting list has the highest priority to be allocated if more than MAXJ’RI jobs arriving after it have been allocated. If we do not allow any reservations, the priority scheduler behaves like a non- preemptive priority scheduler. If a job can not be allocated, it joins the waiting list. Jobs from the reservation and waiting lists are considered for allocation in an FCFS manner, without blocking. When a job finishes and releases its processors, an attempt is made to allocate as many jobs waiting in the reservation list followed by the waiting list in the FCFS order. This proceeds in three phases. In the first phase, the entire reservation list is searched to find out the reserved submeshes whose processors are all free and allocate them to corresponding jobs. In the second phase, the entire reservation list is searched again to determine if any of the reserved submeshes can be allocated due to the processors released by the finished job or the free processors released by the reserved submeshes (after the jobs that reserved them got allocated 29 during the second phase itself). In the third phase, an attempt is made to allocate the jobs in the waiting list, and then try to reserve submeshes for the jobs in the waiting list. The priority technique alone, i.e., reservation is not allowed, improves the perfor- mance. However, the reservation technique alone does not improve the performance, due to the waste of large number of reserved free processors. A combination of the priority and reservation techniques results in a lower waiting delay than the prior- ity technique alone for low to moderate loads. However, for high loads, the priority technique alone yields best results. 2.6 Performance Evaluation Tools for Processor Allocation and Scheduling A great deal of effort has been directed towards developing software tools for the problems of scheduling parallel program tasks on parallel and distributed systems [28]. However, software tools for the study of processor allocation and job scheduling problems are in limited supply. Windisch, Miller, and Lo have developed a software tool called ProcSimity [55] for research in the area of processor allocation and scheduling for distributed memory multicomputers. It includes two main components: (1) a multicomputer simulator and (2) a performance analysis / visualization tool. The simulator currently supports a two-dimensional mesh topology ranging in sizes 30 up to 1024 nodes (32 x 32) and k-ary n-cubes ranging in sizes up to 1024 nodes for selected values of k and n. It supports the following scheduling schemes: First-Come- First-Served, Last-Come-First-Served, Shortest (Longest) Service Demand, Shortest (Longest) Hold Time, and Smallest (Largest) Job First. Processor allocation strate- gies for 2D meshes supported include selected submesh allocation strategies: Frame Sliding, First Fit, Best Fit, and non-contiguous allocation strategies (job requests for number of processor needed) proposed in [48]. Jobs running in the system can be simulated at two different levels. First, jobs can be delayed for a randomly gener- ated service time, and then depart, i.e., only service time is modeled and simulated. Second, jobs are simulated by executing specific parallel application communication patterns, i.e., only communication is modeled without detailed simulation of compu— tation. When the ProcSimity simulator is executed, it begins by reading a detailed set of parameters from the terminal’s stdin. However, it is easier to invoke the simulator by redirecting a file to the simulator. The file contains a list of all the parameters for the run. Each execution of the simulator produces output to stdout containing three types of information: (1) the parameters used, (2) the results of each individual run, and (3) the final results, averaged over all runs. Performance information is provided to the visualization tool through trace files generated by the simulator. The visualization tool provides a dynamic display of allocation and deallocation events and performance metrics. The Topology window displays the network topology and the dynamic activity of jobs entering and leaving the system. The SimInfo window provides information such as topology dimensions, 31 job size distribution, and the processor allocation and scheduling algorithms. In a typical running session, a single trace file is loaded, and allocation and deallocation activities ofjobs are displayed. Chapter 3 Performance Evaluation Methodology and Tool Simulation is a useful technique for computer systems performance analysis. During the design stage of an algorithm, a simulation model provides an easy way to predict the performance or compare several alternatives. Further, a simulation model allows various algorithms to be compared under a wide variety of workloads. As we discussed in the previous chapter, processor allocation and scheduling prob- lems on a 2D mesh system are NP-hard. Due to the complexity of these problems, a discrete event simulator was built to evaluate the performance of various processor allocation and scheduling schemes. In this chapter, we first present the simulation model in Section 3.1. Then, we give the workload characterization and performance metrics in Section 3.2 and Section 3.3, respectively. In Section 3.4, we describe the performance evaluation tool that we developed. 32 33 3.1 Simulation Description In discrete event simulation, there are three basic operations that a simulation pro- gram needs to perform: event generation, event handling, and statistics collection. Event generation can be trace-driven or distribution-driven. In trace—driven sim- ulation, the time of occurrence of the event and the magnitude of parameters asso- ciated with the event are obtained from measurements on a real system and then used directly in the simulation. Trace—driven simulations may be necessary if detailed characteristics of the input must be preserved. In most other cases, distribution- driven simulation provides a compact, reproducible, and easily modifiable model of the workload [35]. Event handling is usually performed by advancing the simulation clock to the occurrence of the next event, updating the system state correspondingly, and then enabling some other events. Selecting a proper language is probably the most important step in the process of developing a simulation model. Special-purpose simulation languages have built-in facilities for time—advancing, random number generation, statistical data collection, and report generation. While these languages may save development time for expe- rienced programmers, they usually are available only through special purchase and consume more system resources, e.g., CPU time. Further, beginners must spend some time learning the languages. On the other hand, a general-purpose language offers faster run-time speed, but programmers have to spend extra time developing routines for event handling, statistical data collection, and so forth. Another alternative is an extension of a general-purpose language. The extension 34 consists of a collection of routines to handle tasks that are commonly required in simulations. It provides a compromise in terms of efficiency, flexibility, and portability. CSIM [52] is an example of such an approach. Simulation of processor allocation and scheduling schemes requires a description of the system. The underlying target machine is a two-dimensional mesh-connected multiple processor system. This simulation considers various mesh sizes ranging from 16 X 16 to 128 X 128. However, the simulation results for mesh systems of various sizes follow a similar trend, thus, we only report the performance results for a 32 X 32 mesh system in this dissertation. Each job J(w, h) arrives randomly in a stochastic stream and requests a rectangular submesh S (w, h). The system tries to find a free submesh for incoming jobs following a specific scheduling scheme and a submesh allocation strategy. If a free submesh is found for a job, it is allocated to the job until the job is finished. Then, the system deallocates the finished job and releases its processors. In this simulator, we have implemented various existing submesh allocation schemes as well as new allocation and scheduling schemes that we developed. In the next section, we will describe the workload characterization in detail. 3.2 Workload Characterization Simulation involves constructing a model for the behavior of the system and driving it with an appropriate abstraction of the workload. In our simulation model, basic input workload parameters include: the scheduling disciplines, the submesh allocation strategies, job inter-arrival times and service times, job sizes (widths and heights), 35 and mesh sizes. Job inter-arrival times are assumed to have the exponential distribution with the mean of IATM. That is, the jobs arrive in a Poisson process. The Poisson process is commonly used to model independent jobs arriving randomly in the computer system. Job service (execution) times are also distributed exponentially with a mean of EX T M. The width w and height h of the requested submesh are generated independently. We considered three distributions for w and h, namely, uniform, decreasing, and increasing distributions [13], to cover different characteristics of jobs. In the uniform distribution, w and h are equally likely to have a value of 1 through N (N = 32). In the decreasing distribution, the range of 1 through 32 is divided into four intervals, [1,4], [5,8], [9,16], and [17,32]. The distribution of w and h within each interval is still uniform. However, the probability that w or h falls into one of the intervals decreases as the values in the interval increase. The probabilities that the height or width falls in these intervals are P[1,4] = 0.4, P[5,8] = 0.2, P[9,16] = 0.2, and P[17, 32] = 0.2. Essentially, a decreasing distribution on w and h represents a system with more jobs requesting small submeshes. Similarly, in an increasing distribution, the probability that w or h falls into one of the intervals increases as the values in the interval increase. So, for increasing distribution, we have intervals [1,16], [17,24], [25,28], and [29,32], and the probabilities that the height or width falls in these intervals are P[1,16] =2: 0.2, P[17, 24] = 0.2, P[25,28] = 0.2, and P[29,32] = 0.4. If w and h follow the increasing distribution, the system will have greater chance to receive jobs requesting large submeshes. 36 In order to compare the performance of various strategies under different loads, we define the load as follows: 72 X EXTM load z N x IATM where n is the mean requested submesh size (number of processors) of jobs, and N is the total number of processors in the 2D mesh system. In our simulation, we fix the EXTM to be 10.0 time units and adjust the IATM according to the desired load. The system will become unstable with too small an IATM value. For each separate simulation run, all random number streams are initialized with the same seed value to guarantee that each allocation and scheduling method receives the same stream of jobs. 3.3 Performance Metrics We observe the behavior of the simulated system by calculating and analyzing various statistics, or performance metrics, which are outputs from the simulation experiment. The metrics of interest include: mean response time, standard deviation of response time, and system utilization. The response time of a job is defined as the time that elapses from the moment the job arrives to the system until it completes execution. In other words, it’s the sum of the waiting time and the service time. Mean job response time provides a good, user-oriented perspective on performance. On the other hand, standard deviation of response time complements the mean response 37 time as a measure of performance. A scheme that reduces the mean response time may result in an increased standard deviation of response time, that is, the response time is more unpredictable. A scheme is considered better if it not only reduces the mean response time, but also reduces the standard deviation of response time under various job characteristics and loads. The utilization of a resource is measured as the fraction of time the resource is busy servicing requests. Thus, for a single processor, the utilization with respect to time is defined as follows: Definition 3.1 The utilization (with respect to time) of a single processor system over a given time period is . . . timebu, utzlzzatzontime 2' ————-—3i—. timeelapsed That is the ratio of busy time to total elapsed time over a given period. The system we focus on is a multi-processor system. Only a fraction of processors may be used at a given time, For a multiprocessor system, the utilization is usually defined with respect to space, i.e. processors, of the system at a time instant. Definition 3.2 The utilization (with respect to space) of a multiprocessor system at a given time is altprocessorsbu,y total # processors ' utilizationspace = However, this utilization could further be averaged over time to give the overall system utilization over a given period of time. The overall utilization is a good 38 measure from the system’s point of view. We record the activities (arrivals, departures, etc.) of all jobs and processors and the response times of all jobs. The results reported in this thesis are reasonably accurate. The error is less than 5% with a 90% confidence level. The method of inde- pendent replications [34] is used to achieve this level of precision. Each independent run performs 50,000 job allocations/deallocations, and data collected from the first 1000 jobs are discarded to eliminate the initial transient effect. As in studies by Chuang, Tzeng, Ding, Bhuyan, Li, Cheng, and Zhu [13, 25, 45, 57], the overheads of the allocation and scheduling algorithms are assumed to be negligible with respect to job execution time in our simulation study. To assess the validity of this assumption, we measured the overhead of various strategies based on prototype implementations on a SUN Sparcstation 10. The actual average CPU time needed per allocation for various strategies ranges from 0.2 ms for the Busy Distance Inverse strategy (to be introduced in Chapter 8) to 10 ms for the Best Fit strategy on a 32 X 32 mesh system. In contrast, the mean job execution time used for the simulation is 10 seconds. Thus, the assumption seems reasonable. We have made extensive efforts to verify and validate the correctness of our sim- ulation tool. We have self-consistency check points throughout the program. For example: check if the arriving time, waiting time, execution time, and finishing time of each job would match; compare the resulting system utilization with the input workload, etc. We have cross-checked the results against published results where possible, both qualitatively and quantitatively. The simulation tool provides a valid comparison of various strategies. The next step, left as future work, is to validate the 39 results against an actual implementation with real workloads. 3.4 Simulation and Experiment Analysis Tool We have developed a simulation and experiment analysis tool called Sched-All for job scheduling and allocation on 2D mesh multicomputers. Sched—All is scalable in the sizes of mesh which can be simulated. It can evaluate the performance of various submesh allocation and job scheduling schemes on a mesh system of size up to 128 X 128. It includes three main components: (1) a simulator, (2) an experiment planner, and (3) an experiment analyzer. The core of the simulator is implemented in C++ using CSIM simulation library [52]. CSIM is a process-oriented discrete-event simulation package for use with C or C++ programs. It provides a runtime library to support the necessary simulation procedures. The graphical user interface is designed using Tcl/Tk [51]. TC] stands for Tool Command Language. It is a scripting language and users can extend it by creating new commands implemented in C. Tk is the associated X windows toolkit of Tel. Users can also implement custom widgets in C. Sched—All supports a wide variety of processor allocation and scheduling schemes. Table 3.1 lists the current library of submesh allocation strategies, which include some strategies discussed in Chapter 2 and two advanced strategies we developed (to be presented in Chapters 7 and 8). We also include three modified strategies in the simulation tool. We modify the First Fit and Best Fit strategies to consider 90° rotation of an incoming job and modify the Busy List strategy to consider all possible 40 Allocation strategy Category [Chapter Adaptive Scan first-fit 2 First Fit first-fit 2 First Fit (modified) first—fit 2 Best Fit best-fit 2 Best Fit (modified) best-fit 2 Busy List best-fit 2 Busy List (modified) best-fit 2 Lookahead BF L(0) best-fit 2 Lookahead BFL(1) best-fit 2 Busy Distance Inverse best-fit 8 Estimated Execution Time best-fit 7 Table 3.1: Submesh allocation strategies supported by the simulator. candidate submeshes along the boundaries. Table 3.2 lists the current library of job scheduling schemes, which include some schemes discussed in Chapter 2 and two advanced strategies we developed (to be presented in Chapters 5 and 6). Figure 3.1 shows a screen dump of the main window of the simulator. The top portion of the window lists a set of buttons for specific functions. The bottom portion of the window is a status box. A user can select the desired mesh size, allocation strategy, scheduling scheme, job characteristics, and load by pressing the corresponding button. After pressing the “Run” button, a message is displayed in the status box to indicate if the simu— lation is started successfully. By pressing the “Status” button, the current status of all simulation runs (either “finished”, “still running”, or “terminated”) is displayed in the status box. For each simulation run, the simulator generates an output file that contains information such as mean and variance of response time and system utilization. 41 time information no no no Table 3.2: Job scheduling schemes supported by the simulator. Salted - All Simulator 0 FCFS 1. N. SD.- use. 0. 1-_._ -. . Dace Not enough. Please , OFCFS 1. N. SE DEC 03.1 Figure 3.1: The Sched—All simulator. 42 Figure 3.2 shows the screen dump of the main window of the experiment planner. An experiment is defined for our purpose as a set of simulation runs in which the runs are to be compared with one another via plotting on the same graph axes. The experiment planner is designed to set up the collection of data for an experiment comparing the performance of various processor allocation and scheduling schemes under different system loads. Users can use the “Load” button to specify the load range of interest. Then, similar to the simulator, users can use the “Mesh Size”, “Allocation”, “Scheduling”, and “Job Char” buttons to select a desired environment and scheme, and then press the “Add” button to add this scheme to the current experiment. After selecting all the schemes to be compared, the desired data file for the experiment analyzer is generated by pressing the “Generate” button under “File”. A pop-up window lets the user input a filename for the data file. The experiment analyzer uses GNUPLOT [54] for plotting experimental data. It reads in the data file generated by the experiment planner, creates a GNUPLOT program which generates a postscript file of the plot, and then displays the plot on the screen using a postscript viewer utility software Ghostview. Figure 3.3 shows a screen dump of the experiment analyzer and some pop-up windows for user inputs. Users can input the data filename, the GNUPLOT program name, and the postscript filename. Users also can specify the plot title, the x-axis and y-axis labels, and for each scheme, the line type (solid, dashed, or dotted), marker type (diamond, plus, square, x, triangle, or asterisk), and text label (tag). Under the “Options” button, users can set the range of each axis, the tic positions, key position, and log scaling. When the “Plot” button is pressed, a GNUPLOT program is created to generate a 43 __ §c1£~ICNTFx 517153171? 5, “farm. : . 1‘ Figure 3.2: The Sched-All experiment planner. Figure 3.3: The Sched—All experiment analyzer. 45 postscript plot file, and the plot is displayed on the screen. The experiment analyzer can be invoked from within the simulator or the experiment planner. It can also be used stand-alone, as a graphical user interface for GNUPLOT. Figure 3.4 shows the content of an example data file plot .dat. The first column corresponds to the x-axis values, one additional column represent one curve to be plotted. Figure 3.5 shows the GNUPLOT program generated by the Sched-All exper- iment analyzer illustrated in Figure 3.3. Noted that the x and y tics and key position and the x and y ranges are not not specified in this example, so the default values are used. Figure 3.6 shows the screen dump of the plot. Because of the poor resolution of the screen dump, the corresponding postscript version of the plot is also shown in Figure 3.7. 0.1 50.0 52.0 54.0 56.0 58.0 60.0 62.0 0.2 40.0 42.0 44.0 46.0 48.0 50.0 52.0 0.3 30.0 32.0 34.0 36.0 38.0 40.0 42.0 0.4 20.0 22.0 24.0 26.0 28.0 30.0 32.0 0.5 10.0 12.0 14.0 16.0 18.0 20.0 22.0 Figure 3.4: An example data file. 46 set term postscript eps 24 set output ’p10t.ps’ set title ’My Title’ set xlabel ’My X-label’ set ylabel ’My Y-label’ set xtics set ytics set key set nologscale 1 set nologscale y plot [z] [z] \ ’plot.dat’ using 1:2 title ’line 1’ with linespoints 0 1,\ ’plot.dat’ using 1:3 title ’line 2’ with linespoints 1 2, \ ’plot.dat’ using 1:4 title ’line 3’ with linespoints 8 3, \ ’plot.dat' using 1:5 title ’line 4’ with linespoints 1 4, \ ’plot.dat’ using 1:6 title ’line 5’ with linespoints 1 5, \ ’plot.dat’ using 1:7 title ’line 6’ with linespoints 1 6, \ ’plot.dat’ using 1:8 title ’line 7’ with linespoints 1 1 Figure 3.5: The generated GNUPLOT program. Giostview, version 1.5 Fla-PS My Tltle Tue Jul 11 01:03:271 7” . . . - line 1 ‘0"— My Y-Inbel | l l l l l I 10 DJ 0J5 02 025 03 035 04 045 05 My X-Iabel Figure 3.6: The generated plot (screen dump). My Y-label 47 My Title 70 , 1 1 r I line 1 -.., ....... 60 line 2 —+—: 10 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 My X-Iabel l L 1 Figure 3.7: The generated plot. Chapter 4 Cost-Performance Assessment of Schemes We have discussed some previous work in the area of submesh allocation and job scheduling in Chapter 2. While these submesh allocation strategies improve the overall system performance and reduce the system fragmentation, the actual proces- sor utilization and external fragmentation resulting from these strategies still leaves plenty of room for further improvement. Furthermore, some strategies reduce the job mean response time, but they have to pay the price of higher time complexity. It is very desirable to achieve high performance and low complexity at the same time. In this chapter, we first compare the performance of current processor alloca- tion strategies and discuss the complexity versus performance issues in Section 4.1. Then, we discuss the limitations of the FCFS scheduling scheme and the possible performance gain due to scheduling schemes in Section 4.2. 48 49 4.1 Processor Allocation Strategies Various submesh allocation strategies have been proposed. Performance compari- son and analysis of these strategies are very desirable. In Section 4.1.1, we perform simulation experiments, present the results, and give a comprehensive performance comparison and analysis of existing submesh allocation strategies. Further, the limi- tations of these strategies and the complexity versus performance issues are discussed in Section 4.1.2 4. 1 . 1 Performance Analysis The Two—Dimensional Buddy System (2DBS) is the first dynamic submesh alloca- tion strategy ever developed. It causes large internal fragmentation if the requested submesh is not a square mesh with all sides being exactly powers of 2. The Frame Sliding (F S) strategy eliminates the internal fragmentation, but it sometimes misses an available free submesh and causes unnecessary external fragmentation. While these two strategies provided the foundation of this research area, their performance is surpassed by the succeeding strategies. We perform the following experiments and introduce selected results to study the performance and existing strategies. Performance comparison and analysis of the ex- isting submesh allocation strategies form the basis of our further research in this area. In the following experiments, two first-fit strategies, FF and AS, and three best-fit strategies, BF, BL, and BFL(1), are considered. We performed simulation to evaluate the performance of these five processor allocation strategies under different workloads 50 and job characteristics. Figures 4.1, 4.2, and 4.3 illustrate the mean response time versus load for uniform, increasing, and decreasing job distribution, respectively, in a 32 X 32 mesh system. Noted that some of the strategies we compared here do not consider rotating the incoming jobs by 90°. So, in order to have a fair comparison, all requested submeshes are square submeshes. 60 i . . 50 - Adaptive Scan —<>— . Busy List —°—- First Fit -e— 40 - Best Fit —~—- « Lookahead BFL(1) Mean Response Time 30 - - 20 - . 1o - . O l 1 l 1 0.1 0.2 0.3 0.4 Load Figure 4.1: Mean response time vs. load (w and h: uniform distribution). Under uniform job distribution (Figure 4.1), we can observe that the three best-fit allocation strategies (BF, BL, and BFL(1)) yield better performance compared with the two first-fit strategies (FF and AS). Under increasing distribution, (Figure 4.2), we can observe that the performance difference among various strategies is not as significant as that of the previous case. When jobs are distributed with increasing distribution, the system tends to have more jobs requesting large submeshes. eighty percent of the jobs request submeshes in size greater than or equal to 16 X16 in a 32 X 32 51 50 Y I 1 40 ~ Adaptive Scan 4— « “5’ Busy List -+— ; First Fit +— Q 30 _ Best Fit —*—' d 2 Lookahead BFL(1) 4— a (D 0) CE 20 - ~ C as 0 2 10 - O l l i J l 0.1 0.2 0.3 0.4 0.5 Load Figure 4.2: Mean response time vs. load (w and h: increasing distribution). 190 . . . 180 ~ . 170 - Adaptive Scan _._ , ~ 160 - Busy List —0— ~ 150 - First Fit -9— ‘ 140 - Best Fit +— 130 - L k h +_ 120 _ 00 a sad BFL(1) 110 - 100 - - 90 r . 70 - ’ 60 ~ 50 - 40 r 38 ‘ 1o - ”’7‘ 0 1 1 l l 0.1 0.2 0.3 0.4 Load Mean Response Time I l l l l Figure 4.3: Mean response time vs. load (w and h: decreasing distribution). 52 system. Consequently, the effect on performance improvement due to the processor allocation strategy is diminished. Under decreasing distribution, only 20% of the jobs request submeshes in size greater than or equal to 16 X 16 in a 32 X 32 system. So, the processor allocation strategy plays a critical role to improve the performance. As we can observe from Figure 4.3, the performance difference among various strategies becomes much larger than the previous two cases. The BL strategy has the best performance among all the strategies. The BF strategy has very similar performance compared with BL. The other best-fit strategy, Lookahead BFL(1), however, has similar performance compared with the first-fit AS strategy. First-fit strategies allocate jobs to the first available location obtained, while best- fit strategies try to allocate jobs to the best location available. As we can observe from Figures 4.1 to 4.3, best-fit strategies yield better performance than first—fit strategies. This observation motivates our research in new and more advanced best-fit submesh allocation strategies. However, performance is not the only concern of strategies. We will discuss the complexity versus performance issues in the next section. 4.1.2 Complexity Analysis As we have discussed in Chapter 2, various submesh allocation strategies have been proposed. Some schemes reduce the job mean response time, but they pay the price of higher time complexity. It is very desirable to achieve high performance and low complexity at the same time. Intuitively, best-fit strategies are more complicated than first-fit strategies. However, a well designed best-fit strategy may outperform 53 first—fit strategies not only in performance, but also in complexity. Table 4.1 summarizes the limitations and time complexity of various submesh allocation strategies. Figure 4.4 shows relative performance versus complexity trends of recent allocation strategies. In Figure 4.4 and Table 4.1, N denotes the number of processors in the mesh system, N A denotes the number of jobs currently running in the system, and w and h denote the width and height, respectively, of the job being allocated. Strategy Category Considers Considers Complexity job rotation execution time 2DBS first-fit no no 0(logN ) FS first-fit no no 0( £454) FF first-fit no no 0( N) BF best—fit no no 0( N) AS first-fit yes no 0(N X NA) BL best-fit yes no 0( N f; ) Lookahead best-fit no no 0( N g ) Table 4.1: Limitations of existing submesh allocation strategies. The Two-Dimensional Buddy System (2DBS) is applicable only when the system is a square mesh with all sides being exactly powers of 2. Additionally, it causes large internal fragmentation if the requested submesh does not also satisfy this constraint. It has a time complexity of 0(logN). The Frame Sliding (F S) strategy is applicable to any rectangular mesh system with arbitrary widths and heights, and it eliminates the internal fragmentation. The cost is that the FS has a much higher time complexity of 0(N—J-gf). However, the FS strategy sometimes misses an available free submesh and causes unnecessary external fragmentation. The AS strategy allows the rotating 54 Complexity ll 0(N X N A) r” * Adaptive Scan * Frame Sliding 0W) l— * First Fit * Best Fit 3 0(NA) *— * Busy List 0(Nfi) "[r * Lookahead ] > optimal Performance Figure 4.4: Performance vs. complexity of submesh allocation strategies. 55 of the submesh request by 90° and searching the mesh system one more time if a free submesh is not found in the first pass. The AS strategy eliminates the allocation miss of FS strategy, but its time complexity, 0(N X N A), is also higher than that of the FS strategy. The FF and BF strategies also eliminate the allocation miss encountered by the F S strategy, and each has the time complexity of 0(N). However, they did not consider rotating of the submesh request by 90° for possible allocation. The Busy List (BL) strategy has a time complexity of 0(N3), and it is the first strategy that has time complexity not related to N, the mesh size. However, its best-fit heuristic considers the number of busy neighbors only, neglecting busy nodes two or more hops away. Usually, N is much larger than N A, so the goal of recent research is to develop strategies having complexity independent of mesh size as well as having higher performance. The Lookahead strategies take a different approach to allow examining the queue of waiting jobs and try to retain a free space large enough for the largest sized waiting job. The time complexity is further reduced to 0(Ni), but their performance is not as good as the BF and the BL strategies. While the submesh allocation problem is certainly well studied, the actual proces- sor utilization and external fragmentation resulting from these strategies still leaves plenty of room for further improvement. We argue that performance gains are still possible at little or no cost in additional overhead. Looking ahead to the new Esti- mated Execution Time (EET) and Busy Distance Inverse (BDI) strategies we pro— posed in Chapters 7 and 8, respectively, we present their relative complexity and performance in Table 4.2 and Figure 4.5. 56 Strategy Category Considers Considers Complexity job rotation execution time EET-SC best-fit yes yes 0( N) EET-SB best-fit yes yes 0( N 3) BDI best-fit yes no 0( Ni) Table 4.2: New submesh allocation strategies. Complexity ll 0(N X NA) ‘l’ * Adaptive Scan * Frame Sliding 0(N) “r * First Fit * Best Fit * EET-SC 3 0(NA) ‘l‘ * Busy List * BET-SB 0(Nfi) “r * Lookahead *Busy Distance Inverse [ > optimal Performance Figure 4.5: Performance vs. complexity of submesh allocation strategies. 57 4.2 Scheduling Schemes Traditionally, submesh allocation strategies allocate jobs by a FCFS scheme. The allocation process always starts from the front of the waiting queue, so scheduling complexity of FCFS scheme is 0(1). However, a job with a large submesh request is to be allocated, it usually has to wait until almost all the allocated jobs leave the system, thus blocking later jobs from entering the system. Krueger, Lai, and Radiya have claimed that job scheduling is more important than processor allocation for hypercube computers [40]. They have developed a family of job scheduling schemes called Scan, which puts jobs into separate waiting queues according to their dimensions. All jobs of a certain dimension are allocated in the FCFS order until the queue is empty, and then the dispatcher allocates jobs in the next queue. It takes constant time to put a job in the waiting queue corresponding to its dimension, it also takes constant time to allocate jobs in a queue in the FCFS order. So, the scheduling complexity of the of the Scan scheduling schemes is also 0(1). However, the Scan scheduling schemes improve the performance against FCFS scheme dramatically in both reducing the mean job response time and increasing the overall system utilization. This work motivates our research of developing effective and efficient job scheduling schemes for 2D meshes. In the following two chapters, we concentrate on the job scheduling issues and introduce our development of advanced job scheduling schemes for 2D mesh systems. In Chapters 7 and 8, we present our development on submesh allocation and mesh partitioning strategies. Chapter 5 Job Scheduling Schemes Krueger, Lai, and Radiya have claimed that job scheduling is more important than processor allocation for hypercube computers [40]. They developed a family of job scheduling schemes called Scan, which improve mean job response time dramatically against the FCFS and some other job scheduling schemes. However, job schedul- ing schemes for hypercube computers are not applicable to two-dimensional mesh computers. Traditionally, submesh allocation strategies allocate jobs by a F CF S policy. When a job arrives, the dispatcher allocates the job only if the waiting queue is empty and a free submesh is found; otherwise it puts the job at the end of the waiting queue. Upon the completion of a job, the processors in the system that were allocated to the job are released and the dispatcher checks if jobs in the front of the waiting queue can be allocated. When a job with a large submesh request is to be allocated, it usually has to wait until almost all the allocated jobs leave the system, thus blocking later jobs from entering the system. To reduce the external fragmentation and avoid the 58 59 potential performance loss due to blocking, different job scheduling policies need to be considered. In this chapter, we propose two starvation free job scheduling schemes, Immedi- ate Fit (IF) and Scan All (SA), which improve the mean response time and system utilization dramatically for 2D mesh systems. The detailed strategies and perfor- mance results under different load conditions and job characteristics are given in the following sections. 5.1 Immediate Fit Strategy When a job arrives in the system, there may exist a large enough submesh to be allocated to it. However, this job will not be allocated in the FCFS policy if the waiting queue is not empty. Incorporating job scheduling strategies for processor allocation can allow a job to be allocated when possible. In Immediate Fit (IF) strategy, we allocate the incoming job immediately if it fits in a free submesh in the system. However, allowing a newly arrived job to enter the system will reduce the chance of freeing processors for a waiting job with a large submesh request and, hence, result in a starvation problem. The possibility of encountering starvation will be higher with a heavier system load. One solution to this starvation problem is to enforce a waiting time limit. When a job arrives, the dispatcher tries to allocate the job only if there are no jobs in the waiting queue exceeding the waiting time limit. Otherwise, it simply puts the job at the end of the waiting queue. When a job completes execution and processors are released, the dispatcher checks if jobs in the 60 front of the waiting queue can be allocated. and allocates them when possible. The algorithm for the Immediate Fit strategy is given in Figure 5.1. When a job J arrives: 1. check if the waiting queue is empty; if waiting queue is not empty, then retrieve the job, J I, in the front of the waiting queue; if J; has exceeded its waiting time limit, then put J in the waiting queue and stop; endif; endif; 2. use the desired allocation strategy to find a free submesh for J; if a free submesh is available, then allocate processors for J; else put J in the waiting queue; endif; Upon completion of a job J: 1. release processors allocated to J; 2. retrieve the job, J}, in the front of the waiting queue; 3. use the desired allocation strategy to find free submesh for J f; if a free submesh is available, then remove J I from the waiting queue and allocate processors for J I; goto step 2; endif; Figure 5.1: The IF scheduling scheme. 61 5.2 Scan All Strategy When a job completes execution, a submesh is freed. Jobs which previously could not be allocated can now be considered for possible allocation. In the Immediate Fit strategy, the allocation process will be blocked if the job in the front of the queue could not be allocated. Here, we discuss a Scan All (SA) scheme, which scans the entire queue for possible allocation of any job in the queue upon completion of a job. When the system identifies a job that can be allocated, it will allocate the appropriate submesh to it. The only difference between the SA scheme and the IF scheme is the procedure followed “upon completion of a job J.” This is shown in Figure 5.2. 5.3 Performance Results We performed simulation to evaluate the performance of the IF and SA scheduling schemes compared with the FCFS policy. The scheduling policies are combined with various processor allocation strategies. The Adaptive Scan (AS) and Busy List (BL) schemes were chosen to represent the first-fit and best-fit allocation strategies. Figure 5.3 illustrates the system utilization against load using AS and BL alloca- tion strategies with the FCFS, IF and SA scheduling schemes. Here, the waiting time limit is set to be 500. A uniform distribution is considered for the height and width of the requested submeshes. The maximum system utilization achieved using AS or BL allocation strategies with FCFS scheduling scheme is 51.3% and 53.7%, respectively. When the AS and BL allocation strategies are combined with the IF scheduling 62 Upon completion of a job J: 1. release processors allocated to J; 2. retrieve the job, J], in the front of the waiting queue; 3. use the desired allocation strategy to find a free submesh for J 1’; if a free submesh is not available, then if J] has exceeded its waiting time limit, then stop; endif; else remove J f from the waiting queue and allocate processors for J f; endif; 4. if Jf is the last job in the waiting queue, then stop; else retrieve the next job in the waiting queue and let it be Jf; endif; 5. go to step 3; Figure 5.2: The SA scheduling scheme. 63 scheme, the system utilization is improved to 53.9% and 55.8%, respectively. The SA scheduling scheme improves the system utilization much more significantly. The system remains stable with a load up to 0.7. 0.8 "AS(FCFS) "a ..... , .1 AS IF 0.7 ~ AS( A ...,. ..... 1 c BL(FCFS ...__ 3 0.6 . BL IF; —.— 4 ‘3 BL( A ...— 37; 0.5 - q 3 E 0.4 - . 2 % 0.3L + 0.2 - . 0.1 - ‘ 0 1 1 1 1 1 L 1 l 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3 Load Figure 5.3: System utilization vs. load (w and h: uniform distribution). It is interesting to compare the curves as follows. Observe the FCF S curves (dashed and solid lines with square markers). Note the differential, which is due to the allocation strategy. Next, compare the dashed lines only. Similarly, compare the solid lines only. These differentials are due to the scheduling strategy. In most cases, the latter differentials (due to scheduling) are greater. We measure the system performance based on the mean response time. To have an accurate measurement, only stable systems are considered. From Figure 5.3, a reasonable range of load is chosen to guarantee the system stability. Figure 5.4 illustrates the mean response time under different loads (from 0.1 to 0.5) using AS 64 and BL allocation strategies combined with various scheduling schemes. The width and height of the requested submeshes are uniformly distributed from 1 to 32. When the load is light (less than 0.2), all allocation strategies have similar performance. Most jobs will be served promptly after they enter the system. For load greater than 0.3, IF and SA scheduling schemes show significant performance gain. When the load reaches 0.5, the IF scheduling scheme reduces the mean response time by more than 48% for both AS and BL allocation strategies; and the SA scheduling scheme reduces it by more than 73% for both. The standard deviation of the response time is illustrated in Figure 5.5. We can observe that the standard deviations are about the same magnitude as their corresponding means. The high variability of the response times is the nature of this problem: some jobs may get serviced promptly, while others may encounter long waiting delays in the queue. When evaluating a system of considerable variability, we should consider not only the mean response time, but also its standard deviation. One scheduling scheme may result in a lower mean response time but a higher standard deviation. As we can conclude from Figure 5.4 and Figure 5.5, the performance gained by using the IF and SA scheduling schemes versus the FCFS policy is far more significant than the improvement obtained due to the use of a better allocation strategy. The same experiment is conducted to consider different characteristics of job re- quests, namely, decreasing and increasing distributions for the width w and height h of submeshes requested. These distributions are described in Section 3.2. The system becomes unstable if the load reaches 0.5 when w and h are under decreasing distribution and F CF S scheduling is used. Thus, we report the results up to a load 65 110 r 100 - 90 - AS(FCFS ...,, ...... j m AS IF .-.,, ....... E. 80 " BL?|:S(§FQ ..-,. ........ 5+ , 3 70 _ BL IF H— i 5 60 b BL( A _,,_ 1 8' , m 50 - D: c 40 ~ 8 2 30 : . 20 - ] 10 - q 0 . . . , 1 0 0.1 0.2 0.3 0.4 05 Load Figure 5.4: Mean response time vs. load (w and h: uniform distribution). 120 . . w 110 ~ . 100 - ’ f‘g’ 9° ' “(235.5. l; 80 ' AS( A) . ‘ 2 7° - was] :: é 5° ' BL( A -—-— ‘ a: 50 ' i ‘5 4o - . 8 so - - 20 ~ 10 ~ 0 . L . . L o 0.1 0.2 0.3 0.4 0.5 Load Figure 5.5: Standard deviation of response time vs. load (w and h: uniform distri- bution). 66 of 0.4 for the decreasing distribution case. Figures 5.6 and 5.7 show the mean and standard deviation of response time versus load, respectively, for various scheduling schemes under the decreasing distribution of job requests. The relative performance among various scheduling strategies remains the same except that the IF scheme yields a larger standard deviation than the FCFS policy. With a decreasing distribution, the system has more small jobs than large ones. As more small jobs come into the system and get allocated by the IF scheduling policy, large jobs in the front of the waiting queue face potentially longer delays in getting allocated. The simulation results for various strategies under the increasing distribution of job requests are presented in Figures 5.8 and 5.9. Performance results similar to that of the uniform distribution are observed here: the IF scheduling scheme performs better than the FCFS scheme, and the SA scheduling scheme performs better than the IF scheme. However, for increasing distribution, the performance difference among various strategies is not as significant as that of the previous two cases (uniform and decreasing). This moderate performance improvement for increasing distribution of job requests is expected. The system tends to have more jobs requesting large submeshes in an increasing distribution and, hence, there is a reduced chance of successfully allocating a job that is not in the front of the queue. Consequently, the effect on performance improvement due to the scheduling policy is diminished. The main reason why IF and SA scheduling schemes drastically improve the per- formance is that these schemes give priority to jobs with smaller submesh requests. Thus, it would be interesting to analyze the effects of these scheduling policies on Mean Response Time 50 4o 30 20 10 67 ~ AS(FCFS ~43 ,0 . AAS IF ' _ BL(FSCFS_+_ +— BL IF ‘ BL(A 0 0.1 0.2 0.3 0.4 Load Figure 5.6: Mean response time vs. load (w and h: decreasing distribution). SD of Response Time 70 60 50 40 30- 20- 10- I I m AS(FCFS; AS IF AS( A) ...,, ....... BL(FCFS +— BL IF —~— BL( A) —-— 0.1 0.2 Load 0.3 Figure 5.7: Standard deviation of response time vs. distribution). 0.4 load (w and h: decreasing 68 so . . . 4o - AS(FCFS a ' g AS IF , I: AS( A1,..-“ 0 30 _ BL(FCFS +— 2 BL IF —~— ‘ 3 BL( A (D 0 (I 20 - 1 C <0 0 2 1o - o 0 0.1 0.2 0.3 0.4 0.5 Load Figure 5.8: Mean response time vs. load (w and h: increasing distribution). 50 l r I l 1 4o - AS(FCFS a “5’ AS IF “+ ~ i: AS( A -..,. , q) BL(FCFS +— 30 - 2 BL IF ~— 8 BL( A m o a: 20 - . ”5 O (D 10 - ~ 0 1 1 4 1 1 0 0.1 0.2 0.3 0.4 0.5 Load Figure 5.9: Standard deviation of response time vs. load (w and h: increasing distri- bution). 69 jobs of different requested sizes. Consider a 32 X 32 mesh. The maximum number of nodes a job can request is 1024. We divided thejobs into four groups according to their requested submesh sizes, namely, jobs requesting 1 to 256 nodes, 257 to 512 nodes, 513 to 768 nodes, and 769 to 1024 nodes. The mean and standard deviation of response time are measured for each job group under different scheduling schemes. In this simulation, we assume the width and height of the requested submesh are distributed under uniform distribution. Busy List strategy is adopted for submesh allocation. The results are shown in Figure 5.10 and Figure 5.11 for a load of 0.5. We can see that the IF and SA scheduling schemes reduce the mean and standard deviation of response time for all job groups, though, with more improvement for jobs with smaller submesh requests. As discussed in Section 3, a waiting time limit is introduced to resolve the potential starvation problem. Jobs which are not in the front of the waiting queue will not have the chance to be allocated if any job has exceeded its waiting time limit. To study the effect of waiting time limit on the system performance, we conducted the following simulation: The IF and SA scheduling schemes are each combined with the BL allocation strategy. A uniform distribution is considered for the width and height of the requested submeshes. The load of the system is set to 0.5. The response time is measured against various waiting time limits. Figure 5.12 presents the simulation results. As the waiting time limit decreases, the effect of using the IF or SA scheduling policy becomes less significant and, hence, less performance improvement can be 140 - FCFS -&-— « 130 . Immediate Fit —~—— . 120 _. Scan All *— 1 “E’ 110 . /__/€ . i: 100 - - a”; 90 - . § 80 - 4 g 70 - 1 a: 60 - . 5 50 - . g 40 - « 30 - 1 20 - . 10 - 1 o 1 1 1 1 0 256 512 768 1024 Job Sizes Figure 5.10: Mean response time vs. job size (w and h: uniform distribution). 110 - FCES —e— _ _ Immediate it —+— . 100 Scan All +— GE) 90 "’ W ‘ F 80 - - / 3 1 c 70 [ 1 § 60 + « a? so . t ”5 40 r 1 D a) 30 - 4 20 - - 10 - - O L 1 L L 0 256 512 768 1024 Job Sizes Figure 5.11: Standard deviation of response time vs. load (w and h: uniform distri- bution). 71 110 . s . r . 100 Mean using BL(IF —-°— - S.D. using BL(IF +«v 90 1,- Mean using BL(SA +-— - ’ ' S.D. using BL(SA ~~* ------- g 80 . - '2; 7o . ,,,,, + + « (I) 8 L 4 Q 60 8 a: 50 - a - 40 - 1 20 1 L I 7 7 0 100 200 300 400 500 Waiting Time Limit Figure 5.12: Mean and standard deviation of response time vs. waiting time limit (w and h: uniform distribution). obtained. Essentially, a scheduling policy reduces to F CFS policy when the waiting time limit is zero. If the waiting time limit is greater than 500, all scheduling schemes will perform as if the waiting time limit is infinite. Actually, for both IF and SA scheduling schemes, setting the waiting time limit to be 100 already gives very good performance gains. For both scheduling schemes, the performance difference between the waiting time limit being 300 and infinity is very small. From the results, we can see that the SA and IF scheduling schemes work well with a reasonable value of the waiting time limit. 72 5.4 Discussion In this chapter, we presented two new job scheduling schemes, Immediate Fit and Scan All. A waiting time limit is introduced to avoid a starvation problem. Simulation results show that our scheduling schemes significantly improve the system utilization and improve the mean and standard deviation of response time for different load conditions and job characteristics. More precisely, our SA scheduling scheme can handle a load up to 0.7 and results in a 70% system utilization. For a moderate load of 0.5, the SA scheme reduces the mean response time of the FCFS policy by more than 75%. The simulation results show that we can have significant performance gain with a reasonable value of the waiting time limit. The Scan All scheme has greater overhead than the F CFS and IF schemes because it scans all jobs in the waiting queue. However, for uniformly distributed width and height, the mean queue length for a system load of 0.5 is only 3.6. IF and SA not only favor smaller jobs, but they also improve the performance of larger jobs by reducing the waiting queue length. So, the mean response times for all sized jobs are improved. From the simulation results, we have observed that when the load is low to mod- erate, the performance difference among various scheduling and processor allocation schemes is small. So, when the system is not busy, most jobs will be served promptly after they enter the system, no matter which processor allocation and job scheduling scheme we use. When the load is heavy, the performance difference among various schemes increases, and better schemes make bigger performance gain. So, we consider a scheme is better than others if it gives better performance when the load is heavy. 73 This observation assists in comparing our work with other recently reported results. As we have presented in Chapter 2, the priority and reservation techniques for job scheduling and submesh allocation was proposed in ICPP ’94 (August 1994) [23]. The priority technique alone, i.e., reservation is not allowed, improves the performance. However, the reservation technique alone does not improves the performance, due to the waste of large number of free processors that get reserved. A combination of the priority and reservation techniques results in a lower waiting delay than the priority technique alone for low to moderate loads. However, for high loads, the priority technique alone yields best results. Since the priority technique gives better performance when the load is heavy, we use it as a basis for comparison. The priority technique is very similar to the SA scheme that we proposed: when a job finishes and releases processors, an attempt is made to allocate the jobs from the waiting queue in an FCFS fashion; when a job arrives, the jobs waiting in the waiting queue do not block the allocation process unless MAX.PRI number of subsequent jobs arriving after the job at the head of the queue have been allocated. This is done to avoid starvation of jobs requiring larger submeshes. Each job waiting in the queue keeps track of the number of jobs arriving later than it that have been allocated. We take a different approach to solve the starvation problem by imposing a waiting time limit. If the waiting time limit is infinity in SA and MAX_PRI in the priority scheme is also infinity, we have determined that SA and the priority scheme behave exactly the same. Chapter 6 Job Scheduling Using Multiple Queues In the Scan and Lazy job scheduling schemes for hypercube computers [39, 49], mul- tiple queues are used. Instead of using a single job waiting queue, r+1 queues are maintained for an r-dimensional hypercube system, one for each possible subcube dimension. These two schemes improve mean job response time dramatically relative to schemes that do not cluster equal-sized jobs. In a 2D mesh system, jobs request submeshes of various sizes and shapes. It is not trivial to cluster submesh requests into separate queues for job scheduling. In this chapter, we will investigate the effect of various ways of using multiple queues for job scheduling in 2D mesh systems. A job scheduling scheme using multiple queues is developed to further improve the performance of the SA scheduling scheme presented in Chapter 5. Performance results show that the Multiple Queues scheme further reduces the mean response time under all load conditions and different job 74 75 characteristics. In the next section, we introduce some job scheduling schemes other than F CFS. Then, we pr0pose the Multiple Queues scheduling scheme in Section 6.2. Pgrformance results are given in Section 6.3. 6.1 Job Scheduling Schemes Various job scheduling schemes have been proposed over the years that classify jobs for more efficient scheduling. In this section, we consider these in general and then turn to their use in multiple queues approaches. In multiprocessor systems, it is feasible to schedule jobs according to the number of processors they request. Smallest-Job-First (the job requiring the smallest number of processors first) and Largest-Job-First (the job requiring the largest number of processors first) are two possibilities. In single CPU systems, Shortest-Job—First (SJF) is a nonpreemptive scheduling discipline in which the waiting job with the smallest estimated run—time—to—completion is run next. SJF selects jobs for service in a manner that ensures the next job will complete and leave the system as soon as possible. This tends to reduce the number of waiting jobs, and also reduces the number of jobs waiting behind large jobs. As a result, SJF can minimize the average waiting time of jobs. The waiting times, however, have a larger variance (i.e., are more unpredictable) than FIFO, especially for large jobs [24]. The obvious problem with SJF is that it requires precise knowledge of how long a 76 job will run, and this information is not usually available. The best SJF can do is to rely on a user estimate of run time. In production environments where the same jobs run regularly, it may be possible to provide reasonable estimates. But in development environments, users rarely know how long their programs will execute. If users know that the system is designed to favor jobs with small estimated exe- cution times, they may give small estimates. However, the scheduler can be designed to remove this temptation. First, the user can be forewarned that if the job runs longer than estimated, it will be terminated and the user will be charged for the work. Second, the system can run the job for the estimated time plus a small per- centage extra, and then suspend it and it may be restarted at a later time. The user would pay for the suspending/ restarting overhead. Another solution is to run the job for the estimated time at normal billing rates, and then to charge a premium rate for additional execution time. Under this arrangement, the user providing low run-time estimates to obtain better service will pay a sharp premium. Neither number of processors nor execution time gives a complete indication of workload. The service demand (the product of execution time and number of pro- cessors required) gives a better indication of the workload. The Smallest-Service- Demand-Job-First scheme is evaluated in Section 6.3.2, alone with each of the schemes named above. A family of job scheduling schemes, called Scan, was proposed by Krueger, Lai, and Radiya in [39] for hypercube computers. It is similar to the C-Scan disk-scheduling discipline [53]. The basic idea behind the Scan strategies is to classify jobs according to their sizes and put them into separate queues. We can either scan up from the queues 77 with smaller jobs or scan down the queues with larger jobs. For an r-dimensional hypercube, it maintains r + 1 separate queues (queues 0 to r) for requested subcubes having dimensions 0 to r. When a job arrives, it goes to the end of the queue corresponding to its dimension. All jobs of a certain dimension will be allocated in the FCFS order until the queue is empty, and then the dispatcher will allocate jobs in the next queue. The queues are scanned from small dimension to large dimension, i.e., from queues 0 to r, (ScanUp) or from queues r to 0 (ScanDown) in a circular way. However, starvation is possible for both ScanUp and ScanDown. To solve this problem, before allocating jobs in a specific queue i, the job dispatcher moves all the jobs in queue i to another queue called the current-queue. It allocates jobs in the current_queue until it is empty. Then it moves the jobs in the next non-empty queue into the current_queue. Jobs arriving at queue i after the dispatcher moves the jobs from queue i to the current_queue must wait until the next time the job dispatcher visits queue i. In the next section, we introduce a job scheduling scheme using multiple queues for 2D mesh systems. Several methods of clustering job requests are proposed. Per- formance results are given in Section 6.3. 6.2 Multiple Queues Scheme Several problems arise when we try to establish a scheme using multiple queues for 2D mesh systems. First of all, what is the best criterion to classify jobs into different queues? Secondly, what is an optimal number of queues? Although, it is straightfor- 78 ward to have r + 1 queues for subcube requests from dimensions 0 to r, there is no natural way to decide how many queues are needed for 2D mesh systems to attain optimal performance. For hypercube systems, every job in the same queue has ex- actly the same size, so the allocation process stops when a job in the queue can not be allocated. For a 2D mesh system, we have different sizes and shapes of submesh requests in the same queue. Even two submesh requests having the same number of processors may have different shapes. For example, having a 2 X 6 submesh request in the front of the queue that can not be allocated does not mean a 3 X 4 submesh request can not be allocated either. For the Multiple Queues scheme, we propose three criteria for classification of submesh requests: total number of processors, the larger side length of the requested submesh, and the smaller side length. The performance of different criteria is eval- uated in Section 6.3.1. The optimal number of queues needed is also determined by the simulation performed in Section 6.3.1. Assume that we have Q queues (numbered 1 to Q) and N processors in the system. In the Multiple Queues scheduling scheme, the scan process always starts from queue 1. If we use scan up, and the number of processors requested is the job classification criterion, then a job which requests x processors will go to queue [AIL/Q]. Similarly, if we use scan down, then a job which requests x processors will go to queue Q— [N15] +1. If we use the larger (smaller) side of the requested submesh as the classification criterion, then the job with the larger (smaller) side length x will go to queue [la/L62] for scan up or queue Q — [la/LQl + l for scan down, where L is the length of the larger (smaller) side of the mesh system. In the Multiple Queues scheme, the job dispatcher scans through jobs in all queues 79 in a specific order, from queues of large jobs to queues of small jobs (scan down) or vice versa (scan up). Applying the Scan All scheme presented in Chapter 5 to each waiting queue, the allocation process does not block if the job in the front of the queue can not be allocated. If a job can not be allocated, the dispatcher attempts to allocate the next job in the queue. Additionally, when a job arrives, there might exist a large enough submesh for it. We try to allocate an incoming job immediately if it fits in a free submesh in the system (Immediate Fit). The incoming job is put in a queue only if there is no free submesh found for it. However, allowing a newly arrived job to enter the system reduces the chance of freeing processors for a waiting job having a large submesh request and, hence, results in a starvation problem. The possibility of encountering starvation is higher with a heavier system load. Our solution to this starvation problem, as in the schemes of Chapter 5, is to enforce a waiting time limit. If the job dispatcher tries to allocate a job in a waiting queue but no free submesh is found for it and it has exceeded its waiting time limit, the job dispatcher stops the scan process and does not allocate the next job in the queue. When a job arrives, the dispatcher tries to allocate the job only if there are no jobs in any waiting queue exceeding the waiting time limit. Otherwise, it simply puts the job at the end of the waiting queue corresponding to its size. The detailed algorithm for a system with Q waiting queues is listed in Figure 6.1. In the next section, we will perform simulation runs to see which is the best job classification criterion and what is the optimal number of queues needed. 80 When a job J arrives: l. initialize the scan from waiting queue i = 1; 2. if waiting queue i is empty, then goto step 4; else retrieve the job in the front of waiting queue i and let it be J 1’; 3. if J; has exceeded its waiting time limit then put J in the appropriate waiting queue and stop; 4. ifi < Q (waiting queue i is not the last queue), then i=i+h goto step 2; 5. use the desired allocation strategy to find a free submesh for J; if a free submesh is available, then allocate processors for J; else put J in the appropriate waiting queue and stop. Upon completion of a job J: 1. initialize the scan from waiting queue i = 1; 2. release processors allocated to J; 3. retrieve the job in the front of the waiting queue i and let it be Jf; 4. use the desired allocation strategy to find a free submesh for J f; if a free submesh is available, then remove J I from the waiting queue and allocate processors for J f; goto step 6; 5. if J, has exceeded its waiting time limit then stop; 6. if J; is not the last job in waiting queue i, then retrieve the next job in waiting queue i and let it be J 1’; goto step 4; else ifi < Q (waiting queue i is not the last queue), then i = i + 1; goto step 3; else stop; endif; endif; Figure 6.1: The Multiple Queues scheduling scheme. 81 6.3 Performance Results In this section, we conduct simulations to determine the best job classification crite- rion and number of queues needed for the Multiple Queues scheme. Then we compare the performance of the Multiple Queues scheme with other scheduling schemes. In our simulation, the waiting time limit is set to be infinity. The width w and height h of requested submeshes are generated independently. We considered three distribu- tions for w and h, namely, uniform, decreasing, and increasing distributions [13], to cover different characteristics of jobs. The processor allocation strategy we use is the BL strategy. 6.3.1 Performance of Variations of Multiple Queues Scheme In the Multiple Queues scheme, jobs are classified into different queues using one of three criteria: total number of processors, the larger side of the requested submesh, or the smaller side. Tow types of scan, ScanUp an ScanDown, are also considered. The following experiment is performed under a load of 0.5, and the widths and heights of jobs are distributed under uniform distribution. The performance of the original ScanUp and ScanDown schemes (without scanning through all jobs in the waiting queues) is compared for various number of queues and different job classification cri- teria. Figure 6.2 shows the mean response time versus number of queues for variations of the Scan schemes. When we have only a single queue, these schemes degrade to the F CFS scheme. If two queues are used, scan up and scan down will perform the same. The number-of-processors criterion performs the best among the three criteria. When 82 the number of queues is larger than four, scan up performs a little better than scan down. This matches the results reported by Kruger, Lai, and Radiya [39]. As we can observe, using number of processors as the job classification criterion results in the best performance. If we use scan up and number of processors as the job classification criterion, the mean response time will be reduced by more than 40%. If we use the larger side as the job classification criterion, the queues for larger jobs contain many more jobs than the queues for smaller jobs. For example, if we have 32 queues, queue 1 contains only 1 X 1 jobs, and queue 32 contains jobs ranging from 1 X 32 to 32 X 32. Under such circumstances, jobs are unevenly distributed among the queues, so the effect of multiple queues diminishes. 130 . i i s . 120 ScanUp(# processors —-——‘ ScanUp(larger side —*—— m 110 ScanUp(smaller side —e—. .E ScanDown(# processors; -~- -------- '- 100 ScanDown(larger side ---+ ------- g ScanDown(smaller side) a § 90 - ~ can) c so - ~ m o ~~..__, 2 70 h = -~ - ,. 'iiiiif‘iif” ' i"'ij';.::;r-~-- ” 60 ~ . 50 . . - - - 1 2 4 8 16 32 # Queues Figure 6.2: Mean response time vs. # queues, load=0.5, w and h: uniform distribu- tion. Next, we performed similar simulations to evaluate the performance of the Mul- tiple Queues scheme presented in Figure 6.1, which scans through all jobs in the 83 waiting queues. The width and height of jobs are under uniform, decreasing, and increasing distributions, and the results are illustrated in Figures 6.3, 6.4, and 6.5, respectively. Using only a single queue, the Multiple Queues scheme degrades to I I MQScanUp(# rocessors) ...— MQScanUp larger side —1— MQScanUp(smaller side —a— . MQScanDown(# processors .. MQScanDown(lar or side + MQScanDown(sma ler side a 35 Mean Response Time 20 1 1 1 1 1 # Queues Figure 6.3: Mean response time vs. # queues, load=0.5, (w and h: uniform distribu- tion). the Scan All scheme presented in Chapter 5. The performance of the SA scheme is much better than the FCF S scheme and the original Scan schemes because the SA scheme removes the blocking effect of the FCFS scheme. When the number of queues increases, ScanDown performs better than ScanUp. It is harder to find a free submesh for larger jobs than for smaller jobs, so the result is within our expectation. If we try to allocate smaller jobs first, it becomes even harder to allocate larger jobs, and the waiting time of large jobs will be much longer. Conversely, if we try to allocate larger jobs first, there might still exist some small free submeshes for smaller jobs. So, ScanDown is the better one to adopt for the Multiple Queues scheme. As the number 84 MQScanUp(# processors -— MQScanUp(larger side ..— MQScanUp(smaller side +— 30 P MQScanDown(# processors ...,. ....... q MQScanDown(larger side + MQScanDown(smaller side a » Mean Response Time # Queues Figure 6.4: Mean response time vs. # queues, load=0.5, (w and h: decreasing distribution). 45 fi fi r ' MQScanUp(# processors —~—— MQScanUp(larger side —+— MQScanUp(smaller side —e—— MQScanDown(# processors - MQScanDown(lar er side + ~ 40 ’ MQScanDown(smaler side ..-,, ,_ - Mean Response Time # Queues Figure 6.5: Mean response time vs. # queues, load=0.5, (w and h: increasing distri- bution). 85 of queues increases, the mean response time reduces if we use ScanDown regardless of which job classification criterion is used. When the system load is relatively light, it is difficult to observe performance differences. The results of increasing the load to 0.7 and repeating the experiments for ScanDown are plotted in Figures 6.6, 6.7, and 6.8. 140 T , . . . 130 - MQScanDown(# processors .-.,. ........ , 1 MQScanDown(larger side ~-+ ~ a) 120 - MQScanDown(smaller side -- a - i: 110 ~ . g 100 - , 5% Q 90 " .4 5 80 i . 5‘ 9. - ~ 60 h QTY... .. .,:::;.:'Q ...:: ..................... Q - 50 l l l 1 1 1 2 4 8 16 32 # Queues Figure 6.6: Mean response time vs. # queues, load=0.7, (w and h: uniform distribu- tion). Apparently, larger side length is not a good criterion under heavy load. A job might request a submesh having a big larger side length but very few processors. It then has priority to be allocated and can possibly block a larger job (in terms of number of processors) behind it from being allocated. Number of processors and smaller side length have similar performance, and the former performs slightly better then the latter. In our opinion, number of processors gives a better indication of the real workload, and it performs the best among the three criteria, so we choose number 86 14o . T , r . 130 r MQScanDown(# processors .. ., """"""""""""""" + , MQScanDown(lar er side .+ m 120 . MQScanDown(sma ler side a , i: ,+" d o 110 . a 100 - l g l|-~-- '+ ‘ -‘ A+ m 90 Q " ' I... _ g .. an a: 30 - ..... .w - 2 i u i B . g. Q 70 - --.- . 60 ~ . ., J 1 2 4 8 16 32 # Queues Figure 6.7: Mean response time vs. # queues, load=0.7, (w and h: decreasing distribution). MQScanDown(# rocessors 110 " MQScanDown(far er side ‘ MQScanDown(sma ler side a- m ..+. ..E_ I- 100 "f i. *' ‘ a C o +' 3 a) 90 r a . m C _ m ”n. G) ‘13.. 2 80 ~ . ----- ........ - 70 1 1 1 1 1 1 2 4 8 16 32 #Queues Figure 6.8: Mean response time vs. # queues, load=0.7, (w and h: increasing distri- bu tion). 87 of processors as the job classification criterion for the Multiple Queues scheme. As we can observe from Figures 6.6 to 6.8, the more queues we use, the better the performance. Also, the performance flattens out as the number of queues increases. As the number of queues increases from 1 to 32, the mean response time reduces from 79.91 to 56.57 time units for the uniform distribution, from 92.37 to 60.64 for the decreasing distribution, and from 98.66 to 76.50 for the increasing distribution. Using 32 queues for a 32 x 32 mesh system is just enough. So, the number of queues used in the Multiple Queues scheme is set to be the side length of the mesh system. 6.3.2 Comparing Multiple Queues Scheme with Other Schemes In the following experiments, we compare the Multiple Queues scheme with other scheduling schemes under various load conditions. Here we use 32 queues, classify jobs by number of requested processors, and apply ScanDown to try to allocate large jobs first for the Multiple Queues scheme. The other scheduling schemes considered include: First-Come-Fz'rst-Served (FCFS), Smallest-Job-First (SmallestJF), Largest- Job-Fz'rst (LargestJF), Shortest-Job-Fz'rst (SJF), and Smallest-Service-Demand-Job- First (SmallestSDJF). The mean response time using these six schemes for a load of 0.1 to 0.7 under uniform, decreasing, and increasing distributions are plotted in Figures 6.9, 6.10, and 6.11, respectively. The Multiple Queues scheme has the best performance. The main advantage of the Multiple Queues scheme is that it scans through all jobs in the waiting queues, 88 60 I I 1 T r FCFS *— 50 ~ SJF "'8” SmallestSDJF ...,, g Slrnallestj: -+— -- _ argest -*— l; 40 Multiple Queues +— 53). 30 - . Q r: 5 20 - o ............ 2 ... 10 - 0 1 L 1 1 1 1 1 O 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Load Figure 6.9: Mean response time vs. load (w and h: uniform distribution). 7o . , . . > f 60 - FCFS _,_ , m 8 II 833'}: ...,... ........ . ma est ...m.-. F—E' 5° ' SmallestJF +- ‘D LargestJF +— 3 40 ~ Multiple Queues ..— co; 3 E 30 - 4 g Q) . 2 20 ..... 10 . 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Load Figure 6.10: Mean response time vs. load (w and h: decreasing distribution). 89 80 . . I r . . 70 _ FCFS ee— * SJF 41~ 0 50 . SmallestSDJF ..., . E SmallestJ'l: —1— _ LargestJ —~— § 50 Multiple Queues —*— 7 g 40 - - 6‘6 : 30 ~ ‘ m (D :2 20 + « 10 - - 0 1 1 1 1 1 1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Load Figure 6.11: Mean response time vs. load (w and h: increasing distribution). checking for possible allocation, while other schemes stop if a job in the queue can not be allocated. For comparison purposes, we can modify the other schemes to scan all jobs in the waiting queue and to try to allocate jobs when they arrive. The FCF S scheme becomes the Scan All (SA) scheme proposed in Chapter 5, which is equivalent to the Multiple Queues scheme with a single queue. The results are shown in Figures 6.12, 6.13, and 6.14. The Shortest—Job—First and Smallest-Service—Demand—Job-First schemes have a slight advantage over the Multiple Queues scheme when the load is light. If the users can somehow estimate the execution times of their jobs and provide this information to the system, it is possible for the job scheduler to place shorter jobs first to achieve better performance. Overall, the Multiple Queues scheme performs the best if no execution time information is available. Mean Response Time Figure 6.12: Mean response time vs. load (w and h: Mean Response Time 80- 70- 60- 50- 40- 30 20- 10 0 1 00 90 80 70- 60 50 40- 30 20- 10 O 90 SA-—§ SJF(SA ...B ..... SmallestSDJF(SA SmallestJF(SA LargestJF(SA Multiple Queues l I l 0 Load 0.1 0.2 0.3 0.4 0.5 0.6 0.7 uniform distribution). SmallestSDJF SA «>7 SmallestJF SA LargestJF SA +— Multiple Queues —*— ._..__ l l r SA *— SJF SA ‘8 O 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Load Figure 6.13: Mean response time vs. load (w and h: decreasing distribution). 91 100 . . . . . 90 - SA —-— ~ SJFSSA .Dm- G, 80 - SmallestSDJF SA . ,. ~ E 70 _ SmallestJF(SA —+-— + i: LargestJF(SA +— g 60 _ Multiple Queues —~— . §_ 50 .. . 0 a: 40 _ . C 8 so - . 2 20 - - 10 l ~ 0 1 1 1 1 1 1 1 O 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Load Figure 6.14: Mean response time vs. load (w and h: increasing distribution). As discussed in Section 6.2, a waiting time limit is introduced to resolve the potential starvation problem. Jobs that are not in the front of the waiting queue will not be considered for allocation if any job has exceeded its waiting time limit. To study the actual effect of waiting time limit on the system performance, we conducted the following experiment. A uniform distribution is considered for the width and height of the requested submeshes. The load of the system is set to 0.5. The response time is measured against various waiting time limits. Figure 6.15 presents the simulation results. As the waiting time limit decreases, more jobs will exceed their waiting time limits, rendering the scanning process ineffective in allocating other jobs. Hence, less performance improvement can be obtained. Setting the waiting time limit to be 300 gives very good performance gains. If the waiting time limit is greater than 92 110 T n T Y I r 100 Multiple Queues —~—- . Mean Response Time u H IV h“ 20 . . . 1 i o 100 200 300 400 500 600 Waiting Time Limit Figure 6.15: Mean response time vs. waiting time limit (w and h: uniform distribu- tion). 600, all scheduling schemes perform as if the waiting time limit is infinite. The performance difference between the waiting time limit being 300 and infinity is very small. Actually, the starvation problem could only happen when the system load is very heavy. Reducing the waiting time limit simply causes greater mean response time and longer queue length. From the results, we can see that the Multiple Queues scheme works well with a reasonable value of the waiting time limit. 6.4 Discussion In this chapter, we have proposed a starvation-free job scheduling scheme, Multiple Queues, that can be used with any existing submesh allocation strategy. It is based on the Scan All scheduling scheme presented in Chapter 5. The effect of various 93 ways of using multiple queues for job scheduling in 2D mesh systems is investigated. Simulation results show that the Multiple Queues scheme significantly reduces the mean response time of existing allocation strategies under different load conditions and job characteristics. A higher system utilization and lower internal fragmentation are also achieved. Babbar and Krueger recently proposed a Scan scheduling discipline for 2D mesh systems that also employs multiple queues [3]. However, it is only applicable when the mesh system and the requested submeshes are square meshes. For a square mesh of size L x L, it maintains L separate queues, one for each possible size of submesh request. On arrival, an incoming job joins the end of the queue corresponding to the size of the submesh it requires. All jobs in a certain queue are allocated before the scheduler moves to the next queue. This approach has the blocking effect similar to the FCFS scheme, that is, larger jobs may block smaller jobs from entering the system. For a system accepting jobs requesting arbitrary sizes and shapes, their approach maintains only a single ordered queue. For ScanUp, jobs are ordered from the smallest to the largest number of processors requested, while for ScanDown, jobs are ordered from largest to smallest. Actually, ScanUp and ScanDown are exactly the same as Smallest-Job-First and Largest-Job—First scheduling policies, respectively, when only a single queue is maintained. On the other hand, the Multiple Queues scheme we presented is applicable when the mesh system and the requested submeshes are not squared. Further, it performs better than the Smallest-Job—First and Largest-Job—First under all circumstances. 94 The job scheduling schemes presented in Chapters 5 and 6 can be used with any processor allocation strategy and further improve allocation performance compared with the FCFS policy. In the next two chapters, we will focus on processor allocation strategies instead of job scheduling schemes. Chapter 7 Estimated Execution Time Allocation Strategies We have performed simulation to study the performance of existing strategies and found that the different completion times of the jobs in the system can result in busy processors being scattered throughout the mesh, thus causing large external fragmentation. Execution time information is useful for job scheduling. For example, the Shortest- Job—First (SJF) scheduling discipline selects the the waiting job with the smallest estimated run-time—to—completion to run next. Having an estimated execution time for the job submitted to the processor allo— cator, the system can try to allocate jobs with similar completion times in the same area. In practice, since most jobs running on parallel systems are programs submit- ted repeatedly [29], users usually can quite accurately estimate the execution time of their jobs. 95 96 In this chapter, we present a family of processor allocation strategies, called Esti- mated Execution Time (BET), which utilize the estimated execution time information. In the next section, we give the basic idea of the BET strategies. Then we provide detailed algorithms in Section 7.2. The performance results are given in Section 7.3. 7 .1 Basic Idea and Heuristics The basic idea of our EET strategies is to allocate a job in close proximity to jobs with similar estimated completion times. The estimated completion time (ECT) of a job in the system is the time it was allocated plus its estimated execution time. The estimated completion time of the job which is currently being allocated is the current time plus its estimated execution time. The estimated completion time dif- ference (ECTD) is calculated for each boundary node of a candidate submesh and its neighbor. If a neighbor is free, the ECTD is the estimated execution time of the job. If a neighboring node is busy, the ECTD is defined as the absolute value of the estimated completion time difference of the two neighboring jobs. If the neighboring node is beyond the boundary, the ECTD is zero. Because of inaccurate estimation, it is possible to encounter a situation in which the ECT of a neighboring job is less than the current time. If this happens, current time is used to represent the ECT of the neighboring job and the ECTD is merely the estimated execution time of the job currently being allocated. The total estimated completion time difference (TECTD) of a candidate submesh is the sum of the esti- mated completion time differences of all its neighboring nodes. Consider the 8 x 8 97 mesh in Figure 7.1. Job y Candidate submesh B - ~ f—«'* :4“ e g r ,=”_“~, ..A‘fi Y gyv , ' _ d an» a... — L n m. m; . , a l y , y ,‘ i] ‘i * ’ ' V“ : ‘ * ' .~ * | ’5 a,“ " , * Y Y H ‘ , ,Z- a l ,1, ,,,L\i , A [I 3;; ,1 ii y y .‘gi x x r X . . ... _ ,. i '1 .. -..‘,,- l .1: Figure 7.1: A two-dimensional mesh M (8, 8). Two jobs, X and Y, are currently in the system with estimated completion times cc and y, respectively. Submeshes A and B are two candidate submeshes for a job Z requesting a 3 X 4 submesh with estimated completion time 2, where z is the current time t plus the estimated execution time 2’ of Z. If both at and y are greater than t, candidate submesh A has a total estimated completion time difference of 3 X |z — :13] + 2 X Iz — yl + 2 X z’. Candidate submesh B has a total estimated completion time difference of 3 X |z — a:| + 4 X 2’. If t is greater than :1: and t is less than y, A has a total estimated completion time difference of 3 X z’ + 2 X |z — yl + 2 X z’ and B has a total estimated completion time difference of 3 X z’ + 4 X 2’. If both a: and y are less than t, A has a total estimated completion 98 time difference of 3 X z’ + 2 X z’ + 2 X z’ and B has a total estimated completion time difference of 3 X z’ + 4 X z'. The objective is to choose the candidate submesh with the smallest total estimated completion time difference. 7 .2 Detailed Algorithms We first propose the Estimated Execution Time-Search Boundaries (EET-SB) strat— egy. It will search candidate submeshes which are adjacent to the boundaries of the mesh system or to allocated submeshes. The searching procedure is similar to the BL strategy, but the EET-SB strategy will consider all possible candidate submeshes along the boundaries. After each successful submesh allocation, the system must record the estimated completion time of that job and update the busy list for fu- ture allocation consideration. After each job completes and releases its processors, the busy list again needs to be updated. The detailed algorithm is presented in Figure 7.2. The Best Fit allocation strategy proposed by Zhu is designed to find corners in the coverage array. A major reason why the BF strategy performs worse than the BL strategy is that the BF strategy does not allow the rotating of the requested submesh by 90 degrees. We propose the Estimated Execution Time—Search Corner (EET-SC) processor allocation strategy in Figure 7.3. The corner-searching procedure is similar to that in the BF strategy, but we allow the rotating of the requested submesh. Simulation results and performance comparisons are given in the next section. 99 Allocate J(w, h) 1. if busyJist = null, then allocate J(w,/z) to < 0,0,w — 1,h — l >; stop; else initialize min-TECTD = 00 and TECTD = 0; 2. generate candidate submeshes for J (w, h) and J (h, w) on the four corners of the mesh system and put them in the candidateJist; 3. for each job B in the busyJist do generate candidate submeshes for J (w, h) and J(h, w) that lie within the boundaries of the mesh system along the four boundaries of B and put them in candidateJist; end for; 4. get the first submesh S in the candidateJist; 5. for each job B in the busylist do if S overlaps with B, then remove S from the candidateJist and goto step 8. else if S is adjacent to B, then multiply their EC TD by the length of their common boundary, add it to TEC TD, and mark those neighboring nodes of S' as busy; end for; 6. multiply the number of free neighboring nodes of S by the BET of J(w, h), and add it to TECTD; 7. if TECTD < min-TECTD, then choice_submesh = 5; min- TECTD = TECTD, 8. if S is not the last submesh in the candidateJist, then get the next submesh in candidatelist, let it be S, and goto step 5; 9. if the candidateJist is empty, then put J(w, h) in the waiting queue and stop; else allocate J(w,h) or J(h, w) to choiceusubmesh and stop; Figure 7.2: The EET-SB allocation strategy. 100 Allocate J(w, h) 1. 2. compute the coverage array for J (w, h); scan all the rows in the coverage array and mark two end elements of each sequence of consecutive 0’s as end; . scan all the columns in the coverage array; if any element is an end element of a consecutive sequence of 0’s in a column, and it is previously marked as end, then add it to cornerJist; .compute the TECTD for the candidate submesh based at each corner; . compute the coverage array for J (h, w); .follow steps 2,3, and 4 to identify corners and their corresponding TEC TD, . if the cornerJist is empty, then put J(w, h) in the waiting queue and stop; else choose the candidate submesh with the smallest TEC TD and stop; Figure 7.3: The EET-SC allocation strategy. 101 7 .3 Performance Results In this section, we present the simulation results comparing the performance of our EET strategies with the BL and BF strategies. In order to achieve the peak per- formance of the BL strategy, we implement it in such a way that it will consider all free neighboring nodes and always find the free submesh with the highest boundary value. In order to have a fair comparison, the BF strategy is implemented in such a way that it can rotate the requested submesh as in the EET-SC strategy. In addition to the parameters we described in Chapter 3, the estimated execution time for each job is needed for this simulation for the EET strategies. We generate estimated execution time a: for each job according to the exact job execution time y of the job by a normal distribution with mean y and standard deviation c X y, where c is a coefficient indicating the accuracy of the estimation. In our experiments, c is ranging from 0 (perfect estimation) to 1. If the estimated execution time generated is negative, it will be set to zero. We measure the system performance based on the mean response time. To have an accurate measurement, only stable systems are considered. Figure 7.4 illustrates the mean response time under different workloads using BL and EET-SB strategies with the coefficient c equal to 0.0, 0.5, and 1.0. The width and height of the requested submeshes are uniformly distributed from 1 to 32. When the load is light (less than 0.3), all allocation strategies have similar performance. Most jobs will be served promptly after they enter the system. For load greater than 0.4, the EET-SB strategy shows performance gain. When the load reaches 0.5, the EET-SB strategy reduces the 102 100 , , . . . 90 L BL «— q EET-SB c=0.0 +,, m 80 r- EET-SB 0:05 ...,... x q E 70 L- EET'SB c=1.o '~~)( ~~~~~~~ F‘ ‘ 93 60 ~ . C § 50 . g G a: 40 - J C 8 30 + 4 z 20 ~ + 10 ~ q o - . . , g 0 0.1 0.2 0.3 0.4 05 Load Figure 7.4: Mean response time vs. load (w and h: uniform distribution). mean response time compared to BL allocation strategy by 18% to 27% under different estimation accuracies. When the load reaches 0.6, the system becomes unstable, so we report the results only up to a load of 0.5. The EET-SB strategy outperforms the BL strategy under all workloads when width and height of requested submeshes are uniformly distributed. As the execution time estimation error decreases, the mean response time becomes smaller. When the coefficient c is 1, the standard deviation of the estimated execution time equals the exact execution time. The estimation error is very large under this condition, and over 16% of the jobs will have their estimated execution time set to zero. In a real system, it is reasonable that some users have no idea about the execution times of their submitted jobs. The system can set a default value or zero to those jobs for the purpose of allocation. In our simulation, the EET-SB strategy outperforms the 103 so , . . f . 40 ' BL —+— + "5’ EET-SB c=0,o ..... l: EET-SB 0:0.5 m EET-SB o=1.0 m 30 - ‘ C s U) (D II 20 - . C as (D 2 1o - . o 1 . . . . 0 0.1 0.2 0.3 0.4 05 Load Figure 7.5: Mean response time vs. load (to and h: increasing distribution). BL strategy even under the worst case of c = 1. When the coefficient c is 0, exact execution time is used to select the candidate submesh and the EET-SB strategy performs the best. The same experiment is conducted to consider different characteristics of job re- quests, namely, increasing and decreasing distributions for the width w and height h of submeshes requested. The performance results of the mean response time for BL and EET-SB strategies when the width and height of requested submeshes are distributed under the increasing distribution is presented in Figure 7.5. Performance results are similar to that of the uniform distribution case. However, the performance difference among various strategies is not as significant as that of the previous case for the same reason as we discussed in Section sec-perf-s. Figure 7.6 shows the mean response time versus load for BL and EET-SB strategies 104 200 T , . I I 180 - BL _,_ - EET-SB 0:0,0 m 160 'EET-SB =o.5 I E 140 _ EET-SB c=1.0 . 3 120 - .- - C § 100 - I m ’- ‘I 80 - . C 8 60 - I 2 4o - I 20 - . O L 1 1 l I 0 0.1 0.2 0.3 0.4 05 Load Figure 7.6: Mean response time vs. load (to and h: decreasing distribution). under the decreasing distribution of job requests. With a decreasing distribution, the system has more small jobs than large ones. The processor allocation strategy plays an important role when the system load is heavy. As the load reached 0.5, the EET-SB strategy reduced the mean response time by 15 to 55% of the BL strategy under different estimation accuracies. Figures 7.4 to 7.6 show that the EET strategy outperforms the BL strategy under all load conditions, various job characteristics, and different estimation accuracies. Because the performance difference is small when the width and height of the requested submesh are distributed under increasing distribution, only the results for uniform and decreasing distribution are presented for the BF and EET-SC strategies. Figures 7.7 and 7.8 show the mean response time under various loads for BF and EET-SC strategies when w and h are distributed under uniform and decreasing 105 160 . I 150 - BF _,_ I 140 ~ EET-SC(C=0.0I ...+ .. . U I I 130 . EET-SC C=0.5 120 _ EET-SC(C=1.0 110 - - 100 P 90 ~ 80 - 70 - 50- 40- 30- 20- 10- 0 1 1 1 L 0 0.1 0.2 0.3 0.4 0.5 Load Mean Response Time Figure 7.7: Mean response time vs. load (10 and h: uniform distribution). distributions, respectively. As we can observe from Figures 7.7 and 7.8, the EET-SC strategy outperforms the BF strategy. In order to compare the performance of all the strategies discussed above, we combine Figure 7.4 with Figure 7.7 and Figure 7.6 with Figure 7.8 into Figures 7.9 and 7.10, respectively. We can see that the BL strategy outperforms the BF strategy under all conditions. The EET-SC and EET-SB strategies have similar performances. The EET-SC strategy has a slight advantage over the EET—SB strategy in the c = 1 case, while the EET-SB strategy performs better than the EET-SC strategy when c = 0. When EET-SB strategy is used, we have all the candidate submeshes along the boundaries to choose from, while the EET-SC strategy considers only corners. It is reasonable 106 300 - BF -o- , EET-SC C=0.0 ...,. ....... a) EET-SC C=0.5 --., I; 250 ’EET-SC c=1.0 ------------ ~ g 200 " l o 3 g 150 - I E o 100 - . 2 50 - . 0 TT ‘1 I 0 0.1 0.2 0.3 0.4 0.5 Load Figure 7.8: Mean response time vs. load (w and h: decreasing distribution). 100 l I I I I- I 90 - BL —+— " — EET-SB C=0.0 *— m 80 ' EET-SB C=0.5 —'— ‘ E 70 _ EET-SB C=1.0 —*— I i: BF ~43 ........ ‘0’), 60 _ EET-SC C=0.0 W‘” _, c EET-SC C=0.5 ° g, 50 . EET-SC c=1.0 a . 6‘6 40 ~ 4 c 8 30 - . 2 20 - - 10 - ‘ O 1 1 1 1 4 0 0.1 0.2 0.3 0.4 0.5 Load Figure 7.9: Mean response time vs. load (w and h: uniform distribution). 107 200 U I I T I- I 130 - BL 4— EET-se(c=o.oI ..._ 160 " EET-SB :05 140 _ EET-SB c=1I.30F II .. EET-SC C=0.0 1. . . 12° EET-SC was) ------- 1 100 . EET-SC(c=1.0 80- Mean Response Time 40 r I 20 r ..... j 0 1 1 1 1 1 0 0.1 0.2 0.3 0.4 0.5 Load Figure 7.10: Mean response time vs. load (w and h: decreasing distribution). that the EET—SB strategy can find a better fit than the EET-SC strategy when the execution time estimation is precise. In the case of c = 1, since the EET-SC strategy will only consider corners and the EET-SB strategy will consider all boundaries, EET- SB strategy is more likely to choose a bad candidate submesh because of the large estimation error. Finally, the EET strategies outperform the BF and BL strategies under all conditions. 7 .4 Discussion In this chapter, we have investigated the effect of using estimated execution time to guide submesh allocation. We have developed two processor allocation strategies for two-dimensional mesh-connected systems, called the EET-SB and EET-SC strategies, which utilize the estimated execution time information provided by users to allocate 108 jobs with similar finishing times in the same areas. Extensive simulation runs have been performed to compare various strategies. The results indicate that both EET strategies have better performance than previously proposed strategies under all circumstances. Inaccurate estimation of such execution time will simply cause less overall system performance improvement, but not cause any execution failures. As we have discussed in Chapter 6, systems using SJF scheduling scheme may encounter some difficulties if the users know that the system is designed to favor jobs with small estimated execution times and give small estimates of their jobs. Unlike Shortest-Job-First (SJF) scheduling scheme, EET strategies do not assign priorities to jobs base on their execution times. Users intentionally give smaller or larger estimations of the execution times will not change their priorities of execution, but will degrade the overall system performance and result in a longer response time. So, users have no reasons to fool the system by giving inaccurate estimations intentionally. Basically, EET-SC and EET-SB strategies improve the performance relative to strategies that do not use execution time information. Both EET-SC and EET-SB use the same candidate submesh selection heuristic, and the difference between them is the candidate submesh generation method and the detailed implementation. The performance of EET-SC and EET-SB strategies are very similar, and the reason why they perform better is the candidate submesh selection heuristic using estimated time information. In the next chapter, we will introduce a new submesh allocation strategy and an efficient candidate submesh generation method. The new candidate submesh generation method can be adapted for EET strategies. Job scheduling schemes pre— 109 sented in Chapters 5 and 6 can also be applied to further improve the performance. Chapter 8 Busy Distance Inverse Allocation Strategy Generally, submesh allocation strategies can be divided into two categories, first-fit and best-fit. A first-fit submesh allocation strategy allocates the job to the first free submesh it detects. A best-fit strategy identifies several candidate submeshes, and the job is allocated to the best-fit candidate based on a certain heuristic. In general, a first-fit allocation strategy tends to divide a large free space into disjoint, small pieces, causing large external fragmentation. Therefore, a carefully designed best-fit strategy is the key to better performance. Existing best-fit alloca- tion strategies improve upon the fragmentation problem but leave plenty of room for further improvement. In this chapter, we present a new best-fit processor allocation strategy called Busy Distance Inverse (BDI) strategy for two-dimensional mesh sys— tems. As with other best-fit strategies, it tends to allocate a job to a submesh that is in close proximity to other busy submeshes or mesh boundaries, however, it incor- 110 111 porates more information in its heuristic. Yet, it yields the least complexity among all best-fit strategies. The performance of the BDI strategy is compared with other best-fit strategies, including Busy-List and Best-Fit, under the FCFS scheduling discipline. In the following sections, we will give the basic ideas and detailed algorithms of the BDI strategy, and then present the performance results. 8.1 Basic Ideas Intuitively, it is better to allocate a job to processors in the near neighborhood of other busy submeshes and leave a larger contiguous free space for future jobs, thus reducing the external fragmentation. The BF strategy is based on the heuristic to choose a corner in the coverage array having largest number of 1’s surrounding it as the base of the free submesh. The BL strategy selects the candidate submesh having the most busy neighbors. These two strategies consider only the numbers of 1’s surrounding corners in the coverage array or the numbers of busy neighbors of candidate submeshes. Neither of these strategies considers busy processors two or more hops away in the mesh. Consider the 10 X 10 mesh in Figure 8.1. A solid dot represents a busy processor 112 f<9 .9. <0.9> <1,9> <2,9> <3. 9)» <4 9) <5. 9) <6 9? <7, 9> <8,9>'t <9, 95$ l 5 i ' 1 <0.8) <1,8$ <2,8> <3, 8> <4, 8) <5. 8) <6, 8> <7, 8:» <8, 8>i <9, 8>I . A.‘ - .-.. ”A.‘ .-. 4 k \ ,2 F- .- . ‘ . .. >- *. fl .- _ O~ (1,7) <2,7> <3, 7) <4 7): <5, 7) <6. 7) <7, 7> <8,7> <9,7> ’ <35; <46; <5.6“>‘ «5,69,» <7,65‘j»' <85; <95: 1 i j i i ' -—- . . <7 .7774“ .——- --—- k_———. /.—-»+J -.. __~_. ._..__ <0 5; «.55 <2, 55 <3. 55 <4 55‘“ <5 55 <6 55' <7 55 <8 55}“ <9,5 <045 <1 4$<2 4’ <3 4: <4 .4; <5 4%» <6,4$ <14; <1,2> <22> <3,2> <4,2 in ._ 7 mi“ ”:54- + <0,1>“ <1,1>' <2.1>"7<3,1>I5<4,1> <5,1!iI <6.1> <7,1> <8.1> <9,1> [ALI <0,0> <1,0> <2,0> <3.0> <4.0> <5,0> <6,0> <7.0> <8.0> <9,0> Figure 8.1: A two dimensional mesh M(10,10). and a circle represents a free processor. The corresponding busy array is 1000000011 B[10,10] :- 113 and the coverage array (including the reject set) for an incoming job J (3, 3) is 1111111111 1111111111 1000001111 C[10,10] = There are two 3-corners, C [0, 5] and C [6,4], in the coverage array. The area of C [0, 5] is six while the area of C [6,4] is three. So, C [6, 4] will be selected as the base of the submesh for the incoming job J (3, 3) if the Best Fit strategy is used. If we use the Busy List strategy, J (3, 3) could be allocated to either submesh < 0,5, 2, 7 > or < 6, 4, 8, 6 > because both have the same highest boundary value of seven among all candidate submeshes. However, the lower left corner of the mesh might be a better choice to allocate J (3, 3), because it leaves a larger free space for later jobs. We present a new processor allocation strategy, called Busy Distance Inverse (BDI), which considers busy processors two and even more hops away. It allocates the incoming job in the most crowded area and leaves a larger free space to later jobs. 114 In order to make the busy processors not adjacent to the candidate submesh still have some effect to the best—fit submesh selection, we design the BDI strategy so that each corner processor of a candidate submesh calculates the distance to the nearest busy processor or mesh boundary (we call it busy distance) in each direction. The larger the busy distances of a candidate submesh, the less it satisfies the best-fit heuristic. There are three major parts of the BDI strategy: (1) the best-fit heuristic, (2) the candidate submesh generation methods, and (3) an efficient algorithm to implement it. 8.2 Best-Fit! Heuristic We need the following definitions to help us describe the detailed algorithm. Definition 8.1 The processors at the four corners, lower-left, lower-right, upper- left, upper-right, of a candidate submesh A, are denoted as LL(A), LR(A), UL(A), U R(A), respectively. Definition 8.2 The busy distance from a corner of a candidate submesh A is the number of hops from it to the nearest busy processor in a given direction. If the corner is adjacent to a boundary in a given direction, then that busy distance is assumed to have a value of one. Definition 8.3 For a candidate submesh A, its busy distance vector has eight compo— nents, consisting of the two busy distances at each of the four corners of A, which are 115 denoted: LLS(A), LLW(A), LRS(A), LRE(A), ULN(A), ULW(A), URN(A), and URE(A). They are depicted in Figure 8.2. LLS'(A) refers to the busy distance from LL(A) to the nearest busy processor in the south direction. All other components are similarly defined. A <0,9 <1,9> <2,9:V <3, 9>1 <4, 9>1 <5, 9) <6, 9) <7, 9>_ <8 8,9: <9, 9> -I III 1* .1__ i..- <0.8¥ <1,65I <2,65' <3 8> <4,65; <5,65'I1 <6,8>‘I» <7,6'5’I} <38! <93: .55 l L L <7, 65 <6. 65 <9,6 1 1 | A A ~ 1 A A, A. A A A <1.55' <2, 5 <,3 55 <4 .55 <5 .55 <6 .55 <7 .55? <8, 55 <9,55 n‘ L _ L I LIW - L L <1 45} <2 4 <3 .41 <4, 45 <5, 4 <6 4 <7, 4 6,4 <9.4 1 1 , < 1 .. RN18) ‘ l l l I <1,3> <2,3 <3,3>1 <4,3 <5, 3> <6, 3>i <7. 3) <8, 3>i <9, 3 ULN(A) T <1 ,2>I <2, 2>I <3 2> <4 2 <5,2 <6 2 <7 2 <9,2>1 Candidate ,L ' <9; SmeQShA (O’Pl <1,1> <2, 1> i<3,1é( <5,1 <7 81.9.1 LLW(A) ‘—1 ~. “ & <0. .0171.) .05 <2,o <3,o5 <4,0> <5,o5 <6,0> <7, 05 <8. 05 <9,o5 ( LL A) LRS(A) Figure 8.2: Busy distance vectors. Definition 8.4 The busy distance inverse (BDI) of a candidate submesh A, BDI(A), is defined as 1 1 1 1 1 1 1 1 LLS(A) + LLW(A) + LRS(A) + LRE(A) + ULN(A) + ULW(A) + URN(A) + URE(A)° As an example, consider the 10 X 10 mesh given in Figure 8.2. If the incoming job requests a 3 X 3 submesh, there are several candidate submeshes available. Consider 116 two candidate submeshes A, < 0,0, 2, 2 >, and B, < 0, 5, 2, 7 >. Applying the above definitions, we can determine that BDI(A) = 1 + 1 + 1 + 1 + % + :1; + % + % = 6. BDI(B)=l+1+1+1+1+§+§+§=5.64. The best-fit selection heuristic is to choose the candidate submesh having the largest BDI value. Thus, in this example, A is better than B. 8.3 Candidate Submesh Generation The BF strategy generates candidate submeshes on the corners in the coverage array and has a time complexity of 0(N), where N is the number of processors in the sys- tem. The BL strategy generates candidate submeshes along the four corners of the allocated submeshes, and its complexity is 0(N3), where N A is the number of allo- cated submeshes. The Lookahead strategies divide the free processors that can serve as the base of the requested submesh into rectangular free regions, and the candidate submeshes are based at the lower-left corners of these free regions. It generates 0(NA) candidate submeshes and has a time complexity of 0(Nfi) [5]. Usually the number of allocated submeshes, N A, is much smaller than N, the number of processors in the system. Das Sharma and Pradhan [21] showed that the BL strategy has much lower average allocation overhead compared with the strategies having 0(N) complexity. The Lookahead strategy offers even lower time complexity than the BL strategy does. Based on the simulation results collected using our simulation tool (which we de— scribe in detail in the next section), the BDI best-fit selection heuristic combined with the BF and BL strategies’ candidate submesh generation methods offer similar per- 117 formance. However, the Lookahead strategy’s candidate submesh generation method performs poor with the BDI heuristic. One possible reason is that this method restrict itself to checking only one corner of each free region, and this may disregard better candidate submeshes. Instead of using only the lower-left corners of free regions as the bases of candidate submeshes, if we consider all four corners of the free regions as possible bases, the BDI heuristic offers similar performance using all three strate- gies’ candidate submesh generation methods. So, the preferred submesh generation method is the Lookahead strategy, since it offers lower complexity. 8.4 Detailed Algorithm Using the Lookahead strategy’s candidate submesh generation method, we can gener- ate 0(NA) candidates in 0(Nj) time. Each candidate must consider all N A allocated submeshes to calculate the eight components of its busy distance vector. A candidate takes 0(1) time to compare with an allocated submesh and update the eight com- ponents of its busy distance vector, so it takes 0(Nfi) time for 0(NA) candidates to compare with N A allocated submeshes. It takes 0(NA) time to calculate the busy dis- tance inverse values for 0(NA) candidates and choose the candidate with the largest BDI value. So, the overall complexity of the BDI strategy is 0(N/21), the same as the Lookahead strategies. The detailed algorithm is presented in Figure 8.3. 118 1. let J(w,/z) be the first job in queue; 2. if the busy set is empty, then allocate J(w,/2) to < 0,0,w — 1, h - 1 > and stop; 3. generate coverage submeshes for J (w, h); 4. starting with a single free region < 0, O, W — w + 1, H — h + 1 > subtract the coverage submeshes for J (w, h) and generate a list of free regions; 5. if the list of free regions is not empty, then calculate the BDI values for all the candidate submeshes based on four corners of all the free regions and choose the candidate submesh having the largest BDI value and stop; 6. if w 75 h, then repeat steps 3 to 5 for J(h,w); 7. put J (w, h) back on front of queue and stop; Figure 8.3: The Busy Distance Inverse strategy. 8.5 Performance Results Via simulation, we compared the performance of the BDI strategy with the BL and BF strategies. In order to have a fair comparison, the BF strategy is implemented in such a way that it can rotate the requested submesh by 90 degrees as in the BL and BDI strategies. We measure the system performance based on the mean and standard deviation of response time. To have an accurate measurement, only stable systems are considered. Figure 8.4 illustrates the mean response time under different workloads using BF, BL, and BDI strategies. The width and height of the requested submeshes are uniformly distributed from 1 to 32. When the load is light (less than 0.3), all allocation strategies have similar performance. Most jobs are served promptly after 119 14o . . . . 130 - BF —-——- ~ 120 - '- -'— . 11o - BD' *— - “5’ 100 - J ‘5 9° ' . 2 80- - 8 7o - « m 2 60 - . g 50 ~ 4 g 40 - - 30 - - 20 . . 1o : J 0 L 1 a 1 0.1 0.2 0.3 0.4 0.5 Load Figure 8.4: Mean response time vs. load (w and h: uniform distribution). they enter the system. For load greater than 0.4, the BDI strategy shows performance gain. When the load reaches 0.5, the BDI strategy reduces the mean response time compared to the BL allocation strategy by 18%. When the load reaches 0.6, the system becomes unstable, so we report the results only up to a load of 0.5. Note that in a recent job characteristics study [‘29], the observed system utilization was above 40% for over 80% of the time on weekdays in a practical scientific parallel computing environment. The standard deviation of the response time is presented in Figure 8.5. Figures 8.4 and 8.5 indicate that the BDI strategy outperforms the BL and BF strategies under all workloads when width and height of requested submeshes are uniformly distributed. The same experiment is conducted to consider different characteristics of job re- quests, namely, increasing and decreasing distributions for the width to and height h 120 140 r . . 130 - BF +— ~ 120 - 3|- +- . 110 ~ 30' "— - (D E 100- ~ a, 90 - - ‘8 80 - . g 70 - l m 60 ' '1 3 50 - . 8 4o - . 30 - . 20 - , 1o ' - 0 1 l l l 0.1 0.2 0.3 0.4 0.5 Load Figure 8.5: Standard deviation of response time vs. load (w and h: uniform distri- bution). of submeshes requested. The simulation results of the mean and standard deviation of response time for BF, BL, and BDI strategies when the width and height of requested submeshes are distributed under the increasing distribution are presented in Figures 8.6 and 8.7. Performance results are similar to that of the uniform distribution case. However, the performance difference among various strategies is not as significant as that of the previous case. This moderate performance improvement for increasing distribution of job requests is expected. The system tends to have more jobs requesting large submeshes in an increasing distribution and, hence, there is a reduced chance of allocating large jobs into the system while some jobs are already in the system. The extreme case is that all jobs request the full mesh, so no matter which processor allocation strategy you use, the performance remains the same. 121 50 . . I . 40 - BF -*—- ~ 0 BL +— .g BDI +— 3 30 - . C O Q. (I) 9 c 20 ~ - (U (D 2 I 10 ~ 0 1 1 1 1 0.1 0.2 0.3 0.4 0.5 Load Figure 8.6: Mean response time vs. load (w and h: increasing distribution). 50 I I T I 40 — BF’—*—- « g BL HF—- i: BEN-«~— 8 30 - - C o O. (I) (D 1: 20 - . ‘8 O 00 . 10 - - O 4 1 4 1 Ofil (12 (13 (14 (15 Load Figure 8.7: Standard deviation of response time vs. load (w and [2: increasing distri- bution). 122 Figures 8.8 and 8.9 show the shows the mean and standard deviation of response time versus load , respectively, for BF, BL and BDI strategies under the decreasing distribution ofjob requests. 300 . 280 r BF -— ~ 260 _ BL +— . 220 - . 200 - 4 180 - ‘ 160 * - 140 r - 120 - ‘ 100 - i 80 - 4 60 ~ - 40 - - 20 1 = _ _ O 1 0.1 0.2 0.3 0.4 0.5 Load Mean Response Time I l L Figure 8.8: Mean response time vs. load (w and h: decreasing distribution). With a decreasing distribution, the system has more small jobs than large ones. The processor allocation strategy plays an important role when the system load is heavy. As the load reaches 0.5, the BDI strategy reduces the mean response time by 55% of the BL strategy. Figures 8.4 to 8.8 show that the BDI strategy outperforms the BL strategy under all load conditions and various job characteristics. The main reason why the BDI strategy performs better is that it considers not only adjacent busy processors of a candidate submesh, but it also considers busy processors two or more hops away, thus having a global View of the system and a superior best-fit heuristic. 123 24o . . . f 220 ~ BF ...__ - 20° ' 13%| : 1 o . J g 180 u- 160 - - § 140 - « g 120 - - g 100 ~ . “5 80 L . 8 60 . - 4o - ~ 20 - = _ - 0 1 1 4 1 0.1 0.2 0.3 0.4 0.5 Load Figure 8.9: Standard deviation of response time vs. load (w and h: decreasing distribution). 8.6 Discussion In this chapter, we have presented a new processor allocation strategy, called the Busy Distance Inverse strategy. It is a best—fit strategy that tries to allocate a submesh not only adjacent to busy submeshes or mesh boundaries, but also in close proximity to other busy submeshes or mesh boundaries. Extensive simulation runs have been performed to compare the BDI strategy with various strategies. The results indicate that it performs better than BF and BL strate- gies under all load conditions and job characteristics. Moreover, the BDI strategy can be effectively applied in practice. It can be implemented efficiently, is scalable (i.e., does not depend directly on the mesh size), and performs well under load conditions that occur in actual scientific parallel computing environments. Chapter 9 Conclusions Mesh connected parallel architectures have become increasingly popular in the design of multiprocessor systems in recent years. To allow the best usage of these systems, effective submesh allocation and job scheduling schemes are desirable. Continuous allocations and deallocations of various sized submeshes result in a fragmented mesh system. We have developed efficient and effective submesh alloca- tion and scheduling schemes on 2D mesh systems. This research improves the system utilization and reduces the mean job response time significantly. Further, we have built a simulation and experiment analysis tool called Sched-All for job scheduling and allocation on 2D mesh multicomputers to evaluate the performance of existing schemes as well as our proposed schemes. In this chapter, we summarize the contributions of this research in Section 9.1. Then, we suggest some future research directions in Section 9.2 124 125 9.1 Contributions We have developed a simulation and experiment analysis tool called Sched—All for job scheduling and allocation on 2D mesh multicomputers. Sched-All is scalable in the sizes of mesh which can be simulated. It can evaluate the performance of various submesh allocation and job scheduling schemes on a mesh system of size up to 128 x 128. It includes a performance evaluation simulator, an experiment analyzer, and a performance plotter. It supports a comprehensive set of processor allocation and scheduling schemes, and its modular and object oriented design make it very easy to maintain and to integrate new schemes into it. We first incorporated job scheduling for processor allocation on 2D mesh mul- ticomputers. We have developed two new job scheduling schemes, Immediate Fit and Scan All. They can be used with any submesh allocation strategy. A waiting time limit is introduced to avoid a starvation problem. Simulation results show that our scheduling schemes significantly improve the system utilization and improve the mean and standard deviation of response time for different load conditions and job characteristics. Based on the Scan All scheduling scheme, we developed a starvation-free job scheduling scheme, Multiple Queues, that can also be used with any existing sub- mesh allocation strategy. The effect of various ways of using multiple queues for job scheduling in 2D mesh systems is investigated. Simulation results show that the Multiple Queues scheme further improves the performance of the SA scheme under different load conditions and job characteristics. A higher system utilization and 126 lower internal fragmentation are also achieved. Turning to processor allocation strategies, we have presented pioneering work to guide processor allocation on 2D mesh multicomputers with estimated execution time. We have developed two processor allocation strategies for two—dimensional mesh-connected systems, called the EET—SB and EET-SC strategies, which utilize the estimated execution time information provided by users to allocate jobs with similar finishing times in the same areas. Again, extensive simulation runs have been performed to compare various strategies. The results indicate that both EET strategies have better performance than previously proposed strategies under all cir- cumstances. Inaccurate estimation of execution time will simply cause less overall system performance improvement, but not cause any execution failures. Although users usually can quite accurately estimate the execution time of their jobs, people may argue that execution time information is not always available. We have also developed a new processor allocation strategy, called the Busy Distance Inverse (BDI) strategy, which does not use the execution time information. It is a best-fit strategy that tries to allocate a submesh not only adjacent to busy sub- meshes or mesh boundaries, but also in close proximity to other busy submeshes or mesh boundaries. The consequence is that it performs the best among all submesh allocation strategies that do not use execution time information. Moreover, the BDI strategy has the lowest time complexity among all submesh allocation strategies and can be effectively applied in practice. It can be implemented efficiently, is scalable (i.e., does not depend directly on the mesh size), and performs well under load con- ditions that occur in actual scientific parallel computing environments. 127 9.2 Future Work Our Sched-All tool can be extended to enhance its functionality. For example, it would be useful to integrate it with the visualization tool ProcSimity (described in Section 2.6). ProcSimity provides a dynamic display of allocation and deallocation events and performance metrics. We can modify the Sched-All simulator to generate tracefiles in the format readable by ProcSimity’s visualization tool. Further, topolo- gies other than 2D mesh (for example, hypercubes) and their subcube allocation strategies could also be integrated in the Sched—All tool. In this research as well as related research, probabilistic workload is used to drive the simulator to evaluate the performance of various schemes. We considered uni- form, decreasing, and increasing distributions of jobs in the simulation tool. Other distributions, for example, bimodal distribution, could also be considered. Further, it is very desirable to collect actual workload data from a system serving multiple, mixed jobs to drive the simulator. Prototypical implementation of the existing schemes and the new schemes that we have developed is necessary to measure actual overheads to the system and the actual performance. We have obtained preliminary results for selected simulation runs as noted in Section 3.3. However, further results are needed from a more extensive implementation. This is essential prior to adopting a particular strategy in practice. The assumption of negligible overheads needs to be supported quantitatively. BIBLIOGRAPHY Bibliography [1] A. Al-Dhelaan and B. Bose, “A New Strategy for Processor Allocation in an N-cube Multiprocessor,” Proc. International Phoenix Conference on Computers and Communications, pp. 114-118, 1989. [2] R. Alverson, et al., “The Tera Computer System,” Proc. International Confer- ence on Supercomputing, pp. 1-6, Jun. 1990. [3] D. Babbar and P. Krueger, “A Performance Comparison of Processor Alloca- tion and Job Scheduling Algorithms for Mesh-Connected Multiprocessors,” Proc. IEEE Symp. on Parallel and Distributed Processing, pp. 46-53, 1994. [4] B. S. Baker, E. G. Coffman, Jr., and R. L. Rivest, “Orthogonal packings in two dimensions,” SIAM J. Comput., pp. 846-855, Nov. 1980. [5] S. Bhattacharya and W.—T. Tsai, “Lookahead Processor Allocation in Mesh- Connected Massively Parallel Multicomputer,” Proc. International Parallel Pro- cessing Symposium, pp. 868-875, 1994. [6] G.-I. Chen and T.-H. Lai, “Scheduling Independent Jobs on Hypercubes,” Proc. 5th Symp. on Theoretical Aspects of Computer Science, pp. 273—280, 1988. [7] G.-I. Chen and T.-H. Lai, “Preemptive Scheduling of Independent Jobs on a Hypercube,” Information Processing Letters, no. 28, pp. 201-206, 1988. [8] M. S. Chen and K. G. Shin, “Processor Allocation in an N-cube Multiprocessor,” IEEE Transactions on Computers, pp. 1396-1407, Dec. 1987. [9] Y.-K. Chu and D. T. Rover, “An Effective Two-Dimensional Mesh Partitioning Strategy,” Parallel Processing Letters, special issue on Partitioning and Schedul- ing for Parallel and Distributed Systems (accepted and to be published in Dec. 1995). [10] Y.-K. Chu, I.-L. Yen, and D. T. Rover, “Processor Allocation in Mesh Systems Using Multiple Queues,” Tech. Rep. CPS-93—33, Dept. of Computer Science, Michigan State University, June 1994. [11] Y.-K. Chu, I.-L. Yen, and D. T. Rover, “Incorporating Job Scheduling for Pro- cessor Allocation on Two-Dimensional Mesh-Connected Systems,” Proc. 7th In- ternational Conference on Parallel and Distributed Computing Systems, pp. 124- 129, Oct. 1994. 128 129 [12] Y.—K. Chu, I.—L. Yen, and D. T. Rover, “Guiding Processor Allocation with [13] [14] [15] [16] [17] [18] [19] [201 [21] [22] [23] Estimated Execution Time for Mesh Connected Multiple Processor Systems,” Proc. 28th Hawaii International Conference of System Sciences, pp. 163-172, Jan.1995. P.-J. Chuang and N.—F. Tzeng, “An Efficient Submesh Allocation Strategy for Mesh Computer Systems,” Proc. International Conference on Distributed Com- puting Systems, pp. 256-263, May. 1991. P.-J. Chuang and N.-F. Tzeng, “A Fast Recognition-Complete Processor Allo- cation Strategy for Hypercube Computers,” IEEE Transactions on Computers, pp. 467-479, Apr. 1992. E. G. Coffman, Jr., M. R. Carey, and D. S. Johnson, “Approximation Algorithms for Bin-Packing—an Updated Survey,” Algorithm Design for Computer System Design (G. Ausiello, M. Lucertini, P. Serafini, ed), pp. 49-106, Springer—Verlag, New York, 1984. E. G. Coffman, Jr., M. R. Garey, D. S. Johnson, and R. E. Tarjan, “Perfor- mance Bounds for Level-Oriented Two-Dimensional Packing Algorithms,” SIAM J. Comput., pp. 808—826, Nov. 1980. E. G. Coffman, Jr., J. Y.-T. Leung, and D. Slutz, “On the Optimality of First—Fit and Level Algorithms for Parallel Machine Assignments and Sequencing,” Proc. International Conference on Parallel Processing, pp. 95-99, 1977. D. Das Sharma, G. D. Holland, and D. K. Pradhan, “Subcube Level Time- Sharing in Hypercube Multicomputers,” Proc. International Conference on Par- allel Processing, vol. II, pp. 134-142, 1994. D. Das Sharma and D. K. Pradhan, “A Novel Approach for Subcube Allocation in Hypercube Multiprocessors,” Proc. IEEE Symp. on Parallel and Distributed Processing, pp. 336-345, Dec. 1992. D. Das Sharma and D. K. Pradhan, “Submesh Allocation in Mesh Multicom- puters,” Dept. of Computer Science, Texas A&M University, Tech. Rep. 93-043, 1993. D. Das Sharma and D. K. Pradhan, “A Fast and Efficient Strategy for Sub- mesh Allocation in Mesh—Connected Parallel Computers,” Proc. IEEE Symp. on Parallel and Distributed Processing, pp. 682-689, Dec. 1993. D. Das Sharma and D. K. Pradhan, “Fast and Efficient Strategies for Cubic and Non-Cubic Allocation in Hypercube Multiprocessors,” Proc. International Conference on Parallel Processing, vol. I, pp. 118-127, 1993. D. Das Sharma and D. K. Pradhan, “Job Scheduling in Mesh Multicomputers,” Proc. International Conference on Parallel Processing, vol. II, pp. 251-258, 1994. 130 [24] H. M. Deitel, An Introduction to Operating Systems, Addison-Wesley, 1983. [25] J. Ding and L. N. Bhuyan, “An Adaptive Submesh Allocation Strategy for Two— Dimensional Mesh Connected Systems,” Proc. 1993 International Conference on Parallel Processing, vol. II, pp. 193-200, 1993. [26] S. Dutt and J. P. Hayes, “On Allocating Subcubes in a Hypercube Multipro- cessor,” Proc. Third Conference on Hypercube Computers and Applications, pp. 801-810, 1988. [27] S. Dutt and J. P. Hayes, “Subcube Allocation in Hypercube Computers,” IEEE Transactions on Computer, pp. 341-352, Mar. 1991. [28] H. El—Rewini, T. G. Lewis, and H. H. Ali, Task Scheduling in Parallel and Dis- tributed Systems, Prentice Hall, 1994. [29] D. G. Feitelson and B. Nitzberg, “Job Characteristics of a Production Parallel Scientific Workload on the NASA Ames iPSC / 860,” Tech Report RC 19773, IBM T. J. Watson Research Center, Oct. 1994. [30] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman, 1979. [31] V. P. Gopinath, T. A. Grotjohn, Y.-K. Chu, and D. T. Rover, “Three Dimen- sional Simulations of Plasmas Using MPPs,” Proc. 7th SIAM Conference on Parallel Processing for Scientific Computing, pp. 201-206, Feb. 1995. [32] V. P. Gopinath, T. A. Grotjohn, Y.-K. Chu, and D. T. Rover, “Parallelization and Performance of Three-Dimensional Plasma Simulation,” Proc. 5th Sympo- sium on the Frontiers of Massively Parallel Computation, pp. 148-155, Feb. 1995. [33] M. T. Heath and J. A. Etheridge, Visualizing performance of parallel programs, Tech. Rep. ORNL/TM-11813, Oak Ridge National Laboratory, Oak Ridge, TN, May. 1991. [34] R. Jain, The Art of Computer Systems Performance Analysis, John Wiley & Sons,1991. [35] K. Kant, Introduction to Computer System Performance Evaluation, McGraw- Hill, 1992. [36] J. Kim, C. R. Das, and W. Lin, “A Processor Allocation Scheme for Hypercube Compu,” Proc. International Conference on Parallel Processing, vol. II, pp. 231- 238, 1989. [37] J. Kim, C. R. Das, and W. Lin, “A Top-Down Processor Allocation Scheme for Hypercube Computers,” IEEE Transactions on Parallel and Distributed Systems, vol. 2, pp. 20-30, Jan. 1991. [381 [39] [40] [41] [4‘21 [43] [44] [451 [461 [47] [48] [49] [50] 131 K. C. Knowlton, “A Fast Storage Allocator,” Communications of the ACM, pp. 623-625, Oct. 1965. P. Krueger, T.-H. Lai, and V. A. Radiya, “Processor Allocation vs. Job Schedul- ing on Hypercube Computers,” Proc. International Conference on Distributed Computing Systems, pp. 394-401, May. 1991. P. Krueger, T.-H. Lai, and V. A. Radiya, “Job Scheduling is More Important than Processor Allocation for Hypercube Computers,” IEEE Transactions on Parallel and Distributed Systems, May. 1994. K. Li and K. H. Cheng, “Job Scheduling in Partitionable Mesh Connected Sys- tems,” Proc. International Conference on Parallel Processing, vol. II, pp. 65—72, Aug. 1989. K. Li and K. H. Cheng, “Complexity of Resource Allocation and Job Schedul- ing Problems in Partitionable Mesh Connected Systems,” Proc. IEEE Symp. on Parallel and Distributed Processing, pp. 358-365, 1989. K. Li and K. H. Cheng, “Static Job Scheduling in Partitionable Mesh Connected Systems,” Journal of Parallel and Distributed Computing, pp. 152-159, Oct. 1990. K. Li and K. H. Cheng, “Job Scheduling in PMCS Using a 2DBS as the System Partitioning Scheme,” Proc. International Conference on Parallel Processing, vol. I, pp. 119-122, 1990. K. Li and K. H. Cheng, “A Two Dimensional Buddy System for Dynamic Re- source Allocation in a Partitionable Mesh Connected System,” Proc. ACM Com- puter Science Conference, pp. 22-28, Feb. 1990. K. Li and K. H. Cheng, “A Two—Dimensional Buddy System for Dynamic Re— source Allocation in a Partitionable Mesh Connected System,” Journal of Par- allel and Distributed Computing, pp. 79-83”, May 1991. K. Li and K. H. Cheng, “Job Scheduling in a Partitionable Mesh Using a Two- Dimensional Buddy System Partitioning Scheme,” IEEE Transactions on Par- allel and Distributed Systems, pp. 413-422”, 1991. W. Liu, V. Lo, K. Windisch, and B. Nitzberg, “Non-contiguous Processor Allo- cation Algorithms for Distributed Memory Multicomputers,” Proc. International Conference on Supercomputing, pp. 227-236, 1994. P. Mohapatra, C. Yu, C. R. Das, and J. Kim, “A Lazy Scheduling Scheme for Improving Hypercube Performance,” Proc. International Conference on Parallel Processing, vol. I, pp. 110-117, 1993. L. M. Ni and P. K. McKinley, “A Survey of Wormhole Routing Techniques in Direct Networks,” Computer, vol. 26, pp. 62-76, Feb. 1993. 132 [51] J. K. Ousterhout, Tcl and the Tk Toolkit, Addison-Wesley, 1994. [52] H. Schwetman, CSIM User’s Guide, Microelectronics and Computer Technology Corporation, 1991. [53] T. J. Teorey and TB. Pinkerton, “A Comparative Analysis of Disk Scheduling Policies”, Communications of ACM, pp. 177-184, March, 1972. [54] T. Williams and C. Kelley, GUNPLOT: An Interactive Plottong Program, 1993. [55] K. Windisch, J. V. Miller, and V. Lo, “ProcSimity: An Experimental Tool for Processor Allocation and Scheduling in Highly Parallel Systems,” The Fifth Sym- posium on the Frontiers of Massively Parallel Computation, pp. 414-421, Feb., 1995. [56] C. Yu and C. R. Das, “Limit Allocation: An Efficient Processor Management Scheme for Hypercubes,” Proc. International Conference on Parallel Processing, vol. II, pp. 143-150, 1993. [57] Y. Zhu, “Efficient Processor Allocation Strategies for Mesh-Connected Parallel Computers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 328-337, Dec. 1992. [58] Y. Zhu and M. Ahuja, “Preemptive Job Scheduling on a Hypercube,” Proc. International Conference on Parallel Processing, vol. I, pp. 301-304, 1990. [59] Y. Zhu and M. Ahuja, “Job Scheduling on a Hypercube,” Proc. International Conference on Distributed Computing Systems, pp. 510-516, 1990. [60] Paragon XP/S Product Overview, Intel Corporation, 1991. [61] A Touchstone DELTA System Description, Intel Corporation, 1991. "I111111111111111“