9' If ‘ WWW .7 Mu. .H , . ..-«—-~~§\’~~\‘::r z w ' :32? we?!” {2- ~- I“ “-4. lul'up “K M :I- t.’ x 15-N- m w t“ I " . ”a". ?.,1..M ,. 1' 7.."‘2a’72w: ‘ 2:2; .. ,3.“- - $.33: -w ' 4‘ réb‘" -z 4&3: "Earn?“ ; ‘-~~‘- - . ' . l. t‘x a. nu. = 454443“?- ‘3? 2; 45:35:35,:4a- _ . . ... A .M : ~ I ' , .. W‘ .. 34:3: afigj: xi: 2 .C .f «14-15%» M1- ' “32:: _ 1, n . .4 v - "‘J' ‘."' “‘37:?“ _ _‘“—*‘up»4 441%; 31,14 31,. un‘: 37%: 2.3? ‘ «34-43% m; 41:?ng . ' ~42? - It . ~ ‘. at 91"“ ‘ " a- 9k quailf“ “MN“?! I?“ ‘ 3‘ — ‘ 'e: . , “1“?”"'@ A ‘r '0". r A {'1' .. r -4 ‘34»: 331-" l ‘ 7““ . ._ v -- km , 1:, w . _. ....::. "mica,” 1-344: 44 , 4‘... pun" : L ¢‘ ‘1 f ,1 , “w. . “: :2“. v. ’ ‘ 4‘ u... v '4- w. a A‘ 14m. 1’ 4 u " , " 4., .5." u ' .3. {.1 . .15 n. «7.- “5.“‘5: ,‘1. ‘5."th : ‘ ‘ z . ,_ J‘ , '7 t . s .7 , _ u .‘ , .‘t.’ ' 4 ‘:”:L‘.‘ “/3, . ‘3; as» x- a l. 4+ 5.33” >_, ~ufi‘F. , A‘JK .. .‘_ - . ._J-u ,M ~, . g~q~53‘.’.1 P. ,. II ’ /‘ 245:4. 1‘ .33; '. - i 4 ' 44 ‘ ""4 4 ' 4 4 ‘ ' ' 5 - 4- \ ‘ ‘14:: .345“ .‘f‘. '4? " ' -‘ v'., on: ." . a . 2, - .;4 m v . 375.- - . . I, . 4,544‘4 44,, 4 J, 2:,» 4, ,“ g ». mg? mmf ,V.. V I ‘ 1m. h u y. “ A a II.5 .444. I ($133!: IVSIYERT |44|4444|444444444444444444|4444 3 1293 007914 f 44444484444444 LIBRARY mama State University K This is to certify that the dissertation entitled PARALLEL COMPUTATION MODELS: REPRESENTATION, ANALYSIS AND APPLICATIONS presented by Xian-He SUN has been accepted towards fulfillment of the requirements for Ph.D. degree in Computer Science \thNVSLM. N 4' Major professor Date 11/2/90 MSU Lt an Affirmative Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE || ___=4 __4__ fll—fl 4 —_4 I MSU Is An Affirmative Action/Equal Opportunity Institution ammo-9.1 PARALLEL COMPUTATION MODELS: REPRESENTATION, ANALYSIS AND APPLICATIONS By Xian-He Sun A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1990 (0(5'3if7 1:7 ABSTRACT PARALLEL COMPUTATION MODELS: REPRESENTATION, ANALYSIS AND APPLICATIONS By Xian-He Sun Although there are various multiprocessors available, both software support and algo- rithm development for these machines are far behind their hardware counterpart. This is especially true for multicomputers in which each processor has its own local memory. A new concept, compute-exchange computation, is proposed for efficient parallel algorithm design in multicomputers. Based on this new concept, the most frequently used scientific and en- gineering applications can be describled by a simple structured representation. Structured design and rethinking, therefore, become possible. The basic building blocks of these struc- tures, called parallel computation models, are identified and studied. A performance metric of parallel processing, parallel speedup, is carefully examined. A new model of speedup is presented which considers the memory capacity as an influential factor and provides the quantum of the tradeoff between time and space. . , These research results are useful in many areas of computer science. The dissertation focuses on two areas: eficient algorithm design and performance prediction. Examples cho- sen from real applications will be used in both areas to illustrate the concept and usefulness of the computation models. The applications used for eficient algorithm design are solv- ing large scale tridiagonal systems and solving electrical power flow problems. Several novel algorithms are developed for these two applications based on combinations of computation models. Implementation results confirm that these algorithms are competitive with any existing algorithm in the field. ©Copyright By XianQHhWSun 1990 Dedicated to my parents Yu Lin and Chang-Xiang Sun and to my family. iv ACKNOWLEDGMENTS I would like to express my sincere appreciation to all of the faculty members who have helped and contributed to make this work a reality. This includes Professor Lionel M. Ni, my thesis advisor, for his support, guidance and inspiration, and Professors Richard Enbody, Moon Jung Chung, and Richard Hill for their insightful comments and serving in my Ph.D. committee, and Professors Nabil Kamel, Abdol Esfahanian and Fathi Salam for their encouragement. I would like to acknowledge all of my fellow students who have given me help and friend- ship during my stay at MSU. In particular, I wish to thank Dr. Chung-Ta King, Jayashree Ramanathan, Dongyul Ra, Paul Wolberg, Xian-La Lin, and Honda Shing, for their help and valuable discussions, and David Robinson, Dr. Shixiong Guo and Ten Hwan Tzen for their cooperation and wonderful work. I am very grateful to my parents and parents-in-law for their years support, concern and understanding. I am indebted to my wife Hong Zhang-Sun for her constant encouragement, assistance and care during the course of my graduate studies. This research has been supported in part by National Science Foundation under grant ECS-88-14027. Table of Contents List of Tables viii List of Figures ix 1 INTRODUCTION 1 1.1 Multicomputers .................................. 2 1.2 Parallel Algorithm Design Considerations .................... 3 1.2.1 Sources of Degradation .......................... 4 1.2.2 Algorithm Characteristics ........................ 6 1.3 Motivation and Problem Statement ....................... 8 1.4 Thesis Organization ................................ 10 2 PERFORMANCE METRIC OF PARALLEL ALGORITHMS 12 2.1 Preliminary .................................... 13 2.2 Models of Speedup ................................ 17 2.3 Simplified Models of Speedup .......................... 19 2.4 Comparison Study ................................ 26 3 PARALLEL COMPUTATION MODELS FOR SCIENTIFIC COMPUT- ING 29 3.1 Structured Representation ............................ 30 3.2 Parallel Computation Models .......................... 36 3.2.1 Local Computation Model ........................ 38 3.2.2 Global-Exchange Computation Model .................. 39 3.2.3 Compute-Aggregate—Broadcast Computation Model .......... 39 3.2.4 Divide-and-Conquer Computation Model ................ 40 3.2.5 Domain Decomposition Model ...................... 44 3.2.6 Pipeline (Ring) Computation Model .................. 45 vi 3.2.7 Recursive Doubling Computation Model ................ 3.3 Application Considerations ............................ 4 APPLICATION-DRIVEN ALGORITHM DESIGN 4.1 Preliminary .................................... 4.2 Power Flow Application ............................. 4.3 Homotopy Method ................................ 4.4 Implementation Issues and Results ....................... 5 ARCHITECTURE-DRIVEN ALGORITHM DESIGN 5.1 Linear Tridiagonal Solvers ............................ 5.1.1 The LU Decomposition Method ..................... 5.1.2 The Parallel Prefix Method ....................... 5.1.3 A Novel Matrix Partitioning Technique ................. 5.2 A Compute-Aggregate-Broadcast Computation Solver ............. 5.3 A Global-Exchange Computation Model .................... 5.4 A Solver With More Than One Computation Model .............. 5.5 Comparison and Experimental Results ..................... 6 PREDICTION OF PERFORMANCE 6.1 Performance Formulations ............................ 6.2 Structured Prediction ............................... 6.3 The Influence of Problem Size on Speedup ................... 7 CONCLUSION AND FUTURE RESEARCH 7.1 Summary and Major Contributions ....................... 7.2 Future Research Directions ......................... Z . . Bibliography vii 46 49 52 53 54 56 58 64 65 65 66 67 70 73 75 77 81 82 87 90 97 97 100 103 List of Tables 5.1 Computation and Communication Counts of Tridiagonal Solvers ....... 77 viii List of Figures 1.1 A Generic Multicomputer Architecture ..................... 3 2.1 Parallelism Profile of an Application ...................... 14 2.2 Shape of the Application ............................. 15 2.3 Amdahl’s Law ................................... 20 2.4 Gustafson’s Scaled Speedup ........................... 21 2.5 Simplified Memory-Bounded Scaled Speedup .................. 22 2.6 Matrix Multiplication with Local Computation ................. 24 2.7 Matrix Multiplication with Global Computation ................ 24 2.8 Amdahl’s law, Gustafson’s speedup and SMB speedup ............ 27 3.1 Compute-Exchange Computation ........................ 30 3.2 Multicast Data Exchange ............................. 32 3.3 Conjunctive Data Exchange ........................... 33 3.4 Partitioning of Graphs .............................. 34 3.5 FFT (Butterfly) Computation .......................... 35 3.6 Odd-Even Cyclic Reduction ........................... 37 3.7 Local Computation Model ............................ 38 3.8 Global Exchange Computation Model ...................... 40 3.9 CAB Computation Model ............................ 41 3.10 Divide-and-Conquer Compute-Exchange Computation ............. 42 3.11 Domain Decomposition Computation Model .................. 44 3.12 Pipelined Computation Model .......................... 47 3.13 Recursive Doubling Computation Model .................... 50 4.1 The Main Components of Power System .................... 54 4.2 The Homotopy Method .............................. 59 4.3 Global-Exchange Model for Merge Checking .................. 60 ix 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 6.1 6.2 Compute—AggregateBroadcast Model for Merge Checking .......... 61 Speedup Versus Different Number of Processors ................ 62 The Communication Pattern of the PPT Algorithm .............. 72 Tree Reduction .................................. 73 The Communication Pattern of the GE Computation . . . . . . . . . . . . . 74 The Communication Pattern of the PPH Algorithm .............. 76 The Speedup Over The LU Decomposition Method .............. 79 Efficiency Over The LU Decomposition Method ................ 80 Parallel Prefix Method .............................. 89 Measured and Estimate Speedup of Parallel Prefix Method .......... 91 Chapter 1 INTRODUCTION Powerful computer systems able to handle computationally intensive problems are increas- ingly in demand in many scientific and engineering areas. To cope with these computational requirements, a natural trend is to construct systems consisting of multiple processors. This approach has been shown in recent years to be the most straightforward and cost-effective approach for achieving high performance. One system consisting of many processors commu- nicating through an interconnection mechanism is called a multiprocessor system [28]. These processors are usually homogeneous, and the communication latency between processors is relatively small but non-negligible. The communication latency, along with other degrada- tions, makes parallel computation much more complex than sequential computation. Much effort has been exerted and is currently being exerted to obtain highly efficient parallel com- putation. Although there are various multiprocessors available, software support and efficient. algorithm development for these machines are far behind their hardware counterpart. In scientific computing, there are several issues related to parallelization that do not arise in a serial context. The first issue is task partitioning, i.e., the breakdown of the total work— load into smaller tasks, some or all of which, can be processed concurrently. This includes the proper sequencing of the tasks when some tasks are interdependent and cannot be executed simultaneously. The second issue is the synchronization of concurrent processes. In some methods, processes must wait at predetermined points for the completion of certain computa- tions or for the arrival of certain data. The mechanisms used to enforce such synchronization have an important effect on algorithmic performance. A third issue is the communication of interim results between the processes. The objective here is to carry out the communica- tion efficiently, which is problem dependent as well as machine dependent. The fourth issue is to obtain a good match between algorithmic requirements and architectural capabilities, i.e., the pairing of applications and architectures. From a parallel algorithm design point of view, the issue is how to explore all the execution potential of the underlying architecture, which includes the determination of the number of processors needed as well as task and data allocation across the processors. From the architecture point of view, architectures can be tailored for the execution of a fixed set of algorithms if the architectural requirement of . the algorithms are understood. All of the above issues are problem dependent. It is dificult, if not impossible, to derive any acceptable solution for parallel computation as a whole. For this reason the issues have been traditionally studied problem by problem and solved by brute force methods. How- ever, the issues are important in a general context. A more general study is needed that does not reference a special problem. In this dissertation, a new concept, compute-exchange computation, is proposed. Based on this new concept, most of the frequently used scien- tific and engineering applications can be represented by a simple structured representation. Structured design and rethinking, therefore, become possible. The basic building blocks of these structures, called parallel computation models, are identified and studied. Issues of performance metrics for parallel computation are also examined. The results of this research are useful in many areas of computer science. The study presented will focus on two areas, efficient algorithm design and performance prediction. This chapter is organized as follows. Section 1.1 will address the parallel systems con- sidered in this study. Section 1.2 will discuss the parallel programming considerations. The motivation of the research will be given in Section 1.3. Finally, an overview of the dissertation will be found in Section 1.4. 1.1 Multicomputers The parallel systems considered in this study are multicomputers. Multicomputers are mes- sage passing distributed-memory multiprocessors [28]. They are organized as an ensemble of small programmable computers, called nodes, and communicate through a point-to-point interprocessor communication network. Each memory module is physically associated with each processor. When the number of processors increases, the memory capacity also in- creases. Multicomputers with hundreds or thousands of processors are available commer- cially. Examples of first generation multicomputers include J PL’s MARK III, Intel’s iPSC (up to 128 processors), Ncube’s NCUBE (up to 1024 processors), Ametek’s S/ 14 (up to 256 processors) and FPS’s T series (up to 2“ processors) [55] [23] [21]. Note that the Connection Machine [25], though hypercube interconnected, is a bit-serial SIMD machine. New multi- computers are also emerging. Both Intel’s iPSC-2 [46] and Ametek’s 2010 [3] are classified as second generation multicomputers. The iWARP [6], being developed by Intel and CMU, is also an example of a high performance multicomputer. All first generation multicomputers adopt the store-and-forward communication mechanism and the hypercube topology. Second generation multicomputers have more advanced communication mechanisms. Although the processors are physically interconnected as a hypercube or a 2D-mesh, they are logically fully connected [3]. The structured representation proposed in this study aims at these new com— munication mechanisms. Multicomputers hold a promising potential for providing massive parallelism. In addition to architectural enhancement through an increase in computational capability and a decrease in communication latency, the full potential of a multicomputer cannot be exploited without a careful design of algorithms. The results of this research will benefit this careful design and should be useful in the design of future multicomputers to help accommodate those frequently used parallel computation models. A generic architecture of multicomputers is depicted in Figure 1.1. Processor Processor Processor Memory Memory Memory T I I 1 I J I I ' I ' hirerprocessor Communication Network ] Figure 1.1: A Generic Multicomputer Architecture 1.2 Parallel Algorithm Design Considerations During the past two decades, the computing community has witnessed a rapid growth of interest in parallel processing. The motivation behind this increasing interest is not that parallel computing is easier to program, but that there are physical and technological limits in uniprocessors. Modern VLSI technologies enable multiprocessors to be built and imple- mented at a very cost-effective rate for solving computationally intensive problems. However, the expected superior performance can be achieved only when all the advantages of the un- derlying architecture are exploited. These include partitioning an application in such a way that load balancing can be maximized and allocating the tasks in such a way that the communication overhead can be minimized. Neither this maximum nor minimum can be achieved easily. In fact, it has been shown that implementing parallel algorithms is much more difficult than most computer scientists had expected [30]. Parallelism degradations exist that could degrade the performance of parallel algorithms. Depending on their char- acteristics, different algorithms may be influenced by different degradations. In this section we shall address possible degradations and characteristics of parallel algorithms. 1.2.1 Sources of Degradation Parallel algorithms rarely achieve linear speedup for several reasons [18]. In designing efficient parallel algorithms, we have to know the sources of degradation and how to reduce their influence or remove them completely whenever possible. In this section sources of degradation are discussed. In the next section, parallel algorithms are characterized according to their inherent degradations. a Communication delay. From a computation point of view, in addition to the transmis- sion and propagation delay, there are two other communication delays [5]: - Communication processing time. This is the time required to prepare information for transmission. - Queuing time. Once information is assembled into packets for transmission on some communication link, it must wait in a queue prior to the start of its trans- mission. Queuing time is generally difficult to quantify precisely, but simplified models are often adequate to obtain valuable qualitative and quantitative conclu- sions [34]. 4 Communication overhead can be reduced by overlapping communication with compu- tation. The queuing, transmission, and propagation delays can possibly be overlapped with computations, but communication processing cannot. Communication cost is a determining factor for the complexity of parallel algorithms. Various techniques have been developed to reduce communication overhead. These include designing efficient networks, overlapping communication with computation, and developing algorithms carefully. In second generation multicomputers, the number of hops from a source node to a destination node is not a major factor in determining the message delay times. However, there are three classical ways to reduce communication delay by restructuring al- gorithms which still hold true [31]: 1) reducing the quantity of information that must be communicated, 2) reducing the frequency with which messages must be sent, 3) overlap- ping communication with computation. In practice the above goals may not be mutually compatible. 0 Uneven Allocation. It is desirable to distribute the computation load evenly so that all processors have an identical amount of work. However, load balancing is very difficult to achieve. High level module load balancing may lead to unbalanced computation due to the underlying data structure. Those processors that finish a step of computa- tion earlier may have to wait until others are complete before continuing. Thus, the processing power of those processors is wasted. Communication delay and uneven allocation are the main sources of degradation in par- allel algorithms. Reducing these two degradations is difficult and may sometimes raise other degradations. Such tradeoffs between degradations can be used as a design technique to improve the overall performance. 0 Redundant Work. Transferring intermediate results is more costly than computation for certain problems. Instead of transferring the results, we may wish to recompute the results. This kind of redundant work is adopted in solving eigenvalue problems [40] where eigenvalues are transferred and corresponding eigenvectors are recomputed. Another kind of redundant work comes from the load balancing: To reach agreement on shared data, an algorithm can either let one processor compute it while all the other processors wait or let all the processors compute it concurrently. Since the former involves waiting and receiving, the later is more favorable. This kind of redundant work is used in solving linear tridiagonal systems [60]. Redundant work decreases the communication overhead by increasing the computation requirement. 0 Reduced Knowledge. If performing calculations improves the knowledge base of the al- gorithm, simultaneous calculations cannot make use of the most recent knowledge but must rely on information that is to some extent out of date. This is often the case in iter- ative methods where each iteration advances the knowledge base. For example, Jacobi algorithms are widely used in multicomputers, while Gauss-Seidel algorithms, which use the latest available information, usually converge faster than the corresponding Jacobi algorithm. Receiving the newest information requires frequent communication and synchronization, making Gauss-Seidel algorithms difficult to implement efficiently on multicomputers. o Non-optimal. With more than one processor computing concurrently, the computa- tional complexity might be misleading, since an algorithm could do more work and still be fast. The algorithm with the lowest computation count is very often sequential in nature. To achieve a high degree of parallelism, we may have to develop parallel al- gorithms which require more computation than the optimal sequential algorithms. We reduce the uneven allocation degradation by increasing the computation requirement. This increase in computation requirement will lower the performance gain of parallel processing, at least in the sense of absolute speedup (see Chapter 2). 1.2.2 Algorithm Characteristics Parallel algorithm characteristics have been studied for mapping parallel algorithms into parallel or distributed architectures [30]. In this section, parallel algorithms are characterized from a different point of view. We study how these characteristics degrade the performance of parallel algorithms. 0 Data Parallelism versus functional parallelism. Parallelism can be achieved by di- viding the data among the processors, or by decomposing the algorithm into segments that can be assigned to different processors. We call the former data parallelism and the latter functional parallelism. Data parallelism is the parallelism which comes from simultaneous operations across a set of data [26]. Thus, for data parallel algorithms each processor executes the same instructions on a different data set. Task partitioning is more flexible and load balancing is relatively easy to achieve. Algorithms based on functional parallelism typically use a small number of processors. 0 Granularity. The granularity or grain size deals with average subtask size, which is measured in the number of instructions executed. It has a bearing on data alloca- tion, communication requirements, processing capability, and memory requirements. Fine-grain algorithms often require frequent communication. Large-grain algorithms typically reduce the communication / computation ratio and have less communication penalty. This communication penalty reduction makes some computation models, that may be intolerable for fine-grain algorithms, favorable for large-grain algorithms. For example, with fine-grain parallelism, both pipelined Jacobi iteration and Gauss-Seidel iteration are unacceptable. With large-grain parallelism, these methods can achieve a near linear speedup [49], [5]. 0 Degree of Parallelism. We call the number of ’ processes that can operate indepen- dently the degree of parallelism. Load balancing will be much more easily achieved for algorithms with a static degree of parallelism. Unfortunately, parallel algorithms with a dynamic degree of parallelism are not uncommon. An example is parallel Gaussian elimination algorithm. To eliminate the lower triangular elements, n — 1 independent processes are available for elimination in the first column, n - 2 independent processes are available for elimination in the second column, ..., and finally one process is avail- able for elimination in the n — 1th column. Another example is parallel Given Rotation [47]. For an n dimensional matrix, the degree of parallelism of Given Rotation goes up from 1 to n/2 and then goes down from n/2 to 1. o Concurrency and Pipelining. From the vieWpoint of scheduling, a parallel computation can be characterized as either concurrent or pipelined [29]. Concurrency exploits spatial parallelism by utilizing several processors executing multiple independent tasks simul- taneously. Pipelining exploits temporal parallelism in which each processor (called a stage) behaves like a filter or transformer which operates on its input data and passes its output data to the succeeding processor as the later’s input. A systolic array is a typical example of a pipelined computation in a synchronized environment. Through pipelining, communication occurs only between fixed and neighboring stages. Thus a very high ratio of computation to I/O rate will result. A more detailed study on pipeline computation can be found in [34] [49]. 0 Data Dependencies. Data dependencies have a major impact on the design of parallel algorithms. They influence data granularity, degree of parallelism and synchronization. Data dependency is so important that two nontraditional parallel machines, namely graph reduction machines and dataflow machines, have been proposed in which the order of instruction executions is implied by data dependencies [61]. By representing the operations as vertices and using directed edges to indicate the dependencies, data dependency can be represented by a directed dependency graph. Much information can be obtained by coloring the dependency graph. Operations with different colors cannot execute concurrently. From a design point of view, wewould like to distribute operations with the different color to the same processor. The multicoloring method [5], which has been successfully employed in the solution of partial differential equations and in image processing, is a good example of grouping the operations according to data dependency. 0 Methodology. There are a variety of methods for solving systems of equations, usually classified as direct and iterative methods. Direct methods follow fixed procedures and find the exact solution with a finite number of operations. Iterative methods do not obtain an exact solution in finite steps, they converge to a solution asymptotically. Nevertheless, iterative methods often yield a solution, within acceptable precision, af- ter a relatively small number of iterations. In this case, they are preferred to direct methods. This is usually the case for linear systems when n is very large and the coefficient matrix is sparse, such as the system A1: = b which arises in the discretiza- tion-of a linear partial differential equation and in many other applications. Iterative methods may also have smaller storage requirements than direct methods when the matrix A is sparse. The computation count of iterative methods depends on the num- ber of iterations, which is problem dependent. Curve tracing is a special method for obtaining solutions, in which we start at one end of the curve and want to get to the other end point. The intermediate results along the curve are unimportant. A curve tracing method for solving power flow problems will be presented in Chapter 4. Several newly developed direct methods for solving tridiagonal linear systems will be given in Chapter 5. 1.3 Motivation and Problem Statement The complex interaction of the many architectural, hardware, and software features of par- allel systems results in a larger space of possible performance behavior and an increase in algorithm design complexity. Designing efficient parallel algorithms requires that users to un- derstand the performance characteristics of parallel machines and to modify their algorithm accordingly. These modifications are problem dependent. Therefore, parallel algorithms have had to be fine-tuned case by case to achieve higher performance. The painful, elusive design process has excluded casual users and restricted parallel computers to a rather small professional community. This situation needs to be changed to make parallel computers usable for other scientists. We would like to reduce the burden of parallel algorithm design and make the design process more systematic. This raises the obvious questions: What are the techniques for developing eflicient parallel algorithms? How could these techniques be used on a given application? To answer the first question, Nelson and Snyder [44] have proposed the concept of parallel paradigm and identified several paradigms. Paradigms are recognized high level 9 methodologies common to many of the effective parallel algorithms. They are important, because they provide good examples and may help users understand parallel computation. However, these paradigms are described verbally and are isolated from each other. How to connect these paradigms with general applications is unknown. In this dissertation, I approach these two questions from a different angle. I study parallel algorithm design from a general point of view. First, a representation methodology, struc- tured representation, is developed. With this representation, most of the frequently used scientific and engineering applications can be represented by simple formulations. These formulations are combinations of some simple data structures, called parallel computation models, and provide adequate information about performance degradations. Parallel compu- tation models are the basic building blocks of structured representations. Since both parallel computation models and parallel paradigms are commonly used data structures, it is not surprising that they share some similarities. Some of them even share the same name. A major advantage of parallel computation models over parallel paradigms is that computation models are based on mathematical formulations, and they are the constructing components of general scientific applications. The structured representation of most frequently used sci- entific and engineering applications consists of computation models. The design techniques used in computation models are the techniques needed for general scientific algorithm design, and the design techniques are used in a similar way. A mathematical foundation is necessary in moving toward a systematic design methodol- ogy for parallel algorithms. Based on that foundation, different applications can be classified and manipulated. Forming structured representation and acquiring computation models are the first step in developing such a foundation. After computation models have been well identified and studied, the performance of an application can be predicted during its design stage, and systematically mapping algorithms onto architectures becomes possible. Structured representation and computation models are important. They provide a fast way of estimating and understanding the performance of parallel programs in a parallel system. They lead to efficient parallel algorithm design for scientific and engineering appli- cations. They provide guidelines for programming tools and systematic parallel algorithm design methodology. Also, they suggest how each multicomputer should be used in applica- tions and which multicomputer is the best for a given application. This study focuses on scientific computations. The models are built for scientific appli- cations and examples are chosen from scientific applications. We are interested in scientific applications since scientific and engineering applications are the major driving force behind 10 parallel processing and the data structures of scientific applications are relatively simple. With some effort, structured representation and computation models could possibly be ex- tended to general non—numerical applications. 1.4 Thesis Organization This dissertation is organized as follows. In Chapter 2, we study the performance issues of parallel processing. The study focuses on one performance metric, parallel speedup. Three models of speedup are studied. They are fired-size speedup, fired-time speedup and memory- bounded speedup. Memory-bounded speedup considers memory capacity as an influential factor and provides the quantum of the tradeoff between time and space. Two sets of speedup formulations are derived for these three models. One set requires more information and provides more accurate estimations. Another set considers a simplified case and provides a clear picture of the possible performance gain of parallel processing. Structured representation is proposed in Chapter 3. Two different classes of data- exchange are identified. One is regular data-exchange. Another is conjunctive regular data-ezchange. Notation is developed to represent these two classes of data-exchanges. A new concept, compute-exchange computation, is introduced. By this concept, every parallel computation can be divided into two parts, compute and data-exchange. There- fore, a parallel algorithm can be written in terms of data exchanges and computations, which is called structured representation. Some basic components of structured representa- tion, called computation models, are also studied in Chapter 3. They are Local Computa- tion Model, Global-Exchange Computation Model, Compute-Aggregate-Broadcast Computa- tion Model, Divide-and-Conquer Computation Model, Domain Decomposition Computation Model, Pipeline (Ring) Computation Model, and Recursive Doubling Computation Model. Parallel algorithms may be modified for different reasons to achieve high performance. The modifications, however, mainly follow two approaches: explore the inherent parallelism of the application to increase the concurrency and take advantage of the underlying architec- ture to reduce the communication overhead. Two applications are studied to illustrate how structured representation and computation models can be used in efficient algorithm design. The power flow problem is studied in Chapter 4 to demonstrate how structured representa- tion can help in application-driven algorithm design. In Chapter 5, using the design process of linear tridiagonal solvers, we show how the parallel architecture will influence the choice of computation models. The power flow problem determines how electrical power generation 11 and distribution should be changed with customer demand, so that the power system can be operated safely. We adopt the homotopy mathematical method to solve the power flow problem. Our first approach is an algorithm using the local computation model. Then, we change to a global-exchange computation. After learning more about the homotopy method, we modify our algorithm into a compute-aggregate—broadcast computation. Solving linear tridiagonal systems is a fundamental problem of scientific computation. We have developed tridiagonal solvers on a first generation NCUBE/ 7 hypercube multicomputer. To take‘advan- tage of the hypercube architecture, we change our design from compute-aggregate-broadcast computation to global-exchange computation. Then we change to a hybrid computation, which is a combination of two computation models. The last one gives the best perfor- mance. All the algorithms are implemented on an NCUBE multicomputer and the results can be found in the chapters that follow. A new performance prediction methodology is presented in Chapter 6. This methodology adopts a divide-and-combine strategy. For a given application we write down its structured representation and find out the contained computation models. We then use the well-studied computation models to predict the performance of the given application. Examples are given and the performance measurements of identified computation models are studied. Instead of studying the performance on a given architecture, the performance is studied from a dynamic point of view. We study the influence of problem size on performance. Chapter 7 gives a summary of my major contributions and the direction of future research. Chapter 2 PERFORMANCE METRIC OF PARALLEL ALGORITHMS To study efficient parallel algorithm design we first have to understand the performance metrics. For sequential machines, elapsed time and memory space are the metrics by which algorithms are measured. With parallel machines, the performance measurement becomes more difficult. In additionito time and space, the performance parameters for parallel algo- rithms include the number of processors available, communication overhead and the inherent parallelism of the given application. For this reason, despite the fact that parallel processing has become a common approach for achieving high performance, performance evaluation techniques of parallel processing are weak. There is no well-established metric to measure the performance gain. This weakness is one of the reason which leads to confusion and limits the growth of parallel computation. In this chapter, we shall study issues of performance ' metrics and seek a better understanding of one metric, parallel speedup. The most commonly used performance metric for parallel processing is speedup, which gives the performance gain of parallel processing versus sequential processing. However, with different emphases, speedup has been defined differently. One definition focuses on how much faster a problem can be solved with N processors. Thus, it compares the best sequential algorithm with the parallel algorithm under consideration. This definition is referred to as absolute speedup. Absolute speedup has two different definitions, one which considers machine resources and one which does not. In the first definition, speedup is defined as the ratio of elapsed time of the best sequential algorithm on one processor over the elapsed time of the parallel algorithm on N processors. In the second definition, the absolute speedup is defined as the ratio of elapsed time of the best sequential algorithm on the fastest sequential machine over the elapsed time of the parallel algorithm on the parallel machine [47]. Another type of speedup, called relative speedup, deals with the inherent parallelism of the parallel algorithm under consideration. It is defined as the ratio of the elapsed time of the parallel algorithm on one processor over the elapsed time of the parallel algorithm on N processors. Relative speedup is used because that the, performance of parallel algorithms varies with the number of available processors. Comparing the algorithm itself with different numbers of 12 13 processors gives information on the variations and the degradations of parallelism. Absolute speedup and relative speedup are two commonly used speedup measures. Other definitions of speedup also exist. For instance, considering the construction costs of processors, a cost- related speedup is defined in [4]. Among all of the defined speedups, relative speedup is probably the one which has had the most influence on parallel processing. Two well known speedup formulations have been proposed based on relative speedup. One is Amdahl’s law [2] and another is Gustafson’s scaled speedup [20]. Both of these two speedup formulations use a single parameter, the sequential portion of a parallel algorithm. They are simple and give much insight into the degradations of parallelism. Amdahl’s law has a fixed problem size and is interested in how fast the response time could be. It suggests that massively parallel processing may not gain high speedup. Under the influence of Amdahl’s law many parallel computers have been built with a small number of processors. Gustafson [20] approaches the problem from another point of view. He fixes the response time and is interested in how large a problem could be solved within this time. The argument of Gustafson is that the problem size should be increased to meet the available computation power for better results. Experimental results based on his argument show that speedup can increase linearly with the number of processors available [22]. In this chapter we shall provide a careful study of relative speedup and reserve the word speedup for relative speedup, unless explicit state otherwise. We first study three models of speedup, fired-size speedup, fired-time speedup and memory-bounded speedup. All three models are based on relative speedup. With both uneven allocation and communication over- head considered, general speedup formulations will be derived for all three models. When communication overhead is not considered and the workload only consists of sequential and perfectly parallel portions, the simplified fixed-size speedup is Amdahl’s law; the simplified fixed-time speedup is Gustafson’s scaled speedup; and, with one more parameter, the sim- plified memory-bounded speedup contains both Amdahl’s law and Gustafson’s speedup as its special cases. Therefore, from different points of view, the three models of speedup are unified. 2.1 Preliminary The parallelism in an application can be characterized in different ways for different purposes, such as data dependency graph [61], task precedence graph [11], Petri Net [41] and average 14 parallelism [15]. For simplicity, speedup formulations generally use very few parameters and consider very high level characterizations of the parallelism. In our study we consider two main degradations of parallelism, uneven allocation and communication latency. The former degradation is application dependent. The latter degradation depends on both the application and the parallel computer under consideration. To give an accurate estimate, both of the degradations need to be considered. Uneven allocation is measured by degree of parallelism. Definition 1 The degree of parallelism of a program is an integer which indicates the number of processors that are busy computing at a particular instant in time, given an unbounded number of available processors. The degree of parallelism is a function of time. By drawing the degree of parallelism over the execution time of an application, a graph can be obtained. We refer to this graph as the parallelism profile. Software tools are available to determine the parallelism profile of large scientific and engineering applications [36]. Figure 2.1 is the parallelism profile of a hypothetical divide-and-conquer computation (see Section 3.2). By accumulating the time spent at each degree of parallelism, the profile can be rearranged to form the shape (see Figure 2.2) of the application [53]. l 1 Degree of Parallelism q 2- 1- L] v 0 Time T Figure 2.1: Parallelism Profile of an Application Definition 2 The average parallelism is the average number of processors that are busy during the execution of the program in question, given an unbounded number of available processors. l5 Degree of Parallelism 7—7 I I V I t3 -.".-t2 -..F t1 ..n -10---- F- I; Figure 2.2: Shape of the Application By definition, average parallelism is the ratio of the total service demand to the execution time with an unbounded number of available processors. This is equal to the speedup, given unbounded number of available processors and without considering communication latency. Therefore, average parallelism can be defined equivalently as follows [15]. Definition 3 The average parallelism is the speedup, given an unbounded number of avail- able processors and without considering communication latency. Let W be the amount of work (computation) of an application. Let W,- be the amount of work executed with degree of parallelism i, and m be the maximum degree of parallelism. Thus, W = 23, W,. The execution time for computing W,- with a single processor will be t.(1)= ‘33-. (2.1) where A is the computing capacity of each processor. If there are i processors available, the execution time will be . W,- t; = '.——. (1) :15 With an infinite number of processors available, the execution time will be W.- . .= . = _ < < . t, t,(oo) iA for1_i_m Therefore, without considering communication latency, the response time on a single pro- cessor and on an infinite number of processors will be Ta) = i % (2.2) i=1 at W; The speedup will be 2m W- T1( ) =1 12‘_ _Z___l';1 We ,S'Oo = __ = , 2.4 Too( 2; 5...: 2:: ‘T/z. ‘ ) The average parallelism, A, can be computed in terms of t,-, [1:2 _ 2:21 “i (2.5) ._—_1 221 t 27;] t5' Notice that t.- is the time for executing W,-. When an unbounded number of processors are available, t,- = ‘12. Substituting t.- = {1:3, into Eq. (2.5), we have A = 231“.” __ 2.21 W. ___ 500 (2.6) This gives a formal proof for the equivalence of Definition 2 and Definition 3. Average parallelism is a very important factor for speedup and efficiency. It has been carefully examined in [15]. 500 gives the best possible speedup based on the inherent parallelism of an algorithm. There are no machine dependent factors considered. With only a limited number of available processors and with. communication latency considered, the speedup will be less than the best speedup, See. If there are N processors available and N < i, then some processors have to do gtffif] work and the rest of the processors will do l—Y‘ [fi] work. In this case, assuming W.- and W,- cannot be solved simultaneously for i 5:9 j, the elapsed time will be W") = gag: and m T(N)= 2': i=1 {-1 (2.7) ”:IE The speedup is T_(l)__ _,_ w.- T—(N)= m Elfil. 3:1 . SN= (2.8) Communication latency is an important factor contributing to the complexity of a parallel algorithm. Unlike degree of parallelism, communication latency is machine dependent. It depends on the communication network, the routing scheme, and the adopted switching 17 technique. For instance, the switching technique used in first generation multicomputers is store-and-forward. Second generation multicomputers adopt circuit switching or wormhole routing switching techniques. These new switching techniques reduce the communication cost considerably. Let Q N be the communication overhead when N processors are used in parallel processing; the general speedup becomes _._T_(l)_ _ 231W: 5” ' T(N) ’( 2:. ”774731) +Q~' (2'9) 2.2 Models of Speedup In the last section we developed a general speedup. formula and showed how the number of processors and degradation parameters will influence the performance. However, speedup is not only dependent on these parameters. It is also dependent on how we view the problem. With different points of view, we will get different models of speedup and will get different speedup formulations. One vieWpoint emphasizes shortening the time it takes to solve a problem by parallel processing. With more and more computation power available, the problem can be solved in less and less time. With more processors available, the system will provide a fast turnaround time and the user will have a shorter waiting time. A speedup formulation based on this philosophy is called a fired-size speedup. In the previous section, we adopted fixed—size speedup implicitly. Equation (2.9) is the general speedup formula for fixed-size speedup. F ixed-size speedup is suitable for many applications. I For some applications we may have a time limitation, but we may not want to obtain the solution in the quickest way. If we have more computation power, we may want to increase the problem size, carry out more operations, get a more accurate solution, and keep the execution time unchanged. This viewpoint leads to a new model of speedup, called fired-time speedup. Many scientific and engineering applications can be represented by some partial differential equations, which can be discretized for different choices of grid spacing. Coarser grids demand less computation, but finer grids give more accurate solutions. If more accurate solutions are desired, this kind of application will fit the fixed-time speedup model. One good example is weather forecasting. With more computation power, we may not want to give the forecast earlier. Rather, we may wish to add more factors into the weather model - increasing the problem size and getting a more accurate solution - thus providing a more precise forecast. 18 For fixed-time speedup the workload is scaled up with the number of processors available. Let W,’ be the amount of scaled up work executed with degree of parallelism i and m’ be the maximum degree of parallelism of the scaled up problem when N processors are available. In order to keep the same turnaround time as the sequential version, we must have m m’ “I: i E W: - Z; T [N] + QN Thus, the general speedup formula for fixed-time speedup is 2'11 W" 3:: i, Sjv = , ', _ ' = —-;n—. (2.10) 2:: 514:] + on 2:... We From our experience in using multicomputers, we have found that memory capacity plays a very important role in performance. Existing multicomputers do not support virtual memory, and memory is distributed and associated with each node. The memory associated with each node is relatively small. When solving an application with one processor, the problem size is more often bounded by the memory limitation than by the execution time limitation. With more nodes available, instead of keeping the execution time fixed, we may want to meet the memory capacity and increase the execution time. In general, the question is, if you want to increase the problem size, do you have enough memory for the size increase? If you do have adequate memory space for the size increase, and after the problem size is increased to meet the time limit you still have memory space available, do you want to increase the problem size further by using this unused memory space and to get an even better solution? For memory-bounded speedup the answer is yes. Like fixed-time speedup, memory-bounded speedup is a scaled speedup. The problem size is scaled up with system size. The difference is that in fixed-time speedup the execution time is the dominant factor and in memory-bounded speedup the memory capacity is the dominant factor. Most of the applications which fit fixed-time speedup will fit memory-bounded speedup when more accurate solutions are the goal. A good application for memory-bounded speedup is simulation. If we simulate a nuclear power plant, obtaining an accurate solution will probably be the highest priority. With memory capacity considered as a factor of performance, the requirement of solv- ing an application contains two parts. One is the computation requirement, which is the workload, and another is the memory requirement. For a given application, these two re- quirements are related to each other, and the workload can be seen as a function of the memory requirement. Let M represent the memory requirement and let 9 represent the function, we have W = g(M), or M = g"‘(W), where g‘1 is the inverse function of g. 19 Under different architectures the memory capacity will change differently with the number of processors available. For multicomputers, the memory capacity increases linearly with the number of nodes available. If W: 23.1 W,- is the workload for sequential execution, W‘ = ,-._l W? is the scaled workload when N processors are available, m" is the maximum degree of parallelism of the scaled problem, then the memory limitation for multicomput- ers can be stated as: the memory requirement for any active node is less than or equal to g"1 (2:,_1 W,-). Here the main point is that the memory occupation on each node is fixed. Equation (2.11) is the general speedup formula for memory-bounded speedup. W‘ 5' = "=1 2.11 N 3.1% W [1v] + QN ( ) 2.3 Simplified Models of Speedup Three general speedup formulations have been proposed for three models of speedup. These formulations contain both uneven allocation and communication latency degradations. They are closer to actual speedup and give better upper bounds on the performance of parallel algorithms. On the other hand, these formulations are problem dependent and difficult to understand. They give more detailed information for each application, but lose the global view of the possible performance gain. In this section, we study a simplified case for speedup, which is the special case studied by Amdahl and Gustafson. We do not consider the commu- nication overhead, so QN = 0, and we assume that the allocation only contains two parts, a sequential part and a perfectly parallel part. That is W,- = 0, for i at 1 and i# N. We also assume that the sequential part is independent of the system size, i.e., W1 = Wi = Wf'. Under this simplified case, the general fixed-size speedup formulation Eq.(2.9) becomes W1 ‘1' WN W1 + 7,,“ , which is known as Amdahl’s law. From Eq.(2.l2) and Fig. 2.3 we can see when the number of processors increases the load on each processor “decreases. Eventually, the sequential part will dominate the performance and the speedup is bounded by M. Under the simplified conditions, ,--1 W: W1 + WN and Ev,;"=1—I- 1' W’I'N] + QN- — W’ + 4. Therefore, for fixed-time speedup, we have W1 + WN- -— W,’ + 7,“. Since W1 = Wl’, we have WN = 313:. That is W,:, = NWN. Equation (2.10) becomes SN = (2.12) S" = i=1W"= W__1’-__+W'N =W1+NWN N zg,W.- W,+W~ W,+WN. (2.13) 20 Amount of Work N film 12 3 4 5 f Number of Processors (N) T1 TN TN TN TN T 12345 Number of Processors (N) Figure 2.3: Amdahl’s Law The simplified fixed-time speedup formula Eq.(2.l3) is Gustafson’s scaled speedup, which was proposed by Gustafson in 1988 [20]. From Eq.(2.l3) we can see that the parallel portion of the application is scaled up linearly with the system size, and the speedup increases linearly with the system size. The relation of workload and elapsed time for Gustafson’s scaled speedup is depicted in Figure (2.4), where T1 is the execution time for the sequential portion of work. TN is the execution time for the parallel portion of load. We need some preparation before deriving the simplified formulation for memory-bounded speedup. Definition 4 A function g is a semihomomorphism if there exists a function 3 such that for any real number c and any variable x , g(cz) = g(c)g(z). One class of semihomomorphism functions is the power function g(z) = 2:", where b is a rational number. In this case, 57 is the same as the function g. Another class of semihomomorphism functions is the single term polynomial g(z) = az", where a is a real constant and b is a rational number. For this kind of semihomomorphism functions, g(::) = :c", which is not the same as g(z). The sequential portion of the workload, W1, is independent of the system size. If we do not consider the influence of memory on the sequential portion we have the following theorem: 21 l) ‘— ..Wz. Amount ,— 0f Ela 3 Work —Ws- Tirrlfe 1: .1511. WIN ——. N T, T, T, T, T, N N . [81’er TN TN TN TN TN 1 2 3 ' 4 5 v 1 2 3 4 5 7 Number of Processors (N) Number of Processors (N) Figure 2.4: Gustafson’s Scaled Speedup Theorem 1 If W = g(M) for some semihomomorphism function g, g(cz) = g(c)g(:r), then, with all data being shared by all the available processors and using all the available memory space, the simplified memory-bounded speedup is _ W1 '1' g(leN — W1 'l' lily-2W” Proof: As mentioned before, WN is the parallel portion of the workload when one processor is used. Let the memory requirement of WN be M, WN = g(M). M is the memory requirement when one node is available. With N nodes available, the memory capacity will increase to NM. Using all of the available memory, for the scaled parallel portion W5}, W}, = g(N M ) = g(N)g(M). Therefore, WK, = g(N)W~ and 5;, (2.14) s: ___ W,- + W,:, = W, + g(N)WN N W“ + Wir/N W, + ESS’JWN (2.15) C] In the proof of Theorem (1), we claimed that W3, = g(N M ) = gg(M). This claim is true under two assumptions: 1) the data is shared by all available processors, and 2) all the available memory space is used for better solutions. A computation with the first property is called global computation. Equation (2.14) is the simplified memory-bounded speedup 22 E | A 1 Amount 1K1. of Work TV: 17: T TN TN ” 1W~ IW" TN TN 1 2 5 1 2 3 4 5 Number of Processors (N) Number of Processors (N) W, N _2 [Ml 3 '4 Figure 2.5: Simplified Memory-Bounded Scaled Speedup for global computation when memory is fully used. In general, data may be duplicated on different nodes and the available memory may not be fully used for increased problem size. Replacing the function g by a general function G, that is W3, = G(N)WN, a more generalized theorem will be Theorem 2 If W5, = G(N)WN for some function G, then .5, ___ W, +G(N)W~i " w,+9§,’,9w~' Proof: If WK, = G(N)WN for some integer N, then s: = W; + W5, = W, + G(N)W,., (2.16) _ (2.17) [3 Equation (2.16) will be referred to as simplified memory-bounded (SMB) scaled speedup. SMB scaled speedup is determined by the function C(N), which represents how the change of memory will influence the change of problem size. When the problem size is independent of 23 the system size, the problem size is fixed, C(N) = 1. In this case, SMB scaled speedup is the same as Amdahl’s law, i.e., Eq.(2.16) and Eq.(2.l2) are equivalent. The local computation model is one computation model studied in Section 3.2. In the local computation model, when more processors are available, work will be replicated on these available processors. Computation is done locally on each node, and communication between nodes is not required. In this case, when memory is increased N times, the workload also increases N times, i.e., C(N) = N. In this case, SMB scaled speedup is the same as Gustafson’s scaled speedup. SMB scaled speedup contains both Amdahl’s law and Gustafson’s scaled speedup as its special cases. For most scientific and engineering applications, the computation requirement increases faster than the memory requirement. For these applications, g(N) > N and memory-bounded speedup will likely give a higher speedup than fixed-time speedup. The proposed scaled speedup formulation, Eq.(2.16), may not be easy to fully understand at first glance. Here we use matrix multiplication to illustrate it. A matrix often represents some discretized continuum. Enlarging the matrix size generally will lead to a more accurate solution for the continuum. For matrices with dimension n, the computation requirement of matrix multiplication is 2n3 and the memory requirement is 3n2 (roughly). Thus, M i W" = 2 (‘3') This means that M .3. 3 WN = g(M) = 2 (y) 1 90") = N1 (2-18) The simplified memory-bounded speedup for global computation will be . _ W1 + N % WN " " W. + MWN' Global computation uses distributed local memories as a large shared memory. All the data is distributed and shared. When these local memories are used locally without sharing, the computation is local computation and WK, = N g(M ) This means that g = N. The (2.19) speedup is St _W1+NWN N- W1+WN ’ 24 which is Gustafson’s scaled speedup. For matrix multiplication C' = AB, let A.- be the i‘h row of A, i = 1, ...,n, and let B,- be the j“ column of B, i = 1, ...,n. The local computation and global computation of the matrix multiplication are shown in Figure (2.6) and (2.7), respectively. In global computation, the rows of matrix A are rotated after each row, column A81 AB; ABN multiply. Figure 2.6: Matrix Multiplication with Local Computation Lu» -------- A181 A282 ANBN Figure 2.7: Matrix Multiplication with Global Computation We have studied two cases of memory-bounded scaled speedup, global computation and local computation. Most of the applications are some combination of these two computation styles. Data are distributed in one part and duplicated in the other part. The duplica- tion may be required by inherent properties of the given application, or may be added in deliberately to reduce communication. Speedup formulations for these appliCations depend on the ratio of the global and the local computation. Deriving a speedup formulation for these combined applications is difficult, not only because we are facing a more complicated situation, but also because of the uncertainty of the ratio. The duplicated part might not increase with system size. It might increase, but with a speed which is different from the increasing speed of the global part. Also, an application may start as global computation, but, when the computation power increases, duplication may be added in as a part of the effort for a better solution. In general, C(N) is application dependent. We derive C(N) for a special case as an example. The structure of this derivation can be used as a guideline for general applications. Lemma 1 If function g is a semihomomorphism function, g(cz) = g(c)g(z); and 9'1 exists 25 and is also a semihomomorphism, g‘1(cs:) = h(c)g’1 (1:) for some function h, then a has an inverse and g“ = h. 0 Proof: Since cy = 9[9’1(cy)l = 9[h(6)9"(y)l = §[h(C)lg[9"(y)] = §[h(6)ly. we have §[h(c)] = c for any real number c (2.20) Also, since cy = g“[g(cy)l = 9“[§(C)g(y)l = h[§(6)ly, we have h[§(c)] = c for any real number c (2.21) By Eq.(2.20) and Eq.(2.21), the function g has an inverse and j“ = h. Theorem 3 Assume W = g(M) for some semihomomorphism function g, where g(cM) = §(c)g(M), g inverse exists and is a semihomomorphism. If the workload is scaled up to meet the time limitation with global computation first and the rest of the unused memory space is then used to increase the problem size further with local computation, we have G(N) = (1 + w — Cléflnm (2.22) Proof: By the fixed-time speedup, after the number of nodes changes from 1 to N, the parallel portion of work will increase from WN to N WN (see Figure 2.4). The storage requirement is given by the function g". For operation requirement N WN, the memory requirement is g'1(NWN) = §‘1(N)g‘1(WN). Let M represent the size of the memory associated with each node which can be used for parallel processing. Then, when the number of nodes equals 1, the total memory available is M, which is equal to g'1(WN). When the number of nodes equals N, the total memory available changes to NM. We first fix the execution time and increase the problem size to meet the time limitation. After the fixed-time scale up, the unused memory space is the difference between current available memory and current memory requirement, which equals NM — g“(NW~) = NM — :7“(N)g"(W~) = NM — y"(N)M = (N — :7“(N))M. 26, The unused space at each node is [N - a"(N)lM = N [1 — 221619154. The problem size can be further scaled by using this unused memory space. The further scaled computation on each node is given by the function g, and it is equal to “{1}"!le )] M) = 2(1—511—3—"9-MW) = 9(1 --§—’11f,—N—)-)w~ (223) Therefore, the computation on each node becomes the original operation on each node + the operation increase on each node = WN + mug-11$” ))W~ = [1 + 2(1—213—112)]w~, (2.24) and, for the scaled parallel computation W5}, -—1 W,:,=N[1 + g(l—g—N-QQHWN. Thus, we have G(N) = M1 + g(i — 93%)]. (2.25) Figure 2.8 depicts the speedup difference among the fixed-sized model, the fixed-time model and the memory-bounded model. The function, g, used in Figure 2.8 is the function for matrix multiplication, g(N) = N i . As most matrix computations have the same function g(N) = N 3', the speedup relation depicted by Figure 2.8 is in general true for a large class of applications. 2.4 Comparison Study It is known that the performance of parallel processing is influenced by the inherent par- allelism of the application, by the computation power and by the memory capacity of the parallel computing system. However, how these three factors are related to each other, and how they influence the performance of parallel processing generally is unknown. Discovering 27 1000 - ' ' ' ‘ 8Q‘SMB-Global a” JiMB-Combineld is” $1" 800 ”a +,}: .. .4" A” AA" 600 - ' a A - Speedup At 400 - 200 - I Amdahflslm, 800 1000 Number of Nodes Figure 2.8: Amdahl’s law, Gustafson’s speedup and SMB speedup where Wl = 0.3 and g(N) = N} 28 the answers to these unknowns is very important for designing efficient parallel algorithms and for constructing high performance parallel systems. In this paper one model of speedup, memory-bounded speedup, is carefully studied. This model is simple, and it contains all of these three factors as its parameters. It shows the degradations and the possible performance gain of parallel computation. . As part of the study on performance, two other models of speedup have also been studied. They are fired-size speedup and fixed-time speedup. Two sets of speedup formulations have been derived for these two models of speedup and memory-bounded speedup. Formulations in the first set are general speedup formulas. These formulas contain more parameters and provide more accurate information. The second set of formulations only considers a special, simplified case. These formulations give the performance in principle and lead to a bet- ter understanding of parallel processing. The simplified fixed-size speedup is Amdahl’s law, the simplified fixed-time speedup is Gustafson’s scaled speedup, and the simplified memory- bounded speedup contains both Amdahl’s law and Gustafson’s speedup as special cases. Amdhal’s law suggests that the sequential portion of the workload will dominate the per- formance when the number of processors is large. Gustafson’ scaled speedup claims that the influence of the sequential portion is independent of system size. Simplified memory- bounded scaled speedup declares that the sequential fraction will change with the system size. Since the computation requirement increases faster than the memory requirement for most applications, the sequential fraction could be reduced when the number of processors increases. The three models of speedup, fixed-size speedup, fixed-time speedup and memory-bounded speedup, are based on different vieWpoints and are suitable for different classes of applica- tions. Applications exist which do not fit any of the models of speedup, but satisfy some combination of the models. Chapter 3 PARALLEL COMPUTATION MODELS FOR SCIENTIFIC COMPUTING With more than one processor operating concurrently, parallel processing enlarges the range of possible performance and makes efficient algorithm design more important. We have studied performance metrics in the previous chapter. In this chapter, we focus on designing ' parallel algorithms, with emphasis on methodology and analysis. One of the factors that makes parallel algorithm design difficult is the lack of guidelines. Different applications are rarely related to each other in the design process. Therefore, they have to be handled one by one in an ad hoc fashion. The experience of experts hardly benefits general users. In Section 3.1, we propose a representation system, called structured representation, for scientific and engineering applications. With this representation, the sim- ilarities between applications become obvious and applications can be compared, classified, and manipulated based on their structures. Developing structured representation is the first step toward the ultimate goal, replacing the ad hoc development of parallel applications with a rational methodology based on analysis and measurement. A parallel algorithm cannot be efficient without considering the architectural aspects of the underlying multiprocessor. This is especially true for multicomputers where communi- cation overhead is a major consideration. Mapping an application onto an architecture is both application dependent and architecture dependent. To lead to a systematic mapping, the basic building blocks of structured representation are identified and studied in Section 3.2. With structured representation, an application can be decomposed into a set of com- putational models which can be mapped onto the architecture using predefined strategies. We can modify our design by changing certain computation models. In this way, structured design and rethinking become possible. Casual users can develop efficient algorithms based on experts’ experience. A general guideline for efficient algorithm design is provided. 29 30 3.1 Structured Representation Parallelism can be achieved by dividing a given application into pieces, called subtasks, and solving these pieces concurrently. Ideally, these subtasks can be solved independently, where the exchange of intermediate result is negligible. Some scientific and engineering applications have this nice, easy parallelism property. For these applications, it is natural to solve each of the subtasks locally on a different processor. This model of computation is called a local computation model. IDmExchanae' . > I_. Figure 3.1: Compute-Exchange Computation The design of local computation algorithms is straightforward. Unfortunately, most appli- cations do not have this easy parallelism property [1]. For most applications communication is necessary for exchanging data and coordinating activities. Although various asynchronous techniques have been designed to reduce the communication overhead, most communication must be achieved in a synchronous fashion, that is the receiving node must receive the com- municated message before continuing. This synchronous communication requirement makes eficient algorithm design very difficult. The load needs to be balanced between synchroniza- tions and special care has to be taken to reduce the communication overhead. Figure 3.1 depicts the general parallel computation pattern with synchronous communication consid- ered. It shows that the solving process consists of two phases, compute and data-exchange. These two phases occur alternatively and repetitively, and, therefore, form the compute— ezchange computation. The data-exchange phase involves communication between compute phases. The communication patterns vary largely from application to application, and may be represented by a notation. To simplify this notation, we restrict ourselves to certain 31 classes of communication, which are large enough for our purposes - describing the most frequently used scientific and engineering applications. A processor sending a message in a communication is called a sender. A processor receiving message in a communication is called a receiver. A processor could be both sender and receiver in a communication. A graph G(V, E) is a structure which consists of a set of vertices V = {v1,vg,...} and a set of edges E = {81,62,...} [17]. If we let processors in a communication bevertices in a graph and add directed edge (v, w) from v to w if processor v sends a message to processor w; a digraph (directed graph) is formed. This digraph is called the communication digraph. Following the notations of graph theory [17], the outdegree of a vertex v is the number of edges which have v as their start-vertex. In other words, the outdegree of a vertex v is equal to the number of destinations that v sends its message to. For this reason we also call the outdegree of a vertex the degree of a sender. The indegree of a vertex and the degree of a receiver are defined similarly. The degree of a receiver is the number of sources from which it receives messages. Definition 5 A regular communication is a communication in which all senders have the same degree and all receivers have the same degree. For a given undirected graph, if for every two vertices u and v there exists a path whose starting vertex is u and whose ending vertex is v, then the graph is connected. A con- nected subgraph G'( V’, E’) is a connected component if there is no other connected subgraph containing C(V’, E’) as its proper subgraph. The underlying (undirected) graph Of a digraph is the graph resulting from the digraph if the direction of the edges is ignored. A connected component of a digraph is the corre- sponding subdigraph of the connected component of its underlying graph. A connected component of the communication digraph is called a pattern of the commu- nication. Definition 6 A regular communication is a regular data-exchange if it consists of one or more copies of the same pattern. By our definition, the communication requirement of a regular data-exchange is given by the number of patterns it contains. The pattern of a communication is described by the number of senders and the number of receivers in this pattern. The complexity of each sender and of each receiver is given by its degree. Thus, a regular data-exchange can be represented using five parameters as 32 Pl5(D), 13(4)], (3-1) - where P is the number of instance of the pattern, S is the number of senders in each instance of the pattern, and D is the degree of the senders. Similarly, R is the number of receivers in each instance of the pattern, and d is the degree of each receiver. An example of using this notation for presenting communication is given in Figure 3.2. Notation (3.1) describes a communication by five parameters. Since broadcast is not provided in neither first generation nor second generation multicomputers, messages must be sent one at a time. The number of times messages are sent and received is the dominant factor in communication cost. Notation (3.1) indicates the characteristics of a communication. More information may be needed when implementation is considered. '9 '0 \ \‘s ‘ \ \‘ \‘~~ \ \‘s‘ \ \‘s‘ \‘ \ “ \‘ ‘ “ \ ‘ “ ‘~ ‘ “ 2[1(3)3(1>1 ‘\ \‘ ‘s \\ \ ‘~ ' \ \ ‘ \ \ s \‘ \‘ s‘ \‘ \\ “ \ i a ‘3 1 \__‘a. Figure 3.2: Multicast Data Exchange The second class of data-exchange which we want to identify is called conjunctive reg- ular data-exchange. We use the same five parameters to identify conjunctive regular data- exchange. The difference between regular data-exchange and conjunctive regular data- exchange is that in conjunctive regular data-exchange the patterns are not disjoint, they conjoin one another. Consider two special cases: conjunction at the sender side only and conjunction at the receiver side only. We have two general notations, ”5(0). R.(d)], (3-2) and P [51(0), 3(4)], (3-3) 33 where the subscript, c, points out which side has conjunctions. An example of conjunctive regular data-exchange is given in Figure 3.3, in which the receiver side has conjunction. Q I A " In\ I“\ H ’ \ I \ I \ ’ ‘ \ I \ ”I X ’I \‘\ ’I \\ I, “ I ~ I t ,' X I ~ mm, 2 (1)] I \ , I \ ’ ‘ ’ ‘ ’ ‘v ’ \ Figure 3.3: Conjunctive Data Exchange A graph G'(V’,E’) is a partition graph of G(V, E) if G’(V’, E’) is formed by splitting a subset of vertices {v1,vg,...v,.} E V into two subsets of vertices as {v;,v;,...v:,} and {v;’, v.1! , ...vz}, where v}, v,” are the vertices formed by splitting vertex v;, and having edges (v£,v;-) and (vf’,v§’ when edge (v;,v,-) exists in graph G(V, E) These'divided vertices are called partition vertices. Figure 3.4 shows two partition graphs of the given graphs. With this terminology, conjunctive regular data-exchange can be defined more mathematically as follows. Definition 7 A regular communication is a conjunctive regular data-exchange if one of its partition graphs consists of one or more copies of the same pattern. If all partition vertices are senders, the conjunctive regular data-exchange is a sender conjunctive regular data-exchange. If all partition vertices are receivers, the conjunctive regular data-exchange is a receiver conjunctive regular data-exchange. Since a regular communication patterns could have more than one partition graph which consists of one or more copies of the same pattern, a conjunctive regular data-exchange could have more than one notation. Once the data-exchange phase has been identified in an application, we can describe the application in terms of data-exchange. An application might have different data-exchange phases. Writing these data-exchange phases together in order by using the Z: symbol and adding in the compute phases, we have a formula, called a structured representation, for each 34 Original Graph Partitoned Graph Partition Vertices Original Graph Partitioned Graph Case Two Figure 3.4: Partitioning of Graphs 35 application. Figure 3.5 shows how to represent the Fast Fourier Transform (FFT) compu- tation in terms of regular data-exchange. The communication divides the computation into layers. The total computation depends on the number of layers as well as the computation requirement at each layer. X ,- is the computation work on each processor between data- exchange phase i - 1 and i, if we have even allocation. X.- is the computation work of the processor which has the largest workload among all the working processors in the compute phase i, if we have uneven allocation. Notice that X; is not equal to W,- which is defined in Chapter 2. W,- is the total workload with degree of parallelism i. E!!!!!!! x: :x: 3‘ ‘ :x: 4[2(1),2(1)] 223?:1'. :3: s: _, ~. 4[2(1).2(1)] 4l2(1).2(1)l i(2k'1[2(1),2(1)] + Xi) lc = log(N) = 3 i=1 Figure 3.5: FFT (Butterfly) Computation 36 We can see from Figure 3.5 that different communications could have the same data- exchange representation. The reason is that our notation. is a high level notation. It pro- vides the communication complexity for. the data-exchanges. It does not contain detailed information about how the communication takes place. We assume that the processors which participate in the data communication also partic- ipate in the computation, and the processors which do not participate in the data communi- cation do not participate in the computation. With this assumption, the second application, odd—even cyclic reduction (see Figure 3.6), shows how the structured representation illustrats the uneven allocation degradation. Using the data-exchange information P[S' (D), R(d)] at data-exchange phase i, we can see that there are P x S processors working at compute phase i — 1 and there are P x R processors working at compute phase i. If the data-exchange is a conjunctive regular data-exchange, P[SC(D), R(d)], the number of working processors at compute phase i -— 1 will be P x (S — l) + 1. These can be confirmed by observing Figure 3.6. Similarly, for conjunctive regular data-exchange P[S (D), Rc(d)], the number of working processors at compute phase i will be P x (R - 1) + 1. Structured representation provides information about uneven allocation at an abstract level. It provides the number of proces- sors which participate in the computation at each compute phase. It does not provide the workload information for each of the working processors. Odd-even cyclic reduction is a commonly used method for scientific applications. A well known parallel algorithm for tridiagonal linear systems is based on odd-even cyclic reduction [33]. From Figure 3.6 we can see that the odd-even cyclic reduction application contains two different structures. The upper half of Figure 3.6 is one structure and the lower half is another structure. This is a common phenomenon of scientific applications. Most of the frequently used scientific applications are combinations of a few simple structures, which we call computation models. The information in computation models can be used in general applications. Studying computation models will lead to a general algorithm design guideline for scientific applications. In the next section we will present some of the identified computation models. 3.2 Parallel Computation Models Structured representation is a precise and perspicuous medium to express computation mod- els. With structured representation, computation models can be defined at different levels for different purposes. In our study, we follow the conventional research of parallel paradigms 37 7[3 (1) 1(3)] I [3cm 1(3)] 1 [1(2).2(1)] * 2[1(2).2(1)] “1(2). 2(1)] E((2*-‘-1)I3.(1).1(3)1+ X.)+ 2?;1-"(2‘-*[1(2).2(1)1+ x.) k = [109(Nll = 4 Figure 3.6: Odd-Even Cyclic Reduction 38 and define the computation models from a coarse point of view. We identify the computa- tion models based on design techniques, communication patterns, characteristies, sources of degradation, and define them precisely with structured representations. 3.2.1 Local Computation Model Parallelism can be achieved by dividing a given problem into pieces and solving these pieces concurrently. In general, these pieces need to communicate in order to exchange data and coordinate their activates. However sometimes the problem can be divided into almost com- pletely independent subtasks where the exchange of the intermediate results is negligible. Some computational problems have this ideal parallel property. The use of the Local Com- putation Model is natural in solving these problems on a multicomputer. The characteristic of the local model is asynchrony. _ILJLJLJIIULI Figure 3.7: Local Computation Model Local computation for algorithms with data parallelism can be illustrated by matrix multiplication. Given n x n matrices A and B, we want to compute their product, C. Assuming N is less than n, we partition matrix B by columns evenly into N submatrices Bo, Bl, ..., B~-1. Each node is loaded with one submatrix B.- (i = 0, ..., N - 1) together with the whole matrix A. The resultant products C.- = A x B,- are then obtained on each processor independently. The common input A shows another characteristic of the local computation model. The processors give different parts of the output, but they may share some input. The Homotopy method [9] (see also Section 4.3) solves systems of equations by curve tracing. To solve a non-trivial equation P(Z) = 0, a homotopy function H (Z,t) is defined as H(Z,t) = (1 — t)Q(Z) + tP(Z) t 6 [0,1] (3.4) It can be seen from Eq.(3.5) that when t = 0, H(Z, 0) = Q(Z) and when t = l, H(Z,1) = P(Z). Mathematical results have shown that if P(Z) 6 C", Q(Z) E C’, then P(Z) = 0 can be solved by tracing the solution set of the the following equation H(Z,t) = o, (3.5) i.e., the equation P(Z) = 0 can be solved by 39 o solving the equation Q(Z) = 0, a following the solution curves of Eq.(3.5) from t = 0 to t = l. Each’solution of Q(Z) will lead to a solution curve of H (Z,t). By broadcasting the homotopy function to each of the processors and by starting them from different solutions of Q(Z), the solution curves of Eq. 3.5 can be traced locally on each processor. There is no communication needed for coordination. The curves are traced by following discrete points. The amount of computation on each point and the points themselves is determined at run time. The local computation model is a special case of compute-exchange computation where the data-exchange phase does not exist. 3.2.2 Global-Exchange Computation Model In the global-exchange computation model each processor executes a computational phase and then performs a total data-exchange. At each data-exchange phase, each processor sends messages to all the other processors and receives messages from all the other processors. The communication requirement at each data-exchange phase is high. The Jacobi itera- tive method for solving linear systems is an application which is based on global-exchange computation. The characteristic of global-exchange computation is synchronism. If one pro- cessor does not finish its computation in the current compute phase, then no other processor can enter the next compute phase. This synchronization requirement and the high com- munication overhead make designing efficient algorithms for global-exchange computation applications very difficult. Figure 3.8 depicts the global-exchange computation model and gives the structured representation for this model. 3.2.3 Compute-AggregateBroadcast Computation Model Compute-Aggregate-Broadcast Computation Model (CAB) was first proposed in [44]. It is composed of three basic phases bearing those names: a compute phase which performs the computation, an aggregation phase which combines local data into one or a few global values, and a broadcast phase which returns global information back to each processor. This model is characterized by synchronism communication. This synchronism requirement may be caused by data dependency or is demanded by global control for better performance. The compute phase varies widely from application to application. For a multicomputer we would like the 40 ‘ l I ' I I \ .l. \ ‘§ ’ 30.5.9 8e ’ I *‘1’9 ‘ r ; ’I :\\\‘:!:3=£§f \¢A? ' vs: q.) I (I'L 2‘ ‘15'4 I’I ' i ‘ "‘§ '7“ — «(3.5 4" (- <3. ”I ’v ’ ' \ \ «$4 { . . u 9.» < , \I r I I qr \ ‘ a ’ «1'»: ’ ’ I I\ , “‘4’“ ,v f . 't rap; ‘4‘)‘l \ I \ . I 1’4 ,I’ It 1* a 1‘1on .33 ~)~~ s" \r [‘1 ,V¢’ ' 1",“«J‘, ,v‘?tzg‘ FI~$$§3~st \. . 1‘ ~ . «9“ . was ~ . p. 3.? 2(IN(N),N(N)1+ Xe) i=1 Figure 3.8: Global Exchange Computation Model compute phase to be dominant. The aggregation phase can be a tree-based computation that combines data from the processors, then produces a single global information, such as the convergent signal for solving PDE problems, or it can be a data gathering communication pattern, in which all the computed results are sent to a special node that processes them and coordinates the next round of computation. Both the global exchange computation model and the compute-aggregatebroadcast com- putation model are commonly used for iterative methods. Comparing these two methods, the CAB computation mode has a lower communication requirement. However, also, it has degradations caused by uneven allocation. At some compute phases only one processor works and all the others are idle. 3.2.4 Divide-and-Conquer Computation Model Divide-and-Conquer is a well known technique in sequential algorithm development [7]. In divide-and-conquer computation, a problem is divided into two or more smaller problems. These smaller problems are solved and their results are combined to give the final solution. The Divide-and-Conquer computation model can be divided into two subclasses, Recur- sive Divide-and-Conquer and Partitioning. Recursive divide-and-conquer uses the smaller 41 - ‘ "' 206+ [1(N),N(1)1+ Y.- + [N(1).1(N)1) i=1 Figure 3.9: CAB Computation Model [1(8).8(1)] [8(1).1(8)l 42 problems as a smaller instance of the original problem and does the divide and conquer concurrently. We refer to divide-and-conquer in one stage as partitioning. This model is also the basis for several classes of parallel algorithms, including a number of sorting and searching algorithms [48]. In general the recursive divide—and-conquer computation model has two shortcomings for multicomputers: l. Synchronization: The algorithm requires synchronization at each merge stage. 2. Degree of Parallelism: The algorithm has a dynamic degree of parallelism. I I \ I \‘ I \ I , \ ,’ ‘s I \ ‘ I \ I \ I ‘ I \ I ‘ I \ I ‘ I \ I ‘t‘ I \ I. s ”’ Ic-l . 2k-1 . 2(2'I1(2),2(1)1+ x.) + 2 (2"‘"'+"[2(1),1(2)1 + x.), k = IogN i=0 i=k Figure 3.10: Divide-and-Conquer Compute-Exchange Computation The following result shows the beauty of recursive divide-and-conquer technique [5]. Proposition 1 A non-singular triangular linear system A: = b can be solved in 0(log2 n ) time, using n3 processors, where n is the dimension of A. 43 Proof: It is sufficient to prove that A"1 can be found in 0(log2 n) time with n3 processors. We partition the matrix A into blocks: A; 0 A: [A2 As], i (3.6) where A; is of size [n/ 2] x [rt/2]. Since A, and A3 are lower triangular matrices. Moreover, it is easily shown that A?! 0 ] (3.7) A'1 = [ —A;‘A2A," .4? Based on the above decomposition, we obtain the following algorithm. Given an n x n triangular matrix A: 1. If n = 1, then A'1 can be found directly. 2. If n > 1, then partition A as indicated above and do the following: (a) Find the inverse of A1 and A3 concurrently. (Notice that A1 and A3 are lower triangular matrices, they can be inverted by using the same algorithm recursively.) (b) Multiply A; by A5,“ on the left to obtain A; 1A2. (c) Right-multiply the result of (b) by Afl. Both steps (b) and (c) take 0(log n) time using n3 processors. Thus, if T(n) denotes the time required by the algorithm for inverting a matrix of dimensions n x n, we have T(n) = T( [n/2]) + 0(log n), which yields T(n) = 0(log2 n) computation using n3 proces- sors. C] The method presented here does not make things simpler than the presumably easier task of solving the system A2: = b. Furthermore, if communication overhead is properly taken into account, the time requirement can be much higher than 0(log2 n) for certain cOmputer architectures. For this reason, and in view of its requirement for a large number of processors, the algorithm is theoretically interesting but impractical. A more practical method will be described later in the pipeline model. Partition models are more practical than recursive divide-and-conquer models, especially for N < n. Three partition methods for solving tridiagonal linear systems on multicomputers will be given in Chapter 5. 44 3.2.5 Domain Decomposition Model Domain decomposition reduces the original large problem into a set of subproblems through decomposition of the problem domain. This technique was originally developed to reduce overall computation time and storage requirements on sequential machines. However, it becomes much more important in parallel algorithm design. The decomposed subproblems may be independent of each other or may be dependent in some way. The goal of the design is to reduce the dependence of the subproblems in order to achieve maximal parallelism. The domain decomposition model is well suited to the finite difference method for solving PDEs where each subproblem is dependent on a fixed number of others [19], [31]. V s I w \ I V V I“ ’I \ IA\ I, ‘s\ I s I“ I \\‘ 0‘ ”(4)33” ‘ I \ ’,’ ’I \ '3' ‘2." I ‘.’ I \ ‘.’ :A’ I \‘ \ I \‘ l s I I \ I s s s I x [a Ix l ‘\ I" s \ s \ , I I ‘ \ I \ I s \ \I \I§\I’\1‘\"‘\ 1"\l\‘\ I I \ I \ "\ -. I— .—| ’..O -fi )—) _l p--- I: Z([N(4). N(4)1 + X.) i=1 Figure 3.11: Domain Decomposition Computation Model One problem faced by domain decomposition is how to partition the domain. In order 45 to solve Laplace’s equation, ' 6’2 022 ax: 293? = by the finite difference method, we decompose the my domain into subregions and assign o, (3.8) one region to each processor. The regions can be chosen either as strips or as rectangles. With strip decomposition, each processor communicates with two neighbors. For rectangular regions, each processor has to communicate with three of four neighbors, but has shorter messages. The optimum decomposition depends on the computation/ communication ratio of the underlying machine and the number of grid points in each region. Domain decom- position computation is characterized by easily achieved load balance and by neighboring . communication. However, if global convergence checking is adopted for the finite difference method, then global communication is needed. Problems can also be solved in parallel by decomposing the solution domain. The biseCs tion method for solving eigenvalue and polynomial problems belongs to this category. This kind of decomposition has a different communication pattern. We do not consider them as part of the domain decomposition computation model. 3.2.6 Pipeline (Ring) Computation Model Pipeline (ring) computation increases concurrency by dividing a computation into a num— ber of steps and allowing a number of tasks in various stages to be executed at the same time. Thus the characteristic of this model is that the intermediate result moves in a fixed direction and the final result emerges in the last stage. Communication is automatically overlapped with computation; this is the main advantage of this model [34]. One primary consideration in pipelined computation is to keep the processing speeds of all stages roughly equal. Otherwise, the slowest stage will become the bottleneck of the pipeline. There are two types of pipeline computation models, the functional pipeline model and the data pipeline model. Traditionally, the term pipelining refers to function pipelining. Different stages in the pipeline perform different functions, and, when data flows through the stages, it is modified along the way. In the data pipelining model, processors in the pipeline perform the same function in a synchronous fashion. Back substitution is a well-accepted pipeline paradigm for solving triangular linear systems. Under the assumption that A is a lower triangular matrix, the i‘h equation of the system A1: = b is aglxl + 0.52132 + + a,-,-:r,- = b,. (3.9) 46 The following parallel'version of the back substitution algorithm employs n processors. Sup- pose that at the beginning of the if“ stage, the values of the variables 2:1, ...,:c;-1 and the expressions ans-1 + + aj,,--1:i:,--1 for each j Z i, are available. Then the i‘h processor evaluates :c; by solving Eq. (3.9): 1:,- = ;(b,- — aux] — — a;,,-_11:,-_1). (3.10) Finally, each processor j, with j Z i+ 1, evaluates the expression 01'] 1:1 + + ajgz; by adding ajgx, to the previously available expression ajl 1:1 «I» + aj,,-_1:r,-..1. The algorithm terminates at the end of the n‘h stage when all the variables 2:1, ...,xn have been computed. Clearly, the parallel time required for each stage is constant. Therefore, the total time required by this version of back substitution is 0(n) using n processors. Here the communication cost is excluded. Connecting the output of a later stage to the input of an earlier stage of a pipeline model, we get the ring computation model. The ring model is commonly used in iterative methods; wrapped around the pipeline allows successive iteration to continue. 3.2.7 Recursive Doubling Computation Model The best known example of parallelism is the computation of the summation a0 + a1+, ..., +a,,-1 in 0(log n) time by % processors. It follows the process of recursive doubling, which consists of the evaluation of subterms of size 2i for i = 0, ..., log(§). Computation of the sums and all the partial sums leads to two different communication patterns. Pattern 1 can be seen as the conquer part of the recursive divide-and—conquer model, known as tree reduction. Pattern 2 is much different (see Figure 3.13); it is called a prefix by some authors. Recursive doubling is applicable to a large number of applications. The following result is due to Kogge and Stone [35]. Proposition 2 The recursive function x1 = bi, x: = fem—1) = f(b.-.g(a.-,z.--i)), 2 S i S n (3.11) can be computed by a recursive doubling parallel algorithm in 0(log n) time, where b; and a,- are arbitrary constants and f and g are index-independent functions that satisfy the three restrictions: 1- f is associative- f(z,f(y,2)) = f(f(a=.y).z)- 47 ‘ ~~ . ~~ ~ ‘- ~~ . ~‘ ~ ~ rlr I L‘.“___*‘-“_f‘. , M A; 7 - ~ ~ ‘s ‘5 ‘Q ~ ~ ~ ~ s ~ ~ ~ 5 5‘ ~~ ~~ ~A ‘L ‘A I N—l ‘. “ ‘~ § ~ ~ ~ ~ ~ ~ § ~ ~ ~ ~ ~. ~. ~. ~ ~ ~ ~ ~ I) In; I: 23 (211(1). 1(1)] + y.) + Z(N([1(1). 1(1)] + Xi) i=1 i=1 Figure 3.12: Pipelined Computation Model 48 2. g distributes over f- g(s.f(y,2)) = f(9(s.y).y(s,2))- 3. g is semiassociative, that is, there exists some function h such that g(s.g(y,z)) = 9(h(s.y),z)- A large number of linear recursive functions can be reduced to the above form by us- ing a matrix representation. However, this reduction very often increases the sequential computation count. Functions suitable for recursive doubling are given in [57]. Proposition 3 The LU decomposition of an n x n tridiagonal linear system can be computed in 0(log n ) time, using 12 processors. - Proof: Let A be a tridiagonal matrix of order n i be Co ‘ (11 b] C] A = ' ' ' . . (3.12) art—2 bra—2 611—2 [ an-l bn—l .. When A = LU and L is chosen to have its diagonal elements equal to unity, the matrices L and U take the forms m1 1 U1 61 mg I. 112 mn-2 1 Uri—2 cit-2 mn—l 1 .. (_ un—l d Note that the superdiagonal elements of U are identical to those of A. Thus we only need to compute m,- and u.- to obtain this factorization. Observing the equation A = LU, we can find the recursive relation: uo=bo u,-=b,-—a,-fi:l— m;=—ai- ISiSn (3.13) ui—l ui-l To solve for u,- and m.- in parallel, Stone introduced the quantities q.- (—1 S i S n), that satisfy the recurrence equations 9-! = 1, (10 = b0: 9i = biqi-i - aiCi-IQi-z- (3.14) 49 Let u,- = 321-1- for i Z 0 and write equation (3.14) into matrix form: a bi - i '- a- ‘1 = a Cu 1 q 1 (3.15) 96-: 1 0 95-2 or Q; = BiQi-l 1 S i S n - l i (3.16) where q_1 = 0, q) = (2,. In this form all Q,- (l S i S n — 1) can be expressed in terms of Q0: Q9” = 8535-1...BlBoQo, t = 1,2, ...,n - 1.(3.17) Thus each Q,- can be computed by evaluating the product of 2 x 2 matrices in parallel, and it is not necessary to compute Q,- only after Q54 has been computed. Figure 3.13 indicates the computation and communication process for n=8. 0 In the hypercube connection topology, the communication pattern shown in Figure 3.13 can be mapped such that the dilation cost is no greater than 2 [33]. This means that the communication is either neighboring communication or passing through one intermediate node. 3.3 Application Considerations We have identified seven computation models for parallel computing. They are local compu- tation model, global-exchange computation model, compute-aggregate-broadcast computation model, divide-and-conquer computation model, domain decomposition computation model, pipelined computation model and recursive doubling computation model. We have found that various scientific applications are combinations of these models. Most of the examples given in this chapter are simple applications. Complicated applications may involve more than one computation model. For designing efficient parallel algorithms we have to understand the problem under consideration, which includes understanding the given application, the ap- plied mathematical method and understanding the underlying computer architecture. The computation model concept provides a way to gain an insightful view and understand the relation between applications and architectures. Based on the performance information of computation models, parallel algorithms may be deliberately switched from one model to another to achieve better performance. The performance of computation models is problem dependent as well as machine de- pendent. It depends on the topology, the switching technique and the routing scheme of 50 \‘ \‘ s‘ \‘ \‘ ‘ ‘ 7[l(l).l(l)] “WHOM 4001.10” if“? " 2‘)l1(1)a1(1)l+ xi) 1: = logN Figure 3.13: Recursive Doubling Computation Model 51 the underlying multicomputer. Topology determines which nodes are directly connected by physic links. Hypercube is the popular topology for first generation multicomputers. 2-D mesh topology has become a strong alternative to hypercube in second generation multi- computers. The switching techniques determine how two non-adjacent nodes communicate. First generation multicomputers adopted the store-and-forward switching technique, while second generation multicomputers use the circuit-switching or the wormhole routing switch- ing technique. The routing scheme decides the sequence of channels (called paths) used for communication between any pair of nodes. In first generation multicomputers, the routing scheme is the shortest path. In second generation multicomputers, the fixed path routing scheme is used. Different multicomputers may adopt different topologies, switching tech- niques and routing schemes. Performance considerations for different multicomputers may be different. Computation models should be studied on different multicomputers. Most of my experience is on a first generation NCUBE multicomputer. It is a hypercube multicom- puter with store-and-forward switching and shortest path routing. Chapter 4 APPLICATION-DRIVEN ALGORITHM DESIGN Designing efficient parallel algorithms is difficult. The structured representation and com- putation model concept provide a guideline for efficient algorithm design. With structured representation, structured rethinking becomes possible. The bottlenecks of an algorithm will be easier to find and high parallelism will be easier to explore. The performance information of the computation models will suggest which structure is better for a given multicomputer. Therefore, structured design becomes possible. We have used structured representation and computation models on several applications. These applications include solving electrical power flow problem, solving tridiagonal linear systems and solving dense linear systems. The first application is one of the most import problems in the electrical power field. The second application is a fundamental problem of scientific computing. The structured representation and computation model concept helps us in changing our design from one structure to another. For solving the power flow problem, we started with a local computation design, and changed to a global-exchange computation design. After studying the structure of the global-exchange algorithm, we adopted a compute-aggregate- broadcast design which gives the best performance [59]. The algorithm design for the power flow application is mainly influenced by the chosen mathematical method, the homotopy method. This kind of algorithm design is called application-driven algorithm design. We also developed three algorithms for solving tridiagonal systems. They are all based on the same matrix modification formula. The algorithm design for solving tridiagonal systems is mainly influenced by the topology of the underlying multicomputer [60]. This kind of algorithm design is called architecture-driven algorithm design. Some of the design changes which we have made are only suitable for the hypercube topology. More information concerning the power flow application can be found in the rest of this chapter and a detailed study of solving tridiagonal systems can be found in the next chapter. 52 53 4.1 Preliminary Determining steady state solutions of large-scale interconnected power systems is by far one of the most important problems facing theoretical as well as applied researchers in the field. It has been formulated mathematically [64] and commonly referred to as the power flow, or load flow, problem. The most frequent and acceptable numerical procedures for finding solutions to the load flow problem are variants of the Newton method. It is well known, however, that the Newton method does not always converge to a solution, and if it does, there is no guarantee that it would find all possible solutions to the load flow problem. The Newton method is a local algorithm because it depends on the initial guess and, therefore, it is dependent on the type of nonlinearity of the governing equations, i.e., the load flow equations. Our work aims at solving the load flow problem using the globally convergent probability- one homotopy method originally proposed in [9]. The homotopy method has been shown to have the following distinct features [52], [45], [51], [10], [37], [38]. First, numerical compu- tation of the homotopy method can be systematically implemented. Second, the homotopy method is globally convergent, i.e., one may choose any initial guesses and the homotopy method is guaranteed to converge to all solutions with probability one. However, for solv- ing large-scale load flow problems, the computation requirements of the homotopy method increases exponentially. The exponential increasing of computation requirements is the case for most scientific and engineering applications. Large-scale scientific computing is one of the major driving forces behind parallel computation, but the move into parallel territory requires new conceptual strategies for formulating a problem, and new algorithms to shape the problem for parallel computers. This implies that a brute-force approach to solving a large—scale problem, such as the load flow of power systems, on parallel computers does not render the problem tractable. The power flow problem is a good example for illustrating the design process of large scale scientific computing and how the structured representation and computation model concept can be used to help the design. The design process of the power flow application consists of four parts - finding an appropriate mathematical model for the application, choosing a suitable numerical method to solve this model, modifying the mathematical method to reduce the computation, and changing the algorithm design to achieve high performance. 54 4.2 Power Flow Application The power flow, or load flow, problem has been modeled on the sinusoidal steady state nodal analysis of circuit theory. Figure 4.1 depicts the general model of a power system. It consists of three main components: generators (suppliers), loads (consumers), and a transmission network that connects generators and loads. Consumers demand electrical power to be supplied. Suppliers generate the power and transmit the power to the consumers. The demand changes with time. The suppliers, therefore, have to vary their production to fit the demand. Changing supply according to demand is done not only for economic reasons, but, more importantly, for safety considerations. In order to operate a power system safely, power generation must satisfy a set of inequality constraints. A solution which satisfies this set of constraints is called a steady state solution. The power flow problem is a matter of determining the steady state solutions. 1 l 1? Generators: 0 (Pv'busesl: 'Il‘ansmission Loads (PQ-buses) 0 Network 0 o o - +1 Slack Bus—l- '- Figure 4.1: The Main Components of Power System Let the power system have N +1 buses or nodes with a voltage reference (or datum). Let the load buses be subscripted from 1 to NL, and let the generator buses be subscripted from N1,“ to N. We choow to subscript the slack bus by 0 and denote the N x N node or bus complex admittance matrix by [Y], where its lei“ component is y)..- = G)..- + j E)..- (j = \/-—1). Let E). denote the complex voltage of bus k. In polar coordinates it is 55 E). = Vpeje‘, (4.1) where V). represents its voltage amplitude and 6). represents its phase angle. In rectangular coordinates, it becomes E). = VwosOk + ijsinOk = X]: + in, (4.2) where X). represents the real part of the complex voltage E), and Y), represents its imaginary part. Let S), = p). + jq). be the complex injected power at node 1:, then the injected complex power balance equation may be written as [13-] [Y] E — 5: = 0 (4.3) or N N E); Z yin-E: + E; Emu-E; — 2P]: = 0, k = I, 2, ..., N (4.4) i=0 i=0 N N El: 2 yziE: - E; Z ykiEi — 2lec z 0) k =1: 2, "'3 NL (4'5) i=0 i=0 as; - v: = o, k = NL+1,...,N, (4.6) where the superscript * denotes the complex conjugate. The above system of equations is a polynomial system. They are the full load equations in complex form. The full load equations can be represented in other forms. They can be written in rectangular coordinates as, N Z[G,.,-(X,.X,- + my.) + B.,(X,-Y,. — X).Y.-)] - P). = 0, k = 1, 2, N (4.7) i=0 xi + Y: - v: = o, k = 1,2, ...,NL (4.8) N Banana. - in’.) — B),,-(X.-X,, + 15.14)] — Q), = o, k = NW, N. (4.9) i=0 Or, it can be written in equivalent rectangular coordinate equations as N N XI: 2(GHXI: - BkiYi) + Yk 2(BkiXi - Gaye) " P1: = 0, k = 1,2, mr N (4-10) i=0 i=0 xi + Y: — V,2 = o, k = 1,2, ...,NL (4.11) N N Y): 2(Gkixi - Buys) — XI: 2(Blcixi + Guys) - Q]: = 0, k = NL+1, ..., N, (4-12) i=0 i=0 56 where the additional equations (4.8) and (4.11) represent the constraints on the amplitude of the voltage of the PV-buses. The full load equations also can be represented in terms of trigonometric functions, called standard power flow equations, as N 2 V;,V.-(G'k.-cosOk,- + BkgsinOkg) — P). = 0, I: = 1, 2, ..., N (4.13) i=0 N Z VLVXGIn-sinOh- - thosOH) -- Q); = 0, lc = NL+1, ..., N. (4.14) i=0 There are three types of buses in a power system network: 1. PQ bus: a bus where the real and reactive powers are specified. 2. PV bus: a bus where the real power and the voltage amplitude are specified. 3. A Slack bus: a fictitious concept whereby one of the generator buses has only its complex voltage specified. One purpose of this bus is to guarantee that the total power injection into the network equals the total power consumed. It is conventional to model loads as PQ-buses, one generator as a slack bus, and the rest of the generators as PV buses, as depicted in Figure 4.1. With some assumptions, two simplified models are also proposed for power flow problems [51], which can be used to obtain approximate solutions with reduced computation. [4.3 Homotopy Method The homotopy method [9] (see Section 3.2.1) is able to find all the complex roots of a system of polynomials with probability one. Let P(x) = 0 denote a system of n equations in n unknowns. ’ P1021, 22, ...Zn) P(Z)= P’(z""""’") =0, (4.15) _ P..(21,zg, ...zn) 4 where Z is an n-dimensional complex vector (21, ..., z“). The homotopy method solves this system by solving a trivial system 57 0(2) = 0. (4.16) and then follows the solution curves of H(Z,t) = (1 — t)Q(Z) + tP(Z) = 0, (4.17) from t = 0 to t = 1. If Q(Z) = 0 is chosen correctly, the following properties hold [39]: o Triviality The solutions of Q(Z) = 0 are known. a Smoothness The solution set of H (Z,t) = 0 for 0 S t < 1 consists of a finite number of smooth paths, each parameterized by t in [0,1). 0 Accessibility Every isolated solution of H (Z, 1) = P(Z) = 0 is reached by some path originating at t = 0. It follows that this path starts at a solution of H (Z,0) = Q(Z) = 0. Let the degree of P,- be d, for 1 S i S n. In general, Eq.(4.15) has at most d = d; x d; x x d,, isolated roots. It has been proven [9] with random complex numbers b1, b2, ..., bn that 2;“ —- b1 62(2) = = o (4.18) 2:" — b,, can be used to find all the solutions of (4.15) by tracing the solution curves of equation (4.17) from t = 0 to t = 1 with probability 1. The robustness, stability, and accuracy properties of the homotopy method make it the best known choice for power flow problems. The number d = 2 x 2 x x 2 gives the upper bound of the possible solutions of the power load system (4.4)—(4.6). In practice, power flow systems have many fewer less solutions than this upper bound. Many of the homotopy curves cannot reach t=1, because they go to infinity when they approach t=l (see Figure 4.2). These divergent curves generally require more computation than convergent curves and consume most of the computation time. Not only does the system of equations (4.4)-—(4.6) have many fewer solutions than the upper bound, but also, in practical consequences, the majority of polynomial systems have the same property. These kinds of systems are called deficient by Li et a1. [37] and called m-homogeneous by Morgan and Sommese [43]. 58 Based on [43], we conclude that equation (4.17) is 2-homogeneous and, instead of 22’" solutions, it has at most [N : 2N ] (i.e., N out of 2N) solutions. The following modified initial function is proposed to trace the [N : 2N ] possible solutions, and the correctness of the function is followed by the theoretical results of the general deficient and m-homogeneous systems [37] [51]. For the general complex form (4.4) - (4.6), the initial polynomial system is chosen as (E). + a.)(z£§'__, y;,E: + 3).), k = 1, 2...N, Q(Z) = (2.1-:1 ykiEki + aN+k)(E£ + .BN+I:), ’6 = 1,2. ..., NL, (4-19) (B). + 0N+Nn+k)(Ei + BNsl-NL'H‘)’ 1‘ = NL+1, "'1 N where (1;, ,6;,i = 1, ...,2N are random complex scalars. For “almost all” (01, ..., am) 6 CZN, (fir. "'3162N) 6 CzN, Eq. (4.19) has [N : 2N] solutions. The equations of the polynomial system Q(Z) are products of linear equations. Q(Z) is easy to solve. It has [N : 2N ] distinct complex roots and all the solutions of equations (4.4) — (4.6) can be obtained by tracing the solutions curves of equation (4.17). With the specified initial system of equations (4.19), the computation time has been reduced considerably. That is one reason for choosing the complex full flowed model for our study. 4.4 Implementation Issues and Results With the homotopy method, the solution process for system (4.4) - (4.6) is the process of curve tracing. We need to trace the solution curves, which are also called homotopy curves, of the homotopy function (4.17), starting from t = 0 and gradually increasing t until it becomes 1. Let to,t1, ...,tk be the set of sampling instants, where to = 0 and t), = 1. Define s.- = t,- — t,--1 as the stepping distance from t;-1 to the next sampling instant t,- for 1 S i S k. Given Z,(t,-_1), we have to solve Z,-(t.-) for t,- = t,-_1 + s,-. Several different techniques, such as Newton’s iterative method, the ODE-based method, and normal flow method [63] may be used for the curve tracing. Theoretically, all the homotopy curves can be traced concurrently and independently. Therefore, the power flow problem is readily solved by the local computation model. Though homotopy curves have been proven disjoint, it is possible that one curve merges to another curve [8] in the numerical implementation of curve tracing. This curve merging will lead to the Occurrence of curve missing, and consequently, solutions being missed. The stepping distances have to be very small to reduce the possibility of curve merging. Small 59 stepping distances increase the computation time significantly. The performance of the local computation approach is not very encouraging. H H(z,t) = (1-t)Q(Z)+ tP(Z> ‘ Some homotopy curves diverge End . 0 Points Q(Z)=o ° 2 P(Z)=O Start 9 9 Points : Some curves may erge 3 O / "V Figure 4.2: The Homotopy Method To reduce the computation cost of curve tracing, we adopt a dynamic stepping strategy [10]. Run time information is used to adjust the stepping distance to make it relatively large. A predictor-corrector method is adopted for curve tracing. This curve tracing method contains two steps, first predict the value Z (t1) based on the value and derivative at t.--1, then use Newton’s iterative method to correct the predicted value until it satisfies Eq. (4.17). If after a certain number (threshold) of iterations, the predicted value does not converge to a solution of equation (4.17), we say the tracing at t,_1 has failed. When this occurs, the stepping distance 3,- = t,- - t,-_1 is reduced and the procedure is repeated. If the predicted value converges to a solution within the threshold, the tracing is successful. The stepping distance is increased for the next step. The iteration threshold is an important parameter for the dynamic curve tracing method. Curve merging checks are also performed. We introduce a series of c merge checking instants, 0 = to < t; < t; < < t; < l, where these c merge 60 checking instants are a subset of the sampling instants. Note that tracing different homotopy curves dynamically may lead to different sets of sampling instants and a different number of sampling instants, lc. However, for the purpose of merge checking, these c merge checking instants must be included in any set of sampling instants. PE- .." ,’I I I ' \ she fig, :1’ \ ” Q‘ : ”333% “ WE‘} y 4 i. . [8(8).8(8)] "5‘J’ve‘g? $3; ‘gé £53.: ‘:\\\‘ , t ' ’9 It ‘ \ I‘ 559! .9cf‘ k1 fé’uifle‘s" . \ V“ a a¢.gg:‘e‘: : : :t’so 8 ‘ W5: I ’9: ”92‘ ‘ fla‘lzu \ 53: s: ‘4‘ I A‘ } I" T J aaagfi ,v‘fi [861.8(8)] I ‘a 1;. Figure 4.3: Global-Exchange Model for Merge Checking Merge checking requires global information. Communication becomes unavoidable at merge checking instants. This communication requirement changes our computation from a local computation to a compute-exchange computation. Initially, this compute-exchange computation is implemented in a synchronous fashion, which corresponds to the global- exchange computation model (see Figure 4.3). All the nodes synchronize at merge checking instants, and merge sort is used at these points to sort the intermediate values Z(t;'). If more than one curve has the same value, curve merging has occurred. The merged curves will then be retraced from t‘ _, with a more conservative iteration threshold. With the merge checking strategy, the stepping distance can be relatively large. If only a few retracings have been done, performance should be improved. This is true when the synchronous merge checking algorithm is implemented with a single processor. However, with more processors, the merge checking algorithm does not show any superiority over conservative tracing without merge checking. To analyze degradations of parallelism we use the notation introduced in Chapter 3 to write down the synchronous merge checking algorithm. With eight processors, it is 61 an +‘Z([8(8),8(8)1+X.-). (4.20) i=1 when no curve merging happens. When curve merging does occur at j of the c merge checking points, it becomes Xai +§([8(8),8(8)1+X.). (4.21) i=1 where X,- is the workload of the node with the heaviest load during the two contiguous synchronization. Note that we use dynamic stepping. Each curve, depending on its shape, may require a different number of sampling instances from t:_, to t:. X,- is the workload of the curve with the biggest number of sampling instances between t;_, and t2“. Equations (4.20) and (4.21) also show that when curve merging occurs, some nodes are busy doing retracing while all the others are waiting. This reveals that the load is not balanced between synchronization, and that curve retracing makes the load imbalance even worse. ’ 7 7 v v 3' 3' I I " a" ,o ’-o I ' a v ’ ’ I’ ’’’’’’ I I ’ I ooooo I I ' at o I ’ I ” """"" I [’I’a'o=:::" I I ’a"o’ D — )_1 E D — —. . — ‘ \\s~“ \ s \ “ \ \~ \ s‘ \ “ T} shjjih- Figure 4.4: Compute-Aggregate—Broadcast Model for Merge Checking To overcome load imbalancing, the algorithm is redesigned to be asynchronous (see Fig- ure 4.4). We choose one node as the checking node. All the other nodes work as the computing nodes. Computing nodes do the curve tracing and send the computed value at t; to the checking node. The checking node does the merge checking. When curve merging 62 45 I l I l l l ....... > 1. The PDD algorithm is a fast algorithm for a special case. We will not study the PDD algorithm in this chapter. Communication mechanisms have a great impact on the performance of multicomputers. Thus, the communication pattern of parallel algorithms should be carefully designed to reduce the communication complexity. The communication consideration may lead to design modification. In this study, the algorithms are evaluated based on both computation and communication complexities. 5.1 Linear Tridiagonal Solvers Two known methods will be used in our algorithms. For completeness, they are briefly presented in this section. 5.1.1 The LU Decomposition Method The LU decomposition method is also called the Gaussian elimination method without piv- oting. The LUP decomposition method is the Gaussian elimination method with pivoting. They are the most commonly accepted sequential algorithms and are used in the linear sys- tem package, LINPACK [13]. The pivoting referred to in this paper is partial pivoting. The LINPACK routine SGTSL solves Eq. (5.1) using the LUP decomposition method. 66 The algorithm first factors the matrix A into a product form A = LU (or PA = LU when pivoting is involved), (5.3) where L and U are lower and upper triangular matrices and P is the product of all of the row permutations. Then A1: = d (or PAx = Pd) can be solved by solving both Ly = d (or Ly = Pd) and Us: = y. (5.4) It can be easily verified that the LU and LUP algorithms can solve Eq. (5.1) in 8n — 7 and lln - 12 arithmetic operations for the non-pivoting and pivoting cases, respectively. 5.1.2 The Parallel Prefix Method The RCD method uses 0(log n) parallel computation time to solve Eq. (5.1) on a parallel computer with n processors (see Section 3.2). The PP method [16] modifies the RCD method for N processors, where N < n. Equation (5.1) can be written as a,-:c.--1 + b,-:c,- + car,“ = d,- , 0 S i _<_ n — 1 , (5.5) where 3.; = 2,, = 0. Solving for mg“, we have $i+1 = (--:i)$r + (-§)3;-1 + (Ei) = 051'; + fli-‘Pi—l + 7i- (55) Here c,- aé 0 is assumed. Equation (5.6) can be written in matrix form as $i+1 Ge ,3; 7i 1'; '1'; = 1 0 0 25.1 (5.7) 1 0 0 1 1 Define $i+1 X,“ = z,- with $4 = 23,. = 0 , (5.8) 1 J 0i [3; 7i - B.- = 1 0 o (5.9) 0 o 1, 67 Equation (5.6) becomes X5.“ = B;X,' , 0 S i S n—l , (5.10) and X.- (1 S i S n) can be expressed in terms of X0 as X; = B;_1Bg_2...BlBoXo , 1 S 2 S n . (5.11) Now solving Eq. (5.1) reduces to finding the partial products of matrices B; for 0 S i S n — 1. If N < n, evenly distribute matrices B.- to N processors, perform sequential matrix multiplication on each processor, then use the prefix method on N pro- cessors. There are (log N) + 1 parallel communication steps for implementing the prefix method with N processors. Let C? = B,-B,-+1...Bj. Then 0,1 is a 3 x 3 matrix. Since the last row of C? always equals ' [0,0,1], only six entries of C? need to be transferred at each parallel communication. Parallel recursive doubling computation can be used to obtain all the partial matrix products. The actual communication complexity of the PP method depends on the mapping of computation units to processors, the. interconnection topology of the multicomputer, and the underlying communication mechanism. In the case of the hypercube topology, the communication pattern shown in Figure (3.13) can be mapped such that the dilation cost is no greater than 2 [33]. 5.1.3 A Novel Matrix Partitioning Technique Our parallel algorithms are based on the divide and conquer model of parallel computation. In the divide part, matrix A is partitioned into submatrices. For convenience we assume that n = Nm. Matrix A in Eq. (5.1) can be written as A = A + AA, (5.12) where the submatrices, A.- (i = 0, ..., N — 1), are m x m tridiagonal matrices. Let e,- be a - column vector with its i“ (0 S i S n - 1) element being 1 and all the other entries being zero. We have 68 — I I I — — I I I -'l I I I I I .- --L-—-—---L-q ---Q~Jl ------- I.-. . I I I a I I ”4"". ------- Lu r-q--l6m---L-. . i i ' l : al'fi- : A: : : : AA: : :'x : I I I I I ' I I I I I I ‘ I I I ' I I I ' ~--.--r-------r-- "nut ------- Hahn-1 _ : : :AA-I _ ' : “1“?" F 55—1 63. AA = amema cm—lem—la a2me2m9 C2m—le2m-lan-I C(N—1)m—1€(N-1)m-1] - = VET, C(IV-im-l CT (N—l)m where both V and E are n x 2(N - 1) matrices. Thus we have A = A + VET. (5.13) Based on the matrix modification formula originally defined by Sherman and Morrison [54] for rank-one changes and generalized by Woodbury [14], Eq. (5.1) can be solved by z = A-ld = (A + VET)-1d = A-Id — A-IV( I + ETA-1V )‘1ETA-‘d . Note that I is a 2(N—1)x2(N—1) identity matrix and I + ETA" V is a 2(N—1)x2(N—1) band matrix with bandwidth 5. We introduce a 2(N—1)x2(N—1) elementary transformation matrix P such that (5.14) P z = (21, zo, 23, 22,..., 22N-3, 22”,-” )T for all z E RAN-1). (5.15) Based on the property that P'1 = P, Eq. (5.14) becomes 5 = A-ld — A“VP( P + ETA"VP)“ETA"1d. (5.16) 69 Note that Z = (P + ETA-IVP) is a 2(N — 1) x 2(N — 1) tridiagonal matrix. Let A5 = d, (5.17) AV = VP, (5.18) h = E7513, (5.19) Z = P + ETY, (5.20) Zy = h , (5.21) A2: = Yy. (5.22) From Eqs. (5.17)—(5.22), (5.16) becomes :1: = 51': - A3, (5.23) where A is an n x n matrix, Y, V, and E are n x 2(N— 1) matrices, Z and P are 2(N — 1) x 2(N — 1) matrices, h and y are 2(N - 1) X 1 vectors, and A2, 5:, and d are n x 1 vectors. Based on Eqs. (5.17)-(5.23), the computation required to solve Eq. (5.1) in a sequential computer is described below. In Eqs. (5.17) and (5.18), :‘E and Y are solved by the LU decomposition method. By the structure of A and VP, this is equivalent to solving A,- [§:(‘), v“), mm] = [d(‘), aimeo, c(,-+1),,,_1em-1] , i = 0,..., N—l . (5.24) Here we have a0 = c,._1 = 0, e0, em-1 6 R’”, 5:“) and d“) are the i‘h block of 5: and d, respectively, and v“), w“) are possible non-zero column vectors of the i“ row block of Y. Equation (5.24) implies that we only need to solve three linear systems of order m with the same LU decomposition for each i (i = 0, ..., N — 1). In addition, we can skip the first m - 1 forward substitutions for the third system since the first m — 1 components of the vector at the right hand side are all zeros. Equation (5.19) only picks 2(N — 1) specified components from the vector 5:. There is no computation or communication involved. The evaluation of Eq. (5.20) uses those possible non-zero entries of specified 2(N - 1) rows of Y together with P to form matrix Z. Again there is no computation or communication involved. 70 Equation (5.21) solves a 2(N —— l) x 2(N - 1) tridiagonal system, which is the major computation involved in the conquer part of our algorithms. Since Y has at most two non- zero entries at each row, the evaluation of Eq. (5.22) takes four arithmetic operations per row. Based on the above observations, together with a careful scaling process, the compu- tational complexity required to solve Eq. (5.1) in a sequential processor is stated in the following theorem. Theorem 4 Equation (5.1) can be solved using Eqs. (5.17)-(5.23) with 1711 -— 6N — 23 operations without pivoting and 24n - 13N — 34 operations with pivoting. 5.2 A Compute-Aggregate-Broadcast Computation Solver Based on the matrix partitioning technique described previously, a parallel algorithm can be easily developed with a compute-aggregate-broadcast computation. This algorithm is called the Parallel Partition LU (PPT) algorithm and is described in this section. Using N processors, the PPT algorithm to solve Eq. (5.1) consists of the following steps: Step 0. Allocate A,-, d“) and elements aim and c(,-+1),,,_1 to the 1"“ node, where 0 S i S N —1. Step 1. Use the LU decomposition method described in Section 5.1 to solve Eq. (5.24). All N computations can be executed in parallel and independently on N processors. Step 2. Send 5:9), 5:51,, v9), v31“ my), 1125,11 (0 S i S n — l) to a special node to form matrix Z and vector h (Eqs. (5.19) and (5.20)). Hereafter the subindex indicates the component of the vector. Step 3. Use the LU decomposition method to solve Zy = h (Eq. (5.21)) on that special node. Note that Z is a 2(N — 1) dimensional tridiagonal matrix. Step 4. Send yg;_1 and yg; to the i“ node, (i = 0, ..., N— 1). Here we set y-1 = 31207-1) = 0. Step 5. Compute Eqs. (5.22) and (5.23). We have Ax“) = [5“), mm] [mi-ll ’ (5 25) V2; 30') = 50') _ Ax“) , 71 This step is executed in parallel on N processors. As mentioned in Section 1.2, the underlying communication mechanism of multicomput- era has a great impact on the performance of parallel algorithms. Before we address the communication complexity issue, let’s take a close look at different communication mecha- nisms. In the store-and-forward mechanism, which is used in all first generation hypercube multicomputers, the message transfer time between two adjacent processors can be expressed as a + ,33, where a is the communication latency, fl is the transmission time per byte, and S is the number of bytes in the message. If a message is delivered to a node h hops away, the message transfer time can be roughly estimated to be h(a + 65). In second genera- tion multicomputers, advanced communication mechanisms are adopted, such as the circuit switching used in iPSC-2 [46] and the wormhole routing used in Ametek 2010 [3]. In these new communication paradigms, the message transfer time is almost independent of the dis- tance (number of hops) between processors. Thus, in second generation multicomputers, the transfer time of a message with 3 bytes can always be expressed as o: + 35 regardless of the distance the message has to traverse. The PPT algorithm has two communication patterns. The data gathering communica- tion required in Step 2 has to transfer 6 data elements per processor. The data scattering communication required in Step 4 has 2 data elements per processor. Figure 5.1 shows these two communication patterns for the case when N = 8 and processor 0 is the special node for solving Eq. (5.21). The best way to handle the data gathering and scattering communications is to use the tree reduction technique, as shown in Figure 5.2 for the case of data scattering when N = 8. The data gathering communication is simply an inverse of the data scattering. It has been shown that the spanning tree can be perfectly embedded in a hypercube [50]. Thus the communication time required to scatter messages is logN units for N processors. Note that in the first data transfer, the message has to contain the data for the rest of the processors to be visited. Each processor, upon receiving a message, has to strip its own data and forward the rest of the data to the following processors. As a result, the communication time required in data scattering (and also in data gathering) can be estimated to be alog N + (N —1)55. (5.26) Equation (5.26) can be applied to all second generation multicomputers and to first genera- tion hypercube multicomputers using the embedding scheme described in [50]. Now we are ready to state the computation and communication complexities of the PPT algorithm. 72 WWW? DataGthering ) IDDDDDDD A3 5555 E Figure 5.1: The Communication Pattern of the PPT Algorithm Theorem 5 The PPT algorithm solves Eq. (5.1) in 17(n/N) + 16N — 45 and 24(n/N) + 22N — 69 parallel arithmetic operations for non-pivoting and pivoting, respectively. With 4 bytes per data element, it requires 2(alogN + 16(N — 1);?) communication time units, where N is the number of processors and n = mN. Let Tw, T5”, and TppT be the time required to solve Eq. (5.1) using the sequential LU decomposition algorithm, the sequential partitioning algorithm (Eqs. (5.17)-(5.23)), and the PPT algorithm, respectively. Let rm, represent the unit of a computation operation normalized to the communication time. From those theorems shown previously, we have T LU = (8n -— 7km, without pivoting (5.27) Twp = (11n — 12km, with pivoting (5.28) Tspr = (l7n — 6N — 23km, without pivoting (5.29) TSPT = (24n — 13N - 34)rm, with pivoting (5.30) TPPT = (17% + 16N — 45) Team, + 2(alogN + 16(N — 1);?) without pivoting (5.31) T PPT = (24% + 22N — 69) Team, + 2(alogN + 16(N — 1);?) with pivoting (5.32) 73 Dividing the time required for a sequential algorithm by the time required to execute the PPT algorithm, an estimated speedup of the PPT algorithm over other sequential algorithms can be obtained. In Section 5.5, the theoretical results obtained here will be compared with the measured results obtained from experiments on a 64-node NCUBE. 5.3 A Global-Exchange Computation Model From Figure 5.1 we can see that the PPT method is a compute-aggregate—broadcast compu- tation. Step 1 of PPT is the first parallel computation phase. Step 2 is the data aggregation phase. Step 3 is the single node computation phase and step 4 is the broadcast phase. Step 5 is the second parallel computation phase. On hypercube multicomputers, the data aggrega- tion (gathering) and data broadcast (scattering) are achieved by tree reduction and inverse tree reduction. Both tree reduction and inverse tree reduction take 0(logN) steps, where N is the number of nodes involved in the computation, on hypercube architectures (see Figure 1391:9999 Ix“. \‘x \ 411(1),1(1)] E3 Q r W l \‘x, [{1ij 211mm» éfibfisifid l ............ [1(1).1(1)] 111111 DUDE] FD Figure 5.2: Tree Reduction 74 Using the communication pattern shown in Figure 3.5, the total data-exchange also can be achieved in 0(logN) time on hypercube architectures. The 0(logN) time communication suggests that we can add redundant computation, reduce the communication overhead and, therefore, improve the overall performance. Step 2 of the PPT method can be changed to total data exchange and step 4 of the PPT method can be removed. Step 3 will be redundant on every working node. This redundant computation combined with step 5 of the PPT method forms a computation phase. With this total data-exchange strategy, we achieve a global-exchange computation method which is called the Parallel Partition Global-exchange (PPG) method (see Figure 5.3). gamma; [N(N) N(N)] ddibihhh Figure 5.3: The Communication Pattern of the G-E Computation Using P processors, the PPG algorithm for solving Eq. (5.1) consists of the following steps: Step 0. Allocate Ag, (1“) and elements a;,,,, c(,+1)m_1 to the 2"“ node, where 0 S i S N - 1. Step 1. Use the LU decomposition method described in Section 5.1 to solve Eq. (5.24). All the N computations can be executed in parallel and independently on N processors. Step 2. Send 30') , $5,111, v35), v,(,',)_,, 109), w(,,',)_ (0 S i S N — 1) to every other node to form matrix Z and vector h (Eqs. (5.19) and (5.20)). Step 3. Use the LU decomposition method to solve Zy = h (Eq. (5.21)) on each node and compute Eqs. (5.22) and (5.23) in parallel on N processors. The computation requirement of the PPG algorithm is the same as the PPT method while the communication requirement is reduced from 2(alogN + 16(N - l)fi) to (01le + 16(N - 1))6). This reduction is not significant for our machine where N = 64, but it may have a larger influence when more processors become available. 75 5.4 A Solver With More Than One Computation Model The PPT and PPG algorithms discussed above provide better performance than most of the existing parallel algorithms for solving Eq. (5.1) when N < n. However, the efficiency of the PPT algorithm decreases as the number of processors, N, increases, because the major computation in the conquer part (Step 3 in Section 5.2) of the PPT algorithm is sequential. For this reason, we use the PP method (see Section 5.1), the limited processor version of the RCD method, to solve Eq. (5.21). In order to apply the PP method, all the superdiagonal elements of the coefficient matrix must be non-zero. The following theorem is needed to apply the PP method to solve Eq. (5.21). Theorem 6 If all the superdiagonal elements of the matrix A are non-zero, the tridiagonal matrix Z = P + ETA-1 VP has non-zero superdiagonal elements. Proof: The superdiagonal elements of Z are either equal to l or the first components of the solutions A55“) = c(,-+1),,,-1e,,,_1, i = 1,..., N—2, (5.33) where w“), em-1 6 12'". Suppose ing) = 0, then to?) = 0 for j = 1,..., m — 1, since the superdiagonal elements of A,- are non-zero. Therefore, we have Agw“) = 0, which is a contradiction to Eq. (5.33). D The new algorithm, namely the parallel partition hybrid (PPH) algorithm, consists of two computation models, the recursive doubling computation model and the compute-aggregate- broadcast computation model. The PPH algorithm is similar to the PPT algorithm except for the following changes. After Step 1 of the PPT algorithm, the (2i — I)“ and 2i“, for i = 0, ..., N - 1, rows of the tridiagonal matrix Z are stored in the i“ node (here the (--1)‘h and 2(N - 1)“ rows of Z are assumed to be zero). Step 2 in the PPT algorithm which performs data gathering is eliminated. Step 3 of the PPT algorithm is then replaced by the PP method as described in Section 2. Step 4 performs the following data transfer. Step 4. Send ygg.” from the i“ node to the (i + 1)“ node for i = 0, ..., N - 2. In Step 3 the PP method has two communication patterns. The first communication pattern computes all partial matrix products (see Figure 3.13). The second communication pattern is a broadcast which broadcasts the computed to (or X0 with another two known components, 0,1) to all other nodes. Broadcast has a communication pattern similar to data 76 scattering as shown in Figure 5.2. However, in broadcast all nodes receive the same data. Here we assume each floating point number has 4 bytes. Considering second generation multicomputers, based on the communication model discussed in Section 5.2, the communi- cation time required to obtain all partial products of B.- (Figure 3.13) is (a + 24,8)(1+logN) time. Here we have 5 = 24 because each message transfer includes 6 data elements. The broadcast communication takes logN steps and the message is one data element, which takes (a + 4,6)logN time. The shift communication required in Step 4 takes a + 43 time. The communication pattern of the PPH algorithm is shown'in Figure 5.4 for N = 8. Counting the arithmetic operations and communication times, we have the following theorem. Theorem 7 With n = N m, the PPH algorithm solves Eq. (5.1) in 17(n/N) +20logN +17 and 24(n/N) + 2010gN + 4 parallel arithmetic operations for non-pivoting and pivoting, respectively. It requires ((a + 4,6) + (a + 24B))(1 + logN) communication steps. WWW (Prefix Method for Partial Products 655315615 Figure 5.4: The Communication Pattern of the PPH Algorithm Let Tspg be the time required to execute the PPH algorithm in a sequential processor and Tppy be the time required to execute the PPH algorithm on N processors. We then have 77 I Algorithm l Computation Communication Wang’s Partition 21f, - 12 — 87"; " 2lN — Illa +103l+la+163l Recursive Doubling(PP) 3555 + 2OIogN - 29 (1 + logN)(a + 243) + logN(a + 43) Cyclic Reduction 20%} " 0(logn) "' FPT(non-pivoting) 1755 + 16N — 45 2_alogN + 16(N —l)3‘ PPT(pivoting) 247?; + 22N -— 69 2_alogN + 16(N -1)3‘ Ffiflnon-pivoting) 1755 + ZOIogN + 17 “(a + 43) + (a + 243), (l + logN) PPH(pivoting) 24% + ZOIogN + 4 ,(a + 43)+(a+243j(1+logN) scalar count of Table V 162] divided by N "the number of communication steps for the cyclic reduction method is not known. However it is easy to find that it has at least the complexity of 0(logn). Table 5.1: Computation and Communication Counts of Tridiagonal Solvers TSP” = ( 1712 + 8N —' 41 ) romp without pivoting TSP” = ( 24n — 5N — 41 )rme with pivoting Tppfi = (17% + 20logN + 17) Team, + ((a + 43) + (a + 243))(1 + logN) without pivoting Tppg = (24% + 2OIogN + 4) Team, + ((a + 43) + (a + 243))(1 + logN) with pivoting It is interesting to note that the PPH algorithm can reach a speedup of 2 over the PP algorithm for l < N < n. When N = n, no matrix partitioning is needed and the PPH algorithm is virtually the RCD algorithm. When N=l, there is no conquer part and as the LU decomposition method is used in the dividing part, the algorithm becomes the LU decomposition algorithm. 5.5 Comparison and Experimental Results In this section we compare our methods with some existing methods for solving Eq. (5.1). The arithmetic operation counts and communication steps of those methods are summarized in Table 5.1. Wang’s partition method and cyclic reduction method are well known. However, it is difficult to compare these two methods with our methods on multicomputers. Although we believe that the limited processor version of these two methods will increase the computation complexity, we list their computation counts in Table 5.1 assuming they can be perfectly partitioned, i.e., the computation count is scaled down by a factor of N. The communication 78 cost of cyclic reduction is in the order of 0(log n) steps and each step takes at least a + S 3 time. Michielse and Vorst modified Wang’s algorithm for local memory systems. The communication count is based on their result [42]. However, the computation count in their modification is slightly greater than Wang’s algorithm and thus is not listed in the table. As shown in Table 5.1 the computation counts of all our methods are better than other methods. In terms of communication counts, the PPH method has the same order as the PP . method. This is reasonable because they all use the communication pattern shown in Figure 3.13. The cyclic reduction method has a high communication count, and thus is popular for vector machines. For our other methods, the communication complexity is less than Wang’s method. In the following discussion, we shall compare our methods with respect to the number of processors. Figure 5.5 shows the estimated and measured speedup of the PP, PPT, PPH algorithms with respect to the SGTSL routine (see Section 5.1). These algorithms are implemented on a 64-node NCUBE multicomputer. NCUBE is a first generation multicomputer adopting the store-and-forward communication mechanism. As indicated in Section 5.2, the com- munication pattern for the parallel prefix method shown in Figure 3.13 cannot be perfectly embedded in hypercube, and the worst dilation cost is 2. Thus, in the communication counts of the PP and PPH algorithms, the factor (0+243) is doubled. In our NCUBE machine, the following system parameters are measured: a = 5.0, 3 = 0.013, and room, = 0.08 (without normalization). The dimension of matrix A is chosen as n = 6400. This value is limited by the memory constraint of individual processors. As N increases, the PPH algorithm will outperform the PPT algorithm because the dimension of the matrix Z (Eq. (5.21)) increases as N increases, which favors the parallel approach used in the PPH algorithm. Again, due to the limited number of processors available in our NCUBE, the maximum number of pro- cessors used is N = 64. The performance of the PPH algorithm seems to be underestimated compared with the measured results. This is mainly caused by assuming a dilation of 2 for all communications occurring in Figure 3.13. However, in actual implementation, some communications have a dilation of 1. As n goes to infinity, the asymptotic speedup, compared with the LU decomposition method, of all our methods is 0.471N. The asymptotic speedup for the PP method is 0.229N. Dividing the speedup by the number of processors, N, Figure 5.6 shows the efiiciency of our methods. For all methods, the efficiency decreases as the number of processors N increases. This is mainly the result of the increasing ratio of communication to computation. As mentioned in Chapter 2, there are two commonly accepted measures for the speedup 79 30 I T I I I I 30 I I I f j I 25 - - 25 - - 20 ' o‘ 20 ’ PPi 8 68d _ :9" _ S eedu _ . .. ' P 11P15 3'8 H P P15 .._,...--:::,.BBml 10 — ,..83’ +~+~ 10 - ...:fi=°““"” ,......BPF- 5 r - ...-V"? - 5 _ .- ........ "I” .- ....-l~ cam-i. 0 .- l l l l l l 0 ' J J I l I 0 10 20 30 40 50 60 0 10 20 30 40 50 60 No. of Nodes No. of Nodes (a) Estimated speedup (b) Measured speedup Figure 5.5: The Speedup Over The LU Decomposition Method when n = 6400 and the efficiency of parallel algorithms [47]. One focuses on how much faster a problem can be solved by N processors. Thus, it compares the best serial algorithm with the parallel algorithm under consideration, which is defined as execution time using the fastest sequential algorithm on one processor execution time using the parallel algorithm on N processors ' 5,, = (5.34) Both Figures 5.5 and 5.6 are based on the above measure and the best sequential al- gorithm chosen is the LINPACK subroutine, SGTSL. Another measure is interested in the parallelism inherent in an algorithm and is defined as execution time usin one rocessor I 5' = g p . (5.35) ’ execution time using N processors The latter is the relative speedup. Since each node only has 512K memory capacity for N CUBE multicomputers, the single node execution for solving a 6400 dimension tridiagonal system is impossible. The estimated self speedup and efficiency of our methods can be found in [60]. One of the advantages of our approach is its diversity. In the divide part, pivoting may or may not be used. There has been a tacit assumption for the first three methods listed in Table 5.1 that no pivoting is required. In fact, it does not appear to be feasible to incorporate a pivoting strategy into those algorithms (Wang’s algorithm allows very limited pivoting). However, pivoting can be used by the PPT, PPG and PPH algorithms with slightly higher operation counts. Thus, the PPT, PPG and PPH algorithms are more stable than others in cases where pivoting is required. In the conquer part, the major computation requirement is s t olv gpa o 80 .5 F" I I I 1 I 0-5 I I I i I I .45 W28. 0.45 - - .4 __ «.8385 4 0.4 $Cfi. .- .35 '— "8) T _. 0'33 C 'zzzizzza-mmPPH -‘ Efficiencyég ;l'+ P Rafi Efficieggzg r PPT“? °- - ”+“'+~ . 0.2 m . . - .15 - +~+P£H+j 0.15 ~' ' ‘ ' H" +4 .1 - - 0.1 - - 0.05 - - 0.05 - ~ 0 l J l l l l 0 l 1 l I I l 0 10 20 30 40 50 60 0 10 20 30 40 50 60 No. of Nodes No. of Nodes (a) Estimated efficiency (b) Measured efficient Figure 5.6: Efficiency Over The LU Decomposition Method when n = 6400 devoted to solving Eq. (5.21). The methods used in this part are irrelevant to the methods used in the divide part. Other strategies may also be used to solve Eq. (5.21). Unless the matrix A is positive definite or diagonal dominant, there is no general rule to guarantee that all the submatrices A,- (i = 0, ...,N — 1) of A (see Section 5.1) are non- singular. The small matrix Z in the conquer part may be singular too. The latter case is unlikely to happen. However, for a certain class of matrices at hand, very often we can find a criterion to avoid those singularities and make the algorithms work. In this chapter, using the solution of tridiagonal linear system as an example, we have shown how to change the algorithm design to fit the architecture and to reduce communica- tion overhead. We have changed our design from compute-aggregate—broadcast computation to global exchange computation and to hybrid computation for better performance. In general, we may have different approaches for a given application. The performance of com— putation models will suggest which approach is best for a given architecture, and we can achieve a more efficient algorithm design. Another new parallel algorithm, Parallel Diag- onal Dominant (PDD) is also proposed for solving diagonal dominant tridiagonal systems. Detailed information about the PDD algorithm can be found in [60]. Chapter 6 PREDICTION OF PERFORMANCE Performance measurement techniques can be divided into three categories: pre-run, run-time and post-run [12]. Pre-run performance measurement is also called performance prediction. It identifies the computationally intensive parts of an algorithm and predicts the perfor- mance of an algorithm. Run-time performance measurement collects execution data that can be analyzed on line. The post-run performance environment provides the measured im- plementation results. Comparing with run-time and post-run performance measurements, performance prediction is more economical for algorithm evaluation, and, without actual implementation, it is not restricted by physical limitations which may be imposed by the number of available processors and / or the maximum memory capacity. The predicted results not only can be used for evaluating algorithms, they also can be compared with the imple- mentation results. The prediction and implementation results comparison often leads to a better understanding of the algorithm and to a possible performance enhancement. Further, performance prediction is very important for real time systems. In real time systems, the machines have to respond within a given deadline. Within the time limitation,'we would like to obtain the most accurate solution possible. Knowing the turnaround time before execution is the key to achieving this goal. However, providing an accurate prediction of the performance of parallel algorithms is difficult. As we have learned from previous chap- ters, the performance of a parallel algorithm is dependent on the application, the number of processors available and the problem size. Execution time can be predicted by using a parallelism profile (see Figure 2.1). A par- allelism profile provides information about uneven allocation degradation. The amount of work executed with degree of parallelism i, VVg, can be obtained from this information and a lower bound of elapsed time can be derived based on I’Vg. Equation 2.7 T(N) = 55% 731 (6.1) i=1 is the predicted elapsed time based on the parallelism profile. T(N) is a lower bound of the 81 82 execution time, but it certainly is not a greatest lower bound. Communication latency is a very important factor influencing the performance. By considering communication overhead, a better lower bound for execution time is T(N) = ‘=‘ ‘ 2"] + Q" , (6.2) where QN is the communication overhead when N processors are used (see Eq.2.9). Structured representation provides adequate information for both uneven allocation and communication latency degradations. With some measured data, the performance of an application can be predicted based on its structured representation and Eq.(6.2). This performance prediction can be obtained by collecting information from each compute and data-exchange phase. In this approach, the parallel computation requirement of each com- pute phase should be known and the performance of each communication primitive should be premeasured. This performance prediction can also be obtained through a modular approach by using the performance information of computation models. This modular approach has many advantages over the detailed approach, which we will discuss in the next section. This chapter is organized as follows. In Section 6.1, we will show the performance formu- lations of two computation models, the domain decomposition computation model and the recursive doubling computation model. An example is presented in Section 6.2 to illustrate how the computation models and the memory-bounded speedup model can be used in per- formance prediction. The influence of problem size on the speedup of computation models is analyzed in Section 6.3. 6.1 Performance Formulations The most commonly used performance metrics for parallel algorithms are elapsed time and speedup. In this section, the performance formulations of two parallel computation models are derived based on their structured representations, in terms of elapsed time and in terms of speedup. Performance formulations for other computation models can be derived by following similar arguments. These performance formulations of computation models can be used to predict the performance of a large class of scientific applications. Using the performance of computation models to predict the performance of a given ap- plication has advantages over predicting the performance of the given application directly through its structured representation. With the computation model approach, an application is divided into parts. The performance of these parts, or computation models, will suggest 83 which section of your algorithm is the bottleneck. Second, the performance of an application is not only dependent on the application and the machine architecture, but also dependent on how the application is mapped onto the architecture. The performance information of com- putation models contains the mapping information. It provides a more accurate estimate. Also, the computation models are naturally associated with commonly used mathematical methods. With the modular approach, when a mathematical method is chosen, the struc- tured representation of a given application will be known, and the performance prediction of the given application is available. The computation model based modular approach is easy to understand, has less error accumulation, requires less communication and less architectural knowledge. Figure 3.11 in Section 3.2 shows a structured representation for a domain decomposi- tion application. In general, the structured representation of the domain decomposition computation model is 20W), N001 + X0, i=1 where N is the number of processors available, I: is an integer, which sometime depends on the convergence check, and d is the degree of each sender and each receiver. Studying the domain decomposition computation more carefully, we find that it has the following characteristics. 0 Communication overhead is independent of system size The degrees of senders and receivers are fixed for domain decomposition applications. It is independent of problem size and system size. The communication pattern is also independent of problem size and system size. For second generation multicomputers, the communication latency of the domain decomposition computation model is fixed. In fact, even for first generation multicomputers we generally can manage to map a domain decomposition onto an architecture such that the communication delay is independent of the system size. We use Q to denote the common communication latency. 0 Perfect degree of parallelism The domain decomposition applications do not have uneven allocation degradation. All the processors do useful work at each compute phase and the loads are evenly distributed. Domain decomposition computations are good candidates for parallel processing. 84 0 With a fixed number of iterations at each compute phase, the computation requirement at each compute phase is the same, X = X; for i = 1, ..., It. This is the case for most domain decomposition applications. The execution time of domain decomposition applications can be derived from the struc- tured representation directly. The sequential execution time is N(Zf=1 Xi) A 3 where A is the computing capacity of each processor (see Section 2.1). Notice that, as T(1) = defined in Section 2.1, WN = N (E?=1 X,-) for domain decomposition computations. The parallel execution time for domain decomposition computations will be Xg)+Q A . Three different speedup formulations can be derived for the domain decomposition model by following the three different models of speedup given in Section 2.2. For the fixed-size speedup model the speedup is WN S = . 6.3 N 57%: ‘ ) For the fixed-time speedup model, we have WN = #vk + Q Wiv =N(WN-Q) where Wiv is the scaled workload for fixed-time speedup (see Section 2.1). Thus, the fixed- time speedup is Wiv ___ N(WN-Q) J + Q W” (6.4) For the memory-bounded speedup model, if we let the integer k be fixed, we have W12, = N W”, where W5, is the scaled workload (see Section 2.1). The memory-bounded speedup is _ NWN - Q'l'WN. 5;. (6.5) 85 The recursive doubling computation model is another computation model presented in Section 3.2 (see Figure 3.13). The structured representation of the recursive doubling com- putation model is 33“?" - 2")[1(1).1(1)] + Xe). k = logN. (6.6) i=0 The recursive doubling computation model has interesting properties. It has a dynamic degree of parallelism and has a fixed computation requirement at each compute phase while the number of compute phases increases with the system size. More precisely, this model has the following characteristics. 0 The computation requirements at each compute phase are the same, X = X,- for i = 1, ..., k. 0 Communication at data-exchange phases is regular data-exchange. The communication patterns at each data-exchange phase are the same. Although the number of disjoint patterns varies, communication latency at each data-exchange phase is the same. 0 The number of compute and data-exchange phases will increase as a function of log(system size). 0 In both fixed-size and fixed-time speedup models, if Q is fixed, X will decrease when N increases. 0 The degree of parallelism varies from phase to phase. P x R gives the degree of parallelism at each compute phase, which is equal to 2" - 2‘ at the i“ compute phase. This means that W2h_gi = (2" —2‘)X = (2" —2‘)W1, and IV,- = 0 for j at 2" - 2‘, where iemnvwk—n. Based on the structured representation (6.6), the sequential execution time of the recur- sive doubling computation is 15-! I: _ i T(l) = i=0 (2A 2 )X The parallel execution time is k(X+Q). T(N) = A 86 Just as with the domain decomposition computation model, three speedup formulations can be obtained corresponding to the three models of speedup previously discussed. Follow- ing the notation defined in Section 2.1, we let W: 21-"! W,- be the work load when the application is running on a single processor, and define W,’ and IV: correspondingly. For the fixed-size speedup model, we have W i=7} [1? + 01'], W W hollWI + Q] =le1 + Q] W ”#53,"; + Q] SN = j=2k—2‘ Thus, notice here that 21:3 (2" -- 2‘) = lcN - (N — 1), _ _ N—l iw 5” — (N 1: )w+oIZ:i-':.‘(2*-2-'n (6'7) = (N " NI:- 1)W+[(k-W)N+I]Q’ (6'8) By definition, the speedup formulation of the recursive doubling model for the fixed-time speedup is Ic-l WI . 51" = zi=0 j = 2k _ 2s ’-‘:‘ 5+ 01-] ;:1(2"— 2')W’ N -1 Wi Ic-l =(N - ) I ‘ .0[W1 + Q] k W1 + Q Notice that k—l W' I W= Zl<-’+Q]= k(W +0) and W kQ W; = l: , we have N l—W Wk =(N - —) Q. (6.9) Similarly, for memory-bounded speedup we also have N-l W,- 3,1,:(1v— k )WHQ' 87 The workload on each processor will not change with the number of processors available in memory-bounded speedup. Thus, the memory-bounded speedup is N - 1) W k W + 62' From Eq. (6.8) we can see that with the degree of parallelism considered, the speedup 5;, = (N — (6.10) formulation can be completely different from the simplified speedup formulas presented in Section 2.3. The structured representation of a given application contains adequate information about degradations, and provides more information than a parallelism profile. By using the domain decomposition and recursive doubling computation models, we have shown how elapsed time and speedup can be predicted based on structured representations. We also have shown how to derive speedup formulations based on structured representations for different models of speedup. Performance formulations of other computation models can be derived similarly. The performance formulations of computation models can be used to predict the performance of general scientific applications. 6.2 Structured Prediction We have identified computation models and have shown performance formulations for some of the identified computation models. We would like to predict the performance of general scientific applications by using the performance information of parallel computation models. The idea is that a given application can be seen as a combination of computation models. These computation models can be found by studying the structured representation of the application. The performance information of these computation models can then be com- bined to provide a performance prediction of this application. If the measured performance data of computation models are applied to a performance prediction, performance data can be estimated for this application. If the performance formulations of computation models are applied, a performance formula can be obtained for the application which shows how different parts of the application will vary with system size and problem size. Therefore, difl'erent parts can be modified for different situations to achieve the best performance. We call this kind of analysis structured analysis. The combination of elapsed times is straightforward, but the combination of speedups is somewhat more complex. The combination formulas for the elapsed times and speedups are shown in Eq.(6.11) and Eq.(6.12) respectively for the combination of two models. The 88 combination formulas in the case of n models can be easily obtained by using Eq. (6.11) and Eq.(6.12) recursively. Execution time = Execution time of part one + Execution time of part two . (6.11) Speedup = Speedup of part one x Ratio of part one (6 12) + Speedup of part two x Ratio of part two , where the ratio is the ratio over total execution time. Speedup can be derived from the execution time. We would like to have speedup as a function of the speedups of the com- putation models, since the variations of each model are easier to observe. Also, determining speedup from Eq. (6.12) is simpler than determining speedup from Eq. (6.11), when the ratio is independent of problem size and system size. We use one simple example, the parallel prefix method for solving tridiagonal systems (see Figure 6.1), to illustrate this performance prediction approach. The parallel prefix method has the following structured representation x. + Ea" - 2‘)([1(1).1(1)1+ x...)+[1(N).N(1)1+ 26..., (6.13) k- Recursitnoubling 4 where the conpute phases are counted from 0 to k + 1. From this representation we can see that the parallel prefix method consists of three parts, the preprocessing part, the recursive doubling part and the post processing part. The preprocessing part and the post processing part are local computations in which the workload is evenly distributed among the available processors. The middle part is a recursive doubling computation. The degree of parallelism of the postprocessing part is equal to P x S, where P is the number of patterns in the broadcast data-exchange phase, and S is the number of senders in each of the patterns. Since P = 1 and S' = N in this case, P x S = N. The degree of parallelism of the preprocessing part is also equal to N. Since recursive doubling computation models have equal load at each compute phase, X1 = X,,i = 1, ..., k - 1. Also, since recursive doubling computations do not achieve perfect parallelism at any compute phase, WN = N (X0 + Xk+l)- Using the recursive part as a basic structure, the sequential execution time of the prefix method is [NXo + 2319" - 2")(Xx + Q) + QN + NXml A 9 (6.14) 89 E QE—Q—Q—Qu -- .-'-"-'"-'-'-"""'|-""-'|"'-. Iii l_LT (fini- Parallel Prefix Method Figure 6.1 90 and the predicted parallel execution time of the prefix method is [X0 + k(X1 + Q) 'l' QN + Xk-Hl A , where A is the computing capacity of each processor, Q is the communication overhead (6.15) of data-exchange [1(1), 1(1)] and QN is the communication overhead of data-exchange [1(N ),N (1)] Q is independent of N, the number of processors available, and QN is a function of N. After removing the computing capacity A, Eq.(6.15) becomes [X0 + k(Xl + Q) ‘l' QN + XH—ll- (6.16) Using Eq.(6.12) the speedup of the prefix method is NX x N- x, kx+Q Nx- x- #[quamfl + (N - 7:1)an Eq.16.16 + XL: [Eq.(6.i6)] (6'17) _ X H! - N-l x, k X + "' Nr_J]qu.(6.I6) + (N " T)X,+o Eq.(6.16) (6'18) where X1 is fixed. It is independent of system size and problem size. Equation (6.18) shows that the prefix method contains two parts, the local computation part and the recursive doubling computation part. The local computation part has speedup N and the recursive doubling part has speedup (N - Efllx—fifi' The local computation part has a higher speedup. We would like the local computation part to dominate the total computation. From Eq.(6.18) we can see that when the problem size increases so does the ratio of local computation, and when the system size increases the ratio of local computation will decrease. Figure 6.2 shows the estimated and measured speedup of the prefix method. We can see that they are very close to each other. By the properties of fixed computation at each compute phase, the speedup formulation used for the recursive doubling part is the memory-bounded speedup. Thus, using one simple example, we have shown how the computation models and memory-bounded speedup can be used in performance predictions. 6.3 The Influence of Problem Size on Speedup In the last two sections we showed how the performance of general applications can be predicted using the performance information of computation models. We also derived per- formance formulations for some computation models. In this section we will study the 91 25 I r l I r I 20 ~ .. 15 *- _ Speedup M d 10 - 933% O 1111 5 _ 09 """ Estimated 4 :::::3g: ..... 0 Q L 4 L l l 1 0 10 20 30 40 50 60 No. of Nodes Figure 6.2: Measured and Estimate Speedup of Parallel Prefix Method when n = 6400 performance of computation models in more depth. We will study how problem size influ- ences performance. The study starts with general formulas and then special cases will be noted. These special cases are those that are suitable for some computation models. As shown in Chapter 2, speedup is a function of both problem size and the number of processors available. The study is three fold. First, we only consider the simplified condition discussed in Section 2.3. Then, communication overhead is considered. Finally, uneven allocation degradation is added in to give a more accurate analysis. According to Eq. (2.13), the simplified fixedcsize speedup is W1 + WN S = . 6.19 N m ‘ ) Notice that by Eq. (2.1) W,- t.‘(1) = K. We can rewrite Eq. (6.19) as _ t1(1) +tN(1) (6.20) N- t1(1)+3”7$9' In general, we have 92 ti=t,(1)=% forlsigm, where m is the maximum degree of parallelism (see Section 2.1). When ti = 0 for i 75 1 and i aé N, the simplified speedup Eq. (6.19) can be rewritten as a function of workload W, T1(W) _ t1(W)+tN(W) T~(W) ’ t1(W) +tN(W)/N' Equation (6.21) is equivalent to Eq. (6.19), but execution time has been written as a function SN(W) = (6.21 ) of problem size. Therefore, Eq. (6.21) is easier to use to analyze the influence of problem size. In this section, instead of writing speedup in terms of workload, we write speedup formulations in terms of execution time t,-(1), where the subscript i indicates the execution of workload W,- and 1 indicates that only one processor is used in the execution. If the sequential execution time t1 and the perfectly parallel portion of the execution time tN increase with the problem size W at the same rate, then when the problem size increases c times T1(cW) = c1t1(W) + cltN(W) TN(cW) c1t1(W) + cltN(W)/N Thus, in this case, speedup Eq. (6.21) is independent of problem size. SN(cW) = = SN(W). (6.22) The problem becomes more complicated when communication overhead is considered and the communication overhead is a function of problem size. In this case we could let the sequential execution time, t1, be independent of the problem size or we could let it vary with the problem size. In the former case, we have 151 + thwl t1 + tN(Wl/N + Q(W)' Assume that tN is proportional to W with coefficient c1 and Q is proportional to W with SN(W) = (6.23) coefficient c3, then after the problem size is changed from W to cW, we have t1 + ctN(W) t1 + CtN(W)/N + cQ(W)' The variation of speedup is difficult to understand through Eq. (6.24). We need to use a new approach, the derivative of SN, to study the variation. By Eq. (6.23), the derivative of SN on W will be 5N(CW) = (6.24) 61(t1 + Q + tN/N) ‘ (03 + Cl/Nth + tN). WW) = (t. + mm + Q)’ (6.25) 93 The sign of S}, is determined by its numerator. Notice that under our assumption, we have le = catN. Thus the numerator 61(t1 + Q + tN/N) - (Ca + c1/N)(t1 + tN) = c1(t1(1—1/N)+ Q) - c3(t1 + tN) = t1[61(1—1/N)— C3]. When c1 > 171%63’ 51“, > 0. Therefore, SN will increase with the problem size. If c1 < Nix—leg, Sj‘l, < 0, speedup will decrease as the problem size increases. If e; = FIX—163, S}, = 0, speedup is independent of the problem size. This gives the following results. Theorem 8 If the sequential execution time is independent of the problem size, and the parallel execution time and the communication overhead are proportional to the problem size with coeflicient c1,c3 respectively, then, the speedup Eq. (6.23) will increase with problem size when c1 > 171%1'63' It will decrease with an increase in problem size when c1 < N—IZTC3 and will be independent of problem size when cl = NIX—{cw The following theorem gives an upper bound of the influence of problem size. Theorem 9 Under the same assumptions of Theorem 8, the speedup is bounded by SN(I’V)+ tN/ Q Proof: _ h+tN(cW) _ 81+C8N SN(CW) - tl+Q(CW)+tN(cW)/N — 81+OQ+ctNIN = t £1 + “N < ‘1+‘N in (6.26) 1 +CQ+ctN /N 81 +OQ+ctN IN 81 +OQ+ctN [N cQ < SN(W) + tN/Q with the assumption c 2 1 It is easy to see that if the sequential execution time is also proportional to the problem size then the speedup (6.23) is independent of the problem size. The performance model (6.23) 'fits the domain decomposition computation model (see Section 3.2) well. In the domain decomposition model, there is no uneven allocation degra- dation. The problem can be perfectly partitioned and the degree of parallelism will be N with N processors available. Each node has a fixed number of communication requirements. The communication latency is independent of system size and is a function of problem size. The sequential workload, W1, may or may not change with the problem size depending on the loading of the initial data. However, the communication overhead is proportional to the problem size. Commonly, it is a rational power function of the problem size. For instance, 94 using the finite difference method to solve Laplace’s equation, Eq. (3.8), each node will solve a sublinear system Ax, = b,- and send the result to its four neighbors. If A,- is an n by n matrix, then the length of the result is n, and, with the Gaussian elimination method, the solving process has a complexity 0(n3). Theorem 10 gives a prediction of the speedup when communication overhead is a power function of problem size. Theorem 10 If the sequential execution time, t1, is independent of the problem size, the parallel operation is proportional with the problem size, tN = Cl W, and the communication overhead is a power function of the problem size, Q = aW”, then when c3 < “121:1;111331351 , the speedup (6.23) will increase with the problem size; when 63 > [t1([i-I:I{::)l)gltN , the speedup will decrease with an increase in problem size; and the speedup is independent of the problem . __ [t1(1-1/N)+Q]tn size when C3 — “N“llq . Proof: We prove the theorem by observing the sign of the derivative of SN(W) on the variable W. The sign is determined by the numerator of Sfi(W), which is equal to c1(t1 + tN/N + Q) - (61/N + caawa-lxtl + tN) = ~36- (C1W(t1 + tN/N + Q) — («W/N + caaWs-‘Wxn + tn» = 5% (tN(t1 + tN/N + Q) - (tN/N + caQ)(t1 + tNll = 5 (1,1,,(1 — 1/N) + tNQ - c2061 + tn))- Therefore, Sj‘v(W) will be less than zero when c3 > [‘1°2:;I/+1:)14)'31h~1. va(W) will be greater than zero when c3 < “1003173: It” and Sj‘(,(W) will be equal to zero if C3 = [_tfil - l/N) + QltN D (tN'l'tllo Notice here that when the sequential workload is equal to zero, the sequential execution time t1 is also equal to zero. In this case, [t1(1-1/N)+ QltN =1 “N + tllQ ’ and Theorem 10 becomes easier to understand. Theorem 11 gives the upper bound of the influence of problem size when the communication latency Q is a rational power function of the problem size. Theorem 11 Under the assumptions of Theorem 10, for any real number c 2 1, we have _c t w 5,,(cW) 5 SN(W)+c‘ a-ggT,’ 95 Proof: t «H (CW) 5~(6W) = W ‘L‘N‘Wl = t1+cc3Q(“;)+cl(iw(W )7N + t1+c‘30(W)-I;ct%9;V)/N < SN(W) +—N—— é,q(w)_ -SN(W) + chm-315,7. If we let the sequential execution time, t1, vary with the problem size, by similar argu- ments we have the following corresponding theorems. Theorem 12 Under the assumptions in Theorem 10, if we also let the sequential execution time be proportional to the problem size, with coefl‘icient c2, then when C3 < 1, the speedup (6.23) will increase with the problem size; when c3 > 1, the speedup will decrease with an increase in problem size; and when c3 = 1 the speedup is independent of the problem size. Theorem 13 Under the assumptions of Theorem 12, for any real number c 2 1, when 63 ..>. 1, 5N(CW) S S'N(W); and when C3 < 1, S'N(cW) S cl’QSN(W). Proof: By Theorem 12, we only need to prove the c3 < 1 case. When c3 < 1, _ ct1+ctu SN(CW) - ctl-l-c‘3Q-l-ctN/N t1+tN = c‘3 “(a-ca t1+Q+c1-¢a tN/N) 5 cl":3 SN(W). D All of the above six theorems, Theorem 8 to Theorem 13, are based on speedup (6.23). With the consideration of the sequential execution and communication latency degradations, they show how the speedup is influenced by workload and what is the limitation of this influ- ence. They are useful prediction formulations for domain decomposition computations and for other scientific and engineering applications. In the remaining portion of this subsection, we study the influence of problem size with degree of parallelism considered. Recall that t; is the execution time for workload IV,- when one processor is available. Assuming that each t;, i = 1, ...,m is proportional to the problem size, then the speedup will be independent of the problem size from the degree of parallelism point of view. This is confirmed by the following equation, €i=1flti(cwl Em ti(W) 51v(6W)= (cw)/22,-";1 t1(Wl/ =5~(W). (6.27) 96 Most of the frequently used applications do not distribute load evenly. For achieving high performance, we would like workload with a high degree of parallelism to increase with the problem size, and workload with lower degrees of parallelism to be fixed, independent of problem size. In general, we can divide the integer set 1, 2, ..., m into two disjoint subsets i1, i2, ..., im, j1,j3, ..., jm, such that m1 + m2 = m and that workload in degree of parallelism it, I: = 1, ..., m1 is independent of the problem size, and the workload in degree of parallelism jk, lc = 1, ...,mg increases with the problem size. By this partition, we have _ m,-,t;(W) 221.11 t1,+222'_1t5 (W) S”(W)’ .:.t.(W)/i =2?..t.,/i.+zr:.t1:(W)/j, Extending the average parallelism concept introduced in Section 2.1, we have t1*+tn‘(W) . t . +tN.(W)/A. S A . (6.28) where t1 = 216:1 tik/ika tN'(W) =X:,,_1 tjk(W)’ and A‘= 2,51 tjk/[zkgl tjk/jkl is the extended average parallelism. Thus, when the number of processors is fixed, we return to Eq. (6.21) and the results for Eq. (6.21) can be directly used for Eq. (6.28). Both the super sequential execution time t1' and the extended average parallelism A‘ of Eq. (6.28) SN(W) = are independent of problem size. However, both of them are functions of system size. The prediction formulation Eq. (6.28) is useful for divide-and-conquer computations. For certain divide-and-conquer computations the divide-and-conquer parts have a fixed number of operations which are not influenced by the problem size. Only the computation part scales up with the problem size. Since the computation part has N as the degree of parallelism, by Eq. (6.28), near perfect speedup can be achieved when the problem size is large enough. Chapter 7 CONCLUSION AND FUTURE RESEARCH The design, representation and performance considerations of parallel algorithms on mul- ticomputers have been presented in this thesis. In this chapter, the work reported in this thesis is summarized and the major contributions of this dissertation are highlighted. Next, improvements to the current work are suggested, and future plans and research directions will also be discussed. 7.1 Summary and Major Contributions The central issues in the development of efficient parallel algorithms involve exploring the inherent parallelism in applications and the matching of applications to architectures. The present study is an attempt to better understand of these issues. The contributions of this study include developing a notation for describing the structures of the most frequently used scientific and engineering applications, identifying the basic building blocks of these structures, providing a guideline for the design of efficient parallel algorithms, developing and implementing efficient parallel algorithms for several applications, proposing a new approach for predicting the performance of parallel algorithms, and presenting a more general speedup model, namely, memory-bounded speedup. Traditionally, parallel algorithms have been designed by brute force methods and fine tuned on each architecture to achieve high performance. Rather than studying the devel- opment and matching case by case, a systematic approach was proposed in my research by studying the basic building blocks, called computation models, of frequently used ap- plications. Anotation was first developed in Chapter 3. Using this notation, most of the frequently used scientific and engineering applications can be represented by simple formulas. These formulas constitute the structured representation of the corresponding applications. The structured representations are simple, adequate and easy to understand. They also contain sufficient information about uneven allocation and communication latency degra- dations. With the structured representations, applications can be compared, classified and 97 98 partitioned. There are innumerable applications. However, most of the applications are combinations of some computation models. Structured representations relate general ap- plications to computation models. Studying computation models leads to a guideline for efficient parallel algorithm design for general applications. Seven models have been identified and presented in Chapter 3. They are the Local Com- putation Model, the Global-Exchange Computation Model, the Compute-Aggregate-Broadcast Computation Model, the Divide-and-Conquer Computation Model, the Domain Decompo- sition Computation Model, the Pipelined Computation model and the Recursive Doubling Computation Model. The structured representation of each of the computation models was presented, and the computation and communication patterns of the computation models were studied. With the computation model concept, the matching of application to archi- tectures becomes transparent. The performance information of computation models suggests which structure is best suited for a specific multicomputer architecture, and which part of an algorithm is the performance bottleneck. Therefore, structured design and rethinking become possible. Structured representation and computation models have been used on the development of parallel algorithms for several scientific and engineering applications. These applications in- clude solving electrical power flow problem [52], solving tridiagonal linear systems [60], solving dense linear systems [49] and solving tridiagonal symmetric algebraic eigenvalue problems. (The work on the last two applications, which are published in [49] and [40], are not included in this dissertation). We used different computation models for different applications. The algorithm for solving dense linear systems is based on the pipelined computation model. The tridiagonal symmetric algebraic eigenvalue problems were solved with local computation and divide-and-conquer computation approaches. The newly proposed divide-and-conquer algo- rithm for solving eigenvalue problems is very efficient. It competes well with any existing parallel algorithm for solving eigenvalue problems. Solving tridiagonal linear systems is a fundamental problem of scientific computing, and solving power flow problems is, by far, one of the most important problems facing researchers in the electrical power field. Structured representation and computation model concepts help us in changing from one design to an- other. Four new algorithms were developed for solving tridiagonal linear systems. Three of the four algorithms are presented in this thesis. They are the partition LU, the parti- tion global-exchange and the partition hybrid algorithms. The partition LU algorithm is based on compute-aggregate—broadcast computation. The partition global-exchange algo- rithm uses global-exchange computation. The partition hybrid algorithm is a combination . p -. r-l' 99 of compute-aggregate—broadcast computation and recursive doubling computation. All of the deve10ped parallel algorithms have been implemented on an NCUBE/ 7 multicomputer. Our algorithms yield a higher speedup and efficiency compared to existing algorithms. Us- ing structured representation, We modified our design for solving the power flow problem from the local computation model to the global—exchange model, and then to the compute- aggregate-broadcast computation model. The last one gave the best performance. Based on the computeaggregate—broadcast algorithm, we have developed a package, called PowerCube [45], which adapts the most rigorous mathematical techniques and is able to obtain all the steady state solutions of power systems. The package currently runs on NCUBE and BBN GP-1000 multicomputers. The related details of solving electrical power flow problems and tridiagonal linear systems can be found in Chapter 4 and Chapter 5 respectively. In association with the study of efficient algorithm design, the performance issues of par- allel algorithms were also studied [58]. Three models of speedup were discussed and analyzed in Chapter 2. They are fixed-size speedup, fixed-time speedup and memory-bounded speedup. Two sets of speedup formulations were derived for these three models. One set requires more information and gives more accurate estimates. Another set considers a simplified case and provides a clearer picture of the possible performance gain with parallel processing. The simplified fixed-size speedup constitutes Amdahl’s law, whereas the simplified fixed-time speedup is Gustafson’s scaled speedup. The simplified memory-bounded speedup contains both Amdahl’s law and Gustafson’s scaled speedup as its special cases. This study proposes a new metric for performance evaluation and leads to a. better understanding’of parallel processing. Performance prediction techniques have been developed based on structured representa- tion and based On some precollected performance data. The performance predictions con- sider both communication overhead and uneven allocation as degradations of parallelism, and give a good bound on performance. Performance prediction is very important in real time systems where we want to get the most accurate solution within a time limitation. The predicted performances are also very helpful for efficient algorithm design. Using structured representation, performance can be predicted for different levels. The lower level accumu- lates the predicted performance of each compute and data-exchange phase. The higher level uses computation models as the basic modules and provides a more accurate and automatic approach. The performance formulations of two of the computation models were derived in Chapter 6. The performance of a large class of applications could be predicted by combining these formulations and by using the higher level approach. The analysis process also showed 100 how we could derive the performance formulations of a given application through the lower level approach. An example was given to show the performance prediction techniques. Re- sults obtained from our implementation of this example matched the predicted performance well. 7 .2 Future Research Directions In this research, the structured representation method was developed to describe the most frequently used scientific and engineering applications. The basic data structures of these applications, called computation models, were identified and studied and a new metric for the performance of parallel algorithms was proposed. Applying the structured representation, the identified computation models and the new metric of performance, the issues in efficient algorithm design and in performance prediction were studied. All these topics, namely, the representation, the basic data structure and the metric of performance, are fundamental problems in computer science. Because of their fundamental properties, the results presented in this dissertation can be applied to other areas of computer science. Further, these problems are difficult to solve perfectly due to their fundamental nature. Future research areas can be identified along three directions - improving the current results of these fundamental problems, continuing the research of applying these fundamental results to efficient algorithm design and performance prediction, extending the application of these fundamental results to other areas of computer science. Several questions remain open along the first direction of research. For structured rep- resentation, we have only developed notation for two classes of data—exchange, i.e., regular data-exchange and conjunctive regular data-exchange. Communication phases do exist that do not belong to either of these two kinds of data-exchanges. Can we extend our nota- tion to cover more complicated communications? Structured representation provides the information about how many processors participate in each data-exchange phase. If we assume that a processor computes if and only if it participates in data-exchange, then we know how many processors do useful work at each compute phase and, therefore, have some knowledge of uneven allocation degradation. However, the workload of each processor is unknown. Processors may compute at the same compute phase but with different workload. Can we extend or modify our current notation to contain the workload information? Fur- thermore, we assume that the communication is achieved in a synchronous fashion, can we extend our current notation to contain asynchronous communication? Could we use graph 101 theoretical results and terminologies to make structured representation more general and more fruitful? For instance, the results of bipartite graphs could be used to further divide a communication pattern into smaller pieces. Also, further study on computation models is needed. This includes studying more applications and identifying more models, studying each model on different architectures, and deciding how to identify a computation model. With structured representation, a computation model can be identified at different levels — finer or coarser. Which level is better is uncertain at this time. For example, divide-and- conquer is a commonly used technique, which we identified in the conventional way (see Section 3.2). However, it can be identified as two tree structures, which may be easier to measure and understand. Three models of speedup, namely, fixed-size speedup, fixed-time speedup and memory-bounded speedup, are studied for the performance of different classes of applications. Applications exist which do not fit any of these models of speedup, but satisfy some combination of the models. The study of efficient algorithm design and performance prediction is still in a prelim- inary stage. This is especially true for the research in performance prediction. Only four applications have been studied on a single architecture — the hypercube architecture. We would like to study more applications on different architectures and explore the dynamic load balancing issues. We need more experimental data to support the theoretical results. Our representation is based on advanced technology, so it does not consider the distance between two processors as an important factor. The algorithms which have been implemented on the first generation NCUBE multicomputer should be tested further on second generation multicomputers. More algorithms should be implemented on second generation multicom- puters to determine the accuracy of our predictions and to improve our prediction schemes. Structured representation contains adequate information about an application, but finding the structured representation for an application is difficult. It requires a deep understanding of the application itself. How to detect the structured representation of a given application automatically and systematically is an interesting research issue. Some tools exist for de- tecting the degree of parallelism of a given application [36]. These tools provide starting points for developing tools that can detect the structures and computation models contained within applications automatically. The proposed performance prediction method is based on structured representation and computation models. Performance tools can be developed for performance estimation and prediction by applying the method. Further study of the per- formance prediction methodology is needed before a tool can be developed, since there are numerous implementation issues involved. This is another interesting research issue which 102 can be studied based on the results of this dissertation. During my dissertation, I have conducted research on several fundamental problems and applied the research results to different areas of computer science. Further research can be continued along these directions. This research can also be extended to new areas to attack untouched problems. For instance, structured representation could be modified for use with shared-memory multiprocessors; the computation model concept may be extended for non- numerical computations; kernel programs for machine benchmarks can be developed based on the identified computation models. I plan to continue my current research and am always looking forward to meeting new challenges. Bibliography O 0 Bibliography [1] G. Almasi and A. Gottlieb. Highly parallel computing. The Benjamin/Cummings Pub- lishing Company, Inc., 1989. [2] G. Amdahl. Validity of the single-processor approch to achieving large scale computing capabilities. In Proc. AFIPS Conf., pages 483-485, 1967. [3] W. Athas and C. Seitz. Multicomputers: Message-passing concurrent computers. Com- puter, pages 9-25, Aug. 1988. [4] M. Barton and G. Withers. Computing performance as a function of the speed, quantity, and cost of the processors. In Proc. Supercomputing ’89, pages 759—764, 1989. [5] D. Bertsekas and J. Tsitsiklis. Parallel And Distributed Computation: Numercal Meth- ods. Prentice-Hall, Inc., 1989. [6] S. Borkar and et al. Iwarp: An integrated solution to high-speed parallel computing. In Proc. of Supercomputing’88, pages 330—339, 1988. [7] R. Brassard and Bratley. Algorithmics Theory and Practice. Prentice-Hall, Inc., 1987. [8] S. Chang, X. Sun, and L. Ni. A projection tool for visualizing multidimensional curves. Michigan State University, 1989. Technical Report, MSU-CPS-ACS-16. [9] S. Chow, J. Mallet-Paret, and J. Yorke. Finding zeroes of maps: homotopy methods that are constructive with probability one. Math. Comp., 32:887-899, 1978. [10] S. Chow, L. Ni, and Y. Shen. A parallel homotopy method for solving a system of polynominal equations. In Proc. of the 3rd SIAM Int ’l Conf. on Parallel Processing for Scientific Computing, pages 121-125, Chicago, 1987. [11] E. Coffman. Introduction to deterministic scheduling theory. In J. W. E.G. Coffman and Sons, editors, Computer and Job/Shop Scheduling Theory, pages 1-50. 1976. 103 104 [12] F. Darema. Parallel applications performance methodology. In M. Simmons, R. Koskela, and I. Bucher, editors, Instrumentation for Future Parallel Computing Sys- tems, pages 49—58. The ACM press, 1989. [13] J. Dongarra, J. Bunch, C. Moler, and G. Stewart. Linpacl: Users’ Guide. SIAM, Philadelphia, 1979. [14] I. Duff, A. Erisman, and J. Reid. Direct Methods for Sparse Matrices. Clarendon Press, Oxford, 1986. [15] D. Eager, J. Zahorjan, and E. Lazowska. Speedup versus efficiency in parallel system. IEEE Transactions on Computers, pages 403—423, March 1989. [16] O. Egecioglu, D. Koc, and A. Laub. A recursive doubling algorithm for solution of tridiagonal systems on hypercube multiprocessors. Technical Report No. TRC88-1, 1988. Dept. of Computer Science, Univ. of California, Santa Barbara. [17] S. Even. Graph Algorithms. Computer Science Press, 1979. [18] R. Finkel. Large-grain parallelism — three case studies. In L. Jamieson, D. Cannon, and R. Douglass, editors, The Characteristics of Parallel Algorithms, pages 21—63. The MIT press, 1987. [19] W. Gropp and d.E. Keyes. Complexity of parallel implementation of domain decom- position techniques for elliptic partial differential equations. SIAM J. on SS TC, 9(2), March 1988. [20] J. Gustafson. Reevaluating amdahl’s law. CA CM, 31:532—533, May 1988. [21] J. Gustafson, S. Hawkinson, and K. Scott. The architecture of a homogeneours vector supercomputer. In Proc. of 1986' Int’l Conf. on Parallel Processing, pages 649-652, Chicago, 1986. [22] J. Gustafson, G. Montry, and R. Benner. Development of parallel methods for a 1024- processor hypercube. SIAM J. on SS TC, 9(4), July 1988. [23] J. Hayes, T. Mudge, Q. Stout, S. Collet, and J. Palmer. Architecture of a hypercube supercomputer. In Proc. of 1986 Int ’1 Conf. on Parallel Processing, pages 653—660, 1986. 105 [24] D. Heller. A survey of parallel algorithms in numerical algebra. SIAM Review, 20:740— 777, Oct. 1978. [25] W. Hill. The Connection Machine. MIT Press, Cambride, Mass., 1985. [26] W. Hills and G. Steele. Data parallel algorithms. Communications of the ACM, 29, Dec. 1986. [27] R. Hockney. A fast direct solution of poisson’s equation using fourier analysis. J. ACM, 12:95—113, 1965. [28] K. Hwang. Advanced parallel processing with supercomputer architectures. Proc. of the IEEE, pages 33-47, Oct. 1987. [29] K. Hwang and F. Briggs. Computer Architecture and Parallel Processin. McGraw-Hill Book Co., 1984. [30] L. Jamieson. Characterizing parallel algorithms. In L. Jamieson, D. Gannon, and R. Douglass, editors, The Characteristics of Parallel Algorithms, pages 65-100. The MIT press, 1987. [31] V. N. J.H. Saltz and D. Nicol. Reduction of the effects of the communication delays in scientific algorithms on massage passing mimd architectures. SIAM J. SCI. STAT. COMPUT., 8(1), Jan. 1987. [32] S. Johnsson. Communication efficient basic linear computations on hypercube multi- processors. J. of Parallel and Distributed Computing, (4):]33—172, 1987. [33] S. Johnsson. Solving tridiagonal systems on ensemble architectures. SIAM J. on SS T C, 8(3):354—392, May 1987. [34] C. King, W. Chow, and L. Ni. Pipelined data parallel algorithm - concept and modeling. In ACM Int ’l. Conf. on Supercomputing, 1988. [35] P. Kogge and H. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE- TC, c-22z786-793, 1973. [36] M. Kumar. Measuring parallelism in computation intensive scientific/engineering ap- plications. IEEE- TC, 37(9):1088—1098, Sep. 1988. 106 [37] T. Li, T. Sauer, and J. Yorke. Numerical solution of a class of deficient polynimial systems. SIAM J. Numer. Anal., 24(2), Apr. 1987. [38] T. Li, T. Sauer, and J. Yorke. The random product homotopy and deficient polynimial system. Numer. Math., 51(481), 1987. [39] T. Li and X. Wang. A more efficient homotopy for solving deficient polynomial systems. Submitted for publication. [40] T. Li, H. Zhang, and X. Sun. Parallel homotopy algorithm for symmetric tridiago- nal eigenvalue problem. accepted to appear in SIAM J. of Scientific and Statistical Computing. [41] M. A. Marsan, G. Bailbo, and G. Conte. A class of generalized stochastic petri nets for the performance analysis of multiprocessor systems. ACM TOCS, pages 93—122, May 1984. [42] P. Michielse and H. Vorst. Data transport in Wang’s partition method. Parallel Com- puting, 7:87—95, 1988. [43] A. Morgan and A. Sommese. A homotopy for solving general polynomial systems that respects m-homogeneous structures. App. Mat. and Comp., 24:101—113, 1987. [44] P. Nelson and L. Snyder. Programming paradigms for nonshared memory parallel com- puters. In L. Jamieson, D. Gannon, and R. Douglass, editors, The Characteristics of Parallel Algorithms. The MIT press, 1987. [45] L. Ni, F. Salam, T. Tzen, X. Sun, and S. Guo. Powercube: A software package for solving load flow problems. In Proc. of the 32nd Midwest Symposium on Circuits and System, Illinois, Aug. 1989. [46] S. N ugent. The ipsc-2 direct-connect communication technology. In Proc. of the Third Conf. on Hypercube Concurrent Computers and Applications, Jan. 1988. [47] J. Ortega and R. Voigt. Solution of partial differential equations on vector and parallel computers. SIAM Review, June 1985. [48] M. Quinn. Designing Efl‘icient Algorithms For Parallel Computers. McGraw-Hill, 1987. 107 [49] D. Robinson, X. Sun, and R. Enbody. A pipelined parallel approach to solving dense linear systems. In Proc. of the Fourth Conf. on Hypercube concurrent Computers and Applications, March 1989. [50] Y. Saad and M. Schultz. Data communication in hypercube. Research Report, YALEU/DCS/RR-42, Computer Science Department, Yale University, Oct. 1985. [51] F. Salam, L. Ni, S. Guo, and X. Sun. Parallel processing for the load flow of large-scale power system: The approach and application. In Proc. of the 28th IEEE Conf. on Decision and Control, Tampa, Florida, Dec 1989. [52] F. Salam, L. Ni, X. Sun, and S. Guo. Parallel processing for the steady state solutions of large-scale nonlinear models of power systems. In Proc. of the 1989 IEEE Int ’1 Symposium on Circuits and Systems (ISCAS), Portland, Oregon, May 1989. [53] K. Sevcik. Characterizations of parallelism in applications and their use in scheduling. In Proc. of ACM SIGMETRICS and Performance ’89, May 1989. [54] J. Sherman and W. Morrison. Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix. Ann. Math. Stat., 20(621), 1949. [55] Y. Shih and J. Fier. Hypercube systems and key application. In K. Hwang and D. DeGroot, editors, Parallel Processing for Supercomputing and Artificial Intelligence. McGraw-Hill, 1988. [56] H. Stone. An efficient parallel algorithm for the solution of a tridiagonal linear system of equations. J. of ACM, 20(1):27—38, Jan. 1973. [57] H. Stone. Parallel computers. In Introduction to computer architecture. SRA Inc., Chicago, Ill, 1980 (2nd ed.). [58] X. Sun and L. Ni. Another view on parallel speedup. In Proc. of Supercomputing ’90, NY, NY, Nov. 1990. [59] X. Sun, L. Ni, F. Salam, and S. Guo. Compute-exchange computation for solving power flow problems: The model and application. In Proc. of the Fourth SIAM Conf. on Parallel Processing for Scientific Computing, Dec. 1989. 108 [60] X. Sun, H. Zhang, and L. Ni. Parallel algorithms for solution of tridiagonal systems on multicomputers. In Proc. of the 1989 ACM Int ’1 Conf. on Supercomputing, Crete, Greece, June 1989. [61] A. Veen. Dataflow machine architectures. ACM Computing Surveys, 18(4):365-396, Dec. 1986. [62] H. Wang. A parallel method for tridiagonal equations. ACM Trans. Math. Software, 7:170—183, June 1981. [63] L. Wason. Numerical linear algebra aspects of globally convergent homotopy methods. SIAM Review, pages 529—545, Dec. 1986. [64] F. Wu. Stability, and security, and reliability of interconnected power systems. Large Scale Systems: Theory and Applications, 7:99-113, 1984. l- l. "I[lllllllllllllllll