”Hwy. SEW ““1” ’ 43.43 .. ,. ~ ‘ Hug-6.; _ A ,....r"‘:;:;:« 3 - - ; mifimj . Emmy 7 ‘ ‘5? ~ A ~\n\.-‘~ ‘ . ' {Inn‘w‘z‘afii’ o 1.. ' ‘” .. . ”a ‘ ‘ s: ‘ 1 .42333 a flifiqfigawr‘ ~ <_ I? '4, v.3 j: r 323;. ASK-“:35?” . :1‘ £1,' xflf'. {w v . .,.. "aw—'x‘fff‘ffi § u r 1:. 5 \T .1 a I‘ .x f, r- _ , .34.”. 33:}ng pf ‘y - w‘, .33“ ql‘tgzlaaxwv ' T mgmrfimwfiv} » i “I" ‘3’,“- W‘ ”.1 . .. ‘, - a” 7-31.; ‘3‘..— "51“? mar: « ._ m. ~ 2. ' “W“? a ‘ «E gang 7; ; w: «t. q g r" ‘ n’ “u”. ‘ J " ‘ w.“- ‘p u‘”; h . -.r~ nV' . ‘ a 43-4.». .. . _.,.-.’\ ”Muffin 35' J: t“ 1'! ' T‘Wfiméflm’a‘” Vt‘efi‘l‘l‘ u. , «a 7 'r" 3132;11» Ira-1:" )0“... *1. \. . ‘2‘. 7r » u v» - ““‘t‘Vw “w“ _ urn" , .r _ , . v.1 . ., g, - 5,... “fix-rhu- 1 4".” .7 .‘f‘qpy$m,n&_...‘( . . "uw- , k , . - aw": 'r'u .h-N ”r w~~ w W; ‘ - - , . ~ Jam“ u flit» » #13: , 7 ‘ M.. x.» i ‘r-T“"’*I-.“*zr .. v . M p ‘1’ a“? *fi::‘.?:‘fi5’ .. . -. M . *“ a-u-tymw‘ w . 9;": sin 4‘ 33$. . .a‘ ’4 :J': a > LIIBRAR ISE lllllllllllhlll“IHHWUHHllllliU11Nil] 3 1293 01015 16 ill This is to certify that the dissertation entitled Effects of Communication Costs on the Design and Implementation of Parallel Numerical Algorithms presented by Christian Trefftz has been accepted towards fulfillment of the requirements for PhD Computer Science degree in ‘Major professor MGR / I Date A [fly LL / / MSU i: an Affirmative Action/Equal Opportunity Institution 0-127" LIBRARY Michigan State University PLACED! RETURN BOXtomnwombdnckoufmuywncord. TOAVOIDFINESMumonorbdonddoduo. DATE DUE DATE DUE DATE DUE [’1 [:I“? IS - 1m MSUIcMNflmdMMUVEqu-lowommlylmwonm -_,___.._....—-—-——--———‘ Effects of Communication Costs on the Design and Implementation of Parallel Numerical Algorithms By Christian fiefitz A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science Department 1994 ABSTRACT Effects of Communication Costs on the Design and Implementation of Parallel Numerical Algorithms By Christian Trefftz An insatiable demand for more computational capacity characterizes many com- puter applications that model various physical phenomena, forecast the behavior of natural and artificial systems, and explore different design options in manufactur- ing. Solving large numerical applications has traditionally been one of the tasks best suited for supercomputers. Over the last few years, supercomputer architectures have migrated from large vector mainframes to scalable parallel architectures, which are designed to offer corresponding increases in performance as the number of processors is increased. Such systems encompass both massively parallel computers (MPCs) and clusters of high-performance workstations interconnected by high-speed networks. Both MPG and clusters are characterized by the distribution of memory among processor nodes. In such systems, the performance of applications depends on the efficiency of communication, which in turn depends on the underlying communications architecture. The wide variety of communications platforms for scalable systems has led to an increasing interest in the development of communications libraries, which can increase portability and usually offer better performance than communication routines embedded in the applications software. The Operations included in typical communications libraries include not only point—to-point primitives, but also collective communication primitives, which involve more than two processes. Collective com- munication is receiving increasing attention due a better understanding of its wide applicability in parallel processing. Thus has arisen the need for the research described in this dissertation: to study the effects of interprocess communication costs, particularly that of collective oper- ations, on the design, implementation, and performance of parallel numerical algo- rithms. The thesis can be stated as: The use of communication operations, designed to exploit properties of new generation communications architectures for distributed- memory platforms, can significantly improve the performance of parallel numerical al- gorithms; moreover, the redesign of such algorithms to account for point-to-point and collective communication costs explicitly, can result in further performance improve- ment. The dissertation makes several specific contributions, including: comparisons of cluster and MPC environments for numerical algorithm design; study of the effects on performance of communication operations that are optimized for new generation parallel architectures; and case studies for specific highly parallel numerical linear algebra algorithms that have recently been designed. - To my wife Ana Cristina iv ACKNOWLEDGMENTS I am extremely grateful to Dr. Philip McKinley, my advisor, for his guidance, gen- erosity, and patience. His dedication and commitment to research are remarkable. A good advisor can motivate his/her students to try to do their best and that was certainly the case with Dr. McKinley. I also would like to express my appreciation to the members of my committee, Dr. Li, Dr. Ni and Dr. Mutka, for their comments and suggestions. It was a privilege to be in their classes. Dr. Zeng, from the math department, deserves a special mention for all his help with mathematical aspects of the programs. I am also indebted to Dr. Carl Page for encouraging me to come to Michigan State in the first place, to Dr. Diane Rover for her class on visualization which was instrumental for this research and to Dr. W.J. Hsu who guided my first steps in research in parallel systems. My dissertation has benefited from the interaction with many of my fellow stu- dents at the department of Computer Science at Michigan State: Edgar Kalns, Ron Sass, Hong Xu, Jon Engelsma, Steve Walsh, Joe Sharnowski, Bill Ryan, Isaac Tsai, Chengchang Huang, Eric Kasten, Dan Judd, Marwan Krunz. I am also very grateful to Ms. Lora Mae Higbee for all her opportune help regarding departmental and university procedures. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 2 Motivations and Related Work 2.1 Communication Architectures for MPCs ................ 2.2 Clusters of Workstations ......................... 2.3 Collective Communication ........................ 2.4 Communication Libraries for Distributed-Memory Environments . . . 2.5 Libraries for Numerical Analysis ..................... 2.6 Analysis of Collective Communication Costs .............. 2.6.1 General Analysis ......................... 2.6.2 Analysis of Specific Numerical Algorithms ........... 2.6.3 Scalability Analysis ........................ 3 A Parallel Eigenvalue Solver on an MPC 3.1 Split-Merge Algorithm .......................... 3.2 Simple Parallel Implementation ....... ' .............. 3.3 Use of Dynamic Load Balancing ..................... 3.4 Reducing Communication Costs ..................... 3.5 Performance Study .............. . .............. 3.6 Related Work ............................... 3.7 Summary ................................. 4 The Split-merge Algorithm on a Conventional Cluster 4.1 PVM Implementation ........................... 4.2 Load Balancing in Local Stages ..................... 4.3 Effects of the Cluster Environment ................... 4.4 Performance Study ............................ 4.5 Summary ................................. 5 Performance on Switch-Based Clusters 5.1 Cluster Environments ........................... 5.2 Experiments Using Original PVM .................... vi viii ix 11 14 19 21 25 25 26 27 29 30 35 39 45 48 54 56 57 58 60 62 66 71 73 74 76 5.3 Using IP—Multicast ............................ 79 5.4 An Improved Broadcast Implementation in PVM ........... 81 5.5 Experiments with ATM Multicast .................... 83 5.6 Summary ................................. 85 6 A Model of Split-merge Performance 88 6.1 Parameters ................................ 88 6.2 Iteration Characteristics ......................... 90 6.3 Waiting Time ............. ~ .................. 91 6.4 Broadcast Time .............................. 93 6.5 Gather and Scatter Times ........................ 95 6.6 Scalability ................................. 96 6.7 Summary of Eigenvalue Study ...................... 97 7 A Parallel Singular Value Algorithm 99 7.1 A New Algorithm ........ - ..................... 101 7.2 MPC SVD Study ............................. 106 7.3 Cluster implementation .......................... 110 7.4 A Model of the Performance of the SVD Algorithm .......... 116 7.5 Related work ............................... 118 7.6 Summary of Singular Value Study .................... 119 8 Conclusions and Future Work 121 BIBLIOGRAPHY 125 vii 3.1 3.2 3.3 4.1 6.1 7.1 LIST OF TABLES Execution times on the nCUBE with different broadcast algorithms . Types of test matrices .......................... Execution times (in seconds) for matrices of type 1 through 12 ‘on an nCUBE—2 ................................. Execution times (in seconds) for matrices of type 1 through 12 on a cluster of workstations .......................... Parameters considered in the model ................... Execution times on a cluster with different broadcast algorithms and different interconnecting networks .................... viii 89 116 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 4.1 4.2 4.3 4.4 4.5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 LIST OF FIGURES Examples of MPC topologies (Taken with permission from[2]) ..... 8 A node on a MPC (Taken with permission from[2]) ........... 10 Diagram of a cluster of workstations. .................. 13 Libraries and their structure ........................ 24 Matrices in split-merge eigenvalue solver ................ 31 Split-merge algorithm. .......................... 33 Abstract representation of the split-merge algorithm .......... 33 Different representations of hypercubes ................. 34 Sinks and clients in a 3-cube ....................... 35 Trace of initial parallel algorithm .................... 38 Distribution of the number of iterations needed in the Laguerre routine 39 Trace of load balancing algorithm .................... 43 Execution Times and Speedups for a sample of 10 matrices ...... 44 Traces of Broadcasts ........................... 47 Execution times of parallel eigenvalue solvers on an nCUBE-2. . . . . 52 Execution times of parallel eigenvalue solvers on an nCUBE-2 ..... 53 Execution traces of parallel eigenvalue solver on network of workstations 62 Comparison of load balancing approaches for a matrix or order 4096. 63 Effect of workstation load on execution times. ............. 65 Execution times of parallel eigenvalue solvers on a cluster. ...... 68 Execution times of parallel eigenvalue solvers on a cluster. ...... 69 ATM cluster testbed at Michigan State University ........... 75 GIGAswitch cluster testbed at Michigan State University ....... 76 Comparison of the performance of Ethernet and ATM ......... 77 Comparison of the performance of a cluster of DEC and a cluster of SUN workstations ............................. 78 Comparison of the performance of the program on Ethernet and the GIGAswitch ................................ 79 Comparison of the performance of the program using IP-Multicast and regular PVM bcast ............................ 81 Comparison of the performance of regular PVM and the enhanced PVM over Ethernet ............................ 82 Comparison of the performance of regular PVM and the enhanced PVM over ATM .............................. 83 ix 5.9 5.10 6.1 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 Comparison of the performance of the program using unreliable ATM- Multicast and regular PVM boast .................... 85 Comparison of the performance of the program using reliable ATM- Multicast and regular PVM bcast .................... 86 Efficiency as a function of the number of processors with different broadcasting functions ........................... 95 Bidiagonal matrix ........... ‘ .................. 101 Symmetric Tridiagonal Matrix T .................... 102 Numerical example of a bidiagonal matrix and the associated matrices 103 Algorithm to calculate singular values .................. 105 Representation of the intervals of eigenvalues of interest ........ 108 Comparison of the two versions of the SVD program on an nCUBE . 109 Execution times as a function of the workload size ........... 111 Traces of the program with different workload sizes .......... 113 Execution times for random matrices of different sizes on a cluster of workstations ................................ 1 15 CHAPTER 1 Introduction Computers have become an essential tool for researchers in many different sciences. Programs are being used to model various physical phenomena, forecast the behavior of natural and artificial systems, and explore different design options in manufactur- ing. An insatiable demand for more computational capacity characterizes many of ' these applications. Many of the most demanding of such applications are numerical algorithms from fields of study that include computational chemistry, structural biol- ogy, fluid and combustion dynamics, petroleum reservoir and groundwater modeling, and high-energy physics. Solving large numerical applications has traditionally been one of the tasks best suited for supercomputers. Supercomputing has undergone many changes over the years, however. Pushing aside traditional vector supercomputer architectures, parallel processing, which can greatly decrease the time required to solve many problems, has begun to dominate the supercomputer industry [1]. Scalable parallel architectures are designed to offer corresponding increases in per- formance as the number of processors is increased. Scalable computing platforms include both massively parallel computers (MPCs) and collections or clusters of high- performance workstations interconnected by high-speed networks. Although they are different architectures, both platforms are scalable because they share the “distributed memory” property. In MPCs, memory is distributed among an ensemble of nodes, which are often interconnected by a point-to-point, or direct network; nodes commu- nicate by passing messages though the network [2]. In clusters, memory is naturally distributed among the autonomous workstations. As new networking technologies mature, the difference in communication latency between distributed-memory multi- processors and networks of workstations is diminishing. Hence, clusters are considered promising platforms to compete with MPCs in solving computationally intensive ap- plications [3]. In distributed-memory computing environments, the efficiency of communication is critical to performance. Even if the computational workload is distributed evenly among the processors, communication overhead can severely limit performance. Min- imizing communication overhead, in turn, requires efficient transfer of data among memory modules and efficient coordination of processors. In numerical scientific computing, this problem typically involves the decomposition and alignment of arrays among local memories, sending of partial results from one node to several other nodes, implementation of synchronization points (barriers), and communication needed in methods used to balance the computation load on processors. Efficient implementation of such communication operations depends on the un- derlying communications architecture. However, a wide variety of communications platforms are currently in use for parallel computing. Not only is there little consen- sus among vendors regarding MPC interconnections, but cluster computing, by its very nature, must accommodate heterogeneous networking components. In the past, such architectural diversity has implied that a new version of a particular algorithm had to be developed for each new architecture. Hence, an increasing amount of recent research is addressing communications libraries for parallel computing. The advantages of libraries include increased portability, better software modularity, and less redundancy among different projects. In addition, use of libraries can lead to improved performance because the systems programmer who implements the library is usually more familiar with the architecture and therefore more likely to exploit its characteristics than is the applications programmer. The operations found in typical communications libraries include not only point- to-point primitives, which involve a single source and single destination, but also collective communication primitives, which involve more than two processes. Col- lective communication, also known as group communication, is receiving increasing attention due a better understanding of its wide applicability in parallel processing. Examples of collective communication primitives include multicast, in which the same data is sent to multiple destinations; scatter, in which different data is sent to mul- tiple destinations; and reduction, in which a commutative and associative operation, such as max or sum, is performed on data that resides at different nodes. Barrier synchronization, the most popular coordination mechanism in parallel programming, is also an example of collective communication. The development and evolution of numerical libraries has been ongoing for many years. Beginning nearly 20 years ago, numerical libraries have been written for spe- cific problem domains, such as eigenvalue problems and linear systems of equations. In fact, numerical libraries have become useful in many different application fields. Some libraries are produced by computer manufacturers and are optimized for specific architectures; others are in the public domain and are designed to be portable across multiple computing platforms. In fact, new versions of libraries are designed to exe- cute on different distributed-memory architectures. Since the primary goal of many such efforts is portability, the software uses de facto standard communication libraries that are not optimized for specific architectures. While these libraries sometimes of- fer interfaces to collective operations, those operations may actually be implemented inefficiently with simple unicast communication primitives. Thus has arisen the need for the research described in this dissertation: to study the effects of interprocess communication costs, particularly that of collective oper- ations, on the design, implementation, and performance of parallel numerical algo- rithms. The thesis can be stated as: The use of collective communication operations, designed to exploit properties of new generation communications architectures for distributed-memory platforms, can significantly improve the performance of paral- lel numerical algorithms; moreover, the redesign of such algorithms to account for point-to-point and collective communication costs explicitly, can result in further performance improvement. This dissertation makes the following contributions: 1. Compares the performance of MPC and cluster environments, the effects of communication costs in each, and methods to accommodate them in numerical algorithm design. 2. Measures and models the relative effects of optimized collective operations, par- ticularly multicast and broadcast, versus unicast implementations, with partic- ular emphasis on the scale of the system. 3. Studies the use of point-to-point and collective communication designed specif- ically for new generation communications architectures, namely, wormhole- routed networks and switch-based LANs. 4. Produces case studies for particular highly parallel numerical linear algebra algorithms that have recently been designed, which will result not only in an understanding of how collective operations affect their performance, but also in efficient implementations of those algorithms for use in numerical libraries. The remainder of this dissertation is structured as follows. Chapter 2 presents background material on the areas that motivate the proposed research. Chapter 3 describes the implementation of a parallel eigenvalue algorithm on an MPC. The experiences and results obtained in porting the same program to a cluster of worksta- tions are described in chapter 4. Chapter 5 contains the results of experiments with the same program conducted on switch-based clusters and with different broadcast alternatives. A model of the performance of this program is presented in chapter 6. In chapter 7, a second study case is presented: A new algorithm for singular value decomposition (SVD) was also implemented on an MPC and a cluster platform. Chapter 8 summarizes the dissertation. CHAPTER 2 Motivations and Related Work This chapter is intended to describe the state of parallel computing on distributed- memory platforms as it relates to the issues addressed in the dissertation. Six topics are covered: the wide variety of communication architectures for MPCs, emerging designs for workstation clusters, collective communication algorithms, programming support for parallel computing on distributed memory platforms, the evolution of numerical libraries, and the analysis of the effects of collective communications on the performance of different algorithms. 2.1 Communication Architectures for MPCs Communication architectures of MPCs are characterized by several parameters, in- cluding topology, routing, switching, port model, and startup latency [2]. The topology of a network defines how the nodes are interconnected by channels. Routing deter- mines the path selected by a packet in order to reach its destination. Flow control deals with the allocation of channels and buffers to a packet as it travels along a, path through the network, while switching is the mechanism that removes data from an one channel and places it on another channel along the path. In some systems, communication-related tasks are handled by a separate router at each node; the port model of a system refers the number of communication channels connecting the local processor to the router. Start-up latency is the system call time required for handling of the message at both the source and destination nodes. Several researchers have concentrated on the problem of determining the most . appropriate topology given certain communication parameters and under certain as— sumptions about data traffic. Agarwal [4] presents a mathematical model of the performance of interconnection networks and validates his model with simulations. He concludes that a 3D-mesh is the best topology. Dally analyzes the performance of k-ary n-cube networks that use wormhole routing in [5] and concludes that low— dimensional networks offer better performance. While there has been some consensus on the solutions to some of these design issues, there has been little or no consensus on others. The large number of feasible combinations of these factors has led to a wide variety of parallel communication architectures, which in turn has hindered the progress towards portable parallel pro- gramming and motivated the use of communication libraries. Many MPC systems use direct network architectures, in which each node has a point-to—point, or direct, connection to some number of other nodes, called neighbor- ing nodes. The popularity of direct networks stems from their ability to scale well, that is, as the number of nodes in the system increases, the total communication bandwidth of the network also increases [2]. Although the topologies of commercial MPCs vary, many of them are special cases of either n-dimensional meshes or k—ary n-cubes. These classes of topologies, which include hypercubes, meshes, and tori as special cases, are popular in part because they lend themselves to very simple routing algorithms. In recent years, low-dimensional meshes and tori have attracted larger followings than hypercubes, in part due to their simpler physical layouts and better scalability [2]. A notable exception to mesh-based direct networks is an indirect network called a fat tree [6], which is also claimed to scale well with simple routing. Figure 2.1. depicts several M PC topologies. (a) 2-It)’ 4-cube (hypercube) (b) 3-ary 2-cube (toms) (c) 3 x 3 x 3 31) mesh (d) Binary fat tree (8 nodes) Figure 2.1. Examples of MPC topologies (Taken with permission from[2]). At the present time, routing in most commercial MPCs is deterministic, that is, the path followed by a message is determined solely by the the source and destination addresses. In meshes and tori, the routing method most commonly used is dimension- ordered routing, in which a message is forwarded through dimensions of the topology in strictly ascending (alternatively, descending) order. Research in adaptive routing algorithms, which improve performance by accounting for current network conditions, is very active [7, 8]. Generally, however, those results are too recent to have infiltrated commercial designs at this time. The switching techniques employed in distributed-memory computers have evolved over the years. Early systems used store-and-forward switching, whereby each intermediate node completely received and stored a packet before forwarding it along the path to its destination. Subsequent machines used circuit switching [9], where a physical circuit is established between the source and destination nodes; the circuit is reserved exclusively for traffic between the source and the destination. Many new MPCs use the wormhole routing switching strategy [10], in which a packet is divided into flits (flow control digits). The route is determined by the header flit, and the remaining flits follow the head flit in pipelined fashion. Using wormhole routing, communication latency is nearly distance-insensitive in lightly- loaded networks [2]. Furthermore, since blocked messages remain in the network, only very small buffers are required at the routers. Virtual channels [11] have been proposed to improve the performance of wormhole routing. In this approach, multiple channels, each with its own buffer, are multiplexed on each physical channel. The use of virtual channels reduces a situation that occurs in regular wormhole routing, where a channel is available for use by a message but the corresponding buffer is occupied by a second message that is waiting on a channel not needed by the first message [11]. Another parameter that affects the performance of a communication subsystem is the port model. Figure 2.2. depicts a typical node in a wormhole-routed network. The external channels interconnect routers in order to establish the network topology. The internal channels, or ports, provide the node with access to the network. The port model determines how many messages a node can send / receive simultaneously. This number may be of little consequence when a node is sending and receiving infrequent unicast messages. However, when a node participates in a collective operation, it is often required to send/ receive several messages at nearly the same time. In these situations, the port model is critical to performance. Finally, as communication speeds have increased and wormhole routing has min- imized the effect of distance in the total communication cost, startup latency has become an important factor in communication cost. In some current systems, startup latency actually dominates the time needed to send a message across the network. Claims of much lower startup latencies in recently announced systems illustrate the importance placed on this parameter by manufacturers [12, 13]. 10, Other Local Functional Processor Mom Unit ory ll ll V V 1 Local [ i ll J interconnect Internal input Internal output W l. . . . . . Channels ——>1 ——> External input ——.—’ We, —:—'" External output channels : e dmannels ——> ——-> Figure 2.2. A node on a MPC (Taken with permission from[2]). Several architectures have either found their way to the commercial market or have made significant contributions as research prototypes. Hypercubes are still being offered by nCUBE [14, 12], 2D—meshes are produced by Intel [15] and were produced by Symultek [16], 3D-meshes are used in the Cray T3D [17] and on the experimental J-machine [13], and fat trees are used on the CM-5 [6]. Some manufacturers are offering machines that provide hardware support for distributed shared memory, as the Kendall Square’s KSR-l and Convex Exemplar system. Most recent machines use microprocessors designed by workstation manufacturers. The wide variety of MPC architectures illustrates the lack of a single solution to providing efficient communication primitives. This research explores the effects of new collective communication implementations, optimized to exploit specific architectural properties, on the design of numerical algorithms. 11 2.2 Clusters of Workstations As an alternative to MPCs, there has recently emerged increasing interest in using clusters of workstations for parallel scientific computing. The reasons for the popu- larity of these so-called “distributed supercomputers” are several [18]. First, clusters are often more economical than either traditional vector-based supercomputers or MPCs. Second, the memory capacity of each workstation is typically much greater than that of an MPC node, allowing large problem instances to be addressed using simpler programming methods. Third, the I/O capacity of the system is larger than that of an MPC because each workstation has its own disks and monitor. Fourth, a cluster is more flexible than an MPC: additional computing and communication capacity, in the form of new workstation models and faster networks, can be easily configured into the system. Finally, clusters can be used for parallel computing at the same time that they are used as workstations to meet the computing needs of individuals. The IBM SP1 [19] is an example of this kind of systems. IBM RS / 6000’s are stacked in frames and interconnected by one or more high-speed networks. However, such sharing of resources (both processors and communication links) can also be considered a drawback of clusters, since it is likely to reduce the performance of a particular application. Another disadvantage of clusters is that communication latency is often much longer than in an MPC, particularly if conventional networking technology is employed. New high-speed local area networks (LAN 3) are being used to address this problem. A major issue in the design of cluster systems software, such as communication libraries, concerns portability versus performance. Communication latency depends on the physical network architecture, network protocols and their implementations, and the architecture and software of the network interface. Implementing workstation clusters on top of Ethernet LANs with communication primitives based on U N I X TM 12, sockets and TCP/IP is quite common due to the ubiquity of such environments. However, shared physical media and inefficient protocol implementations can pro- duce interprocess communication latencies that preclude effective execution of many applications that are communication-intensive. In order to reduce the difference in communication latency between MPCs and clusters, higher-speed networks, more efficient protocol implementations and stream- lined communications interfaces are needed. Many sites have already installed 100 Mbps F DDI (Fiber Distributed Data Interface) rings. Recently, several switch-based interconnects have been proposed. These include the High Performance Parallel In- terface (HiPPI) standard and the Fibre Channel Standard (FCS) to interconnect supercomputers and workstations, respectively. The nodes on an IBM SP1 can be connected using a high performance switch that uses a MIN design. In addition, fiber-optic LANs using Asynchronous Transfer Mode (ATM) protocols [20] are now commercially available [21]. Although ATM was primarily designed for future Broad- band Integrated Services Digital Networks (B-ISDN), the high data transfer rates and interoperability of standardized protocols makes it attractive for use in high- speed LANs. A major advantage of switch-based interconnects over shared media is that they can provide an enormous aggregate bandwidth because multiple packets can simultaneously be passed through the switch at the full channel rate. Figure 2.3 shows a typical cluster configuration; the workstations are intercon- nected by both a switch (for parallel computing) and a regular LAN (for traditional networking). Several tools that are employed to use a cluster of workstations as a platform for parallel computing are based on TCP/IP protocols. Experience has shown that the traditional implementations of those protocols introduce a significant amount of overhead and that when the physical medium is replaced by a faster one, the increase in performance is not proportional to the increase in capacity at the physical 13 Node Node Node Node I ° i \\' '// Hi h-Speed witch "Traditional" Network (9.9. Ethernet, FDDI) Figure 2.3. Diagram of a cluster of workstations. level [22]. This problem is being attacked in several ways: better implementations of TCP/IP [23]; new network interfaces that avoid copying between the user and the kernel and then again to the network interface, by instead copying directly from the user space to the network interface [22]; and for the case of ATM networks, new libraries that bypass completely TCP/IP and use directly the ATM protocols while providing enhanced functionality [24]. The Nectar project [25], is pursuing several of these approaches simultaneously: new network boards and switches have been developed, as well as software for both the interfaces and the host workstations. Several computer vendors have either recently brought cluster-based products to the market or have announced such systems for release in the near future. The commercial offerings came after several research projects produced tools that allowed existing groups of workstations, interconnected through some network, to be used as a single computer. That is, the research projects created software tools that stirred interest in the user community, which in turn led the manufacturers to offer new prod- ucts. Many manufacturers include such software tools with their products. Clusters of workstations are being offered by DEC, IBM and Convex/ Hewlett Packard. The clusters from these manufacturers can be interconnected via regular Ethernet or using proprietary high-speed switches. 14. Clearly, cluster platforms are being increasingly used for parallel scientific com- puting. The research so far has focused on taking advantage of the computing power of the machines involved and on taking advantage of faster networks. However, with the exception of multicast in Nectar, collective communication on clusters has not yet received significant attention. Many of these cluster environments can poten- tially support efficient collective operations, particularly data distribution operations such as multicast and scatter. For example, Ethernet-based systems support IP- multicast [26], and most switch-based LANs support some type of hardwafe multicast. We report the results of using both broadcasting alternatives. 2.3 Collective Communication The advent and popularity of distributed-memory parallel platforms has produced increasing attention from the research community on efficient implementations of col- lective communication operations. This section discusses several of the most impor- tant collective operations and describes research efforts into their implementation on particular communication architectures. Since collective communication operations are included in the Message Passing Interface (MPI) standardization effort, which is discussed later. Collective operations are often defined in terms of a group of processes. The group may constitute all or a subset of the processes in the parallel application. While the abstract quality of such definitions is useful for studying the semantics of operations, it may be difficult to study their performance, which depends on the physical relationships between the group and the system. For example, the number of processes per node and the distribution of processes in the network affect the performance of collective operations on the group. Therefore, in this dissertation, we will generally discuss a particular collective operation in terms of the physical 15. network architecture and the specific messages that constitute that operation. The distinction will become more clear in the following, which describes the functionality and possible uses of various collective communications operations. Broadcast is perhaps the most fundamental collective communication primitive. In this operation, the same data is sent from one source node to all other nodes in the network. Broadcast is used in many parallel numerical algorithms, including matrix multiplication, Gaussian elimination, LU-factorization, and Householder transforma- tions [27]. An even more common use of broadcast is at the initiation of most parallel algorithms; broadcast is used to send to all processes those input data and other parameters that are needed for the execution of the algorithm. As will be discussed in Chapter 6, broadcasts, if implemented inefficiently, can severely degrade scalability and performance. Broadcast is actually a special case of multicast, in which the same data is sent from a source node to a subset of nodes in the network. Some confusion may arise here, because the MPI standard does not explicitly refer to multicast; rather, MPI would describe multicast as a broadcast to a set of processes that happen to reside on only a subset of the nodes in the network. We will discuss multicast in terms of sending to nodes, rather than processes. The applications of multicast are similar to those of broadcast; in fact, sending to a subset of nodes (a parallel application may only execute on a subset of nodes) is more common than sending to all the nodes in the network. Henceforth, we will use the term multicast, except for cases when the message is actually transmitted to all nodes in the network. Other collective communication operations include: multinode broadcast/multicast, in which every node sends a piece of data to all members of the group, scatter, in which the source node distributes different data to each member of the group, gather, in which different data from all members of the group are collected by one member, global exchange, where every member of a group sends different data 16 to every other node in the group, global operations, which include both reduction and scan (also known as parallel prefix). In reduction, an associative and commutative operation is applied across data items from each member of the group. Examples include sum, max, min, bitwise operations, and so on. In a scan operation, a parallel prefix with respect to an associative and commutative operator is performed. Every process has a rank i, and the result of this operation returns to the calling process the value of the reduction of the data of processes 0 through i. Formally, given processes p1, p2, . . . , p7. and data items d1, d2, . . . , dn, an associative operator (9 is applied such that the result at process p,- is (11 (9 d2 (9 . . . (9 d;. The operations described thus far can be classified as data movement operations. Collective communication may also be used to implement control operations. An important example is barrier synchronization. A synchronization barrier is a logical point in the control flow of an algorithm at which all the members of a subset of the processes must arrive before any of the processes in the subset are allowed to proceed further. Barrier synchronization occurs frequently in programs for parallel applica- tions, especially for those problems that can be solved by using iterative methods because it is useful in supporting parallel loop synchronization. In a distributed- memory environment, barrier synchronization must be implemented by a gather op— eration followed by a multicast operation [28]. Each node involved in the barrier sends a message to the barrier processor. Upon receiving all reduction messages, the barrier process must instruct all processes waiting on the barrier that they may proceed by multicasting a distribution message to them. The communication characteristics of each combination of topology, port model, and switching technique are different, and collective communication service must be implemented differently in order to exploit those particular characteristics. Research has been conducted on different architectures. Most of the research to date on col- lective communication has addressed store-and-forward architectures; much of that 17 work is not relevant to new generation distributed—memory architectures, specifically, wormhole-routed systems and switch-based clusters. In the following, we concentrate primarily on multicast and broadcast operations. Saad and Schultz [29] studied hypercube communications and developed algo- rithms for broadcasting, multinode broadcasting, scattering, and gathering. Al- though their workassumed a store-and-forward switching model, since they consid- ered only full subcubes and nearest neighbor communication, much of these results pertain to wormhole-routed systems as well. Hillis and Steel [30] survey several al- gorithms that are used in the CM-2. In their survey, there are versions of both global operations and scan Operations based on recursive doubling. Johnsson and Ho [31, 32, 27] have worked on the particular problems of broadcasting and “person- alized communications” (one-to-all and all-to-all) in hypercubes, in both one-port and n-port models. They have presented optimal algorithms in the one-port model using only nearest neighbor communication. Their approach uses 17. edge—disjoint spanning trees of the hypercube; the message is partitioned into 12 segments, each of which is transmitted along a different spanning tree. The spanning trees are defined in such a way that there are no edge (channel) conflicts. Again, these algorithms can also be used effectively in wormhole-routed networks. Barnett, Payne, and Van de Geijn [33, 34] have studied both the broadcast and reduction operations in one-port wormhole-routed 2D meshes. The broadcast primitive embeds a minimum spanning tree in a mesh and uses it for broadcasting. The more general problem of multicast (to an arbitrary subset of nodes) was originally studied for hypercubes and 2D-meshes by Lan, Esfahanian and Ni [35]. Their proposed distributed greedy heuristic algorithm, called LEN, was intended to be used to support multicast communication in hardware in virtual cut-through switched systems; it could also be used in software in store-and-forward networks. McKinley, Xu, Esfahanian, and Ni [36] developed software (unicast-based) multicast algorithms 18, for one-port wormhole-routed meshes and hypercubes. Their U-cube and U—mesh algorithms exploit the distance-insensitivity of wormhole routing, and are optimal for one—port architectures. Unicast-based collective communication for multiport systems has come under study recently. We [37] have proposed the double—tree (DT) algorithm a broadcast algorithm for hypercubes that conform to the all-port model. In the first step, the sender sends the message to the node that is farthest away, as well as to the neighboring nodes except one. By exploiting the distance insensitivity of wormhole routing, the DT algorithm requires only half the number of message- passing steps of that of the usual SBT algorithm. Robinson et al [38] have recently proposed a multicast algorithm for all-port wormhole-routed hypercubes. Most of the approaches described above are designed for implementation in soft- ware; collective operations can also be implemented in hardware. For example, the CM—5 [6] control network supports broadcast and reduction, but only for small mes- sages. Hardware-supported wormhole multicast can be based on trees similar to those described above. Tree-based routing is used to support broadcast and a re- stricted form of multicast in the nCUBE-2 [14], but is susceptible to deadlock. Lin et al [39] have developed a deadlock-free approach to hardware wormhole multicast, called path-based routing, in which each of multiple worms distributes the message to certain subsets of destinations. Collective communication has been studied in networks as well, thereby affecting parallel computing on clusters. A significant amount of work has been done at the network layer of the IP protocol to support multicast routing of packets [26], and a multicast service is currently available on the IP protocol. Reliable multicast protocols built atop an underlying unreliable multicast service have been the subject of much research in the last several years. For example, Tanenbaum et al [40] have proposed a simple reliable multicast protocol to be used in local area networks. For message- passing programming, however, ordering may not be needed because either only one 19 message is multicast at a time or because messages can be distinguished by some type checking [14]. With the growing popularity of cluster-based parallel computing, distributed implementations of other collective operations are receiving attention. For example, Huang and McKinley [24] are presently studying implementations of collective operations on switch-based LAN s. We have investigated the effects on performance and scalability of different im- plementations of broadcast on the nCUBE-2. On a cluster environment, we have explored the effects of implementing multicast operations using IP-Multicast, a broad- cast primitive available on the ATM environment [41], and an enhanced version of PVM. 2.4 Communication Libraries for Distributed- Memory Environments In the past, the programmer of a distributed-memory system has invoked system primitives to send messages among processes executing on different nodes. While such low-level control over communication allows the user to exploit characteristics of the architecture, message-passing programming is usually tedious and error-prone. Fur- thermore, the increasing variety of distributed-memory parallel computing platforms, and particularly the emergence and rapid growth of cluster-based computing, can potentially require that a new version of a particular algorithm had to be developed for each new architecture. One method that has been used to address these problems is to construct com- munication libraries. Libraries hide the details of the underlying architecture and vendor-specific interfaces from the user but provide a common interface across mul- tiple platforms, permitting user code to be more easily ported among machines. The correctness of the library routines can be verified independently of the applications 20, using them, thus shortening the application development cycle. In addition, use of libraries can lead to improved performance because the systems programmer who implements the library is generally more familiar with the architecture than is the applications programmer. A variety of research groups and commercial companies offer communication li- braries for distributed-memory environments. Some of them started as programming facilities for multicomputers and have been ported later to cluster environments. Others have migrated in the opposite direction. PVM [42] and P4 [43] are widely . used public domain tools available from national laboratories. Another library for distributed-memory platforms, being developed locally at Michigan State University, is ComPaSS [44]. Many communication libraries other than those mentioned above have been developed, including the CHIMP project [45], Zipcode [46], Express [47, 48], nCUBE’s Vertex [14], and ORNL’s PICL [49] and Linda [50]. The use of data parallel languages offers many advantages over message-passing programming, and may represent the long-term direction for parallel programming. However, despite its drawbacks, message passing is used for parallel programming, and will likely continue to be used for the foreseeable future. As shown above, a wide variety of communications libraries have been produced. Many of the organizations involved in these efforts have recognized the need for portability, and several of the libraries have been ported to multiple platforms. These events indicate that the time has come for standardization of message passing operations. Recently, a volunteer group has been organized to develop the Message Passing Interface (MP1) [51], which represents the first effort to bring about such a standard. The MP1 forum seeks to make use of the most attractive features of a number of existing message passing systems, rather than selecting one of them and adopting it as the standard. Thus, MPI has been strongly influenced by PVM, Express, Vertex, P4, and other libraries. 21. The research described in this document complements ongoing research and devel- opment of communications libraries. We use some of the existing tools and evaluate the performance of two implementations of numerical algorithms on top of those tools. Furthermore, we investigate alternative implementations of multicast operations and their effect on performance of the implementations of the algorithms. 2.5 Libraries for Numerical Analysis Problems in numerical analysis are, usually, clearly defined, and the specific circum- stances in which those problems are solved are relatively standard. That is, there exists a relatively small set of variations on a given problem that are commonly used, and those variations can be accounted for in standard methods. These properties, along with the advantages of modular software, have led to the development of li- braries of methods for solving various numerical problems. In fact, such numerical libraries have become useful in many different fields. Numerical libraries are available from several sources. For example, computer manufacturers often provide libraries that are optimized for their products. Software houses that specialize in libraries for specific domains have been founded; examples include Numerical Algorithms Group Limited (NAG) and International Mathematical and Statistical Libraries (IMSL). Other libraries are in the public domainand are designed to be to be portable across multiple computing platforms. The collected algorithms of the ACM are an early example of this approach. Numerical libraries have been written for specific domains. For example, EIS- PACK [52] was written to solve eigenvalue problems. Routines in EISPACK find the eigenvalues and the eigenvectors for several classes of matrices: complex general, complex hermitian, real general, real symmetric, real symmetric handed, and real symmetric tridiagonal. Other routines can solve the generalized eigenproblem of 22, symmetric and non-symmetric real matrices and the singular value decomposition of an arbitrary matrix. LINPACK [53] is intended to solve systems of linear equations and to perform matrix factorizations. It was developed, in Fortran, from 1976 to 1979. The routines in this library are used to solve linear systems for several classes of matrices: general, banded, symmetric indefinite, symmetric positive definite, and triangular. Two of the LINPACK routines (one that factors a matrix and one that solves a system of equations) are used for benchmarking high performance comput- ers [54]. LAPACK [55] integrates and refines most of the procedures available in EISPACK and LINPACK. The routines in LAPACK solve systems of linear equa— tions, linear least squares, eigenvalue and singular value problems. [There are also routines for estimating condition numbers and matrix factorizations. The routines are provided for handed and dense matrices and for real and complex matrices. For nonlinear equations and nonlinear least squares problems, MINPACK [56] is used. QUADPACK [57] is utilized in the domain of numerical quadrature (numerical in- tegration). Several other packages are designed for other specific problem domains: fishpaclc, for separable elliptic partial differential equations; fitpack, for splines under tension; fitpaclc, for fast fourier transforms; hompack, for‘nonlinear equations by ho- motopy method; itpack, for iterative linear system solutions; and odepack, for ordinary differential equations. Developers of these libraries, LINPACK and LAPACK in particular, have found that many primitives are used repeatedly while implementing a library. This in turn has led to the creation of lower level libraries, which contain primitive oper- ations for particular domains. For the specific case of linear algebra, a set of low level routines called BLAS (Basic Linear Algebra Subroutines), has been developed. Three different levels (1, 2, and 3) of BLAS routines have been written, containing functions that perform vector-vector, vector-matrix and matrix-matrix operations, 8 respectively [58, 59, 60]. Efficient implementation of the BLAS routines helps to 23. obtain good performance on specific machines. For example, the BLAS routines have been specially compiled for the Intel i860 and are available for programs that run on the Intel Paragon [15]. As massively parallel computers and networks of workstations become more widely used for solving numerical problems, it becomes necessary to port the existing li- braries to new machines. A project called ScaLAPACK [61] is targeted to producing a distributed—memory version of LAPACK. Currently, a subset of the routines in LA— PACK (Cholesky decomposition, QR, LU, and Hessenberg and tridiagonal reduction for the algebraic eigenvalue problem) has been ported to distributed—memory par- allel computers [62]. In the current version of ScaLAPACK, square block-scattered decomposition of the matrices over the processors is used. The block size and the size of the rectangle sides are parameters that the user can adjust. There is a beta version of ScaLAPACK available which can run on Paragon and CM-5 machines as well as on clusters of workstations that use PVM. This beta version is implemented in Fortran 77. The different architectures and programming paradigms found in parallel computers create the need for other low-level libraries or sets of macros that facilitate the porting of the existing libraries to the new environments. In ScaLA- PACK, simplicity is maintained by using distributed versions of the BLAS routines, contained in a library called PB-BLAS. PB-BLAS uses the communications services offered by a communication library called BLACS and calls BLAS routines for the local operations. BLACS (Basic Linear Algebra Communication Subprograms) [63], is designed to provide basic matrix-related communications operations for MIMD message-passing machines. The routines offer communication primitives and global operators for general rectangular matrices and trapezoidal matrices. The four com- munication operations are: send a message, receive a message, broadcast a message, and receive a broadcast message. Three global operators are provided: SUM, MAX, 24, and MIN. BLACS has been implemented atop at least three lower level communi- cation packages: the Intel communication primitives [64], PVM [65], and the CM-5 communication primitives [6]. An object-oriented programming approach is being pursued in ScaLAPACK++ [66], a version of ScaLAPACK designed to be callable from C++. Matrices are treated as objects in ScaLAPACK++. Figure 2.5 depicts graphically the evolution and relationships among the numerical libraries mentioned above. EISPACK LINPACK LAPACK ScaLAPACK PB—BLAS BLASZ BLACS BLA82 BLAS] BLAS] Comm. BLAS 1 Time Figure 2.4. Libraries and their structure. With the migration from traditional supercomputers to distributed-memory plat- forms comes the need to port existing libraries (or variations of them) to the new environments. The conversion should be made with portability and efficiency as objectives, and this in turn suggests the use of communication libraries with a well- defined set of primitives, carefully optimized for each environment. Our experiences in implementing different numerical algorithms show that performance depends heavily on efficient implementation of the communication primitives. 25 The new algorithms that were parallelized as part of this research might become in the future part of a numerical library. In the same vein, the parallel versions could, after further refinement, become part of a library like ScaLAPACK. 2.6 Analysis of Collective Communication Costs Some aspects of the effect of communications on the performance of parallel algo- rithms have been studied previously. The research projects reviewed here fall into three categories: general analysis of families of algorithms; analysis of particular nu- merical algorithms; and analysis of the effect of communication on scalability. Each area is discussed in turn. 2.6.1 General Analysis Scherson and Corbett [67] examine the effect of communications on the expected speedup of applications on multidimensional meshes. They provide expressions for the effect of communication on the maximum speedup of an application under the assumption of uniform distribution of communications. Their model assumes that communication time is directly proportional to the distance between the communi- cating processes, an assumption which has become less important with the advent of wormhole routing. In contrast, our analysis considers the characteristics of both wormhole-routed systems and clusters, and is based on case studies of specific nu- merical algorithms. Furthermore, our analysis emphasizes the effects of different implementations of collective communications, whereas their work is concerned with communications in general without distinguishing between point-to-point and collec- tive communications. 26, 2.6.2 Analysis of Specific Numerical Algorithms Cannon and van Rosendale [68] examined the effects of communications on three numerical algorithms: FFTs, the solution of tridiagonal systems of equations, and the solution of PDEs. Their analysis covers both shared-memory and distributed- memory systems, specifically, multistage interconnection networks (MINs) and direct networks. They calculate limits on the performance of the algorithms as a func- tion of the communication costs. Their model assumes store-and-forward switching, which was prevalent at the time of the research. Our analysis is based on different algorithms, and considers new generation architectures. We do not consider shared- memory systems since these are generally not considered to scale well. Brochard [69] studied and modeled the (point-to-point) communication and syn- chronization costs of several basic matrix and vector operations, an iterative method for solving partial differential equations, and a multigrid method. Brochard imple- mented the algorithms on an iPSC hypercube and compared theexperimental results with the models. Johnsson [70] has analyzed the performance of F F T and BLAS 2 routines for a library for data parallel languages and has studied the most appropriate collective communications, available on the Connection Machine, for those routines. His research was conducted in the context of a SIMD system, a CM-2, which was based on a hypercube topology. Dongarra and van de Geijn [71] implemented a parallel reduction to Hessenberg form in a 2D-mesh(Paragon). They studied the amount of time spent in communications and observed that a significant percentage of the overall execution time is spent in all-to-all broadcast operations. Although they analyzed the composition of the communication overhead, they did not explore the impact of different broadcast alternatives. In our work, we also validate our model against experimental results, but, again, we focus on different algorithms and the costs of collective communication operations. 27 2.6.3 Scalability Analysis Amdahl’s law [72] states that the sequential fraction of an algorithm represents a limit to its scalability. Let f be the fraction of an algorithm that has to be performed sequentially, where 0 S f S 1. Then the maximum speedup S achievable by a machine with p processors is: I S 5 7+(1—7)/p' Over the years, various other models have been developed to address the scalability of parallel algorithms, which account for properties not handled in Amdahl’s law. Gustafson, Montry and Benner [73] implemented several applications on a lK-node nCUBE-2. They realized that on a large numbers of processors, it is possible to solve larger problems than what would have been possible on a single processor or on a small number of processors. When using a large number of processors, they scale the total size of the problem, keeping constant the size of the subproblem assigned to every processor. This alternative view of speedup is called scaled speedup. More recently, Sun and Ni [74] realized that memory on the nodes of an MPC can be a limiting factor on the size of the problems that are solvable. Based on their experiences in using multicomputers, they developed the memory bounded speedup model, which takes into account the inherent parallelism of the application, the computation power and the memory capacity of a given MPC. Worley [75], based on experiments with partial differential equations solvers, observed that if one has a time constrain, then, for certain problems, it might not be feasible just to scale the size of the problem up to use efficiently a larger number of processors, as the execution time would be larger than the given limit. The size of other problems, though, can be increased without violating the time constrain. Kumar and Rao [76] proposed a scalability metric, called the isoefiiciency function. This function relates the size of the problem, W, to the number of processing elements, 28. p, in a parallel system. They define T o as the total parallel overhead. If the problem size is kept fixed as p increases, T o grows as well, and the efficiency decreases. To maintain a constant efficiency as p increases, W must grow at a certain rate. The isoefficiency function is the rate at which W must grow, in terms of p, in order to keep the efficiency constant. The function varies from one algorithm to another and from one architecture to another. To includes the communication overhead; hence, the efficiency of the communications is certain to affect the isoefficiency function. The isoefficiency models for workstation clusters have assumed that the cost of a broadcast operation is linear on the number of processors involved. If the cost becomes constant or logarithmic, the isoefficiency of a given algorithm on clusters is likely to improve. This is one of the results of this research: In the new switch-based clusters of workstations, it is feasible to use broadcast functions that are logarithmic on the number of participating nodes. The scalability of such systems improves with this result. The work by Xu, McKinley and Ni [28] deserves a special mention as it inspired some of the research reported here. They report the design and implementation of an efficient barrier synchronization primitive on an nCUBE-2. Their experiments com- pare 2 different implementations of barrier synchronization, one based on separate addressing and the other one based on their U-cube tree algorithm and indicate the better scalability of the program that uses the U-cube based version. The research reported here is not based on an artificial workload but on an actual algorithm and the emphasis is not on barrier synchronization but on broadcast and load balanc- ing. Also the work reported here includes experiments on clusters of workstations. This research, as theirs, emphasizes the importance of efficient implementations of collective communications in the scalability of given programs on actual systems. CHAPTER 3 A Parallel Eigenvalue Solver on an MPC As quantitative analysis becomes increasingly important in the sciences and engi- neering, the need grows for faster and more efficient methods to solve eigenvalue problems. Large eigenvalue problems occur in a wide variety of applications, including the dynamic analysis of large-scale structures such as aircraft and ships, prediction of structural responses in solid and soil mechanics, the study of solar convection, modal analysis of electronic circuits, and the statistical analysis of data. Solving for the eigenvalues of large systems is a computationally-intensive task that may need to be carried out many times within a particular application; reducing its execution time will improve the performance of the application. The research described in this dissertation has its roots in a project to parallelize a new algorithm for finding the eigenvalues of real symmetric tridiagonal matrices. The split-merge eigenvalue algorithm, proposed by Li and Zeng [77], has a high potential for successful parallelization while preserving the accuracy and stability found in other algorithms. Nevertheless, realizing the potential parallelism of the algorithm has illuminated many issues regarding the effects of communication in parallel numerical algorithms. The split-merge algorithm has been implemented and studied in two 29 30, different environments: an nCUBE2 hypercube multicomputer and a cluster of Spam- 10 workstations interconnected via Ethernet and via an Asynchronous Transfer Mode (ATM) switch. This chapter describes the results obtained in the implementation of the algorithm on the nCUBE-2 and describes the impact of collective communications on its efficiency and speed on this particular environment. 3.1 Split-Merge Algorithm The problem of finding the eigenvalues of a matrix can be stated as follows: Find the values A that satisfy the equation: Ax = Ax for a vector x, which is called an eigenvector. The values /\ are eigenvalues. The problem can be rewritten as follows: Given [A - AI]x = 0, solve f (r) = det[A - AI] = 0. Symmetric tridiagonal (ST) matrices have the form shown in Figure 3.1(a). All nonzero entries occur on either the main diagonal, the superdiagonal, or the subdiagonal. Furthermore, the superdiagonal is identical to the subdiagonal. Finding the eigenvalues of symmetric matrices is a very common problem in many different fields. A procedure frequently used to solve this problem is to reduce the original “full” matrix to a tridiagonal matrix (reduced or tridiagonal form). This is achieved by premultiplying the matrix by an orthogonal matrix U and postmultiplying it by U T. The matrix U is chosen so that it introduces Os in the original matrix except in the main diagonal, the superdiagonal and the subdiagonal. The eigenvalues of the original matrix and the reduced matrix are the same. Thus, finding the eigenvalues of a symmetric tridiagonal matrix is crucial in the process of finding the eigenvalues of symmetric matrices. As one of the most fundamental problems of computational mathematics, the symmetric tridiagonal eigenvalue problem continues to receive considerable attention in the literature due to its wide applicability. 31. The new split-merge algorithm was designed by Li and Zeng [77] originally for shared-memory parallel architectures. This algorithm is inherently parallel and takes advantage of a fast iteration technique, namely, Laguerre’s method [78](pp. 263-266). ' 011 52 ' ' ' 52 a2 53 A1 A = 53 03 A' = 0 0 7671 A2 .. [Bu an . . . (a) symmetric tridiagonal matrix (b) use of separation property Figure 3.1. Matrices in split-merge eigenvalue solver The split-merge algorithm relies on the so-called separation property [79] in order to find the eigenvalues of ST matrices. Given an ST matrix A with nonzero subdi- agonal elements, this technique uses the matrix A’ produced by replacing some fl,- with 0, as shown in Figure 3.1(b). Let A? S /\g S 3 A9, be the eigenvalues of the submatrices A1 and A2, and let A1 < A; < < An be the eigenvalues of A. The separation property states that ASL] < A,- < A?“ and A,‘_1 < A? < A,“ for all values of i. In the split-merge algorithm, the separation property is used as follows: the eigen- values of matrices A1 and A2 are found and then used as the initial approximations to the eigenvalues of matrix A. The process is applied recursively until matrices of sizes 2 x 2 are reached, for which eigenvalues may be found easily. Without loss of generality, assume that n, the order of the matrix, is a power of 2; specifically, let n = 2". The algorithm proceeds in a series of k stages. In the first stage, eigenvalues for each of the 2 x 2 arrays are found using the quadratic formula. Results from pairs 32, of neighboring subarrays are merged, sorted, and used in solving 4 x 4 arrays in the second stage. The same procedure is repeated for the following stages: At stage i, eigenvalues are found for 2"“ submatrices of size 2‘, using the results of stage i — 1 as initial approximations. In order to find an individual eigenvalue from an initial approximation, the split- and-merge algorithm uses Laguerre’s method [78](pp. 263-266) for finding the zeroes of a polynomial with real and simple zeros. This method has cubic convergence. Given an initial approximation :1: to a zero of the polynomial f (3:), a better approximation is obtained by: [11(1?) =$+ n (—i,%§})i\/(n—1)i(n-1)(-L,’%5})?-n(%l)1’ where f , f’ , and f” are calculated using three-term recurrence equations. Li and Zeng [77] developed an alternative scheme to calculate the quotients '51, and I; that avoids underflow-overflow problems. The :l: in the Li(r) expression denotes the fact that two different values are obtained depending on which operation is performed in the expression in the denominator. The appropriate value is chosen based on the sit- uation of the approximation with respect to the eigenvalue being calculated. Sturm’s sequence evaluation [80] can be used to determine if the current approximation is smaller or bigger than the actual eigenvalue. The split-and-merge algorithm, which is described in detail in [77], operates as shown in Figure 3.2. The split-merge algorithm can be viewed as a “tree” of tasks, as shown in Fig- ure 3.3. At the leaves of the tree, the eigenvalues for the 2 x 2 submatrices are found. These results are merge-sorted in a pairwise fashion, and in the next stage, the eigenvalues for the 4 x 4 submatrices are found. This process continues until the eigenvalues for the original matrix are produced at the root of the tree. 33 Algorithm 1: Split-Merge Input: Symmetric tridiagonal matrix A of order 2" Output: The eigenvalues of A Procedure: begin find eigenvalues for the 2""1 2 x 2 submatrices of A for i = 2 to k forj = l to 2"" combine submatrices 2j — 1 and 21' of order 2““) into one matrix of size 2‘ (that is, merge their eigenvalues) for l = 1 to 2‘ use Laguerre iteration to find A; for the jth submatrix of order 2‘ endfor endfor endfor end Figure 3.2. Split-merge algorithm. 2&2 2x2 2x2 2x2 [2x2[ Figure 3.3. Abstract representation of the split-merge algorithm 34, Although the split-merge algorithm can be logically viewed as a tree, the nodes in this tree clearly should not also represent processors in a parallel implementation. If this were the case, for example, a single node would be solely responsible for the last stage of the algorithm, severely reducing the parallelism of the algorithm. In parallelizing the split-merge algorithm, a'key objective is to maximize the fraction of time that each processor is busy solving for eigenvalues in each of the I: stages. The algorithm was implemented initially on an nCUBE-2, a wormhole-routed parallel system with a hypercube topology. Formally, a hypercube (or n-cube) consists of 2" nodes, each of which has a unique n-bit binary address. For each node v, let v also denote its n-bit binary address and llvll represent the number of Is in v. A channel c = (u,v) is present in an n-cube if and only if llu EB vll = 1, where EB is the bitwise exclusive OR operation on binary numbers. Figure 3.4(a) depicts a 3-cube; notice that adjacent nodes are connected by two unidirectional channels in opposite directions. Figure 3.4(b) shows a more abstract representation of a 4—cube. On an nCUBE-2, a user is allowed to request a subcube of the hypercube to execute his / her program. For purposes of discussion, let 2" be the number of processors being used for the execution of the program, and let 2" be the size of input matrix. o 1 000 001 2 ‘ _3_‘, 01 -_!’o11‘ 4 ' 5 1'l - ”161 6 7 11 ——->111 (a) 3—cu e (b) 4-cube Figure 3.4. Different representations of hypercubes 35 3.2 Simple Parallel Implementation In our initial parallel approach, the responsibility for solving for eigenvalues is divided evenly among processors, that is, at all stages, every processor is responsible for find- ing 2"”! eigenvalues. In the first It — (1 stages, no communication is necessary between processors, since each submatrix is small enough to be handled by a single processor. Specifically, each node i initially performs the first I: — d stages independently, thereby solving for the 2k'd eigenvalues of the i“ (numbering from left to right) submatrix of order k — d. In the last d stages, nodes must communicate their results to other nodes to be merged, sorted, and the appropriate sets of eigenvalues returned to be used as input to the next stage. The processing elements (nodes) are divided in two categories: sinks and clients. Henceforth, the first It — d stages of the algorithm will be referred to as the local stages, and the last d stages will be referred to as the distributed stages. 7 O O m 6 O O O 53 5 O O prams“,r 4 O O O 0 Client ”Mb“ 3 O O o Sink 2 O O O 1 O O o O O O o O 2 k—3 k-2 k-l k of Figure 3.5. Sinks and clients in a 3-cube The initial parallel implementation on the hypercube is illustrated in Figure 3.5, as implemented on the nCUBE-2. The 2" processors allocated for the execution of 36, the program are numbered sequentially from 0 to 2““. Consider the case of a matrix with 2" entries being solved in a hypercube with 2“ nodes; at the beginning of the program, the original matrix is divided evenly among the participating PEs. In the example, this means that every node is working on a submatrix of 2"“ entries. In stages 1 through k — d, all nodes work independently from the other nodes. At the end of stage I: — d, every node has found the eigenvalues of a submatrix of size 2"“. To proceed, it becomes necessary to merge the results obtained by different nodes. That is, at stage I: — d + 1, processOrs need to start sharing data to continue the process. Every odd numbered node, say i, sends its submatrix to its neighboring node Z — 1. Node l — 1 uses a merge-sort to merge the two lists of eigenvalues, each of size 2""“ into a single list of size 2k‘d“. The even numbered nodes have become “sinks” as they are receiving the results from other nodes. The odd numbered nodes have become “clients”. Each sink node merges the eigenvalues it found in stage It — d with those received from its client. In the next stage, every sink node needs to find the eigenvalues of a matrix of size 2"““‘1 . The clients are idle at this moment, and it would be a waste of computational resources to let the sinks find the eigenvalues of the matrices of size 2""“'1 alone without using the capacity of the clients. Therefore, each sink returns the upper half of the just merged eigenvalues (which become the initial approximations to find the eigenvalues of a larger matrix) to the client from which it had previously received data. The sink solves for the lower half of the eigenvalues. Once the clients have solved their part, they send the new results to their respective sinks. After both sink and client have solved their respective halves, and the sink has received the results from the client, every sink will have the eigenvalues of a matrix of size 2""“H. To continue, it becomes necessary again to merge the results being kept by two sinks. In the next stage, every node with a processor id that is a multiple of 4 becomes a sink. Observe that the number of sinks decreases by half at each stage. In stage 37. k—d +2 each sink node i distributes the work evenly among itself and nodes i + 1, i +2, and i + 3. Each of the four nodes solves for one-fourth of the 2"“+2 submatrix that resulted from the previous merge. After its clients have finished, the sink collects the results and a merge takes place. The process continues in this manner until, in the last stage, node 0 becomes the only sink and all the rest of the nodes are clients. Node 0 distributes the work evenly among all nodes and collects the results. At this point node 0 has the eigenvalues of the original matrix and the problem has been solved. Notice that the number of eigenvalues calculated by each node remains constant through all stages. This pattern of communication and computation is very well-suited to the hyper- cube topology. The sinks at any stage are immediate neighbors of the other sinks of the previous stage. Each sink and its clients work in their own subcube, so their communication is disjoint from the communication of other sinks and clients. The nCUBE-2 supports broadcast operations within subcubes, which can be used by a sink to deliver results to its clients efficiently. In spite of these advantages, the original implementation did not perform well. Performance results of this parallel algorithm indicated that speedup was limited to approximately 25 for 64 processors. In order to discover the reason behind this behav- ior, we used Paragraph [81], a visualization tool from Oak Ridge National Laboratory. Paragraph was used to create the graphs of the execution of the programs. Tracefiles were generated by the eigenvalue program in a manner consistent with the Paragraph format. In order to minimize any distortion of results due to I/O operations, records of the traces were maintained in main memory and copied to secondary storage only after execution of the program. Figure 3.6 shows a trace of the execution of the the algorithm for a 2048 x 2048 matrix on a 16-node subcube of an nCUBE—2. Shaded areas indicate when processors are busy. In this example, the algorithm requires 11 stages. Stages 1 though 7 38, execute without interprocessor communication and are represented in black at the far left of the figure. Stages 8 through 11 require cooperation (and hence communication) among nodes. As shown in the Figure 3.6, processors are idle a large fraction of the time. In the figure, each unit on the .7: axis represents 0.15 seconds; the algorithm requires 65.9 seconds to complete. ”MDJCZ lamwmnoxv OFNUOUOHO Figure 3.6. Trace of initial parallel algorithm Further investigation revealed that the time required for processors to finish their share of eigenvalues varied greatly due to relatively high variance in the number of iterations needed to find eigenvalues. In the local stages of the algorithm, such variance of the time to solve the eigenvalues does not imply any idle time on any processing element. In the latter distributed stages of the algorithm, however, where communication between processors is necessary, this imbalance caused those proces- sors that finished early to remain idle while others continued to work. Figure 3.7 shows the distribution of the number of iterations required by Laguerre’s method to find eigenvalues in the last three stages of the algorithm for the same matrix used in Figure 3.6. Although the majority of the eigenvalues require a single iteration, a 39. significant number require more iterations, leading to the load imbalance. This unex- pected phenomenon greatly reduced the efficiency of the algorithm, thereby limiting speedup. 2000 _ l l l i _ Stage 8 *— - Stage 9 A— Stage 10 -A— _ 1600 Stage 11 -o— 1200 Freq 0 2 4 6 10 Number of Iterations Figure 3.7. Distribution of the number of iterations needed in the Laguerre routine 3.3 Use of Dynamic Load Balancing The previous result led to a redesign of the algorithm in an attempt to achieve better load balancing among processors. The load balancing approach used in the eigenvalue study uses a simple client- server model, similar to those described in [82] as dynamic and centralized approaches. Many algorithms for load distribution have been proposed and they vary in different ways. Load distributing can be static, dynamic or adaptive. Static algorithms ignore completely the state of the nodes before a job is assigned. They assign the jobs to the 40. nodes on a deterministic fashion. Dynamic algorithms take into account, in some way, the load of the nodes in the system before assigning a job. Adaptive algorithms are a particular class of dynamic algorithms that change their parameters according to the state of the system. Another dimension is the degree of centralization: Algorithms can be centralized, hierarchical, fully decentralized, or a combination of those. Sender- initiated and receiver-initiated algorithms have been studied extensively. In a sender- initiated algorithm [83], an overloaded node looks for a receiver for some of its extra load. In a receiver-initiated algorithm, an underloaded node looks for an additional task. This strategy is also related to the techniques employed in shared memory parallel machines to schedule the n iterations of a loop among p processors when the iterations are independent and can be executed in any order [84] (DOALL loop scheduling). Lilja [84] reports in a recent survey four different techniques: Chunk scheduling [85], where the iterations are divided evenly among the available processors, and which is the approach used initially in ouf implementation; Guided Self Scheduling (G35) [86], where a chunk of c,- iterations is allocated to the processor making the ith request and c,- = le/p], where R,- is the number of iterations remaining, with R1 = n; Factoring [87], where iterations are scheduled in batches of p chunks of the same size c.- = lR,/2p], once a batch of p chunks has been dispatched, the value of R,- is updated as Rg+1 = 12,-— pc,-; and Trapezoid Self Scheduling [88], where the programmer provides values for an initial and a final chunk size and the dispenser of the iterations gradually decreases the chunk size from the initial value to the final chunk size as the loop progresses. Kumar et al [89] also encountered the need to balance the load among the proces- sors of an nCUBE when parallelizing search algorithms. The algorithms where these techniques were used are characterized by the existence of a stack of solutions that need to be explored. The stack of unexplored alternatives can be partitioned and i '5. —... - 41, distributed among several processors. They explored several alternatives, which they broadly classify in source initiated and server initiated load balancing algorithms. In source initiated load balancing algorithm, a processor that runs out of work solicits more work from some other node. On receiving such a request, the other node splits its stack and “generates” work for the initiator. If the other node is idle, it sends a “reject” message to the initiator. If the initiator succeeds, it proceeds to work on the problem that it has received, otherwise, it tries from another processor. Different strategies can be used to determine the node from whom the receiver asks for more work. In Asynchronous Round Robin, each pro- cessor maintains an independent variable called target, whose initial value is set to (id_of_the_node + 1) modulo number_o f .processors. When the processor runs out of work, it sends a request to the node indicated by target. The value of target is incremented after each request is sent. In Global Round Robin, there is a single global target variable, which is kept at processor 0. When a processor needs to request more work, it obtains the current value of target and then requests work from the node with that value. Processor 0 increments the value of target by 1 after every request. In Random Polling, the other node is selected at random. Their experiments on a nCUBE with 1024 nodes showed that Random Polling was the most scalable of the three approaches. For these approaches, a termination detection algorithm needs to be added for the case that the algorithm fails to find a solution. In a server initiated load balancing algorithm, the generation of subtasks is inde- pendent of the requests for work from the idle processors. The generation of tasks can be done by a single node (single level) or it can be arranged hierarchically (multilevel) as an m-ary tree of a certain depth. The root processor generates “super sub-tasks” which are assigned to its successors in the tree. The successors in turn distribute them to successors processors on request. The leaf processors perform the actual work and 42, they request work from their parents. When the parent of a leaf runs out of work, the leaf is assigned to a different parent. The general strategy that was used in our implementation works as follows. Within a given stage, each node is initially required to solve only a fraction of the eigenvalues that it would have been responsible for "in the original algorithm; the set of eigen- values assigned to each node is called its initial workload. One node serves as the coordinator and is responsible for managing allocation of the remaining eigenvalues to nodes that finish early. When a node completes its initial workload: it sends its results to the coordinator, which may dispense an additional set of initial eigenvalue approximations to that node, called a subsequent workload. This process repeats until all the eigenvalues have been found for the current stage. Node 0 collects the results from the other nodes and then sends the respective results back to the sinks, which are still responsible for merging and sorting the eigenvalues solved at each stage. This approach was effective in reducing the load imbalance among the nodes cooperating within each stage. Figure 3.8 shows a trace of the execution of the modified algorithm for the same 2048 x 2048 matrix as was used in the trace in Figure 3.6. The nodes finish each stage at approximately the same time, resulting in much higher efficiency. In fact, the total time required in this example is less than half that of the original algorithm. The scale used in Figure 3.8 is the same as that used in Figure 3.6; the load balancing algorithm requires only 31.8 seconds to find all 2048 eigenvalues. Our approach falls in the category of Server Initiated - Single Level Load Balancing in Kumar et al.’s taxonomy of load balancing schemes. The effort to “generate” a task is minimal: incrementing a counter. Our algorithm is not a search algorithm and we are guaranteed that the algorithm will finish. With this approach, there is no need for a termination detection algorithm and there is no need to add overhead on the workers checking for incoming messages from other nodes. ”MUZCZ HOMOMDO" OFNUJUOQO ill Figure 3.8. Trace of load balancing algorithm The speedup of the algorithm is sensitive to the sizes of the initial and subsequent workloads. If the workloads are too small, then too much time is spent in commu- nication between the clients and the coordinator. In addition, the coordinator may not be able to service all the outstanding messages immediately, delaying those nodes requesting additional work. On the other hand, if the workloads are too large, then the phenomenon of load imbalance appears again. For 2048 x 2048 matrices, the best overall performance was attained for an initial workload of 8 eigenvalues and subsequent workloads of 6 eigenvalues. Figure 3.9(a) plots the average time to find all the eigenvalues in 10 random 2048 X 2048 matrices using different versions of our algorithm. Figure 3.9(b) displays the Speedups calculated for the times in Figure 3.9(a). The average speedup for the load balancing algorithm on 64 processors was 48.3, with a maximum speedup of 53.4. This figure emphasizes the importance of load balancing. In particular, the curve for the improved version is still rising sharply at 64 processors. It is interesting to observe that this load balancing technique, which is essentially very similar to loop distribution techniques employed in shared memory machines, is quite effective on a distributed memory machine. 140 120 100 44, I I I I I I _ Original Approach +— Load Balancing B— - Reducing Comm. -)(— 50 40 I I I I I I Original Approach +— — Load Balancing Reducing Comm. 30— Time 80 Speedup 60 20.. 40l- lol- 20- 0 .1__L_l_l_l_l Ori—+—l—l—l—J-= 0102030405060 0102030405060 Number of Processors Number of Processors (a) execution time (b) speedup Figure 3.9. Execution Times and Speedups for a sample of 10 matrices One can consider load balancing a collective communication primitive in a broad sense, as it involves communication among all the participating nodes. It is relatively straightforward to define, as MP1 has done, standards for calls to collective commu- nications operations as broadcast and reduce. It is not clear if it would be feasible to design a “standard” call for a load balancing (loop distribution) operation as the one described above. Yet, as this case demonstrates, the performance gain can be very significant. The designers of parallel programs use collective communications as tools or primitives in their work. Somehow, it is necessary to include load balancing operations among those frequently used primitives, even if there is no single function available to perform it. It is reasonable to expect that the centralized nature of this load balancing ap- proach might lead to congestion problems on larger number of processors. The co- ordinator node can become a “hot spot” [90], a congestion point, slowing down the entire computation. It should not be difficult to implement a hierarchical approach for large numbers of processors, where a tree of coordinators would distribute the 45 work among the workers. In the machines where our experiments were run, though, reasonable values for the workloads seemed to avoid the problem of congestion on the coordinator. 3.4 Reducing Communication Costs The load balancing approach described above was effective in part because it used an efficient implementation of broadcast. In distributed-memory systems, efficient communication is critical to performance. Figure 3.10 shows two traces of the load balancing algorithm using different implementations of broadcast. The shaded areas correspond to the time each node waits for the broadcast from the coordinator, node 0. The horizontal line at the bottom of each trace indicates the total execution time. In Figure 3.10(a), a naive implementation of broadcast is used, whereby a separate message is sent to each destination. In Figure 3.10(b), a more efficient tree-based broadcast method is used. Clearly, the tree-based primitive is much more efficient, however, a significant amount of the total execution time is still spent waiting for broadcasts. In the split-and—merge algorithm, the number of broadcasts increases with the size of the hypercube, and the time for each broadcast increases with the size of the matrix. The effect of using the naive approach is negligible with 8 or fewer nodes, but degrades performance in cubes containing more than 32 nodes. The traditional tree-based hypercube broadcast algorithm is known as the span- ning binomial tree (SBT) [91]. In the first step of the SBT algorithm, the source node sends the message to its neighbor whose address differs from it own in the lowest (alternatively highest) bit position, that is, in the first dimension. Next, these two nodes send to their respective neighbors in the second dimension. Following that, all four nodes holding copies of the message forward it to their neighbors in the third dimension. This process continues until, in the last step, half of the nodes in the 46, network forward the message to the other half through the highest dimension. This algorithm requires n message passing steps to reach all nodes in an n-cube, regardless of the number of ports between processors and their routers [37]. We [37] have developed a new method to reduce broadcast time in wormhole- routed hypercube systems. The method, called the Double Tree (DT) algorithm, is designed to take advantage of the distance insensitivity of wormhole routing and the presence of multiple ports. The DT algorithm begins with the source node 5 sending the message to the node whose address is the bitwise complement of 3, call it 3. Subsequently, nodes 3 and 3 become the roots of partial spanning binomial trees. The tree rooted at s is called the forward tree, and the one rooted at E is called the backward tree. The message is distributed along the branches of both trees in parallel, reducing the number of message passing steps required to reach all nodes in an n-cube to [n/ 2] . Experiments on the nCUBE-2 show that the actual latency of a broadcast operation can be significantly reduced [37], thereby improving the performance of the split-merge algorithm. The broadcast primitive available in the nCUBE programming environment takes advantage of certain characteristics of the architecture that are not available to regular programmers (DMA programming). To evaluate the performance gained by using the double tree algorithm in place of the regular SBT tree algorithm, both primitives were implemented at the user level and the program was run with both broadcast alternatives to compare the performance. The results of the execution of the program on a random matrix of size 2048 are presented in Table 3.1. The results are slightly better with the double tree algorithm for large numbers of processors. These results were obtained with the latest release of the operating system of the nCUBE, whose optimizing compiler seems to be significantly better than the previous version. Besides using an efficient broadcast algorithm, other communication-related al- terations were made to the algorithm in order to improve performance. For example, 47 Table 3.1. Execution times on the nCUBE with different broadcast algorithms (a) broadcast with separate messages (b) tree-based broadcast Figure 3.10. Traces of Broadcasts III-ac: IOIIIODI'I 48, dynamic load balancing turned out to improve performance only when used in the distributed stages of the algorithm. Although the use of dynamic load balancing was tested in the local stagesias well, increased communication overhead actually reduced overall performance. Further improvements were achieved by reducing the amount of data to be broadcast through the use of redundant parallel computing. As part of the preparation for the Laguerre iteration, the values of the entries in the main diagonal, the squares of the off-diagonals and the current approximations to the eigenvalues are scaled 1" The scaling was being done at the nodes in charge of the merging and sorting (the sinks), which sent those scaled values to the coordinator, which in turn broadcasted them to the nodes. By performing the scaling of the main diagonal and the off-diagonals redundantly, but in parallel, at every processor, the total broadcast time was reduced. Combined with the use dynamic load balancing in the distributed stages, these improvements resulted in an average speedup of 52.2 and a maximum speedup of 55.3 for a sample of 10 random matrices on a 64-node nCUBE-2. These results can be observed in Figure 3.9. 3.5 Performance Study The preceding discussion was based on experiments performed using random matrices. Toverify the robustness of the parallel split-merge algorithm, other types of input matrices, designed specifically to test the accuracy and speed of eigenvalue solvers, were used. A comparison of the sequential version of the split—and-merge algorithm with other sequential algorithms for finding eigenvalues has been conducted previously ‘The latest version of the algorithm does not perform this scaling operation anymore. We report on the previous version as it is interesting to observe that at times it is faster to perform redundant computation than expensive communications. 49. by Li and Zeng [77]. In that study, the other sequential algorithms included bisec- tion/multisection(DSTEBZ in Lapack), divide-and-conquer (TREEQL [92]), RFQR (Root-free QR, DSTERF from LAPACK) and QR (DSTEQR from LAPACK). Split- and-merge achieved the best accuracy, while RFQR was the fastest algorithm. In our study, the parallel version of split-and-merge was compared against a paral- lel version of bisection. Bisection is very well suited for parallelization: once the initial data has been broadcast to the participating nodes, each node can work on its part of the problem without any communication with other nodes, except for reporting the final results to the initiating node. QR methods are difficult to parallelize when only the eigenvalues are sought [93]. Ipsen and Jessup [94] report that parallel bisection is faster than divide-and-conquer, hence the decision to use bisection in comparisons. Twelve kinds of input matrices were used in the experiments and are described in Table 3.2. The following conventions are used in the description of the matrices: ag,i = 1, ..., n represent the (main) diagonal entries and ,3,,i = 1, ..., n — 1 denote the offdiagonal entries. The values of a and b were chosen to be 4 and 1, respectively, for all the matrices. The experiments were performed on matrices of size 128, 256, 512, 1024 and 2048 for types 1 through 7, size 128, 256 and 512 for types 8 through 12. The following results were obtained on the nCUBE—2. The experiments were executed on subcubes of size 1, 8, 16, 32, and 64. The execution times for the bisection code and split-and—merge can be observed in Table 3.3. The times are reported in seconds. Figure 3.11 highlights the execution times for matrices of size 2048 of types 1 through 6 on 8, 16, 32, and 64 processors on the nCUBE-2. Figure 3.12 plots the execution times for matrices of size 512 of the types 7 through 12 on 8, 16, 32, and 64 processors. The sequential version of split-and-merge is significantly faster than the sequential version of bisection, as reported already by Li and Zeng [77]. This holds true for all 50, l Type | Entries 0:,- and 6.- f Eigenvalues ] Comments I 1 a,- = a Toeplitz ,6,- = b {a + 2bcosn—",;—'T}k=1,...,n lb, 0, b], [95]( p. 137 ) 2 cl = a — b __ a; = a for i = 2, - - -,n — 1 {a + 2bcos£2—k,,'71£}g=1,...,n [95]( p. 138) an = a + b. B,- =b,j= 1,---,n-1. . a a— 3 c0325!- 3 a. = { 3:3: 2:1; { “i“ 2,” L}.=1,...,,,,2 [95] ( p. 139 ) 65:1. andaifnisodd 4 a.- = 0 {—n + 2k +1}k=1,...,,, [95]( p. 140) fig = \/i(n - i) 5 a; = —-[(2i — l)(n - l) -- 2(i -- I)2] {—k(k - l)}k=1,---,n [95]( p. 141 ) fl; = 5(1) - i) 6 for even n: ._{ g-i+1 151's 1;- Most are in pairs, Wilkinson a, — i — 5 n/2 < i g n consisting of two matrices W,,+ . for odd n: numerically [96]( p. 308 ) _ _ Egg—ll — i + 1 l S i S @ indistinguishable a‘— Lil-("T“)- hut—11(53n eigenvalues fit = 1- 7 Generated randomly in [0,1]. Random Random Matrices 8 Generated by LAPACK Evenly distributed between test matrix generator the smallest and the [55] largest eigenvalue 9 Generated by LAPACK Geometrically distributed: test matrix generator {q‘}; = 1, - - -,n [55] for some q 6 (0,1). 10 Generated by LAPACK One eigenvalue 1 [55] test matrix generator The rest are in (—e, c). 11 Generated by LAPACK Evenly distributed test matrix generator in the interval (0, l] [55] except one very small one 12 Generated by LAPACK Evenly distributed test matrix generator in the interval [55] [10-l2 - 6,10'12 + 6] except one with value 1 Table 3.2. Types of test matrices 51 Table 3.3. Execution times (in seconds) for matrices of type 1 through 12 on an nCUBE-2 T I I I I I 52. I I I I I I 200 [— Bisection & - 200 - Bisection @— i Split-merge -+— Split—merge -+— 150 - -« 150 .. Exec. Exec. Time Time (secs) 100 F (secs) 100 50 '- - 50 I. 0 PI ml I I I 0 I I I I I I 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of PEs Number of PEs (a) Matrix with 2048 entries (b) Matrix with 2048 entries of type 1. of type 2. I I I I I I I r I I I 200 r- Bisection @- _ 200 [- Bisection 9- - Split-merge -l— Split-merge -+— 150 l- - 150 - - Exec. Exec. Time Time (se<:s.)100 " (sew) 100 50 - - 50 - - 0 I I I I I I 0 I L I I I I 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of PEs Number of PEs (c) Matrix with 2048 entries ((1) Matrix with 2048 entries of type 3. of type 4. 2 I I I I I I I I I I I T 00 *' Bisection e— ‘ 20° " Bisection e— ‘ Split-merge -l— Split-merge -l-— 150 - a 150 - Exec. Exec. Time Time (secs)100 _ ‘ (sec-s.) 100 - 50 r - 50 - 0 I I I I I I 0 l I I I I I 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of PBS Number of PEs (e) Matrix with 2048 entries (f) Matrix with 2048 entries of type 5. of type 6. Figure 3.11. Execution times of parallel eigenvalue solvers on an nCUBE-2. 53, _ r r I I I I I I j I I I .- Bisection 9— T 12 Bisection @- _ 12 ' Split-merge +— T F Split-merge -l— _l 0 I ML I I I 0 L I I J I I 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of PBS Number of PEs (a) Matrix with 512 entries (b) Matrix with 512 entries of type 7. of type 8. l4 _ I I I I I I _ 14 _ I I I I I, I _‘ Bisection @- _ Bisection 9 12 _ Split-merge -l— 12 " Split-merge -l-- " 1 - - Exec. 0 Time 8 ' " (secs) 6 _ - 4 I... .- 2 _. .. 0 1 1 1 1 1 1 0 1 1 1I 1 1 —rr 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of PEs Number of PEs (c) Matrix with 512 entries , (d) Matrix with 512 entries of type 9. of type 10. 14 _ I T .I I I I q 14 _ I r I I I I - Bisection @- Bisectton 9— 12 '- Split-merge -l—l 12 " Split-merge -l— T 10 - . 10 - 4 Exec. Time 8 - (38%.) 6 E u 4 ‘1 2 .. - 0 I I I I I I 0 I I I I I I 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Number of PBS Number of PBS (e) Matrix with 512 entries (f) Matrix with 512 entries of of type 11. type 12. I Figure 3.12. Execution times of parallel eigenvalue solvers on an nCUBE-2. 54, types of matrices, although in matrices of type 6, the advantage of split-and-merge is more modest. On the nCUBE—2, the parallel version of split-and-merge is faster, in most cases, than the parallel version of bisection, even though the bisection algorithm has much lower communication requirements. Reducing communication costs, particularly broadcast, is critical to maintaining a performance advantage with larger numbers of processors. The result is that the parallel version of split-and-merge is faster than the parallel version of bisection for most of the matrices in this environment. There are some exceptions for small matrices where parallel bisection has lower execution time than split-and-merge: matrices with 128 or 256 entries executed on 32 or 64 processors. Observation reveals that the parallel version of split-and-merge is faster than the parallel version of bisection when there are at least 8 eigenvalues to be calculated by every PE. When the number of eigenvalues per PE is more than 8, the advantage of split and merge is more pronounced. This can be observed by comparing Figures 3.11 and 3.12 where, with 64 processors, the advantage of split-and-merge is clear for the larger matrices (order 1024 and higher). 3.6 Related Work Several parallel algorithms have been developed to address this problem. The al- gorithm of choice in sequential computers has been QR, which finds simultaneously both eigenvalues and eigenvectors. A theoretical parallelization of that algorithm was reported in 1977 in [97]. More recently, other researchers have modified the QR algorithm and applied it on a variety of parallel machines and clusters of worksta- tions [93]. That study focused on a modified version of QR for finding both eigenvalues and eigenvectors. Interestingly, as in the study described here, those researchers also 55. found the benefit of performing redundant computations in order to reduce commu- nication costs. The finding of the eigenvalues was performed redundantly while the calculation of the eigenvectors was done in parallel. The algorithm devised by Cuppen [98] was used as the basis for the implementation reported in [92] for shared memory machines. Gates [99] reports a variation on the same approach also for a shared memory machine. Cuppen’s method has also been implemented on an hypercube [94]. A third group of implementations is based on bisection and multisection. Using Gerschgorin disks, an interval containing all eigenvalues is calculated. That interval is recursively divided into two (for bisection) or more (for multisection) sub-intervals until one of the following conditions is true: Either the interval contains no eigenvalue, or the interval contains exactly one eigenvalue or the interval is very small, this is, there are several very close eigenvalues. The number of eigenvalues in the interval can be determined by calculating the Sturm sequence at the extremes of the interval. The number of negative terms in the Sturm sequence evaluated at a particular points p is the number of eigenvalues that are smaller than p. Once it has been determined that the interval contains exactly one eigenvalue, there are different alternatives for finding the exact value of the eigenvalue. If the eigenvectors are desired, they can be found by using inverse iteration. If several eigenvalues are very close to each other, it might be necessary to perturb them before finding the eigenvectors. An implementation for shared memory machines is reported in [100] and for hypercubes in [94]. Lu and Qiao [101] developed an algorithm that parallelizes the evaluation of Sturm’s sequence to improve the bisection process. They implemented their algorithm in a shared memory machine. A fourth approach is based on homotopy methods [102, 103]. To find the eigen- values of a given matrix A, another matrix D is chosen, such that A(t) = (1 — t)D+tA 56. is an unreducible tridiagonal matrix for t 75 0. The eigenpairs of matrix D are known in advance. There are disjoint paths between the eigenvalues of matrix D and matrix A. Those paths can be followed in parallel. Clearly, this is an active area of research in parallel numerical algorithms. Re- search in this area is being pursued in the'mathematics department at Michigan State University by Li and his group. 3.7 Summary In this chapter, the implementation and performance evaluation of a parallel eigen- value solver on an nCUBE-2 MPC have been described. The general structure of the split-merge algorithm for finding eigenvalues in symmetric tridiagonal matrices is well- suited to efficient parallelization. The algorithm uses Laguerre’s iteration and exploits the separation property in order to create subtasks that can be solved independently. The split-merge algorithm was implemented and studied on an nCUBE-2, a wormhole-routed hypercube. In the initial parallel version of the algorithm, the only required communication among processes occurs only between stages of the algorithm; the number of stages is at most log2(n), where n is the order of the matrix. The in- herent variance in the number of iterations in the split-merge algorithm justified more sophisticated approaches to load balancing. The communication used to implement dynamic load balancing is, in some sense, programmable. Experiments on a variety of input matrices confirmed that split-merge is signif- icantly faster than bisection for large matrices (order 512 or larger) on the nCUBE environment. CHAPTER 4 The Split-merge Algorithm on a Conventional Cluster Clusters of workstations interconnected via Ethernet are very common across aca- demic, research and even industrial institutions. With tools like PVM and P4, these clusters constitute an economic alternative to MPCs. How does the performance of an ordinary cluster of workstations compare with that of an MPC? Is it competitive? Should the programming be different? What factors affect the performance of an application on this environment? How are these factors different from those on an MPC? In order to answer these questions, the split-merge algorithm was implemented and tested on a cluster of Sun Sparc-lO, model 30 and model 40 workstations, inter- connected by a typical Ethernet network. The workstations used for the experiments were also available for general use by students and faculty, that is, other users could freely access the workstations at any time. This environment is consistent with our goal of testing the hypothesis that general-purpose, shared workstations can provide competitive performance for scientific computing tasks. The eigenvalue algorithm was implemented using two different programming en- vironments, PVM and P4. PVM [42], a public domain package from Oak Ridge 57 58. National Laboratory, provides a software infrastructure for network-based hetero- geneous concurrent computing. PVM-based applications are structured as a set of components, with one or more instances of each component. PVM daemons are created on each node in the system and in turn spawn the application instances. Primitives are provided for process management, communication, synchronization, and so on. Communication among daemons uses ordinary sockets. P4 [43], devel- oped at Argonne National Laboratory, comprises a library of macros and subroutines that support monitors for shared-memory programming, message—passing primitives, and support for heterogeneous cluster computing. Since there were no significant dif- ferences in performance between the PVM and P4 implementations, only the PVM implementation is described here. 4.1 PVM Implementation Porting the eigenvalue code from the nCUBE-2 to PVM (version 2.4 and later ver- sion 3.2) was relatively straightforward. Both environments are based on message passing, and their functionality is similar: sends are non-blocking, receives can be either blocking or non-blocking, and messages can be identified with tags indicating their type. Both environments support programming in C and Fortran and allow the exchange of messages between program components written in either language. The nCUBE is a single machine and the same format is used to represent the data uniformly across all nodes. The only time that the format of the data may need to be converted is when it is sent from the front-end workstation to the nodes in the hypercube. PVM, on the other hand, is designed to be used in an heterogeneous environment. Therefore, in the nCUBE, the system can be oblivious to the type of the data being passed in a message: The receiving and sending nodes have the responsibility on being consistent on the use of the data passed in the messages. In 59, contrast, in PVM it might be necessary to convert from the format of one machine to a different format for another machine and it is the responsibility of PVM to call XDR appropriately to perform the conversion. In terms Of allocating processes to processors, the two environments are quite different. In the nCUBE-2, full subcubes are allocated to each application; one in- stance Of the program runs on each node of the allocated subcube. Every node in the subcube is devoted exclusively to the execution of the program, and I/O Operations can be performed from any node. In PVM, the hosts file contains the names of the workstations that are available for use. Once started, the program “enrolls” one or more instances of each component that is required to solve the problem. Those instances are allocated to the available workstations. It is important to know that the number Of instances can be larger than the number Of workstations available, that is, multiple processes can be assigned to the same workstation, even if that workstation has only a single processor. The usefulness of this strategy will be described later. Executing parallel programs on a cluster of general-use workstations introduces randomness in the performance Of the program from two sources: the load Of the workstations and the load of the network. Both resources are shared among applica- tions. In the nCUBE-2, each node is devoted exclusively to the application assigned to that node. Furthermore, because only full subcubes are assigned to programs, the hypercube topology and the underlying routing algorithm prevent communication conflicts among the messages Of different programs. In other words, the communi- cation network Of the subcube is also dedicated exclusively to the execution of the application assigned to it. Hence, there is no contention among different applications for access to the network. The programming model in PVM assumes that a master process instantiates several server processes. The master process can perform 1/ O Operations without intervention of PVM. The servers perform their 1/ O operations through the PVM 60. daemon on the starting workstation. If one of the servers writes a message to the standard output, the message will be written to a log file on the workstation that started PVM. If the program stops execution prematurely, those messages may not appear on the log file. The syntax of the send and receive Operations on the nCUBE-2 require pointers to the beginning of the area being sent or received and the number of bytes that one wishes to transmit. In one single call, one specifies the sender or receiver, the address of the area, its size and the type of the message. , For the reasons mentioned previously, PVM takes a different approach. When sending a message, one uses a call to specify that a message will be sent, and then uses calls that are type dependent to move data into the buffer that will be sent later. It is possible send a message with different types of data. Once the buffer is ready, another call actually sends the buffer, specifying the tag of the message. The process receiving the message, specifies that it will receive a message of a certain tag (or alternatively with any tag) and then it moves the data from the buffer to the appropriate variables by using type dependent calls. 4.2 Load Balancing in Local Stages As described in chapter 3, the nCUBE-2 version Of the algorithm used dynamic load balancing to accommodate the variance in the Laguerre iteration routine. Traces of our cluster implementations demonstrated that the load balancing algorithm was also effective in the new environment, although larger workload. sizes (the number of eigenvalues assigned by the coordinator following a request for additional work) resulted in better efficiency due to decreased communication overhead. The results in the cluster indicate that workloads Of 16 eigenvalues produce best results for random matrices Of order 2048; in the nCUBE-2, the Optimal workload size was 6 eigenvalues. 61 In the hypercube environment, load balancing was useful only in the distributed stages of the algorithm. However, traces Of execution on the cluster indicated that the problem Of load imbalance was also present in the local stages, that is, in those stages involving no interprocessor communication. This behavior can be attributed to a PVM process relinquishing the processor to another user application on a particular workstation. When such an event occurs, the execution Of the entire program is delayed until that PVM instance processes finishes its (local) share of the work. In order to alleviate this problem, load balancing was also applied to the last several local stages. The load balancing algorithm was the same as that used for the distributed stages, that is, in which a single node acts as the coordinator. The same workload size is used in all stages. Figure 4.1 compares execution traces the PVM implementation on a matrix of order 4096, with and without load balancing in the local stages. Both traces use the same scale. Figure 4.1(a) shows the execution without load balancing in the early stages, where node 0 happens to require much more time than the other nodes to solve its share Of the eigenvalues. Figure 4.1(b) corresponds to an execution Of the algorithm on the same matrix with load balancing implemented in the last 5 local stages; the advantage Of this strategy is clear. To summarize, in a cluster of workstations, load balancing may be critical to parts of the application where it was not effective in an MPC implementation. An alternative approach to load balancing is to create more worker processes than there are workstations, which is possible because PVM can map several processes to the same PE. In the eigenvalue algorithm, assigning two processes to each processor reduced the negative effect of processes becoming blocked while waiting for communi- cation to complete; communication delays are masked by overlapping communication of one process with computation in another. When one process is blocked, the pres- ence of another process in the same PE increases the probability of more useful work 62. nun-ac: Ion-nnol‘ I - u u a u o 1 I fl?! 5|: --E--E--E--E- . I 2 I I S 6 ‘l I 9 1! 11 12 (a) no load balancing in local stages (b) load balancing in local stages Figure 4.1. Execution traces of parallel eigenvalue solver on network of workstations being done. In the PVM implementation, this approach proved even more successful than load balancing in local stages, as can be observed in Figure 4.2. Figures 4.2(a) and 4.2(b), respectively, plot the execution times and corresponding Speedups when 16 processes were created and assigned to the workstations. As shown, dual-process approach achieved better speedup than load balancing in local stages. We did not try to use more than two processes per workstation. 4.3 Effects Of the Cluster Environment According to traces of program executions, the time that a worker processor has to wait between the moment it has sent its results to the coordinator and the time when it receives another piece of work is approximately 400 microseconds on the nCUBE-2. On the cluster, the same action requires approximately 5000 microseconds (in the best cases, when there is no contention for the network and the coordinator is ready to receive the request). In addition to software overhead, the increased communication latency results from the context switch at the coordinator node. Broadcast communication is particularly important to the load balancing imple- mentation of the eigenvalue solver. Broadcast is used to distribute the eigenvalues 63 I I I I I I I I I I I I I — LB in local stages +— u Redunth processes 8— J 200 ' LB in local stages —l-— ‘ Redundant processes 8— Executioln5O Time (sec) 100 I Speedup 50- 0 I I I I I I I 0 1 2 3 4 5 6 7 8 Number of Processors I I I I I I I OHMNAO‘OSKICD I 0 1 2 3 4 5 6 7 8 Number of Processors . . b d (a) execution times ( ) Spee ups Figure 4.2. Comparison Of load balancing approaches for a matrix or order 4096. found in the previous stage to all the worker processes. In the nCUBE-2, as de- scribed earlier, the tree—based system-supported broadcasting primitive and the DT algorithm [37] both provided performance advantage. Both broadcast implementa- tions take advantage of the hypercube topology delivering a message to N nodes in 0(log N) time. Broadcasting a message of 32 kbytes takes approximately 32000 microseconds in a subcube with 4 nodes and 47400 miCroseconds in a subcube with 8 nodes. Using PVM, it is also possible to specify that a message should be delivered to mul- tiple processes. Internally, however, this function is implemented as multiple point-to- point, or unicast messages, which requires time linear in the number of destinations. Cluster broadcast times are further affected by two factors. First, contention for the network among multiple applications can delay the sending Of messages. Second, the nodes executing the algorithm were not always directly connected to the same Eth- ernet cable, that is, some nodes reside on different cables connected by bridges. For 64. 32 kbyte messages, each of the constituent unicast messages required approximately 30000 microseconds. The advantage of the nCUBE-2 in terms of broadcasting is clear. Improving the performance of the broadcast Operation by using different alternatives will be discussed in chapter 5. The execution times Of the split—merge algorithm in the cluster can vary from one execution to another, depending on the‘load on the workstations from other applications. Figure 4.3 shows the load of the workstations and the execution time Of the eigenvalue program at different times of the day. To Obtain the plot, eight Sparc—lO workstations were used to solve a matrix Of order 4096. The program did not use load balancing in the early stages and only one process was assigned to each workstation. The load statistics were Obtained using the rap () system command, and the average load was recorded. The load averages have been scaled up by a factor Of ten to provide better contrast. The figure clearly illustrates the correlation between the average load on the workstations and the execution time of the program. The “spike” at 4:30 am. is due to automatic daily system maintenance. Load balancing on the program can only partially compensate for the additional load on the workstations. When poor speedups occur, they can be attributed to the use of the workstations by other users and to the communication costs. , As discussed previously, the environments Of the nCUBE-2 and the cluster are very different. The nCUBE-2 has a more efficient network and its PEs are not shared. The cluster, on the other hand, has faster PEs: the sequential version Of the split-and- merge program runs approximately 9 times faster in a lightly loaded Spare-10 than in an individual node in the nCUBE-2. Although the communication costs are sub- stantially higher on the cluster environment, the greater computing power of the PEs results in good performance. Since the split-merge algorithm is not communication- intensive, it benefits substantially from this characteristic of the cluster. 65 Figure 4.3. Effect of workstation load on execution times. Although the speedup figures are not always particularly high, when comparing cluster performance with that Of the nCUBE-2, it is necessary to examine the ab- solute execution times and to consider the cost and flexibility of the systems. Even in a general-use cluster environment, the times on a cluster Of 8 workstations are competitive with the time of the nCUBE-2 with 64 processors and better than those Of smaller hypercubes. Finding the eigenvalues for a matrix of order 4096 required 32 seconds on an 8-node cluster; the nCUBE-2 requires 118 seconds on a 16-node subcube, 58 seconds with 32 processors, and 32 seconds with 64 processors. Even when the load on the workstations is high, the worst recorded execution time for this matrix, 89.425 seconds, using 8 workstations, is still better than that of a 16-node subcube in the nCUBE-2. Again, a small cluster is very competitive. In light of these results, and taking into consideration the cost element (given uni- versity discounts on equipment, 16 Sparc-lOs cost approximately the same as 16-node nCUBE-2), it can be argued that networks Of workstations do provide a cost / effective alternative to MPCs for certain scientific applications, once the higher communication costs are accounted for. In this case, the use of dual processes per node was effective in 66. hiding communication latency. Improving performance by reducing broadcast times is part Of our ongoing research. 4.4 Performance Study The same input matrices that were used to test the performance Of the program on the nCUBE were used to test the performance on the cluster of workstation. The experiments on the cluster Of workstations were performed on 1, 4 and 8 workstations. The execution times for the programs on matrices Of types 1 through 12 can be Observed in Table 4.1. The execution times are reported in seconds. Figures 4.4 and 4.5, respectively, plot the execution times for matrices 1 through 6 with 2048 entries and types 7 through 12 with 512 entries. As previously discussed, a cluster of workstations interconnected through Ether- net has much higher communication latencies than the nCUBE-2. The experiments were generally conducted at times of little activity on the workstations, but as the workstations are in an Open laboratory, it was not possible to guarantee exclusive access to the systems. The times reported in this section were collected with the PVM version on the cluster. The results for the cluster of workstations are similar to those for the nCUBE-2. Again, the sequential version of the split-and-merge algorithm is significantly faster than the sequential version of bisection. A phenomenon similar to the one observed on the nCUBE-2 can be Observed in the Clusters as well: for very small matrices (128 entries), bisection is faster than split-and- merge with more PEs (8 for the cluster and 32 for the nCUBE-2). The communication characteristics of the nCUBE—2, specifically, faster interprocessor communication and a faster broadcast primitive, make split-and-merge superior over a wide range Of numbers of processors. The high communication overhead Of the cluster reduces that 67 Table 4.1. Execution times (in seconds) for matrices Of type 1 through 12 on a cluster of workstations 68, 3'50 I I I I I I I I 350 I I I I I I T I 300 - Bisection 9— - 300 *- Bisection @— " 250 _ Spht-merge d 250 _ Split-merge -l— _ Exec. _. Exec. _ Time 200 I Time 200 (8868.) 150 '- ‘ (secs) 150 ' 100 - - 100 .. 50 ‘ 50 - 0 I I I I 0 I I I I 0123456789 0123456789 Number of PEs Number of PBS (a) Matrix with 2048 entries (b) Matrix with 2048 entries of type 1. of type 2. 350 I I I I I I I I 350 I I I I I I I I 300 - Bisection 9— - 300 - Bisection 9— ‘ 250 __ Split-merge -+— _ 250 __ Split-merge -l— J Exec. _ Exec. _ Time 200 Time 200 1 (secs) 150 " (secs) 150 i' - 100 - 100 - - 50 - 50 - d 0 I I I 0 I I I I I I I 0123456789 0123456789 Number of PBS Number of PBS (c) Matrix with 2048 entries (d) Matrix with 2048 entries of type 3. of type 4. 350 I I I I I I I I 350 I I I I I I I I 300 - Bisection 9- 300 - Bisection 9— r 250 __ Split-merge -l— _ 250 __ Split-merge -l— _ Exec. _ _ Exec. _ .. Time 200 Time 200 (secs) 150 ' ‘ (secs) 150 - - 100 - d 100 - '- 50 b - 50 -I\l—r—r—I—t_ 0 I I I I I o I I I 0123456789 0123456789 (e) Matrix with 2048 entries Number of PEs of type 5. Number Of PEs (f) Matrix with 2048 entries of type 6. Figure 4.4. Execution times of parallel eigenvalue solvers on a cluster. 20 15 Exec. Time 10 (secs.) 0 IIIIIIII Bisection @— Split-merge +— a III III 0123456789 Number of PEs (a) Matrix with 512 entries 20 15 Exec. Time 10 Of type 7. IIIIIIII Bisection 9" Split-merge -l— _ IIIIIII 0123456789 Number of PBS (c) Matrix with 512 entries 20 of type 9. IIIIIIII Bisection @- Split—merge -l— .. IIIIIII 0123456789 Number of PBS (e) Matrix with 512 entries of type 11. 69 20 I I I I I I I I Bisection @- 15 _ Split-merge -l-— _ Exec. Time 10 - .. (secs.) 5 .. .. 0 I I I I I I I 0 1 2 3 4 5 6 7 8 9 Number of PBS (b) Matrix with 512 entries of type 8. 20 I I I I I I T Bisection @— 15 p Split-merge -l— .. Exec. Time 10 l _ (secs.) 5 - - 0 I I I I I 0 l 2 3 4 5 6 7 8 9 Number of PBS ((1) Matrix with 512 entries of type 10. 20 I I I I I I I I Bisection 9- 15 _ Split-merge -l— .. Exec. Time 10 .- .. (secs.) 5 ' 1 0 I I L I L 0 1 2 3 4 5 6 7 8 9 Number Of PEs (f) Matrix with 512 entries of type 12. Figure 4.5. Execution times of parallel eigenvalue solvers on a cluster. 70 advantage, although the use Of dual processes is effective in improving performance. On large matrices, the advantage of split-and-merge over bisection remains signifi- cant even On the cluster. For example, bisection requires 41.77 seconds to find the eigenvalues Of a matrix with 2048 entries Of type 1 on a cluster with 8 nodes, while split-and-merge requires only 12.93 seconds. Even if linear Speedups could be Ob- tained with bisection, 22 workstations would be required with bisection to Obtain the game execution time as with split-and-merge on 8 workstations. On the nCUBE-2, for matrices of type 6, the advantage Of split-and-merge over bisection was relatively small. That situation changes on the cluster, as the advantage Of split-and-merge over bisection on matrices of type 6 on the cluster was very significant, as can be Observed by comparing Figures 4.4(f) and 3.11(f). Figure 4.5 shows that for matrices with 512 entries, the split-and-merge execution time begins to rise with 8 PEs compared to 4 PBS. The additional number Of processes introduces communication in an additional stage in the execution of the algorithm, as this is similar to using more physical processing elements in an nCUBE-2. When the experiments were repeated on 8 workstations for matrices of 128 entries with 8 logical processes, instead Of 16, slightly shorter times were Observed. This suggests that the load balancing techniques discussed previously might be more useful for large matrices than for relatively small matrices. In other words, doubling the number Of processes on every workstation is an useful strategy only for large matrices. It should also be noted that faster networks and better interfaces between the workstation and the networks are becoming available and they will reduce the com- munication overhead Of the clusters. As communications latencies decrease, the ad- vantage Of split-and-merge will likely increase. Our continuing study of such factors is discussed in chapter 5. 71 In examining absolute execution times, one can notice that, for matrices with 2048 entries, the execution times Of the parallel version Of split-and-merge on a cluster with 4 nodes are similar to the execution times of split-and-merge on a 32-node nCUBE-2, confirming the Observations made previously for random matrices that the cluster implementations are competitive with those on an nCUBE-2. The execution times of the parallel version Of bisection on 4 nodes on the cluster fall between the execution times with 16 and 32 nodes Of bisection on the nCUBE-2. That is, for both the bisection and the split-and-merge algorithms, a very small cluster of Sparc 10 workstations represents a viable alternative tO an nCUBE-2 with 32 nodes. The reader may notice that the ratio of execution times between the sequential versions on an individual node on the nCUBE-2 and a Sparc 10 is not uniform for the algorithms. For example, split-and-merge requires 520.8 seconds to find the eigenval- ues of a matrix of type 7 with 2048 entries on an individual node of the nCUBE-2 and 63.233 seconds on a Sparc 10, resulting on a ratio of 8.24. For bisection, the times are 1708.97 and 317.79 respectively, yielding a ratio Of 5.38. A possible explanation for this difference is that split-and-merge may take better advantage of the cache of the Spare 10 than bisection. Recall that only in the last stage of split-and-merge involves accesses to the entire matrix, whereas in bisection it is necessary to access the entire matrix at every iteration. 4.5 Summary In this chapter, the implementation and performance evaluation Of a parallel eigen- value solver On a cluster of workstations connected via Ethernet have been described. Even though the communication costs are much higher on the cluster environment than on the MPC, the performance Of a cluster for this particular algorithm is quite competitive. The presence of a full-fledged Operating system on every workstation 72, permits the use Of two instances Of the program per node, which helps to mask the higher communication costs. On the other hand, the performance is very sensitive to the presence of other users on the workstations. It proved beneficial to include load balancing in earlier stages in order to account for the load imbalance caused by other users . CHAPTER 5 Performance on Switch-Based Clusters The need for higher interconnection rates among workstations has lead to the develop- ment of local area networks built from high-speed switches. These switches have been designed to serve other purposes besides cluster-based computing, but their higher speeds and capacities make them especially attractive for cluster-based computing. Two switch—based clusters Of workstations are available in our department and we were interested in examining the performance of the cluster version of the split-merge algorithm on these environments. The workstations in the High Speed Networking and Performance (HSN P) labora- tory are interconnected using Ethernet and a set of three ATM switches. It is possible to use the TCP/IP and UDP/IP protocols over the ATM connections enabling the use of PVM. Another laboratory in the department, the Advanced Computing Sys- tems (ACS) laboratory, is equipped with DEC Alpha 3000 workstations. The ALPHA workstations are interconnected through regular Ethernet and through a GIGAswitch, a crossbar switch. The connections between the workstations and the GIGAswitch use FDDI. Again, it is possible to use TCP/IP and UDP/IP, and hence PVM, across the GIGAswitch. 73 74, Two forms Of hardware-supported multicast are available in the HSNP laboratory. First, the operating system Of a subset Of the workstations in the HSNP laboratory has been recompiled to incorporate the IP-Multicast protocol [26]. Second, the ATM environment provides hardware support for multicasting from a given node to a set of destinations [41]. In addition, the local version Of PVM has been extended to provide a broadcast Operation based on the recursive dOubling [104], a software tree approach to multicast. Experiments were conducted to evaluate the effects of the different environments, the different broadcast alternatives, and the enhancement to PVM. The experiments were conducted using matrices of type 7, as they seemed to be representative of the other types of matrices studied before. The results are discussed in the following sections. 5.1 Cluster Environments The HSNP laboratory in our department includes an ATM testbed. Figure 5.1 depicts the laboratory configuration. The testbed equipment includes 12 Spare-10 workstations, 4 Spare-2 workstations, and miscellaneous networking equipment. The workstations are networked together via both conventional Ethernet as well as via the ATM network. The latter comprises three FORE Systems ASX-100 ATM switches. Each ASXIOO ATM switch provides up to 16 full-duplex ports, allowing for system growth. A Spare RISC processor on each switch is dedicated to tasks, such as connection management and traffic monitoring. A 1.2 Gbps time-division-multiplexed bus constitutes the switch fabric, enabling each I/O port to Operate at the full channel rate. The switch fabric is shared among the 16 ports and the switch processor. Among other features, the bus-based switch fabric provides an efficient method Of supporting multicast communication among the hosts. 75 To Gmy 140/155 MWWtibarn ---- to Woo em Tmuootvowdo Figure 5.1. ATM cluster testbed at Michigan State University Each workstation is equipped with an SBA-200 SBus adapter card, which is con- nected to a switch port by two unidirectional fiber Optic links. The SBA-200 contains a dedicated Intel i960 RISC processor and features DMA with scatter-gather ca- pabilities, as well as custom hardware support for segmentation and reassembly of ATM cells. The device driver for the SBA-200 not only provides an interface to the standard TCP/IP protocol suite, but also supports the Application Program Interface (API), a set of socket-like system utilities that allow user-level programs to directly access ATM-specific Operations, such as hardware multicast. The SBA-200 card and its driver support ATM Adaptation Layers (AAL) 3/4 and 5. AAL5 [105] is an international standard designed primarily to provide efficient data communications over ATM networks. Another cluster-based parallel computing platform has been recently added a to the ACS laboratory. The cluster includes six DEC Alpha-3000 workstations inter- connected by a DEC GIGAswitch. The workstations are also interconnected using 76. regular Ethernet. Each port on the GIGAswitch supports 200 Mbps full-duplex FDDI. An F DDI connection exists between every workstation and a corresponding port on the GIGAswitch. The GIGAswitch, configured internally as a crossbar, has a max- imum aggregate bandwidth Of 3.6 Gigabits/ second. It allows multiple simultaneous connections. Figure 5.2 represents the ACS cluster configuration. ToGahway ---- mama-mu Twink Figure 5.2. GIGAswitch cluster testbed at Michigan State University 5.2 Experiments Using Original PVM Every workstation in the HSNP laboratory has two IP addresses and corresponding names. One corresponds to the Ethernet connection and the other one to the ATM interface. When executing PVM, one specifies in a hosts file the workstation names of the workstations that will form the “virtual machine”. By including in the hosts 77, file the workstations names that correspond to the ATM connections, PVM directs all traffic through the ATM interfaces instead Of using the Ethernet connections. An experiment was conducted using 8 workstations to calculate the eigenvalues of matrices Of sizes ranging from 128 to 2048 entries to compare ATM with Ethernet. Figure 5.3 shows that, for small matrices, the difference in performance is negligi— ble, but as the size of the matrix grows, the faster communications that the ATM environment provides, result in a shorter execution time. l.2e+07 — ' ‘ ' ATM +— Ethernet 8- 8e+06 Time (usec) 4e+06 0 I I I I 0 500 1000 1500 2000 Size Of the input matrix Figure 5.3. Comparison of the performance Of Ethernet and ATM A similar situation, regarding connections and addresses, exists in the cluster of DEC Alpha workstations. That is, every workstations has two addresses, one corresponding to the connection to the GIGAswitch and another one to the regular Ethernet connection. As with the ATM switch, it is possible to use the TCP/IP and UDP/IP protocols across the GIGAswitch. 78, The microprocessor used on the ALPHA workstations has better performance than the Sparcs on 64 bits floating point Operations and hence the speed Of execution is much faster. Figure 5.4 shows the execution times on a matrix Of type 7 of size 2048 on clusters Of different sizes using regular Ethernet. It can be Observed that on sequential mode, an Alpha is significantly faster than a Sparc-10. A cluster of 4 Sparcs has a performance similar to that of a single ALPHA. I I n I I I I Cluster Of Alphas +— Cluster of SUNS B— 50 - -‘ Time (sec) 25 - '- 0 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 Number of workstations Figure 5.4. Comparison of the performance Of a cluster Of DEC and a cluster of SUN workstations Experiments were performed on a cluster of 4 workstations with input matrices Of different sizes using the Ethernet and the GIGAswitch networks. The procedure was similar to the one followed in the experiments on the ATM switch. Figure 5.5 presents the results of the comparison of the two networks. As in the case with the ATM and the Sparcs, the GIGAswitch provides a significant advantage for larger matrices. It should be noted that 4 is a very small number of workstations and that experiments with larger numbers of workstations would likely be more indicative of the scalability 79, Of the switch. Nevertheless, the results indicate the performance gain to be Obtained by using an alternative to Ethernet. GigaSwitch —l— Ethernet 4e+06 *- Time (usec) 2e+06 r- 0 J l I 4 0 500 1000 1500 2000 Size of the Matrix Figure 5.5. Comparison of the performance Of the program on Ethernet and the GIGAswitch 5.3 Using IP-Multicast In the late 808, extensions to the UDP/IP protocols were proposed and implemented to multicast a message from one machine to a set of destinations [26]. The program- ming interface is very similar to regular socket programming. The only differences are that is necessary to set certain Options to the socket and that one needs to specify the id of a group that the program joins. Any node that has joined the group can broadcast to the other members of the group, and it can also receive the broadcasts from any other member Of the group. Several processes on the same workstation can join the broadcasting group. 80, This multicast primitive is not reliable. There are no guarantees Of correct delivery as in TCP/IP. An experiment was conducted to test the reliability of the IP-Multicast in which an array Of 256 doubles (2K bytes) was broadcast 1000 times using IP- Multicast. The receiver compared the received values with the correct values and no errors were detected over the 1000 tries. Given the high reliability of Ethernet, this result was to be expected. Nevertheless, production code would require that a reliability protocol be implemented above IP-Multicast. It is possible to overrun the receiver when using IP-Multicast, that is, if two mul- ticasts are executed over a short period of time, the second one might be lost because the receiver has not finished processing the first one. According to our experiments, the multicast messages could be of a size Of up to 2K bytes (256 doubles). As a result, messages larger than 2K bytes needed to be split into pieces of 2K bytes and the broadcaster had to wait for ACKs from the receivers before proceeding with the next piece. The split-merge program was modified to use IP-multicast instead of using the regular PVM mcast primitive, which sends the message from the sender to each destination sequentially. The original and modified versions Of the program were executed on the six Spare-10 workstations in the HSN P laboratory that support IP- multicast. The results can be Observed in Figure 5.6. The differences are minimal. Two factors explain the lack Of improvement with IP-Multicast: The number Of workstations involved is very small, so the broadcast Operation of PVM is still relatively efficient. Second, the need for ACKs from the receivers for larger matrices serializes the communication process. As new protocols for the Internet and for cluster computing are explored, it would be beneficial for cluster computing users to have a reliable IP-Multicast primitive that allows large packets to be broadcast. While such a reliable multicast will require the use of either ACKs or NAKs, those will not be issued by the user, resulting in better performance. 81 I I I I With IP-mcast +- With PVM mcast 1.5e+07 I 1e+07 Tlme (psec) 5e+06 0 I I I 0 500 1000 1500 2000 Size of the Matrix Figure 5.6. Comparison of the performance of the program using IP-Multicast and regular PVM bcast 5.4 An Improved Broadcast Implementation in PVM PVM has been modified [104] by adding a more efficient group broadcast operation. The broadcast Operation is based on the U-cast algorithm [36]. From the program- mer’s point Of view, one Of the nodes creates a group of processes by calling a new routine and passing as a parameter the task ids Of the processes that belong tO the group. The routine returns a group id. When a node needs to broadcast, it issues a regular send statement but indicating the group id as the destination. The receivers perform a regular receive Operation. It is feasible to use two processes per workstation. Two sets of experiments were performed. Both compare the performance of a regular version of PVM versus the new enhanced version of PVM. One set of experi- ments was performed over regular Ethernet. The results can be Observed in Figure 5.7. The other one was run using the ATM environment. The results can be Observed in 82, Figure 5.8. It can be Observed that the new version of PVM results in slightly better performance on both environments for most of the sizes of the matrices. In these experiments, the performance gain obtained with the new version of PVM is superior to the original PVM, which uses separate addressing to implement multicast. The number of messages and acknowledgments at any node is at most loga- rithimic in the number Of destinations, implying that this method should scale better. The number Of workstations used for these experiments is rather small. .A larger num- ber of workstations would probably show a more significant advantage for the new broadcasting alternatives. In the nCUBE-2, with small numbers Of nodes, using a naive broadcast did not cause a significant difference in performance in comparison with a tree based broadcast. The fact that even on a small number of workstations there is a difference in performance seems to indicate that the results on larger number Of workstations should be similar to those Of the nCUBE. Ether. grdup beastl +— Ether. reg. beast B- le+07 Time (Itsec) 5e+06 0 J I I 0 500 1000 1500 2000 Size of the input matrix Figure 5.7. Comparison of the performance of regular PVM and the enhanced PVM over Ethernet 83 It should be noted, as well, that only 4 broadcast Operations are taking place for the matrices Of sizes between 512 and 2048, and only 3 broadcasts are taking place for the matrices smaller than 512. le+07 - ATM with group beast -l-— ATM without group bcast Time (usec) 5e+06 I 0 500 1000 1500 2000 Size Of the input matrix I I I Figure 5.8. Comparison of the performance Of regular PVM and the enhanced PVM over ATM 5.5 Experiments with ATM Multicast ATM was originally designed as a standard for telecommunications. Its high band- width, though, makes it attractive for data communications as well. To make it feasible to use ATM in computing applications, a programming interface standard has been designed. It is called ATM Adaptation Layer 5 (AAL5) [105]. An unreliable version of the hardware supported multicast Operation was incorpo— rated into the split-merge program. In ATM multicast, every process that is to join a multicast group has to specify whether it is going to be the broadcaster (and only 84, one process can be the broadcaster) or a receiver. The maximum size of a packet that can be sent using ATM multicast is 4096 bytes (512 doubles). As in IP-multicast, it is possible to “swamp” the receiver if two broadcasts occur very close to each other. Again as in IP-multicast, for large packets, it is necessary to send ACKs from the receivers to the broadcaster before more broadcasts are sent. Unlike IP-multicast, though, only one process per workstation can join a broadcast group with the same address, which precludes doubling the number of processes per workstation. Figure 5.9 presents the results of the test on a group of 8 workstations. One set of tests was run using regular PVM broadcast, while the other was run using ATM multicast. All other (non-multicast) communication was performed over regular Ethernet. The ATM multicast results in slightly better performance in matrices Of size Of up to 512, which corresponds to the maximum packet size that can be sent without acknowledgments from the receivers. Beyond the size Of 512, the regular PVM broadcast results in better performance as the gain in broadcast performance Obtained by using the ATM broadcast operation is dominated by the cost Of the transmission of the acknowledgments. Again, if the flow cOntrol mechanisms are incorporated into lower-level protocols, better performance for ATM may result. In this particular application, the fact that the broadcast Operation is not sym- metric, that is that only one member Of the group can broadcast, was not relevant. But, comparing the IP-multicast with the ATM-multicast user interface, one notices that the IP interface is more flexible as any node in the group can broadcast and two or more processes in a given workstation can join the group without requiring different addresses. Recently, a reliable implementation of multicast over ATM netwOrks has become available [41]. The set of routines that implement multicast includes daemons on both the sending and receiving sides that take care Of processing the acknowledgments. The 85 I I I I With ATM mcast "l— Without ATM mcast B— 1e+07 - - Time (#866) 5e+06 *- - 0 1 1 1 1 0 500 1000 1500 2000 Size of the input matrix Figure 5.9. Comparison of the performance of the program using unreliable ATM- Multicast and regular PVM bcast experiments were repeated with this multicast primitive. Removing the acknowledg- ments from the application process produces an improvement in performance that can be Observed in Figure 5.10. These results are particularly encouraging, given that only three broadcasts occur in the 8-processor case. 5.6 Summary The‘experiments described in this chapter compared the performance Of the program under different communication environments and the results of using different imple- mentations Of the broadcast Operation. The use of both switch-based environments, the ATM cluster and the GIGAswitch cluster, resulted in noticeable, albeit small, performance improvements over regular Ethernet. The standard deviations of the run times were smaller than the performance differences Observed. For instance, the standard deviation for the executions of the program on the cluster Of Sparcs 86, I I I I With ATM mcast +- Without ATM mcast B- 1e+07 Time (usec) 5e+06 I I l L 0 500 1000 1500 2000 Size of the input matrix Figure 5.10. Comparison Of the performance Of the program using reliable ATM- Multicast and regular PVM bcast using Ethernet was 184620 microseconds and the mean was 12.09 seconds. The same program, using the ATM cluster, had a standard deviation of 198902 microseconds and a mean Of 11.21 seconds. The ALPHA workstations are significantly faster than the Spare-103 for this numerically intensive algorithm. The different broadcasting alternatives that were explored produced minor per- formance improvements when the acknowledgments were handled at the user level and an appreciable improvement when the acknowledgments were handled by the routines implementing multicast. In the case of IP-mcast based broadcast and un- reliable AAL5 ATM broadcast, the need to send acknowledgments back from the destinations to the broadcaster seems to diminish the gain Obtained by the more efficient broadcast Operation. The reliable implementation of ATM multicast pro- duced better performance that can be attributed to a more efficient handling of the acknowledgments. The enhanced implementation of the broadcast Operation in PVM also produced performance gains. Two caveats should apply to these observations: 87 the number of workstations used in the tests was relatively small and the algorithm uses a small number of broadcast Operations. When we compared the performance of the “naive” one-tO-0ne broadcast against the performance of the tree-based broadcast on the nCUBE-.2, the differences became noticeable only after more than 16 nodes were used. Other algorithms, like the tridiagonalization Of a full matrix, use broadcast Operations much more intensively and would likely benefit more from better broadcast implementations. CHAPTER 6 A Model Of Split-merge Performance After performing the experiments, a model was developed to capture the most relevant aspectsiof the behavior Of the program under different environments. A model allows us to better understand the Observed behavior of the program on different environ- ments, and to make some predictions about the behavior under different conditions. This model, given a set of values of a particular matrix and the parameters of the runtime environment, produces a prediction of the execution time Of the program. The time spent in 1/0 Operations is ignored in this mOdel. 6.1 Parameters Table 6.1 contains a short description Of the parameters that were considered for this model. The left column contains the symbol used to denote the parameter in the equations. The distribution of the number Of iterations Of the input matrix is taken into account in ”the parameter iter(i, j). As can be observed in Figure 3.7, the particu- lar distribution of the number of iterations required by a particular matrix is very 88 89 Parameter Short Description 72. Size Of the input matrix k [709201)] iter(i, j) Number Of iterations required by eigenvalue i at stage j sp Speed Of the microprocessor on a node (F LOPS) p Number Of nodes d 1092(1’) c, Size of initial workload c, Size of subsequent workload comm Communication Speed (node to node) W() Waiting time for additional work 80 Time taken by the broadcast Operation 00 Time taken by the gather Operation 5() Time taken by the scatter operation Table 6.1. Parameters considered in the model unique and does not seem to fit easily into well-known statistical distributions. The distributions of other types Of matrices were examined and were found to have sim- ilar shapes. The number of iterations is inherent to the particular matrix and the algorithm. It is independent from any particular implementation of the algorithm (except for considerations Of precision). The speed of the microprocessor is a difficult parameter to estimate as it depends not only on the speed Of the processor itself but on other factors as the size Of the cache and the memory access pattern of the program. For simplicity, it will be assumed that the number Of processors is a power Of 2. W(),B(), G() and S() are functions of other parameters. These functions will be discussed in the following sections. W() depends on the sizes Of the initial and subsequent workloads, and on speeds of the microprocessor and the network. B O, G() and S () depend on the particular implementations of broadcast, gather, and scatter respectively, as well as on comm, the speed of the node-tO-node communication and the n the number of nodes. 90, 6.2 Iteration Characteristics Every iteration requires a pass through the values Of the diagonal and off-diagonal entries of the submatrix whose eigenvalues are being calculated in the current stage. Approximately 8 additions and 8 multiplications are required for each value, roughly 16 floating point Operations. Although there is computation involved in finding the eigenvalues of the submatri- ces of size 2 x 2 and in the merging operations, the dominating term in computation time is the time spent in Laguerre’s iterations. Thus, a rough estimate of the total amount of floating point Operations is given by: I: n ZZiter( If, i) )*2' =1: 16 (6.1) i=2 (:1 The first summation corresponds to the stages in the algorithm, while the second corresponds to the work needed for all eigenvalues. The complexity Of the algorithm is 0(nzlog2(n)) time. If all other parameters are kept constant, as It grows, the com- putation cost dominates all other costs since the complexity of the communications grows at a rate Of at most 0(n). In a parallel machine with p = 2“ processors, the first k — (1 stages do not require communication. The workload is distributed statically. All nodes have to wait for the last one (the one with the maximum workload) to finish before proceeding with the distributed stages. Thus, the time to finish the first k — d stages is approximately: k-d n/p - - i -_ _ tier (,2 *2 *16 MAX..." .u.( "2 2‘“ 3; ) ) (6.2) All distributed stages involve a gathering at the coordinator from the sinks at that particular stage, followed by a broadcast from the coordinator to all workers. 91, Recall that one node is acting as a coordinator, so the number of workers is p — 1. Out of the n eigenvalues to be calculated, c,- * (p — l) are assigned initially to the workers. This means that n—(c,-*(p—1)) will be distributed later and that l "—"(E‘fffll requests for work will take place. Let lreqj denote the number Of requests for work from node j. SO an approximate value for execution time Of the distributed stages will be: iter(l,i)s2‘ :16) 3P iter((,i)a2' 1:16 2:14.160 + MAmelB0+ fl +1req. *wo (21's., )] + S()- The total estimated execution time is the sum Of the estimates for the local and the distributed stages. 6.3 Waiting Time The time W() that a processor waits for more work to arrive is a stochastic variable. Unfortunately, it is not clear how to model the behavior Of this variable. In a sense, this system could be described, approximately, with a queuing theory model, since there is one single server whose service times should be approximately uniform. The problem arises in modeling the distribution of the requests, as they are related to the distribution of the numbers of iterations, the stage in which the algorithm is in and the size of the workload. The closest model that seems applicable is the “machine repair model” which has been used to model systems with a fixed number of customers, and one single server [106]. This model assumes that the time between the requests for service from the customers has an exponential distribution with an average value Of E [t] = i and the service time also has an exponential distribution with an average Of E [s] = i. This queueing systems always reaches a steady state because the number of customers is finite. The actual system differs from this model in that the service time seems to be fixed (disregarding interferences from other processes on the server) 92, and the time between requests depends on the. distribution Of the required number of iterations. The average time between requests depends on the amount Of work that every node performs on a “chunk” and the speed Of the microprocessor . The number Of Operations in turn depends on the size Of the workload, and the size Of the matrices being solved. The average service time depends on the speed Of the network (disre- garding contention for the network), the speed Of the microprocessor and the overhead for protocol processing and context switching. The expression for the mean response time is: Wztflfh p QI" A 9’ 00 V where p, the server utilization, is calculated as: p=1—lpi (352;),(3m-1 (6.4) m=0 I“ The main result Of this part Of the model, which is confirmed by the trace files, is that the amount of time spent solving a “chunk” of eigenvalues by a worker should be significantly greater than the service time of the coordinator, to avoid queuing requests for work at the server. The mean response time is also affected by the number of workers. The larger the number Of workers, the larger the ratio should be between the amount of time required to finish a “chunk” and the service time to avoid the queuing Of requests. An example of this phenomenon is illustrated in the next chapter in section 7.3. In terms of the model, if the server utilization is very low, then W is asymptotic to the line E [s], the value for p — l = 1. As p —1 00 then W is asymptotic to the line pE[s] — i. The two asymptotic lines intersect at a value 93. called by Kleinrock the system saturation point: : E[s] + E[t] SSP E[s] (6.5) As was discussed in chapter 3, there is a tradeoff between small workloads, which minimize the variance in the finishing times but increase the time waiting for more work and large workloads which minimize the time waiting for more work but increase the risk of load imbalance. - - The effect of the period waiting for more work can be diminished by doubling the number of processes per processor. On the other hand, doubling the number of processes increments the number of distributed stages by one, and hence the commu- nication overhead. 6.4 Broadcast Time The broadcast function depends on the topology of the interconnection among the processing elements. On the nCUBE—2, the time complexity of the broadcasting function is 0(n*logz (p)) and the broadcast primitive acts as a barrier synchronization, all receivers resume their activity at approximately the same time. The performance of different broadcast implementations atop of ATM is presently under study [41]. For large messages (as the ones involved in this particular program), that work shows that the time complexity Of PVM’s mcast primitive is 0(n at p), recursive doubling has a complexity Of 0(n =1: log2(p)), while hardware based (AAL5) multicast has basically a cost that is independent of the number Of processors and depends only on the size of the message being broadcast, 0(n). The hardware based multicast operation that they examined is reliable and the acknowledgments from the receivers are combined in a tree to reduce the collection times. Because the acknowledgments are very 94 small in comparison with the message being broadcast, the broadcast time dominates and the time required for receiving the acknowledgments can be ignored. The same researchers also compared the performance of several broadcast primitives on top of Ethernet: Recursive doubling, iterative pvmsends, and pvmmcast. The complexity Of all these Operations is essentially linear, 0(p), although there are some performance differences depending on the size of the messages. The broadcast cost is not a significant factor in performance with a small number of processors, but as the number Of processors grows, as illustrated by our exper- iments on the nCUBE, the cost of the broadcast Operation can become dominant. As the number Of workstations increases, the number of distributed stages increases as well and the amount of computation time per node is bound to decrease. There- fore, if large numbers of workstations are to be used in a cluster, a better broadcast primitive is needed. The results of [41] indicate that an ATM switch can provide a significant advantage in the broadcast Operation. Our experiments confirm that just using the ATM environment provides a performance advantage. On the other hand, it seems that if one uses Ethernet with a large number Of workstations, communications (broadcast especially) are likely to become a bottleneck and affect the efficiency Of the program. Figure 6.1 plots the efficiency of using different numbers of workstations for a matrix Of a given size with various broadcast implementations. The plot was produced with a simplified version Of this model in which all factors, except the broadcast cost, were assumed to remain constant. As the number of processors increases, the amount of work per processor diminishes and the participation of the broadcast time on the total time increases, diminishing the efficiency. 95 0.8 0.6 Efficiency 0.4 ~Logarithmic Bcast e- _ Constant Bcast 4- Linear Beast B— 0.2 - _ O l I I l I I I I l 0 5 10 15 20 25 30 35 40 45 50 Number of processors Figure 6.1. Efficiency as a function Of the number Of processors with different broad- casting functions. 6.5 Gather and Scatter Times The gather and scatter Operations, described in section 2.3, might also become bot- tlenecks with a large number of processors. Unlike the broadcast primitive, though, whose cost remains constant through all distributed stages, the cost of the gather and scatter Operations changes from one distributed stage to the next one. In the first distributed stage, half the processors are sinks so the coordinator has to perform a gather and a scatter from and to ’23 — 1 sinks. The number of sinks decreases by half in each successive distributed stage and in the last stage the coordinator gathers from only one sink. A result concerning efficiency similar to the one Obtained for broadcast could be Obtained regarding the impact of the gather and scatter Operations. That is, given a matrix Of a certain size, as the number Of processors increases, the work per node 96. decreases and the amount Of time spent in the gather and scatter Operations might become dominant as a fraction of the total execution time. 6.6 Scalability How many processors can be used efficiently with this implementation of the algo- rithm? Different answers are Obtained for different environments. The rich topology of the hypercube makes it scalable in terms Of both broadcast and gather. Our experiments confirmed that up to 64 processors can be used effi— ciently. The centralized load balancing approach might become a bottleneck when the system saturation point is reached. Once that point is reached, a hierarchical load balancing approach would become attractive. Even though more processors would be devoted to the role of coordinators, the reduction in the service time at the coordinator would maintain a high efficiency. In the Ethernet environment, it might be feasible to design an efficient broadcast primitive that takes advantage Of the “broadcast” nature of the medium. With such a broadcast implementation, it would be feasible to increase (scale) the number Of processors. On the other hand, it is much more difficult to make gather and scatter operations scalable. The same is true for the load balancing approach. Eventually, the single medium becomes a bottleneck. The scalability Of Ethernet-based clusters Of workstations is limited. Switch-based clusters, in contrast, allow efficient implementations of all communi- cation primitives: Broadcast, gather, scatter and load balancing. By allowing simul- taneous communication among different pairs Of nodes, scalable tree-based broadcast, scatter and gather can be implemented. The solutions suggested for the nCUBE-2 regarding load balancing could also be applied in this environment. That is, Ethernet- based cluster computing is a cost-effective alternative for parallel computing but its 97 scalability is limited. Switch-based cluster computing Offers better performance and better scalability. 6.7 Summary Of Eigenvalue Study In the previous 4 chapters, the implementation and performance evaluation of a par- allel eigenvalue solver on both an MPC and a cluster Of workstations have been described. The general structure Of the split-merge algorithm for finding eigenval- ues in symmetric tridiagonal matrices is well-suited to efficient parallelization. The algorithm uses Laguerre’s iteration and exploits the separation property in order to create subtasks that can be solved independently. The split-merge algorithm was first implemented and studied on an nCUBE-2, a wormhole-routed hypercube. In the initial parallel version Of the algorithm, the only required communication among processes occurs only between stages of the algorithm; the number of stages is at most log2(n), where n is the order of the matrix. The in- herent variance in the number Of iterations in the split-merge algorithm justified more sophisticated approaches tO load balancing. The communication used to implement dynamic load balancing is, in some sense, programmable. This characteristic Of the communications in the algorithm gives it the potential to also perform well on a cluster of workstations. In order to test this hypothesis, the split-merge algorithm was implemented on a cluster of Sun Sparc-lO, models 30 and 40 workstations, interconnected by a regular Ethernet network and by an ATM switch. Some experiments were also performed on a cluster Of DEC Alpha workstations interconnected by a regular Ethernet and by a GIGAswitch. Experiments on a variety of input matrices confirmed that split-merge is signifi- cantly faster than bisection for large matrices (order 512 or larger) in both parallel 98, environments and revealed the performance ratio between an nCUBE-2 and a cluster of workstations. Split-merge performed better than bisection for matrices Of size 2048 on both environments and across the tested number of processors. It was Observed that as the size of the matrices increased, the relative advantage Of split-merge also increased. This trend seems to indicates that the advantage of split-merge should be even greater for matrices larger than the ones~ tested. On the other hand, beyond a Eertain number of processors, which depends on the particular environment, bisection performed better on small matrices (order 128). This indicates that it might be better to use bisection for small matrices when the available number of processors exceeds a certain threshold that depends on the particular environment. The small number of workstations available did not allow the experiments to show clearly the advantage of the better broadcast Options. But the model of the behavior Of the system indicates that with larger numbers Of processors, the broadcast, scatter and gather Operations might become bottlenecks, hence the need for faster and more efficient broadcast operations as well as better scatter and gather Operations. CHAPTER 7 A Parallel Singular Value Algorithm The singular value decomposition (SVD) of a matrix is an important tool in numerical linear algebra and it has applications in many scientific fields. The problem of finding the singular value decomposition (SVD) of an m x n real matrix A, with m 2 n, can be stated as follows: Find the values 01, . . . , on such that U TAV = diag(al, . . . , on), where U E Rmxm and V E Rnxn are orthogonal matrices. The SVD can be used to detect matrices that are singular or numerically very close to singular. In some cases, it will even provide an approximate answer for an ill-conditioned system Of linear equations [78]. SVD is also used very often to solve linear least squares problems [78]. Linear least squares problems arise in many different fields, where, given a set of experimental measurements, one wants to find the parameters of the system. The SVD is also used in cartography to adjust field measurements to a representation in a map. Solving for the SVD of a matrix is a computationally—intensive task, and reducing its execution time will improve the performance of the applicationslthat require it. An approach that is frequently used is to transform an m X n real matrix A into a bidiagonal matrix B, whose singular values are the same as those of A. Therefore, a 99 100. fast algorithm to find the singular values of a bidiagonal matrix is an important part of the overall process of finding the singular values of a real matrix A. Different methods have been used to solve the SVD problem of bidiagonal ma- trices. Given an n x n bidiagonal matrix B, its singular values can be found by computing the eigenvalues Of the symmetric tridiagonal matrix BTB, where BT is the transpose Of B, and taking the square roots of those eigenvalues. The disadvan- tage Of this approach is that the values calculated for the smallest singular values are not very accurate because the values of the original entries Of B are squared in the process of calculating BTB, leading to roundoff errors. A more accurate method that can be used to find the small singular values is to create a symmetric tridiagonal matrix T of size 2n x 2n, with zeroes in the main diagonal and the entries Of B in the Offdiagonal. The positive eigenvalues of this matrix are identical to the singular values Of B [107]. The drawback of this method is the larger size of the matrix T, which implies a greater computational effort. Li et al [108] recently proposed a new SVD algorithm that combines both meth- ods in order to find all the singular values accurately and efficiently. A threshold is calculated based on the matrix BTB (details Of the calculation will be given in section 7.1). The singular values below the threshold are computed by finding the eigenvalues Of the matrix T, while those above the threshold are found by calculating the eigenvalues of the matrix BTB. In this chapter, we report the results Of parallelizing Li’s algorithm on an nCUBE- 2 and across a cluster of workstations. Two versions of the new algorithm were tested. One uses the split-merge algorithm [77] for finding the eigenvalues of symmetric tridi- agonal matrices. The other one uses the well known bisection algorithm for the same problem. 101 7 .1 A New Algorithm Given an m x n real matrix A, with m 2 n, the process Of calculating its SVD can be described as: Compute the values 01, . . . , an and the matrices U and V, such that UTAV = diag(01, . . . , on), where U E Rmxm and V E Rnxn are orthogonal matrices. An approach that is frequently taken when finding the SVD Of a matrix Of size m X n is to reduce the matrix by orthogonal transformations to an n X n bidiagonal matrix B as shown in Figure 7.1. It can be assumed without loss of generality that all of the '31 e1 32 62 en—l 3n Figure 7.1. Bidiagonal matrix s,,i = 1,. . . ,n and e,-,i = 1, . . . ,n — 1 are nonzero. This assumption implies that all 0,,i = 1, . . . , n, are positive and distinct. That is, 01>°°.>an>0. Two different approaches can be used to find the singular values. First, the sin- gular values of B are the square roots of the eigenvalues of the symmetric tridiagonal matrix BTB [109](p. 320). Therefore, finding the eigenvalues of BTB and then taking the square roots Of them produces the singular values Of B. The drawback of 102, this approach is that the small singular values Of B cannot be calculated accurately because the entries have been squared. The second approach is based on finding the eigenvalues of T, another symmetric tridiagonal matrix, that is constructed from B as follows. Let OBT BO T = (7.1) Then the positive eigenvalues Of T are the singular values Of B [107]. Let P be a permutation matrix which reorders the rows and columns Of T in the order l,n + 1, 2, n + 2,. . . , n, 2n, as shown in Figure 7.2. The matrix T is a symmetric tridiagonal The matrices T and T are similar, therefore they have the matrix Of order 2n x 2n. same eigenvalues, so the positive eigenvalues of T are also the singular values of B. P 0 31 - 81 0 61 61 0 32 T = PTTP = 32 0 62 C2 ' 0 3,, . 3,, 0 . Figure 7.2. Symmetric Tridiagonal Matrix T The following example illustrates the above transformations. Given the bidiagonal matrix B in Figure 7.3(a), one can easily construct the symmetric tridiagonal matrices BTB, shown in Figure 7.3(b), and T, which can be seen in Figure 7.3(c). Notice that because the value Of B[2,2] is so small, when one calculates BTB, BTB[2, 2] becomes 1.0. It turns out that the singular values of the matrix B in this example are 103, 1.4142 and 7.0710e—11, as calculated by the routine DBDSQR in LAPACK [55]. The eigenvalues of the matrix BTB are, according to the routine DSTEQR of LAPACK, 0.0 and 2.0. The square root of 2.0, 1.4142, coincides with one of the singular values of B but the other one (square root of 0) does not. The positive eigenvalues of the matrix T, again calculated with DSTEQR, are identical to the singular values of B: 1.4142 and 7.0710e-11. ‘ _ 1.0 1.0 T __ 1.0 1.0 B—[O 1.0e—10] BB_[1.01.0] (a) sample bidiagonal matrix B (b) sample product BTB 0 1.0 T _ 1 0 0 1.0 _ 1.0 0 1.06 — 10 1.0e — 10 0 (c) sample matrix T Figure 7.3. Numerical example Of a bidiagonal matrix and the associated matrices It is possible to compute all the singular values Of B with high relative accuracy if the eigenvalue algorithm used on T introduces small relative perturbations entrywise on T. The drawback of this second approach is that the size of the matrix doubles, and therefore the SVD is more expensive to calculate. It is faster to find the eigenvalues Of BTB because the matrix is smaller, but this method introduces errors in the smallest singular values. Li’s algorithm uses a novel approach that combines both methods. Let a,- be an actual singular value, and let of be an approximation of 0;. Let 5 be the machine precision. Then, based on the assumption that the eigenvalue algorithm can achieve 104, a relative accuracy of 3ne when finding the eigenvalues Of T, Li et al [108] set the following goal of relative accuracy to find the singular values: [03 - Oil —— S 3ne. 0i i It has been shown [108] that this degree of relative accuracy can be Obtained by calculating a threshold. The threshold is calculated using the OO norm and the size of the matrix BTB. The threshold is: 7||BTB||00 12n The largest singular values of B are found using the first approach, this is, calculating the eigenvalues A,- of BTB in the interval [mfg—€113, llBTBlloo) and then setting a,- = fl. The smallest singular values Of B are computed by finding the eigenvalues of T in the interval (0, ZLIBTngUPE). An outline of Li’s algorithm is given in Figure 7.4. The threshold described previously was calculated using the best relative accuracy attainable with the split-merge algorithm, discussed in the previous chapters. The split-merge algorithm described in the previous chapter can be used to calculate the eigenvalues of symmetric tridiagonal matrices. The algorithm needed some modifica- tions, described later in section 7.3. We wanted to compare the split-merge version of the singular value algorithm against another version that uses a different eigenvalue solver. As discussed previ- ously, different versions of the bisection algorithm have been used in both sequential (DSTEBZ in LAPACK [55]) and parallel systems [94] in order to find the eigenval- ues of symmetric tridiagonal matrices. The algorithm is based on the evaluation of Sturm’s sequence [80] at a particular value. Sturm’s sequence determines the num- ber of eigenvalues that are smaller than the point where the sequence is evaluated. Given an initial interval that is known to contain the eigenvalue of interest, Sturm’s 105, Algorithm 1: Singular Values Input: Bidiagonal matrix B of order n Output: The singular values 0,, i = 1,... ,n, of B Procedure: begin Calculate matrix BTB , Calculate the OO norm of BTB, llBTBlloo Calculate the threshold 31131—25118 Construct matrix T Calculate l, the number of eigenvalues Of T in the interval (0, 1/ M). Calculate the i’ eigenvalues )1,,i = 1, . . . ,i of T in the interval (0, 11131—1213llg) and set 0'; = A; Calculate the n — i eigenvalues /\,-,i = l + 1, . . . , n of BTB in the interval [W, llBTBlloo) and set a,- = fili = €+ 1,... ,n end _ Figure 7.4. Algorithm to calculate singular values sequence is evaluated in the middle of the interval to determine which part of the original interval contains the desired eigenvalue. The process is performed repeatedly in the manner Of a binary search until a small enough interval has been Obtained. A particular implementation of bisection starts by finding the interval containing all the eigenvalues using Gerschgorin Theorem [96], which states that every eigenvalue A,- Of a complex matrix A is inside a disk in the complex plane centered at a;,,- and with radius 21-,“ llaggll [109]. The theorem applies to real matrices as well, but in this case the eigenvalues are inside an interval in the real line. Then each eigenvalue is obtained, independently and in parallel, by bisecting that interval with the same number Of steps. This version Of bisection is very appropriate for parallelization: once the initial data has been broadcast to the participating nodes, each node can work on its part Of the problem without any communication with other nodes, except 106. for reporting the final results to the initiating node. Demmel, Heath and van der Vorst [110] note that it is possible to use different number Of iterations for different eigenvalues, since tightly clustered eigenvalues require more iterations. However, this approach in turn might require load balancing to keep the work evenly distributed among the processors and load balancing requires communications. Our experiments in finding the singular values of matrices with very small sin- gular values revealed that Obtaining high relative accuracy indeed required the use Of a variable number Of iterations for every singular value. Therefore we use a load balancing approach, similar to the one used in the parallel split-merge algorithm, in the implementation of the bisection based singular value solver. One process servers as a coordinator, distributing small numbers of singular values to the rest of the processes to try to ensure that all processes finish at approximately the same time. Let a and b be the lower and upper limits of the interval containing the singular value being calculated. In order to be consistent with the relative accuracy goal employed in the split-merge version of the algorithm, the criteria for stopping used in the bisection implementation reported here was set as follows: b—a S3neb where n is the size Of the matrix, and e is the precision Of the machine. 7.2 MPC SVD Study Two versions Of the new singular value algorithm were implemented on the nCUBE-2. They differ in the method used to find the eigenvalues Of the symmetric tridiagonal matrices. One uses the split-merge algorithm and the other one uses the bisection- based routine described above. Both algorithm possess the ability to find only the 107 eigenvalues that lie within a given interval, without requiring the calculation of all eigenvalues Of the matrix. Recall that in the new algorithm, only the eigenvalues of BTB and T that lie in certain ranges, defined by the threshold, are Of interest. It is not necessary to calculate all the eigenvalues of both matrices. The parallel implementation of split- merge that was described earlier and reported in [111] calculated all the eigenvalues in the input matrix. Additional parameters were added to that implementation in order to define the interval Of interest. For every submatrix in the “task-tree” depicted in Figure 3.3, only the eigenvalues in the apprOpriate range are calculated. The range of eigenvalues of interest is calculated by evaluating Sturm’s sequence in the edges of the interval. Recall that Sturm’s sequence will return the number Of eigenvalues that are smaller than a certain value. The rest of the eigenvalues are set to values slightly above or below the limits of the interval, depending on their position with respect to the interval. The load balancing task is more complicated in this version of split-merge, since the coordinator does not dispense additional eigenvalues to the workers from a con- tinuous sequence Of eigenvalues but out Of a set Of discontinuous intervals. Consider Figure 7.5, where every solid line represents all the eigenvalues of a given matrix. In the original implementation, every eigenvalue in every submatrix is calculated. That is, the entire solid line Of eigenvalues would be calculated. But in this version, we are interested only in those eigenvalues in a particular interval, and those eigenval- ues are only part Of the entire set Of eigenvalues. . In Figure 7.5, those eigenvalues are represented with the dotted lines. As before, the eigenvalues calculated in the smaller matrices become initial approximations to compute the eigenvalues Of the larger matrices. The task Of finding which eigenvalues are of interest in every submatrix was as- signed to the “sink” nodes at every stage. After a “sink” has merged the results of 108, two smaller submatrices, it finds the range of eigenvalues that needs to be calculated and it sends this information to the coordinator. In the nCUBE implementation, this required sending an additional message to the coordinator from the “sinks” in the gather Operation. A problem arose with the fact that two messages were being sent from every “sink” to the coordinator: The communication subsystem of the nCUBE was not able to handle more than 3 “sinks” (the case when 8 nodes were being used). Hence, it was necessary to “sequentialize” the messages from the sinks to the coordinator. Every sink waits for a message from the coordinator before sending its submatrix and the range of eigenvalues to calculate. Figure 7.5. Representation Of the intervals Of eigenvalues of interest Since this algorithm uses the split-merge algorithm as a routine, the size of the workloads dispensed by the coordinator has a significant impact on the performance of the program. Even more so in this SVD algorithm, as the version Of split-merge works on finding only eigenvalues in a particular range. For the purpose of comparison, the singular value algorithm was implemented using the bisection-based eigenvalue algorithm, which was described in Section 7.1. Both versions of the program were executed on input matrices Ofdifferent sizes to examine the effects the size Of the matrix. Figure 7.6 shows the execution times for the algorithms On random matrices of sizes 2048 and 4096. It can be Observed that 109 the split-merge based version Of the algorithm is faster than the bisection based one. The advantage holds even for a 64 processors, although the advantage is not so large with a large number of processors. I I I - I 1 I 45 - Bisection based “f— - Split-merge based 8— Exec.30 _ Times (in sec.) 15 - "‘ fit: 0 10 20 30 40 50 60 Number of Processors (a) Matrix order: 2048 I I I I I I 450 - Bisection based ‘1" - Split-merge based 8— I Exebhoo Times (in sec.) E 150 '- H 0 I I I I I I El 0 10 20 30 40 50 60 Number of Processors (b) Matrix order: 4096 Figure 7.6. Comparison of the two versions Of the SVD program on an nCUBE 110, 7 .3 Cluster implementation The program was also implemented on the cluster Of Sparc-lOs described in the pre- vious chapter. The implementation on the cluster was very sensitive to the size of the workload. As discussed previously, if the workloads are too small, then the workers will have to request more work frequently, each time incurring the overhead Of a waiting period. The risk exists again of using workloads that take very little processing time, thus increasing the response time from the coordinator. This compounds the problem Of the frequent requests. Larger workloads imply less waiting, but they increase the risk of load imbalance. Specifically, a processor may receive a set of eigenvalues that requires a large number of iterations causing this processor to. continue working after other processors become idle. This issue is illustrated in Figure 7.7, which shows the execution times Of the singular value program on a certain random matrix with 2048 entries, as executed on a cluster with 8 processors. The :1: axis represents different sizes of workloads dispensed by the coordinator, while the y axis represents the execution times. As described in the previous chapter, in order to better understand the effects Of load imbalance, the executions of the program were instrumented so as to generate tracefiles compatible with the Paragraph visualization tool [81]. The traces in Fig- ure 7.8 show the execution of the program with different sizes of workloads, on the same matrix used to produce Figure 7.7. All the traces use the same scale; every unit in the :1: axis represents 50 milliseconds and levels in the y axis represent the different processors. The shaded areas represent areas Of processor activity, blank areas represent areas of inactivity. Different shades correspond to the stages of the algorithm. Recall that node 0 works as a coordinator in-the last d = log2(p) stages for every matrix, where p is the number of processes in the execution of the program. 111 24 _ I I I I I I _ Execution Times E}— 16 - Exec. Times (in sec.) 3 _ .. 0 1 1 1 1 1 I 0 5 10 15 20 25 30 Workload size Figure 7.7. Execution times as a function Of the workload size Notice that the algorithm was implemented with two consecutive calls to the split- merge eigenvalue solver. The first call finds some Of the eigenvalues Of the matrix T, which has twice the size Of the original matrix. The second call calculates the rest of the singular values by finding another subset Of eigenvalues in the matrix BTB. Hence, in the traces, both calls are represented. The first twelve stages correspond to solving some of the eigenvalues of T and the next eleven stages to the calculations of the eigenvalues of BTB. In this particular matrix, only 117 of the singular values needed to be calculated using matrix T and therefore the stages corresponding to BTB take much longer. Figure 7.8(a) corresponds to a workload size Of 2 eigenvalues. Numerous “gaps” of inactivity can be Observed in the traces Of the workers, although, the workers finish at approximately the same time in the distributed stages. A similar situation can be Observed in Figure 7.8(b), which corresponds to a workload size of 4. The execution with a workload size of 20 is shown in Figure 7.8(c). The workers are busy most of the time, and the distributed stages finish at approximately the same 112. time. The gaps-in the traces represent idle times, and as expected, the smaller the gaps, the better the execution time. These traces also confirm the discussion about the response time Of the server Of chapter 6. A detailed Observation of Figure 7.8(a) reveals, that the gaps of inactivity are more prominent in stage 9. The size of the submatrices being solved in stage 9 is 512. For a workload size Of only 2, the average time to solve such a workload in matrices of this size is around 7000 microseconds, slightly above the 5000 microseconds of the service time. This is further compounded by having 7 workers. As a result, the response time from the server dominates and the workers are kept idle most of the time. In the next stage, stage 10, the average time for solving a chunk increases to around 15000 microseconds and as a result the response time diminishes. This results in a better utilization Of the workers. It can be Observed that stage 9 takes almost as long as stage 10, even though the amount of computational work of stage 10 is by far larger than that of stage 9. Finally, in stage 11, the last stage, the average time to solve a workload increases again to close to 30000 microseconds. The congestion phenomenon at the coordinator diminishes and the gaps of inactivity are smaller. Observe that when the size of the workload is 2, the time spent in stages 9 and 10 are significantly larger than with any other size. This can be attributed to the congestion in the server. i This phenomenon suggests using different workload sizes for the different matrices and maybe even at different stages. A small improvement was Obtained by setting the workload size for the matrix T to 8 and the size for the matrix BTB to 16. The execution time decreased to 13.7 seconds (from 14.3 for 20 and 20). It is possible to _ determine in advance how many singular values will be generated from the matrix T and how many will be generated from the matrix BTB. In the current implementation the size Of the workload is a user determined parameter, but it would be possible to 113 adjust the value at run time in order to divide the work more evenly among the processes. 'I . P t 3 C 5 E 3 E O O I ’ 11 U 2 II I .: I I :E I . g . I _ : THE 512 IIE-IE-IEIIE- O I 2 CI 4 5 6 7 O 9 I. II 12 (a) Workload size: 2 7 P ‘ 3 C 5 E I ‘ I ‘ 0 ‘ l 3 N U 2 II C I I l I fi-%!!$5!%!-Efi I. II 12 I - n u A u a 4 'H-ICZ IDIOHDOI‘ I i In“ 312 IIEIIEIIE-IE- O I 2 3 1 S 6 7 0 9 1.1112 (c) Workload size: 20 Figure 7.8. Traces of the program with different workload sizes 114, An additional factor has to be considered: The number Of singular values that are calculated out of each Of the matrices BTB and T. The particular number depends on every matrix, it does not depend on the algorithm. If the number Of singular values that are calculated out Of any of the two matrices is very small, then it might be necessary to use smaller chunks to guarantee that all processors are active. For example, assume that the singular values Of a matrix of size 2048 are being calculated 0°11 8 processors. Furthermore, assume that 120 of the singular values are calculated by finding the eigenvalues Of T and the remaining 1928 are found using BTB. If one uses a workload size of 32, then in the distributed stages of calculating the singular values of T, only 4 of the 7 workers will be active, diminishing the efficiency Of the algorithm. Notice that the number of broadcast Operations for a given matrix is double the one of the split-merge algorithm, precisely because the split-merge algorithm is called twice. This is, given a certain tridiagonal matrix Of size n, if we used p processes to find its eigenvalues then log; (p) broadcasts will take place in the execution Of the algorithm. Given a bidiagonal matrix of the same size and the same number of processes, then 2(log2(p)) broadcasts will take place. Thus, the performance of the parallel implementations of this algorithm are even more sensitive to the complexity of the broadcast Operations than the split-merge implementations. The experiments with the eigenvalue solver showed that using an ATM switch and using an improved version Of the broadcast routine in PVM could improve the performance Of the application. Given that this program uses twice as many broad- casts with approximately the same amount Of computation, we tested the program on both environments to see the effects on this program. Table 7.1 contains the execution times of the program on random input matrices Of sizes 2048 and 4096. The results for both sets Of tests were Obtained on clusters Of 7 workstations. The table shows that using an ATM network provides an advantage 90 ‘ I I I Bisection +— lit-Merge B- 6 _ - Exec. 0 Times (in sec.) PQ n 0 I I I I 0 2 4 6 8 10 Number of Processors (a) Matrix order: 2048 2100 - ' ' ' - Bisection -|- plit-merge B— Exec.1400 _ I Times (in sec.) 700 - - 0 M 0 2 4 6 8 Number Of Processors (c) Matrix order: 8192 10 115 I I I Bisection +— lit-merge B— ‘ 240 '- Exec. Times (in sec.)l O I I I I 0 2 4 6 8 10 Number Of Processors (b) Matrix order: 4096 1 3000 - Exec. Times (“1 838300 r 0 I I I I 0 2 4 6 8 10 Number of Processors ((1) Matrix order: 16384 Figure 7.9. Execution times for random matrices of different sizes on a cluster of workstati ODS 116, Matrix Ethernet ATM Size Reg.Bcast l Improved Bcast Reg.Bcast l Improved Bcast 2048 9.85 8.24 8.49 7.28 4096 33.60 31.30 32.97 30.79 Table 7.1. Execution times on a cluster with different broadcast algorithms and different interconnecting networks over using regular Ethernet. Using the improved, tree based, version of the multicast Operation [41] also improves the performance of the program. 7 .4 A Model Of the Performance Of the SVD Al- gorithm The split-merge based implementation Of the SVD algorithm behaves very similarly to the split-merge algorithm reported in the previous chapter. The main difference is that in a given call to the split-merge procedure, only part of the eigenvalues are required. This in turn might lead to instances of the execution of the program where the workload size is such that some Of the processes might be idle in the distributed stages. The size Of the matrix whose eigenvalues are being found also varies as matrix T has size 2n. Let nT be the number of singular values that are extracted from the matrix T and let n 31‘ B be the number of singular values that are extracted from the matrix BTB. Clearly, nT + "BT19 = n and their values depend on the particular matrix. Lets consider the call to split-merge to find the singular values using T. Because the size of T is 2n, the total number Of stages is incremented by one. The number of distributed stages remains constant (it depends on the number Of processes), so there is one more local stage. The number of eigenvalues that are calculated in a given 117 submatrix along the process varies. For example, assume that matrix T is Of size 128 and it contains 20 singular values. In the last stage of the split—merge call to find the eigenvalues of T we will be finding 20 singular values. In the previous stage, though, we will be finding eigenvalues Of 2 submatrices of size 64 each. It is not possible to know in advance how many will be in each. 10 might come from one and 10 from the other one or 1 might come from one and 19 from the other one. Let nw- denote the -number of eigenvalues calculated by processor p at stage i, where 0 S np.i S n1. The expression for the local stages in the call for T is: 21:51“ 22:3" iter(i, i) * 2i :1: 16 MAXover all p( 3p ) (7.2) Previously we discussed the need to replace the value Of the initial workload size c,- with a more appropriate value to avoid leaving some workers idle. If this situation arises, a reasonable Option seems to replace the user supplied value of c,- with 53;. Let c: be the value that will be used during the execution of the program and which ' °f —1*-< willassignedaszdz 6* '(P ) c._n;r fig otherwise The expression for the distributed stages in the call for T is: :j‘:fi_d+2 G0 + MAXan WorkerslBO + fllitezg.i):2a16) + lreqj * W0 * (mifffifizum )] + S() For the call to split-merge to find the eigenvalues of BTB, the expression for the local stages is: k-d npri . . 3. -_ _ iter 2,: *2 *16 MAXover all p( “—2 l-J 8p( ) ) ‘ (7-3) and for the distributed stages Ef=k-d+1 G 0 + M AXm workerslBO + mliter((,i)n-2ul6) + lreqj * W0 * (2.151 nigh-21.16 )} + S 0 8P 118, 7 .5 Related work In the realm of sequential SVD algorithms, the current method Of choice is to reduce the original matrix to bidiagonal form and then find its singular values using an algorithm based on QR with zero shift [112]. The QR algorithm has been effectively parallelized when both eigenvalues and eigenvectors are needed [93]. But if only the eigenvalues are required, QR is difficult to parallelize efficiently, thus other algorithms have been proposed for finding singular values on parallel machines. Jacobi methods Operate directly on the original matrix, that is, there is no reduc- tion Of the original matrix to bidiagonal form. Bischof [113] presented an algorithm based on the two-sided Jacobi method. The matrix is partitioned in blocks among the processors. The process proceeds in sweeps. In every sweep, every processor works on a group Of 4 blocks, say blocks (i,i),(i,j),(j,j), and (j,i). A transformation is applied to the entire group to make it block diagonal, this is, the norms of the blocks (i,j) and (j,i) will diminish. Once a group has been transformed, all processors exchange the updates done locally so that they can be performed globally. Eventually the matrix becomes block-diagonal. Once the matrix is block-diagonal, the SVD is computed by finding the SVD of each diagonal block. Bischof’s implementation was done on a LCAP-l, an array of ten processors interconnected via global memory. Ewerbring and Luk [114] implemented Jacobi based singular value solvers on a CM-2, a SIMD machine. Zhou and Brent [115] are implementing a version Of the Jacobi algorithm on a CM-5 that orders the sweeps taking advantage of the architecture of that machine. Lee et al [116] also report an implementation Of Jacobi’s method on a CM-5. Cuppen’s divide—and-conquer algorithm [98] to find the eigenvalues of symmet- ric tridiagonal matrices has been used to find SVDs as well. Arbenz [117] applied Cuppen’s algorithm to the matrix T. The matrix T is not explicitly solved, though. Instead two matrices, BI and 3;, are formed (both smaller than B). The singular 119 values of each submatrix can be found in parallel and later recombined. This process is repeated recursively until the submatrices are small enough and it is more efficient to find their singular values with an EISPACK routine (svd). Arbenz reports an implementation of his algorithm on a sequential machine. Jessup and Sorensen [118] have also developed a parallel SVD algorithm based on divide-and-conquer. They apply the divide-and-conquer algorithm to BBT and BTB but they do not form these products explicitly to avoid numerical difficulties. Their program was imple- mented on a Sequent Symmetry and on an Alliant FX/ 8. Gu and Eisenstat [119] use an approach similar to the one of Jessup and Sorensen for computing the singular values, but they use a different approach for calculating the singular vectors. They do not report performance results from an implementation on a parallel machine. The algorithms of Arbenz, Jessup and Sorensen, and Gu and Eisenstat calculate both singular values and singular vectors and work on bidiagonal matrices. 7 .6 Summary of Singular Value Study The algorithm presented in this chapter provides the user with the speed advantages of finding the eigenvalues Of BTB and the accuracy of calculating the eigenvalues Of T. The analysis performed by Li et al [108] to derive the expression for the threshold that dictates which singular values to find in each matrix could be adapted to the precision characteristics of other eigenvalue solvers for symmetric tridiagonal matrices. In the work reported in this chapter, two implementations were studied: one based on the split-merge algorithm and the other one uses a bisection based algorithm. Both implementations used the same load—balancing technique: a central coordinator which distributes work among the rest of the nodes. The performance of the split-merge based implementation Of the algorithm is sensitive to the size of the workload. The fact that the number of singular values 120, calculated out Of one Of the matrices can be small may require that the program overrides the workload size provided by the user to guarantee that all processors are active. The number Of gathers, scatters and broadcasts pperations is double that of the split-merge implementation and hence the performance of the program is even more sensitive to the implementation Of the broadcast and gather primitives. The implementation based on the split-merge algorithm achieved better perfor- mance than the one based on bisection, even though the communication and synchro- nization requirements of the split-merge version are considerably highef‘. This was true for both environments, the nCUBE-2 and the cluster Of workstations. CHAPTER 8 Conclusions and Future Work Parallel computing is becoming more widely used as the software and hardware tech- nologies involved are maturing, Offering better performance and ease of use than in the past. Significant research is being conducted on how to solve efficiently a variety Of problems in parallel, including numerical scientific applications. Clusters Of work- stations are becoming a cost / effective platform for parallel computing, allowing many more organizations to explore and take advantage Of parallel computing. At the same time, more powerful, scalable distributed-memory parallel systems are appearing. The research described in this dissertation has produced the following contribu- tions: a study of the effects Of collective communications, in particular, of different implementations Of broadcast and multicast, with particular emphasis on their impact on scalability on both MPCs and clusters; a comparison of MPC and cluster environ- ments, the effects Of communication costs in each, and methods to accommodate them in algorithm design, such as load balancing and hiding communication costs; and case studies of the proposed techniques in new highly parallel numerical algorithms. On both environments, clusters and MPCs, efficient implementation of collective communication Operations can be used to improve performance, especially when large number of processors are being used. Hence, efficient collective communication is 121 122. required for good scalability, else the increased cost in communications eventually defeats the gains made by adding processing elements. The research described in this dissertation has highlighted the effects Of inter- process communication costs on the design and implementation Of parallel numerical algorithms. A cluster of workstations has proven to be an effective alternative to a particular MPC, in spite Of the much higher communication costs in the cluster. The characteristics Of the cluster led to new approaches to application structure, for instance, doubling the number Of processes in each node to mask the effect of communications. In the particular case of Ethernet-based clusters, it is necessary to improve the implementation of the broadcast primitive in order to take advantage of the medium. For this particular algorithms, the gather, scatter, and load balancing Operations pose an Obstacle to the scalability Of the algorithms. As manufacturers Offer new alternatives for interconnecting clusters, such as switch-based LANS, there will be Opportunities to implement collective communication more efficiently than on a shared medium. Gather, scatter and load balancing can be implemented more efficiently on a switch-based cluster, and hence the scalability of the algorithms, according to our model, should be significantly better. This research has concentrated on broadcasting and multicasting, as well as load balancing, and partially on gather and scatter, but a number of other primitives are equally important to other algorithms and their efficient implementation in clusters needs to be studied. The alternative broadcast primitives that were explored in conjunction with PVM provided an initial approach to a more efficient implementation Of broadcast in clus- ters. The implementation and study of the SVD algorithm provided better under- standing of communication and dynamic load balancing when the amount Of work varied in different stages of the algorithm. The experiments on the ATM and DEC 123, clusters contrasted the performance Of new generation clusters with different charac- teristics. The model indicates that switch-based clusters should be scalable, thanks to the possibility of simultaneous communications among different pairs of nodes and, hence, better implementations Of collective Operations. Load balancing proved to be very effective in improving the performance of the parallel versions of the algorithms. In these algorithms, where the memory require- ments are modest, replicating certain data across all nodes involved to allow load balancing can be cost-effective. The response time Of the server depends on the ratio between the expected service time and expected time between requests. The performance also depends on the size Of the workloads. Very small workloads can cause congestion at the server while large workloads may cause load imbalance. Further research can be done in several areas of the parallel implementation: exploring heuristics to assign automatically the sizes Of the workloads, replacing the distributed sorting with performing all sorting at the coordinator to avoid all gather and scatter operations, except one gather, and using a hierarchy of coordinators, which might be necessary if larger numbers Of nodes are to be used. Both applications, the eigenvalue solver and the singular value solver, will benefit from research in better communication protocols for clusters of workstations. In par- ticular, in the Ethernet environment, a reliable multicast protocol that allowed large messages would be very useful. On ATM-based clusters, the availability of efficient collective communication primitives at the user level will simplify the programmer’s task and it will improve the performance Of the program. The insight gained in the parallelization in a distributed memory environment should be useful in parallelizing the programs in a shared memory multiprocessor or in a distributed memory machine that Offers virtual shared memory. The fact that the number Of iterations for Laguerre iterations is variable will help a programmer in that environment to use the appropriate directives for the parallel compiler. The 124. results Obtained can also be used in the efficient implementations Of other parallel algorithms. BIBLIOGRAPHY BIBLIOGRAPHY [1] G. Bell, “Ultracomputers - a teraflop before its time,” Communications of the ACM, vol. 35, pp. 27—47, August 1992. [2] L. M. Ni and P. K. McKinley, “A survey Of wormhole routing techniques in direct networks,” IEEE Computer, pp. 62-76, February 1993. [3] A. Beguelin, J. J. Dongarra, G. A. Geist, R. Manchek, and V. S. Sunderam, “Graphical development tools for network-based concurrent supercomputing,” in Proceedings of Supercomputing .91, IEEE CS Press, 1991. [4] A. Agarwal, “Limits on interconnection network performance,” IEEE Trans- actions on Parallel and Distributed Processing, vol. 2, pp. 398—412, October 1991. [5] W. Dally, “Performance analysis of k-ary n-cube interconnection networks,” IEEE Transactions on Computers, vol. 39, pp. 775—785, June 1990. [6] C. E. Leiserson and et. al., “The network architecture of the Connection Ma- chine CM-5,” in Proceedings of SPAA ’92, 1992. [7] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” in Proceedings of the 19th. International Symposium on Computer Architecture, pp. 278—287, 1992. [8] D. H. Linder and J. C. Harden, “An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes,” IEEE Transactions on Computers, vol. 40, pp. 2—12, January 1991. [9] D. C. Grunwald and D. A. Reed, “Networks for parallel processors: Measure- ments and prognostications,” in Proceedings of the Third Conference on Hy- percube Conncurrent Computers and Applications, vol. 1, pp. 610—619, ACM, January 1988. [10] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Computing, vol. 1, no. 3, pp. 187-196, 1986. [11] W. Dally, “Virtual-channel flow control,” IEEE Transactions on Parallel and Distributed Processing, vol. 3, pp. 194—205, March 1992. 125 126, [12] R. Duzett and R. Buck, “An overview of the nCUBE-3 supercomputer,” in Proc. Frontiers ’92: The 5th Symposium on the Frontiers of Massively Parallel Computation, October 1992. [13] W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R. Nuth, R. E. Davison, and G. A. Flyer, “The message-driven processor: A mul- ticomputer processing node with efficient mechanisms,” IEEE Micro, pp. 23—39, April 1992. [14] NCUBE Company, NCUBE 6'4 00 ProceSsor Manual, 1990. [15] Intel, Supercomputing, Systems, and Division, Intel Touchstone Delta Systems Description. Advanced Information, Intel Corporation, 1991. [16] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Sizovic, C. S. Steele, and W. K. Su, “The architecture and programming Of the Ametek Series 2010 multicomputer,” in Proceedings of the Third Conference on Hypercube Concur- rent Computers and Applications, Volume I, (Padadena, CA), pp. 33-36, ACM, January 1988. [17] W. Myers, “Supercomputing 92 reaches down to the workstation,” IEEE Com- puter, vol. 26, January 1993. [18] H. T. Kung, R. Sansom, S. Schlick, P. Steenkiste, M. Arnould, F. Bitz, F. Chris- tianson, E. Cooper, O. Menzilcioglu, D. Ombres, and B. Zill, “Network-based multicomputers: An emerging parallel architecture,” in Proceedings of Super- computing ’.91, pp. 664—673, 1991. [19] W. Gropp and E. Lusk, “Users guide for the ANL IBM SP1 (draft),” Tech. Rep. AN L/ MCS-Tm-OO, Argonne National Laboratory, March 1994. [20] A. Hac and H. B. Mutlu, “Synchronous Optical network and broadband ISDN protocols,” IEEE Computer, vol. 22, November 1989. [21] E. Biagioni, E. Cooper, and R. Sansom, “Designing a practical ATM LAN,” IEEE Network, pp. 32-39, March 1993. [22] D. Banks and M. Prudence, “A high-performance network architecture for a PA-RISC workstation,” IEEE Journal on Selected Areas in Communications, vol. 11, pp. 191—202, February 1993. [23] C. Chang, R. Flower, J. Forecast, H. Gray, W. R. Hawe, K. K. Ramarishnan, A. P. Nadkarni, U. N. Shikarpur, and K. M. Wilde, “High-performance TCP/IP and UDP/IP networking in DEC OSF/l for Alpha AXP,” Digital Technical Journal, vol. 5, no. 1, 1993. [24] C. C. Huang and P. K. McKinley, “Efficient collective communication on ATM networks,” In preparation, 1993. - [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] 127, H. T. Kung, P. Steenkiste, M. Gubitoso, and M. Khaira, “Parallelizing a new class of large applications over high-speed networks,” in Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallle Program- ming (PPoPP), 1991. S. E. Deering and D. R. Cheriton, “Multicast routing in datagram internetwork and extended LAN 3,” ACM Transactions on Computer Systems, vol. 8, pp. 85— 110, May 1990. C.-T. Ho and M. T. Raghunath, “Efficient communication primitives on hyper- cubes,” Journal of concurrency: Practice and Experience, vol. 4, pp. 427-457, September 1992. H. Xu, P. K. McKinley, and L. M. Ni, “Efficient implementation Of barrier synchronization in wormhole-routed hypercube multicomputers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 172-184, 1992. Y. Saad and M. H. Schultz, “Data communication in hypercubes,” Journal of Parallel and Distributed Computing, vol. 6, pp. 115—135, 1989. W. D. Hillis and G. L. Steele Jr, “Data parallel algorithms,” Communications of the ACM, vol. 29, December 1986. S. L. Johnsson and C. T. HO, “Optimum broadcasting and personalized commu- nication in hypercubes,” IEEE Transactions on Computers, vol. 38, pp. 1249— 1268, September 1989. C.-T. HO, “Optimal broadcasting on SIMD hypercubes without indirect ad- dressing capability,” Journal of Parallel and Ditributed Computing, vol. 13, no. 2, pp. 246-255, 1991. M. Barnett, R. Littlefield, D. G. Payne, and R. van de Geijn, “Efficient commu- nication primitives on mesh architectures with hardware routing,” in Proceed- ings of the sixth SIAM conference on Parallel Processing, pp. 943—948, 1993. M. Barnett, R. Littlefield, D. G. Payne, and R. van de Geijn, “Global com- bine on mesh architectures with wormhole routing,” in Proceedings of the 7th International Parallel Processing Symposium, pp. 156-162, 1993. Y. Lan, A. H. Esfahanian, and L. M. Ni, “Multicast in hypercube multipro- cessors,” Journal of Parallel and Distributed Computing, pp. 30—41, January 1990. P. K. McKinley, H. Xu, A. Esfahanian, and L. M. Ni, “Unicast-based multicast communications in wormhole—routed networks.” Accepted to appear in IEEE Transactions on Parallel and Distibuted Systems. P. K. McKinley and C. Trefftz, “Efficient broadcast in all-port wormhole—routed hypercubes,” in Proceedings of the 19.93 International Conference on Parallel Processing, vol. II, pp. 288-291, 1993. 128. [38] D. F. Robinson, D. L. Judd, P. K. McKinley, and B. H. C. Cheng, “Efficient collective data distribution in all-port wormhole-routed hypercubes,” in Pro- ceedings of Supercomputing ’93, pp. 792—801, Nov. 1993. [39] X. Lin, P. K. McKinley, and L. M. Ni, “Performance evaluation Of multicast wormhole routing in 2D-mesh multicomputers.” Accepted to appear in IEEE Transactions on Parallel and Distibuted Systems. [40] A. S. Tanenbaum, F. Kaashoek, and H. Bal, “Parallel programming using shared objects and broadcasting,” IEEE Computer, vol. 25, August 1992. [41] C. Huang, E. P. Kasten, and P. K. McKinley, “Design and implementation Of multicast Operations for ATM-Based high performance computing,” Tech. Rep. MSU-CPS-94-22, Michigan State University, March 1994. [42] V. Sunderam, “PVM: A framework for parallel distributed computing,” Con- currency: Practice and Experience, vol. 2, December 1990. [43] J. Boyle, R. Butler, T. Disz, B. Glickfield, E. Lusk, and R. Overbeek, Portable Programs for Parallel Processors. Holt, Rinehart and Winston, 1987. [44] P. K. McKinley, H. Xu, E. Kalns, and L. M. Ni, “Efficient communication ser- vices for scalable architectures,” in Proceedings of Supercomputing ’92, pp. 478— 487, 1992. [45] J. G. Mills, L. J. Clarke, and A. S. Trew, “CHIMP concepts,” Tech. Rep. EPCC-KTP-CHIMP-CON Version 1.2, Edinburgh Parallel Computing Cen- tre,The University of Edinburgh, April 1991. [46] A. Skjellum and A. P. Leung, “Zipcode, a portable multicomputer communi- cation library atop the reactive kernel,” in Proceedings of the Fifth Distibuted Memory Computing Conference, pp. 767—776, IEEE Press, April 1990. [47] J. Flower, A. Kolawa, and S. Bharadwaj, “The Express way to distributed processing,” Supercomputing Review, pp. 54-55, May 1991. [48] A. Kolawa and J. Flower, Software Engineering Productivity, ch. The Impact of Parallel Processing. Elsevier Science Publishers B.V.(North Holland), 1992. [49] G. Geist, M. Heath, B. Peyton, and P. Worley, “A machine-independent com- munication library,” in Proceedings of the fourth Conference on Hypercubes, Concurrent computers, and Applications, 1990. [50] N. Carriero and D. Galernter, “Linda in context,” Communications of the ACM, vol. 32, April 1989. [51] Message Passing Interface Forum, “The MP1 draft, message-passing standard,” in preparation, 1993. 129, [52] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema, and C. B. Moler, Matrix Eigensystem Routines - EISPACK Guide. Lecture Notes in Computer Science, Berlin: Springer Verlag, 2nd ed., 1976. [53] J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart, LINPA CK User’s Guide. Philadelphia, PA: SIAM, 1979. [54] J. Dongarra, “Performance of various computers using standard linear soft- ware,” tech. rep., Mathematical Sciences Section, Oak Ridge National Labora- tory, 1993. [55] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. D. Croz, A. Green- baum, S. Hammarling, A. McKenney, and D. Sorensen, LAPA CK User’s Guide. Philadelphia PA: SIAM, 1992. [56] J. J. Moré, D. C. Sorensen, B. S. Garvow, and K. E. Hillstrom, Sources and Development of Mathematical software, ch. The MIN PACK project, pp. 88—111. Englewood Cliffs NJ: Prentice Hall, 1984. [57] E. Piessens, E. deDoncker Kapenga, C. W. Uberhuber, and D. K. Kahaner, QUADPACK A Subroutine Package for Automatic Integration. New York: Springer Verlag, 1983. [58] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, “Basic linear algebra subpro- grams for FORTRAN usage,” ACM Trans. Math. Software, vol. 5, pp. 308-323, 1979. [59] J. Dongarra, J. Du-Croz, S. Hammarling, and R. Hanson, “An extended set of fortran basic linear algebra subprograms,” ACM Trans. Math. Software, vol. 14, pp. 1-17, 1988. [60] J. Dongarra, J. Du-Croz, S. Hammarling, and I. Duff, “A set of level 3 basic linear algebra subprograms,” ACM Trans. Math. Software, vol. 16, pp. 1—17, 1990. [61] J. ChOi, J. J. Dongarra, R. Pozo, and D. W. Walker, “ScaLAPACK: A scal- able linear algebra library for distributed memory concurrent computers,” in Proceedings, Fourth Symposium on the Frontiers of Massively Parallel Compu- tation, pp. 120—127, IEEE Computer Society Press, 1992. [62] E. Anderson, A. anoni, J. Dongarra, S. Moulton, S. Ostrouchov, B. Tourancheau, and R. van de Geijn, “LAPACK for distributed memory achi- tectures: progress report,” in Proceedings of the Fifth SIAM Conference on Parallel Processing, 1991. [63] J. Dongarra, R. van de Geijn, and R. Whaley, “Two dimensional basic linear algebra communication subprograms,” in Proceedings of the sixth SIAM con- ference on Parallel Processing, pp. 347—352, 1993. [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] 130 R. Arlauskas, “iPSC / 2 system: A second generation hypercube,” in Proceedings of the Third Conference on Hypercube Computers and Concurrent Applications, pp. 38—42, ACM, January 1988. A. Beguelin, J. J. Dongarra, G. A. Geist, R. Manchek, and V. S. Sunderam, “A users’ guide to PVM parallel virtual machine,” Tech. Rep. TM-11826, Oak Ridge National Laboratory, July 1991. J. J. Dongarra, R. Pozo, and D. W. Walker, “ScaLAPACK++: An Object oriented linear algebra library for scalable systems,” in Proceedings of the 1993 Scalable Parallel Libraries Confer ence, pp. 216—223, Oct. 1993. l. D. Scherson and P. F. Corbett, “Communications overhead and the expected speedup of multidimensional mesh-connected parallel processors,” Journal of Parallel and Ditributed Computing, vol. 11, pp. 86-96, 1991. D. Gannon and J. V. Rosendale, “On the impact of communication complexity in the design Of parallel numerical algorithms,” IEEE Transactions on Comput- ers, vol. 33, pp. 1180—1194, December 1984. L. Brochard, “Efficiency of some parallel numerical algorithm on distributed systems,” Parallel Computing, vol. 12, pp. 21—44, 1989. S. L. Johnsson, “Language and compiler issues in scalable high performance,” Tech. Rep. 244, Thinking Machines, 1992. J. Dongarra and R. A. van de Geijn, “Lapack working note 30 - reduction to condensed form for the eigenvalue problem on distributed memory archi- tectures,” Tech. Rep. CS-91-130, University of Tennessee, 1991. To appear in Parallel Computing. G. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in AFIPS Conference Proceedings, pp. 483—485, Thompson Books, 1967. J. L. Gustafson, G. R. Montry, and R. E. Benner, “Development Of parallel methods for a 1024-processor hypercube,” SIAM Journal of Scientific and Sta- tistical Computing, vol. 9, pp. 609—638, July 1988. X. H. Sun and L. M. Ni, “Another view on parallel speedup,” in Proceedings of Supercomputing ’90, pp. 324-333, IEEE CS Press, 1990. P. H. Worley, “The effect Of time constraints on scaled speedup,” SIAM Journal of Scientific and Statistical Computing, vol. 11, pp. 838-858, September 1990. V. Kumar and V. N. Rao, “Parallel depth—first search, part ii: Analysis,” In- ternational Journal of Parallel Programming, vol. 16, no. 6, pp. 501-519, 1987. 131. [77] T. Y. Li and Z. Zeng, “Laguerre’s iteration in solving the symmetric tridiag- onal eigenproblem - revisited,” SIAM Journal of Scientific Computing, 1993. accepted to appear. [78] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numer- ical Recipes, The Art of Scientific Computing. Cambridge, UK: Cambridge University Press, 1986. [79] G. H. Golub and C. F. Van Loan, Matrix Computations. Johns Hopkins Uni- versity Press, 2nd. edition ed., 1990. [80] P. G. Ciarlet, Introduction to numerical linear algebra and optimisation, pp. 200—203. Cambridge University Press, 1989. [81] M. Heath and J. Etheridge, “Visualizing performance of parallel programs,” IEEE Software, vol. 8, September 1991. [82] N. G. Shivaratri, P. Krueger, and M. Singhal, “Load distributing for locally distributed systems,” IEEE Computer, vol. 25, pp. 33—44, December 1992. [83] D. L. Eager, E. D. Lazowska, and J. Zahorjan, “Adaptive load sharing in ho- mogeneous distributed systems,” IEEE Transactions on Software Engineering, vol. 12, pp. 662-675, May 1986. [84] D. J. Lilja, “Expliting the parallelism available in loops,” IEEE Computer, pp. 13—26, February 1994. [85] C. P. Kruskal and A. Weiss, “Allocating independent subtasks on parallel pro- cessors,” in Proceedings of the 1984 International Conference on Parallel Pro- cessing, pp. 236-240, 1984. [86] C. PolychronOpoulos and D. Kuck, “Guided ' self-scheduling: A practical scheduling scheme for parallel supercomputers,” IEEE Transactions on Com- puters, vol. 36, pp. 1425-1439, December 1987. [87] S. F. Hummel, E. Schonberg, and L. E. Flynn, “Factoring, a practical and robust method for scheduling parallel loops,” in Proceedings of Supercomputing 91, IEEE CS Press, 1991. [88] T. H. Tzen and L. M. Ni, “Trapezoid self-scheduling: A practical scheduling scheme for parallel compilers,” IEEE Transactions on Parallel and Distributed Processing, vol. 4, pp. 87—98, January 1993. [89] V. Kumar and G. Y. Ananth, “Scalable load balancing techniques for parallel computers,” Tech. Rep. TR 91-55, University of Minnesota, November 1991. [90] P.-C. Yew, N.-F. Tzeng, and D. H. Lawrie, “Distributing hot-spot address- ing in large-scale multiprocessors,” IEEE Transactions on Computers, vol. 36, pp. 388—395, April 1987. [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] 132. H. Sullivan and T. R. Bashkow, “A large scale, homogeneous, fully distributed parallel machine,” in Proceedings of the 4th Annual Symp. Comput. Architec- ture, pp. 105—124, March 1977. J. J. Dongarra and D. C. Sorensen, “A fully parallel algorithm for the symmetric eigenvalue problem,” SIAM Journal of Scientific and Statistical Computing, vol. 2, pp. 139-154, March 1987. P. Arbenz, K. Gates, and C. Sprenger, “A parallel implementation of the sym- metric tridiagonal QR algorithm,” in Proceedings of the Fourth Symposium on the Frontiers of Massively Paralllel Computation, IEEE CS Press, 1992. I. C. F. Ipsen and E. R. Jessup, “Solving the symmetric tridiagonal eigenvalue problem on the hypercube,” SIAM Journal of Scientific and Statistical Com- puting, vol. 11, pp. 203—229, March 1990. T. Gregory and D. L. Karney, A Collection of Matrices for Testing Compu- tational Algorithms. Huntington, New York: Robert E. Krieger Publishing Company, 1978. J. H. Wilkinson, The Algebraic Eigenvalue Problem. Oxford: Clarendon Press, 1965. A. Sameh and D. Kuck, “A parallel QR algorithm for symmetric tridiagonal matrices,” IEEE Transactions on Computers, no. C-26, pp. 81-91, 1977. J. J. M. Cuppen, “A divide and conquer method for the symmetric tridiagonal eigenproblem,” Numerische Mathematik, vol. 2, no. 36, pp. 177-195, 1981. K. Gates, “A rank-two divide and conquer method for the symmetric tridiagonal eigenproblem,” in Proceedings of the Fourth Symposium on the Frontiers of Massively Paralllel Computation, IEEE CS Press, 1992. S. S. LO, B. Philippe, and A. Sameh, “A multiprocessor algorithm for the symmetric tridiagonal eigenvalue problem,” SIAM Journal of Scientific and Statistical Computing, vol. 2, pp. 155—165, March 1987. M. Lu and X. Qiao, “Applying parallel computer systems to solve symmetric tridiagonal eigenvalue problems,” Parallel Computing, vol. 18, pp. 1301-1315, 1992. T.-Y. Li, H. Zhang, and X.-H. Sun, “Parallel homotopy algorithm for the sym- metric tridiagonal eigenvalue problem,” SIAM Journal of Scientific and Statis- tical Computing, vol. 12, pp. 469-487, May 1991. [103] T. Y. Li and N. H. Rhee, “Homotopy algorithm for symmetric eigenvalue prob- lems,” Numerische Mathematik, vol. 5, pp. 265-280, 1989. [104] E. P. Kasten and P. K. McKinley, “An efficient multicast extension to pvm,” Tech. Rep. MSU-CPS-94—21, Michigan State University, April 1994. 133. [105] CCITT, “Recommendation i.363,” 1993. [106] A. 0. Allen, Probability, Statisitcs and Queueing Theory with Computer Science Applications. New York, NY: Academic Press, 1978. [107] G. H. Golub and W. Kahan, “Calculating the singular values and pesudo- inverse Of a matrix,” SIAM Journal 0n Numerical Analysis Ser. B, vol. 2, no. 2, pp. 205-224, 1965. [108] T. Y. Li, N. H. Rhee, and Z. Zeng, “An efficient and accurate parallel algorithm for the singular value problem of bidiagonal matrices,” Submitted for publica- tion, 1993. [109] G. W. Stewart, Introduction to Matrix Computations. Academic Press, 1973. [110] J. W. Demmel, M. T. Heath, and H. A. van der Vorst, “Parallel numerical linear algebra,” Acta Numerica, pp. 111—197, 1993. [111] C. Trefftz, P. K. McKinley, T. Y. Li, and Z. Zeng, “A scalable eigenvalue solver for symmetric tridiagonal matrices,” in Proceedings of the sixth SIAM conference on Parallel Processing, pp. 602—609, 1993. [112] J. Demmel and W. Kahan, “Accurate singular values Of bidiagonal matrices,” SIAM Journal of Scientific and Statistical Computing, vol. 11, pp. 873-912, September 1990. [113] C. H. Bischof, “Computing the singular value decomposition on a distributed system Of vector processors,” Parallel Computing, vol. 11, pp. 171-186, 1989. [114] L. M. Ewerbring and F. T. Luk, “Computing the singular value decomposition on the Connection Machine,” IEEE Transactions on Computers, vol. 39, no. 9, pp. 152—155, 1990. [115] B. B. Zhou and R. P. Brent, “Parallel computation of the singular value de- composition on tree architectures,” in Proceedings of the 1993 International Conference on Parallel Processing, vol. III, pp. 128—131, 1993. [116] T. J. Lee, F. T. Luk, and D. L. Boley, “Singular value decomposition on a 2-d fat-tree architecture,” in Proceedings of the sixth SIAM conference on Parallel Processing, pp. 358-362, 1993. [117] P. Arbenz, “Divide-and-conquer algorithms for the computation of the SVD of bidiagonal matrices,” Tech. Rep. 170, ETH Ziirich, Institut fiir Wis- senschaftliches Rechnen, Switzerland, 1991. [118] E. R. Jessup and D. C. Sorensen, “A parallel algorithm for computing the singu- lar value decomposition Of a matrix,” Tech. Rep. ANL-MCS-TM-102, Argonne National Laboratory, July 1991. 134. [119] M. Gu and S. C. Eisenstat, “A divide-and-conquer algorithm for the bidiagonal SVD,” Tech. Rep. YALEU/DCS/RR-933, Yale University, December 1992.