.5 lul. Swh. . fifim «Kama .. . .. “amnamwfi 3% Myra”? Pol-JR. QHKJIHMHq .. 15%,.er 3.. o. H.257 .71. . . 'U . I I $13111 . . , . miflxfiu‘va £93}th :. . . . . . Huck}; it? , . . . . T I. drink. nu . . . . . . Ell . . . .F... . r\l.9!n..h\..t.\finfl:l . . . . . .Wfln. .. o IIII|; r: | .u. . I I ‘c u. . Yr . . . . .. .r i .. J.“ __ . .. . . . . .. .. . n. V . l _ a . . ~ 4 I. v. u v. a, J . uni... ".38.. 2 ,. .. . . ,. _ . .. .. . . JS: 3. .. . . .. : . x I. . . s. C. ~ Elan % H . . . . flaw. L... .5??? Km “thwammv . _ , . . ., «fixzatiiwquuhfi ..Lonllv . . . . . . vawflhflntlhtnhlvu“... . . 3.1. WON. . . _ . . .. 1.6. J. .53.... I... :. . .. 131‘ lv. 11.-.: .. .1. b.--.a...vk.-1|..- :3! Ti: .v-f»-o£)f-4l: i... a [£1 ? y.-. .1111”... . Int,» -1! ‘9. 1»: I- .Iu-«.y -.I£fl....1..l-¢ll. ... €11 $542.30... -J. urn." . . .,.. L. .. 4. ..... 1.. l . .- gnflnfl. Juan". 5): .{1 . !- . «Kt-i. .u u..lt.|!....! . . ‘0 u I .. 1.1... .9133. . 15...?! 151--.??? . .H J. .. 21.1.3.1. I. 025$. . hurwfkdit. £....u.l..«8..u......u . ltsvlbr n|1.l . » '3 . i ....19§Lv.. .nut ivLmfllhu. . .memwh. ‘ 2163. .3)... fit. I}: , l«- . .xyofll . . {.u 113...] ll! . 1|. ...ln.|...f..-: Ir 1?! $1.- .u 5.1.2:...isl. .70.}... n! nuptiaaiqaui. 4 1 0|. :9 . . . . . Jaku.........: .. . . . uni-['49 {5:}.th . ..a l... :1 ..- I)--- 1...}..- % . u. . p .. . 4 ‘ y .. N .L 71......5 l . . k «1: 1. L... 1. THESlS .“7 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII 31293 01571421 This is to certify that the dissertation entitled Modelin" Collective Commtmicatten: Design and QnPDYWSCU’ICC Evahmifon presented by \fihjie Edi has been accepted towards fulfillment of the requirements for ph. D degree in COMPLJQF SClCflCe, @35wa Q Majorroesspf Date {12"} [97 MSUisa n Affirmative Action/Equal Opportunity Institution 0- 12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE MODELING COLLECTIVE COMMUNICATION: DESIGN AND PERFORMANCE EVALUATION By Yih-jia Tsai A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1997 ABSTRACT MODELING COLLECTIVE COMMUNICATION: DESIGN AND PERFORMANCE EVALUATION By Yih-jia Tsai As advances in VLSI technology have enabled rapid improvements in commod- ity microprocessors, the same technology also has changed communication archi- tectures dramatically. A communication subsystem is an integrated combination of both hardware and software. Eficient communication primitives can be achieved only through the careful exploitation of the underlying hardware architecture. Col- lective communication operations, which involve more than two participants, arise frequently in scientific applications. The primary objective of this work is to explore design and performance evaluation of collective communication operations. To this end, two models are proposed. The Extended Dominating Node (EDN) model is based on dominating sets in graph theory. The EDN model can be used to design collective operations in systems that support single-step multicast communication. We demonstrate the use of the EDN model on the design of collective communi- cation algorithms for wormhole-routed massively parallel processors (MPPs). The second model, the T-mOdel, is a generalized communication cost model that facili- tates the study and characterization of collective communication operations over a variety of networking technologies. We show that the 7-model can reveal relation- ships among algorithms that are not easily discernible through traditional methods. To my family iii ACKNOWLEDGMENTS First of all, I would like to thank my thesis advisor, Dr. Philip K. McKinley, for his initial encourgement and inspiration on the subject that later led to this dissertation. The constant advice and guidance from him on the subject matter are enormous. I also would like to express great appreciation to all the members of my committee: Dr. Lionel M. Ni, whose insights and first class research program have always been an inspiring source for me; Dr. Abdol H. Esfahanian for stimulating my thinking in an initial course on graph theory; and Dr. T. Y. Li for his very useful discussion on the impact of communication performance to parallel scientific programs. The research projects that these faculty have conducted have served as wonderful example to me. It is a priviledge to be able to study and work in such an exciting environment. In addition, much appreciation is due to Dr. Donald E. Weinshank for providing me with initial support to work at the Michigan State University, and to Dr. Edward N. Kashy for his support and discussions on the subject of teaching science in the University. Finally, I would like to express my gratitude to members of the Communications Research Group for their inputs and feedbacks on my on—going research. iv TABLE OF CONTENTS List of Tables vii List of Figures viii 1 Introduction 1 1.1 Motivations .................................. 1 1.2 Thesis Statement ............................... 3 1.3 Research Contributions ........................... 3 1.4 Dissertation Organization .......................... 4 2 Background and Related Work 6 2.1 Introduction ................................. 6 2.2 Parallel Computation Structures ...................... 8 2.3 Communication Paradigms ......................... 10 2.4 Collective Communication Operations ................... 12 2.5 Wormhole-Routed MPP Network Architectures .............. 16 2.6 The Trend Toward Clusters ......................... 22 2.7 Computation / Communication Models ................... 25 3 Extended Dominating Nodes in Mesh Networks 32 3.1 System Model ................................. 33 3.2 Single-Level Dominating Node Broadcast ................. 35 3.2.1 Dominating Nodes in a 2D Mesh ..................... 36 3.2.2 Multicast Algorithm ............................ 37 3.2.3 Example .................................. 39 3.2.4 Performance ................................ 40 3.3 EDN Broadcast in 2D Mesh Networks ................... 41 3.3.1 Extended Dominating Sets ........................ 41 3.3.2 Systematic Approach to Finding Multi-Level Extended Dominating Sets 44 3.3.3 Theoretical Performance .......................... 50 3.3.4 Analysis ................................... 54 3.3.5 Simulation Study .............................. 55 3.4 EDN Broadcast in 3D Mesh Networks ................... 60 3.4.1 Recursion in XY plane ........................... 62 3.4.2 Recursion along the Z-axis ........................ 63 3.4.3 Theoretical Performance .......................... 64 3.4.4 Analysis ................................... 65 3.4.5 Simulation Study .............................. 67 V 3.5 Reduction and Gather Operations ..................... 69 3.6 Matrix Transposition ............................. 71 3.7 Conclusions .................................. 74 4 EDN Broadcasting in Torus Networks 75 4.1 System Model ................................. 75 4.2 EDN Broadcast in 2D Torus Networks ................... 77 4.3 EDN Broadcast in 3D Torus Networks ................... 87 4.4 Performance Evaluation ........................... 89 4.4.1 Number of MessageaPassing Steps .................... 90 4.4.2 Timing Analysis .............................. 91 4.4.3 Simulation Study .............................. 95 4.5 Conclusions .................................. 96 5 The 'r-Model 98 5.1 The T-Model ................................. 99 5.1.1 Generalized Latency Formula ....................... 99 5.1.2 The T-Plot: Comparing Networks .................... 100 5.1.3 Use of the T-Plot to Compare Algorithms ................ 102 5.2 Characterization of Point-to—Point Communication ............ 104 5.3 Analysis of Wormhole Broadcast Algorithms ............... 108 5.4 Comparisons Using the T-Model ...................... 113 5.4.1 Overall Comparison of Algorithms .................... 113 5.4.2 Comparison of SC and FT ........................ 116 5.4.3 Comparison of EDN and SC ....................... 123 5.4.4 Comparison of EDN and FT ....................... 124 5.5 Application of the T-Model to N OWs .................... 127 5.5.1 Broadcast Operations ........................... 128 5.5.2 All-to—All Broadcast ............................ 131 5.5.3 Summary .................................. 135 6 Conclusions and Future Research 136 APPENDICES 139 A Formulae for EDNs in 2" x 2" Meshes 139 BIBLIOGRAPHY 143 vi 2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 5.1 5.2 5.3 5.4 5.5 5.6 5.7 LIST or TABLES Comparison of some message passing libraries with MP1 standard . . . 17 EDN steps for n = 4 x 2" .......................... 51 EDN steps for n = 5 x 2" .......................... 51 EDN steps for n = 6 x 2" .......................... 51 EDN steps for n = 7 x 2" .......................... 51 Theoretical EDN performance in 3D mesh networks ........... 65 EDN reduction steps for n = 4 x 2" ..................... 70 EDN reduction steps for n = 5 x 2" ..................... 70 Point-to-point performance on various platforms .............. 107 Analysis of Scatter Collect algorithm .................... 1 10 Analytical performance of the three broadcast algorithms ........ 1 13 The latency formulae for the SC and the FT algorithms .......... 1 17 Latency formulae for the SC algorithm and the EDN algorithm. . . . . 123 Performance of broadcast operation. .................... 129 Performance of all-to-all operation ...................... 133 vii LIST OF FIGURES 2.1 Collective operations. ............................ 2.2 System architecture. ............................. 2.3 Wormhole routers. .............................. 2.4 One-port 2D router. ............................. 2.5 All-port 2D router ............................... 3.1 Mesh topologies ................................ 3.2 Examples of finding dominating nodes in a 5 x 5 mesh and a 4 x 4 mesh 3.3 Example of SLD broadcasting in a 4 x 4 2D mesh ............ 3.4 Examples of extended dominating sets in 8 x 4 2D meshes ........ 3.5 Level-1 and level-2 dominating nodes in 8 x 8 mesh ........... 3.6 Two basic configurations for dominating nodes in a 4 x 4 mesh ..... 3.7 Two configurations of synthetic dominating nodes in a 8 x 8 mesh . . . 3.8 Level-2 dominating nodes sending to level-1 dominating nodes in 8 x 8 meshes ................................... 3.9 Level-3 dominating nodes sending to level-2 dominating nodes in a 16 x 16 mesh .................................. 3.10 Startup message passing in 8 x 8 meshes .................. 3.11 Basic building blocks for meshes not based on powers of two ...... 3.12 RD multicast solution in a 6 x 6 mesh .................... 3.13 Comparison of EDN steps with RD and lower bound (2D mesh) . . . . 3.14 Comparison of broadcast algorithms in 8 x 8 mesh (NCUBE—2 parameters) 3.15 Comparison of broadcast algorithms in 32 x 32 mesh (NCUBE-2 param- eters) ................................... 3.16 Comparison of broadcast algorithms in 8 x 8 mesh (T3D parameters) 3.17 Comparison of broadcast algorithms in 32 x 32 mesh (T3D parameters) 3.18 Difierent EDNs in 4 x 4 meshes ....................... 3.19 16-node level-1 EDNs in a 4 x 4 x 4 mesh ................. 3.20 Combining EDN configurations to cover an additional plane ....... 3.21 Level-2 EDNs in a 4 x 4 x 5 network .................... 3.22 Level-3 EDNs in a 8 x 8 x 5 network .................... 3.23 Example of concatenating networks in the Z direction .......... 3.24 Comparison of EDN steps with RD and lower bound (3D mesh) . . . . 3.25 Comparison of EDN broadcast with RD broadcast in 3D meshes (simu- lation) ................................... 3.26 First three steps of reduction in a 16 x 16 2D mesh ............ 3.27 Channel contention problem in direct approach to matrix transposition viii 46 48 49 49 52 53 57 58 58 59 60 61 61 62 63 64 68 69 71 3.28 EDN matrix transposition in 16 x 16 mesh. ................ 72 3.29 Comparison of EDN transpose with direct transpose ........... 73 4.1 Examples of 2D 4 x 4 torus networks .................... 76 4.2 Virtual channels in one dimension of a bidirectional torus ........ 77 4.3 Five dominating nodes in 5 x 5 torus. ................... 78 4.4 Two-step broadcast in 4 x 4 torus. ..................... 78 4.5 Y-pattern and T-pattern. .......................... 79 4.6 First phase (two steps) of EDN broadcast in a 16 x 16 torus. ...... 80 4.7 Main EDN broadcast routine for a 4" x 4" torus network. ........ 81 4.8 Procedure T_pattern(). ........................... 81 4.9 F-pattern of message transmission ...................... 82 4.10 EDN broadcast in 8 x 8 torus, 6 = 2 for both Y and T patterns, and 6 = 1 for F-patterns ............................ 83 4.11 Tilings using {YT} and {YTF} patterns. ................. 84 4.12 EDN configurations in 3D torus networks ................. 88 4.13 Broadcast performance in torus networks. ................. 91 4.14 Comparison of EDN broadcast with U-torus for 4 x 4 and 32 x 32 torus networks. ................................. 96 4.15 Comparison analytical and simulation results for a 32 x 32 torus ..... 96 5. 1 Example of the T-plot ............................. 101 5.2 Comparing algorithms using the T-plot. .................. 104 5.3 Results from ping-pong test. ........................ 106 5.4 T-plot for point-to-point communication benchmarks. .......... 108 5.5 Dimensional broadcast in an 8 x 8 mesh ................... 109 5.6 Scatter-collect broadcast operation in an 8 x 8 2D mesh .......... 110 5.7 Fibonacci Tree algorithm for k = 2 ...................... 111 5.8 Fixed 6 with varying a. ........................... 114 5.9 Fixed a with varying fl. ........................... 115 5.10 Plotting of ((5, k) and ,u(5, k) with respect to k ............... 120 5.11 Difierent L“ values with respect to k ..................... 121 5.12 The SC algorithm compared with the FT algorithm with different k values. 122 5.13 Difierent L" values for various 7'. ...................... 124 5.14 The crossover message length with different To values. .......... 126 5.15 T-plots for broadcast operations on difierent networking environments. . 130 5.16 Results for all-to-all operation. ....................... 132 5.17 Results of fitting the minimum value. ................... 133 5.18 T-plot of all-to-all operations. ........................ 134 -1 Representation of four highest level EDNs ................. 140 Chapter 1 Introduction That which has been, it is that which shall be; and that which has been done is that which shall be done; and there is nothing new under the sun. Ecclesiastes 1:9 1.1 Motivations Communication and network eficiency play a key role in the pursuit of high perfor- mance scientific computation. Large-scale scientific applications are characterized by the demand for high computing power. Advancements in VLSI technology have increased raw computing power, while high-speed networks have ofiered an effec- tive way to interconnect processors. Parallel computing provides one of the most economical ways to solve large scale problems. As the goal of “performance at all costs” has given way to economic considerations, however, fine tuning of underlying 2 hardware platforms and various software components is frequently used to improve performance of parallel applications at reasonable cost. Communication is recognized by many researchers as an area in which such fine tuning can result in large performance gains. Communication operations can be generally classified into two broad categories, point-to-point and collective, according to the number of parties involved. The performance of point-to—point communication depends heavily on the underlying network bandwidth, network interface, communication pattern, message size, channel contention, protocol pro- cessing, and the current system load. These factors also play an important role in determining the performance of collective communication. In addition, the adverse effects of resource contention and message serialization are more important to col- lective operations than to point-to-point operations, due to the presence of multiple messages in the network concurrently. During the past decade, communication operations have been extensively stud- ied in the context of parallel computing environments. The consolidation of the Message Passing Interface (MPI) standard [131] has spawned a wave of implemen- tations of various message-passing libraries for difierent architectures. The impor- tance and usefulness of collective operations is evidenced by extensive discussion on this topic within the MPI Forum. 1.2 Thesis Statement Several features of network interfaces and network architectures are critical to the performance of collective communication operations. We contend that a performance cost model, capable of accounting for these features can be used in the design of new algorithms for collective operations that outperform existing algorithms across a variety of parallel computing platforms. Moreover, such models can reveal relationships among algorithms and architectures that are not easily discernible through traditional methods. 1.3 Research Contributions This research begins with the exploration of communication locality, and yields the Extended Dominating Node (EDN) model. This model has been used in the design of several collective operations in parallel computers [142, 143, 144, 145, 146, 147]. The results from these studies are used in the development of a general communication cost model, the r-model, that covers collective operations and can accommodate difierent networking technologies. The contributions of this dissertation can be summarized as follows: 0 The research has produced an abstract model for designing eflicient commu- nication operations for MPPs with dedicated wormhole-routed networks. 0 Associated with the abstract model is a cost model to account for the network latency of communication operations. The cost model can be used to classify 4 and characterize communication algorithms, and thus provide a unified view of all communication operations. Moreover, the complexity of communication algorithms can be categorized in a more meaningful way. a The generalized cost model can serve as a basis to characterize the efiects of difierent architectural supports for communication operations across various platforms, including both MPPs and workstation clusters. By characterizing communication operations, we can gain a better understanding of the expected performance of parallel programs. 1.4 Dissertation Organization The remainder of this dissertation is organized into the following chapters. Chapter 2 provides background and related work for this research. Algorithms and on-going research in collective communication primitives for both wormhole-routed MPPs and workstation clusters are reviewed, and several communication cost models used in the parallel processing community are discussed. The Extended Dominating Node model is described in Chapter 3. This model extends the dominating set concept in graph theory to support the design of collective communication primi- tives in wormhole-routed networks. Continued in Chapter 3 and in Chapter 4 are demonstrations of the use of the new model to design collective operations for two network topologies, the mesh and torus, respectively. Chapter 5 first describes the 'r-model in general and its associated r plot. Subsequently, we demonstrate the use of the r-model to analyze communication operations in wormhole switching net- 5 works and network of workstations environments. Finally, Chapter 6 summarizes the dissertation. Chapter 2 Background and Related Work Where the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers in the future may have 1,000 vacuum tubes and perhaps weigh just 1-1/2 tons. Popular Mechanics, March 1949, p. 258 2.1 Introduction The rapid advancements in VLSI technology and breakthroughs in the manufac- turing industry have been major sources of innovation in parallel computation structures and networking architectures. From the early processor arrays, vector machines, and multiprocessors, the field of experimental parallel architectures has been very active. As man’s quest for computing power to solve larger and finer- grained problems is unabated, so is the demand for economical solutions to satisfy this need. With the scaling of new technology, both in size and speed, and the vast 6 7 market for general-purpose computing chips, microprocessor-based systems have emerged as the most viable candidates for providing enormous computing power at an acceptable cost. Users now have at their fingertips processing power that was unimaginable fifty years ago, when the first electronic computer was invented. The design of fast, powerful, and economical single chip microprocessors will be ultimately limited by physics, specifically, the heat transfer capability of silicon chips and the speed of light. One way to address these limitations is to interconnect multiple processors and program them to solve a problem cooperatively. Parallel processing is not a new idea. As early as the dawn of the computer, there were experiments connecting together these “electronic brains,” and research projects in the 19608 and 19708 (Illiac-IV, NASA MPP, Cmmp, and many others) laid the foundation for parallel processing. However, it was Thinking Machines Corporation in the mid-19808, with the Connection Machine, that symbolized the economical success of massive parallelism. In recent years, however, the cost of silicon chip manufacturing equipment has risen so high that, without a commodity market, no such company can withstand the high risks and investment. The result is that future generations of micropro- cessors will be increasingly targeted at the mass market. Parallel computation architectures are already successfully competing with supercomputer architectures in the high performance computing market, and will likely continue to gain ground as inexpensive and powerful microprocessors are equipped with such features. The introduction of CM-5 in 1991, which adopted a commodity microprocessor (SPARC), marked the end of the first generation of massively parallel computers 8 and a victory for off-the-shelf microprocessors over custom-made vector processors. Although the CM-5 used special-purpose hardware for communications, experimen- tal microprocessor architectures are already incorporating features that can be used to better support communication operations [79]. With the widespread use of the Internet and network navigating applications, future microprocessor architectures are likely to be even more communication-oriented. The mass market will be the final judge. These forces have already resulted in the new trend of joining com- modity workstations with ofl-the-shelf communication networks to perform parallel computations. The economics and ubiquity of these so-called workstation clusters may cause us to rephrase the previous technical jargon from “massively parallel machines” to “parallel machines for the masses.” 2.2 Parallel Computation Structures In retrospect, the introduction of the Intel 4004 single chip microprocessor [81] in 1972 opened a new era in computing. Since then, the pace of technological ad- vancement has been so fast that single chip microprocessors are competitive with the vector machines. Such technological advancement has promised “to bring the power and flexibility of a dedicated general-purpose computer at low cost” [81]. In the early 19708, Gorden Moore predicted that transistor count will be doubled ev- ery 18 months, since referred to as “Moore’s law.” Moore’s prediction was based on his observation of tremendous potential advances in microprocessor manufacturing technology. These advances have been realized and the implications have been even 9 more far-reaching than predicted by Moore. With the size of transistors decreas- ing and the cost dropping, additional functionality, and even the whole processing unit, can be cheaply integrated into a single chip. As early as 1980, Patterson proposed a single-chip architecture that could be used as a node in a parallel com- puter [116]. The revolutionary development of the RISC architecture [115, 117] set the stage for more rapid performance improvement. Specifically, microprocessor performance has increased steadily at the rate of over 50 percent per year since the 19808 [70]. As mentioned earlier, parallel processing can be traced back to the origins of the computer itself: the EN IAC was equipped with multiple processing units for redundancy. Parallel computation architectures span a wide spectrum, from fine-grained bit-serial SIMD systems to multiprocessor vector machines. The Connection Machine marks one end of the spectrum, and the Cray X / MP marks the other end. Both architectures use custom-made processors, whose cost rises more rapidly than the computing power per processor. In the middle of this spectrum is the medium-grained multicomputer equipped with ofi-the-shelf microprocessors. Beginning with the Caltech Cosmic Cube project [129], several commercial mul- ticomputers soon entered the market [11, 109, 120]. Evidence of the success of commercial microprocessors in parallel processing is seen in the computation archi- tectures of the TMC CM-5, Intel TOUCHSTONE, Intel Paragon, Meiko CS-2, and IBM SP—1/SP-2. The trend toward using a commodity network is also clear. Ex- amples include Myrinet [27], Asynchronous Transfer Mode (ATM) [114], and Fast Ethernet [84]. The careful design and integration of the network architecture and the computation architecture is a major issue in parallel systems. As the communi- 10 cation architecture of workstations improves, it is very likely that the dividing line between multicomputers and workstation clusters will eventually vanish. We can expect that the widespread use of commodity high-speed networks, combined with high performance commodity microprocessors, will continue to make the worksta- tion cluster a popular parallel platform. 2.3 Communication Paradigms The programming model used in a parallel system defines the tradeoff between exposing the underlying hardware platform to the programmer and preserving ap- plication portability and extensibility. Parallel programs have been long regarded as dificult to develop and even more dificult to debug. One possible reason is that programmers usually have a hard time visualizing the interaction among concurrent entities. The classical dichotomy of parallel programming style has been shared memory versus explicit message passing. The shared memory paradigm can be supported in software, hardware, or both. Hardware-supported shared memory can be based on either a bus-based network or a multistage interconnection network. However, neither approach scales to a large number of nodes due to contention. Physically distributed, logically shared architectures must rely on underlying message-passing operations to move data among nodes. In one approach, the operating system supports data distribution and coherence protocols, and maintains a single virtual shared memory space for the user applications. This method is used in the MIT 1 1 Alewife [2, 3, 4], Stanford FLASH [68, 90], Princeton SHRIMP I, II [26], and Wis- consin Wind Tunnel [39]. In another approach, programming language constructs support the shared memory paradigm. The compiler generates communication stubs to support data distribution and sharing. Examples include Dataparallel C [67] and HPF [71]. Explicit message passing is often blamed for the difliculty to program and main- tain parallel applications. Yet most of today’s commercial multicomputers support this model, in part because it is more closely tied to the machine hardware and gives more control to the programmer. By exposing the communication hardware, this method relies on the programmer to prevent deadlocks in the parallel program. Of course, message passing can also serve as an underlying mechanism to support the shared memory programming paradigm. No matter which upper level programming paradigm is used, all communication operations deal with movement and manipulation of data. Communication oper- ations can be classified as either point-to-point or collective, depending on the number of parties involved. Numerous communications libraries have been devel- oped in the past ten years by commercial and research organizations. The recently published Message Passing Interface (MPI) standard [131] is the result of an efiort by both industry and the research community to improve the portability of parallel programs among diflerent platforms. MPI defines the interfaces and semantics of both point-to-point and collective operations. The software layer that implements these operations is typically packaged as a library. Calls to these routines can be placed in application code developed by the user or inserted by the compiler. The 12 performance of communication operations has direct impact on the performance of the applications that use them. Carefully designed communication operations can increase program portability and efliciency. Portability implies that applications can run on difierent platforms with little or no modification, and eficiency means that no matter which platform is used, the communication Operations can take advantage of the underlying architecture to achieve better performance. 2.4 Collective Communication Operations A communication operation may involve exactly two, or more than two, parties. When more than two parties are involved, the communication operation is called collective. As defined in the MPI standard, the participating entity in these com- munications operation is a process, and the processes involved in a collective op- eration are defined as a process group. The eficient implementation of collective communication primitives can enhance the performance of programs that use them. According to the direction of data movements, we can classify collective communi- cation operations as point-to—multipoint, multipoint-to-multipoint, and multipoint- to-point. A point-to—multipoint operation is used to distribute one or more data items to a set of receiving processes. In a broadcast operation, the same data item is delivered to every receiving process. Another point-to-multipoint operation is called scatter, in which a difierent data item is distributed to each group member. The data flow for broadcast and scatter operations are depicted in Figures 2.1(a) and 2.1(b), 13 respectively. If the same point-to-multipoint operation is performed by all members of a group, the result is a multipoint-to—multipoint collective operation. If all members perform a broadcast, then the resultant operation is called all- gather, or gossiping. When all members perform scatter operations simultaneously, it is called complete exchange, or all-to-all personalized communication. Depicted in Figure 2.1(c), and 2.1(d), respectively, are all-gather and all-to-all personalized communication operations. In addition to data movements, computations can be performed on the data. The global computation can be as simple as a logical operation or as complicated as a user-defined function. If the resultant data resides at a single node, then the oper- ation is multipoint-to-point. Depending on the operation performed over the data set, several different collective operations can be defined. As shown in Figure 2.1(c), gather refers to pure data movement from all members to a single member; con- catenation is the operation performed on the data. In contrast, combine means that in addition to data movement, a logical or arithmetic operation is performed on the data set; the combine operation is depicted in Figure 2.1(f) Multipoint-to-multipoint operations also come in several flavors. If the result of a combine operation is made available to all members, then the operation is called global reduction; see Figure 2.1(g). A special case, when the global oper- ation synchronizes all participating members, is called barrier synchronization. Another special form of multipoint-to-multipoint operation is the prefix-scan, or simply scan, where the result at each node depends only on the contents of all its 14 Pi— process 1' ..... - local data movement D;— process i’s data item} fl-..» inter—process data movement ‘/ / IS. \\\ I ?\ I t E! ’ \ x \ I \ \ 'I ‘4 / ‘ \ \ ./ — \ \ \ / ‘ \\ V \\ \V/ \\ h 5.. \\ /'F,..\ (a) broadcast ”uh stx ”R‘\ ,/ \va/ \__/ V (e) gather (I) combine (3) global reduction (h) prefix scan Figure 2.1: Collective operations. 15 predecessors in rank order (Figure 2.1(h)). All the above collective operations are included in the Message Passing Interface (MPI) standard [104, 131], which defines a rich collection of both point-to-point and collective communication primitives. Key issues discussed in the MPI standard with regard to implementing these interfaces include sender buffering, receiver buffer- ing, synchronization, data type, process group, communication context, and virtual topology. Most of these concepts are borrowed from existing message-passing 1i- braries. The importance of collective communication can be seen from the large number of pre-MPI communication libraries that include such operations. Vendors of paral- lel computers typically provide their own message passing libraries that are highly optimized with regard to their own architecture. Notable examples include IBM’s EUI (External User Interface) [12], Thinking Machine’s CMMD [139, 149], Intel’s NX [119], and NCube’s Vertex [126]. In the parallel processing research community, PVM [59] is perhaps the most widely used system for parallel computing over net- works of workstations. Other prominent libraries include Argonne m4 descendants, p4[35, 36], PARHACS [37, 69], and TCGMSG, as well as PICL [60, 61] from Oak Ridge National Lab, Zipcode [130] from Caltech, Yale’s Linda [6, 38], BLACS (Basic Lin- ear Algebra Communication Subprograms), which is the communication library for ScaLaPACK [54], CHIMP [29] from University of Edinburgh, and LAM [33, 34] from Ohio State University. The N XLib, implemented by Lamberts et al. [91, 132, 133], is designed to emulate the Paragon message-passing environment on clusters of workstations. 16 All the above libraries implement some form of collective communication opera- tions. From Table 2.1, we can see how certain libraries differ from the proposed MPI standard, although these libraries are now merging into the MPI mainstream. As indicated in the Table 2.1, most of the pre-MPI libraries do not have very extensive communication interfaces compared to MP1. Moreover, the workstation cluster ver- sion of most of these libraries is based on the point-to-point TCP/IP protocol suite. Clearly, careful implementation and fine tuning according to the system character- istics is necessary to achieve high performance, especially in high-speed networks where the overhead of conventional protocol stacks is especially detrimented to per- formance. Researchers such as von Eicken et al. [152] have used the active message approach [153] to reduce the software overhead in ATM networks, and Pakin et al. [108] have developed a fast messaging layer for Myrinet [27], 'a high-speed local area network. 2.5 Wormhole-Routed MPP Network Architec- tures Given the importance of collective communication, the design and analysis of such operations on various parallel architectures has been a very active research area for the past decade [101]. Most early studies concentrated on the hypercube topol- ogy. First generation hypercubes, such as the Cosmic Cube, nCube/ten, and iPSC, all exhibited large communication latencies, on the order of milliseconds. This 17 p4.lendbr() p4.aendrx() v1.1 12 3 .y.z -rocv 1 2 3 .rocv() .z 1 2 3 -fanout() .2 g{1.2,3}.fan1n() - .z IGOP-() 1.x .pack-co-b1no() 1.2.3 -co-bineO 1 2 ,1 1 Table 2.1: Comparison of some message passing libraries with MP1 standard characteristic was due to the ineficiency of communication software and the use of store-and-forward switching, which requires intermediate nodes on the message path to receive incoming data and then forward it to the next node in the path. As a result, the communication latency was proportional to the number of communica- tion links traversed by the message. Suppose the software setup overheads are t, for sending a message and tr for receiving a message, and that the transmission delay per node per unit message is tn, then the overall delay for sending an L-unit message h hops across the network in store-and-forward switching is h(t, + L * tn + t,). With the advent of wormhole routing, in which a message is broken down into 18 small units and pipelined through the network, the overall latency of sending an L-unit message h hops across the network is reduced to approximately t, + h an: L * tn + t,. The advantage of this technique is twofold. First, communication latency is relatively distance-insensitive, a characteristic that has been demonstrated in experimental studies [107]. Second, the routing automata can be implemented in specialized hardware, called a wormhole router, relieving the processor of the burden of message bufi'ering and routing. A8 a result, communication latency is reduced to the microsecond range. J ., . ] "[3] ] WJ J l, IPJPI‘ (Li; [J7EE] Figure 2.2: System architecture. In a wormhole-routed system, the interconnection of routers form the topology of the multicomputer. Each processing node is attached to the router as illustrated in Figure 2.2. The link between routers is called channel and the link between router and processing node is called port. The number of channels between adj a- cent routers determines how many messages can be in transit simultaneously, while the number of ports is an indication of the communication capacity between the node and the network. The major difierence between single-port and multi-port wormhole routers in functionality is the number of messages that can be sent (re- 19 ceived) by the computing node simultaneously. Due to this capability, a multi—port router can be important to the design and performance of collective communication operations, which often involve the transmission (reception) of multiple messages at a single node. Depicted in Figure 2.3(a) and (b) are difl'erent port models of a wormhole router. In Figure 2.4 is a detailed block diagram of the internal structure of a one-port 2D mesh wormhole router, which is composed of two 1D mesh routers. For comparison, Figure 2.5 shows the structure of an all-port 2D mesh wormhole router. The main element of this router is a crossbar. Wormhole routing is the predominant switching technique used in most sec- ond generation multicomputers. Some systems, exemplified by the nCube / 2, adopted the topology of their ancestors, namely, the hypercube. However, anal- ysis by Dally [45] on the performance of wormhole routing under the assumption of constant bisection width for difierent k-ary n-cube topologies, showed a clear cost / performance advantage for low-dimensional topologies, such as meshes and tori. Consequently, Intel produced the DELTA and Paragon systems, which use a 2D mesh, and Cray designed the T3D and T3E, which use a 3D torus topology. P0 Y- “I TY‘fM / h Po Y“ '1 TY‘MIV/ I M ‘ ,,./ //‘ ./ / / Ll //P°" fi—W/é/ Ports x- “‘4', x X- i ‘ \/;I/’/ x M I - n M ' — h ‘--— -——4 r * -— +_ __ 1' 1“...“ . w—w ,--I- -.—.-—-~ ~>l ,_ I——--,.+ '" ' 00‘ In = T can Y‘all ‘Iv+h Y-Ml IY+ h (shone—M20 muter (b) anall—ponzo router Figure 2.3: Wormhole routers. Over the last several years, a number of researchers have sought to exploit the 20 fl- _ X6 FAD": X0 ,_ . in, [:2] __ _ Lia" - 9‘“J:F§H [‘4‘ ’J- 9” ' ’ ' x-.. [ADJ x-... x’h [‘53] 3x3 x'au "*1 EC?“ ‘4 "r—rr ”“ ’Egl" “" Crossbar ’ 4'" y.“ I EL Ly,” p. my --...I_.. -~— -— h . L _ — is; he» .- w v-.. AD]. 0mm P0 __ r_ _— Po 7 I up] w I»-.. _._. A F_C hit—‘5‘ [<- "J'. P1“ TAD“ P1“ _._ _ ———. 'pc an we —3- “a, [49]. 3x3 .72, 1"”? - EC w—v Crossbar “ p2,, [10; p2,... ”LEG r— _le ‘ "’ Y‘h A—D . P0,. _. . FC [‘Z —— ?2- P3“ lg P301 AD: We. dooodo AD: Mus decode FC: Flow control FC: Roam Figure 2.4: One-port 2D router. Figure 2.5: All-port 2D router. distance—insensitivity of wormhole routing in the design of collective communica- tion algorithms. A large fraction of that research has concentrated on broadcast and multicast for wormhole-routed systems. One-port mesh multicast (U—mesh al- gorithm) is given in [103], and a one-port mesh broadcast algorithm is presented in [17]. The former is a tree-based algorithm that can be used in both hypercubes and meshes. The latter uses a scatter-collect approach, in which the original mes- sage is split into pieces that one distributed to all nodes and later collected by the receiving nodes. Multicast for a one-port torus is described in [124]; similar to the U-mesh algorithm in one-port meshes [103], the U-torus algorithm is tree based. In addition to the tree-based approach to multicast and broadcast, a difi'erent method is to partition the message into pieces and send them through different routes in a pipelined fashion. This approach has been explored for one-port hypercubes [73] and one-port meshes [156]. All the communication algorithms mentioned above assume a one-port network architecture. In addition, software unicast algorithms have been designed for multi- port systems. Communication algorithms based on the multiport assumption have 21 been studied quite extensively for hypercubes [21, 76, 85, 99]. 'Itee-based multiport hypercube multicast algorithms are presented in [76, 99, 123] and one-to—all per- sonalized communication (scatter) is described in [85]. Broadcasting algorithms, based on our EDN model, for multiport meshes [147] and tori [146] comprise our preliminary studies and are described in detail in Chapter 3 and 4. Another distinctive hardware feature that has been used to construct eficient communication operations is a path-like approach for multicast. This method relies on a special feature of the router, in which the message can be duplicated so that the node connecting to the router can receive a copy while the message is progressing toward subsequent destinations. The entire message visits all the destination nodes in a pipelined fashion. This path-based approach was first explored in [95, 96, 97]. Later, this concept was extended from single—phase algorithms to multi-phase collective algorithms for meshes [148], hypercubes [74], and tori [122]. The EDN model can also be used in the design of multi-phase path-based algorithms. While most of the above mentioned communication operations focus on mul- ticast and broadcast, other collective communication primitives have been devel- oped specifically for wormhole networks. Examples include barrier synchroniza- tion [66, 110, 159] and global combine [151]. Several difierent approaches have been proposed for all-to—all exchange in meshes, including binary exchange, interleaved binary exchange, direct exchange [128], quadrant exchange [28], modified quadrant exchange [136], pairwise exchange, and indirect pairwise exchange [137 , 138]. All these algorithms assume a one-port architecture. We [144] have used the EDN model to design global combine, gather, and matrix transposition algorithms for 22 multiport 2D meshes; details are given in Chapter 4. These extensive research efi'orts have shown that, to design eficient collective communication operations, the characteristics of the underlying network should be carefully taken into consideration. This is particularly true in the parallel process- ing community, where communication latency is the major concern. In early studies of collective communication, the store-and-forward routing assumption is predom- inant, although both one-port and all-port architectures were addressed. With the advent of the wormhole-routed networks, the network interface characteristics play increasingly important roles in the design of communication algorithms. While a systematic approach to design eficient algorithms is very useful, more importantly, a unified model for characterizing and parameterizing the network is needed in order to compare and predict the performance of difierent algorithms in various environments. 2.6 The Trend Toward Clusters The platforms used for parallel computing have continued to change over the past decade. In fact, technology has evolved to a point that the communication archi- tectures of MPPs and workstation clusters bear more similarity than difierence. In a paper bearing a similar title with the first paper introducing the RISC ar- chitecture [115], Anderson et al. [8] pushed forward a new era of the network of workstation (NOW) architecture. Using a collection of workstations to perform parallel computations is not restricted to scientific computing. For example, a 23 recently introduced search engine for the World Wide Web utilizes a set of work- stations connected by Myrinet LAN [80]. Indeed, unprecedented computing power is possible when commercial microprocessors are coupled with high performance communication networks. However, how to harness such combined computing and communication capability is a challenging task. Past experiences with the paral- lel computers indicates that eficient parallel applications require not only careful layout the computation, but also the careful consideration of the underlying com- munication hardware. Given the economic and performance advantages of NOW platforms, many diflerent high-speed networking technologies have been investi- gated in this context. Examples of such networking technology are asynchronous transfer mode (ATM), fast Ethernet, and Myrinet [27]. In order to take advantage of such technologies, network interface sofiware must be streamlined. A number of research groups are investigating ways to reduce software overheads; examples in- clude Berkeley NOW Fast Communication [125], Berkeley Active Messages [98, 153], Illinois Fast Messages (FM) [108], and Cornell U-Net [20]. To date, most of these efiorts have focused on unicast communication, although collective communication is gaining increased attention. Of course, the computer networking research community has already showed great interest in the study and implementation of collective communication, ex- emplified by efforts in designing multicast algorithms for various protocol stacks. The introduction of IP multicast [51] implementations made possible many such algorithms. Due to the unreliability of the underlying network, reliability is sup- ported with a separate protocol. Example protocols proposed thus far include 24 ST-II [140] at the network layer; MTP [10], XTP [134, 157], TP++ [56], and VMTP [40] at the transport layer. The recently introduced Scalable Reliable Mul- ticast (SRM) [57] emphasizes the functionality at both the application and session levels. The ISIS [23, 24, 25] and Horus [121, 155] systems have implemented group communication from a difl'erent perspective; they are designed to preserve message ordering in the presence of faults in the network. Some communication packages designed to support parallel computing over con- ventional network technology are also based on the conventional Internet protocol stack. An important example is PVM [135]. The popularity of PVM has demon- strated that the use of ofi-the-shelf communication and computation components, including both software and hardware, is a viable approach to parallel process- ing. The implementation of MP1 over workstation environments, for example, MPICH [64] and CHIMP/MP1 [30], are natural extensions of this work. Several parallel processing projects have constructed communication systems based on IP multicast. This approach includes the Transis system [52] and the PCODE [31] library. All these packages support a subset of the collective operations described earlier. Concurrent with the development of efficient communication operations over traditional network technology, the use of high-speed networks such as ATM has also received attention from the parallel processing community. Point-to-point com- munication performance over ATM network with difi'erent software interface imple- mentations is studied in [94], and ATM point-to-point primitives have been used to improve the performance of PVM [160]. The exploration of low-level ATM mul- 25 ticast to support collective operations is presented in [77 , 78]. This research has shown the potential of directly applying ATM networking technology to parallel computing. Meanwhile, a large contingent is focusing on the implementation of traditional Internet protocol stack over ATM [9]. Finally, a number of studies have addressed how to parameterize collective op— erations for use in distributed environments. For cluster environments that are dedicated parallel machines, such as IBM SP systems, the CCL library [13] of- fers parameterized collective communication. In addition, Park et al. [112] have shown how to design multicast algorithms using a parameterized communication cost model that derives from the LogP model. Finally, considering a network envi- ronment spawned from the parallel processing community, the Myrinet LAN [62], a reliable multicast algorithm based on reliable point-to-point communication prim- itive and credit-based flow control mechanism is reported in [154]. 2.7 Computation / Communication Models In the design and performance assessment of parallel programs, an appropriate com- putation model is used to aid the understanding of their behavior. In this section, several popular computation models that are used in the design of eficient par- allel programs are briefly reviewed. In addition, emerging communication models, proposed by various researchers to characterize communication behavior more accu- rately, are discussed. The diflerence between modeling parallel algorithms and se- quential algorithms lies not only in the parallelism from multiple concurrent threads 26 of control, but more importantly, in the communication costs that be taken into account. Although the Turing machine is a theoretically interesting computation model that can eflectively simulate arbitrary algorithms, this model is seldom used in the design and analysis of parallel algorithms. Numerous research endeavors have sought to design a more practical model for analysis of parallel algorithms. We review some of these eflorts in the following. PRAM and variants. In the area of modeling and analyzing the complexity of parallel algorithms, the classical Parallel Random Access Machine (PRAM) model [58] has been widely used. This model assumes P concurrent processors sharing a unlimited sized memory, and closely matches the uniform memory access shared-memory machines. Several refinements to the model accounting for the han- dling of simultaneous access of the same data item by several processors. Examples include the EREW (exclusive read exclusive write) PRAM, the CREW (concurrent read exclusive write) PRAM, and the CRCW (concurrent read concurrent write) PRAM. Due to the simplicity of the model and the lack of a detailed cost model for communication, several variants of the PRAM model have been proposed in an attempt to better model the behavior of parallel systems [5, 42, 63]. The PRAM models assume that communication is implemented using shared memory or explicit message-passing. In the case of message-passing, the network topology is either not considered or simply assumed to be a fully-connected graph. The reason for this is twofold: assuming a specific topology invalidates the gener- ality of the model, and no “canonical” topology exists among real parallel architec- 27 tures. Moreover, even for small networks, the completely connected topology is far more dense than typical multicomputer topologies, such as meshes and hypercubes. Therefore, contention for network resources, such as link bandwidth and channel sharing, is not likely to be accurately accounted for by these abstract models. Postal model. The Postal model [18], as opposed to telephone model, attempts to define a communication model in terms of a single parameter A, which is a ra- tio of the network latency and the time that the sender engaged in preparing the message. By using the daily mail delivery system as an analogy to the communica- tion network, this model can be used to analyze the performance of communication algorithms such as broadcast and multicast [19]. Specifically, if a node can send multiple messages before the first arrives at its destination, the structure of an op- timal distribution tree is afiected. Because of the long network latency and the relative short message preparation overhead, which the model asserts by assuming A 2 1, a complete graph for the underlying network is suitable for the analysis. However, for high speed networks such as ATM LANs, as observed in [86] , the soft- ware startup latency is on the order of 400 to 1000 psecs, while the switch latency is on the order of few to tens of microseconds. This eflectively makes A value less than one, and therefore the optimal algorithm developed under this model may not be the best choice for such systems. BSP model. The Bulk Synchronous Parallel (BSP) architecture model [150] was designed to achieve the goal of developing architecture-independent parallel algo- 28 rithms. It is intended to bridge the programmer’s perspective with the actual system architecture in order to achieve portable program design. The BSP archi- tecture is a collection of processor / memory pairs, a point-to-point virtual network, and a mechanism to synchronize all or a subset of the processor / memory nodes. The computation is divided into sequences of parallel “supersteps,” with each super- step separated by a global synchronization, before which all memory and network accesses should be completed. The cost model associated with the BSP model is expressed in the form of linear combination of communication costs, synchroniza- tion costs, and computations within each superstep. The communication cost is measured by a communication primitive called the h-relation [1], which means that each processor has at most h messages to send to diflerent processors and each processor is due to receive at most h messages. Unfortunately, the BSP model does not describe how communication occurs, that is, the communication pattern is not scheduled precisely. C3 model. The C3 model [88] is an attempt to capture the cost of computa- tion, communication patterns, and network congestion due to communication for coarse-grained parallel algorithms. Five system parameters are used as metrics for evaluating the target algorithms for a particular system. They are the number of processors p, the network latency h, the bisection width of the communication net- work b, the startup cost per message 3, and the length of a message I. The latency h is modeled as the average distance between two processors. The overall communica- tion cost is the sum of send / receive cost, link congestion, and processor congestion 29 costs. The send / receive cost depends on the underlying routing mechanism and the communication mode used to send / receive a message. The routing mechanism can be store-and-forward or wormhole routing and the communication mode can be either blocking or non-blocking. This model is designed specifically to characterize parallel computer networks, wormhole network in particular. Similar to the LogP model discussed later, this model mainly considered point-to-point communication and the possible underlying hardware supported multicast capability is not included in the model. LogP model. The LogP model is an efl'ort to model parallel and distributed sys- tem in which all participating nodes communicate through explicit point-to-point message-passing. Associated with each message are three parameters to account for the cost of message-passing. They are the overhead 0, the time that the sender is busy preparing the message; the gap 9, the total amount of time elapsed from the start of one message to the initiation of the next message; and the network latency L, the time that a message spends in the network. The model also assumes that the time the receiver engaged in reception of the message is equal to the overhead 0. As with the Postal model, the LogP model assumes a completely connected graph as the underlying network topology. In fact, the LogP model can be viewed as a generalization of the Postal model. By setting parameters 0 = 0, g = 1, and A = %, the LogP model is exactly the same as the Postal model [18]. Recent extensions to the LogP enable the gap between packets to be much smaller than the gap between messages [7], and allow for a hierarchical memory model [93]. Although the LogP 30 model was carefully designed to account for communication costs, with three out of four parameters are used to describe such costs, there exists many situations the model does not adequately cover. Examples include network congestion and network supported multicast. Hockey’s model. Some models have focused exclusively on communication. Hockney [7 2] recently proposed communication model that identifies two metrics as the most important performance indicators for modeling point-to—point communi- cation on massively parallel processors such as Intel iPSC / 860, Paragon, and Meiko CS—2. The startup time, to, is the latency to send a zero-length message, and the asymptotic bandwidth, r00, is the averaged elapsed time for sending an m-byte mes- sage through the network. The latency formula used to describe the performance of point-to-point communications is of the form to + m/roo. Despite its simplicity, this formula matches the measured point-to—point communication performance of many massive parallel architectures [53]. Xu and Hwang’s model. In an attempt to generalize Hockney’s cost model, Xu and Hwang [158] incorporated collective communication operations into a new model and introduced the idea that both to and TOC are not constants for a specific architecture, but instead are functions of the number of participating parties n. Therefore, the latency formula used in this cost model assumes the form to (n) + m / r00 (n). This cost model also possesses the virtue of simplicity, and yet is precise enough to capture the aggregate behavior of collective communications. Xu and 31 Hwang studied diflerent implementations of collective communication operations on the IBM SP-2 and measured the difierent parameters t0(n) and roo(n). By identifying the order of complexity of both t0(n) and roo(n), they showed the model is able to identify the possible ineficiencies in difierent implementations. The need for a general communication model. A universal parallel comput- ing model that satisfies both the eficiency and universality requirements is dificult to construct, since these basic requirements are often conflicting. This dificulty has led to the separation of performance modeling, analysis, and algorithm devel- opment. Modeling the complexity of parallel algorithms is based on models that are rarely used to design and develop communication algorithms. Therefore, in the design of eficient, portable parallel algorithms, the usual approach is to adopt some frequently used communication primitives and implement these primitive op- erations on a per-platform basis. In order to design eficient communication primitives, an extensive communi- cation cost model is needed. The model should be general enough to capture the characteristics of diflerent communication platforms and capable of reflecting costs associated with resource conflicts. In the following chapter, we will show how the proposed EDN model provides an intuitive model that is powerful enough to ac- commodate many communication operations on various platforms. Chapter 3 Extended Dominating Nodes in Mesh Networks There may be nothing new under the sun, but permutation of the old within complex systems can do wonders. Stephen Jay Gould 1977 As discussed in Chapter 2, development of emcient collective algorithms requires that the underlying hardware characteristics be taken into account. This approach has been following in the application of the Postal model and the LogP model to the design of collective algorithms for various environments. However, an important feature that is not reflected in either of these models is the ability of one node to transmit a message to multiple destinations simultaneously. In this chapter, we develop the Extended Dominating Node (EDN) model, which explicitly models this “single-step multicast” property. Such a capability may be 32 33 realized in diflerent ways, for example, broadcast on a shared medium (such as Ethernet) or path-based multicast in wormhole networks. In this chapter and the follow chapter, we develop collective algorithms for wormhole-routed networks in which single-step multicast is possible due to a multi-port network interface archi- tecture. In Chapter 5, we discuss the application of EDN to ATM networks, where single-step multicast is supported by multicast virtual channels. 3.1 System Model Before we formally introduce the definition of Extended Domination and its appli— cation in wormhole networks, we discuss some major characteristics of the systems studied. This chapter addresses MPCs that are characterized by five properties described below. First, their topologies are low-dimensional (2D or 3D) meshes. Formally, an m x n 2D mesh consists of N = m x n nodes; each node has an associated integer coordinate pair (x,y), 0 g x < n and 0 S y < m. Two nodes with coordinates (x,, y,) and (12,, yj) are connected by a communication channel if and only if Ix,- — xJ-I + Iy, — yjl = 1. A 3D mesh is defined similarly. Figure 3.1(a) shows a 3 x 3 2D mesh, while Figure 3.1(b) shows a 3 x 3 x 3 3D mesh. Second, communication is handled by routers, one at each node. Several pairs of external channels connect each router to neighboring routers; the pattern in which the external channels are connected defines the network topology. Typically, a router can relay multiple messages simultaneously, provided that each incoming ”A (2.2.2) (0.2) (2,2) (0,2,0) LXI . t: a» (0,0) (2,0) (0,0,0) (2,0,0) (a)3x32Dmesh (b)3x3x33Dmesh Figure 3.1: Mesh topologies message requires a unique outgoing channel. In addition, two messages may be transmitted simultaneously in opposite directions between neighboring routers. As discussed in earlier chapter, a router is connected to the local processor / memory by internal channels, or ports. One or more of the ports is for input, and one or more is for output. Third, these systems use the wormhole routing switching strategy [48]. Worm- hole routing has been adopted in many new generation MPCs, including the MIT J-machine [47] (3D mesh); the nCUBE-2 [106] and nCUBE—3 [55] (hypercube); and the Intel Paragon [82] (2D mesh); and the Cray T3D [87] (3D torus). Fourth, these systems use the deterministic dimension- ordered routing algo- rithm [47], in which messages are routed through dimensions of the mesh in a predetermined order. In Figure 3.1(b), for example, a message sent from node (0,0,0) to node (2,2,2) under dimension-ordered routing would follow the path: (0,0,0), (1,0,0), (2,0,0), (2,1,0), (2,2,0), (2,2,1), (2,2,2). (Since wormhole-routing is assumed here, such a message would traverse only the routers at the five inter- 35 mediate nodes, and would not be handled by the local processors at those nodes.) Although research in adaptive routing algorithms is very promising, most commer- cial systems whose topologies are n-dimensional meshes presently use some form of dimension-ordered routing. The final distinguishing characteristic of these systems, mentioned earlier, is that each node possesses multiple internal channels, or parts. 3.2 Single-Level Dominating Node Broadcast We begin with the case of broadcast in all-port 2D mesh networks. We first apply the concept of dominating sets in graphs to direct network topologies for MPCs. The MPC is modeled as a graph, with nodes represented as vertices and communication links represented as edges. Definition 1 A set of dominating nodes D in a direct network is a subset of nodes such that every node in the network is either in D or is a neighbor of at least one node in D. The first broadcasting approach studied in this project, called the single-level dominating (SLD) algorithm, comprises two phases. First, the message is mul- ticast from the source node to a set of dominating nodes of the network; this phase may require several message-passing steps. In the second phase, the domi- nating nodes deliver the message to all other nodes in the network by forwarding the message directly to the appropriate neighbors. Once all the dominating nodes 36 have received the message, this second phase can be completed in one additional message-passing step. The two primary issues that arise in using this approach are (i) how to find a set of dominating nodes with small cardinality, and (ii) how to eficiently send a multicast message to those dominating nodes. 3.2.1 Dominating Nodes in a 2D Mesh The problem of finding a dominating set (or domination number) of minimal car- dinality in a rectangular graph (or grid graph) has been shown to be NP-hard [83]. Since the objective of this project is to develop practical broadcast methods, how- ever, any approximation algorithm that yields a dominating set will serve our pur- pose as long as the number of nodes within that set is a relatively small fraction of the total number of nodes. According to the work by Cockayne [41], we can easily find such sets of dom- inating nodes for square meshes. This method is illustrated in Figure 3.2. We begin by replicating a basic 5-square pattern, illustrated in Figure 3.2(a), in order to form an “infinite” background pattern, represented in Figure 3.2(b). The shaded squares represent dominating nodes; every unshaded square is adjacent to exactly one shaded square. In order to find a set of dominating nodes for a square mesh of a given size, we move a square frame of that size over the background pattern, seeking a position in which all squares in the frame are covered minimally. As illustrated in Figure 3.2(b), our search for a dominating set for a 5 x 5 mesh results in the placement of the 37 5-square flame as shown. The flame contains 5 dominating nodes, however, these are not sumcient to dominate all nodes that lie inside the 5 x 5 flame. As shown in Figure 3.2(c), we move two dominating nodes that lie outside the flame to its interior and adjust the positions of two of the dominating nodes that were already inside the flame, forming a set of 7 dominating nodes that covers all 25 nodes in the flame. Details of this method may be found in [41]. Also shown in Figure 3.2(b) is the optimal placement of a 4 x 4 flame. In this case, however, the 4 dominating nodes inside the flame represent a minimum dominating set. As shown in Figure 3.2(d), none of the nodes inside the flame are dominated by a node outside the flame, so no adjustment is necessary. In general, this approach yields a set of dominating nodes with cardinality of approximately one-fourth the number of nodes in that square mesh [41]. 3.2.2 Multicast Algorithm Once a set of dominating nodes has been found for a specific mesh, the first phase of the SLD broadcast algorithm can be executed, in which the source node delivers the message to all the dominating nodes in the mesh. Unicast-based multicast communication in wormhole-routed 2D mesh networks has been studied previously. For example, the U-mesh algorithm [103] was designed for one-port mesh networks, but can also be used on multi-port architectures. In the U-mesh algorithm, the source node and destination nodes are sorted lexicographically by their addresses into a list . The source node successively 38 JuJ ,T I J a J r f l J IJAJ T J J 1...III.,II JWJIIMJ J II :11 IUJAJMJ I 1.11, J J47 I, , ,, , w , A L, III , , , I I7? I , JLI , II. JIW .IATI J IIiJ A. L J ,J ,IJJ I II) , Imus , , i A L r M, _JJIM, JLEI: .,] I JmII A,, J» J WJJ , ,J, L A I 7 (a) basic 5—square pattern used to construct background pattern (b) placing target squares on background (d) no adjustment necessary for 4—square (c) adjusting dominating nodes for 5—square Figure 3.2: Examples of finding dominating nodes in a 5 x 5 mesh and a 4 x 4 mesh 39 divides (I) in half. If the source is in the lower half, then it sends a copy of the message to the first node in the upper half. In addition to the data, each message carries the addresses of the destinations for which the receiving node is responsible. The receiving node will be responsible for delivering the message to the other nodes in the upper half, using the same U-mesh algorithm in a distributed manner. If the source is in the upper half, then it sends a copy of the message to the last node in the lower half. At each step, the source deletes flom the receiving node and those nodes in the half not containing the source address. The source continues this procedure until contains only its own address. On one-port architectures, the U-mesh algorithm is optimal in terms of the number of message-passing steps, requiring [log2(m + 1)] steps to reach m destinations. Moreover, the constituent unicast messages are guaranteed to be contention-flee [103]. However, we emphasize that other multicast algorithms besides U-mesh could be used in the first phase of the SLD broadcast algorithm. 3.2.3 Example Figure 3.3 illustrates the operation of the proposed SLD algorithm in a small, 4 x 4 mesh. Figure 3.3(a) shows the mesh itself. The four dominating nodes, (0,2), (1,0), (2,3), and (3,1), are shaded in the figure. In this example, node (2, 0) is the source of the broadcast. Figure 3.3(b) shows the broadcast “tree” rooted at node (2,0). In step 1, the source node transmits simultaneously to dominating nodes (1,0) and (2,3), which subsequently forward the message to dominating nodes (0,2) and (3,1), 40 respectively, in step 2. Although some of these messages pass through intermediate routers, the pipelining eflect of wormhole routing and startup costs result in latency nearly identical to that of messages sent between neighboring nodes [107]. Since the network is assumed to be an all-port architecture, node (1,0) can also forward the message to node (1,1) in step 2, and node (2,3) can also forward the message to nodes (1,3) and (2,2) in step 2. The dominating nodes send the message to the remaining nodes in step 3. 3.2.4 Performance In the SLD broadcast algorithm, the source node uses the U-mesh algorithm to multicast the message to the dominating nodes, which then complete the broadcast by passing the message to the appropriate neighboring nodes. Of course, the U- mesh algorithm itself can be used for broadcast, requiring [log2(N)] steps, where N is the number of nodes in the network. The SLD algorithm requires approximately [log2(%)] = [log2(N)] — 2 message-passing steps to deliver the message to the dom- @ source node other node {1;} r": intermediate router message transmitted in step It 8 (a) 4 x 4 2D mesh (b) broadcast based on single-level dominating set Figure 3.3: Example of SLD broadcasting in a 4 x 4 2D mesh 41 inating nodes. An additional step is required to complete the broadcast, bringing the total number of steps to [log2(N)] — 1. Although the SLD algorithm requires one less step than that of the U-mesh algorithm when used alone, the advantage may not be worth the added complexity of the algorithm. This observation led us to investigate a diflerent approach to broadcast that relies entirely on the concept of dominating nodes. 3.3 EDN Broadcast in 2D Mesh Networks A natural extension to the previous approach is to find some subset of the dominat- ing nodes that are capable of sending the message to the other dominating nodes in a single message-passing step and without channel contention. Once each of these nodes has a copy of the message, then broadcast can be completed in two additional steps. 3.3.1 Extended Dominating Sets The method described in this section applies such a method recursively until a very small set of nodes, which will be sent the message directly flom the source, are used to begin the broadcast process. This approach broadens the concept of node domination by accounting for the distance-insensitivity of wormhole routing. We formally define the concept of an extended dominating set as follows. Definition 2 Consider an all-port wormhole-routed direct network with node set V, a routing algorithm R, a set D1, D1 C V, and a set D2, D2 C D1. The 42 set D2 is said to be an extended dominating set (EDS) of D1 if and only if there exists a set of edge-disjoint paths ’P under R such that, for every node v in D1 — D2, there exists a node x, x E D2, and a path p, p E P, from x to v. Figure 3.4 illustrates four examples of extended domination in an 8 x 4 2D mesh network under XY routing. In each of the four figures, the 8 shaded nodes (two difl‘erent shades are used) constitute a dominating set D, of all nodes in the mesh, and 2 of those 8 nodes form an extended dominating set D2 of D1. On an all- port architecture and using XY routing, the nodes in D2 can transmit messages, as shown, to the other nodes in D1 in a single step without channel contention. ' 12%???- . ° _ ——> I node m D2 41?.J node m D1 D2 D other node message transmission (b) (C) ((1) Figure 3.4: Examples of extended dominating sets in 8 x 4 2D meshes In order to model multiple steps of a broadcast algorithm, we need to formally define the concept of levels of extended dominating sets. Hence, an EDS of a set of nodes X in a direct network will also be referred to as a level-1 EDS of X. For completeness, a level-0 EDS of a set X is just X itself. Definition 3 Consider an all-port wormhole-routed direct network with node set V, a routing algorithm R, a set X C V, and a set D,- C X. The set D,- is a 43 level-i EDS of X if and only if D,- is a level-1 EDS of a level-(i—l) EDS of X. Given Definitions 2 and 3, the problem of performing eficient broadcast flom a source node 3 in a wormhole-routed direct network with node set V can be stated as follows: Find a collection of multi-level EDSs, D,-, i = 0,... ,t such that {s} is a level-t EDS of V and t is minimum among all such collections. Once such a collection of sets has been found, broadcast can be performed flom node 3 in a minimum number of steps, while avoiding contention among the unicast messages transmitted in each step. Although numerous methods could be used to perform this task, we sought a method that will not only achieve low latency, but that will be simple to implement and applicable to meshes of diflerent shapes and sizes. Specifically, it is desirable that the pattern of dominating nodes at each level be “regular,” in order that the algorithm can be implemented easily and executed quickly. The generality of the approach will be important because, while determining the sets of multi-level dominating nodes for a physical architecture need be carried out only once, in some systems a given application may be allocated only a subset of the nodes on the network. If these nodes are organized as a (square or rectangular) submesh, then the application would benefit flom a library broadcast routine that accommodates a large variety of possible submesh shapes and sizes. 44 3.3.2 Systematic Approach to Finding Multi—Level Extended Dominating Sets Initially, we describe the proposed method in terms of meshes of size 2" x 2"; later, we Show how the approach can be applied in other square and rectangular meshes. An optimal four-node dominating set for a 4 x 4 mesh was shown in Figure 3.2(d). Suppose that we wish to find a dominating set for an 8 x 8 mesh. Figure 3.5(a) depicts a minimum dominating set, comprising 16 nodes, for such a mesh. Figure 3.5(b) shows another dominating set for the 8 x 8 mesh, constructed by simply combining four solutions to the 4 x 4 problem. For each mesh, a level-2 dominating set comprising four nodes can be used to reach the level-1 dominating nodes. The advantage of the regular pattern in Figure 3.5(b) is that the number and pattern of messages transmitted by each level-2 dominating node is identical. That is to say, the four level-2 nodes can complete the broadcast in two steps using either pattern, but the actions are more regular for the pattern shown in Figure 3.5(b). E'” . .. . 1 I ’ _> I level—2 node E level-l node I, J level-0 node message transmlssmn ...,,‘) (a) 4——node EDS (b) 4—node regular EDS Figure 3.5: Level-1 and level-2 dominating nodes in 8 x 8 mesh Using such regular patterns is advantageous to finding broadcast patterns for 45 larger meshes. In fact, by treating 4 x 4 meshes and their dominating sets as building blocks, we can construct level-2 dominating sets for any mesh of size 2" x 2" for any It 2 2. The problem becomes how to recursively find a higher level dominating nodes. In order to develop a regular approach to this problem, the orientation of the 4 x 4 dominating nodes must be considered. There are essentially two difl'erent configurations for (non-extended) dominating nodes in a 4 x 4 mesh, as shown in Figure 3.6. We will refer to these patterns as A and B, respectively. Pattern B is the same as that shown earlier in Figure 3.2(d), and pattern A is the mirror image of pattern B across the vertical axis. —-> level-1 node J—:] level—0 node message transmission l_ (a) pattern A (b) pattern B Figure 3.6: Two basic configurations for dominating nodes in a 4 x 4 mesh By combining four instances of patterns A and B in all possible ways, we can construct 16 difl'erent configurations for an 8 x 8 mesh. Among these 16 config- urations, we have selected two configurations to demonstrate the construction of higher-level dominating sets. As illustrated in Figure 3.7, pattern C places two A patterns on the left and two B patterns on the right, while pattern D places two A patterns on the right and two B patterns on the left. Patterns C and D are mirror images of one another, this time with respect to the horizontal axis. It is straightforward to find four level-2 dominating nodes for 8 x 8 meshes with 46 [j level—0 node [ level—l node Figure 3.7: Two configurations of synthetic dominating nodes in a 8 x 8 mesh —> I level—2 node B level—l node [j level-0 node message transmission Figure 3.8: Level-2 dominating nodes sending to level-1 dominating nodes in 8 x 8 meshes 47 either pattern C or pattern D at level 1. Figure 3.8 illustrates four level-2 dominat- ing nodes for each pattern. Under XY routing on an all-port architecture, the level-2 nodes can transmit messages to all of the level-1 nodes in a single message-passing step. Figure 3.9 depicts the message passing flom level-3 dominating nodes to level-2 dominating nodes in a 16 x 16 mesh. Since the pattern is formed by placing two C patterns above two D patterns, the reader will notice that the right half of the mesh is a mirror image of the left half, and that the lower half is a mirror image of the top half. Again, under XY routing, the message passing between level-3 nodes and level-2 nodes can be completed in a single step. This systematic approach can be applied to any mesh of size 2" x 2". Essentially, the broadcast “tree” is constructed in a bottom-up manner. A dominating set of all nodes in the network is found and placed at level 1 of the tree; all non-dominating nodes are placed at level 0. A dominating set of the level 1 nodes is found and placed at level 2, and so on. This process continues until four nodes are placed at level (I: — 1). The resulting algorithm, which we call the EDN (extended dominating nodes) algorithm, consists of two distinct phases. In the startup phase, the source node sends out the message to the four dominating nodes at the highest level. In the second phase, the message is iteratively passed down the tree to lower-level domi- nating nodes, until, after the last step, all nodes in the network will have received a copy of the message. The method used by the source node to deliver the message to the highest-level dominating nodes, as well as the number of message-passing steps required, depends 48 I level-3 node E: level—l node I level—2 node B level-O node message transmission Figure 3.9: Level-3 dominating nodes sending to level-2 dominating nodes in a 16 x 16 mesh on the position of the source relative to those nodes. Figure 3.10 illustrates two examples of the startup phase in an 8 x 8 mesh. In Figure 3.10(a), the source node sends the message to two of the level-2 dominating nodes in the first step, each of which forwards the message to another level-2 dominating node in the second step. The reader will notice that each of the level-2 dominating nodes that received the message in step 1 can also proceed to forward the message to two of the level-1 dominating nodes in step 2. Since the architecture is assumed to be all-port, an overlapping of steps is permitted, thus allowing some level—0 nodes to receive the message before the last step. In Figure 3.10(b), since all the level-2 dominating nodes are to the right of the source node, under X Y routing, the source node can transmit to only one of them in the first step. At the end of the second step, all 49 i —> I source node I level—2 node i ,,J level-l node message transmission in stepi LL ’ L. _ L, (a) reach 2 nodes in step I (b) reach one nose in step I Figure 3.10: Startup message passing in 8 x 8 meshes level-2 dominating nodes (and some level-1 dominating nodes) will have received the message. As mentioned earlier, the multilevel dominating set approach to broadcast can be applied to square meshes whose widths are not powers of two, as well as to rectangular meshes. Of course, for mesh sizes other than powers of two, the basic building blocks are different than the 4 x 4 meshes shown in Figure 3.6. Figure 3.11 illustrates basic building blocks of size 5 x 5, 6 x 6, and 7 x 7, which were found using the method [41] illustrated in Figure 3.2. The performance of each is summarized in the next subsection. I... ...; r-- "7 _ mm A -— . I.“ «1 , . . 3.. .J - L i I ,L (a) 5 x 5 mesh (b) 6 x 6 mesh (c) 7 x 7 mesh Figure 3.11: Basic building blocks for meshes not based on powers of two In each of the cases described thus far, we have shown only one or a small 50 number of basic building blocks. One disadvantage of having a fixed pattern for all broadcasts is that, in the presence of a large amount of broadcast trafic, system resource usage may become imbalanced. However, it turns out that the patterns presented here are not the only ones that would suflice for broadcast. On the other hand, if broadcast were expected to be a relatively infrequent, but time-critical, operation, the proposed methods could be supported in hardware. 3.3.3 Theoretical Performance As mentioned earlier, we evaluate the performance of the EDN from both a graph theoretical and from a systems perspective. Since several other projects evaluate the performance of multi-port collective algorithms by counting message-passing steps [21, 75, 111, 113, 118], we first assess algorithms in this manner. As in these related projects, it is assumed that a k-port node can send (receive) 1: messages simultaneously, that is, serialization of startup latencies is not reflected. In order to predict the performance of the EDN algorithms on actual systems, however, we also conducted simulation studies. The simulations account for sequential startups, as well as other system conditions and parameters. We have calculated the number of message-passing steps required for EDN broadcast in square meshes of width n = 4 x 2’25 x 2*,6 x 2*,7 x 2". The to- tal number of steps consists of two parts, corresponding to the startup phase and message passing among multiple levels of EDNs. Tables 3.1 through 3.4 give these values. The number of steps needed in the startup phase depends on the position 51 of the source node relative to the highest-level EDNs. Table 3.1: EDN steps for n = 4 x 2" Table 3.2: EDN steps for n = 5 x 2" n = 4 X 4 8 18 32 64 128 4 x n = 5 x 5 10 20 40 80 160 5 x max max mu max Table 3.3: EDN steps for n = 6 x 2" Table 3.4: EDN steps for n = 7 x 2" n = 6 x 6 12 24 48 96 192 6 x n = 7 x 7 14 28 56 112 224 7 x max max max mu: In order to evaluate the EDN broadcast algorithm, we compare its performance to that of an algorithm based on recursive doubling (RD) and to a lower bound. The RD algorithm [102, 103] was designed for the general case of multicast in one- port n-dimensional meshes, and can also be used for the special case of broadcast. Figure 3.12 illustrates the operation of the RD algorithm on a multicast problem in a 6 x 6 2D mesh. The RD algorithm first sorts the source and destination addresses lexicographically into a list, denoted (D. The source node successively divides in half. If the source itself is in the lower half, then it sends a copy of the message to the first node in the upper half. That node will be responsible for delivering the message to the other nodes in the upper half, by invoking the same algorithm recursively. If the source is in the upper half, then it sends a copy of the message to the last node in the lower half. The source continues this procedure until contains only its own address. The RD algorithm doubles the number of nodes that hold the message in each III D [D] 0 source node other node destination node intermediate router Ill-I- -I--I--I--I--|- +++++ [41 [3] i [3] Q Q 6} Q ® ® 6‘ :I' 2E. IlzllisII.EI Q (a) multicast example' In 6x6 mesh (b) U—rnesh multicast tree solution Figure 3.12: RD multicast solution in a 6 x 6 mesh. step, and therefore is optimal on one-port architectures, requiring [log2(N)] steps to complete a broadcast operation in an N ~node network. Although the RD algo- rithm can be implemented on a multi-port architecture, it will often fail to take advantage of that architectural property, resulting in more message-passing steps than necessary. In fact, the number of steps is still [log2(N )1. A simple lower bound on the number of message-passing steps needed for broad- cast in a three-port 2D mesh is computed as follows. If, in every time step, every node that already holds a copy of the message is capable of forwarding the message to three other nodes, then the number of nodes that have received the message after time step t is 4‘. Given a mesh of size n x n, this lower bound on the time complexity is derived to be T 2 [log4(n2)]. Figure 3.13 plots the number of steps needed to perform broadcast in n x n meshes, for 1 < n g 256. The number of nodes in the mesh varies from 1 to 53 65,536. The number of steps required by the RD algorithm is equal to [log2(n2)], regardless of the location of the source node. The number of steps required by the EDN algorithm depends on the location of the source node, but the maximum number of steps is at most one more than the minimum number. As shown in Figure 3.13, in many cases the minimum number of EDN steps matches the lower bound of [log4(n2)], and in some cases even the maximum number of EDN steps achieves the lower bound. 16 - l4 - 12 L 10 L steps ”fir . 8 - "'49-"-+----+-Titl’c.’:...---.5__..--..--g.-.....__...-::l:______, 6 - ”is: " ° ' . 1...--- RD L EDN maximum -+--- 2 EDN minimum --a ----- Lower bound for Three ports ~~~~~ 0 Dower bound for All port 0 32 64 96 l28 I60 192 224 256 Mesh width n (number of nodes = n x n) Figure 3.13: Comparison of EDN steps with RD and lower bound (2D mesh) 54 3.3.4 Analysis In order to better understand the performance of these two algorithms, we develop an analytical model. Let or represent the message sending latency and '7 represent the receiving latency. Let L be the message length, and let 6 denote the per-hop transmission time. If the message is long enough that we can ignore the number of hops traversed by each message, then we can approximate the message latency by a + IBL + 'y. In a 2" x 2Ic mesh, the RD broadcast algorithm consists of 2k message—passing steps. In each step, every node holding the message sends it to one other node. Therefore, the total time to execute the algorithm can be approximated by: 5RD broadcast = 2M0 + BL + 7) In each message-passing step of the EDN broadcast algorithm, other than the first two steps, a parent node sends three messages to its children. Again ignoring the number of hops traversed by each message, the time for this operation is ap- proximated by 3a + 6L + 7. For a 2" x 2" mesh, 1: — 1 such steps are required to broadcast from the four highest-level EDNs, and two steps are needed to deliver the data from the source node to the four highest-level EDNs. Depending on the configuration (see Figure 3.10), one of the first two steps requires time a + 6L + 7 and the other step requires time 20 + 6L + 7. The total time for the EDN broadcast 55 algorithm is approximated by: £EDN broadcast = (k - 1)(3(a) + flL + ’7) + 30 + 2,61. + 2’)’ = 3ka + (k +1)’7 + (k +1)[3L. Comparing these two approximations, we see that they differ in the coemcients of a, 6L, and '7. If we assume that sending and receiving latencies are approximately equal (a z 7), then the sum of the first two terms is (4k + 1)a for EDN and 41m for RD. In the RD algorithm, the coeficient of L is 2kfi, while in the EDN algorithm, it is (k + 1) 6. By combining these two results, the EDN algorithm can be expected to perform better than the RD algorithm when message length L is greater than (k-lm' 3.3.5 Simulation Study The previous two sections evaluated the performance of the EDN algorithm in terms of the number of message-passing steps and analysis based on sending / receiving overheads and per-byte transmission time. When implemented on an MPC, other factors such as path length and channel contention, affect the latency of unicast- based collective operations. In order to compare the algorithms while accounting for such system charac- teristics, a simulation study was conducted. As part of an earlier project, we have developed a simulation tool called MultiSim [100] for the study of large-scale mul- tiprocessors. MultiSim is based on an event-driven simulation package, CSIM [127], 56 which allows multiple pseudo-processes to execute in a quasi-parallel fashion and provides a convenient interface for writing modular simulation programs. MultiSim is designed to emciently simulate wormhole-routed systems. MultiSim was used to simulate broadcast operations in 3—port meshes of difierent sizes. For each mesh size, up to 100 diflerent source nodes were selected randomly. For a given broadcast operation, the average broadcast latency is the mean of the broadcast latencies of all the destinations. The maximum broadcast latency is the largest broadcast latency among all destinations, that is, the time between when the source sends the message until the last destination receives it. Both values may be of interest, depending on how broadcast is used in a particular parallel application. Since EDN is assumed to be implemented in software, an intermediate node must fully receive the message before it forwards it to another node(s), and sending latency is incurred for each message transmitted. The computation time of a broadcast operation at an intermediate node is small (typically a table lookup determining to which nodes to forward the message) and is assumed to be part of the startup latency. We first studied the performance of EDN broadcast on a multi-port 2D mesh with latency characteristics of the only commercial multi-port system, the nCUBE- 2 hypercube. Specifically, the simulated channel rate 0.45 microseconds per byte, and a combined (sending and receiving) startup latency of 170 microseconds. As with all other simulations presented in this paper, sending latencies are simulated sequentially. For example, if a node is sending two messages in succession, the sending latencies do not overlap, even if the messages are transmitted on difierent 57 Mmrn' |II| uneasy (rm ) Average latency (mncrosec' onds) 00 256 5l2 768 102A |2L80 15136 17.92 2048 00 256 5l2 768 "IN l2‘80 15‘36 17‘92 20“ Message Size (Bytes) Message Size (Bytes) (a) maximum broadcast latency (b) average broadcast latency Figure 3.14: Comparison of broadcast algorithms in 8 x 8 mesh (N CUBE-2 param- eters) ports and external channels. (The same is true of receiving latency, though this property does not afi'ect the broadcast results.) Therefore, even though the system has an multi—port architecture, some staggering will occur among successive mes- sages. This characteristic is consistent with our experience on the nCUBE-2 [99]. Figure 3.14 compares the maximum and average broadcast latencies of the two broadcast algorithms (EDN and RD) in an 8 x 8 mesh, and Figure 3.15 compares the algorithms in a 32 x 32 mesh. The message size is varied from 32 bytes to 2048 bytes. The advantage of the EDN algorithm is significant, and the advantage is greater in the larger mesh. In fact, inspection of the plots reveals that the maximum broadcast latency of the EDN algorithm in a 32 x 32 mesh is approximately equal to that of the RD algorithm in an 8 x 8 mesh. We also wanted to evaluate the EDN algorithm on a newer architecture, since the nCUBE—2 architecture is now several years old. In the second set of tests, the values of the latency parameters are consistent with the Cray T3D. Specifically, the simulated channel rate is 300 MBytes/ sec (0.0033 microseconds per byte, per hop) 12000 10000» Maximal [Juicy (Imam ) E 1 AP 0 256 512 768 1024 12110 1536 1792 2048 MSizeGyu) (a) maximum broadcast latency Figure 3.15: Comparison of broadcast algorithms in 32 x 32 mesh (NCUBE—2 pa- rameters) 60 Maximum Leency (microseconds) L 4 0 256 512 768 1024 1280 1536 1792 2048 Meme: Sin (Bytes) (a) maximum broadcast latency 58 Average Latency (microseconds) Average Latency (microseconds) 12“!) 100001 J 256 512 768 1024 1280 1536 1792 2048 MessageSiafiBytee) (b) average broadcast latency L 256 512 768 l024 1280 1536 1792 2048 Message Siu(Bytee) (b) average broadcast latency Figure 3.16: Comparison of broadcast algorithms in 8 x 8 mesh (T3D parameters) and the combined sending and receiving latency is 1.5 microseconds. Figures 3.16 and 3.17 compare the algorithms in 8 x 8 and 32 x 32 meshes, respectively. Again, the performance advantage of the EDN algorithm is significant. Although the network latency in wormhole routing is relatively insensitive to the distance between the source and destination nodes, the total path lengths of the messages involved in a collective operation indicate the amount of communication resources that are consumed by the operation. A collective operation that consumes few resources may be less likely to negatively aflect other network traffic. The link usage of the EDN algorithm is lower than that of the RD algorithm, due to the 59 100 a 100 . i E 8" '52:; 2* § 3 s 5 3 g g z < 00 256 5; 768 1024 ”80 I?” 1782 2048 00 256 5l2 768 I024 I280 15; :92 2048 Message Size(By1c3) Message Sine (3m) (a) maximum broadcast latency (b) average broadcast latency Figure 3.17: Comparison of broadcast algorithms in 32 x 32 mesh (T3D parameters) divide-and—conquer nature of EDN. In the last step of EDN broadcast, for example, a full three-fourths of all the nodes receive the message from an immediate neighbor. In a 32 x 32 mesh, the average path length of a message in EDN is 1.86 hops, whereas the average for RD is 4.16 hops. 60 3.4 EDN Broadcast in 3D Mesh Networks We illustrate the application of the EDN approach to 3D mesh network by way of an extended example. We first consider EDN patterns in 2D meshes, upon which the 3D mesh solutions are based. In a 4 x 4 network, several different level-1 EDN configurations can be found, each of which can serve as a building block pattern in larger networks. Figures 3.18(a) and (b) show four level-1 EDN configurations and their abstract representations, respectively. In the rest of this section, we will demonstrate a systematic approach for constructing multi-level EDNs for 3D meshes by starting from these four configurations. Figure 3.18: Difl'erent EDNs in 4 x 4 meshes We begin with a 4 x 4 x 4 3D mesh network. Using any permutation of the above four configurations, one configuration in each of the four planes, gives a 16-node level-1 EDS for the entire network. Figure 3.19 shows one such configuration (for clarity, not all connections in the z dimension are shown). The configuration used in Figure 3.19 possesses an interesting feature, as shown in Figure 3.20. Under XYZ 61 routing, these 16 level-1 EDNs can together dominate an additional 4 x 4 plane. Therefore, the number of level-1 EDNs for a 4 x 4 x 5 network is the same as that ofa 4 x 4 x 4 network. Figure 3.19: 16-node level-1 EDNs in a 4 x 4 x 4 mesh L‘ Q level-1 EDN 5 Figure 3.20: Combining EDN configurations to cover an additional plane Next, we seek a set of level-2 EDNs that can reach the other level-1 EDNs in a single message-passing step and without channel contention. A set of four such nodes is shown in Figure 3.21(a). Also shown in the figure are the routes taken by the messages sent to the twelve other level-1 nodes; all messages follow XYZ paths. Given this configuration of the level-2 EDNs, any source node can deliver the message to them in two startup steps. Therefore, broadcast in a 4 x 4 x 5 network can be completed in four message-passing steps. (a) Communication between level-2 and level—l EDNs in 4 x 4 x 5 network I level-1801*! .. (b) representation of level-2 basic unit Figure 3.21: Level-2 EDNs in a 4 x 4 x 5 network 3.4.1 Recursion in XY plane We can represent the configuration of the four level-2 EDNs as shown in Fig- ure 3.21(b). This particular configuration plays an important role in the recursive construction of multi-level EDNs in larger 3D mesh networks, and we refer to it as a basic unit. By combining four basic units as shown in Figure 3.22(a), we define the 16 level-2 EDNs of a larger network with four times the number of total nodes. By seeking a basic unit among these sixteen level-2 EDNs, as illustrated in Fig- ure 3.22(b), four level-3 EDNs are defined. The paths followed by messages from level-3 EDNs to level-2 EDNs are shown in Figure 3.22(b). The twelve other nodes are reached while avoiding contention among the messages. As a basic unit, the ori- entation of the four level-3 EDN s is the same as the representation in Figure 3.21(b), with the pattern simply reversed in the Z dimension. This method can be applied recursively to larger networks. In general, if the basic unit represents the highest-level EDNs in a network of size a x a x 6, then four such basic units can be combined, and a new basic unit is found that represents the highest-level EDNs in a network of size (2 * a) x (2 * a) x B. Specifically, this basic 63 I level-3 EDN if? level-2 EDN (a) 16 level-2 EDNs in a 8 x 8 x 5 network 0” communication between level-3 EDNs (c)1eve1—3 basic unit and remaining level—2 EDNs Figure 3.22: Level-3 EDNs in a 8 x 8 x 5 network unit can represent the highest-level EDNs in networks of size 4 x 4 x 4, 8 x 8 x 4, 16x16x4,..., (4*2")x(4*2")x4and4x4x5,8x8x5,16x16x5,..., (4 =I= 2") x (4 * 2") x 5, where k = 0,1,2, and so on. 3.4.2 Recursion along the Z-axis By positioning three basic units along the z-axis, as shown in Figure 3.23(a), we can create a network three times the original size. In Figure 3.23(b), we can see that 4 nodes in the middle section can serve as the highest-level EDNs in this 8 x 8 x 15 network. Again, the basic unit of the level-4 EDNs in this network is the same as that of the level-3 EDNs in the 4 x 4 x 5 network, as shown in Figure 3.22(c). This approach can be extended to larger networks. In general, if the basic unit represents the highest-level EDNs in a a x a x 6 network, then after the operation as described above, the same basic unit can be used to represent the highest-level EDNs in a a x a x (3 * 6) network. Combined with the earlier result, this basic unit can also represent the highest-level EDNs in a network of size (412‘) x (4*2") x (4213") and (4*2") x (4*2") x (5*3'") where k = 0,1,2,--- andm = 0,1,2, and so on. The 64 techniques described here generalize to meshes of other shapes by using difierent basic units. I level—4 EDN I level-3 EDN L: level—2 EDN (a) combining three 8 x 8 x 5 networks along the z—axis to form an 8 x 8 x 15 network n' . . .1 < (b) messages sent between level—4 EDNs and level—3 EDNs Figure 3.23: Example of concatenating networks in the Z direction 3.4.3 Theoretical Performance The number of startup steps required to send the message from the source node to the four highest-level EDNs is two, regardless of the location of the source. This property can be seen from the orientation of the basic unit. The number of message- passing steps in networks of size (4x2") x (4x2") x (4x3m) is (k+2)+m+2 = k+m+4, where k 2 0 and m 2 O. The numbers of steps required in specific networks are listed in Table 3.5 in increasing order of the total number of nodes in the network. As with the 2D case, we compare the number of message-passing steps of the EDN algorithm with that of the RD algorithm and a theoretical lower bound. For a 3D mesh with four output ports per node, a lower bound is [log5(N )] Figure 3.24 65 Table 3.5: Theoretical EDN performance in 3D mesh networks Network Total Levels of Total number size nodes EDNs of steps 4 x 4 x 4 64 2 4 4 x 4 x 5 80 4 x 4 x 12 192 4 x 4 x 15 240 8 x 8 x 4 256 8 x 8 x 5 320 4 x 4 x 36 576 4 x 4 x 45 720 8 x 8 x 12 768 8 x 8 x 15 960 16 x 16 x 4 1024 16 x 16 x 5 1280 4 x 4 x 108 1728 4 x 4 x 135 2160 8 x 8 x 36 2304 8 x 8 x 45 2880 WWUMAQAhrhI-wawww NNNNGQQQQGWO‘U‘UA compares the maximum broadcast latency (in message-passing steps) of the EDN algorithm and the RD algorithm for 3D meshes of specific sizes up to 5000 nodes. The advantage of the EDN algorithm over RD is clear; the performance of the EDN algorithm is within a small number of steps of the lower bound. 3.4.4 Analysis Following the same notations of the previous analysis for 2D broadcast and combine, this section provides some analytical comparison of the performance of these two algorithms. In a (4 x 2") x (4 x 2") x (4 x 2’") mesh, the RD broadcast algorithm consists of 2]: + m + 6 message-passing steps, with each receiving node combining the incoming 66 20 . . , L . . RD +— EDN ”a.-- Four port lower bound --------- Br I— v Message-Passing Steps 0 4 L 1 1 L 1 1 a r 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Total number of nodes Figure 3.24: Comparison of EDN steps with RD and lower bound (3D mesh) vector its local data. Therefore, the total time to execute the algorithm can be approximated by: 5RD broadcast = (2k + m + 6) * (a + M + 7) = (2k+m+6)a+(2k+m+6)7+(2k+m+5)5L~ Although the number of message-passing steps of the EDN algorithm is the same for (4 x 2") x (4 x 2") x (4 x 2'") and (4 x 2") x (4 x 2") x (4 x 3’") meshes, it is more convenient in the case of the RD algorithm to analyze the (4 x 2") x (4 x 2") x (4 x 2'") mesh. In the first two steps of the EDN broadcast algorithm, a total time of 30 + 26L + 27 is needed for the source to reach the four highest-level EDNs. Proceeding along the Z-direction, another m message-passing steps are needed, and for each step, two messages are sent out from each node. Thus, a maximum time of m(2a + 6L + 7) is required. The broadcast algorithm now is executed on each XY-plane, where k + 2 ‘ 67 steps are needed. In the first k +1 steps, 3 messages are sent from each node holding the message, and in the last step, at most 4 messages are sent from an EDN. The total time for the EDN broadcast algorithm is approximated by: ICEDN broadcast = (3a + 2,3L + 2’7) + m(2a + BL + ’7) + (k+1) * (3a+flL+7)+(4a+fiL+7) = (3k+2m+lO)a+(k+m+4)7+(k+m+4)flL Again, if we assume that sending and receiving latencies are approximately equal (a as 7), and combining these two approximations, the EDN algorithm is expected to perform better than the RD algorithm when the message length L is greater than (m+3)a (k+2)fl ‘ 3.4.5 Simulation Study The MultiSim simulator was used to study the performance of EDN broadcast in 3D mesh networks under various system conditions. Two different mesh sizes were simulated. For each mesh size, difi'erent source nodes were selected randomly. The channel rate parameter is again consistent with the T3D, namely, 300 MBytes / sec. Difi'erent values of sending and receiving latencies, which together constitute startup latency, were used. As in the other simulations, sending (and receiving) latencies are simulated serially. Figure 3.25(a) compares the broadcast latency of the EDN algorithm and the RD algorithm for meshes of size 4 x 4 x 4 and 8 x 8 x 4. The message size is varied 68 from 32 bytes to 2048 bytes. In this plot, the startup latency is 1.5 microseconds (0.75 for sending, and 0.75 for receiving). The advantage of the EDN algorithm is approximately 30 percent in the larger mesh and 27 percent in the smaller mesh. One may conclude that for tasks such as distribution of initial data and replication of arrays, the EDN algorithm is preferred to a recursive doubling approach. Even though the latter will inadvertently take advantage of a multi-port architecture, the EDN approach achieves better performance due to increased parallelism among the constituent messages. 100 v . 1 . v . . 1m 1 1 fi a E 80 » EDN(4x4x4) ~— E 80 - EDN(4x4x4) ~— I RD(4X4X4) -'-'" RD(4X4X4) ------ EDN(8x8x4) --~-~- / g EDN (8x8x4) «~- 48; so _ norm“) ——~—- , ”g 60 _ RD(8x8x4) —— ; :. i S .11 _3 40 I S . .g ,0 . 2 2 o A A A A A 1 L o l A L A L J A 0 256 512 768 1024 1280 1536 1792 2048 0 256 512 768 1024 1280 1536 1792 2048 Massage Size (Bytes) Message Size (Bytes) (a) startup latency equals 1.5;rsecs (b) startup latency equals 0.15psecs Figure 3.25: Comparison of EDN broadcast with RD broadcast in 3D meshes (sim- ulation) Reducing startup latency is a major goal of many ongoing research projects. For comparison, Figure 3.25(b) plots the broadcast latency of the EDN algorithm and the RD algorithm in the same meshes, except that the startup latency is decreased by an order of magnitude. Specifically, the sending latency is 0.075 microseconds, as is the receiving latency. Again, the message size is varied from 32 bytes to 2048 bytes. The broadcast times of both algorithms are reduced, but the relative advantage of the EDN algorithm increases (36 percent and 32 percent, respectively, 69 ~ 7 —. @ manna: I Ievel—SEDN I Ievel—IEDN [j level-IEDN L] level—0m Wei-Inmate- 5' +15?!th .775 .1" 3“?th :1": ”I34: III: I”. II“? $13 #12.“ film...“ in; 444‘ (a) step! nearest neIghbor (b) step2 (c) step3 Figure 3.26: First three steps of reduction in a 16 x 16 2D mesh in the large and small meshes). A lower startup latency increases the overlap among messages messages sent from the same node, and the EDN algorithm is better able to exploit this characteristic. 3.5 Reduction and Gather Operations Besides broadcast, the EDN approach can be applied to collective operations such as reduction and gather [104]. In reduction, a commutative and associative operation, such as maximum, is performed on a set of data items that may be distributed across network nodes. In gather, one process receives a message from each of a group of processes. These operations, also known as global combine, have been studied for one-port meshes, for which at least [log N] message-passage steps are needed to complete the operation [159, 15]. EDN-based algorithms are designed to reduce this time by taking advantage of all-port network interfaces. Since the communication patterns of such operations are typically many-to-one, it may appear that an EDN broadcast tree, described 70 above, could be used to implement such operations with the direction of message transmission reversed. The restriction of XY routing prevents this approach from achieving good performance, however, because it results in contention among the unicast messages. Specifically, the children nodes of some parent nodes cannot all transmit to that parent at the same time because the messages require a common channel. The solution lies in rotating the pattern of EDN nodes by 90 degrees clockwise or counter-clockwise, as shown in Figure 3.26. In the first step, all level-0 nodes send the message to their neighbor that is an EDN. After three steps, the four level-3 EDNs have received the data from their children. Two more completion steps are required to deliver the data to the root node. Like broadcast, the total number of steps in EDN-based reduction consists of two parts, the steps in which messages are passed up the reduction tree, and the completion steps. Tables 3.6 and 3.7, respectively, give these values for n = 4 x 2" and n = 5 x 2", where the number of network nodes is N = n x n. The values for n = 6 x 2" and n = 7 x 2", though not given, are the same as for n = 5 x 2". In fact, although the number of steps in the reduction tree and the completion process difier among the four mesh types, the totals are the same for all types. Table 3.6: EDN reduction steps for n = 4 x 2" Table 3.7: EDN reduction steps for n = 5 x 2" n=4x 4 8 16 32 64 4x n=5x 5 10 20 40 80 5X nxn an 71 3.6 Matrix Transposition We can also apply the EDN approach to the problem of transposing a square matrix mem. We assume that the elements of the matrix are distributed evenly among a 2D mesh processor network of size N = n x n with processor p1,,- holding a block of submatrix of size If x 1:5, for 0 S i, j S n — 1. (In terms of High Performance Fortran [71], the matrix is decomposed in a (black, black) manner.) Suppose that b = % is an integer. In a direct implementation of transposition, depicted in Figure 3.27 for an 8 x 8 mesh, processor 1),, j sends a message of length b2 to processor p“, for 0 5 2', j S n — 1. Processors located in the major diagonal already hold the correct data, and do not send or receive any messages. Therefore, a total of n2 —n unicast messages are transmitted. Although all the messages are theoretically transmitted in one step, under XY routing, channel contention is a major drawback of this approach. As depicted in Figure 3.27, only two of the messages do not have to compete for at least one channel. In the EDN approach to transposition, although multiple message-passing steps are involved, contention among messages can be largely avoided, thereby reducing the latency of the operation. (a) low to high (b) high to low Figure 3.27: Channel contention problem in direct approach to matrix transposition The basic EDN pattern used for transposition is shown in Figure 3.28(a) for 72 (a) EDNs for transpose (C) step 2 (‘0 step 3 - (b) step 1 Figure 3.28: EDN matrix transposition in 16 x 16 mesh. the 16 x 16 mesh. In order to implement transposition, the mesh is “divided” into 4 x 4 blocks. In the first message-passing step, nodes within each 4 x 4 block send their data to the appropriate EDN in that block, as shown in Figure 3.28(b). In the second step, corresponding EDNs on the diagonals in separate blocks send messages as illustrated in Figure 3.28(c). The third and fourth steps are similar to the second and first, respectively, except that the direction of the messages is reversed. After four steps, the transpose operation is complete, while avoiding channel contention. We observe that the communication patterns in Figures 3.28(c) and (d) are iden- tical to the communication pattern in Figure 3.28(b). That is, the “basic blocks” of this transpose pattern can be of size ((4 x 4)j x (4 x 4)’), for j 2 0. In general, transpose for a network of size 2" x 2" requires k message-passing steps. For net- works of size 2" x 2", where k is even, this approach produces no channel contention for any communication steps throughout the process of transposition. However, when we use the same approach for a network of size 2" x 2", where k is odd, 73 30°00 WWWTW W W "W W W W“ TW _._ whom; ) +— g i. '3 A g , , 1 / 'm b“. -1. ”WHO“: )m M _ L-- l ‘ ,,‘ ___ ___+ 25000» MIJGZflDW" I- r/1 4 A $2“): ‘ : Mummy -»-« . E , lug) .... 1 ”N Mama)» , - € . .000 mm 1 ._.__.L w-.. m _MMRQMZI '37 -.-_ . 1 1.- _ ;, -, 1,, _ ,, .1 ' f DreaMuuoxIO) J-- , _ . _ i . omsfumsr- , ; ; , g i i l t ' ’ ” l ' : . g .- _/ < m, J / J/jwjfiwm‘“ k” A A ‘1 0 256 512 768 1W [2” 1536 1192 2048 0 256 512 768 [024 l2!) 1536 I792 2048 WWW) WSW (bra) (a) maximum latency (a) average latency Figure 3.29: Comparison of EDN transpose with direct transpose channel contention is possible in one of the transposition steps. Some messages in those blocks along the major diagonal may begin transmission at the same time that other messages from ofi-diagonal blocks are being sent across the network to other ofi-diagonal blocks. The amount of contention, though, is small as compared with the direct approach. Specifically, the number of messages involved in channel contention is— 3—“ ,compared to (n2 — n— 2) 1n the direct approach. In order to compare the transposition algorithms while accounting for such sys- tem characteristics, a simulation study was conducted using MultiSim [100]. Fig- ures 3.29(a) and 3.29(b), respectively, compare the maximum and average transpose latency of the EDN transpose algorithm with the direct approach for meshes of size 8 x 8, 16 x 16, and 32 x 32. The message size is varied from 128 bytes to 2048 bytes. The advantage of the EDN transpose algorithm is clear. In fact, the maximum latency of the EDN algorithm is less than one-third that of the direct approach in a 32 x 32 mesh, and the average transpose latency of the EDN algorithm is approx- imately one-half that of the direct approach. Thus, while the direct approach can theoretically be completed in a single message-passing step, the EDN approach, 74 which involves k steps for network of size 2" x 2" , can achieve better performance by reducing or avoiding contention among the constituent unicast messages. 3.7 Conclusions In this chapter, a communication model has been described that uses the concept of dominating sets to eficiently implement broadcast in all-port wormhole-routed 3D mesh networks. The EDN algorithm takes advantage of the distance-insensitivity of wormhole routing, allowing a node to “dominate” another node even though the two nodes may not be neighbors in the network. In addition, the algorithm is designed to exploit the presence of multiple ports between each local processor and its router. Using the EDN algorithm, the number of message-passing steps was shown to be at most three steps above a theoretical lower bound in systems with up to 60,000 nodes. Simulation results confirm the advantages of the algorithm under typical system conditions. Chapter 4 EDN Broadcasting in Torus Networks In this chapter, we use the EDN model to develop broadcast algorithms for multi- port wormhole-routed torus networks. We first consider 2D networks, followed by 3D networks. 4.1 System Model The systems under consideration in this chapter are similar to those discussed in Chapter 3. They use wormhole routing, dimension-ordered routing, and possess a multi-port architecture. However, the topologies of these systems are 2D or 3D tori. Formally, an n-dimensional torus has [so x [:1 x x kn-2 x kn_1 nodes, with k,- nodes along each dimension 2', where k,- 2 2 for 0 S 2' g n — 1. Each node :1: is identified by n coordinates, an_1(a:)o,,-2(a:) . . .ao(:c), where 0 g (7,-(3) S k,- — 1 75 76 for 0 g i g n — 1. Two nodes :1: and y are neighbors if and only if (7,-(x) = o,(y) for all i, 0 g i S n - 1, except one, j, where oJ-(sc) i 1 = 03-(y) mod kj. In this chapter, we consider only bidirectional tori, in which direct message transmission is possible in either direction between neighboring nodes. Figures 4.1(a) and 4.1(b), respectively, show the physical links associated with a 2D unidirectional torus and a 2D bidirectional torus. Also shown in each figure are the paths taken by two example unicast messages, one from source node (0, 0) to destination node (2,1), and another from source node (0, 2) to destination node (3, 1). The paths shown in Figure 4. 1 result from dimension-ordered routing. (a) unidirectional torus (b) bidirectional torus 3— unicast path —’ unidirectional communication link bidirectioml communication link 4—5’ Figure 4.1: Examples of 2D 4 x 4 torus networks The second distinguishing property of these networks is that they use virtual channels [46] in order to prevent deadlock. Each virtual channel has its own flit bufl’er and control lines. For wormhole routing in bidirectional torus networks, three sets of (unidirectional) virtual channels are required: p-channels, l-channels, and h-channels [49]. The virtual channels along a single dimension, d, of a bidirectional torus with even width, k, are illustrated in Figure 4.2; the situation is similar when k is odd. We use the notation of Dally and Seitz [49]: Cdaa: represents the 77 virtual channel leaving node :1: in dimension d, in the virtual channel set a, where oz 6 {p,l,h}. cdpu-n _.\ ‘awz- 2; :AHVZ- 1: Cam . :?:au_w-H'——}_ -3 Cam; _iL‘rk—l .q—w‘gm- 2 '— jinx—c ‘40 Cd: lit—m 2) Cmm- n ma 1; cl“- 1; Figure 4.2: Virtual channels in one dimension of a bidirectional torus 4.2 EDN Broadcast in 2D Torus Networks An important step in the development of an EDN broadcast algorithm is the se- lection of the EDN configuration, that is, the pattern of EDNs that will be used. We might initially consider using the EDN configuration of a 5 x 5 torus, since it possesses the property that only five level-1 EDNs are required to dominate the other nodes, and that all five EDNs use their entire set of outgoing channels in doing so; see Figure 4.3. However, we observe that at least three message passing steps are needed for broadcasting in a 5 x 5 torus under XY routing, since a single source node cannot reach the other 4 nodes in one step. We would rather base our algorithm on a more eflicient “building block,” such as a 4 x 4 torus. As shown in Figure 4.4, it is possible to perform broadcast in a 4 x 4 torus in two message-passing steps under X Y routing. In the first step, the source delivers the message to three level-1 EDNs, labeled A, B, and C', which together with the source deliver the message to the remaining nodes in the second step. In this case, the source is a level-2 EDN of all the nodes in the network. There are actually many EDN configurations that could be used to implement two-step broadcast in a 4 x 4 Level-1 EDNs in 5x5 toms network, all 5 level-1 EDNs has used all its out going channels Figure 4.3: Five dominating nodes in 5 x 5 torus. torus. The message-passing patterns shown in Figure 4.4 are particularly important because they can be used to design broadcast algorithms for larger networks as well. The pattern in Figure 4.4(a) is referred to as a Y-pattern, and the pattern in Figure 4.4(b) is referred to as a T-pattern. I Source node —> Message transmission path Level-l EDN -_---- Connection link ~. .. V} ; u. ‘ in?» n... F _L. fig. (a) step 1, source sends to (b) step 2. delivery to three level-1 EDNs remaining nodes using Y—pattem. by T—pattem. Figure 4.4: Two-step broadcast in 4 x 4 torus. Figure 4.5 shows an abstract representation of the two patterns, which are used to identify the EDNs at each level for a given network. The Y-pattern is used by a level-t EDN to send a message to three level-(t— 1) EDNs. The T—pattern is used by four level-(t—l) EDNs to send a message to twelve level-(t—2) EDNs. In a torus, the paths may traverse wraparound channels. Associated with each instantiation of 79 a Y or T pattern is a parameter 6, which represents the relative distance between the source and destination of each message. Also, since the torus is a symmetric topology, exactly the same patterns can be used regardless of the location of the source node. (Banning—l : (3,+a,3,) (AriAy)(A.+2§A,) B [3 48 BF [ll—FAR (s.+25.s,) (Bx-258,) (3.3,) (A.-6.A,) l m (A',A,- 8) ! (s..s,+ 25) {:———+——£]C 8(vasy) (Sx- §.Sy+ 5) (Cg-8.Cy+ 8) Alil LT) ](S..S,+ 5) fl (Cx.Cy+5) (SI-'58P» } p l ~ __J(S.+ 6.8,) JAE] (a) Y patter" (s,.s,) (c.,c,) (c.+a.c,) (b) T—pattern Figure 4.5: Y-pattern and T-pattern. Given a single level-t EDN holding a message, execution of a Y-pattern followed by a T-pattern results in 16 level-t, level-(t— 1), and level-(t—2) EDNs holding the message. These 16 nodes form a 4 x 4 grid and are equidistant from one another in each dimension, that distance being 6. We can therefore define a broadcast operation in a 4 x 4 torus as {YT} with parameter 6 = 1. Let us refer to the combined {YT} transmission pattern as a phase. For example, Figure 4.6 shows the first phase of an EDN broadcast operation in a 16 x 16 torus. The parameter 5 for both steps is 16/4 = 4. A second {YT} phase, with 6 = 1, is required to complete the broadcast. We should emphasize that each {YT}-pattern in the second phase does not necessarily operate within a square 4 x 4 subnetwork, as illustrated in Figure 4.4. Rather, the patterns match those in Figure 4.5, with wraparound channels used as necessary. Figure 4.6 shows three level-1 EDNs and 80 the twelve level-0 EDNs that will be reached from node C in the second phase. I Someenode bevel-2 EDN O Level-OEDN (reachedfromnodeC) I Level-3 EDN Level—l EDN (reached from “06¢ C) -> Message transmission path (first phase only) A. 4. , (a) step 1. Y-pattern with 8 = 4. (b) step 2. T—pattern with 5 = 4. Figure 4.6: First phase (two steps) of EDN broadcast in a 16 x 16 torus. Figure 4.7 gives the general algorithm for EDN broadcast in a 2D 4 x 4 torus network; the algorithm is invoked by all nodes in parallel. The parameter 6 is calculated from the size of the network. If the node is not the source, then it must first receive the message, and may be required to complete the T step of a {YT} phase. Following this action, or if the node is the source itself, the node repeatedly sends messages according to the {YT} pattern, reducing the value of 6 by a factor of 4 following each phase. Each message carries two variables identifying the message type and the value of parameter 6. The message type can be one of Ya, Yb, Yc, or T, which dictates to the receiving node which role it must play the subsequent message-passing step. Procedure T.pattern (type , 6), given in Figure 4.8, generates three messages corresponding to the type parameter. In order to implement broadcast in a network of size (2 x 4") x (2 x 4"), we 81 Procedure EDN_bcast() /* local address is (X , Y) */ 6 : 4k—1 if ( I..am_not-source ) call msg_recv(type,6); svitch(type) case Ya: call T.pattern(Ya,6); break; case Yb: call T.pattern(Yb,6); break; case Yc: call T_pattern(Yc,6); break; case T: break; end switch 6 = 6/4 end if while ( 6 >= 1 ) send {Ya,6} to node (X—6,Y—6); send {Yb,6} to node (X,Y+26); send {Yc,6} to node (X+26, Y); call T_pattern(Y,,6); 6:6/4 ; end while Figure 4.7: Main EDN broadcast routine for a 4" x 4" torus network. Procedure T_pattern (type, 6) /* local address is (X, Y) */ switch (type) case Y;: send {T, 6} to node (X+26, Y); send {T, 6} to node (X,Y—6); send {T, 6} to node (X -6, Y); break case 1%: send {T, 6} to node (X—26, Y); send {T, 6} to node (X,Y+6); send {T, 6} to node (X +6, Y); break case 1; or l3: send {T,6} to node (X—6,Y+6); send {T,6} to node (X,Y+6); send {T, 6} to node (X +6, Y); break and switch Figure 4.8: Procedure T..pattern(). 82 need to define a new message-transmission pattern. The F-pattern is shown in Figure 4.9. As mentioned above, after executing a {YT} pattern of transmission from any source node, three levels of EDNs form a 4 x 4 grid with some internode spacing, call it 61. If each of these 16 EDNs executes an F-pattern transmission with parameter 62 = 61/ 2, the resulting 64 EDNs form an 8 x 8 grid with adjacent nodes spaced at distance 62. The canonical example of F-pattern usage is broadcast in an 8 x 8 torus, il- lustrated in Figure 4.10. The entire broadcast procedure is denoted as {YTF}. The parameter 6 for both the Y and T patterns is 8 / 4 = 2, while 6 = 1 for the F-pattern. The operation of the F-pattern is illustrated in Figure 4.10(c), where it is used by each of the 16 EDNs holding the message to send to the remaining nodes in the network. The algorithm in Figure 4.7 can be extended easily to handle the networks of size (2 x 4") x (2 x 4") by defining an F-message type and by having all nodes that have not received an F message, invoke an F-pattern as their last action. (3“ 5’5 3+ 5) {T1— [gown 6) {:i—Eks x+ 5,8 y ) (S x .S y ) Figure 4.9: F-pattern of message transmission. Theorem 1 The EDN broadcast algorithm delivers a message exactly once to every node in a 4" x 4" ZD torus network in 2k message-passing steps, where k_>_1. 83 . Source node —> Message transmission path I Itvcl-Z EDN [3 Level-l EDN v——~ T__. l (1) m node sends ‘0 three level-2 EDN‘ (1)) source and level-2 EDNs use the T-pattem (c) source and all EDN: all use the F—pattem “3'08 Y-pattetn. to send to twelve level-l EDNs. to transmit to the remaining nodes. Figure 4.10: EDN broadcast in 8 x 8 torus, 6 = 2 for both Y and T patterns, and 6 = 1 for F-patterns. Proof: The proof is by induction on k. For k = 1, the theorem holds by obser- vation; Figure 4.4 shows the {YT} pattern used to implement two-step broadcast in a 4 x 4 torus. Assume that the result is true for k = 8, E 2 1, that is, the EDN broadcast algorithm can reach every node in a 4‘ x 4‘ torus in 28 steps. Clearly, the same algorithm can be used to reach every fourth node (in each direction) in a 4‘+1 x 4‘+1 torus by starting EDN-bcast() with 6 = 4‘ instead of 6 = 4‘”. An additional invocation of the {YT} pattern (two steps) can be used to reach the remaining nodes in the 4‘+1 x 4‘+1 torus. This can be seen by considering the the pattern of nodes reached by the {YT} pattern, which is depicted in Figure 4.11(a). The source node is indicated by an open circle. As shown in Figure 4.11(c), this pattern can cover a plane, with the pattern repeating every four nodes in both directions. Therefore, afier 2(€ + 1) total steps, every node in the 4‘+1 x 4‘+1 torus has received the message at least once. Finally, no node receives the message more than once, since in every step of the algorithm, every node holding the message sends it to 3 other nodes, so at most 42““) nodes hold 84 the message after 2(13 + 1) steps. I . . a" - in II rill n , air ~ a? . "as. . .. (a) nodes reached by (YT) pattern a an. I In ""1: ”‘flihu "hmdnfllfio “IIUBI‘LQ "Iln'l'k- (b) nodes reached by (YTF) pattem (c) 4—node tiling using (YT) pattern (d) 8—node tiling using (YTF) pattern Figure 4.11: Tilings using {YT} and {YTF} patterns. The pattern of nodes reached by the {YTF} pattern is depicted in Fig- ure 4.11(b). As shown in Figure 4.11(d), this pattern also can cover a plane, repeating every eight nodes in both directions. By a similar argument to that used in the proof of Theorem 1, we have the following corollary. Combining the two results yields Corollary 2. Corollary 1 The EDN broadcast algorithm delivers a message exactly once to every node in a (2 x 4") x (2 x 4") 2D torus network in 2k + 1 message-passing steps, where k 2 1. Corollary 2 The EDN broadcast algorithm requires d message-passing steps to deliver a message exactly once to every node in a. 2d x 2“ torus. Due to the symmetry of the torus, the number of steps and patterns involved in a broadcast operation are independent of the location of the source node. Here, 85 we have considered only networks built from 4 x 4 networks. However, the EDN approach can be applied to torus networks of other shapes and sizes. Various patterns, applicable to both mesh and torus networks, are described in [143]. Besides the number of message-passing steps, an issue that must be addressed is channel contention among constituent messages. Two types of contention have been studied for collective communication in wormhole-routed direct networks [103]. Stepwise contention occurs when two or more messages sent in the same message- passing step contend for the same channel. In the absence of multiple virtual channels [46] for each unidirectional physical channel, one of the messages will be blocked until the other has relinquished the channel. The presence of virtual channels allows the messages to share the physical channel in a round-robin manner, though the resulting rate of progression of each will be indirectly proportional to the number of messages sharing the channel. The second type of contention is depth contention, which occurs when messages sent in two difl'erent communication steps may actually be transmitted concurrently. This skewing of message-passing steps can be caused by the delays at nodes, such as sending latency, receiving latency, and other computing tasks that delay the handling of messages [103]. Theorem 2 The EDN broadcast algorithm for a 4" x 4" 2D torus network is stepwise contention- e. Proof: EDN broadcast for a 4" x 4" torus can be denoted as {YT},;,{YT},;,{YT}53 . . . {YT}5,, where 6m 2 4""m,m = 1,2, . .. ,k. To prove that stepwise contention does not occur, we need only show that no two messages trans- 86 mitted in step {Y};, for 6 E {6,}f=1, require the same channel, and that no two messages transmitted in step {T},;, for 6 E {6,}le, require the same channel. By definition, the three messages sent by a single node {Y}5, do not use a common channel. Consider step {Y};,, i > 1, which follows step {T}45,-. All nodes holding the message are located at distance 46 from one another. In the {Y}5, step, each such node (x, y) sends messages to nodes (2: — 6;, y — 6;), (x, y + 6,), and (x + 6,, y). Since all channels used by these messages are within distance 26,- of node (x, y), none of these messages may contend with the messages sent by a node that is distance 46,- from (x, y). As for step {T};,, i Z 1, inspection of the tiling in Figure 4.11(c) shows that the T-pattern messages in neighboring regions do not require any common channels. I Corollary 3 EDN broadcast in a 2D torus network of size (2 x 4") x (2 x 4") is stepwise contention— free. Proof: The algorithm requires a total of 2k + 1 steps. The first 2k steps can be denoted as {YT},5,{YT},;,,{YT},;3 . . . {YT}6k, where 6m = 2 x 4""m, m = 1, 2, . .. ,k. The last step is a {F}5=1 pattern. From the proof of Theorem 2, the first 2k steps are stepwise contention-free. After step {T}5k, all nodes holding the message are located at distance 2 from one another in each dimension. The three destination nodes of the {F}5=1 messages sent from node (x, y) are located at (x — 1, y + 1), (x, y + 1), and (x + 1, y). Therefore, no channel is traversed by more than one message in {F}5=1, completing the proof. I 87 Besides stepwise contention, it can be shown that the EDN broadcast algori- thm also avoids contention among messages sent in neighboring steps. Although depth contention is possible under certain conditions, simulation results presented in Section 3.4.3 indicate that the efl'ect of such contention is likely to be small under conditions representative of commercial systems. 4.3 EDN Broadcast in 3D Torus Networks We now describe the procedure for implementing EDN broadcast in 3D torus net- works. As shown in Figure 4.12(a), a source node located in position (x, y, z) in a network of size 4 x 4 x 7 can simultaneously send a message to six other nodes, one located in each of the XY planes other than its own. The six nodes are: (x,y,z+3),(x+1,y,z + 2),(x,y,z — 2),(x,y —1,z+4),(x —1,y,z — 1), and (x, y — 1, 2 +1). The broadcast process then proceeds in a divide-and-conquer man- ner, with each node holding the message performing a local broadcast within its XY plane using the 2D broadcast algorithm described earlier. Because the torus topology is symmetric, this method is applicable to any source node. For networks in which the width of the Z dimension is greater than 7, additional strategies are needed. The proposed approach is to partition the network into at most 7 zones, each comprising contiguous X Y planes, with the X Y planes in the network assigned to the zones as evenly as possible. In the first message-passing step, the source node uses the message-passing pattern in Figure 4.12 to send a n A [\l A x , .« “a J as u s .. .. - 1 m n a a «sf _ . .5 ~ ‘3‘; fr 7i red _I I I; :flj ii. iii 1 at: ’3‘ _ IT — ""L’f‘ifi- 1‘ “Paris L _TPW‘T CEL ‘x ‘ ~. . ‘3‘ \ Ef{\ l ‘T Ix; :‘l(~\~{i‘ l [Ih'ml in) ,U \ . ‘“ ~34 jgerT ‘l f" \me 2 “a t ' +...{_ ".a 1 , said - . w. s ~ ii . mks-l \ ~ w i ll Q flu-{ll "tF C #;,:;{L)‘l\:fl $5,; 975:1“; r" {3:9deyflp iii—{iii‘tikfl -,,' fa .3 q) [ I}, l u ; “effigy; [. ‘g’l'il 142;“, l .«is( L, I. ’ will a I u , : "i k) is...» u - (it: u . . a..- u {l ‘ u l LI 1 " \J a! U w y J a c U , (a) A basic pattern for source node located at plane 2 = 0, Z This pattern utilizes two 2 direction wraparound links. ‘ ' 9' !/( a, x l ;r, -—$ _'J\ —4 ”If”; )d«‘ 444;. "$4.; ) A. ’ ‘ x ,4. ‘ 21.; .Qr—i: #2 - x' a' ,1 “.19: I 4' I" .o.‘ L'l .J' ‘ ,, I 4+. .3 ‘ ‘ ,‘ ,A g 4 ‘¥Pf“ vi a l "/ C; (~93 :- (V—rw‘ , I l |.' (b) A basic pattern for source node located at any plane with f Z 2 value not equal to zero. This pattern does not utilize any —*) 2 direction wraparound links. Figure 4.12: EDN configurations in 3D torus networks copy of the message to one node in each zone. Continuing in a recursive fashion, each sub-zone is further partitioned, with each receiving node acting as a new source node. Eventually, the sub-zone is a single X Y plane, in which case the node holding the message delivers it to the other nodes in the plane according to the 2D torus broadcast algorithm described earlier. In order to avoid the use of the wraparound channels in all but the first message-passing step, subsequent steps may partition a zone into at most 6 sub-zones. Figure 4.12(b) shows an example in which a single node sends to one node in each of the other planes without using wraparound channels in the Z dimension. Theorem 3 The EDN broadcast algorithm for 3D torus networks requires the following number of message-passing steps to deliver a message to every node in a 2" x 2“ x z network, where d 2 2: If 2 = 1, then the algorithm requires requires (1 message-passing steps. If 2 _<_ z s 7, then the algorithm requires requires d+ 1 message-passing steps. If 7 x 6'" +1 3 z s 7 x 6”“, where m 2 0, 89 then the algorithm requires d + m + 2 message-passing steps. Proof: In the following proof, let patterns (a) and (b) refer to Figure 4.12(a) and (b) respectively. If 2 = 1, then (1 steps are needed by Theorem 1 and Corollary 1. If 2 g z _<_ 7, pattern (a) is used in the first step to deliver the message to one node in each X Y plane, after which these X Y planes can then perform their own 2D torus broadcast in d steps, for a total of d + 1 steps. When 7 x 6'" + 1 g z 5 7 x 6”“, the (a) pattern is followed by m + 1 (b) patterns in order to distribute the message to one node in each X Y plane. Including the (1 steps for broadcast in each plane, a total of d +m+ 2 message-passing steps is needed to reach all nodes. I Since the messages in the two patterns used to reach different X Y planes do not produce channel contention (see Figure 4.12), and the algorithm used within each plane has already been shown to be stepwise contention-free, we have the following corollary. Corollary 4 EDN broadcast in a 3D torus network of size 2“ x 2“ x z, where d 2 2 and z 2 1, is stepwise contention- cs. 4.4 Performance Evaluation The performance of EDN broadcast algorithm is evaluated in three ways: by com- paring the number of message-passing steps to a lower bound, through timing anal- ysis, and through simulation. 90 4.4.1 Number of Message-Passing Steps A simple theoretical lower bound on the number of message-passing steps required for broadcast in an all-port n-dimensional torus is [10an H (N )l , where N is the total number of nodes in the network. In many cases, however, this bound may not be achievable because of the limitations imposed by deterministic routing. Specifically, it is not always possible that every node can send to 2n new nodes in every step of the algorithm. Figure 4.13(a) compares the number of message-passing steps of the EDN algo- rithm for torus networks (both 2D and SD) of various sizes up to 3000 nodes. The number of steps achieves the theoretical lower bound in many cases and is at most one step away from the lower bound in other cases. Figure 4.13(b) compares the number of steps for networks with up to 20,000 nodes. Again, the EDN algorithm again is close to, and in some cases equal to, the lower bound. As a reference point, both plots also show the number of steps needed by the U-torus algorithm [124]. The U-torus algorithm uses recursive doubling, in which the number of nodes hold- ing the message is doubled in each step. Though optimal for one-port architectures, requiring [ log2(N )] steps to complete a broadcast operation in an N -node network, the U-torus algorithm will often fail to take advantage of a multi-port architecture, resulting in more message-passing steps than necessary. In fact, the number of steps is still [log2(N )] . 91 20 :o f Um +— Utunu *— Wlwnd "" loverbumd -+-- [sort 1... EDN »o IS l! l I—-———1 o-nuo -o--o o~~o--o- our -- o ~o--o o ~DOH—b— 5 p — A :o-obooo-T—H—a— 5 - arr—mmun---—-—-~—«----«-»-—«-——----—$ J ra‘ (F. 00 ill) I‘D I?” who 25; 3000 00 1m 15:!» 10000 Tumor-uh Total-Walnuts (a) medium-sized toms networks (b) large torus networks Figure 4.13: Broadcast performance in torus networks. 4.4.2 Timing Analysis Calculating the number of steps required is only one way to evaluate the perfor- mance of a broadcast algorithm. In order to better understand how the EDN algorithm may perform for specific values of system parameters, we analyze the time required when executed on an all-port torus. We use the following notation: 0 represents the sending latency incurred for each message. 5 is the time required to transmit a flit on a channel connecting two neighboring routers. 7 is the receiving latency of a message. h denotes the number of hops, or distance, between the source and the destina- tion of a message. L is the length of message, in flits.‘ We model the delay of a unicast message as the sum of the sending latency, the time it takes for the header flit to reach the destination node, the time for the message to be extracted from the network at the receiving node, and the receiv- 92 ing latency. That is, the unicast delay is a + fih + 5L + '7. When a node sends multiple messages in succession, we assume that their sending latencies are seri- alized, even though the other components may occur in parallel. Therefore, on a k-port architecture, the time needed for a node to transmit m (m g 1:) copies of the same message simultaneously to m different destinations, each of which is reached through a disjoint set of external channels, can be expressed as ma + hfl + L5 + 7. This expression assumes that no channel contention occurs for any of these m out- going messages and that the timing is measured from the beginning of the first message transmission until the last destination receives its message. Using this formula as a basis, we can analyze the execution time of the EDN algorithm. In a Y-pattern transmission with parameter 6, for example, the last destination will receive its message at time 3a + (26m + Lfl + 7. As we have seen, for a 4" x 4" torus, k 2 1, the message transmission pattern is {YT}", with each stage comprising two message-passing steps. At stage i, the value of the parameter 6,- associated with the Y and T patterns is 4"“. Because 2le 6, = 2" 4"‘i = i=1 2" 4“1 = (4" — 1) /3, after summing the times required for all stages, the total i=1 execution time of the EDN algorithm, £ ED N: is computed as follows: It LEDN = 2::{301 + ’7 + 5(L + 260} i=1 4(4Ic — 1) = Gka + 2k'y + 2kLfi + 3 [3 . In other words, for a 2“ x 2" torus, LEDN = 3da + d7 + dLfl + flzd—yflfi, when d is even. Now let us consider a torus of size (2 x 4") x (2 x 4"), in which the 93 broadcast pattern is {YTF}" F. There are k stages, plus one extra message-passing step. For stage i, the parameter 6,- = 4"""1 / 2. Because 2;, 6,- = 2;, 4"“+1/2 = iZfziwc) = 2(4" - 1)/3, we compute EEDN for this case to be: I: LEDN = 22{3a+7+fl(L+26,-)}+3a+7+fl(L+2) i=1 k 1_ = 3(2k + 1)a + (2k + 1M + (2k + 1)Lfl + _._—.2“ + 1) 3 fl . To summarize, the time needed for EDN broadcast in a 2" x 2“ torus can be expressed as 3da+d7+dLfl+g§2d+qufi ,diseven £EDN = +1 3da+d7+dw+3<2i311s ,disodd. Combining these two expressions, we obtain: 2(2‘“1 — 2 + mod(d, 2)) 3 EEDN = 3da + d’)’ + dLfi + H . (4.1) For comparison, the execution time of the U-torus algorithm for a X x Y torus can be derived in much the same way (see [124] for details of the algorithm). As- suming that both X and Y are powers of two, in the X dimension, l'log2(X)l message-passing steps are required to perform broadcast. At step i, the number of hops traversed by the message is (X / 2‘). The same argument can also be applied to the Y dimension. Because 2:913?“ )1 if. = X — 1, the total delay for broadcast 94 using the U-torus algorithm L atom can then be calculated as follows: “082 001 l1082(Y)l LU-torus : Z fi('§,‘)+ 2 fl(;) +llog2(X)l (a + 7 + M) + llogs(Y)l (a + 7 + fiL) = (a + fiL + v)(llog2(X)l +l10s2(Y)l)+ B(X + Y — 2). We rewrite the U-torus expression for 2" x 2" torus as: mum, = 2da + 2d7 + was + 2(2" — 1);} . (4.2) In order to discover when the EDN algorithm will execute faster than the U-torus algorithm, we can set EEDN S £U.tom, eliminate common terms, and compare the two sides of the inequality for difierent values of parameters. Doing so for Equations 4.1 and 4.2 yields: 2(2“ — 1 — mod(d, 2)) dagd7+dLB+ 3 B . (4.3) For d 2 1, the ratio 2(2d'1—310dw’2» is always greater than or equal to zero. Therefore, we can see immediately from this relation that, in those situations where the sending latency is less than or equal to the sum of the receiving latency and the network latency ( that is, a 5 7 + Lfl ), the EDN algorithm will be faster than U-torus, assuming that B > 0, L 2 0 and d 2 1. Moreover, if a < 7, then EDN will complete sooner for any message size L. 95 4.4.3 Simulation Study In order to account for specific system dynamics, a simulation study was conducted using MultiSim [100]. MultiSim was used to simulate broadcasts of various-sized messages in networks of difierent sizes. For a given message size, a large number of difierent source nodes were selected at random to perform the broadcast operation. The maximum delays were measured and then averaged over these samples. The simulated channel rate was 0.5 microseconds per byte, a relatively modest rate con- sistent with the nCUBE—Z, a wormhole-routed multicomputer with which MultiSim has previously been validated. Figure 4.14 compares the two algorithms across difierent message lengths (128 bytes to 2048 bytes) in a 4 x 4 torus and a 32 x 32 torus. In Figure 4.14(a), the sending latency and the receiving latency are both set to 0. In Figure 4.14(b), the sending latency and the receiving latency are both set to 100 microseconds. In both cases, the advantage of the EDN approach is significant. Although the U-torus algorithm will sometimes happen to take advantage of the all-port architecture, the EDN algorithm achieves better performance because it is designed specifically to exploit the all-port architecture. The results of the simulation study have been compared with the previous anal- ysis. One reason to do so is that the analysis did not account for depth contention, while the simulation would accurately reflect any message delays due to this phe- nomenon. Figure 4.15 plots simulation versus analytical results for a 32 x 32 torus; the plots agree almost exactly. Such results indicate that, while depth contention is Maximum latency (microseconds) 96 EDN vs UTorus with startup/receive latency = 0 10000 I f j T t 1 I 3000 FWUfEDN(4X‘)L*—" . , . T . . x’f .. ., EDN (32x32). -+--- : g / Uroms(4x4) -G--- - , x/ UTqus (32x32)? "*~- . . / i . ’ .X f { )i/J I at” t : .2 ' ,."' / ,v’¥/ /” ,2". 5 3 I ,,,,,,,, 3" ; X r ...a ----- ° I / I,‘ . ."‘G‘" . f / ' - " : _.—"B . ,A' _..-D" 20“) u- . .=_ . . / . I.‘ .1”? . .4 . 8'8 . . . b ; / ,x" a ...... o """ i 2 ,3", , 1A: ---- B .... ‘3' . 111-"0" . a , 0‘ j 1 i L 1 n 128 256 512 768 1024 1280 1536 I792 2048 Message Size (Bytes) Figure 4.14: Comparison of EDN broadcast with U-torus for 4 x 4 torus networks. possible in the EDN algorithm, the efiect of this phenomenon may not be significant. 4.5 Conclusions As one of the most fundamental collective communication operations, broadcast is highly demanded in parallel applications that are implemented on massively parallel computers. The EDN model is intended to formalize the concept of node domina- tion among nodes that may be separated by multiple channels. In this chapter, an EDN-based algorithm has been described that uses the concept of dominating sets to eficiently implement broadcast in all-port wormhole-routed torus networks that use dimension-ordered routing. The performance advantage of the algorithm over recursive doubling has been evaluated through both analysis and simulation. 97 EDN vs UTorus with startup/receive latency = I00 microseconds ”000 f y W, r 1 r v / 10000 ”*EDN(‘X4)“.— .. - ................................ ._, EDN (32x32); -+-- : :- U'l'orus (4x4); -G--- g UTorus (32x32); mm . . 5 t . I/x Q Maximum [Amy (microseconds) 0 i 4 j 1 i i 1 |28 256 512 768 1024 1280 l536 1792 2048 Message Size (Bytes) Figure 4.15: Comparison of EDN broadcast with U-torus for 32 x 32 torus networks. 12000- 10000 r- EDN *— U-Iorus -+--- EDN Simulation 43"- U-loms snmulauan +— Time (micro seconds) 2000 l 1 1 l l 1 l 0 256 512 768 1024 1280 1536 I792 2048 Message size Figure 4.16: Comparison analytical and simulation results for a 32 x 32 torus. Chapter 5 The T-Model As far as the Laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality. Albert Einstein Prediction is dificult, especially of the future. Niel Bohr As demonstrated in previous chapters, communication algorithms designed to exploit certain characteristics of a network architecture can improve performance. In this chapter, we introduce the 'r-model, a generalized communication cost model that can assist in the comparison of difierent algorithms as well as difierent net- working technologies, and the associated T-plot, which facilitates the visualization of performance comparisons. Next, we demonstrate the use of the T-model in eval- uating the performance of wormhole broadcast algorithms, include EDN broadcast. Finally, we discuss the application of the T-model to an increasingly popular parallel 98 99 platform, networks of workstations. 5.1 The T-Model In general, the point-to-point communication latency of a message can be repre— sented as the sum of the message handling overhead and the network delay. Al- though this decomposition is represented in difierent ways in the literature, we will use in the following form: T,+T,,*L, where L is the message size, T, is the sum of software startup and receiving latencies, and Tn is the network latency per unit of message size. Empirical studies have shown, that T, and Tn are not necessarily constants, but for a given network, depend on a number of system parameters [107, 86]. 5.1.1 Generalized Latency Formula The first part of the T-model is a generalized latency formula describing both point- to-point and collective communication operations. Instead of explicitly asserting a limited number of parameters in the cost model, the formula incorporates all the factors that may have any efiect on performance. The T-model uses the following formula to represent the latency of a broad range of communication operations: T = T,(ao,a1,a2, . .. ,a,_1) + Tn(ao,a1,a2, . .. ,a,_1) * f(L) (5.1) 100 Although this formula appears rather complex, we can summarize it as the sum of a message-length independent term and a message-length dependent term. The parameters {a,- Hg; are specific to the underlying hardware and software of the system. They typically include such items as the software startup latency, reception latency, network bandwidth, communication network distance, system load, degree of network sharing, communication group size, and so on [89, 43]. Their values depend on the system configuration and network conditions under which the communication algorithm is executed. The function f (L) satisfies the property that f (O) = 0 and is usually a non-decreasing, continuous polynomial function. The function f (L) determines the L-complean'ty of the algorithm. If f (L) = L, which is the most common situation, then we say that the algorithm is of linear .C-complexity. 5.1.2 The T-PlOt: Comparing Networks The second part of the 'r-model is a two-dimensional Cartesian plot that allows us to characterize and to compare different performance curves more easily than using traditional methods. From the above generalized latency formula, we define the parameter 'r to be: : Tn(a01alia21' ' ' ran—l) T,(ao,a1,a2,... wan—l) . (5.2) The ratio r is the reciprocal of the metric m1 /2, which in Hockney’s model is the message length required to achieve half the asymptotic bandwidth [72]. 101 Plotting the performance of a communication algorithm according to its co- ordinates (T, T,), results in a 'r-plot for that algorithm. The use of the T-plot is illustrated in the following analysis. Shown in Figure 5.1(a) are three lines marked 1, 2 and 3. Suppose that they represent measurements of the same communica- tion operation for three difi‘erent networks. Their individual latency formulas are T,1 + TnlL, R; + TngL, and T,3 + Tn3L, respectively. Let us assume that T,1 = T,2. In Figure 5.1(b), points 1 and 2 represent lines 1 and 2, respectively. They have the same T, coordinate, but point 2 lies to the right of point 1, indicating that the slope of line 2 is greater than that of line 1 in original latency plot. Line 3 has a difierent T, value and a difierent Tn value as compared with line 1. However, in the T-plot, they have the same 7' coordinate, indicating that the total time required to finish the operation in network 3 is a constant multiple of that in network 1, a relationship that is not obvious from the original latency plot. We can use this relationship to predict the performance of a communication operation in network 1 from its performance in network 3. all /”2 T 2 T33 +Tn3 Lf/f/é;gi.,.-w, 3 /_ “/i-f/ ”I" T82+ Tnz L "‘/,,/—»“’/»I7 1 T83 3 — ‘ “— __ {/ T91 'l' Tm L Tsz‘-’Ts1 T Tea 1 2 T31 Message length (Lli 1 “=13 1'2 “:3; (a) message length vs. latency plot (b) ‘t-plot T Figure 5.1: Example of the T-plot. This example illustrates the use of T-plot for linear algorithms, that is, where the 102 algorithms discussed all exhibit linear £-complexity. In the following, we demon- strate the use of the T-plot for the more general communication cost model. 5.1.3 Use of the T-PlOt to Compare Algorithms The use of the T-plot to compare the performance of diflerent communication al- gorithms under the same network conditions is illustrated in the following exam- ple. Suppose that three communication algorithms 1, 2, and 3, are described by the latency formulas T1(L) = ,1 + Tm * f1(L), T2(L) = 32 + Tug * f2(L), and T3(L) = T,3 + Tn3 * f3(L), respectively. Let us further assume that these three algo- rithms have the same LI-complexity, that is, f1 E f2 E f3 E f (L). Let T1, 72 and T3 be their respective T-values; the T-plot of these three algorithms is also represented by Figure 5.1(b). Let us now derive the formal relationships among these three algorithms. The relationship between the performances of Algorithms 1 and 3 can be characterized as follows. Because T1 = 73 and f1 E f3, we have, Tnl/T,1 = Tn3/T,3. The latency formula of T1(L) can be rewritten as T1 (L) = gt; * T3(L). The relationship between Algorithms 1 and 2 can be calculated from the fact that T,1 = T,2 and 72 > 71. We can then derive the following equation: T2 - T1 = (E:- — 1) Tn1f(L) (5.3) = (1- g) Tn2f(L). (5.4) The relationship between Algorithms 2 and 3 can be derived directly by taking 103 the difierence, T3 - T2 = T,3 — T,2 + (Tn3 — Tug) f (L). If Tug ;£ Tug, then a specific length L“ can be defined as: T — T a _ —1 82 83 L — f (Tn3 _ T112) (5'5) T — T _1 s2 s3 = . 5.6 f (T3713 — T2712 ( ) Thus, we know that: T3 > T2, when L > L“ T3 = T2, when L = L" (57) T3 < T2, when L < L“ Because T,3 > T,2, f (0) = 0 by definition, and f is an increasing function, if Tug > Tn3, then L“ > O, and if Tug < Tn3, then L‘ < 0. Since message size cannot be negative, if T02 < Tna, then for all message sizes L > 0, T2 < T3. To summarize, the performance of a particular communication operation under a specific communication architecture is identified as a single point in the T-plot, shown as point :r in Figure 5.2. Four regions, marked by A, B, C, and D, can be identified by drawing two intersecting lines through point 3:, one line parallel to the horizontal axis and the other parallel to the vertical axis. Any other communication operation that falls in region A has better performance than point 2:, and point a: outperforms any communication operation in region B. Algorithms in regions C and D can have better or worse performance than point as, depending on the message length. The crossover message length is calculated from Equation 5.7. In general, 104 algorithms in region C favor long messages as compared to algorithm 2:, and those in region D favor short messages as compared to algorithm 2:. Tn T(=T—s) Figure 5.2: Comparing algorithms using the T-plot. 5.2 Characterization of Point-to-Point Communi- cation As mentioned earlier, using the T-model, two kinds of comparisons can be made. The first is to compare different networks, which is made possible by identifying some communication operation that exhibits the same .C-complexity in these net- works. The second is a quantitative comparison of different implementations of the same operation in a given network. In the same network, the system and algorith- mic parameters {a,-}§=’(} are identical for all implementations of the operation. It is also true that the difierent implementations of the same operation usually exhibit the same .C-complexity in the same network architecture. Point-to-point communication is one operation that exhibits same fi-complexity 105 for difierent networking technologies. The traditional way to measure point-to- point communication latency is the so-called ping-pong benchmark, in which two nodes exchange messages repeatedly. To provide a simple illustration of the T- model, we conducted two difierent ping-pong experiments using the same network architecture, but difi'erent computing platforms. The first experiment uses two SPARCstation-ZO workstations running Solaris 2.5, and the second uses two SGI Indy workstations running IRIX 5.3. The SPARCstation-20 has 32 MBytes of main memory and a clock rate of 75 MHz, whereas the Indy is running at 180 MHz and has a main memory size of 64 MBytes. According to the SPEC benchmark, the SPARCstation-20 has a SPECint95 rating of 3.11 and SPECfp95 of 3.10, while the Indy has a SPECint95 rating of 4.1 and SPECfp95 of 4.4. The results of 20,000 round-trip measurements were recorded for each specific message length, averaged, then divided by two to obtain the one-way latencies, which are plotted in Figure 5.3(a). These one-way latencies were then least-squared fitted to a straight line; the results (using a logarithmic scale) are shown in Figure 5.3(b). This method can be used to derive the values of T, and T, for point-to-point communication. The ping-pong benchmark has been widely used to measure point-to-point per- formance, and many such results have been published in the literature [53, 86, 20]. We have used these results, and the method described above, to compute T,, Tn and 7' values for various configurations. Table 5.1 lists the results, and Figure 5.4 presents a r-plot of the data listed in the table. The T-plot reveals several relationships that may not be evident in typical la- tency vs. message size and latency vs. bandwidth plots [53]. First, the effect of 106 200000 200000 100000 > SCUM-id nip) 0 < It!!!” > SGllone-Ivay) 0 - SGHonc-way) ° 0 294.45+0.8887‘(messagcsize) 600!) > SS20(tomd nip) 6 1) 60am *- SS20(one-way) . - m . SS20(one-\ny) l l m . 260.85+0.89l6'(message sue) -— .0 3 0 20000 - . 4 20000 L . I 10000 ~ 1 10000 ~ I g J 6“!) r m . 4000 L - ( 4000 ~ 2G!) '- O i 2!!!) r O ' 1000 a 1000 4 600 a 3 . 600 J 400‘r 8 8 4 ' 400 j"). n .4 b . ..... , .1 1 .: z u? ‘ rants—"r333" m A L A A A l m A A A A J L 4 I6 64 256 1024 4096 16384 4 16 64 256 1024 4096 16384 mainsLflmea) maizel.(by1es) (a) Round trip and one-way latencies. (b) Least-squared fit one-way latencies. Figure 5.3: Results from ping-pong test. reducing the software and protocol processing time is reflected by a shift to the lower right in the T-plot. This characteristic is illustrated in dramatic fashion by the relative locations of the two points marked SS20/UNet/TCP and 5520/ UNet/ UDP in the lower-right corner as compared to the points marked SS20/ASX200/TCP and SS20/ASX200 / U DP in the upper-left corner. Although both sets of points represent measurements on Sun SPARCstation—20 workstations and a FORE ASX-200 ATM switch, the Active-Message based U-Net interface software is very efiective in re- ducing T,. Second, a similar shift occurs when using the same underlying physical networking technology, but faster processors. This property is exemplified in Fig- ure 5.4 by the difierence between the points SSlO/Ethernet/TCP, SS20/Ethernet/TCP and SGI/Ethernet/TCP. Third, a constant factor increase / decrease in performance between networks for a particular communication operation is demonstrated in the T-plot if their positions are located in the vicinity of a vertical line, which indicates that they have similar 1' values. An example of this behavior in Figure 5.4 is the 107 Table 5.1: Point-to-point performance on various platforms. system software [I T,(11.9) I Tn (us/byte) 7 = Tn/T, I source I T3D MAX 1.2.0.2 (SM) 3.0 0.00781 0.0026 [53] Paragon SUNMOS 1.6.2 25.0 0.00585 0.000234 (53 Delta NX 3.3.10 77.0 0.125 0.00162 53 iPSC/860 NX 3.3.2 65.0 0.333 0.00513 :53 iPSC/2 NX 3.3.2 370.0 0.357 0.000965 :53 SP-l MPL 270.0 0.143 0.000529 :53 SP-2 MP1 35.0 0.0286 0.000816 :53: KSR—l OSF 3.1.2.2 73.0 0.125 0.00171 :53: 03-2 Solaris 2.3 (SM) 11.0 0.0250 0.00227 :53 nCUBE—2 Vertex 2.0 154.0 0.589 0.00382 :53 nCUBE—l Vertex 2.3 384.0 2.5 0.00651 :53 Cenju-3 End. Rel. 1.5d (SM) 34.0 0.04 0.00118 :53: SGI IRIX 6.1 10.0 0.0156 0.00156 :53 CM-5 CMMD 2.0 95.0 0.111 0.00117 :53 SS-20/ASX200 Solaris 2.4/UDP 716.0 0.16 0.000224 :86: SS-20/ASX200 Solaris 2.4/TCP 817.0 0.13 0.000159 :86 SS-20/Myrinet Solaris 2.4/UDP 746.0 0.18 0.000241 86 SS—20/Myrinet Solaris 2.4/TCP 807.0 0.152 0.000188 {86 SS-20/UNet SunOS 4.1.3/UDP 65.0 0.75 0.0115 20 SSaZO/UNet SunOS 4.1.3/TCP 75.0 0.85 0.0113 20 SS-lO/Ethernet Solaris 2.5/TC}> 552.17 0.913 0.00165 * SS-20/Ethernet Solaris 2.5/TCP 260.85 0.892 0.00342 SGI Indy/ Ethernet IRIX 5.3/TCP 294.45 0.889 0.00302 * measured by the authors 7 values of the Paragon system and the SS-20s connected by the FORE ASK-200 ATM switch. The Intel Paragon system has a 7 value of 0.00023, while the SS- 208 have a 7 value of 0.00022. The values of T, for these points reveal that for point-to-point one-way communication operation, the ATM-connected SS-208 ex- hibit latencies that are approximately 28 times the latency of the Intel Paragon system, across the entire measured message length range. Another example pair is the CM-5 and Cenju-3, although in this case, the ratio is much smaller (2.78). 108 § . R‘ T A:SS20/UNel/UDP s:sszoIUNe1/Tcp 8' C:SG|/Power O. D. 600 ... G)“ a: U 7890" u- 1 lPSC/860 W- R:SSZOIASX2OOITCP K' v- s:sszolMyrine1/uop I T:SS20/Mynnel/TCP 0:881 OlEthemet/TCP ~1- V:SSZO/Ethemel/TCP W:lndy/EthemellTCP 0‘ L F BA. 0.0 0.002 0.004 08.006 0.008 0.010 U Figure 5.4: T-plot for point-to-point communication benchmarks. 5.3 Analysis of Wormhole Broadcast Algorithms In this section, we analyze and compare four broadcast algorithms for 2D wormhole- routed mesh networks. In Section 5.4 we use the r-model to compare them. The algorithms under investigation are Recursive Doubling (RD) [16], Scatter Collect (SC) [17], Fibonacci Tree (FT) [50], and Extended Dominating Node (EDN) [147]. All these algorithms are based on point-to-point message passing. The following formula is used to describe the point-to-point communication latency in wormhole- routed networks. Suppose that a message traverses h hops before reaching its destination. Then the overall message latency is described by 07+ hfl+flL+y, where a is the startup overhead, and [3 is the per-unit message transmission latency, and 109 7 is the reception overhead at the destination. Probably the most well-known algorithm for multicast or broadcast is the Re- cursive Doubling algorithm. The source divides all participating nodes into two groups, the first containing itself and sends the message to one node in the sec- ond group. Those two nodes holding the message now acting as source nodes for their respective groups and proceed with the broadcasting process independently. When used as a broadcast algorithm for a wormhole-routed 2" x 2" square mesh network, one implementation is to begin the broadcast operation with the distribu- tion of the message along the row of the source node. Next, all nodes in that row perform the same operation along their respective columns. Figure 5.5 illustrates the operation of this algorithm in an 8 x 8 2D mesh. This algorithm, referred to as Dimensional Broadcast [16], is a special case of the more general U-mesh [103] multicast algorithm. For a 2" x 2" mesh, a total of 271 steps is needed to complete the operation. Step 1' requires time a + 2510 + fiL + 7. Therefore, the overall time needed is 27107 + 2fl(2" — 1) + 2717 + 2nfiL. new” D mdeauyaholdlngm (flaw! (bl-lap! (cm-1H (d) m4 (ewes)! (0 MG Figure 5.5: Dimensional broadcast in an 8 x 8 mesh. The Scatter-Collect algorithm [17] is based on the idea of message fragmentation. The source node partitions the message and performs a scatter operation across its row, with each node receiving a particular segment of the message. Next, each node 110 holding part of the message performs a scatter in its column. In the third step, the nodes in each row form a logical ring and circulate segments around the ring until all nodes hold all segments in that row. Finally, a similar circulation in each column results in a copy of the entire message at every node. This algorithm avoids channel contention, and the pipelining of message segments ofiers good performance for broadcast of long messages [17]. Table 5.2 summarizes its performance for a 2" X 2" mesh network. . , 1 W » ‘- . “'9" message ”8 ”f "“5““ D "54 of ”5538‘ scatter operation wormhole across network 1 jibfi‘. , ”rt-’1 2°79. Lil 151’ (a) scatter in row — 3 steps (b) scatter in columns - 3 steps (c) collect in rows — 8 steps (d) collect in columns — 8 steps Figure 5.6: Scatter-collect broadcast operation in an 8 x 8 2D mesh. Table 5.2: Analysis of Scatter Collect algorithm row 71* 07+ — + + scatter 71* 67+ —1 + + row —1*a+++ —1*a+ ++ The Fibonacci Tree algorithm, proposed in [50], uses both message fragmen- tation and message pipelining. One requirement of the algorithm is that the tree used to broadcast is free of depth-contention [103]. As with the U-mesh algorithm, the tree is based on lexicographical ordering of nodes, which prevents depth con- tention. This means that messages will not contend for the same channel, even if 111 those messages are sent in difierent steps. The message is divided into k pieces and sent down the tree in a pipeline fashion. The structure of the tree depends on the parameter k and the number of nodes in the tree. Figure 5.7 illustrates the operation of the Fibonacci The algorithm for a partition parameter k = 2. The first part of the message is labeled as A and the second part as B; the messages are sent to the receiving nodes independently. [jmmmmewholemessage 72;» TmnsmisssonoimeesagepanA O Nodesholdnoparflalmessaso ——> Transmisslonofmessage panB step 1 step 2 step 3 step 4 step 5 D U T [J “:V 1651/ 1A1] 181 {:1 “’ O D D D D O U 141/ m 131:] / [A1] 6 WE moo EDD Figure 5.7: Fibonacci Tree algorithm for k = 2. The relationship between the partition parameter k and the total number of nodes N that have received the entire message after t steps of message transmission can be described by the following recursion [50]: N(t,k)=N(t-k,k)+N(t—1,k) fortZk N (t, k) = 1 otherwise 112 When implemented on a mesh network of size P = 2" x 2”, the total number of hops traversed by the segmented message is In times that of the RD algorithm, if the original message is divided into 1: packets, each of size no more than [%I. We define the number of steps in the Fibonacci tree algorithm as t(P, k) for group size P and message segmentation factor of k. A special case of the function t(P, k) occurs when k = 1 and P = 2" x 2”, specifically, t(2” =1: 2",1) = t1 = 271. We can approximate the latency of the Fibonacci tree algorithm by adding up the cost from the distance traveled by the message, 2k(2" — 1) 0, to the cost computed by t(P, 11:) steps of unicast message, a + 5% + 7. Therefore, the latency formula for the Fibonacci tree algorithm is approximated by Ln = t(P, k)(a + 7) + 226(2" — 1) + t(P, 105%. The final algorithm that we consider is EDN broadcast. As described earlier, the EDN methodology designed to take advantage of a multiport architecture. We can compute its performance based on the analysis in Chapter 3. The approximate cost, without taking into account the distances traveled by the message, is 3na + (n + 1)7 + (n + 1) 0L. The cost induced by the distance can be approximated by dividing the distance cost of the RD algorithm by half, as indicated by the simulation results in Chapter 3. Therefore, the overall cost of the EDN algorithm can be approximated by 37167 + (n + 1)7 + (2" — 1)fl + (n +1)flL. We summarize the performance of the four broadcast algorithms according to 113 the 7-model in Table 5.3. The overall latency formula takes the form T,(a, fl, 7, n) + Tn(a, fl, 7, n)L, the T, and T,. components are listed in the table. Table 5.3: Analytical performance of the three broadcast algorithms Algorithm Ts Tn RD 27167 + 25(2" — 1) + 271.7 2710 SC 2(2" — 1+ n)(a + 7) + 6(2“ — 1w 2(1— 3%)[3 FT t(P, k)a + 226(27 — 1) + t(P, k)7 flP—flfi EDN 371a + (n +1)7 + (2" - 1)[3 (n + 1)fl 5.4 Comparisons Using the 7-Model In this section, we demonstrate the use of the 7-model to analyze the four broad- cast algorithms based on wormhole switching technology. More specifically, we use parameterized wormhole-routed square mesh network as the underlying networking technology to compare different implementations of broadcast algorithm. 5.4.1 Overall Comparison of Algorithms Let us begin with two examples. We fix the size of the network by letting n = 5, and we assume that a = 7. In the first example, we vary the value of a from 0.5 to 1.5 in increments of 0.1, while the value of 3 is fixed at 0.01. This represents a range of 7 values from 0.01 to 0.003. In the second example, we vary the values of 0 from 0.001 to 0.2 in increments of 0.01, while we fix the value of a at 0.5. This results in a range of 7 values from 0.001 to 0.2. Next, we calculate the T, and T,. values and 1 14 the corresponding 7 values for each algorithm. me those difierent (7, T,) pairs, the 7-plot for these four algorithms is constructed. 2m I I I I I fi + RD 0 SC + + EDN El Fibonacci Tree (k=2) x + Fibonacci Tree (k=4) A Fibonacci Tree (lt=8) I 150 -+ -' + + I m h- '.' T -. 1— 100 . 4- I :1 + 111 A I a I a 50 - 1' ‘ a I ‘ ‘ xx x A ‘ X X x 0% a x x6 o . 0 c1 a o x )4 ° ° 0 El 0 :1 o 0 1 l l l l l 0 0.00l 0.002 0.003 0.004 0.005 0.006 0 007 Figure 5.8: Fixed [3 with varying a. As indicated in the Figure 5.8, with increasing a (7 decreases), the SC algorithm has more rapidly rising T, component than the other three algorithms. From the closeness to the origin of the trace made by the SC algorithm, we know that it utilizes the largest aggregate network bandwidth, but sufiers severely from the increased T, component as the underlying network parameter 67 increases. The fact that the SC algorithm moves toward upper left corner as 07 increases indicates that long message sizes are more suitable for this algorithm. The FT algorithm can utilize more aggregate network bandwidth as It increases, however, with large k values, it is more sensitive to increased T, as network parameter a increases. When compared with the SC algorithm, the FT algorithm will be more suitable for smaller message sizes. The EDN algorithm utilizes less aggregate network bandwidth as 115 140 fi I ‘ RD 0 . SC 4- . EDN :1 12°F :- Fibonacci Tree (k=2) ,, .. . Fibonacci Tree (k=4) a I Fibonacci Tree (k=8) 1. 100 - 1‘ , I I I 80 I. q M 4 . ‘ t- . 5 111 60 .. . A: 4 ' :‘ I A I A 40$ ‘5‘ d A A“ xxxx 20+ xxxxx 00° F X X o .0. _I K "El 413 DOD one mouooonmoed’o‘b 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.l Tau Figure 5.9: Fixed a with varying 0. a increases, the bandwidth is approximately the same as the FT algorithm with k = 4, and is less subjective to increased T,. The RD algorithm is the lowest in network bandwidth utilization, and it has the least increase in T, as 07 increases. However, it also has the largest 7 values among the algorithms compared. To summarize, the SC algorithm is more suitable for long message sizes and networks with larger 7 values. The EDN algorithm will perform better for smaller message sizes and the FT algorithm with smaller k values is suitable for smaller message sizes, but compared with the EDN algorithm, the FT algorithm is faster for large message sizes. Shown in Figure 5.9 is the 7-plot with increasing ,6 (7 increases). The traces of these algorithms indicate that the upper bound of 7 values in decreasing order is RD, EDN, FTk=2, FTk=4, FTk=8, and SC when network parameter ,6 increases. The SC algorithm will perform faster than the FTk=3 algorithm as network param- 116 eter 3 increases. me the traces made by FTkzz, FTk=4, and FTk=8, we can tell that as It increases, the FT algorithm will have characteristics similar to the SC algorithm, but performance will be worse than the SC algorithm as [3 increases. When comparing the RD algorithm with the EDN algorithm, we know that for this a = 7 = 0.01, the EDN algorithm will be faster than the RD algorithm. The characteristic of the EDN algorithm is similar to the RD algorithm, as compared with the FT, SC algorithms, the former are both suitable for short message sizes. In summary, as the network 7 value increases, the SC algorithm is better for long message sizes and the EDN algorithm is better for short message sizes. The FT algorithm has similar characteristic as the SC algorithm, but for large k and large 7 values the SC algorithm will perform faster than the FT algorithm. As demonstrated above, the comparison between broadcast algorithms through the use of the 7-plot can provide some qualitative information regarding the algo- rithms. In the following, detailed quantitative comparisons of between the SC, the FT and the EDN algorithms are made. This analysis is intended to give a thor- ough understanding about the behavior of these algorithms under any reasonable architectural assumptions. 5.4.2 Comparison of SC and FT From the latency formulae of the SC and the FT algorithms, as indicated in Ta— ble 5.4, it is very dificult to extract an overall picture of how one algorithm performs as compared to the other under different parameter settings. With the aid of the 117 7-model, we can establish these conditions and compare the algorithms easily. The parameters involved in the latency are the mesh size P = 2" x 2", the message size L, the segmentation count k for the FT algorithm, and the three unicast latency parameters, a, 0, and 7. Table 5.4: The latency formulae for the SC and the FT algorithms. T, T" T n n (1_ In” so 2(2 — 1 + 71)(a + 7) + 6(2 — 1m 2(1; 31;)3 ‘2"“+""E’§33‘2"‘”fl r t 1 FT t(P,k)(a+7) +2<2n — 11w ,. .31)....)aéaunep From our earlier description of the 7-model, we know that when T, [SC] < T, [FT] and 730 < TF7. the SC algorithm performs better than the FT algorithm for all message sizes. The first condition leads to the inequality 2(2" — 1 + n) — t(P, k) 2(2n —-1)(k -— 3) ’ (5.8) provided that k — 3 > 0, where the parameter 7'0 is defined to be fl/ (a+7). Similarly, we can derive from the second condition that t(P, k){k(1 — 5%,) — 2" + 1 — 71} 7° > (2" - 1){3t(P.k) — 2180— 24:1} when 3t(P, k) — 2112(1 — 211.7) > 0 (5.9) and t(P,k){k(1— 53,-) — 2" + 1 — 71} 7° < (2" — 116303 Is) — 2km -— 3a)} when 31(1), k) — 2180— 24;) < 0.(5.10) 118 If, on the other hand, the equation 3t(P, k) — 2142(1 — 5%?) = 0 is true, by replacing 2102(1 - 5%;) with 3t(P,k) in the inequality Tsc < 7”, we can conclude that for any 16 > 0, the inequality 750 < 7pT holds. Therefore, the satisfaction of conditions 3t(P, k) — 2102(1 - 2+”) = 0 and Equation 5.8, or Equation 5.8 and Equation 5.9, 5.10, will guarantee that the SC algorithm performs better than the FT algorithm for any given message length L. In order to facilitate later derivations, we define three functions, w(n, k), 5(71, 1:), and 11(71, k), which depend only on parameter 71 and k, as follows; w(n,k) = 3t(P,k)—2k2(1—2%), _ 2(2" — 1 + 71) - t(P,]c) 47",“) ‘ 2(2" —1)(k — 3) ’ and t(P, k){k(1 — 271;) — 2" +1 — n} “(7" k) = (2n — 00(1) Because the parameter n is a constant when we consider a fixed-sized network of size P = 2" x 2", the only variable in the evaluation of these three functions is It. Once the mesh size is determined (71 is fixed), we can compute a series of w, e, and 11 values for varying 11:. By comparing To with e(n, k), 11(71, k), and w(n, k), we can determine when the SC algorithm is better than the FT algorithm. Following a procedure similar to that outlined above, we know that when T, [SC] > T, [FT] and 750 > TFT, the FT algorithm is faster than the SC algorithm. The first inequality leads to To < 6(71, 1:) for k > 3 and the second inequality leads to To 2 11(71, k) when w(71, k) 2 0. In addition to the above two cases, according to the 7-model, there are cases 119 when one algorithm performs better for message size greater than a specific length. Specifically, when T, [SC] < T, [FT] and Tsc > 7”, the SC algorithm will perform better than the FT algorithm for any message size L > L“, where L" = (T, [SC] — T,[FT])/(T,,[FT] — T,.[SC]). This leads to the results that when To < 6(71, 1:), and 7'0 2 11(71, 1:) for w(71, k) 2 0, the SC algorithm is better than the FT algorithm for message sizes greater than L‘, otherwise, the FT algorithm is better. By combining these results, we can compare the SC algorithm with the FT algorithm for any architectural and algorithmic parameters as follows: 0 From the equation w(n, k) = 0, we can solve for k. Suppose the solution is k“ > 0. We know that when k = k“ the SC algorithm is better than the FT algorithm for any given network parameters 61, fl, and 7. o For a given mesh network with parameters To > 6(71, k), the SC algorithm is better than the FT algorithm for any I: > 3. 0 Given a mesh network with parameter 70, if 70 3 6(71, k) and one of the conditions To 2 11(71,k) for w(n,k) 2 0 holds, the SC algorithm is bet- ter than the FT algorithm for any message size L such that L > L“ = (T,[SC'] — T,[FT])/(T,,[FT] — T,,[SC]). For any L < L“, the FT algorithm is better than the SC algorithm. 0 For a specified mesh network parameter 7'0. if 7'0 5 6(71, k) and one of the conditions To 2 11(71, k) for w(n, k) g 0 holds, the FT algorithm is better than the SC algorithm for any It > 3. 120 In order to illustrate the use of the above results, we select a mesh network with 71 = 5 (thus P = 25 x 25 = 1024 nodes), and assume that a = 7 = 1.5, and fl = 0.009, which are very similar to the T3D parameters [44]. This results in a 7'0 value of 0.003. The values of 6(5, k) and [1(5, k) for difierent values of k, are shown in Figure 5.10. 0.05 fifi~r vvvvvvvvv TM ......... I x epsilon(k) + 1 : mm X I 0.045 :- uu_0=0.003 ......... 1 I x 1 0.04 :- + 1 i : 0.035 [- " 41 0.03 :- * " a: :1 I x I g 0.025 '- + d: . X 1 0.02 :- + x “I 0.015 :- x 5 t + x 1 0.01 :- + x a: D x ‘ 0.005 - + x . ................................................. +x X 0 .................... l a l l = a_L 5 10 15 20 25 30 35 40 Figure 5.10: Plotting of ((5, k) and 11(5, k) with respect to k. Because these three functions, 6(71, k), 11(71, k), and w(n, k), are independent of 7, which indicates for a given network of size 2” x 2", for any a > 0, 3 > 0 and 7 > 0, the resulting plots are identical. For a given 70, we can draw a horizontal line across the plot to indicate that it is independent of k, as shown in Figure 5.10. From the previous results, when To > 6(71, k), the SC algorithm is better than the FT algorithm. As illustrated in Figure 5.10, by following the line indicating constant 7'0 = 0.003, this condition holds only for k 2 20. As to 3 < k < 20, the condition 70 < 6(n, k) holds, as well as 70 < 11(71, k) for w(71, k) < 0 and To > [1(71, k) for 121 w(n, k) > 0. This indicates that for 3 < k < 20, the SC algorithm is better than the FT algorithm for any message size L > L“ = (T, [SC] — T, [FT]) / (T,. [FT] — Tn[SC]). The computed L" values vs. difi'erent k values are illustrated in Figure 5.11. 4500_.v.vr+.4 ) h D 4000~++ + ~ I __ 3500 :- , . 2500 :- 1 L0 2000 :- 1500 L -: 1000 V'Yfivv b 4 b 1 500: 4 1 Figure 5.11: Difierent L“ values with respect to k. Figure 5.12(a) compares the two algorithms by latency vs. message length as computed from the latency formula with specified a, 7 and 0 values. For the FT algorithm, we select a set of k values, k = 16,18,19,20, in the neighborhood of crossover point from Figure 5.10. The results agree with those predicted by the analysis according to the 7-model. When we keep the 7 value fixed at 0.003 but change parameters (a, ,6, 7) from (1.5,0.009,1.5) to (0.5,0.003,0.5), the result is illustrated in Figure 5.12(b). The crossover message length for k = 18 can be calculated from the equation L‘ = (T,[SC] — T, [FT]) / (T,.[FT] — Tn[SC]). In the first case, when a = 7 = 1.5, ,8 = 0.009, and t(P, 18) = 67, the crossover message Tlrm 122 size is 2(25—1+5)(1.5+1.5)+6(25—l)4:0.009—674(1.5+1.5)—-2(25 —1)418*0.009 (6740.009/18)—2(1—1/221T)40.009 _ 6.603 _ - 0.0155 — 4255 In the second case, when a = 7 = 0.5, fl = 0.003, and t(P,18) remains as 67, the crossover message size is 2(2~'5 — 1+5)(0.5+0.5)+6(25 — 1)30003—671:(0.5+0.5)—2(25 —1)41840.003 (6740.003/18)—2(1— 1/22'5 )40.003 ”1 425 25 — 0.0051725 ' 2” Y I V Y V “ I V I W V ”(15,000). ill) — 1110100010 SJ!) — I-‘rusnmoJ 5.191 1710101010519) w - l-‘l’(l 5.0009.15.201.--~"- 4 1110501010520) 4 $059,009.31 4' '1 ’ scws.ooo.1.0;5)...o--“' (a) a = 7 = 1.5, fl = 0.009 (b) a = 7 = 0.5, fl = 0.003 Figure 5.12: The SC algorithm compared with the FT algorithm with difierent k values. When we compare Figure 5.12(a) with Figure 5.12(b), the difierence is the scale 123 of the Y-axis, the crossover message length remains as predicted. 5.4.3 Comparison of EDN and SC Next, we compare the EDN algorithm with the SC algorithm. The procedure is similar to the process illustrated in the previous discussion. Table 5.5 gives the latency formulae for both algorithms. Table 5.5: Latency formulae for the SC algorithm and the EDN algorithm. n n (I- In” so 2(2 — 1 + n)(a + 7) + 6(2 — 1m 2(1— 537% (2n_1+n)(ai—)+a(2n—lm EDN 3na + (71 + 1)7 + (2" - 1W (71 + 1” 3na+Ln$l-+)-;?f(2"—l)fl We can compute the difierence between T, [SC] and T, [EDN ] as T, [SC] —— T,[EDN] = (2 =1: 2" — 2 — n)a + (2 =1 2" — 3 + 71)7 > 0 for 71 >1. Thus, the inequality T, [SC] > T, [EDN ] holds for all 71 > 1. In addition, the difierence between TnIEDN] and T,.[SC'] is T,.[EDN] —T,,[S'C] = (n+5);- —1),6 > 0 for all 71 > 1 and 0 > 0. From these two conditions, we can conclude that the EDN algorithm is better than the SC algorithm for L < L'. The length L“ = (T,[SC'] — T,[EDN])/(T,,[EDN] — Tn[SC]). This indicates that independent of the underlying wormhole network parameters, the EDN algorithm is more suitable for short messages while the SC algorithm is better for long messages. In Figure 5.13, the crossover message length is plotted against the network size. As the network size increases, the crossover size increases, and as the value of 7 increases, the crossover message size decreases. This trend indicates that the SC 124 algorithm is more suitable for network with large 7 values and large message sizes. However, the decrease of crossover message size becomes smaller as the values of 7 increases. Take 71 = 5 for example, the crossover message size is 14 Kbytes when 7 is 0.001 and it becomes 7000 bytes when 7 is 0.002, the size decreases by a factor of 0.5. When 7 is 0.003, the crossover message size is 4200 bytes, as compared with 7 = 0.002, the decrease factor is less than 0.5. In summary, we can conclude that the EDN algorithm is faster for short message sizes, and the crossover message size increases as 7 decreases. 1% I T 1 I I 14000 12000 10000 8000 L0 Figure 5.13: Difierent L“ values for various 7. 5.4.4 Comparison of EDN and FT Following the same procedure used in comparing the SC algorithm with the FT algorithms, we can derive the conditions that will guarantee that the EDN algorithm performs better than the FT algorithm for any message length. That is, when both 125 inequalities T,[EDN] < T, [FT] and 731m < TFT are true. From the first inequality, we can derive 3n — t(P, k) 7" > 2(1: — 1)(2n — 1)‘ Because t(P, k) > 371 for 71 = 5 and k > 3, therefore, we know that the first inequality is valid for any To > 0 due to the fact that (371 — t(P, k))/(2(k — 1)(2" — 1)) < 0. To facilitate the derivation of the second inequality, we define R(n, k) as 2k2(71 + 1) — t(P, k). From the calculation of t(P, k), we know that R(5, k) < 0 for any It > 2. The following inequality is valid for any k > 2. t(P, k){371a + (71 +1)7} —t(P, k)k(71 + 1)(a + 7) < t(P, k){371a + 3717} — t(P, k)k(71 + 1)(a + 7) = t(P, k)(a + ’Y){3n — k(n +1)} < 0 Because (2" — 1)[3 =1: R(n, k) > O for all k, we know that (2" — 1)fl{2k2(71 + 1) — t(P, k)} > 0. Furthermore, the following inequality is valid for all k. (2" —1)6{2k2(n + 1) — t(P, 1)} > t(P, k){3na + (n + 1)7} — t(P, k)k(71 + 1)(a + 7) Rearranging the terms of the above inequality, we can conclude that TED N > TN“- 126 Therefore, we know that both the inequalities T,[EDN] < T ,[FT] and TEDN > 7” are true for 71 = 5 and k > 2. The EDN algorithm is faster than the FT algorithm for message sizes less than L‘. Similar to the previous comparison, we can calculate L‘ for difierent To = [3/ (a+7). Figures 5.14(a), (b) illustrate the crossover message lengths for difierent To values. One observation from both Figure 5.14(a) and Figure 5.14(b) is that as k increases, the crossover message length does not increase very rapidly. In addition, as the 7'0 values increases, the crossover message length decreases. This indicates that architectures with lower To values will favor the EDN algorithm. This agrees with the observation from the 7-plot outlined previously. lowerbountl O upperhound O Iowerboond 0 upperhound O 9 9 eeeeee ...... e a o . . 0 1111mm 111111311115 (a) To = 0.001 (b) To = 0.002 Figure 5.14: The crossover message length with different 7'0 values. To summarize, the process of comparing the performance of difierent communi- cation algorithms over various architectural and algorithmic parameters is compli- cated. In the above analysis, we compare three broadcast algorithms, SC, EDN, and FT, through the use of the 7-model for a particular network topology and size. In 127 the beginning of this section, we use the 7-plot to facilitate the comparison visually. For a range of 7 values, the 7-plot allows us to determine the qualitative character- istics of the algorithms under study easily. This is demonstrated by the comparison of the SC algorithm with the FT algorithm, and the difl’erent characteristics ex- hibited by the EDN algorithm and the SC algorithm. The SC algorithm is faster for long message lengths and favors machines with large 7 values as compared with the EDN algorithm. Detailed analysis confirms the observation from the 7-plot. Nevertheless, to determine quantitatively the crossover message lengths, the com- putation from the 7-model is more appropriate. This approach can be applied to other networking topologies and sizes as well as difierent networking technologies, as demonstrated in the next section. 5.5 Application of the 7-Model to NOWs As networking technology has evolved into difierent branches, and as new protocols and algorithms have been proposed over the past few years at a very fast pace, the need for a unified communication cost model is self evident. In this section, we present results of the application of the 7-model to networks of workstations, an increasingly popular parallel computing platform. In addition to broadcast algo- rithms, we use the 7-model to study all-to-all broadcast algorithms. Specifically, we use recent results from the literature to point out which aspects of the problem are most important, and speculate on how to improve implementations. 128 5.5.1 Broadcast Operations In the following, we demonstrate the use of the 7-model to study the various results for broadcast on difierent network of workstation platforms. This demonstration focuses on the use of some empirical data from various research results and create a global view on those data. Thus allows us to understand better the performance comparison between those algorithms. Table 5.6 gives the empirical data under study. The MPI-CCL library [32] is implemented atop unreliable hardware—supported multicast on Ethernet; the two entries are for group sizes of 4 and 8. These entries represent the performance of an MPI implementation over User-level Reliable Transport Protocol (URTP) on top of 10Mbps Ethernet [32]. The SSlO/ASX100 entries represent software implementations of broadcast operations through FORE systems ATM application interface forming spanning binomial tree configuration, with group sizes of 4 and 8 [78]. The data for broadcast on networks such as Myrinet, ATM, and fast Ethernet are taken from [14]. There are two difierent implementations, the first is based on the Panda system [22] and the second is based on Fast Messages (FM) [92, 108]. For both cases, the group size is fixed at 8. For the purpose of comparison, we also use data from implementations on par— allel machines. The data for the IBM SP-2 are from Xu and Huang’s study [158], in which two difierent communication libraries, MPI and MPL, were compared on a 512-node SP-2. The data for the Intel Paragon is from measurements of the InterComm collective communication library, also on a 512-node system [105]. 129 Table 5.6: Performance of broadcast operation. As illustrated in Figure 5.15, comparisons on various groups of data can be made using a 7-plot. Comparing algorithms over different net- working technology are shown by the pairs (SPARC / IOOBase-T / Panda/ 8, SPARC/ 100Base-T / FM / 8), (SPARC / Myrinet / Panda/ 8, SPARC / Myrinet / FM / 8), and (SPARC/ASXZOO—WG/Panda/S, SPARC/ASXZOO-WG/FM/B). It is ap- parent that points SPARC/ 100Base-T/Panda/ 8, SPARC/ Myrinet/ Panda/ 8, and SPARC / ASX200-WG / Panda/ 8 have a relatively lower 7 values than their counterpart SPARC/ 100Base-T/FM/8, SPARC/Myrinet/FM/8, and SPARC/ ASX200-WG/ FM / 8. This tells us that algorithms based on the Panda system are relatively suitable for long messages than their counterparts based on the Fast Message. Based on the comparison of points SPARC/ looBase- T/Panda/8, SPARC/Myrinet/Panda/8 and SPARC/ 100Base-T/FM/8, SPARC / Myrinet / FM / 8, we can observe that SPARC / Myrinet / FM / 8 is better than SPARC/ 100Base-T/FM/8 for all message sizes while SPARC/Myrinet/Panda/S and SPARC / 100Base-T / Panda/ 8 are not of similar relationship. This indicates that the algorithms based on the Fast Message have take advantage of the 130 networking technology (from Myrinet to fast Ethernet) while the algorithms based on the Panda system have not. 0 SSlO/A8X100/888T 8000 10000 12000 p 0 $81 0/ASX100/4SBT stanu 6000 o Paragon/lnterComm/S12 0 RSBOOOIMPl-CCUB 20004000 . . assooomn—ocm ““043me 2" 2N" SP mmsmnar T/F ° SPAHCI1OOBnn-T/PW8 SPWPWASWWWW” 0.0 0.0005 0.0010 0.0015 tau Figure 5.15: 7—plots for broadcast operations on difierent networking environments. From the comparison of points 8810/ ASX100 / 8SBT, SPARC/ASX200- WG / Panda/ 8, and SPARC / ASX200-WG / FM / 8, they are all of the same group size, and are all based on ATM networking technology, what makes SSlO / ASX100 / SSBT deviates from SPARC / ASX200-WG / Panda/ 8 and SPARC / ASX200-WG / FM / 8 are the high startup appeared in SSlO/ASX100/8SBT. This is partly due to the fact that SS10/ASX100/BSBT uses point-to-point primitive to implement the multi- cast while both SPARC / ASX200-WG / Panda / 8 and SPARC / ASX200-WG / FM / 8 131 are built on top of the underlying un-reliable multicast mechanism. One more observation from Figure 5.15 is that SPARC/ASXZOO-WG/ FM/ 8 is better than SP-2/MPI for all message sizes. This indicates that for group size 8, the platforms based on commodity networks such as ATM have better performance than parallel machines such as SP-2 5.5.2 All-to-All Broadcast For the all-to-all broadcast operation, we considered two difi'erent implementations, referred to as A and B, in a traditional Ethernet environment. The approach adopted in algorithm A is quite similar to that used in MPICH [65]: every par- ticipating node posts (P — 1) non-blocking indications to receive messages, and subsequently sends (P — 1) messages to their individual destinations. Due to the medium access control protocol of Ethernet, if every node starts the operation at the same time, the possibility of message collisions during initial and subsequent transmissions is very high. The approach adopted in version B is to arrange the P participating nodes in a linear list, with each node successively transmitting its (P — 1) messages to the other nodes. The order in which nodes transmit is determined by their positions in the list. When one node has finished sending its messages, the next node must be triggered to begin sending. This synchronization is accomplished by arranging the message transmission pattern so that the last message sent by a given node is destined for the node that will start next. As an example, first node 0 sends (P — 1) 132 messages to nodes (P — 1), (P — 2), ..., 1. Next, node 1 sends (P — 1) messages to nodes 0, (P — 1), (P — 2), . . . , 2. This process continues until node (P — 1) completes its message transmission. A simple timing analysis of version B can be derived by assuming that the time needed to send a message of length L is a + 5L, in which case the overall cost is thus P (P — 1)(a+ flL). Therefore, one might assume that this approach would have much worse performance than version A. However, as indicated by the following experiments, the performance of this approach is in some sense quite competitive. ‘.M 1 V V Y T 0 V l.” V 7 Y Y Y 1’ 1 4.064 Pall-lull v.1 . . 1 . 4.064 ~411-4o-au v.2 . 1 2044 > < 2064 - 1 1.064 1 I I < 1.064 - 1 4&3 r , z . . QM . ‘ 10:3 1 , < 2043 ~ 1 i 1 . D 1.1K! . ’ ' 1 1 1.0e3 J . . ° 1 . - I ' 4.01:2 I | I | 4 ' 4&2 I. 4 . . 4 ; . 2.042 < 2.0420 4 4 4 4 4 4 4 4 I . 1.062 » | « 1.0e2 » . 4 4.041 I | 1 . 4.0e1 .1 4 . 4 4 8 3 ‘ 2041” l I I | . 2041 » l.m' A A A L 4 A A A I'ml l I I A; A A A A 4 16 64 256 1004 4096 16334 4 16 64 256 1024 4096 16384 mummy) messagesinaiyu) (a) version A. (b) version B. Figure 5.16: Results for all-to—all operation. Figures 5.16(a) and (b) plot the results of executing all-to-all broadcast opera- tions (1000 iterations for each message size) on 12 SUN SPARCstation-lO worksta- tions interconnected by a shared Ethernet segment. The plots indicate that in this environment, completion times vary for both versions, but the variation is much greater for version A, likely due to network contention. Computing the average across 1000 runs shows that for message size less than 2048 bytes, version B is 133 faster than version A, despite the fact that the minimum value of version A is less than that of version B, as illustrated in Figure 5.17 (a) and (b). In order to apply the 7-model to the performance plots of these two algorithms, we used the minimum values, which are the most optimistic estimation of algorithm performance. 24M .' _ V _ V 26M ’V ‘ Y _ 1 minimum (version A) 4 minimum (version B) s 1.0e4 r avenge (1133100 A) 1.0e4 - average (version 8) 1.2e7 + l.&5'(message size) ---------- l " 3.3e7 + 9.2e4‘(rnessage size) ---------- I 4.&3 )- 7 4.0e3 - 2043 > ~ 2.01:3 - '5 1) ’ 3 41k) ’ I I I ‘ I I I ' “.xll' ‘ 3 gm .. ‘_. . ' 64".. u- l..." Z'M b A“... ‘ E 1&2 I IIIIIIIII 3.".- "M ' . '1 . 1.042 - 1.. "' 4’ml .- 3'“... 4“! 704414.0‘ ...... .s--" 2.0el r .............. 2.0el t ............... t.....-~----“' l-(bl a L L A l A a g 1 11*] I r I a L A 4 16 64 256 1024 4096 16384 4 16 64 256 1024 4096 16384 message are (bytes) message size (bytes) (a) all-to-all version A. (b) all-to-all version B. Figure 5.17: Results of fitting the minimum value. Following similar procedure, we measure the performance of these two all-to-all operations with group size 4 and 8. Table 5.7 summarizes the results of T, and T,. values for these two algorithms with difierent group sizes. Table 5.7: Performance of all-to-all operation. machine protocol / algorithm / size T, (11.9) T,. ( 11.9 / byte) 7 = T,. /T, SSlO/Ethernet TCP/A/4 3220.0 9.11 0.00283 8810/ Ethernet TCP/B/4 5380.0 9.66 0.00180 SS 10 / Ethernet TCP / A / 8 6200.0 47.7 0.00769 SSlO/Ethernet TCP/B/8 14700.0 43.8 0.00298 8810 / Ethernet TCP / A / 12 12900.0 107.0 0.00830 SSlO/Ethernet TCP/B/12 32900.0 92.4 0.00281 From the data in Table 5.7, the 7-plot of all-to-all operations is constructed. As shown in Figure 5.18, one very distinct difi'erence between these two versions of all- 134 -8810/TCPIBI12 g 4 g 4 3 ‘ -ssiorrcpm -SSlO/TCPIN12 -8810/TCPIM g , -ssiorrcmr4 -ssrolrormv4 If I V I V V l 0.“ 0.” 0.”! 0.” 0.” 0.”? 0.” Figure 5.18: 7-plot of all-to-all operations. to-all operation is the direction when we move from small group size to large group size. Version A tends to move to the right—hand direction while version B prefer to move up. Thus, the qualitative comparison 'of version A vs. version B is that for the same group size, version A prefers long messages, and this trend increases with increasing group size. Such a comparison is not obvious when we look at the latency-length or other similar performance plots. The second observation from the 7 plot is the efiect of increased group size. As the group size changes from 4 to 8, the horizontal shift made by version A is greater than that of B, which indicates that as group size increases, the rate of growth in the latency plot of A increases more rapidly than version B, however, the vertical shift of version B is greater than A, thus overshadowing the gain. 135 5.5.3 Summary To summarize, we have demonstrated the use of the 7 model to compare commu- nication algorithms under difl'erent system parametric assumptions in NOW envi- ronments. This study allows us to gain better understanding of the behavior of any communication operation under difierent system parameters. It also illustrates that for a specific implementation, the efi'ect of a particular system parameter on its performance can be quantified. From this study, we know that as the number of pa- rameters increases, the task of detailed comparison between difi'erent algorithms is not trivial. Different networking environments as well as difi'erent algorithm imple- mentations further complicated the issue. However, through the use of the 7-plot, we can capture the most prominent feature of the algorithm as compared with other algorithms. As seen from the study of communication operations in the network of workstations environment, the 7 model approach provides a common platform to facilitate the understanding of the issues of communication performance. Chapter 6 Conclusions and Future Research Generally speaking, the purpose of a model is to capture the salient features of the object under study with clarity and defining details. Though many well known parallel computation models have been developed in recent years, these models are not necessarily well-suited for modeling communication. Since communication is a critical part of parallel applications, and since communication technologies and software interfaces are changing frequently, the need for an accurate communication model is immediate. The research described in this dissertation has produced the following contribu- tions: a general model for the design of eficient collective communication algorithms that could exploit the characteristics of certain network hardware architecture; the proper integration of a communication model and a cost model to support the de- sign of communication primitives for various network environments. Moreover, this work improves the theoretical background used in characterizing the complexity of difierent communication operations and in comparing their implementations. 136 137 Our work in designing eficient communication primitives for wormhole routed systems generated insight into the proper cost model for collective communication. Specifically, the use of the EDN model in developing communication primitives illustrates a conceptual model, associated with the usual cost model, that aids the design of eficient communication operations while taking into account specific architectural features. The extension of the EDN model to workstation cluster envi- ronments can be considered quite natural, since many communication media used in those platforms provide multi-destination message passing. Examples include Ethernet, Fast Ethernet, and ATM switches, all of which support hardware multi- cast. We have conducted a preliminary study in this area [141]; additional studies are deferred to future research. As an increasing number and variety of network architectures have been intro- duced into the commodity market, and as the trend of using such components for parallel computing continues, the task of understanding and comparing communi- cation algorithms is increasingly complicated. The second model developed in this dissertation, the 7-model, ties together the various aspects of performance compar- ison and allows the comparison of algorithms and networking technologies to be done both qualitatively and quantitatively. The future generation of parallel computing platforms is growing into an area where diverse networking components and workstations play the major roles. Un- derstanding and comparing the performance of diflerent implementations of the same operation, as well as different networking technologies, can assist in the re- finement and selection of a better platform / algorithm combinations. Further re- 138 search can be conducted in the area of creating tools based on the 7-model. Such tools would facilitate a uniform representation of performance data for difierent communication operations and technologies as they come available. APPENDICES Appendix A Formulae for EDNs in 2"“ X 2’“ Meshes We use the D pattern in Figure 3.8(b) to show how to compute the coordinates of the EDNs in a 2D mesh of size 2" x 2", k 2 3. As illustrated in Figure -1(a), we denote the four highest level EDNs of a 2" x 2" mesh as 400 = (4.31.4101). 3(1) = (8.0131(1)). 000 = (6.31.0131) and 0(1) = (0.31.0131). The construction of the next level EDNs is as shown in Figure -1(b). Let the corresponding four highest level EDNs in this 2’”1 x 21+ 1 mesh be A(k + 1), B (k + 139 1”““’P“*y“” 1:] i [j C(k+l) D(k+l) [l I I '1 C(11) D(k) -' C“ I {.1 El I Mk) 30:) (a)the four highest level EDNs (b) the four highest level EDNs in a 2'“; 2'“I in a 2"; 2k mesh mesh Figure -1: Representation of four highest level EDNs 1), C(k + 1) and D(k + 1). The following relation holds for any k 2 3. 430=ruk—0 .Asm=£uk—0 B,(k) = 2" -1— 0,,(1: — 1) , B,(k) .—. 0,,(k — 1) 0,01) = 2k-1— 1 — 0,0: — 1) , 0,01) = 2k — 1 — 0,,(1 — 1) D,(k) = 2k - 1 — D,(k — 1) , 0,,(1) = 2'c — 1 — D,(k — 1) = C,(k) The case of the 8 x 8 mesh serves as a boundary condition for the above re- currence. From Figure 3.8(b), the coordinates for these four level-2 EDNs in 8 x 8 mesh are: A,(3) =1 , A,(3) = 0 B,(3) = 6 , 3,,(3) = 0 61(3) = 2 , 0,,(3) = 7 12(3) = 5 , 1),,(3): 7 141 Now we solve the above recurrence for C,(k). Let S (k) = 2" — 1. By expanding C,(k — 1) with C,(k — 2), C,.(k — 2) with C,(k — 3) and so on until 03(3), we arrive at the following formula for C, (k): 01:0,) Egfl-IP’ISM " j) — 02(3) when k is even Zk—3(-1)j_15(k - j) + Cx(3) when k is odd. i=1 Solving the summation and setting C,c (3) = 2, the following closed-form solution for C,(k) can be found: ”+5 _._ 03(3) = Z"??— when k is even 0:109) = ”—3-?! + 043) = 2"3—‘2 when k is odd. The same procedure can also be used to derive the following equation for C, (k): —2k+1+13 — C (3) = ——2k+1“8 when k is even 3' 3 y 3 y(k) 2_"+_;-_1§ + Cy(3) = w when k is odd. In a similar fashion, the formula for D,(k) can be written as: £3.12 — D,(3) = 3%:3 when k is even D309) = 2%;19 5, 193(3) = 21% when k is odd. Thus, we have the following formula to compute the coordinates of the four highest 142 level EDNs for a mesh of size 2" x 2": 4.00 4+ .Ay(k)=2’fi—5- 3,0.) = —— . 31(1) = 41(1) ) when k is even 04(1) J": .Cy(k>=2"*;-8 194(k) =2"*;-2 ,o,(1)_c,(1.) A400 =22“ .Ay(k)=2"-8 32.-(k) = w 1 B (k) =A (k) y ,, ) when k is odd. 02:05) = 2kT—2 1 011(k) = 24H; 5 19.00 =— ,D,(1.)=c,(1.) , By using the above formulae to calculate the highest-level EDNs for submeshes and by performing coordinate transformations to larger meshes, we can generate all levels of EDNs within a mesh. Using this information, any node in the mesh can compute in advance the list of nodes that it should forward to upon receipt of a broadcast message. Some care must be taken in the startup phase, since a highest level EDN may be required to forward the message to one or more other EDNs at the highest level. The message transmission pattern in this startup phase can be either dependent on the location of source node or fixed. BIBLIOGRAPHY Bibliography [1] [2] [3] [4] [5] [6] [7] [8] [9] M. Adler, J. W. Byers, and R. M. Karp, “Scheduling parallel communica- tion: The h-relation problem,” in Mathematical Foundations of Computer Science 1995, The 20th International Symposium (J. Wiedermann and P. Hajek, Eds.), pp. 1-20, Springer, August 28 — September 1 1995. Lec- ture Notes in Computer Science, Vol. 969. A. Agarwal, “Overview of the Alewife pro ject.” ALEWIFE SYSTEMS MEMO # 10, Laboratory for Computer Science, MIT, July 30 1990. A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, “The MIT Alewife machine: Archi- tecture and performance,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2—13, June 22-24 1995. A. Agarwal, D. Chaiken, G. D’Souza, K. Johnson, D. Kranz, J. Kubiatowicz, K. Jurihara, B.-H. Lim, G. Maa, D. Nussbaum, M. Parkin, and D. Yeung, “The MIT Alewife machine: A large-scale distributed-memory mulitproces- sor,” in Proceedings of the Workshop on Scalable Shared-Memory Multi- processors, June 1990. ‘ A. Aggarwal, A. K. Chandra, and M. Snir, “Communication complexity of PRAMs,” Theoretical Computer Science, vol. 71, pp. 3—28, 1990. S. Ahuja, N. Carriero, and D. Gelernter, “Linda and friends,” IEEE Com- puter, vol. 19, pp. 26-34, August 1986. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, “LogGP: incorporating long messages into the LogP model — one step closer towards a realistic model for parallel computation,” in Proceedings of the 7th Annual Symposium on Parallel Algorithms and Architectures, pp. 95—105, July 1995. T. E. Anderson, D. E. Culler, and D. A. Patterson, “A case for NOW (net- works of workstations),” IEEE Micro, vol. 15, pp. 54—64, February 1995. G. Armitage, “Support for multicast over UNI 3.0/3.1 based ATM networks.” Internet-Draft, February 22 1996. draft-ietf-ipatm-ipmc-12.txt. 143 [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] 144 S. M. Armstrong, A. O. Freier, and K. A. Marzullo, “Multicast transport protocol.” RFC 1301, February 1992. R. Asbury, S. G. F‘rison, and T. Roth, “Concurrent computers ideal for inher- ently parallel problems,” Computer Design, pp. 99—107, September 1 1985. V. Bala, J. Bruck, R. Bryant, R. Cypher, P. de Jong, P. Elustondo, D. Frye, A. Ho, C.-T. Ho, G. Irwin, S. Kipnis, R. Lawrence, and M. Snir, “The IBM external user interface for scalable parallel systems,” Parallel Computing, vol. 20, pp. 445—462, April 1994. V. Bala, J. Bruck, R. Cypher, P. Elustondo, A. Ho, C.-T. Ho, S. Kipnis, and M. Snir, “CCL: a portable and tunable collective communications library for scalable parallel computers,” IEEE Transactions on Parallel and Dis- tributed Systems, vol. 6, pp. 154—164, February 1995. H. Bal, R. Hofman, and K. Verstoep, “A comparison of three high speed networks for parallel cluster computing,” in Workshop on Communica- tion and Architectural Support for Network-based Parallel Computing (CANPC’97), pp. 184—197, February 1997. M. Barnett, R. Littlefield, D. G. Payne, and R. van de Geijn, “Global combine on mesh architectures with wormhole routing,” in Proceedings of the 7th International Parallel Processing Symposium (IPPS ’93), pp. 156-162, April 13—16 1993. M. Barnett, D. G. Payne, R. A. van de Geijn, and J. Watts, “Broadcasting on meshes with worm-hole routing,” tech. rep., Department of Computer Science, University of Texas at Austin, November 2 1993. M. Barnett, D. G. Payne, R. A. van de Geijn, and J. Watts, “Broadcasting on meshes with wormhole routing,” Journal of Parallel and Distributed Computing, vol. 35, no. 2, pp. 112-122, 1996. A. Bar-Noy and S. Kipnis, “Designing broadcasting algorithms in the Postal model for message-passing systems,” in Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 13—22, June 29 - July 1 1992. A. Bar-Noy and S. Kipnis, “Designing broadcasting algorithms in the Postal model for message-passing systems,” Mathematical Systems Theory, vol. 27, pp. 431—452, September/ October 1994. A. Basu, V. Buch, W. Vogels, and T. von Eicken, “U-Net: a user-level network interface for parallel and distributed computing,” in Proceedings of the 15th ACM Symposium on Operating Systems Principles, December 3-6 1995. [21] [22] [23] [24] [25] [26] [27] [23] [29] [30] [31] [32] 145 D. P. Bertsekas, C. Ozveren, G. D. Stamoulis, P. Tseng, and J. N. Tsitsiklis, “Optimal communication algorithms for hypercube,” Journal of Parallel and Distributed Computing, vol. 11, pp. 263—275, 1991. R. Bhoedjang, T. Riihl, R. Hofman, K. Langendoen, and H. Bal, “Panda: a portable platform to support parallel programming languages,” in Sympo- sium on Experiences with Distributed and Multiprocessor Systems IV, pp. 213—226, September 1993. K. P. Birman, “The process group approach to reliable distributed comput~ ing,” Communications of the ACM, vol. 36, pp. 37—53, December 1993. K. P. Birman and T. A. Joseph, “Reliable communication in the presence of failures,” ACM Transactions on Computer Systems, vol. 5, pp. 47—76, February 1987. K. P. Birman, A. Schiper, and P. Stephenson, “Lightweight causal and atomic group multicast,” ACM Transactions on Computer Systems, vol. 9, pp. 272-314, August 1991. M. A. Blumrich, C. Dubnicki, E. W. Felten, K. Li, and M. R. Mesarina, “Virtual-memory-mapped network interfaces,” IEEE Micro, vol. 15, pp. 21— 28, February 1995. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K. Su, “Myrinet: A gigabit-per-second local area network,” IEEE Micro, vol. 15, pp. 29—36, February 1995. S. H. Bokhari and H. Berryman, “Complete exchange on a circuit switched mesh,” in Proceedings of Scalable High Performance Computing Confer- ence (SHPCC ’92), pp. 300—306, April 26—29 1992. R. A. A. Bruce, J. G. Mills, and A. G. Smith, “CHIMP version 2.0 design,” Tech. Rep. EPCC-KTP-CHIMP—V2—DESIGN 1.2, Edinburgh Parallel Com- puting Center, February 24 1994. version 1.2. R. A. A. Bruce, J. G. Mills, and A. G. Smith, “CHIMP/MPI user guide,” Tech. Rep. EPCC-KTP—CHIMP-VZ—USER 1.2, Edinburgh Parallel Comput- ing Center, June 1994. version 1.2. J. Bruck, D. Dolev, C.-T. Ho, R. Orni, and R. Strong, “PCODE: an em- cient and reliable collective communication protocol for unreliable broadcast domains,” in Proceedings of the 9th International Parallel Processing Sym- posium, pp. 130—139, April 1995. J. Bruck, D. Dolev, C.—T. Ho, M.-C. Rosu, and R. Strong, “Eficient message passing interface (MPI) for parallel computing on clusters of workstations,” in Proceedings of the 7th Annual Symposium on Parallel Algorithms and Architectures, July 1995. 146 [33] G. D. Burns, “The local area multicomputer,” in Proceedings of the Fourth Conference on Hypercube Concurrent Computers and Applica- tions, March 1989. [34] G. D. Burns, R. B. Daoud, and J. R. Vaigl, “LAM: an open cluster environ- ment for MPI,” in Supercomputing ’94, June 1994. [35] R. M. Butler and E. L. Lusk, “User’s guide to the p4 parallel programming system,” Tech. Rep. ANL-92/17, Argonne National Laboratory, Mathematics and Computer Science Division, October 1992. [36] R. M. Butler and E. L. Lusk, “Monitors, messages, and clusters: the p4 parallel programming system,” Parallel Computing, vol. 20, pp. 547-564, April 1994. [37] R. Calkin, R. Hempel, H.-C. Hoppe, and P. Wypior, “Portable programming with the PARMACS message-passing library,” Parallel Computing, vol. 20, pp. 615-632, April 1994. [38] N. J. Carriero, D. Gelernter, T. G. Mattson, and A. H. Sherman, “The Linda alternative to message-passing systems,” Parallel Computing, vol. 20, pp. 633—656, April 1994. [39] S. Chandra, J. R. Larus, and A. Rogers, “Where is time spent in message- passing and shared-memory programs?,” in The Sixth International Confer- ence on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), October 1994. [40] D. Cheriton, “VMTP: Versatile message transaction protocol.” RFC 1045, February 1988. Stanford University. [41] E. J. Cockayne, E. O. Hare, S. T. Hedetniemi, and E. V. Wimer, “Bounds for the domination number of grid graphs,” Congressus Numerantium, vol. 47, pp. 217-228, 1985. [42] R. Cole and O. Zajicek, “The APRAM: incorporating asynchrony into the PRAM model,” in Proceedings of the 1989 ACM Symposium on Parallel Algorithms and Architectures, pp. 169-178, June 18—21 1989. [43] M. Cosnard and D. Trystram, Parallel Algorithms and Architectures. In- ternational Thompson Press, 1995. [44] Gray Research, Inc., CRAY T3D System Architecture Overview Manual, 1993. [45] W. J. Dally, “Performance analysis of k-ary n—cube interconnection networks,” IEEE Transactions on Computers, vol. 39, pp. 775-785, June 1990. 147 [46] W. J. Dally, “Virtual-channel flow control,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 194—205, March 1992. [47] W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R. Nuth, R. E. Davison, and G. A. Fyler, “The message-driven processor: A multicom- puter processing node with emcient mechanisms,” IEEE Micro, pp. 23-39, 1992. [48] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Computing, vol. 1, no. 3, pp. 187-196, 1986. [49] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networ ,” IEEE Transactions on Computers, vol. C-36, pp. 547-553, May 1987. [50] L. De Coster, N. Dewulf, and C.-T. Ho, “Emcient multi-packet multicast al- gorithms on meshes with wormhole and dimension-ordered routing,” in Pro- ceedings of the 1995 International Conference on Parallel Processing, vol. III, pp. 137-141, August 1995. [51] S. Deering, “Host extensions for IP multicasting.” RFC 1112, August 1989. [52] D. Dolev and D. Malki, “The Transis approach to high availability cluster communication,” Communications of ACM, vol. 39, pp. 64—70, April 1996. [53] J. J. Dongarra and T. H. Dunigan, “Message-passing performance of various computers,” Tech. Rep. ORN L / TM-13006, Oak Ridge National Laboratory, February 1996. [54] J. J. Dongarra and D. W. Walker, “Software libraries for linear algebra com- putations on high performance computers,” SIAM Review, vol. 37, no. 2, pp. 151—180, 1995. [55] B. Duzett and R. Buck, “An overview of the nCUBE 3 supercomputer,” in Proceedings of the Fourth Symposium on the FRONTIERS ’92, pp. 458- 464, October 19-21 1992. [56] D. C. Feldmeier, “An overview of the TP++ transport protocol project,” in High Performance Networks—Frontiers and Experience (A. Tantawy, Ed.), ch. 8, pp. 157-176, Boston, MA: Kluwer Academic Publishers, 1993. [57] S. Floyd, V. Jacobson, S. McCanne, C.-G. Liu, and L. Zhang, “A reliable multicast framework for light—weight sessions and application level framing,” in Proceedings of the ACM SIGCOMM ’95, pp. 342-356, August 1995. [58] S. Fortune and J. Wyllie, “Parallelism in random access machines,” in Pro- ceedings of the 10‘“ Annual ACM Symposium on Theory of Computing, pp. 114-118, May 1-3 1978. 148 [59] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM: Parallel Virtual Machine. The MIT Press, 1994. [60] G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley, “PICL: A portable instrumented communication library C Reference Manual,” Tech. Rep. ORNL/TM-11130, Oak Ridge National Laboratory, July 1990. [61] G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley, “A user’s guide to PICL: A portable instrumented communication library,” Tech. Rep. ORNL/TM-11616, Oak Ridge National Laboratory, August 1990. [62] M. Gerla, P. Palnati, and S. Walton, “Multicasting protocols for high-speed wormhole-routing local area networ ,” in SI G COMM ’96, 1996. [63] P. B. Gibbons, “A more practical PRAM model,” in Proceedings of the 1989 ACM Symposium on Parallel Algorithms and Architectures, pp. 158-168, June 18—21 1989. [64] W. Gropp and E. Lusk, “User’s guide for mpi ch, a portable implementation of MP1,” Tech. Rep. ANL/MCS-TM-96—6, Argonne National Laboratory, 199. [65] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementation of the MPI message passing interface standard,” tech. rep., Argonne National Laboratory, 1996. [66] S. K. S. Gupta and D. K. Panda, “Barrier synchronization in distributed- memory multiprocessors using rendezvous primitives,” in International Par- allel Processing Symposium (IPPS’93), pp. 501—506, 1993. [67] P. J. Hatcher and M. J. Quinn, Data-Parallel Programming on MIMD Computers. Cambridge, Massachusetts: The MIT Press, 1991. [68] J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, “Integration of mes- sage passing and shared memory in the Stanford FLASH multiprocessor,” in Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 38-50, October 1994. [69] R. Hempel, H.-C. Hoppe, and A. Supalov, PARMA CS 6.0 Library Interface Specification. Institute for Foundations of Information Technology, German National Research Center for Computer Science, December 17 1992. [70] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quanti- tative Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., second ed., 1996. [71] High Performance Fortran Forum, “High Performance Fortran language spec- ification.” (version 1.0), May 1993. [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [32] [83] [84] [85] 149 R. W. Hockney, “The communication challenge for MPP: Intel Paragon and Meiko CS—2,” Parallel Computing, vol. 20, pp. 389—398, 1994. C.-T. Ho and S. L. J ohnsson, “Distributed routing algorithms for broadcasting and personalized communication on hypercubes,” in Proceedings of the 1 986 International Conference on Parallel Processing, pp. 640—648, 1986. C.-T. Ho and M.-Y. Kao, “Optimal broadcast in all-port wormhole-routed hypercubes,” in Proceedings of the 1994 International Conference on Par- allel Processing, vol. III, pp. 167—171, August 1994. C.-T. Ho and M. Kao, “Optimal broadcast in all-port wormhole-routed hyper- cubes,” IEEE Transactions on Parallel and Distributed Systems, vol. 6, pp. 100—204, February 1995. C.-T. Ho and M.-Y. Kao, “Optimal broadcast in all-port wormhole-routed hypercubes,” IEEE Transactions on Parallel and Distributed Systems, vol. 6, pp. 200-204, February 1995. C. C. Huang, E. P. Kasten, and P. K. McKinley, “Design and implementa- tion of multicast operations for ATM-based high performance computing,” in Supercomputing ’94, pp. 164—173, November 1994. C. C. Huang and P. K. McKinley, “Communication issues in parallel com- puting across ATM networks,” IEEE Parallel 6 Distributed Technolog, pp. 73-86, Winter 1994. H. H. J. Hum, K. B. Theobald, and G. R. Gao, “Building multithreaded architectures with ofl-the—shelf microprocessors,” in Proceedings of the 8th International Symposium on Parallel Processing, pp. 288-294, April 1994. Inktomi Corporation, The INK TOMI Technology Behind H OTBOT: A White Paper, 1996. “Computer on a chip.” New Products, IEEE Computer, p. 71, J an- uary/February 1972. Intel Corporation, Paragon XP/S Product Overview, 1991. D. S. Johnson, “The NP-completeness column: An ongoing guide,” Journal of Algorithms, vol. 5, pp. 147—160, 1984. H. W. Johnson, Fast Ethernet: Dawn of a New Network. Prentice Hall, 1996. S. L. Johnsson and C.-T. Ho, “Optimum broadcasting and personalized com- munication in hypercubes,” IEEE Transactions on Computers, vol. 38, pp. 1249-1268, September 1989. 150 [86] K. K. Keeton, T. E. Anderson, and D. A. Patterson, “LogP quantified: The case for low-overhead local area networks,” in H at Interconnects III: A Sym- posium on High Performance Interconnects, August 10—12 1995. [87] R. E. Kessler and J. L. Schwarzmeier, “CRAY T3D: A new dimension for Cray research,” in Proc. COMPCON, pp. 176-182, 1993. [88] A. Khokhar, “c3: An architecture-independent model for coarse-grained ma- chines,” in Workshop on Models of Parallel Computation, January 27 1995. [89] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings, 1994. [90] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, M. R. Anoop Gupta, and J. Hennessy, “The Stanford FLASH multiprocessor,” in Proceedings of the 218” International Symposium on Computer Architecture, pp. 302-313, April 1994. [91] S. Lamberts, G. Stellner, A. Bode, and T. Ludwig, “Paragon parallel pro- gramming on Sun workstations,” in Sun User Group Proceedings, pp. 87-93, December 1993. [92] M. Lauria and A. Chien, “MPI-FM: high performance MPI on workstation clusters,” Journal of Parallel and Distributed Computing, February 1997. [93] Z. Li, P. H. Mills, and J. H. Reif, “Models and resource metrics for parallel and distributed computation,” in Proceedings of the 28th Annual Hawaii International Conference on System Sciences, January 3—6 1995. [94] M. Lin, J. Hsieh, D. H. C. Du, J. P. Thomas, and J. A. MacDonald, “Dis- tributed network computing over local ATM networks,” IEEE Journal on Selected Areas on Communications, vol. 13, no. 4, pp. 733-748, 1995. [95] X. Lin, P. K. McKinley, and L. M. Ni, “Deadlock-free multicast wormhole routing in 2D mesh multicomputers,” in 1991 International Conference on Parallel Processing, vol. I, pp. 435-442, 1991. [96] X. Lin, P. K. McKinley, and L. M. Ni, “Deadlock-free multicast wormhole routing in 2-D mesh multicomputers,” IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 8, pp. 793-804, 1994. [97] X. Lin and L. M. Ni, “Deadlock-free multicast wormhole routing in multicom- puter networks,” in Proceedings of the 18th Annual International Sympo- sium on Computer Architecture, pp. 116-125, May 1991. [98] R. P. Martin, “HPAM: An active message layer for a network of HP worksta- tions,” in Proceedings of Hot Interconnects II, August 1994. 151 [99] P. K. McKinley and C. Trefi’tz, “Eficient broadcast in all-port wormhole routed hypercubes,” in Proceedings of the 1993 International Conference on Parallel Processing, vol. II, pp. 288—291, August 1993. [100] P. K. McKinley and C. Trefftz, “MultiSim: A tool for the study of large- scale multiprocessors,” in Proceedings of 1993 International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Networks (MASCOTS), pp. 57—62, January 1993. [101] P. K. McKinley, Y. Tsai, and D. F. Robinson, “Collective communication in wormhole-routed massively parallel computers,” IEEE Computer, vol. 28, pp. 39-50, December 1995. [102] P. K. McKinley, H. Xu, A.-H. Esfahanian, and L. M. Ni, “Unicast-based multicast communication in wormhole-routed networks,” in Proceedings of the 1992 International Conference on Parallel Processing, vol. II, pp. 10- 19, August 1992. [103] P. K. McKinley, H. Xu, A.-H. Esfahanian, and L. M. Ni, “Unicast-based mul- ticast communication in wormhole-routed direct networks,” IEEE Transac- tions on Parallel and Distributed Systems, vol. 5, pp. 1254—1265, December 1994. [104] Message Passing Interface Forum, “MPI: A message-passing interface stan— dard,” International Journal of Supercomputing Applications, vol. 8, no. 3/4, 1994. Special Issue on MPI. [105] P. Mitra, D. Payne, L. Shuler, R. van de Geijn, and J. Watts, “Fast collec- tive communication libraries, please,” Tech. Rep. TR-95-22, Department of Computer Sciences, The Unversity of Texas, June 1995. [106] NCUBE Company, NCUBE 6400 Processor Manual, 1990. [107] L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct networks,” IEEE Computer, vol. 26, pp. 62—76, February 1993. [108] S. Pakin, M. Lauria, and A. Chien, “High performance messaging on work- stations: Illinois fast messages (FM) for Myrinet,” in On-line Proceedings of Supercomputing ’95 at http://www. supercomp . org/sc95/proceedingsl, 1995. [109] J. Palmer, “A VLSI parallel supercomputer,” in Proceedings of the First Conference on Hypercube Multiprocessors (M. T. Heath, Ed.), August 26- 27 1985. [110] D. K. Panda, “Optimal phase barrier synchronization in k-ary n-cube wormhole-routed systems using multirendezvous primitives,” in Workshop 152 on Fine-Grain Massively Parallel Coordination, in conjunction with In- ternational Symposium on Computer Architecture (ISCA ’93), pp. 24—26, May 1993. [111] J .-Y. L. Park and H.-A. Choi, “Circuit-switched broadcasting in tours and mesh networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 7, pp. 184-190, February 1996. [112] J .-Y. L. Park, H.-A. Choi, N. Nupairoj, and L. M. Ni, “Construction of optimal multicast trees based on the parameterized communication model,” in The 1996 International Conference on Parallel Processing, August 12-16 1996. [113] J .-Y. L. Park, S.-K. Lee, and H.-A. Choi, “Fault-tolerant broadcasting in circuit-switched mesh,” in Proceedings of the sixth SIAM Conference on Parallel Processing for Scientific Computing, pp. 887-890, 1993. [114] C. Patridge, Gigabit Networking. Addison-Wesley, 1993. [115] D. A. Patterson and D. R. Ditzel, “The case for the reduced instruction set computer,” Computer Architecture News, vol. 8, pp. 25—33, October 15 1980. [116] D. A. Patterson and C. H. Séquin, “Design considerations for single-chip computers of the future,” IEEE Transactions on Computers, vol. C-29, pp. 108-116, February 1980. [117] D. A. Patterson and C. H. Séquin, “RISC I: a reduced instruction set VLSI computer,” in The 8th Annual Symposium on Computer Architecture, pp. 443-457, May 1981. [118] J. G. Peters and M. Syska, “Circuit-switched broadcasting in tours networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 7, pp. 246- 255, March 1996. [119] P. Pierce, “The NX message passing interface,” Parallel Computing, vol. 20, pp. 463-480, April 1994. [120] D. A. Reed and R. M. Fujimoto, Multicomputer Networks: Message-Based Parallel Processing, ch. 8 Commercial Hypercubes: A Performance Analysis, pp. 305-377. MIT Press, 1987. [121] R. V. Renesse, T. M. Hickey, and K. P. Birman, “Design and performance of Horus: A lightweight group communications system,” Tech. Rep. TR-94- 1442, Department of Computer Science, Cornell University, August 1994. [122] D. F. Robinson, Scalable Multicast Communication in Massively Parallel Computers. PhD thesis, Department of Computer Science, Michigan State University, 1994. 153 [123] D. F. Robinson, D. Judd, P. K. McKinley, and B. H. C. Cheng, “Eficient multicast in all-port wormhole-routed hypercubes,” Journal of Parallel and Distributed Computing, vol. 31, pp. 126-140, 1995. [124] D. F. Robinson, P. K. McKinley, and B. H. C. Cheng, “Optimal Multicast Communication in Torus Networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 6, pp. 1029—1042, October 1995. [125] S. H. Rodrigues, “Building a better byte stream,” Master’s thesis, University of California at Berkeley, 1996. [126] M. Schmidt-Voigt, “Eficient parallel communications with the nCUBE 2S processor,” Parallel Computing, vol. 20, pp. 509—530, April 1994. [127] H. D. Schwetman, “Csim: A C-based, process oriented simulation laguage,” Tech. Rep. PP-080-85, Microelectronics and Computer Technology Corpora- tion, 1985. [128] D. S. Scott, “Eficient all-to-all communication patterns in hypercube and mesh topologies,” in Proceedings of the 6th Conference on Distributed Memory Concurrent Computers, pp. 398-403, 1991. [129] C. L. Seitz, “The Cosmic cube,” Communications of the ACM, vol. 28, pp. 22-33, January 1985. [130] A. Skjellum, S. G. Smith, N. E. Doss, A. P. Leung, and M. Morari, “The design and evolution of Zipcode,” Parallel Computing, vol. 20, pp. 565-596, April 1994. [131] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, Eds., MPI: The Complete Reference. Cambridge, Massachusetts: The MIT Press, 1996. [132] G. Stellner, A. Bode, S. Lamberts, and T. Ludwig, “NXLib — a parallel pro- gramming environment for workstation clusters,” in PA RLE ’94 Parallel Ar- chitectures and Languages Europe, no. 817 in Lecture Notes in Computer Science, pp. 745-748, Springer, 1994. [133] G. Stellner, A. Bode, S. Lamberts, and T. Ludwig, “Developing application for multicomputer systems on workstations,” in H igh-Performance Comput- ing and Networking, International Conference and Exhibition, no. 797 in Lecture Notes in Computer Science, pp. 286-292, Springer, 1994. [134] W. T. Strayer, M. J. Lewis, and R. E. Cline, Jr., “XTP as a transport pro- tocol for distributed parallel processing,” in Proceedings of the USENIX Symposium on High-Speed Networking, pp. 91-102, August 1-3 1994. [135] V. Sunderam, “PVM: a framework for parallel distributed computing,” Con- currency: Practice and Experience, vol. 2, December 1990. 154 [136] S. S. Takkella and S. R. Seidel, “Broadcast and complete exchange algo- rithms for meshes,” in Proceedings of the Conference on Scalable High- Performance Computing (SHPCC ’94), pp. 422—428, May 23-25 1994. [137] R. Thakur and A. Choudhary, “All-to-all communication on meshes with wormhole routing,” in Proceedings of the 8th International Parallel Pro- cessing Symposium (IPPS ’94), pp. 561-565, April 1994. [138] R. Thakur, A. Choudhary, and G. Fox, “Complete exchange on a wormhole routed mesh,” in Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunica- tion Systems, pp. 131-135, January 31-February 2 1994. [139] Thinking Machines Corporation, CMMD and Active Messages, 1993. [140] C. Topolcic, “Experimental internet stream protocol, version 2 (ST-11).” RFC 1190, October 1990. [141] Y. Tsai, Y. Huang, and P. K. McKinley, “Performance evaluation of barrier synchronization in ATM networks,” in Proceedings of the 5th International Conference on Computer Communications and Networks, pp. 144-151, October 16-18 1996. [142] Y. Tsai and P. K. McKinley, “Broadcast in all-port wormhole-routed 3D mesh networks using extended dominating sets,” in International Conference on Parallel and Distributed Systems, pp. 120—127, December 19-23 1994. [143] Y. Tsai and P. K. McKinley, “A dominating set model for broadcast in all-port wormhole-routed 2D mesh networks,” in Proceedings of the 8th ACM Inter- national Conference on Supercomputing, pp. 126-135, July 11—15 1994. [144] Y. Tsai and P. K. McKinley, “An extended dominating node approach to collective communication in wormhole-routed networks,” in Proceedings of the Scalable High Performance Computing Conference, pp. 199-206, May 23—25 1994. [145] Y. Tsai and P. K. McKinley, “A broadcast algorithm for all-port wormhole- routed torus networks,” in The Fifth Symposium on the Frontiers of M as- sively Parallel Computation, pp. 529—536, February 6—9 1995. [146] Y. Tsai and P. K. McKinley, “A broadcast algorithm for all-port wormhole- routed torus networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 7, pp. 876—885, August 1996. [147] Y. Tsai and P. K. McKinley, “An extended dominating node approach to collective communication in wormhole-routed networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 8, pp. 41—58, January 1997. 155 [148] Y.-C. Tseng, D. K. Panda, and T.-H. Lai, “A trip-based multicasting model in wormhole-routed networks with virtual channels,” IEEE Transactions on Parallel and Distributed Systems, vol. 7, pp. 138-150, February 1996. [149] L. W. Tucker and A. Mainwaring, “CMMD: active messages on the CM-5,” Parallel Computing, vol. 20, pp. 481-496, April 1994. [150] L. G. Valiant, “A bridging model for parallel computation,” Communications of the ACM, vol. 33, pp. 103-111, August 1990. [151] R. van de Geijn, “On global combine operations,” Journal of Parallel and Distributed Computing, vol. 22, pp. 324—328, 1994. [152] T. von Eicken, A. Basu, and V. Buch, “Low-latency communication over ATM networks using active messages,” IEEE Micro, vol. 15, pp. 46-53, February 1995. [153] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “Active messages: a mechanism for integrated communication and computation,” in Proceedings of the 19th International Symposium on Computer Architec- ture, pp. 256-267, May 1992. [154] K. Verstoep, K. Langendoen, and H. Bal, “Eficient reliable multicast on Myrinet,” in Proceedings of the 1996 International Conference on Parallel Processing, pp. III-156—III-165, August 12—16 1996. [155] R. van Renesse, K. P. Birman, and S. Mafl'eis, “Horus, a flexible group com- munication system,” Communications of the ACM, vol. 39, pp. 76—83, April 1996. [156] J. Watts and R. van de Geijn, “A pipelined broadcast for multidimensional meshes,” Parallel Processing Letters, vol. 5, no. 2, pp. 281-292, 1995. [157] XTP Forum, Xpress Transport Protocol Specification: XTP version 4.0, March 1 1995. [158] Z. Xu and K. Hwang, “Modeling communication overhead: MPI and MPL performance on the IBM SP2,” IEEE Parallel 6! Distributed Technology, pp. 9—23, Spring 1996. [159] H. Xu, P. K. McKinley, and L. M. Ni, “Eficient implementation of barrier synchronization in wormhole-routed hypercube multicomputers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 172-184, October 1992. [160] H. Zhou and A. Geist, “Faster (ATM) message passing in PVM,” in The 9th International Parallel Processing Symposium: Workshop on High-Speed Network Computing, April 1995. "I[llllll][llllllll