“I, . .,. “ “aefiés . a .,. ”km. 3 §” .1 Wk. “ "- u:fi.:¢<3§,fi '“’ "”111 frat“. '- «‘34 :t" fad-it: .a . ,'~» 63.3» y. . .5 ' ' vsffi-‘g 7w. ",1 v.2, 31‘ . 7; . asaggfij .1” Kiwi- 7 “1‘7."- —‘ ‘ \- ‘W‘ “L Ari“ {a 1: . ». 1: K .\ 15¢ a ‘ Or a; . J» ,‘ '1; "I 1-. . $1.s§§!.3“v’~ v <' ?'~ 3%? 4‘ ‘ 'e \, .2. , r. Ly, w; .- ., . ~ fig; 1"?7fiau n ’(eju £22; ~ ~1‘3a'r' , 31.34%“1‘i3f55’ ’ . "3. 53-383, 4'9: “ . A i? 1'3: L331": g ' .. fiai'izmz s. ., 3,. ,4 ' I 3‘ ~ "413 ff - 5' " «Cu. ‘ 4' ‘ ”gm. " ' 4. . ' , A9,. ., w ’z’mmr “‘3"? ':"'t '1- ‘ 1. 3‘1 g 4"}: _ .V- r v ‘ 1:. ' \ ‘. > I. l +r r ' ‘15:?" 3 . L V «a: J , < , r. ' ‘ 2:7 ‘13.? 1:969! ill nARtES illiiilli‘llgxlliii 31 This is to certify that the dissertation entitled SCALABLE MULTICAST COMMUNICATION IN MASSIVELY PARALLEL COMPUTERS presented by David F. Robinson has been accepted towards fulfillment of the requirements for PhD degree in Computer Science fwf IM/[C/w/ML [Major professor j Date 913/73)! MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 .v—~<_. _. fir - LIBRARY M'Chigan State University me .1 serum BOXtoumovombchockoutfiom y‘our mom. TO AVOlD mes Mum on or More a. disc. DATE DUE DATE DUE DATE DUE MSU It An Affirmative MW Opponunlty lmtltulon Wows-9.1 \_____——‘ SCALABLE MULTICAST COMMUNICATION IN MASSIVELY PARALLEL COMPUTERS By David F. Robinson A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science Department 1994 ABSTRACT SCALABLE MULTICAST COMMUNICATION IN MASSIVELY PARALLEL COMPUTERS By David F. Robinson Efficient communication has long been considered the key to achieving ever greater performance from parallel processing. Recently, much attention has been focused on multicast communication, in which a single source node delivers a message to a group of destination nodes. Such operations have become recognized as crucial to the performance of many parallel algorithms. Given the important role played by multi- cast communication in parallel processing, the research reported in this dissertation addresses the effect on multicast communication of three key aspects of communica- tion architecture in massively parallel computers, namely (1) port model; (2) virtual channels; and (3) intermediate message reception. This research shows that the per- formance of multicast communication can be significantly improved by considering these three architectural characteristics in the design of multicast operations. Research into the effect of the port model focuses on the problem of multicast in wormhole-routed hypercubes. The system model allows a processor to send and re- ceive data in all dimensions simultaneously. New theoretical results that characterize contention among messages in wormhole-routed hypercubes are developed and used to design new multicast routing algorithms. The algorithms are compared in terms of the number of steps required in each, their measured execution times when imple- mented on a relatively small-scale nCUBE—Z, and their simulated execution times on larger hypercubes. The results indicate that significant performance improvement is possible when the multicast algorithm actively identifies and uses multiple ports in parallel. Through the study of virtual channels, efficient algorithms are presented to imple- ment multicast communication in wormhole-routed torus networks. By exploiting the properties of the switching technology and the use of virtual channels, a minimum- time multicast algorithm is presented for n-dimensional torus networks that use de- terministic, dimension-ordered routing of unicast messages. In order to study the third characteristic of communication architecture, research is presented that focuses on torus networks in which intermediate nodes on a message path are able to receive a copy of a message while simultaneously routing the message to subsequent destinations. In developing new multicast algorithms for such networks, this research examines the effects of intermediate message reception on multicast communication. The results of a simulation study show that, through the efficient use of special routing hardware, the performance of multicast communication in torus networks with unidirectional communication links can be significantly improved. Copyright © by David F. Robinson 1994 ACKNOWLEDGMENTS It is with deep gratitude that I acknowledge my Ph.D. advisers, Dr. Betty H. C. Cheng and Dr. Philip K. McKinley. Because of their eagerness to help, whenever help was needed, my academic career has been extremely rewarding. Drs. McKinley and Cheng were in a large part responsible for many of the op- portunities available to me while at Michigan State University. Their knowledge and insight have been invaluable in defining the direction of my research. Through their advice regarding research topics and publication opportunities, and their subsequent help in my development, preparation, and refinement of technical papers, I have been able to pursue a very rewarding course of research. Their careful reading of numerous publication drafts, including earlier versions of this dissertation, has made my work more enjoyable and successful than I could have envisioned. Their guidance, advice, and encouragement, along with a genuine and selfless concern for my interests, have made all the difference during my graduate education at Michigan State University. I thank the members of my Ph.D. committee, Dr. Abdol H. Esfahanian and Dr. Marvin L. Tomber, for the very helpful and important work they have performed on my behalf. Finally, I thank those faculty members at Michigan State University, who through their genuine interest and enthusiasm for education, have taught me much about excellence in teaching. The lessons I have learned from them will be valuable to me throughout my career. TABLE OF CONTENTS LIST OF FIGURES 1 Introduction 1.1 Motivation ................................. 1.2 Thesis Statement ............................. 1.3 Research Contributions .......................... 1.4 Dissertation Organization ........................ 2 Background and Related Work 2.1 Multicast Communication ........................ 2.2 MPC Communication Architectures ................... 2.2.1 Network Topologies ........................ 2.2.2 Switching Strategy ........................ 2.2.3 Routing Algorithm ........................ 2.3 Port Model ................................ 2.4 Virtual Channels ............................. 2.5 Intermediate Message Reception ..................... 3 Unicast-Based Multicast in All-Port Hypercubes 3.1 Issues ................................... 3.2 Theoretical Foundations ......................... 3.2.1 Notation and Definitions ..................... 3.2.2 Useful Lemmas .......................... 3.2.3 Arc-Disjoint Paths ........................ 3.2.4 Avoiding Depth Contention ................... 3.3 Algorithms Based on Dimension-Ordered Chains ............ 3.4 An Algorithm Based on Cube-Ordered Chains ............. 3.5 Performance Evaluation ......................... 3.5.1 Stepwise Comparisons ...................... 3.5.2 Implementations on an nCUBE-2 ................ 3.5.3 Simulations of Larger Systems .................. 3.6 Conclusions ................................ 4 Unicast-Based Multicast in One-Port Torus Networks 4.1 System Model ............................... 4.2 Unicast Routing Algorithms ....................... vi viii Anhwv—‘I-I (0&00365 12 14 15 18 20 24 25 28 29 31 32 33 36 40 50 50 51 53 54 57 58 60 4.2.1 Unidirectional Torus Routing .................. 4.2.2 Bidirectional Torus Routing ................... 4.3 Contention in Wormhole-Routed Torus Networks ........... 4.4 Optimal Multicast Algorithm ...................... 4.5 Performance Evaluation ......................... 4.6 Conclusions ................................ 5 Path-Based Routing in Unidirectional Torus Networks 5.1 Issues ................................... 5.2 Hamiltonian Circuits ........................... 5.3 The Path Routing Function ....................... 5.3.1 Restrictive Routing ........................ 5.3.2 General Routing ......................... 5.4 Message Preparation ........................... 5.5 Correctness ................................ 5.6 Implementation Issues .......................... 5.6.1 Multi-Destination Message Format ............... 5.6.2 The Flit Forwarding Algorithm ................. 5.6.3 Boundary Identification ..................... 5.6.4 Implementation of the Path Routing Function ......... 5.7 Conclusions ................................ 6 Path-Based Multicast in Unidirectional Torus Networks 6.1 The S-Torus Multicast Algorithm .................... 6.2 The M-Torus Generalized Multicast Algorithm ............. 6.3 The Md-Torus Multicast Algorithm ................... 6.4 The Mu-Torus Multicast Algorithm ................... 6.5 Multi-Phase Implementation Issues ................... 6.6 Performance Evaluation ......................... 6.7 Conclusions ................................ 7 Conclusions BIBLIOGRAPHY vii 61 63 65 74 80 88 89 92 96 96 98 100 103 111 111 113 115 116 117 118 119 120 124 125 127 130 134 142 145 LIST OF FIGURES 2.1 MPC network topologies ......................... 10 2.2 Example message paths under dimension-ordered routing in a mesh . 16 2.3 Generic MPC node architecture [1] ................... 17 2.4 Channel dependencies in a 1D torus with single channels ....... 19 2.5 Router support for multicast communication .............. 21 2.6 A multicast operation using message replication ............ 22 3.1 An example of multicast in a 4-cube .................. 26 3.2 Unicast-based software multicast trees ................. 27 3.3 Conditions 2, 3, and 4 of Theorem 3.3 ................. 35 3.4 Multicast chain in a one-port 4-cube .................. 37 3.5 Generalized multicast algorithm ..................... 39 3.6 Simple Maxport and U-cube comparison ................ 39 3.7 Cube-ordered chains of dimension 4 ................... 41 3.8 The WeightedSort procedure ....................... 43 3.9 Examples of multicast communication .................. 45 3.10 Stepwise comparisons on a 6-cube .................... 51 3.11 Stepwise comparisons on a 10-cube ................... 52 3.12 Average delay comparisons on a 5-cube ................. 53 3.13 Maximum delay comparisons on a 5-cube ................ 54 3.14 Average delay comparisons on a lO-cube ................ 55 3.15 Maximum delay comparisons on a 10-cube ............... 56 4.1 Examples of 2D torus networks ..................... 59 4.2 Virtual channels in one dimension of a unidirectional torus ...... 62 4.3 Virtual channels in one dimension of a bidirectional torus ....... 63 4.4 An example of multicast in a 2D torus ................. 66 4.5 Unicast-based software multicast trees ................. 67 4.6 Channels used by P(u,v) and P(:c,y) in dimension d (Theorem 4.2) . 72 4.7 Channels used by P(u,v) and P($,y) in dimension d (Theorem 4.3) . 74 4.8 A minimum-time multicast ........................ 75 4.9 The U-torus algorithm for multicast ................... 77 4.10 Example multicast using the U-torus algorithm ............ 78 4.11 Possible relations between unicasts produced by U-torus ....... 78 4.12 Average communication steps (512 and 1024-node tori) ........ 82 4.13 Maximum communication steps (512 and 1024-node tori) ....... 83 viii 4.14 4.15 4.16 4.17 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 Average communication steps (4096-node tori) ............. 84 Maximum communication steps (4096-node tori) ............ 85 Effects of physical link sharing ...................... 86 Comparison of 2-way and 3-way physical link sharing ......... 87 A multicast operation in a 2D torus ................... 90 Port-induced communication deadlock in a one-port system ...... 92 Path-based multicast routing in a mesh using Hamiltonian Paths . . . 93 Hamiltonian circuit ’H in a 2D torus ................... 94 Hamiltonian circuit 71 in a 3D torus ................... 95 A (restrictive) multi-destination message in a 2D torus ........ 98 A multi-destination message in a 2D torus ............... 103 The message preparation algorithm for path-based routing ...... 104 Multi-destination message format .................... 112 The flit forwarding algorithm for path-based routing .......... 114 Implementation of the path routing function RUTH; .......... 116 The M-torus algorithm for multicast .................. 122 Compound message format ........................ 128 Multicast latency (4096-node 2D torus, high message startup latency) 136 Multicast latency (4096-node 2D torus, low message startup latency) 137 Multicast latency (4096-node 3D torus, high message startup latency) 138 Multicast latency (4096-node 3D torus, low message startup latency) 139 Multicast latency (4096-node 2D torus, 512 node multicast size) . . . 140 Total link usage (4096-node tori) .................... 141 ix CHAPTER 1 Introduction The recent trend in supercomputer design has been towards scalable parallel comput- ers, which are designed to offer corresponding gains in performance as the number of processors is increased. Many such systems, known as massively parallel computers (MPCs), are characterized by the distribution of memory among an ensemble of pro- cessing nodes. Each node has its own processor, local memory, and other supporting devices. MPCs are scalable because, as the number of nodes in the system increases, the total communication bandwidth, memory bandwidth, and processing capability of the system also increase. 1 .1 Motivation In parallel scientific computing, data must be redistributed periodically in such a way that all processors can be kept busy performing useful tasks. Because they do not physically share memory, nodes in MPCs must communicate by passing messages through a communications network. Some communication operations are point-to- point, that is, they involve only a single source and a single destination. Other operations are collective, in that they involve more than two nodes. Examples of collective communication include multicast, reduction, and barrier synchronization. Multicast communication, in which a single source node delivers a message to a group of destination nodes, is important to many MPC activities, including numeric algorithms, parallel simulation, and the implementation of data-parallel languages, such as High Performance Fortran. The performance of MPCs is thus highly de- pendent on the performance of the underlying multicast operations, which, in turn, depends on several characteristics of the MPC communication architecture. Three characteristics that affect the performance of multicast communication are network topology, message routing algorithm, and message switching strategy. The network topology defines the pattern of interconnection that exists between nodes; for example, the nodes of an MPC may be connected to form a two-dimensional (2D) mesh network. The routing algorithm determines the path taken by a message between the source and destination nodes, and the switching strategy determines how messages are transferred between adjacent nodes on the message path. Early multicomputers used store-and-forward switching, in which the time taken to transmit a message is proportional to the distance between the source and destination nodes. In contrast, many current MPCs employ wormhole routing, where messages are pipelined through the network. The characteristics of communication architecture noted above have been widely studied; these contributions are surveyed in Chapter 2. However, there are other system attributes that also have a large effect on the performance of multicast com- munication. One such attribute is the port model, which describes the number of (parallel) connections between a node processor and the communication network. In a one-port architecture, each processor is connected to the network by a single in- put / output channel pair, thereby effectively serializing all communication originating from, and destined for, that node. Some MPCs support a multi-port architecture, where processors are connected to the communication network by multiple pairs of input / output channels. In an all-port architecture, nodes are maximally connected to the network, so that simultaneous communication over all network links is possible. Another characteristic affecting multicast communication is the use of virtual channels. Some network topologies require virtual channels, in which adjacent nodes are connected by more than a single logical link in one or both directions. The redundant communication paths provided by virtual channels are needed by these topologies in order to provide deadlock-free communication. Multicast communication in MPCs is also influenced by intermediate reception capability. When a message is routed through an intermediate node on a path from the source node to the destination, the processor of the intermediate node is ordinarily unaffected by the message; in fact, message routing hardware is usually such that the processor of an intermediate node cannot access a passing message without interfering with its transmission. A system that allows a processor to simultaneously receive a message while it is being relayed through the node enroute to other destinations is said to have intermediate reception capabilities. 1 .2 Thesis Statement In this dissertation, we address three specific characteristics of MPC communication architecture, and how they relate to the performance of multicast communication. The thesis statement is: The performance of multicast communication in MPCS can be significantly improved by exploiting specific properties of the following three character- istics in the design of the multicast operation: (1 ) port model; {2) virtual channels; and (3) intermediate message reception. 1.3 Research Contributions This dissertation offers three specific contributions to the current research in the area of multicast communication on MPCs, as follows: 1. We develop a new multicast algorithm for all-port hypercubes. This new algo- rithm is shown to perform significantly better on all-port architectures than the best known algorithm, which is optimal for one-port architectures. 2. We develop an optimal multicast algorithm for one-port torus networks. This algorithm accounts for the presence of virtual channels in torus topologies and is compatible with networks having either unidirectional or bidirectional com- munication links. 3. We develop new multicast algorithms for torus networks with intermediate mes- sage reception capabilities. These algorithms allow multicast communication to be efficiently implemented either wholly or partly in hardware, with a resulting performance gain over corresponding software implementations. 1.4 Dissertation Organization The remainder of this dissertation is organized as follows. In Chapter 2, background is given for multicast communication, MPC communication architectures, and the influence of these architectures on the design of efficient multicast operations. Also in Chapter 2, the related research is reviewed, and differences from the research described in this dissertation are discussed. The three research contributions stated above are described in Chapters 3 through 6. Chapters 3 and 4 describe in detail the research in the area of efficient software-based multicast algorithms for MPCs. In Chapter 3, investigations into all-port hypercubes are presented, while in Chapter 4, we discuss our findings for one-port torus networks. The research presented in Chapters 5 and 6 addresses multicast methods for torus networks with intermediate reception capability and unidirectional communication links. In Chapter 5, a deadlock-free path—based routing method is described. In Chapter 6, this routing method is used as a basis for path-based multicast algorithms. Two classes of algorithms are presented: (1) single- phase, in which a single multi-destination message is used to perform the multicast operation, and (2) multi-phase, where a collection of multi-destination messages are generated during a sequence of communication phases. Finally, Chapter 7 presents the concluding remarks. CHAPTER 2 Background and Related Work This chapter presents background information needed to study multicast communi- cation in MPCs. Multicast communication operations, MPC communication archi- tectures, and the issues involved in the implementation of multicast operations are covered. The research of other investigators is reviewed, and differences from the research described in this dissertation are described. 2.1 Multicast Communication Point-to-point, or unicast, communication involves a single source node and a sin- gle destination node. In collective communication, also termed group communica- tion, more than two nodes are involved. Collective communication operations can be roughly decomposed into three categories: (1) those with a single source node and multiple destination nodes; (2) those with multiple source nodes and a single destination node; and (3) those with multiple source and multiple destination nodes. In a multicast operation, a source node must deliver copies of a single message to each node in the destination group. A special case of multicast is broadcast, in which the destination group contains every node in the network (except the source). Unfortunately, the terms broadcast and multicast are often used interchangeably in the literature. Throughout our work, we use broadcast to refer only to the case where the destination set includes all processors in the MPC; other cases are referred to as multicast. Multicast is a fundamental collective communication operation, and is important in many parallel numerical algorithms, including matrix multiplication [2], matrix transpose [3], tridiagonalization [4], eigenvalue computation [5], Gaussian elimination [6], and LU factorization [7]. Efficient implementation of multicast is also useful in many other aspects of parallel computing, including support for barrier synchronization [8], memory updates and invalidation in distributed shared-memory systems [9], and global notification of events in parallel simulation. The growing interest in the use of collective communication routines, including multicast, is evidenced by their inclusion in the Message Passing Interface (MP1) [10], an emerging standard for communication routines used by message-passing programs, and in many research and commercial communication libraries, including IBM’s Col- lective Communication Library (CCL) [11] and MSU’s ComPaSS project [6]. Besides message passing, multicast communication is also important in implementing data- parallel languages, such as High Performance Fortran [12], on distributed-memory systems. Much research has been performed in the area of multicast communication for parallel computers. Many multicast and broadcast algorithms have been developed under the assumption of store-and-forward architectures [13, 14, 15, 16, 17], and mul- ticast communication has been investigated in bus-based architectures [18]. Much of the recent work in this area has addressed software-based (unicast-based) multi- cast [19, 20, 21] and broadcast [22, 23, 24, 25, 26, 27, 28] communication in wormhole- routed MPCs. Some of the work has investigated the use of special routing hardware for the support of multicast and broadcast communication [29, 30, 31, 32, 33, 34, 35]. In much of the current research in the area of multicast communication, it is assumed that the group of destination nodes consists of all nodes in the MPC; that is to say, the operation is assumed to be broadcast rather than general multicast. However, we focus on multicast operations that allow the destination node group to be arbitrarily specified. Usually, such operations are more difficult to develop than those that place restrictions on the node groups. When designing a software- based multicast operation that will send a message to an arbitrarily specified group of destination nodes, only the processors of the source and destination nodes should be used in the operation. Furthermore, no a priori assumptions can be made regarding which nodes will and will not be included in the destination group; the operation must work for any and all destination groups. Because the set of nodes allocated to an application by the operating system fre- quently consists of only a subset of the MPC (rather than the entire MPC), multicast communication operations that allow for arbitrary specification of the node groups are often needed even when the application algorithm calls for an operation that accesses all nodes, or some regular subset of nodes. For example, when an application program executing on an allocated subset of the nodes of an MPC requests a broadcast to all nodes, the appropriate operation is actually a selective multicast to the allocated set of nodes. As further evidence of the need for flexible operations, the MPI standard [10] specifies that all collective communication operations are performed on node groups specified arbitrarily by the programmer. Multicast communication may be implemented in either hardware or software. However, most existing MPCs support only point—to—point, or unicast communication in hardware. In these environments, multicast communication must be implemented in software, typically in a communication library, by sending one or more unicast mes- sages; such implementations are called unicast-based [19]. For example, a multicast operation may be implemented using separate addressing, in which a separate copy of the message is sent directly from the source to every destination. An alternative is to use a multicast tree [19] of unicast messages. The tree can be considered as a sequence of message-passing steps. In the first step, the source node actually sends the message to only a subset of the destinations. In the next step, each node holding a copy of the message forwards it to some subset of the destinations that have not yet received it. The sequence of message—passing steps continues until all destinations have received the message. Using this approach, the time required for the operation can be greatly reduced [19]. In order to prevent interference with computation on nodes not directly involved in a multicast operation, the implementation should not affect any local processors other than those explicitly involved in the operation. That is to say, only source and destination node processors should be required to handle the message. 2.2 MPC Communication Architectures We now describe some of the important characteristics of MPC communication ar- chitecture that affect multicast communication. These characteristics are network topology, switching strategy, and routing strategy. Three additional architectural characteristics that affect multicast communication in MP0s are port model, virtual channels, and intermediate reception. These last three factors are central to the research presented in this dissertation, and are discussed in Sections 2.3, 2.4, and 2.5, respectively. 2.2.1 Network Topologies Two major categories of MPC network topology are direct networks and indirect networks. In a direct network, each node includes a processor and local memory as well as switching hardware, and is directly connected by physical communication links to some set of nodes, called neighboring nodes. In an indirect, or multistage network, 10 some nodes act only as switching elements and have no processing capabilities. Pro- cessing nodes are connected indirectly through switching nodes. For example, in a Fat Tree topology, such as the Thinking Machines CM-5 [36], the nodes are arranged as a tree, where only the leaf nodes are processing nodes; intermediate nodes in the tree are switching nodes used to deliver messages between processing nodes. We concen- trate only on direct networks, which are used in most MPC architectures. Network topologies of commercial and research direct network MPCs vary widely. Some of these topologies are illustrated in Figure 2.1. 4D hypercube Figure 2.1. MPC network topologies 11 An n-dimensional mesh of width k contains k" nodes. Each node of a mesh has an n-digit, radix-k address; each digit of the address specifies the coordinate of the node in the corresponding dimension. For example, in a 2—dimensional (2D) mesh, node addresses are the familiar 2D Cartesian coordinates. There exists a communication link between two nodes in a mesh if and only if their corresponding addresses are equal in every dimension except one, in which the address values differ by exactly one. Two nodes connected by a communication link are said to be adjacent, or neighboring, nodes. Examples of mesh architectures include the 2D Intel Paragon [37], the 2D Caltech Mosaic C [38], and the 3D MIT J-machine [39]. An n-dimensional torus is equivalent to an n-dimensional mesh in which each edge node is connected to the corresponding node on the opposite edge by a “wraparound” channel. Pairs of nodes on opposite edges of the network are made adjacent by these wraparound channels, thereby providing a shorter average path length than in a mesh network. Torus networks include the 2D Intel/CMU iWarp [40], the 3D Cray T3D [41], and the Torus Routing Chip [42], which can be used directly to construct 3D torus networks or cascaded to build tori of higher dimension. An n-dimensional hypercube, or n-cube, contains 2" nodes, each with an n-bit binary address. Two nodes at and y in a hypercube are adjacent if and only if their corresponding addresses differ in exactly one bit position. Commercial hypercubes include the nCUBE-2 [43] and nCUBE-3 [44]. An n-cube (hypercube) is a special case of an n-dimensional mesh or torus, where the width is 2. The general term k-ary n-cube has been used to refer to both mesh and torus networks of width k; thus, a binary n-cube is a hypercube [45]. Early systems that used store-and-forward switching often adopted a hypercube topology [46] because of the relatively dense interconnection network, which resulted in shorter message paths. However, in systems with wormhole routing, the distance between communicating nodes is less important; hence, the more easily constructed 12 lower-dimension meshes and tori are often chosen as topologies for current MPCs [45]. Torus networks are sometimes favored over meshes, because with random communi- cation, the communication links of a mesh are not evenly utilized, whereas a torus network achieves equal utilization of all links [41]. Also, because of the existence of wraparound channels, a torus with bidirectional communication links can provide shorter message paths on average than can a mesh, thus resulting in a more lightly loaded network. However, torus networks require the use of virtual channels in order to provide deadlock-free communication, whereas mesh and hypercube networks do not have this requirement. 2.2.2 Switching Strategy The predominant switching technique in MPCs is wormhole routing [42], in which a message is divided into a number of flits that are pipelined through the network. The header flit of a message proceeds on the path to the destination node, followed by the remaining flits. If the header encounters a needed communication channel that is not currently available (due to use by another message), the header, along with those trailing flits already in the network, are blocked in place. When the required channel becomes available, the header flit, followed by the subsequent flits of the message, continue towards the destination. Because of the way in which messages are blocked in place in a wormhole-routed network, only very small, fixed-size flit buffers are needed for each communication channel. These buffers can be as small as a single flit [43, 47], and are easily in- corporated into the routing hardware at each node [42]. The term network latency refers to the elapsed time after the head of a message has entered the network at the source until the tail of the packet emerges at the destination. For long messages, the pipelining effect of wormhole routing reduces the effect of path length on network latency [1]. The startup latency is the time required for the system to handle the 13 packet at both the source and destination nodes. For small messages, the startup latency often dominates unicast latency [19]. Therefore, in the absence of contention among messages for network resources, the latency of wormhole-routed messages is nearly distance-insensitive [1]. This behavior is in contrast to early store-and-forward systems [46], in which messages are transferred completely across each hop on the path before beginning travel on the next hop. Store-and-forward systems require adequate buffer space with each channel to accommodate the largest possible message. In addition, the amount of time required to deliver a message using store-and-forward switching is roughly proportional to the length of the message path. The distance-insensitive communication latencies and small buffer requirements associated with wormhole routing make this technique scalable, and thus well suited for use in MPCs. Wormhole routing has been adopted in the Symult 2010, the nCUBE-2 and nCUBE-3, Intel/DARPA’s Touchstone DELTA and the subsequent Intel Paragon, the MIT J-machine, Intel/CMU’s iWarp, the Caltech Mosaic C, the Transputer IMS T9000 family, the Torus Routing Chip, the TMC CM-5, and the Cray T3D. A survey of the issues related to wormhole routing can be found in the literature [1]. Under wormhole routing, a message must acquire exclusive use of all channels on the path on which it travels. This extended retention of communication resources makes wormhole routing susceptible to channel contention, in which two or more messages simultaneously require the same communication channel. When channel contention occurs, the message that first requested the common channel proceeds, while other messages requiring the same channel are blocked in place in the network until the required channel has been relinquished. When messages are blocked in place by channel contention, they continue to hold the channels they have already acquired, thereby increasing the possibility of further channel contention. Thus, when designing 14 multicast operations for wormhole—routed systems, it is important to avoid channel contention among the constituent messages of the operation. 2.2.3 Routing Algorithm Nodes in current MPCs are connected by topologies that provide multiple paths between a given source and destination node; in fact, in all existing MPC topologies, there is more than one shortest path between many of the source/ destination pairs. Thus, routing choices must be made when transmitting a message from source to destination. The way in which these routing choices are made is termed the routing algorithm. Under adaptive routing, information about current network conditions such as traffic and defective nodes or links is used to determine message routes. This ap- proach is in contrast to deterministic routing, in which the route between a given source and destination is unique and independent of current network conditions. Deterministic routing has also been termed oblivious routing. Although adaptive routing mechanisms for wormhole-routed networks have been the focus of recent research [48, 49, 50, 51, 52, 53, 54, 55], we study deterministic strategies because of their prevalence in both current and newly-announced architectures. In a deter- ministic routing method termed dimension-ordered routing, messages are routed first in the highest (lowest) dimension in which the source and destination nodes differ. Routing then proceeds on each required dimension, in descending (ascending) order of dimension, until the routing path reaches the destination. Routing in a particular dimension is always completed before routing in the next dimension begins. We assume, without loss of generality, that routing is performed in descending order of dimension. Sullivan and Brashkow [56] introduced E-cube routing, which is essentially dimension-ordered routing for hypercubes. In 2D grid-like architectures such as a 15 2D mesh or torus, dimension-ordered routing is also termed X Y-routing [1], since messages are routed first in the so called “X” dimension, and then in the “Y” di- mension. Similarly, dimension—ordered routing in 3D topologies is often referred to as XYZ-routing. Many current and research MPCs use both wormhole routing as a switching strategy and dimension-ordered routing as a routing strategy, including the Symult 2010, the nCUBE—2 and nCUBE-3, Intel/DARPA’s Touchstone DELTA and the subsequent Intel Paragon, the MIT J-machine, the Caltech Mosaic C, the Torus Routing Chip, and the Cray T3D. The choice of routing algorithm also has a large influence in the design of multicast operations, because it determines how the constituent messages must be scheduled in order to avoid channel conflict. Channel conflict is undesirable in wormhole-routed networks because of the resultant message blocking. For example, consider the two unicast messages in a 2D mesh: message U1 from source node (0, 3) to destination node (2,1), and message 112 from source node (3,3) to destination node (2,2). Although there are numerous minimum-length paths in a 2D mesh that would result in messages U1 and U2 traveling on arc-disjoint paths, the rules of dimension-ordered routing dictate the paths shown in Figure 2.2, which are unfortunately not arc-disjoint. Thus, in order to avoid channel contention in an MPC with dimension-ordered routing, a multicast operation must be implemented so as to produce messages that are pairwise either (1) arc-disjoint under the routing rules, or (2) temporally distinct. 2.3 Port Model In wormhole-routed MPCs, communication among nodes is handled by a separate router. As shown in Figure 2.3, several pairs of external channels connect the router to neighboring routers, and are used for communication between those routers. The pattern in which the external channels are connected defines the network topology. 16 0,0 1,0 2,0 3,0 EH ’1 35 [ n: ll 0,2 1" N 0,3 —' '--" 2,3— 3.3 W ‘-——> communication link D source —> message path [El destination Figure 2.2. Example message paths under dimension—ordered routing in a mesh Usually, the router can relay multiple messages simultaneously, provided that each incoming message requires a unique outgoing channel. A router is connected to the local processor/memory by one or more pairs of internal channels. One channel of each pair is for input, the other for output. The port model of a system refers to the number of internal channels at each node. If each node possesses exactly one pair of internal channels, then the result is a so-called “one-port communication architecture” [14]. A major consequence of a one-port architecture is that the local processor must transmit (receive) messages sequentially. Although additional pairs of internal channels will increase communication capacity, the one-port architecture is characteristic of many existing systems. Architectures with multiple ports reduce this bottleneck. In the case of an all-port system, every external channel has a corresponding internal channel, allowing the node to send to and receive on all external channels simultaneously. l7 Local Processor/Memory internal intemal input a o o o o 0 output channels channels -—> external . . external input 0 Router 0 output channels ' 0 channels —> —> —. —-> Figure 2.3. Generic MPC node architecture [1] Some researchers have considered the effects of an all-port or Multiple Link Avail- ability (MLA) architecture under the store-and-forward model. This work includes the study of broadcast communication in hypercube topologies [13, 14, 17]. Research into multicast communication on current wormhole-routed MPCs with multi-port architecture is just emerging, and includes work on broadcast in hypercubes [24, 25] and meshes [27]. Broadcast in all-port torus and mesh architectures is considered in [26], where a non-standard routing algorithm is assumed. Our work in this area, which is described in Chapter 3, differs from the previous research in the following ways. As pointed out above, broadcast in all-port systems has been studied previously. However, the more general multicast problem has not been addressed. McKinley, et al. [19] developed the U-cube and U—mesh multi- cast algorithms, which are optimal for arbitrary multicast operations in one-port wormhole-routed hypercubes and meshes, respectively. By generalizing the U-cube algorithm, we provide a framework for studying all-port multicast algorithms, and then use this framework to develop the W—sort multicast algorithm, which performs 18 better than U—cube in an all-port hypercube. We are the first to study multicast in all-port wormhole-routed MPCs. 2.4 Virtual Channels In topologies with natural channel routing cycles, such as a torus, if only one commu- nication channel is provided between each pair of neighboring nodes, then the network will not be deadlock free, since it would then be possible for a cycle of channel alloca- tion and demand to exist among two or more unicast messages. Dally and Seitz [57] show that a network is deadlock-free under deterministic routing if and only if there are no cycles in the channel dependency graph. A channel dependency graph is a directed graph in which each vertex represents a channel of the network; there is an are from channel c,- to channel c,- if and only if a message arriving on c,- might next be routed on cj. We consider the case of a 1D torus, which is a simple ring. If there is only one unidirectional channel between each pair of neighboring nodes, then the channel dependency graph will consist of a single ring of channels (see Figure 2.4), which shows that the associated network is not deadlock-free. Similar cycles appear along each dimension in the channel dependency graphs of larger-dimension tori whenever pairs of nodes are connected only by single, unidirectional channels. In order to enable deadlock-free routing in a torus network, multiple virtual chan- nels can be multiplexed onto each physical communication link. These virtual chan- nels share the bandwidth of the physical link, and provide multiple logical paths between neighboring nodes that are used by the routing algorithm to break cycles of channel dependency. The issues related to the use of virtual channels to avoid deadlock in gen- eral topologies with natural channel dependency cycles are discussed by Dally and 19 (a) l-dimension torus (ring) (b) channel dependency graph Figure 2.4. Channel dependencies in a 1D torus with single channels Seitz [57]. Virtual channels are used to avoid deadlock in current torus MPC ar- chitectures, including the Cray T3D [41], and the Torus Routing Chip [42]. Be- sides providing deadlock-free routing in torus networks, the multiple logical paths associated with virtual channels have been used to support adaptive routing algo— rithms [49, 50, 51, 53, 54]. Dally [58] examines the use of virtual channels to improve network throughput in MPCs. Our work in this area, which is described in Chapter 4, differs from the previous research in the following ways. As noted above, much research has focused on the use of virtual channels to provide deadlock-free and adaptive routing in MPCs. However, this work has generally included only point-to-point communication. Park, et al. [26] have studied broadcast in torus networks, but their work assumes a non-standard routing algorithm. The U-mesh multicast algorithm [19] is contention-free in one-port wormhole-routed mesh networks, but not in torus networks. In designing an optimal multicast algorithm for one-port wormhole-routed torus networks, we are the first to study unicast-based multicast on wormhole-routed torus architectures. 20 2.5 Intermediate Message Reception In systems that support only point-to-point communication in hardware, multicast operations must be implemented in software by using unicast-based techniques such as multicast trees. In order to improve multicast performance and reduce software over- head, enhancements to the network routers have been proposed. These enhancements include two additional router features, message replication and intermediate reception, intended to provide some level of hardware support for multicast operations. Message replication refers to the ability to duplicate incoming messages onto more than one outgoing channel, while intermediate reception is the ability to simultaneously deliver an incoming message to the local processor/ memory and to an outgoing channel. In a unicast-based multicast operation, a message is replicated by the processor at intermediate destination nodes, and these multiple copies are then transmitted to subsequent destination nodes. A seemingly natural extension of a multicast tree is message replication, a tree-based approach using hardware support, where the router is enhanced so that an incoming message can be simultaneously transmitted on two (or more) outgoing links. That is, each flit of a message entering the router can be transmitted by the router on multiple outgoing channels, effectively producing a tree-like message worm. Figure 2.5(a) illustrates a message that is being replicated by a router in a 2D mesh (or torus), while Figure 2.6 shows how message replication can be used to perform a multicast operation in a 2D mesh. In the example shown in Figure 2.6, the message is replicated by the routers at nodes (1,2), (2,2), (3,2), (4, 2), and (5, 2). Such message worms have headers on each branch of the resulting message tree. A difficulty with this approach is that when any branch of the tree encounters a required channel that is unavailable, the entire message tree must be blocked, render- ing all channels used by the tree unavailable for other communication. That is to say, 21 +Y external +Y external channels channels channels (ports) -X external -X external channels channels ‘— ‘— ‘ channels ‘ channels _Y external I —Y external l channels _-> channels message worm (a) message replication (b) intermediate reception Figure 2.5. Router support for multicast communication because of the pipelining effects of wormhole routing, when just one branch of the tree is blocked due to channel contention, progress on every branch of the tree is suspended. Even when such a tree is not blocked by channel contention, many communication channels are simultaneously held, and thus unavailable, during the operation. Mul- ticast operations based on message replication are thus highly susceptible to channel contention especially for large destination sets, and must be carefully designed in order to avoid deadlock. Because of these disadvantages, message replication has not been widely supported. Lin, et al. [30] present, as a comparison, a multicast method for 2D meshes based on message replication, but do not recommend its use. The nCUBE—2 hypercube [43] includes hardware support for message replication, which is used to implement tree-based broadcast and limited multicast in cases where the destinations form a subcube, but this mechanism is not deadlock-free. In order to avoid the disadvantages of the tree-like worms produced by mes- sage replication, and yet apply hardware support to the implementation of mul- ticast operations, the technique of intermediate reception (IR) has been pro- posed [30, 31, 32, 33, 34, 59]. A router possessing IR capability is able to copy the flits of a message to the memory of the local processor as the message passes through the router enroute to other destinations, as illustrated in Figure 2.5(b). In this way, a source node destination node other node Figure 2.6. A multicast operation using message replication message originating at a source node can be routed as a single worm through several destination nodes, depositing a copy of the message at each of the intermediate des- tinations as it passes through. Such communication methods are termed path-based, while the constituent messages are called multi-destination worms. Most of the literature dealing with router enhancements for multicast communi- cation is based on IR, and includes the following work. Lin, et al. [30] present a path-based multicast routing algorithm for 2D meshes based on Hamiltonian paths. Kim and Kim [31] propose methods to perform multicast communication in the sup- port of parallel-prefix computations on meshes, which are, in turn, incorporated into a proposed matrix multiplication algorithm. Tseng and King [32] describe broadcast and multicast methods for tori. Panda and Singal [33] present methods for broadcast and all-to-all (simultaneous) broadcast in mesh and torus networks. These methods are a hybrid of hardware and software methods, in that they use multi-destination worms, but also employ the processors at intermediate destination nodes to start new multi-destination worms. Ho and Kao [35] have proposed a multi-step path-based 23 broadcast method for hypercubes, in which all messages conform to E—cube routing rules. By adhering to dimension-ordered routing, deadlock is avoided; however, the routing rules are not sufficiently flexible to allow single-path broadcast operations. Panda and Prabhakaran [34] present a path-based multicast method that uses the underlying base routing algorithm, such as deterministic dimension-ordered routing or the adaptive turn model [50] routing. Our work in this area, which is described in Chapters 5 and 6, differs from the previous research in the following ways. The work proposed in [34] is not a multi- cast routing algorithm, but rather an algorithm for creating multi-destination worms that are consistent with existing unicast routing algorithms. Given the constraints of dimension-ordered routing, certain destination sets cause this method to behave extremely poorly for multicast (0(m) communication steps for m — 1 destinations, as compared to [log2 m] steps for existing software-based methods [19, 21]). Although a multicast routing algorithm for tori is proposed in [32], in order to avoid deadlock this method requires virtual cut-through routing, in which message-sized buffers are required for each channel. The methods of [33] are not deadlock-free in the pres- ence of network traffic in addition to the single collective communication operation. In contrast to the above work, we present deadlock-free path-based multicast rout- ing methods for wormhole-routed torus networks with unidirectional communication links. CHAPTER 3 Unicast-Based Multicast in All-Port Hypercubes In this chapter, the specific problem of efficient unicast-based multicast communi- cation for all-port wormhole-routed hypercubes is addressed. Formally, a hypercube (or n-cube) consists of 2" nodes, each of which has a unique n-bit binary address. For each node 2), let u also denote its n-bit binary address, and let M v [I represent the number of 1’s in v. A channel c = (u,v) is present in an n-cube if and only if [I u EB v [I = 1, where 63 is the bitwise exclusive-or operation on binary numbers. The hypercube topology has been used in multicomputer design for many years [46]. The nCUBE-2 [43] hypercube supports wormhole-routing, as does the recently announced nCUBE-3 [44]. This chapter is organized as follows. Section 3.1 describes the issues and prob- lems involved in supporting efficient multicast communication in all-port hypercube systems. Section 3.2 gives new theoretical results that provide the foundation for this work. Sections 3.3 and 3.4 present the new algorithms that have been designed to support multicast in all-port wormhole-routed hypercubes. Although the multicast problem has been studied previously for one-port architectures [19], the proposed 24 25 methods improve performance by exploiting the presence of multiple ports. Sec- tion 3.5 compares the new algorithms using analysis, simulation, and implementations on a 64-node nCUBE—2, which possesses an all-port architecture. Finally, a summary is given in Section 3.6. 3.1 Issues Although implemented in software, unicast—based multicast communication algo- rithms must exploit the underlying architecture in order to minimize their execution time. In a wormhole-routed system, the implementation should not only take ad- vantage of the distance-insensitivity of unicast latency, but must also avoid channel contention, that is, no two messages involved in the operation should simultaneously require the same channel. Avoiding channel contention depends on the underlying unicast routing algorithm of the MPC; hypercubes often adopt E—cube routing [56], in which messages are routed through dimensions in either ascending or descending order. Also, the implementation should affect no local processors other than those explicitly involved in the operation. For example, in a multicast operation, only source and destination processors should be required to handle the message. Finally, the implementation should account for the port model, which affects the rate at which nodes can send and receive messages. The following (small-scale) example illustrates the issues and difficulties involved in implementing efficient multicast communication in hypercubes. We consider the 4-cube in Figure 3.1, and suppose that a multicast message is to be sent from node 0000 to eight destinations {0001, 0011, 0101, 0111, 1011, 1100, 1110, 1111}. In this example and all subsequent examples, we assume that the E—cube routing algorithm resolves addresses from high order bits to low order bits. In the nCUBE-2, the 26 opposite resolution strategy is used, but this difference does not affect any of the results presented. a source node [E] destination node B other node Figure 3.1. An example of multicast in a 4-cube In early hypercube systems that used store—and-forward switching, the procedure shown in Figure 3.2(a) could be used to implement the multicast operation [60]. At step 1, the source sends the message to node 1000. At step 2, nodes 0000 and 1000 inform nodes 0100 and 1010, respectively. Continuing in this fashion, this implemen- tation requires 4 steps to reach all destinations. In this example, five of the nodes that are required to relay the message (0010, 0100, 0110, 1000, and 1010) are not destinations themselves. Using the same routing algorithm in a one-port wormhole- routed network also requires 4 steps, as shown in Figure 3.2(b). In this case, however, only the routers at two of the non-destination nodes (0010 and 0110) are involved in forwarding the message. The message may be passed from node 0000 to node 0011 in one step because it is pipelined through the router at node 0010 rather than being relayed by the local processor at that node. However, because the message must be replicated and forwarded on multiple outgoing channels at nodes 0100, 1000, and 1010, the local processors at those nodes must still handle the message. Figure 3.2(c) illustrates the result of using the U-cube algorithm [19] to solve the problem on a one—port wormhole-routed system. The U-cube algorithm, which was 27 (a) multicast tree based on store-and-forward switching (b) multicast tree using wormhole routing on a one-port architecture m (c) U—cube tree on a (d) U—cube tree on an (e) optimal multicast tree on ‘ one-port architecture all-port architecture all-port architecture 7 source destination “w"; intermediate node intermediate node message sent 1- node Ejnode l.........i (router used) 0 (processor 080d) ___.[il in step i Figure 3.2. Unicast-based software multicast trees designed specifically for one-port wormhole-routed architectures, will be discussed further in Section 3.3. Using this algorithm, the only local processors required to handle the message are those at destination nodes. Furthermore, on a one-port ar— chitecture, all messages are guaranteed to be contention-free [19]. Although common channels are used between the 0111-to-1011 path and the 0111-to—1100 path, these messages are sent sequentially, so contention does not occur. Since the U-cube algorithm was designed for one-port systems it makes no explicit attempt to take advantage of multiple ports between local processors and routers. That is to say, the U-cube algorithm does not actively seek out and use multiple ports in parallel. For example, if the algorithm were implemented on an all-port hypercube, it would still require four steps to complete the multicast in the above example, as illustrated in Figure 3.2(d). Some destinations are reached earlier than 28 in Figure 3.2(c) simply because the algorithm inadvertently uses multiple ports at some nodes simultaneously. Notice that three steps are required to reach destination node 1011, since that unicast message must traverse a channel (0111,1111) that lies along the path required to reach node 1100, thereby delaying its transmission. Figure 3.2(e) shows a multicast tree that accounts for both wormhole routing and an all-port architecture. The algorithm requires only two steps, no local processors other than the source and destinations are involved, and contention among constituent messages is avoided. This particular tree is based on the methods presented in this chapter. In the next section, we develop the theoretical results necessary to guarantee that our new algorithms, presented in Sections 3.3 and 3.4, are contention-free. Under the proposed system model, even the best known methods for broadcasting are heuristic [24], and since the multicast problem is a generalization of broadcast, it is at least as hard as broadcast with respect to computational complexity. We con- jecture that generating optimal multicast solutions for an all-port wormhole-routed hypercube is an NP-hard problem. The methods presented in this chapter are there- fore heuristic, and thus do not provide optimal solutions in every case (although the multicast tree shown in Figure 3.2(c) is, in fact, optimal for the given set of nodes). 3.2 Theoretical Foundations In this section, we present new theoretical results that will serve as a basis for sub- sequent algorithms. First, we formally define terms related to routing and subcubes. We then state and prove several theorems that are useful in determining that certain pairs of paths are guaranteed to be arc-disjoint (and hence, contention-free). Finally, we formally define contention in an all-port hypercube architecture, and prove a related theorem. 29 3.2.1 Notation and Definitions Bitwise exclusive-or is represented by the symbol 69; logical and and or are represented by /\ and V, respectively, and ii is used to represent the bitwise complement of v. The symbol ||v|| denotes the number of non-zero bits in v. We use N to represent the number of processors in the system. Since n represents the dimensionality of the hypercube, N = 2". The ifh bit of address v is denoted by 03(1)), 0 S i S n — 1, where 00(2)) represents the least-significant address bit; hence, address U can be written as on_1(v)o,,_2(v) . .. 00(v). For each node, v, the outgoing (and incoming) channels of node v are labeled 0 through it - 1, where channel (I connects node v = on_1(v)on_2(v) 00(2)) to node on_1(v) . .. od+1(v)od(v)od_1(v) ... 00(2)). We say that channel d is used to travel in dimension d. Dimension-ordered routing is a minimal deterministic routing algorithm in which every message traverses dimensions of the network in a strict monotonic order. Under dimension-ordered routing, each routing step brings the message one hop closer to the destination, along the highest (alternatively, lowest) dimension in which the current node and the destination node differ. E-cube routing is the hypercube-specific case of dimension-ordered routing. Definition 3.1 Given distinct nodes u and v in an n-dimensional hypercube, let i be the highest dimension such that o;(u) 7f o;(v). Under E-cube routing, a message sent from u to v will be routed first along dimension i to intermediate node w = on_1(v)o,,_2(v) . . .o,-+1(v)o,(v)o;_1(u) . ..oo(u), where o,-(w) = o,-(u). At node w, the same routing algorithm is invoked to determine the next intermediate node. The E—cube path from a source node u to a destination node u will be denoted P(u,v) = (u;w1;w2;...;wp;v), where the nodes w,-, 1 S i S p, are the nodes vis- ited on the path. We note that p + 1 = ”u 63 o”. In any shortest path from a destination u to a source v, a message will travel exactly once in each dimension 30 d such that od(u) 75 od(v). Traveling over these ”a EB v|| dimensions in any arbi- trary order will result in a shortest path between it and v. For example, the path from source node 0101 to destination node 1110 resulting from E-cube routing is P(0101,1110) = (0101; 1101; 1111; 1110). A unicast from node u to node v occurring at time step t is denoted (u,v, P(u, v),t). The following definition simplifies refer- ences to the initial channel in a dimension-ordered route, that is, the first dimension in which a message will travel. Definition 3.2 The symbol 6(u,v) represents the highest-ordered bit position in which u and v differ. Formally, 6(u,v) = max{i :0 S i S n —1 :o,(u) gé o,(v)}. If u = v, then 6(u,v) is undefined. In order to identify a subcube of the nodes of a hypercube, we may explicitly state some of the n address bits, and allow the other address bits to range over all possible values. In this chapter, we need to work only with subcubes in which the explicitly- stated address bits are the high-order bits, and the free-ranging address bits are the low-order bits. We refer to such subcubes as S-cubes. Definition 3.3 An S-cube S = (bn_1bn_2 bus) is defined by a dimensionality n5 6 {0, ..., n}, and a (n—n5)—bit mask (bn_1b,,_2 bn5>- Informally, 5' consists of those nodes whose address is of the form (bn_1b,,_2 . . . bus * * *), where the *’s represent arbitrary bit values. Formally, for any node v, v E S if and only if o,-(v)= b;,forn3 S i S n— 1. For example, the S-cube (01) in a 4-dimension hypercube contains the four nodes 0100, 0101, 0110, and 0111, while the S-cube (110) in a 6-dimension hypercube contains the eight nodes 110000, 110001, 110010, 110011, 110100, 110101, 110110, and 110111. 31 3.2.2 Useful Lemmas This section contains lemmas that will be used to facilitate the proof of subsequent theorems. These lemmas and their proofs are also useful in understanding later sections of the chapter. Lemma 3.1 Let P(u,v) = (u;w1;w2; . . .;wp;v) be any E—cube path (For clarity, let wo = u and wp+1 = v.), and let (w,,w,-+1) E P be any are in P(u,v). Let d be the dimension over which (w,,w,+1) travels, od(w,-) = od(w,-+1). Then the following conditions hold: 1. For allj E {1, ..., i} and for all k E {0, ..., d}, ok(wJ-) = ok(u) (Before traveling in dimension d, a message does not travel in dimensions less than d.) 2. For allj E {i+1, ..., p} and for all/c6 {d+1, ..., n—l}, ok(wJ-) =ok(v) (After traveling in dimension d, a message does not travel in dimensions greater than, or equal to, d.) 3. od(u) 7E od(v) (A message travels in dimension d only if the source and destination node ad- dresses differ in dimension d.) Proof: All three assertions follow directly from the behavior of dimension-ordered routing in hypercubes. Cl Lemma 3.2 For any three nodes u,v,x, and for any S-cube S, if u,x 6 S and u S v S x, then v E S'. (The node addresses within any S-cube are contiguous.) Proof: Let S be represented by (bn_1b,,_2 bus), let u, v and x be as specified, and assume that v ¢ 5. Then we have 0‘,~(u) = (7,-(x) = b,- for us S i S n — 1; and oj(v) 75 b, for some j, n5 S j S n -— 1. Let k be the value of the largest such j. Assume, without loss of generality, that ok(v) > bk. Then for k + l S i S n — 1, o,(v) = b,- = (7,-(x); and ok(v) > bi, = ok(x). It follows that v > x, a contradiction. 32 (If ok(v) < bk, then we conclude similarly that v < u, also a contradiction.) Cl 3.2.3 Arc-Disjoint Paths In implementing a unicast-based multicast algorithm, whenever the paths of two constituent unicast messages share an arc (channel), care must be taken to ensure that the paths do not attempt to use the shared are simultaneously, otherwise contention will arise. When two paths have no arc in common, of course, contention between these two particular paths is always avoided. Paths with no common arc are said to be arc-disjoint. Each of the following theorems state sufficient conditions on two paths such that any two paths meeting these conditions are arc-disjoint. Each theorem is stated for- mally. Where needed for clarity, theorems are stated informally within a parenthetical block of text. Theorem 3.1 Consider any two paths P(u,v) and P(u,y) originating from a com- mon source node a in a hypercube. If 6(u,v) 75 6(u,y), then P(u,v) and P(u,y) are arc-disjoint. {Paths leaving a common source on difierent channels are arc-disjoint.) Proof: Without loss of generality, assume that 6(u,v) > 6(u,y), and let d = 6(u,v). Now suppose that there is some node r 75 it contained in both paths: r E P(u,v) /\ r E P(u,y). Since r E P(u,v) and 6(u,v) = at, then by Lemma 3.1, od(r) # od(u). But since r E P(u,y) and 6(u,y) < d, then od(r) = od(u), which is a contradiction. So there cannot exist any node r 75 11 contained in both paths. Since paths P(u, v) and P(u, y) do not share any node (except the source node it), they are arc-disjoint. Cl 33 Theorem 3.2 Consider any two paths P(u,v) and P(x,y) in a hypercube. If there exists an S-cube S such that u,v E S /\ x,y ¢ 5', then P(u,v) and P(x,y) are arc-disjoint. {A path with source and destination within S-cube S is arc-disjoint from any path with source and destination outside S.) Proof: Suppose there is an arc (w,z) common to both paths. Let (1 be the di- mension in which w and z differ, that is, od(w) = 5.753, and a,(w) = o,-(z) for all i, 0 S i S n-l, wherei # d. Let my be the dimensionality of S; S = (bn_1b,,_2 bus). Case 1: d 2 n5. Since (w,z) E P(u,v), then od(w) = od(u) and 001(2) = od(v) (dimension-ordered routing). Since od(w) = W, then od(w) 74 04(2), hence, od(u) 75 od(v). But since u, v E S and d 2 us, we have od(u) = od(v), a contradiction. Case 2: d S n3. Since (w,z) E P(u,v) and d S us, then 2 E S if and only if v E S. Likewise, since (w,z) E P(x,y), z 6 S if and only if y E S. So v E S if and only if y E S. But v E S and y ¢ S, a contradiction. Thus, there is no are common to both paths. El 3.2.4 Avoiding Depth Contention As previously stated, any two unicasts having arc-disjoint paths are contention-free. One may suspect that two unicasts sent in different steps of a multicast algorithm would also be contention-free, whether or not they are arc-disjoint. However, unicast messages sent in diflerent steps may actually be transmitted concurrently depending on the value of startup latency, which includes the system call time at both the source and destination nodes. If startup latency is large, then the steps of a multicast tree may become staggered, causing unicasts in different steps to actually be sent simultaneously. This condition is possible in commercial systems, where the sending and receiving latencies may be much greater than the network latency of a message. 34 In order to study contention between messages sent in different steps, the definition of the reachable set is needed. Definition 3.4 [19] Given a multicast implementation, a node v is in the reachable set of a node u, denoted Ru, if and only if one of the following conditions holds: 1. v = u, or 2. There exists a unicast (x,v,P(x,v),t) in the implementation such that x E R1,. If the multicast implementation is considered to be a directed tree of unicast messages rooted at the source node do, then the reachable set of a node u is the set of nodes in the subtree rooted at node u. In Figure 3.2(e), for example, R1110 = {1110, 1011, 1100,1111}. Using this definition, the properties of an imple- mentation necessary to avoid contention between messages sent in different steps can be characterized. A multicast implementation is said to be depth contention-free if, re- gardless of overlap in message passing steps caused by startup latency, the constituent messages are contention-free. The following theorem gives sufficient conditions for a multicast implementation to be depth contention-free. Theorem 3.3 [19] A multicast implementation is depth contention-free if at least one of the following four conditions holds for every pair of unicasts (u,v, P(u,v),t) and (x,y, P(x,y),r) in the implementation, where t S 1'. 1. P(u,v) and P(x,y) are arc-disjoint. 3.xERU. 4. x E Ru. and the implementation contains the unicast (u,w,P(u,w),t + k), for some node w and positive integer k. 35 Proof: We need to show that contention does not arise between any pair of unicast messages in the implementation. We consider two arbitrary unicasts (u, v, P(u, v), t) and (x,y, P(x,y),r), with t S 7'. Condition 1. If the two paths of the messages, P(u, v) and P(x, y), are arc-disjoint, then the two unicasts are contention-free. Condition 2. If x = u, then we must consider two cases depending on whether or not destination nodes v and y are reached through the same outgoing channel from source node u. If 6(u,v) = 6(x, y), then t < 7' since u must send the messages sequentially. Figure 3.3(a) illustrates the situation. Node it sends the message to v before sending it to y. Even if r = t + l and the sending latency is 0, contention will not occur. If, on the other hand, 6(u, v) 75 6(x, y), then by Theorem 3.1, the two unicasts are arc-disjoint, and hence contention-free. Condition 3. If x E Ru, as shown in Figure 3.3(a), then the u—to-v unicast must be completed before the x-to-y unicast begins, so they are contention-free. Condition 4. As shown in Figure 3.3(c), node u sends the message to v prior to sending it to node w, which is either an ancestor of x or perhaps x itself. Clearly, node v will have received the message prior to node x, thus preventing contention. Cl It] u,x [T] [t] x y V [I] (a) u=x y (C)u—>v beforeu—> (‘3) x6 Rv W where w is an ancestor of it Figure 3.3. Conditions 2, 3, and 4 of Theorem 3.3 36 In the next two sections, we define several multicast algorithms for all-port hy- percubes. All the algorithms are depth-contention free. Their performance, which is compared in Section 3.5, depends largely on how well they take advantage of the presence of multiple ports. 3.3 Algorithms Based on Dimension-Ordered Chains The algorithms considered in this section are extensions of the U-cube algorithm [19], which was mentioned in Section 3.1. The new algorithms were constructed by modi- fying the U-cube algorithm so as to make better use of multiple ports between each node and its router. We begin with a brief review of the U-cube algorithm. This algorithm, designed for one-port architectures, produces multicast trees on such systems that are of minimum height and are guaranteed to be contention-free. The U-cube multicast algorithm relies on the binary relation “dimension order,” denoted _ 1, in which case U1 and U2 are the product of the same recursive invocation of the Maxport algorithm, and by the above argument, such an invocation of Maxport produces only contention-free unicasts. Having covered all possible cases, we now conclude that the unicasts of the W—sort algorithm are pairwise contention-free. [I] The W—sort algorithm places certain computational requirements on the source node processor. Recall that the W-sort algorithm requires that the source node (1) sort the list of destination nodes into a dimension-ordered chain, (2) invoke the WeightedSort algorithm to produce a cube-ordered chain, and (3) execute the Max- port algorithm. 49 Sorting the list of m destination nodes into a. dimension-ordered chain can be done in 0(m logo m) time. The worst-case computational complexity of the WeightedSort algorithm occurs when the input chain is split after the first element; that is, when the value of center is equal to first + 1 (Figure 3.8). The CubeCenter function can be implemented with a simple binary search, which executes in 0(log2 k) time on an input of size k. This approach gives a worst-case total of 0(m logo m) comparison operations for the WeightedSort algorithm; while the statement that permutes D produces a corresponding total of 0(m2) address copies. The resulting total worst- case computational complexity of W-sort is 0(m2). In many cases, the computational complexity of W-sort may not be important, particularly when a set of destination nodes remains constant over many multicast operations. In this case, WeightedSort can be executed once to produce the appropri- ate cube-ordered chain, and Maxport can then be executed repeatedly on this fixed chain. However, there may be cases in which a computational requirement less than 0(m2) would be advantageous. Since the Maxport algorithm has a computational complexity of only 0(m) at the local node, it would be useful to distribute, and thus parallelize, some of the work of the WeightedSort operation, thereby reducing the computational requirements placed on the source node by the W-sort algorithm. If, rather than using WeightedSort to permute the address sequence D, the source node merely identifies the appropriate destination nodes (and of course, transmits to these destination nodes the message along with a list of relinquished destina- tions), then the worst-case computational complexity of W-sort can be reduced to 0(m logo m) at the local node. By Definition 3.5, the address sequence to be relin- quished to any destination node will be contiguous in a cube-ordered chain; thus, no permutation of the addresses in D is required. 50 3.5 Performance Evaluation In order to understand the relative performance of the algorithms presented in Sec- tions 3.3 and 3.4, they have been compared in three ways on destination sets in which the nodes are randomly distributed throughout the hypercube. First, we com- pared their performance in terms of the maximum number of steps required to reach the destinations. Second, we compared the algorithms by implementing them on an nCUBE-2 and measuring the average and maximum delay, across destinations. Third, we simulated the performance of the algorithms using a simulation tool that has been validated against the nCUBE-2. Since we had access to a real system with only 64 nodes, only simulation allowed us to compare the algorithms on larger systems. Each destination set was produced by selecting, from all nodes in the system, the appropriate number of unique nodes under a uniform distribution model, using a random number generator. The source node was always assumed to be node 0. Due to the symmetry of the hypercube topology, there is a homomorphism between the multicasts originating at a particular source node, and those originating at any other source. 3.5.1 Stepwise Comparisons Figures 3.10 and 3.11 plot the averages, among random sets of destinations, of the maximum number of steps needed to multicast data in a 6-cube and a lO-cube, respectively. For each point in a curve, 100 destination sets were chosen randomly. In addition to reducing the number of steps, the new algorithms “smooth out” the staircase behavior of the U-cube algorithm. As shown in these plots, the W-sort algorithm performs significantly better than the other algorithms. This performance improvement is due to the actions of WeightedSort, which cause destination nodes 51 within more populated S-cubes to migrate toward the root of the multicast tree, thereby allowing the use of multiple ports to occur earlier in the multicast. It 4 Average ’5 Maximum 3 , U-cube -O— H Combine -B— Steps ‘ Maxport -x— 2 7’ W-sort +— T 1 '- —t 0 1 l I 1 1 n 8 16 24 32 40 48 56 64 Number of Destinations Figure 3.10. Stepwise comparisons on a 6-cube All of the curves converge at the highest data point, which represents the special case of broadcast. This behavior results from the degeneration of all of the algorithms to the same algorithm in the case of broadcast, where the set of nodes involved in the multicast consists of every node in the system. 3.5.2 Implementations on an nCUBE-2 Figures 3.12 and 3.13 plot the average and maximum, respectively, among destina- tions, of the measured delay between the sending of a 4096—byte multicast message and its receipt at the destination. For each point in a curve, 20 destination sets were chosen randomly in a 5-cube. These plots show that all the algorithms designed to 52 I I I I I I T 10 - :. -.- .. .731: I 8 _ '7 5 - J r” . Average 6 1.’ - Maximum U-cube *— Steps 4 5 Combine fi— ‘ Maxport -x— W-sort -+— 2 '- _ 0 1 1 1 1 1 1 1 128 256 384 512 640 768 896 1024 Number of Destinations Figure 3.11. Stepwise comparisons on a 10-cube take advantage of the all-port architecture offer some benefit over the U-cube algo- rithm. However, any advantage among Maxport, Combine, and W-sort, is unclear. Interestingly, Figure 3.12 shows that the average delay for U—cube is actually worse for multicast than for broadcast. This anomaly occurs because the algorithm sometimes transmits multiple messages along the same channel instead of taking advantage of multiple channels. In Figure 3.13, we see clearly the staircase behavior of U-cube. As predicted by the stepwise comparisons, the new algorithms tend to smooth the relative delays among various sized destination sets. We infer that the relatively similar results among the new algorithms, and in particular, the lack of a clear advantage for the W-sort algorithm, is due to the startup latency of the nCUBE—2, which prevents the machine from taking full advantage of the all-port architecture. In the nCUBE—2, the sending latency is about 107 psec and the receiving latency is about 80 psec. For small messages, these latencies dominate the network latency; the system behaves much like a one-port architecture, and there 53 6000 I I I I I I I 5000 - ‘3 H . - './ T ..r 4000 - ,la — /- Average //‘ Delay 3000 — / - (usec) / U-cube +- . Combine fl— 2000 — . -‘ / Maxport -x— W-sort +— 1000 - -] 0 1 1 1 1 1 1 1 0 4 8 12 16 20 24 28 32 Number of Destinations Figure 3.12. Average delay comparisons on a 5-cube is less difference between the various multicast algorithms. For larger messages, the network latency becomes more significant compared to startup latency. One would expect the performance advantage of the W-sort algorithm to improve for all sizes of messages if the startup latency were reduced. It is therefore worth noting that the recently announced nCUBE-3 is claimed to exhibit a startup latency of only 5 ,usec. 3.5.3 Simulations of Larger Systems In order to compare the algorithm for larger hypercubes, we relied on simulation. McKinley and Trefftz [61] have developed a CSIM-based simulation tool, called Mul- tiSim, which can be used to simulate large-scale multiprocessors. In particular, Multi- Sim uses novel methods to efficiently simulate wormhole-routed systems. In addition, the simulator has been validated against an nCUBE-2 hypercube multicomputer [61]. 54 10000 '- 8000 - Average Maximum 6000 — Delay Combine fi— (p.866) 4000 M axp ort W-sort +— 2000 -4 0 1 1 1 1 1 1 0 4 8 12 16 20 24 28 32 Number of Destinations Figure 3.13. Maximum delay comparisons on a 5-cube Figures 3.14 and 3.15 plot the average and maximum, respectively, among desti- nations, of the delay between the sending of a 4096-byte multicast message and its receipt at the destination. For each point in a curve, 100 destination sets were chosen randomly in a 10-cube. These plots show that all the algorithms designed to take advantage of the all-port architecture offer advantages over the U-cube algorithm. For the larger systems, the advantage of W-sort becomes more obvious in both the average and maximum cases. 3.6 Conclusions Efficient data distribution is critical to the performance of new generation super- computers that use massively parallel architectures. In this chapter, the problem of multicast in all-port wormhole-routed hypercubes has been addressed. It has been 55 20000 I I I I I I I 15000 '- Average Delay 10000 - (usec) U-cube -O— ' Combine -B— Maxport -x-— 5000 W-sort -+— - 0 l I l l l l l 0 128 256 384 512 640 768 896 1024 Number of Destinations Figure 3.14. Average delay comparisons on a 10-cube demonstrated why the U-cube multicast algorithm [19], which is optimal for one-port architectures, fails to take advantage of multiple ports when they are present in the system. New theoretical results regarding contention among messages in wormhole- routed hypercubes have been developed and used to design new multicast routing algorithms and to prove that these algorithms are contention-free. The algorithms were compared in terms of the number of steps required in each, their measured execution times when implemented on a relatively small-scale nCUBE-2, and their simulated execution times on larger hypercubes. The results indicate that significant performance improvement is possible when the multicast algorithm actively identifies and uses multiple ports in parallel. 56 30000 - I I I I I I I 25000 F- Average 20000 - Maximum Delay 15000 - (usec) U-cube +- 10000 Combine -B- _ Maxport -x— W-sort -+—— 5000 '1 0 1 1 1 1 1 1 1 0 128 256 384 512 640 768 896 1024 Number of Destinations Figure 3.15. Maximum delay comparisons on a 10-cube CHAPTER 4 Unicast-Based Multicast in One-Port Torus Networks In this chapter, we develop a unicast-based multicast algorithm for one-port, wormhole-routed n-dimensional torus networks. This algorithm achieves the lower bound of [log2 m] message-passing steps and avoids contention among the constituent unicast messages. Optimal multicast algorithms have previously been developed for meshes and hypercubes [19]; we generalize the earlier research to accommodate torus networks. The salient difference between the two topologies is the presence of wrap— around channels in tori, which affects the design of techniques for routing and switch- ing of messages through the network in order to avoid deadlock. These torus-specific properties must be considered in the design of tree-based multicast algorithms in order to minimize the number of message-passing steps while avoiding contention. The remainder of this chapter is organized as follows. Section 4.1 presents the system models under which we study multicast communication; we consider torus networks with both unidirectional and bidirectional communication links. In Sec- tion 4.2, we discuss the unicast routing algorithms for each architecture, upon which the multicast algorithm will be based. Section 4.3 develops theoretical results re- garding channel contention in wormhole-routed torus networks, while in Section 4.4, 57 58 we present an optimal unicast-based multicast algorithm for torus networks. In Sec- tion 4.5, we study the performance of the proposed multicast algorithm. Finally, a summary is presented in Section 4.6. 4.1 System Model Formally, an n-dimensional torus has ko x k1 >< - . - >< kn_2 X kn_1 nodes, with k,- nodes along each dimension i, where k,- 2 2 for 0 S i S n — 1. Each node x is identified by n coordinates, on_1(x)o,,_2(x) . . .oo(x), where 0 S o,(x) S k,-—1 for 0 S i S n—l. Two nodes x and y are neighbors if and only if o,(x) = o,(y) for all i, 0 S i S n — 1, except one, j, where oj(x) :l: 1 = oj(y) mod kj. In this chapter, we will assume for purposes of discussion that a torus is regular, that is, that k,- = kj for all 0 S i, j S n — 1, and we refer to the width, or arity, of the torus as simply k; however, all of the results we present are also applicable to non-regular tori. We consider the problem of multicast on two classes of torus networks: those with unidirectional links, and those with bidirectional links. We refer to these network types as unidirectional tori and bidirectional tori, respectively. In a unidirectional torus, neighboring nodes are connected by physical channels in one direction only. That is, if there is a channel from node r to node 3, denoted (r, s), then there is not a channel (s,r). Specifically, a physical channel (r,s) is present if and only if there is a dimension d, 0 S d S n — 1, such that od(r) + 1 = od(s) mod k, and o,(r) = o,-(s) whenever i 75 (I. Figure 4.1(a) shows the physical links associated with a 2D unidirectional torus. Also shown are the paths taken by two example unicast messages, one from source node (0, 0) to destination node (2, 1), and another from source node (0,2) to destination node (3,1). The paths shown result from a deterministic routing algorithm termed dimension-ordered routing [42]. In this ap- proach, messages are routed first in the highest (lowest) dimension in which the source 59 and destination nodes differ. Routing then proceeds on each required dimension, in descending (ascending) order of dimension, until the routing path reaches the desti- nation. Routing in a particular dimension is always completed before routing in the next dimension begins. Although adaptive routing mechanisms for wormhole-routed networks have been the focus of recent research [51, 53], in this chapter we focus on the deterministic dimension-ordered routing strategy, which is widely used due to its simplicity [1]. 0,3 1,3 2,3 ,3 0,3 1,3 2,3 3,3 _ , 9 0,2 3,2 0,2 1,2 2,2 3,2 l 0,1 3,1 0,1 3,1 0,0 1,0 2,0 13,0 0,0 1.0 2,0 3,0 > (a) unidirectional toms (b) bidirectional torus ——-> unidirectional commumcation link D source ‘—> bidirectional communication link _. unicast path E] destination Figure 4.1. Examples of 2D torus networks In a torus network with only unidirectional links, messages are often forced to travel a longer path than would otherwise be possible. Although message transmis- sion time may be nearly distance-insensitive in a wormhole-routed network, it is still 60 desirable to reduce path lengths whenever possible, since messages that travel on shorter paths use fewer channels, thereby reducing overall channel load, and hence decreasing the frequency of channel contention. To this end, we also consider the problem of multicast in bidirectional tori, in which direct message transmission is possible in either direction between neighboring nodes. Formally, a physical link (r, s) is present in a bidirectional torus if and only if there is a dimension d, 0 S d S n — 1, such that od(r) + 1 = od(s) mod k or od(r) — 1 = od(s) mod k, and o;(r) = o,(s) whenever i 75 d. The physical links associated with a 2D bidirectional torus are shown in Figure 4.1(b). As in the case of the unidirectional torus in Figure 4.1(a), paths for two unicast messages are shown. In the case of the path from source node (0, 2) to destination node (3, 1), bidirectional links provide a shorter path than do unidirectional links (2 links versus 6). When a message is routed through a bidirectional torus, there are two possible directions of travel for each dimension. We consider systems in which the direction associated with the shorter path is always taken. For networks with even arity, ties are possible (that is, a message may need to travel exactly k / 2 hops in a particular dimension). In the case of ties, we assume that the path not using a wraparound channel is selected. The neighboring nodes in a bidirectional torus may be connected by either single, bidirectional physical channels, or by pairs of unidirectional physical channels facing in opposing directions. 4.2 Unicast Routing Algorithms Daily and Seitz [57] observed that torus networks with single (unidirectional or bidi- rectional) channels between neighboring nodes exhibit cycles of channel dependency between unicast messages that can cause deadlock, and that these channel-dependence (‘1 (1' it D 61 cycles can be broken by multiplexing virtual channels on a single physical communi- cation channel. Each virtual channel has its own flit buffer and control [58]. In order to prevent message deadlock in a torus network, single channels between neighbor- ing nodes are replaced with multiple virtual channels, thus allowing the underlying routing algorithm to choose among these multiple virtual channels in such a way as to eliminate cycles of channel dependency. Virtual channels may be used in a variety of ways to eliminate deadlock, but how they are used has a significant effect on the design of efficient unicast-based multicast operations. We now describe two unicast routing algorithms that we will later consider in the context of their support of unicast-based multicast operations. The first routing algorithm is suitable for torus architectures with unidirectional links, while the second algorithm is applicable to systems with bidirectional links. 4.2.1 Unidirectional Torus Routing For unidirectional torus routing (UTR), there are two parallel sets of virtual channels, called p—channels and h-channels. The fundamental idea behind UTR is that, for each dimension in which a message travels, p—channels (‘ p’ for pre-wraparound) are used only by messages that will eventually use the wraparound channel in the current dimension; after using the wraparound channel, such messages use the h-channels (‘h’ for high-direction) for all remaining travel in the current dimension. Those messages that will not use the wraparound channel in a particular dimension use h—channels exclusively for travel in that dimension. Figure 4.2 illustrates the virtual channels within a single dimension, d, of a unidirectional torus. We use the notation of Dally and Seitz [57], where coax represents the virtual channel leaving node x in dimension d, in the virtual channel set a where 01 E {‘p’,‘h’}; that is, 0 indicates whether com, is a p-channel or an h-channel). 62 Cam thI Figure 4.2. Virtual channels in one dimension of a unidirectional torus Formally, for each dimension d, 0 S d S k — 1, and for each node x, let y be the node such that od(y) = od(x) + 1 mod k and o,(y) = o,(x) for all i 75 d. Then under UTR, 1. There is an h-channel, thx, from x to y whenever 0 S od(x) S k — 2, and 2. There is a p—channel, Cdpx, from x to y whenever I S od(x) S k - 1. As illustrated in Figure 4.2, there is no p-channel when od(x) = 0, since the p—channels are used only by messages that will eventually use the wraparound channel, and clearly, no message will first visit node 0 and later use the wraparound channel in the same dimension, since to do so would constitute a cycle in the path. Similarly, when od(x) = k — 1, there is no (wraparound) h-channel, since h—channels are used only by messages that have already traversed the wraparound channel (there is no need to use a wraparound channel twice in the same dimension) and by messages that will not use the wraparound channel. The UTR routing algorithm is described formally by the function Rum : N x N —+ C, which maps a ( current node, destination node) pair into the next channel of the routing path. In order to define RUTR(x, y), let d be the highest-ordered dimension in which x and y differ, and let A = od(y) — od(x). Then Cdpx if A < 0 RUTR(xal/) = (4-1) Cd)”; if A > 0 63 A message is routed from a source node to a destination node by applying the Rum function, first at the source, and then at each router through which the message travels, until the destination is reached. Implementing the Rom function in a router is straightforward. The UTR routes unicast messages along only shortest paths (under the constraints of unidirectional links) and is deadlock-free [57]. 4.2.2 Bidirectional Torus Routing In a torus network with bidirectional links, cycles of channel dependency exist in both directions in each dimension; as with unidirectional tori, multiple virtual channels are necessary to provide a deadlock-free routing algorithm. For bidirectional torus routing (BTR), there are three sets of virtual channels: p-channels, l—channels, and h-channels. Each individual virtual channel carries messages in one direction only. The p-channels route messages that will eventually use the wraparound channel in the same dimension. The l-channels (‘l’ for low-direction) and h—channels are used after the wraparound channel has been traversed; the h-channels are also used by messages that will not use the wraparound channel in the current dimension. The l—channels are directed towards lower-address neighboring nodes, while higher-address neighbors are reached through h-channels. The virtual channels along a single dimension, d, of a bidirectional torus with even width, k, are illustrated in Figure 4.3. The situation is similar when k is odd. can (in-2) can (in-n ‘41 um _ . cram-J | cant/2.1 , g-s jam-2: ,-'2‘-2 1.3-1 5‘ 1;“ film/2+1; i‘ canto-21 c c c cam/2oz) dun-n 4mm din/2+1) Figure 4.3. Virtual channels in one dimension of a bidirectional torus 64 Only one set of p-channels is needed, since a message that eventually uses the wraparound channel in a particular dimension will travel only away from the “center” of that dimension prior to using the wraparound channel; to do otherwise would result in non-minimum length routing. Formally, for each dimension d, 0 S d S k— 1, and for each node x, let y be the node such that od(y) = od(x)+1 mod k and o;(y) = 0';(x) for all i 75 d, and let w be the node such that od(w) = od(x) —1 mod k and o;(w) = 0',(x) for all i 75 d. Then under BTR, 1. There is an l-channel, Cdlx, from x to w whenever I S od(x) S k — l, 2. There is an h-channel, thx, from x to y whenever 0 S od(x) S k — 2, 3. There is a p—channel, cdpx, from x to y whenever [5%] +1 S od(x) S k — 1, and 4. There is a p—channel, cdpx, from x to w whenever 0 S od(x) S [3&1] — 1. As shown in Figure 4.3, some pairs of neighboring nodes require only two intercon- necting virtual channels, rather than three. For example, p-channels, which are used by a message prior to wraparound, are not needed near the center of a particular di- mension, since a message routed on a shortest path will never pass through the center of a dimension and also use a wraparound channel in that dimension. Wraparound channels are required only in the p-channel set, for reasons similar to those given for UTR. The BTR routing algorithm is described formally by the function 72ng : N x N —-1 C, which maps a (current node, destination node) pair into the next channel of the routing path. In order to define R3m(x, y), let (1 be the highest—ordered dimension in which x and y differ, and let A = od(y) — od(x). Then Cdpx if [A] > % RBTRUBU) = Cd]; if - g S A S -l (4.2) aaHISAsg 65 Since RBTR routes messages using a wraparound channel in each dimension, d, in which Iod(y) — od(x)| > k/ 2, then BTR always selects a shortest path between source and destination nodes. BTR is deadlock-free, since there are no cycles of channel dependency between unicast messages routed by Rom. 4.3 Contention in Wormhole-Routed Torus Net- works The unicast routing algorithm directly affects the design of multicast algorithms in wormhole-routed systems, because it determines how the constituent messages must be scheduled in order to avoid channel contention. Before a multicast operation begins, only the source node has a copy of the message. During the first message- passing step, the source can send the message to only one destination node on a one- port architecture. In each subsequent step, each node holding a copy of the message can send it to at most one new node. Therefore, the number of nodes that have a copy of the message can at most double during each step, leading to a lower bound of [log2 m] on the number of steps required to complete a multicast to m —1 destination nodes. In order for the actual time required by this recursive doubling procedure to be proportional to the number of message-passing steps, contention between unicast messages must be avoided. The following example illustrates the issues and difficulties involved in imple- menting efficient multicast communication in torus networks. We consider the 2D (5 x 5) torus in Figure 4.4, and suppose that a multicast message is to be sent from source node (4, 3) to six destinations. We define a multicast operation by a message, M, and a list of nodes, (1) = {xo,x1,x2,..., xm_1}, where xo is the source node and {x1,x2, . . . , xm_1} are the destination nodes, listed in arbitrary order. In this example, 0 = ((4,3), (0,0), (1,1), (2,1), (0,3), (1,3), (4,4)). D source node E] destination node D other node Figure 4.4. An example of multicast in a 2D torus Figure 4.5 illustrates the multicast trees associated with three different implemen- tations of the example multicast operation in a unidirectional torus (under UTR). The intermediate nodes and virtual channels (arcs) of each unicast are shown, as well as the communication step in which the unicast occurs. All arcs of a particular unicast share the same step label; because of the message pipelining associated with wormhole routing, a unicast message is considered to occur in a single step regardless of the number of hops on the associated path. In all three implementations, the local processors at only the source and the destination nodes are required to handle the message. Although the multicast implementation depicted in Figure 4.5(a) appears to com- plete the multicast in 3 steps, the unicasts from node (0, 3) to node (1,1), and from (4,3) to (1,3), both require the use of the h—channel from node (0,3) to node (1,3) during step 2. Since the unicasts are sent in the same step, there is said to be stepwise (’C 67 [3] [3] -... ...3 Evil (ii i! 653‘ 1.3 (a) stepwise contention (b) depth contention (c) contention-free source destination rm", intermediate node [Z] node 1:] node =. ..... 3 (IOuter used) [i] p -channel [i] h —channel 1' ' . ‘1 potential are contention ......... , used in stepi _ , used in stepi ~... Figure 4.5. Unicast-based software multicast trees contention between them, which will cause one of the unicasts to be blocked until the other has relinquished the shared arc. This contention will delay one of the unicasts by approximately one step, thereby causing a delay in the receipt of the message (at either nodes (2, 1) and (0, 0), or at node (1, 3), depending on which of the two unicasts was blocked) so that the multicast actually requires four steps to complete. In Figure 4.5(b), we see an alternate implementation of the same multicast opera- tion, but in this case without stepwise contention. However, under certain conditions, messages occurring in two different communication steps may actually be transmit- ted concurrently. This skewing of message-passing steps is due to the effects of the 68 various communication latencies. As described in Section 2.2.2, the startup latency is the overhead incurred in handling a message at the source and destination nodes. Startup latency is composed of sending latency, t5, and receiving latency, tn. The network latency of a message, t N, is dependent upon the message length. In Figure 4.5(b), the message from node (4, 3) to node (0, 3) during the first step is completely received by node (0, 3) at time ts + t N + t R. The message from node (0, 3) to node (1,1) during the second step thus enters the network at time 2ts + tN + t R. The three messages originating at the source node (4, 3) enter the network at time is, 2t5, and 3ts, respectively. Given the above facts, suppose that the communication latencies are such that t3 = tN + t R (for example, perhaps t5 = 2t N = 2tR). Then the message from node (0, 3) to node (1, 1) occurring in the second step, and the message from node (4,3) to node (1,3) occurring in the third step, both enter the network at time 3t5. Since these two messages each require the h—channel from node (0, 3) to node (1, 3), then the messages will exhibit channel contention under this scenario of communication latencies. Such channel contention occurring among messages in diflerent communication steps is termed depth contention. Figure 4.5(c) illustrates the use of techniques presented in this chapter, which pro- vide contention-free multicast in a torus using the optimal number of steps. Although the p-channel from node (4, 3) to node (0,3) is used by two unicasts (in steps 1 and 2, respectively), there can be no contention for this shared arc, since the first unicast is guaranteed to be complete before the second unicast begins. Before formally studying contention between messages, we present some no- tation. The path from a source node u to a destination node v result- ing from dimension-ordered routing in a torus network is denoted P(u, v) = (u; co; x1; CI; (to; co; . . . ;xq; cq; v), where c,- is the virtual channel used to travel from node x,- to node x,~+1, for 0 S i S q (where xo = u and xq+1 = v). 'Ihe 8H1] If’Cl 00. OCCII neiu' llCas coast 33191 COTFEE TheC "49 fl [Ty 69 The sequence of nodes visited on the path is (u;x1;x2; ;xq;v). For ex- ample, the path from source node (0,0) to destination node (2,1) in a unidi- rectional torus, depicted in Figure 4.1(a), is represented by P((0,0),(2,1)) = ((0, 0);Cih(o,0);(1,0);Cih(1,0);(2,0);00h(2,0);(2,1))- A unicast from node u to node v occurring at step t is denoted (u,v, P(u,v),t). To help understand the structure of a unicast-based multicast operation, the reach- able set [19] of a node, say node u, in a multicast implementation is defined to be the set of nodes in the multicast that receive the message, either directly or indirectly, through node v. If the multicast is viewed as a tree of unicast messages, then Ru is the set of nodes in the subtree rooted at u. As an example, in the multicast implementation shown in Figure 4.5(a), R(o,3) = {(0, 3), (1, 1), (2, 1), (0,0)}. Definition 4.1 A node v is in the reachable set of node u, denoted Ru, if and only if one of the following holds: 1. v = u; or 2. The implementation contains a unicast (w, v, P(w, v), t) such that w E R“. We now present several theorems regarding contention in wormhole-routed torus networks. These theorems are important in verifying that the proposed torus mul- ticast algorithm, presented in Section 4.4, produces only multicast operations whose constituent unicast messages are pairwise contention-free. Contention in wormhole-routed, n-dimensional mesh networks has been investi- gated in a previous work [19], in which the following theorem regarding contention in general wormhole-routed networks is presented. We use this theorem to develop corresponding results about contention in torus networks. Theorem 4.1 Given a multicast implementation, if at least one of the follow- ing four conditions holds for every pair of resultant unicasts (u,v,P(u,v),t) and (x,y,P(x,y),r), where t S r, then the multicast is depth contention-free. 70 1. x E R... 2. P(u,v) and P(x,y) are arc-disjoint. 3. x = u 4. x E Ru, and (u, w, P(u, w), t + i) is a product of the multicast, for some node w and positive integer i. In implementing a multicast algorithm, whenever the paths of two constituent unicast messages share an arc (virtual channel), care must be taken to ensure that the paths do not attempt to use the shared arc simultaneously. When two paths have no virtual channel in common, of course, contention between these two particular paths is always avoided. Paths with no common virtual channel are said to be arc-disjoint. We now develop two theorems (Theorems 4.2 and 4.3) that identify situations in a torus network, under UTR and BTR, respectively, in which pairs of unicast messages are arc-disjoint. We first give Lemma 4.1, which formalizes the basic notions of dimension-ordered routing and will be useful in later proofs. Lemma 4.1 Let P(u,v) = (u;co;x1;c1;x2;c2; ;xq;cq;v) be any dimension- ordered path (For clarity, let xo 2 u and xq+1 = v.), and let Cd”, 6 P be any arc in P(u,v). Then the following conditions hold: 1. For all j, 0 S j S i, and for all f, 0 S f S d — 1, Uf($j) = 0f(U). (Before traveling in dimension d, a message does not travel in dimensions lower than d.) 2. Forallj,i+1Squ,andforallf,d+1SfSn—l, Uf($j):0'f(v). (Upon traveling in dimension d, a message does not travel in dimensions higher than d.) 3. od(u) 75 od(v). (A message travels in dimension d only if the source and destination node ad- dresses differ in dimension d.) Proof: The lemma follows directly from the behavior of dimension-ordered routing. Cl 71 Definition 4.2 formally describes a dimension order, denoted d. Since v od(v). Figure 4.6 shows for each of these two cases the routes taken by paths P(u, v) and P(x, y) in dimension d, as prescribed by UTR (Expression 4.1). Although, in case 2, od(u) may, in fact, be farther to the right than shown in the fig- ure, this difference would only cause fewer channels to be used by P(u, v). As shown, paths P(u,v) and P(x, y) cannot share an arc in dimension d; thus, the assumption of the existence of such a shared arc must be false. (:1 h—channels h—channels o ---------- -o——>o- ------------- o—-Io ------------- o 0 Od(u) Od(v) Odor) O’d(y) k-I 0 Od(v) Od(u) Od(x) Gd(y) k—I Case 2: Od(u) > Od(v) Figure 4.6. Channels used by P(u,v) and P(x,y) in dimension d (Theorem 4.2) The next theorem is equivalent to Theorem 4.2, but is applicable to bidirectional, rather than unidirectional, torus networks. 73 Theorem 4.3 For any four nodes u,v,x,y, ifv <01 :1: <1 y then under BTR, paths P(u,v) and P(x,y) are arc-disjoint. Proof: The proof is by contradiction. Assume that there is an arc (r, 3) shared by paths P(u,v) and P(x,y), and let d be the dimension in which (r,s) travels. Then by Lemma 4.2, od(v) S od(x) < od(y). We must consider three possible cases with respect to the shared arc: either (r, s) is (1) a p-channel; (2) an l-channel; or (3) an h-channel. Case 1. Arc (r,s) is a p—channel: Since the same (pm-wraparound) p—channel, (r, s), is used in the path from od(u) to od(v), and in the path from od(x) to od(y), then the relationship between od(u) and od( v) must be the same as the relationship between od(x) and od(y); that is, either od(u) < od(v) and od(x) < od(y), or od(u) > od(v) and od(x) > od(y). Thus, since od(v) S od(x) < od(y) is known, then od(u) < od(v) S od(x) < od(y), as shown in Figure 4.7(a). But since paths P(u,v) and P(x,y) both use a p-channel in dimension d, then od(v) — od(u) > k/2 and od(y) -— od(x) > k/2, which cannot be possible when od(u) < od(v) S od(x) < od(y). Case 2. Arc (r,s) is an l—channel: Since arc (r,s) is an l—channel (low-direction) used in the path P(x,y), and since od(x) < od(y), then P(x,y) must first use a wraparound channel in dimension d before using are (r, s). It then follows that od(x) < od(y) S od(s) < od(r), and od(y) — od(x) > k/2, as shown in Figure 4.7(b). If od(u) > od(v), then path P(u, v) does not use a wraparound channel in dimension d, so od(v) S 04(3) < od(r) S od(u), but since od(y) — od(x) > k/2, then od(u) < od(y) (otherwise, od(u) — od(v) > k/2, in which case P(u,v) would use a wraparound channel in dimension d), from which follows that od(r) < od(y), a contradiction. On the other hand, if od(u) < od(v), we have od(u) < od(v) S od(x) < od(y), and since od(y) - od(x) > k/2, then od(v) — od(u) < k/2, so path P(u,v) uses only h~channels in dimension d. 74 p-channels . L p—channels | l‘ l I I l—channels l I l—channels o -------- O - - - - o- -------- o- --------- o- --------- o- -------- 0 Cd“) 0'40) Od(u) 04M 0d(x) 0,10) k-I (rehannels I | I—channels o ------ o- ------ o- -------------------------------- 0 Odo) Odor) 0,0) Odo) 0d") k-l (b) Case 2 h—channels o ---------- -o --------- -o-—o—o—Io- --------------- o 0 OdM Odor) Od(r) 04m 040') k-l (c) Case 3 Figure 4.7. Channels used by P(u, v) and P(x, y) in dimension d (Theorem 4.3) Case 3. Arc (r,s) is an h—channel: Since od(x) < 0.1(y), we know that od(x) S od(r) < 0,1(3) S od(y), as shown in Figure 4.7(c). If od(u) > od(v), then od(r) < od(s) S od(v), hence od(r) < od(x), a contradiction. If od(u) < od(v), then it follows that od(u) S od(r) < od(s) S od(v); and again, od(r) < od(x), a contradiction. Since each of the above cases leads to a contradiction, the assumption that paths P(u,v) and P(x,y) share an arc must be false. D 4.4 Optimal Multicast Algorithm In this section, we use the theorems of Section 4.3 to develop a unicast-based multicast algorithm for wormhole-routed tori that use either UTR or BTR routing. We show 75 that this algorithm produces pairwise contention-free unicast messages and completes the multicast in the minimum possible number of steps. The algorithm uses the recursive doubling procedure described earlier. This method can be viewed in many ways; one possible view is depicted in Figure 4.8, where we assume for simplicity that m, the number of nodes involved in the mul- ticast, is a power of 2. Figure 4.8 shows all unicasts occurring in the first three steps of an optimal multicast (labeled “[1],” “[2],” and “[3],” respectively), as well as two unicasts occurring in the final step (labeled “[log2 m]”). As shown, the source node, do, sends the message to destination node dm/2 during the first step; this step partitions the multicast problem of size m into two problems, each of size m / 2, with source nodes do and dm/g, respectively. This process continues recursively until all destination nodes have received the message. Figure 4.8. A minimum-time multicast The key to avoiding contention among the constituent messages is the ordering of the destinations. For example, the recursive doubling technique illustrated in Figure 4.8 has previously been applied to meshes; the resultant algorithm is called U- mesh [19]. In order to prevent contention, the U-mesh algorithm orders the destination nodes according to the dimension order relation, <11, which was discussed in the previous section. Such an ordered list is called a dimension-ordered chain [19], and is defined as follows. 76 Definition 4.3 A sequence of nodes {xo,x1,x2,..., xm_1} is a dimension-ordered chain if and only if all the elements are distinct and x, R—chain ———> unicast ----- > series of one or more unicasts Figure 4.11. Possible relations between unicasts produced by U—torus 79 We note from Definition 4.4 that an R-chain, Q = {x,,x,+1, . . . , xm_1,xo,x1, . . ., x,_1}, consists of two concatenated sub-chains, Q, = {x,, x,+1, . . . , xm_1}, and Q5 = {xo,xl, . . . , x,_1}. Since nodes u,v,x and y appear in the given order (u,v,x,y) in the R-chain, there are five possible points with respect to these four nodes where the partition between Q0 and Q5 might occur. These five subcases are enumerated as follows: i. u,v,x,yEQa ii. u,v,xEQa;y€Qb iii. u,vEQa;x,y€Qb iv. uEQa;v,x,yEQb v. u,v,x,yE Qb We consider any two nodes u and v in an R-chain, where it occurs before v in the R-chain. From Definitions 4.3 and 4.4, if u and v belong to the same sub-chain (that is, if either u,v E Q, or u,v 6 Q5), then u <- ‘ 3D, 512-node Bidir. Torus +— 1 _ Lower Bound — T 0 1 1 1 1 1 1 8 16 24 32 40 48 56 64 Multicast Size Figure 4.12. Average communication steps (512 and 1024-node tori) We also examined the effects of physical link sharing on larger torus networks. Figures 4.14 and 4.15 plot the average and maximum number of steps, among desti- nations, for 4096-node 2D (64 x 64) and 3D (16 x 16 x 16) torus networks, with both unidirectional and bidirectional links. For these larger configurations, multicast set sizes from 64 through 512 nodes were considered. As illustrated in Figures 4.12 and 4.14, the effect of virtual channel multiplexing on the average number of steps is small. In all cases, the number of steps is close to the theoretical lower bound for systems in which virtual channels do not share physical link bandwidth. In Figures 4.13 and 4.15, we see that the effect on the 83 8 7 6 Max 5 Steps 4 ‘ 3 2D, 1024-node Unidir. Torus -o— '- 2D, 1024-node Bidir. Torus B— 2 " 3D, 512-node Unidir. Torus -)(— " 3D, 512-node Bidir. Torus +— 1 " Lower Bound — ‘ 0 1 1 1 1 1 1 8 16 24 32 40 48 56 64 Multicast Size Figure 4.13. Maximum communication steps (512 and 1024-node tori) maximum number of steps is greater than for the average case, but even this effect is limited. It is noted that, for a given cost, systems using fewer physical links will have individual links with greater capacity than systems in which each virtual channel is supported by a separate physical link. Thus, the modest effects of physical link sharing are likely to be more than offset by the greater capacity of each link resulting from virtual channel multiplexing. To put the effects of physical link sharing into perspective, Figure 4.16 compares the worst-case observed performance of U-torus with the number of steps required when the source node performs the multicast operation. The worst observed case occurred with a 2D, 4096-node, unidirectional torus, as shown in Figure 4.15. As Figure 4.16 demonstrates, the difference between the performance of the U-torus algorithm, with and without the practical consideration of link sharing, is extremely small compared to the performance increase over a multicast operation performed using separate addressing, in which the source directly sends the message to every destination. 84 12 I I I I I I 10 h- _. 8 1— Avg Steps - 2D, 4096—node Unidir. Torus -o— 4 _ 2D, 4096—node Bidir. Torus B— _ 3D, 4096-node Unidir. Torus *— 3D, 4096-node Bidir. Torus +— 2 '- Lower Bound —- - 0 l I 1 I l l 64 128 192 256 320 384 448 512 Multicast Size Figure 4.14. Average communication steps (4096-node tori) In the case of a bidirectional torus, one might also consider designing a network in which all three virtual channels between a pair of neighboring nodes are multiplexed onto a single, bidirectional physical link. For example, in Figure 4.3, virtual channels Caho, cop], and cm might all be multiplexed onto a single physical link. This 3-way multiplexing would have two additional consequences beyond the 2-way multiplexing that we have already considered. First, two messages that travel between the same pair of nodes, but in opposite directions, would now share the bandwidth of a single physical link. Second, it would now be possible for three messages to simultaneously share a physical link. We examined the additional effects of the first item (2—way mul- tiplexing between opposite-direction messages) on the U-torus algorithm and found an increase in the average and maximum number of steps of not more than 3 and 7 percent, respectively, over the values presented in the previous plots; in many cases, the effect was much smaller. In order to better understand the effect of three unicast messages simultaneously sharing a physical link, we studied the frequency of such occurrences. Figure 4.17 85 2D, 4096—node Unidir. Torus +— 4 a 2D, 4096-node Bidir. Torus B— .. 3D, 4096-node Unidir. Torus -x—— 3D, 4096—node Bidir. Torus -+— 2 - Lower Bound —- - 0 L l L J l l 64 128 192 256 320 384 448 512 Multicast Size Figure 4.15. Maximum communication steps (4096—node tori) compares, for 4096-node 2D and 3D bidirectional torus networks, the number of unicasts produced by the U-torus algorithm that are involved in the following two types of physical link sharing. Recall that the total number of unicasts in a multicast operation equals the number of destinations. Type 1. Unicasts that share a physical link with another unicast in the same step, given the assumption that only virtual channels traveling in the same direction are multiplexed together. Type 2. Unicasts that share a physical link with two other unicast in the same step, given the assumption that all virtual channels between a pair of adjacent nodes are multiplexed onto a single physical link. We chose the above comparison because we know the effects of the first type of link sharing. As shown in Figure 4.17, the frequency of 3-way link sharing is very small compared to 2-way sharing. Since we have demonstrated that the effect of 2-way link sharing is not large, we conclude that the effect of 3-way link sharing is likely to be inconsequential. 86 500 .. I I I I I I _ 400 - - Max 300 ' 1 Steps 200 _ “ Separate addressing -o— U-Torus (worst observed case) -0— 100 7 Lower Bound A— I 1 0‘ Q A QA—Q, A. A A 64 128 192 256 320 384 448 512 Multicast Size Figure 4.16. Effects of physical link sharing Thus, the U-torus algorithm performs well in a variety of environments: those with either unidirectional or bidirectional communication links; and those with vir- tual channels implemented with either independent physical links, or by multiplexing either two or three (in the case of bidirectional tori) virtual channels onto a single physical link. 4.6 Conclusions This chapter has presented an efficient algorithm for multicast communication on wormhole-routed torus networks. The U-torus algorithm applies to unidirectional and bidirectional tori of any dimension. The algorithm produces multicast trees in which the constituent unicast messages do not contend for the same channels, regardless of message length or startup latency. Moreover, the number of message- passing steps required to multicast data to m — 1 destinations is [log2 m], which is optimal for one-port architectures. The results of a simulation study showed that the praC‘ on d‘ 40 n I 35 ' r 30 - 2D, 4096—node (type 1) -o— 3D, 4096—node (type 1) 9% 25 .— 2D, 4096-node (type 2) B— _ 3D, 4096—node (type 2) +— 1 _l .1 Unicasts 20 ' - Sharing Physical 15 7 " Links 10 _ ( 5 - .. 0 .____r_.:_-:: MEI—_‘q—E‘u—i _ 64 I28 192 256 320 384 448 512 Multicast Size Figure 4.17. Comparison of 2-way and 3-way physical link sharing practical consideration of physical link sharing by constituent messages transmitted on different virtual channels has little effect on performance. ca CHAPTER 5 Path-Based Routing in Unidirectional Torus Networks We now turn to research in the area of path-based message routing. In this chapter, we develop a general path-based message routing mechanism for wormhole—routed uni- directional torus networks with intermediate reception (IR) capability. As described in Section 2.5, a router with IR capability is able not only to route incoming messages onto outgoing channels without processor intervention, but can also simultaneously deliver a copy of a passing message to the local processor/ memory. The path—based routing mechanism presented in this chapter will be used in Chapter 6 as a basis for a family of path-based multicast algorithms for unidirectional torus networks with IR capability. The organization of this chapter is as follows: The major issues associated with path-based routing in unidirectional torus networks are discussed in Section 5.1. In Section 5.2, Hamiltonian Circuits in torus networks are presented as a means by which deadlock-free path-based routing can be achieved. The path routing function for multi-destination messages is described in Section 5.3, and in Section 5.4, a mes- sage preparation algorithm is described. This algorithm is used to order a list of destination nodes in such a way that when the path routing function is used to route 88 89 from the source node through each destination node in the prescribed order, the re- sulting multi-destination message is deadlock-free in combination with any collection of such messages in the network. The correctness of the proposed message routing technique is verified in Section 5.5. The implementation issues associated with this path-based routing method are addressed in Section 5.6. Finally, conclusions are given in Section 5.7. 5.1 Issues Multi-destination messages are subject to the same deadlock considerations as we have described for unicast messages. Since the progress of an entire multi-destination worm depends on the concurrent availability of all channels used by that worm, the routing rules must be applied to the worm as a whole, and not just individually to the segments of the worm that exist between destination nodes. In an attempt to provide deadlock-free path-based routing, multi-destination worms may be subjected to the same routing rules that are known to be deadlock-free for unicast routing. However, in a communication operation involving an arbitrary destination set, such as multicast, multi-destination worms that are forced to adhere to existing deadlock-free unicast routing rules are often ineffective. Figure 5.1 illustrates a multicast problem in a 2D (6 x 6) torus, where the source node (3, 2) is to deliver a message to 9 destination nodes. In this example, any single message following XY routing rules can reach at most two destination nodes. For instance, XY routing dictates that the path from source node (3,2) to destination node (4, 3) is ((3, 2), (4, 2), (4, 3)). Upon reaching node (4, 3), this path has not passed through any additional destination nodes and has already traveled in the Y direction, so node (4,5) is the only additional destination node that can be reached. In this case, the routing rules prevent a message enroute to any destination node from passing 90 through a large number of the other destinations. The routing rules used for path- based routing must therefore be liberal enough to allow flexibility in message routing, so that many intermediate destination nodes can be visited by a single message. However, these routing rules must still avoid deadlock. ——> communication link Figure 5.1. A multicast operation in a 2D torus The way in which a multi-destination message is prepared for transmission can affect the channel dependencies created by the message, and hence, the deadlock properties of the system. For example, the order in which the destinations of a multi-destination worm are visited affects the channel dependencies produced by the message. In general, the more flexible routing rules needed for path-based routing introduce additional channel dependencies over those that exist for unicast routing. In order to 91 prevent deadlock in systems with deterministic routing rules, the channel dependency graph must be acyclic [57]. In order to avoid deadlock, the routing rules need only account for those messages that can actually be produced by the supported commu- nication operations. For example, if a multi-destination message visiting destination nodes u, v, and w, in that order, will never be produced, then such a message need not be considered when analyzing the deadlock properties of a system. Therefore, it is the combination of the message preparation algorithm and the path routing function that must be considered when designing a deadlock—free system. Throughout this chapter and the next, we assume that the system model is a regular unidirectional torus of width k and dimension n. The reader is referred to Section 4.1 for a definition of this system model. In addition, node routers are assumed to have IR capabilities, as described in Section 2.5. In systems with one-port architectures, the single path from a router to the local node may itself induce communication deadlock when two or more multi-destination messages are issued concurrently. Figure 5.2 illustrates a scenario in which two multi- destination worms are each attempting to deliver a message to nodes u and v. How- ever, each message is holding the single input port at one node, while waiting to use the input port at the other node, resulting in communication deadlock. So that such port-induced deadlock does not occur, we assume a system with an all-port architecture. For any node u, and any dimension (1, let u“ be the node adjacent to u in dimension d. Formally, u“ = on_1(u)o,,_2(u) od+1(u)[od(u) + 1 mod k]od_1(u) oo(u). (5.1) Up; COT Ilia nod IOU iota lhEr 92 Node a Node v Local Processor Local Processor and and Memory Memory 1 ll II it ‘ Message 0 J Message b L Router Router ———~ lntemal p011 —> Message I Deadlock point Figure 5.2. Port-induced communication deadlock in a one-port system 5.2 Hamiltonian Circuits One way in which deadlock can be avoided in path-based routing is by ensuring that there are no cycles of channel dependency allowed by the routing algorithm. Ensuring an absence of channel dependence cycles, and thus freedom from deadlock, can sometimes be accomplished by means of a Hamiltonian Path (HP). An HP in a network is a path that traverses communication links, visiting each node exactly once, while a Hamiltonian Circuit (HC) is an HP whose last node is adjacent to the first node (thus completing the circuit). An example of an HP in a 2D mesh is shown in Figure 5.3. The numbers near the upper-left corner of each node define a total ordering of the nodes. This ordering corresponds to an HP through the network. In general, a network may contain many HPs. If the routing algorithm can be designed so that messages always visit nodes (including the destination nodes and the intermediate nodes at which only the router is used) in the same order as those nodes appear on a particular HP, then a total ordering will be induced on the use of the communication channels (resources), thereby ensuring deadlock-free communication. 93 Figure 5.3 shows two message worms generated by a path-based multicast algo- rithm designed for mesh networks [30]. These message worms each travel in an order corresponding to the HP and together deliver the message, via IR, to the destination nodes. Since the communication channels that connect nodes in a forward direction with respect to the HP are disjoint from those that connect nodes in a reverse di- rection, the use of both message worms that travel forward, and those that travel backward, on the HP does not produce communication deadlocks. 35 34 33 ,32 31 30 0,5 1,5 2,5 3,5 5,5 , -25 26 27 28 29 Eel—Eil—EA 3’4 4’4 23 22 21 20 19 18 E13 1,3 2,3 3,3 4,3 5,3 12 13 14___ 1 16 17 [:12 1,2 2,2 3,2 4,2 5,2] 11 10 9 8 7 6 [21,1 1,1 [1:] 3,1 4,1 [E] o 1 2 3 4 5 0,0 2,0 3,0 4,0 5,0 El El (_, source node destination node other node Figure 5.3. Path-based multicast routing in a mesh using Hamiltonian Paths The path-based routing method for meshes, described above, is based on the existence of an HP with the following characteristic: it is possible to route a message from an arbitrary source node to an arbitrary destination such that the nodes visited on the route (including the source and destination) conform to the ordering of nodes 94 defined by the HP. In a torus network with unidirectional communication links, an HP with the above characteristic clearly does not exist. Therefore, in order to provide deadlock-free path-based routing in this topology, we cannot rely on an HP alone. Figure 5.4 shows an HC in a 2D (6 x 6) unidirectional torus. The numbers near the upper-left corner of each node correspond to one particular HC; this HC begins at node (0, 0), continues through the network, visiting every node. Although there is generally more than one HC in a unidirectional torus, we use only a particular HC such the one illustrated in Figure 5.4. We refer to this special HC in a particular torus, T, as HT, or simply 'H when it is clear which torus network is indicated. Informally, H begins at node 0 (node 0 is the node whose address value is 0 in every dimension); at each node u on ’H, the next node is the neighbor, u“, of u that minimizes d under the constraint that u“ does not already precede u on ’H. Also shown in Figure 5.4 are boundaries, which are communication links that travel backwards in ’H. Boundaries will be used when defining a path-based routing function for torus networks. — ] boundary —> communication link Figure 5.4. Hamiltonian circuit 'H in a 2D torus 95 Figure 5.5 shows the node orderings defined by ’H in a 3D (4 x 4 x 4) torus where, for clarity, the communication channels in dimension 2 (inter-plane) are not shown. In addition to the boundaries shown explicitly in Figure 5.5, every channel from plane 3 to plane 0 is also a boundary. 1,0,3 1,1,3 8 J1: s 29 1,1,2 “”5 1,1,1 21 Hi 31 I 0 1.1.0” + O P1109 0 plane 1 ."‘ 9 O O 62 51 52 —L- 57 ] 41 46 35 3,0,3 3,1,3 3,2,3 —% 3,3,3 2,0,3 2,1,3 '1 l— T 1 61 50 I 55 56 45 34 3,0,2 3,1,2 3,2,2 3,3,2 2,0,2 2,1,2 ‘ 60 49 I 54 59 3,0,1 3,1,1 3,2,1 +3.11 f 48J— 53 58 ~ . 47 m 3.1. 3.2.0 13.3.0 7.0.0 —4 plane 3 plane 2 L c: I did — I boundary —> communication link Figure 5.5. Hamiltonian circuit ’H in a 3D torus We define notation (1(a) to be the ordinal position, or label, of a node u in torus T as determined by 7717. Again, when it is clear which torus network is under 96 consideration, we use notation [(11). For example, in Figure 5.4, [(0, 0) = 0, [(0, 1) = 1, ((1,0) = 7, and so forth. We define ’H formally by defining the ordinal position, [(11), of an arbitrary node, 71. For a 1D torus, which is simply a ring, trivially, ((u) = oo(u). For a 2D torus, as shown in Figure 5.4, we have ((u) = [(oo(u) + 01(u)) mod k] + k [01(u)] for n = 2 For the general case of a k-ary n-dimensional torus, ’H is defined by Equation 5.2: [(u) = ":1 [ki ((35 Uj(U)) mod k)] for n 2 1 (5.2) i=0 j=i In Definition 5.1, node labels are used to formally define a boundary in a unidi- rectional torus. Definition 5.1 Ifu and v are two neighboring nodes {that is, if 11‘ = v for some 0 S i S n — 1), then channel (u,v) is a boundary if and only if€(u) > €(v). 5.3 The Path Routing Function In order to develop the path routing function, we first describe a path-based routing method for a restricted class of multi-destination messages in which the source node precedes, on 71, every destination node. We then extend this method to include arbitrary multi-destination messages. 5.3.1 Restrictive Routing We consider a multi-destination message in which the destination nodes have been arranged in ascending order according to their position on ’H. Such a message is 97 defined by the node sequence Q = {uo, 7.11, no, . . . , um_1}, where no is the source node, {u1, no, . . . , um_1} are the destination nodes, and [(115) < ((uj) for 0 S i < j S m—l. We can describe a multi-destination message path that begins at the source node no and visits each destination node in order. At each step, the path routing function is applied to the current node and the next destination node, in order to determine the node adjacent to the current node to which the message is next routed. In order to provide minimal paths between destination nodes, we must restrict travel to those dimensions in which the addresses of the current node and the next destination node differ; these are called useful dimensions. Given the above, we can define the path routing function (for this restrictive case) as follows: messages are always routed in the lowest useful dimension that does not cross a boundary. More formally, at the current node, u, the message is routed to the neighboring node, 11“, such that 1. The addresses of node u and the next destination node differ in dimension d; 2. Channel (u,u“) is not a boundary; and 3. The value of d is minimal, given the above two restrictions. For example, Figure 5.6 illustrates the application of this routing algorithm to the (restrictive) multi-destination message Q = {(3,2)23, (4,3)”, (4,5)”, (5,1)”, (5, 4)33}, where the first element, (3, 2), is the source node, and the destination nodes are ordered according to their labels (which have been placed as superscripts in the above list). Since the multi-destination message path described above visits all nodes in an order that is consistent with the total ordering established by ’H (and does not utilize the cyclic property of ’H), then all cycles of channel dependency are prevented, thereby ensuring deadlock-free routing. 98 _ bo —* communication link I undary —> message path El El El source node destination node other node Figure 5.6. A (restrictive) multi-destination message in a 2D torus 5.3.2 General Routing The restrictive path-based routing described above does not apply to messages in which one or more destination nodes precede the source node on ’H. In order to provide a general path-based routing method that is deadlock-free, we extend the above restrictive routing mechanism through the use of virtual communication chan- nels. It is known that deadlock-free deterministic routing cannot be achieved in a wormhole-routed torus network without the use of multiple virtual channel sets, even in the simplified case of unicast routing [57]. Thus, at least two virtual channel sets are required in order to provide deadlock-free deterministic path-based (or unicast) routing. We will, in fact, describe a path-based routing that requires only this lower bound of two virtual channel sets. 99 For unidirectional torus path-based routing (UTPR), there are two virtual channels sets, called p-channels and h-channels. Briefly, the p-channels (‘ p’ for pre-boundary) are used by messages only prior to crossing a boundary; after crossing a boundary, messages use the h-channels (‘h’ for high-channel) for all remaining travel. Those mes- sages that will not cross a boundary use p—channels exclusively. Each non-boundary physical communication link in the network has multiplexed onto it a p—channel and an h—channel. Only h-channels are required at boundary links. Formally, under UTPR, for each dimension d, 0 S d S k — 1, and for each node u, 1. There is a p-channel, cop“, from u to it“ whenever (u, u“) is not a boundary, and 2. There is an h—channel, thu, from u to u“. The rule for routing from node '11 to node v has been described above for the case where [(u) < [(v). We must now consider the case where [(u) > €(v). To route from a node u to a node v, where either [(u) < [(v) or [(u) > [(v), we use the following generalized routing rule: travel occurs in the lowest useful dimension that does not cross a boundary. If every useful dimension crosses a boundary, then travel occurs in the highest useful dimension. By using p—channels prior to crossing a boundary, and h-channels thereafter, cycles of dependency among virtual channels, and thus deadlock, are impossible. We show in Section 5.5 that an arbitrary multi-destination message can be routed so that at most one boundary is crossed. Limiting a message to a single boundary is important, since with only two virtual channel sets, allowing messages to cycle through the network multiple times could cause deadlock, even if these messages are restricted to the cyclic order given by 'H. The path routing function for UTPR is described formally by the function RUTH; : N x {‘p’,‘h’} x N —> C, which maps a (current node, incoming virtual channel set, destination node) triple into the next channel of the routing path. Let A(u, v) be the 100 set of useful dimensions for routing from u to v. That is, A(u,v) = {i I 0 S i S n — 1 and (7,-(u) 74 o,(v)} Let A;(u, v) to be the useful dimensions that are not boundaries at 11. Thus, Af(u,v) = {i E A(u,v) | (u,v') is not a boundary} We define RUTPR, based on A and A}, as follows. RUTPRWflW) = Cufida Where min(A;(u,v)), if Af(u,v) # 0 max(A(u, v)), otherwise (5.3) fl ‘p’, if a = ‘p’ and (u,vd) is not a boundary — ‘h’, otherwise. A multi-destination message is routed from a source node to a destination node (or between successive destination nodes) by applying the RUTH; function, first at the source, and then at each router through which the message travels, until the destination is reached. When a message enters the network, it is initially routed over a p-channel. As specified by Equation 5.3, the message continues to be routed over p-channels unless a boundary is crossed. After crossing a boundary, a message is routed on h-channels. 5.4 Message Preparation The path routing function, realized in hardware within the node router, determines the route taken by a message between subsequent destination nodes (and between 101 the source node and the first destination node). The message preparation algorithm, implemented in software within the source node processor, arranges the list of des- tination nodes in the message header, thereby determining the order in which the destination nodes are reached. The path routing function and message preparation algorithm must therefore be carefully designed in tandem, so that cycles of channel dependency, and thus deadlock, are avoided. The path routing function RUTPR defined in Equation 5.3 describes how messages can be routed between pairs of nodes; that is, between the source node and the first destination node, and between pairs of successive destination nodes of a multi- destination message. One way to maintain the deadlock-free properties of RUTH; is to force the multi-destination message to visit the destination nodes in an order corresponding to H. In addition, given the constraints of only two virtual channel sets, the message must be limited to one full cycle of H. In order to meet the above requirements, we introduce the notion of an H~cycle. We then show how path-based routing can be applied to an arbitrary source and list of destination nodes that have been arranged into an H—cycle. We first define an H-chain, which is simply a sequence of nodes whose order is consistent with the linear (as opposed to cyclic) ordering established by H. Definition 5.2 A sequence of nodes {uo, 111,112, . . . , um_1} is an H-chain if and only if all elements are distinct, and [(1a) < €(u,+1) for 0 S i < m — 1. An H-chain does not utilize the cyclic properties of H —— the labels of the elements of an H-chain are strictly increasing. An H-cycle, on the other hand, is a sequence of nodes whose ordering is consistent with the cyclic ordering associated with H, and is defined as an end-around rotation of an H-chain. Definition 5.3 IfQ = {uo,u1,u2, . . . , um_1} is an H-chain and u, is an element of Q, then {u,,u,+1, . . ., um_.1,uo,u1, . . . , 213-1} is an H-cycle with respect to us. 102 For any H-chain Q = {uo, til, 11;, . . . , um_1}, Q is an H-cycle with respect to no; that is, any H-chain is also an H-cycle with respect to the first element. As an example of the construction of an H-cycle, we consider the multicast problem in a 2D (6 x 6) torus depicted in Figure 5.1, where the source node (3,2) is to deliver a message to 9 destination nodes using a single multi-destination message. The problem is thus defined by the sequence ‘1’ = {(3,2)23,(0,5)5,(4,5)27,(3,4)19,(5,4)33,(4,3)25,(1,2)9,(2, 1)15,(5,1)3°,(1,0)7} where the first element of Q is the source node, and the destination nodes are listed in arbitrary order. For convenience, the source node is underlined, and the label, [(u), of each node, 21, is added as a superscript to the node address. (For reference, the node labels of a 2D (6 x 6) torus are illustrated in Figure 5.4.) First, Q is sorted according to labels of the node addresses to obtain the H—chain ¢I : {(015)5’ (11 (3)7? (11 2)91(211)15’(314)193(312)233(413)251(41 5)27’(5’1)301(514)33} Next, the H-chain Q’ is rotated so that the source node, (3,2), appears at the head of the list, resulting in the following H-cycle <1":{(3,2)23,(4,3)25,(4,5)27,(5,1)3°,(5,4)33,(0,5)5,(1,0)7,(1,2)9,(2,1)15,(3,4)19} Finally, a single message can be routed, starting at the source node (3, 2), and then to each destination node in Q”, in turn, according to the path routing function RUTPR (Equation 5.3). Figure 5.7 illustrates the path of this multi-destination message as it is routed to the successive elements of the H-cycle Q”. Figure 5.8 gives the message preparation algorithm, which is executed by the source node processor in order to build the multi-destination message header. 103 —’ communication link —> message path —[ boundary El El :1 source node destination node other node Figure 5.7. A multi-destination message in a 2D torus 5.5 Correctness In order to implement the above path-based message routing, the source node applies the message preparation algorithm to the source and destination node addresses. The resultant H-cycle is then used as the destination address list of a multi-destination message that will be routed, in turn, to each destination node, according to the path routing function RUTPR. Theorem 5.1 shows that this method provides an efficient, deadlock-free, path-based routing mechanism for unidirectional torus networks of ar- bitrary dimension. Theorem 5.1 IfQ = {uo,u1,u2, ,um_1} is an H-cycle, then a multi-destination message, routed according to path routing function RUTPR, beginning at source node 104 The Message Preparation Algorithm Input: A sequence of nodes Q = {uo, ul, uo, , um_1}, where no is the source node. Output: Message header M. Procedure: 1. Sort Q according to an ascending ordering of the node labels [(ug) of each node a,- 2. Rotate Q so that no is again the first element of Q 3. M = Q '— {110} Figure 5.8. The message preparation algorithm for path-based routing no and routed through destination nodes 111,112, ,um_1, in that order, has the following properties. 1. All paths between successive destination nodes are minimal. 2. The network is deadlock-free under any and all combinations of such multi- destination messages. 3. The message uses all distinct physical channels. In order to prove Theorem 5.1, we first present a series of lemmas. As a notational convenience throughout the following lemmas, we define Z,- to be the coefficient of the i” term of Equation 5.2; that is, for any node v and for 0 S i S n — 1, [,(u) = (r: oj(u)) mod k (5.4) Thus, we can write Equation 5.2 as (00 = "i [k‘ 2.0)] (5.5) i=0 105 Lemma 5.1 Let u be any node in a unidirectional torus, and let d be any dimension, 0 S d S n — 1. Then [(11) > €(ud) if and only if€d(u) > €d(u“). Proof: We consider the maximum amount that the value of ((u) is effected by the first d — 1 terms of Equation 5.2. Because of the modulo-k arithmetic, the value of €,-(u) can change by at most k — 1 as the value of u changes; thus, the corresponding aggregate change to the value of [(21) due to terms 0 through d — 1 of Equation 5.2 is bounded above by E[kf(k—1)]= kd — 1. Since k“ is the minimum amount by which a change in the d“ term of Equation 5.2 can effect the value of ((u), and since terms d + 1 through 72 — 1 are not effected by the value of od(u), then the lemma is valid. Cl Lemma 5.2 Let u and v be any two distinct nodes in a unidirectional torus and let d be the greatest integer such that od(u) 75 od(v). If channel (u,u“) is a boundary, then [(u) > [(v). Proof: By Definition 5.1, €(u) > [(u“). It then follows from Lemma 5.1 that €d(u) > €d(u“), and from the rules of modulo arithmetic, 2,1(11) = k — 1. From the premise of the lemma, od(u) 76 od(v) and o,-(u) = oj(v) for j > d, hence, td(v) 75 k — 1 Therefore, due to the modulo-k arithmetic, €d(v) < k -— 1. Finally, again by Lemma 5.1, [(u) > €(v). El Lemma 5.3 Let u and v be any two distinct nodes in a unidirectional torus. If on_1(u) :,£ on_1(v) then on_1(u) > 0,,__1(v) if and only if ((u) > €(v). Proof: From Equation 5.4, €n_1(u) = on__1(u), so it suffices to show that [n-1(u) > €n_1(v) if and only if [(u) > €(v). By the same reasons given in the 106 proof of Lemma 5.1, the value of 2?;02[kf€,-(u)] is bounded by 71-2 . 0 g 2 [k*t,-(u)] g k"‘1 — 1 i=0 Therefore, in-1(u) > (n-1(v) only if ((u) > [(v), and similarly, [n-1(u) < fn_1(v) only if [(u) < €(v). Since (”-1 74 [n_1(v) because on_1(u) ¢ on_1(v), then the lemma is proved. El Lemma 5.4 Let u and v be any two distinct nodes in a unidirectional torus. If€(u) < ((v) then there exists a dimension d such that od(u) 75 od(v) and [(u) < [(u“) S [(v). If [(u) > [(v) then there exists a dimension (I such that od(u) 75 od(v) and either [(11) < ((ud) or ((ud) S €(v). Proof: Part I. [(u) < ((v): The proof is by induction on n, the dimensionality of the torus. If n = 1, the torus is a ring, therefore 6(a) = oo(u) = u; likewise, [(v) = v. The assertion is then trivially true. We now assume that the lemma is true for n — 1 (although this assumption will only be needed in Case 1, below). Case 1. 0,,_1(u) = 0,,_1(v): In this case, all routing occurs in sub-tori of dimension n—l. If 0,,_1(u) = 0, then the Hamiltonian Circuit H within this sub-tori is equivalent to H7 within a torus, T, of dimension n — 1. Otherwise, by the symmetry of the torus network, the sub-tori is isomorphic to T. Thus, by induction, there is a dimension d such that od(u) = od(v) and ((u) < [(ud) S [(v). Case 2. 0,,_1(u) 75 on_1(v): By Lemma 5.2, channel (u,u"’1) is not a boundary, and by Lemma 5.3, on_1(u) < on-1(v). Case 2a. on_1(v)-o,,_1(u) > 1: Then on_1(u"‘1) < 0,,_1(v), and from Lemma 5.3, [(un‘l) < [(v). Since channel (u,u"'1) is not a boundary, [(u) < [(un’l). Case 2b. on_1 (v) —o,,_1(u) = 1: This case is divided into two sub-cases, as follows. 107 Case 2b.i. There exists a dimension d < n— 1 such that od(u) 75 od(v) and channel (u,vd) is not a boundary: Then 0,,_1(ud) = on_1(u) < 0,,_1(v), so from Lemma 5.3, ((u“) < [(v). Since channel (u,u“) is not a boundary, [(u) < [(u“). Case 2b.ii. There is no dimension d < n — 1 such that od(u) 75 od(v) and channel (u,u“) is not a boundary: In this case we have that for all d < n — 1 such that od(u) 74 od(v), channel (u,u“) is a boundary, and hence, [(u) > [(u“); therefore, we must show that ((un'l) S [(v). Let d be any dimension such that d < n — 1 and od(u) 7E od(v). Since channel (u,vd) is a boundary, then by Lemma 5.1, €d(u) = k — 1. Because channel (u,u”‘1) is not a boundary, then from Lemma 5.3, on_1(u"‘l) > 0,,_1(u), so on_1(u"'l) = on_1(u) + 1. It then follows, from Equation 5.5, that €d(u“’1) = 0. Since od(u) = od(u"‘1) for all d < n— 1, we have that td(u"’1) = 0 for all d < n—l such that od(u"‘1) 79 od(v). Therefore, from Equation 5.5, [(un‘l) S [(v). Part II. ((11) > [(v): If there is a dimension (1 such that od(u) 75 od(v) and channel (u,vd) is not a boundary, then trivially, [(u) < ((u“). Otherwise, we know that for all d such that ad(u) 75 od(v), channel (u,u“) is a boundary, and we must show that for at least one such dimension, [(u“) S ((v). For reasons similar to those given in Part I of this proof (Case 2b.ii), if we choose d as the greatest integer such that od(u) 314 od(v), then [6(11“) = 0 for all e < d such that a,(ud) aé 0,,(0), and thus, and) g [(0). :1 Lemma 5.5 Let u and v be any two distinct nodes in a unidirectional torus and let w be the first node (after node u) on the path from u to v, as determined by the path routing function RUTPR. If [(u) < €(v) then [(u) < [(7.0) S €(v). If€(u) > ((v) then either [(11) < €(w) or €(w) S [(v). 108 Proof: Let cups 2 RUTpR(u,a,v) be the first channel on the path from u to 2). Then w = if. Although the value of the virtual channel set, a, is not specified above, the dimension, 6, produced by the RUTPR function does not depend on 0 (Equation 5.3). Part I. [(u) < [(22): By Lemma 5.4, there exists a useful dimension, d, such that (u,ud) is not a boundary and [(ud) S [(22). From Definition 5.1 and Lemma 5.1 follows that if i and j are two dimensions such that i < j and neither channel (a, u‘) nor channel (u,uj) is a boundary, then €(u‘) < ((uj). Since ’RUTPR always selects the minimum dimension among useful non-boundary dimensions, then 3(a) < [(112) S [(12). Part II. [(u) > €(v): If channel (u, w) is a boundary then all channels in useful dimensions are boundaries, since RUTPR always selects a non-boundary channel if one exists in a useful dimension. Thus, by Lemma 5.4, there exists a useful dimension, d, such that [(ud) S €(v). Since [(22) < 3(a), then (u,ud) is a boundary. From Definition 5.1 and Lemma 5.1 follows that if i and j are two dimensions such that i > j and both channel (u,u‘) and channel (u,uj) are boundaries, then ((u‘) < €(uj). Since RUTPR always selects the maximum dimension among useful boundary dimensions, then ((w) S [(22). If, on the other hand, channel (u,w) is not a boundary, then by Definition 5.1, ((w) > 3(a). [:1 Lemma 5.6 Let u and v be any two distinct nodes in a unidirectional torus. If ((u) < ((1)) then the path from u to v, as determined by the path routing function RUTPR, does not contain a boundary. If [(u) > [(v) then the path contains exactly one boundary. 109 Proof: Let wo,w1, wg, . . . ,wq be the sequence of nodes on the path from u to v, where we 2 u and w, = 2). Part I. [(u) < 3(1)): By Lemma 5.5, [(2120) < €(w1) S €(wq), and by extension, [(wo) < [(wl) < 3(w2) < < [(wq). Hence, the path contains no boundaries. Part II. ((u) > [(v): Since ((wo) > [(wq), then there exists an integer i (0 S i S q— 1) such that €(wg) > €(w,-+1). Choosing the value of i as the least such in- teger gives [(wo) < [(wl) < < ((21);) > €(w;+1). By Lemma 5.5, €(w;+1) S ((wq), and by extension, €(w;+1) < €(w;+2) < - - - < [(wq). Hence, the path contains exactly one boundary, channel (wi, w,-+1). D Proof of Theorem 5.1 With the use of the above lemmas, the three assertions stated in the theorem are now proved, in turn. Assertion 1 (minimal paths): Since the path routing function RUTPR selects only useful dimensions, then all paths are minimal. Assertion 2 (deadlock-free): In order to show that the network is deadlock-free, we first define a total ordering on the virtual channels of the network, and then show that all messages reserve virtual channels in an order that is consistent with this total ordering. For any channel cuad, define A(cuad) as follows. 0.€(u).d if a = ‘p’ /\(Cuad) = 1.!(u).d if a = ‘h’ A lexicographical ordering of the three-part labels, /\(c), of each channel, c, defines a total ordering of the virtual channels. All p—channels precede all h-channels in this ordering; among the same virtual channels set, the ordering is determined by the 110 label, (3(a), of the source node, u. Finally, channels of the same virtual channel set and source node are ordered by the dimension, d, in which they travel. We consider two cases with respect to the ’H-cycle, (P. Case 1. Q is an ’H-chain: By Definition 5.1, ((21,) < €(u;+1) for 0 S i < m — 1, and by Lemma 5.6, the path from node ac to node um-1 does not contain a boundary; thus, from Equation 5.3, the path contains only p-channels. Virtual channels are therefore used by the path from node no to node um_1 in an order that is consistent with the total ordering described above. Case 2. is not an ’H—chain: Since (I) is an ”H-cycle, and is not an ’H-chain, then there exists an integer, q, 1 S q S m — 1, such that [(uo) > [(um_1) and €(u0) < ((ul) < < ((219-1) > €(uq) < [(uqH) < < [(um-1) By applying Lemma 5.6 to each pair of consecutive nodes in (I), the path from node an to node um_1 contains exactly one boundary. Thus, by Equation 5.3, the path uses p-channels prior to the boundary and h-channels thereafter, and from Definition 5.1, both the sequence of p-channels and the sequence of h—channels on the path visit nodes in an ascending order of node labels, [(u). Thus, the order of all virtual channels on the path from node no to node um_1 is consistent with the total ordering described above. Since all messages reserve virtual channels in an order that is consistent with the total ordering of virtual channels defined above by /\, then cycles of channel dependency, and thus deadlock, cannot occur. Assertion 3 (distinct physical channels): From the above proof of Assertion 2, it is clear that the path from node uo to node um._1 does not visit any node more than once, therefore, it cannot contain a multiple occurrence of a physical channel. Cl 111 Because unicast communication is also essential, it should be supported efficiently and in a way that is compatible with other network communication such as multi- destination messages. Since a unicast message is a special case of a multi-destination message in which there is only one destination node, it follows that minimal, deadlock- free unicast routing is also provided by the above routing mechanism. Furthermore, all combinations of multi-destination and unicast messages can coexist without pos- sibility of network deadlock. 5.6 Implementation Issues Because the path routing function must be implemented in hardware, it must be simple. Thus, the path routing function must be designed so that the output channel on which an incoming message is to be forwarded can be quickly determined by exam- ining only the first flit of the message header (which indicates the next destination), and perhaps also taking into account the channel on which the message is arriving. Also, the router must be able to quickly decide if the current node is a destination of the incoming message, and if so, whether there are additional destinations to which the message must be forwarded. 5.6.1 Multi-Destination Message Format With unicast wormhole routing, the header flit of a message contains the address of the destination node (absolute addressing), or perhaps some indication of the direction and distance from the current node to the destination node (relative addressing). In path-based routing for messages with arbitrary destination sets, the message worm can be prefixed with a list of destination node addresses. The order of these addresses corresponds to the order in which the destination nodes will be visited. Once the message header reaches the first destination node in the list, the router at that node 112 removes its own address from the head of the list and forwards the remainder of the message worm towards the next destination node, while simultaneously copying the message contents to the local host memory. The router at the last destination node copies the message to the local host memory, but does not forward the message. In this way, the address of the next destination node, used by the router at each node to determine the next node in the path of a message, is always at the head of the message worm. Thus, as the head of the message advances through the network, each router at which the message arrives need only examine the first flit of the message in order to identify the outgoing channel over which the message will be routed. In the case of intermediate destination nodes, the router first identifies that the first flit of the message is the local address, and after discarding this flit, considers the next flit, which represents the address of the next destination, to determine message routing. Upon recognizing that the message is addressed to the local node, the router also prepares to copy the flits of the message body to the local host. In order to identify the last destination address in the message header, and con- sequently, the beginning of the message body, the address of the last destination node can be duplicated in the message header. The occurrence of two consecutive and identical destination addresses then signifies the end of the destination address list. The message format corresponding to the ’H-cycle Q = {uo,u1,u2, . . . , um_1}, which represents a multi-destination message from source node no to destination nodes u1,u2, . . . , um_1, is depicted in Figure 5.9. "2H Hum-I head tail “I u,,..; message body Figure 5.9. Multi-destination message format 113 This format represents the original message, as constructed by the host at the source node, uo. As the message progresses through the network, the header becomes shorter as the routers at each destination node remove their respective addresses from the head of the message. In order to reduce the message header length, destination encoding schemes have been proposed that avoid the need to list each destination address separately. For example, Kim and Kim [31] describe the use of address masks, which allow “don’t- care” bits in the address field in order to match a group of destination addresses with a single entry in the message header. While such address encoding methods can be useful in reducing the length of the message header in cases where the destination nodes form a regular pattern, they add complexity to the routing hardware, and fail to offer an advantage with arbitrary destination sets. 5.6.2 The Flit Forwarding Algorithm Figure 5.10 shows the flit forwarding algorithm implemented in a router in order to support the proposed path-based routing. The recv function reads the next incoming flit, while the send function transmits a flit over the specified outgoing channel. In cases where channel contention occurs so that a flit cannot be sent over the specified channel, the send function can be considered to block until the channel is available. The send-local function copies a flit to the local host memory. The flit forwarding algorithm, as shown in Figure 5.10, is invoked for each incoming message; thus, there may be concurrent invocations of the algorithm if there are messages arriving concurrently on different incoming channels. The above discussion assumes the use of absolute addressing. In the case of relative addressing, rather than duplicating the address of the last destination node, the list of destination addresses can simply be appended with a zero offset (which, conceptually, is equivalent to duplicating the last address, since an offset of zero from 114 The Multi—Destination Flit Forwarding Algorithm Input: Local node u and incoming virtual channel cwad. Output: Forward incoming message to outgoing channel and /or local host, as appropriate. Procedure: flitl = recv (cwad) ; if flitl = u then // u is a destination isDest = true ; flit2 = recv (Cwad) ; if flit2 = flitl then // u is the last destination isLastDest = true else isLastDest = false ; nextDest = flit2 endif else // u is not a destination isDest = false ; isLastDest = false ; nextDest = flitl endif if not isLastDest then c = RUTPR (u,a,nextDest) ; // set output channel send (c, nextDest) // send first flit (address of next dest.) endif if isDest then while flit2 # flitl do // skip to message body flitl = flit2 ; flit2 = recv (cwad) ; send (c, flit2) endwhile endif while not end-of-message do flitl = recv (cwad) ; if isDest then send-local (flitl) ; if not isLastDest then send (c, flitl) ; endwhile Figure 5.10. The flit forwarding algorithm for path-based routing 115 the last destination is, indeed, the last destination). In conjunction with this change, very minor adjustments to the flit forwarding algorithm are also required in order to support relative addressing. 5.6.3 Boundary Identification In order for the router at a node, u, to implement the path routing function RUTH; given in Equation 5.3, the outgoing channels at node at that are boundaries must be identified. Of course, one way for node a to accomplish this task is to compute the address of each neighbor of u using Equation 5.1, and to then compute the labels of these neighboring nodes using Equation 5.2. The boundary channels can then be identified by comparing the label of u with those of the neighboring nodes, as described in Definition 5.1. There is, however, a much simpler method of determining which of the routing dimensions correspond to boundary channels, as provided for by Lemma 5.7. Lemma 5.7 For any node u, and any dimension d (0 S d S n — 1), channel (u,ud) is a boundary if and only if. (”if oj(u)) mod k = k —1 j=d Proof: The lemma follows directly from Lemma 5.1. II] The configuration of boundaries in a network is completely static; thus, the bound- ary dimensions for each node can be identified once during system startup, and need never be recomputed. Alternatively, the identification of boundaries can be made a part of the node’s static configuration, much the same as the local address of a node is part of that node’s static configuration. 116 5.6.4 Implementation of the Path Routing Function The formal definition of the path routing function RUTPR, given by Equation 5.3, is intended to convey the concept of how the routing path of a message is deter- mined; it does not, however, represent an efficient implementation of the function. An algorithmic view of the function RUTPR, more suitable for implementation in a router, is given in Figure 5.11. The predicate boundary simply indicates whether an outgoing channel on the specified dimension is a boundary. For messages originating at the local node, u (as opposed to those that arrive on network channels), the input parameter a is set initially to ‘p’. The Path Routing Emotion Input: Local node u, next destination node v, and virtual channel set of incoming message a 6 {‘p’,‘h’}. Output: Outgoing virtual channel set and dimension. Procedure: fori=0ton—ldo if o,(u) 75 (5(1)) then // useful dimension? if not boundary(i) then return (a,i) // lowest non-boundary dimension else d = i endif endif endfor return (‘h’, d) // highest boundary dimension Figure 5.11. Implementation of the path routing function RUTPR 117 In actuality, when the dimensionality, n, of a torus is small, as is the case with all current and proposed machines, the RUTH; path routing function can be easily implemented with a simple combinational logic circuit. 5.7 Conclusions In this chapter, the area of path-based message routing in unidirectional torus net- works with IR capabilities has been studied. An efficient, deadlock-free path-based routing method was presented. This method is deadlock-free under all combinations of network traffic, provides minimal routing paths between subsequent destination nodes, and requires only the lower bound of two virtual channel sets. In the following chapter, this routing mechanism is used in the development of path-based multicast algorithms for torus networks with unidirectional communication links. CHAPTER 6 Path-Based Multicast in Unidirectional Torus Networks In this chapter, the path—based routing mechanism described in Chapter 5 is used as a basis for a family of efficient, deadlock-free path-based multicast algorithms for torus networks with unidirectional communication links. In order to perform a multicast operation using multi-destination messages, one or more communication steps may be used. Methods that reach all destination nodes in one communication step are termed single-phase, while those that require more than one step are called multi-phase. During the first phase of a multi-phase multicast, the source node sends a single message to a subset of the destination nodes. During subsequent phases, some (perhaps all) of the nodes that have already received the message each send a multi-destination message to a distinct subset of the nodes that have not yet received the message. This process continues until the message has reached every destination node. The remainder of this chapter is organized as follows: In Section 6.1 a single-phase multicast algorithm is presented. This algorithm follows directly from the path-based routing method presented in Chapter 5. A generalized multi-phase multicast algo- rithm is described in Section 6.2. This algorithm constitutes a family of path-based 118 119 multicast algorithms. In Section 6.3, a specific instance of this generalized multicast algorithm is presented. By incorporating the topology of the torus network when partitioning the multicast destinations into individual multi-destination messages, the algorithm completes a multicast operation in n phases, where n is the number of dimensions in the network. Another instance of the generalized multi-phase multicast algorithm is described in Section 6.4. This algorithm partitions the destination nodes so that all messages produced by a multicast operation are addressed to a nearly equal number of destination nodes. Implementation issues specific to multi-phase multicast algorithms are addressed in Section 6.5. In Section 6.6, the results of a simulation study are presented. This study compares the performance of the multicast algorithms developed in this chapter, as well as a unicast-based multicast method based on the same routing mechanism. Conclusions are presented in Section 6.7. 6.1 The S-Torus Multicast Algorithm The path-based routing method described in Chapter 5 provides a mechanism whereby any node can send a single multi-destination message to an arbitrary set of destination nodes within the network. When a single message is used in this way to perform a complete multicast operation, we call this process the S—torus multicast algorithm (‘3’ for single phase). There are several advantages to implementing multicast communication with a single multi—destination message. The operation requires only one communication step. Hence, for long messages, full advantage is taken of the communication pipelin- ing of wormhole routing. Also, the only processor involved in the operation is that of the source node. Each destination node receives the message via the node router, without the need to use the local processor to relay the message to other destination nodes. 120 However, the S-torus algorithm also suffers from some disadvantages. For example, a single message used to reach all nodes in the network, as in the case of a broadcast operation, will have a total path length of N — 1. Such extremely large path lengths are likely to result in poor performance in large networks due not only to the length of the path traveled by the message flits from the source node to the last destination node, but also due indirectly to the network congestion caused by the many communication channels reserved by the message for the duration of the operation. In addition, a single—phase approach to multicast does not exploit the communication parallelism that is possible when concurrent messages are used to complete the operation. 6.2 The M-Torus Generalized Multicast Algo- rithm Due to the limitations of the single-phase S-torus algorithm, we also consider multi- phase multicast algorithms. We describe a generalized multi-phase multicast algo- rithm that combines multi-destination messages in order to form a coverage of the destination nodes. Since this algorithm uses only messages that have been shown to result in a deadlock-free network, then the algorithm is also deadlock-free. We assume that a multicast operation is described by an H-cycle, Q, where the first element of Q is the destination node. We define a linear partitioning of an 'H-cycle, which will be used as a basis for the generalized multi-phase multicast algorithm. Definition 6.1 A linear partitioning of an ’H-cycle Q = {uo, ul, U2, ..., um_1} is a set of non-empty sub-sequences Q = {Q0, Q1, Q2, ..., Q,._1} such that r 2 2 and Q = Q0 || Q1 || Q2 || || Q,_1, where the symbol ‘|| ’ represents concatenation of lists. Each element, Q,-, off) is written as Q.- = {11,-,0, um, ..., u,,(m‘._1)}. 121 Lemma 6.1 If!) 2 {Q0, Q1, Q2, ..., Q,_1} is a linear partitioning of an ’H-cycle Q, then 1. Each sub-sequence Q,- is an H-cycle, 0 S i S r — 1. 2. At least r — 1 of the r sub-sequences are ’H-chains. The remaining sub-sequence is either an ’H-chain or an ’H-cycle. Proof: The lemma follows directly from Definitions 5.2, 5.3, and 6.1. D Given an H-cycle Q that describes a multicast operation, a linear partitioning of Q can be used to define the first phase of a multi-phase path-based multicast operation. In this first phase, the source node sends a multi-destination message according to the path-based routing method described in Chapter 5. The destination nodes of this message comprise the first elements of each of the partitions of Q (except that the source node, which is always the first element of the first partition, is, of course, not included as a destination). After this first phase is complete, the multicast problem has, in effect, been partitioned into a set of smaller multicast problems, each corresponding to one of the partitions of Q. The source node of each of these new multicasts is the first element, and the destination nodes are the remaining elements, of the respective partition. The M-torus multicast algorithm is shown in Figure 6.1. This algorithm imple- ments a multi-phase path-based multicast operation on an input sequence that has been arranged as an 'H-cycle. As shown in Theorem 6.1, the resulting multicast operations are deadlock-free, while the constituent multi-destination messages are contention-free. Theorem 6.1 The M-torus algorithm applied to an ’H-cycle Q = {uo, ul, U2, , um..1} results in a contention-free multicast from source node no to destination nodes u1,u2, ,um_1. Furthermore, the linear partitioning need not be consistent across 122 The M-Torus Multicast Algorithm Input: Message, M, and H-cycle Q = {uo, ul, uz, ..., um_1}, where no is the local address. Output: Performs a multi-phase, path-based multicast to destination nodes u], U2, . . . , um_1. Procedure: if |Q| > 1 then 1. Let Q = {Q0, Q1, Q2, , Q,..1} be a linear partitioning of Q. 2. Send a multi-destination message to the sequence of destination nodes {u1,0, um, ..., u(,_1),o}. 3. Each node um (O S i S n — l) invokes the M-torus algorithm, recursively, with the input ’H-cycle set to (Pg. endif Figure 6.1. The M-torus algorithm for multicast the distributed invocations of the algorithm, nor over the successive invocations at any single node. Proof: From definition 6.1, each partition is written as Q,- = {u,,o, um, . . . , u,-,(m‘.-1)}. For each partition Q5, range(Q,-) is defined as follows. {u |€(u,-,o) S [(u) S [(u,,(m-1))} If T; IS an ’H-chain range(Q.-) = {u | [(um) S [(u) or [(u) S [(u,,(m-1))} if Q,- is not an ’H-chain From Lemma 5.6 follows that the above sets are all disjoint, that is, range(Q.-) 0 range (Q) = (0 whenever i 75 j. We now consider the multi-destination messages generated within a partition, say partition Q.-. The first such message is generated by node um. Since the source and destination node list of each such message is ordered according to a sub-sequence (partition) of the ’H-cycle (Pg, then by Lemma 5.6, every node visited by the message 123 is an element of range (Q;). We therefore conclude that, once the original multicast operation has been partitioned into r sub-problems, each corresponding to an ’H-cycle, the messages generated by these sub-problems do not visit common nodes. That is to say, there is no contention between any two sub-problems. Since the message generated in Step 2 of the algorithm (Figure 6.1) precedes all other messages of the multicast, then it cannot contend with any other message produced by the operation. Recursively, each multicast sub-problem corresponding to one of the partitions, Q,-, is itself contention-free. Therefore, the entire multicast operation is contention-free. Since no assumptions have been made in this proof regarding the nature of any of the partitionings performed in Step 1 of the algorithm, except that they are linear partitionings, then the second assertion of the theorem is also valid. C1 The M-torus algorithm actually represents a family of multi-phase multicast al- gorithms, whose specific instances are determined by the partitioning method used in Step 1. For example, if the input ’H-cycle is partitioned into sub-sequences whose lengths are all one, then the M—torus algorithm will produce a single-phase multi— cast equivalent to the S-torus algorithm. On the other hand, if the input H-cycle is always partitioned into exactly two subsequences, then all messages generated by the algorithm will have only a single destination, thus resulting in a unicast—based implementation. Between these two extremes are a multitude of partitioning schemes, each resulting in a different version of the M-torus algorithm. We will examine two particular partitionings; namely (1) dimensional partitioning and (2) uniform partitioning. 124 6.3 The Md-Torus Multicast Algorithm By examining Statement 2 of the M-torus algorithm (Figure 6.1) and the definition of a linear partitioning (Definition 6.1), it can be seen that the multi-destination messages generated by the M-torus algorithm must reach destination nodes that may be widely distributed over the input H-cycle. Since higher-dimension channels traverse a greater span of ’H than do channels of lower dimension (we refer to Equation 5.2 and, for example, Figures 5.4 and 5.5), we consider partitionings that result in messages that cross higher-dimension channels between subsequent destination nodes. In this way, the path length of the constituent messages of the M-torus algorithm can be controlled. A dimensional partitioning is such a method. A dimensional partitioning of order (I, in effect, partitions the nodes of an ’H—cycle into their respective sub-tori of dimension d. For example, in a 3D torus, as shown in Figure 5.5, each element of a dimensional partitioning of order 2 corresponds to a plane of the network. Each plane of a 3D torus is, itself, a 2D torus. In a similar manner, a dimensional partitioning of order 1 partitions a network into 1D tori, or rings, corresponding to the columns of nodes in Figures 5.4 and 5.5. Definition 6.2 A linear partitioning Q of an 'H-cycle Q is a dimensional partitioning of order (1 if and only if 1. For each sub-sequence Q0 6 fl, and for any two elements u,v 6 Q0, (7,-(u) = 05(v), for d S i S n — 1; and 2. For any two sub-sequences Qme E Q, where a at b, and any two elements u E Q, and v 6 Q5, there exists some integer i, d S i S n — 1, such that a;(u) # o;(v). By using dimensional partitionings, we create a specific instance of the M-torus algorithm, as follows (please refer to Figure 6.1). During the first communication phase, the partition, Q, is a dimensional partitioning of order n — 1. During the second phase, a dimensional partitioning of order n — 2 is used, and so forth, until the 125 nth and last phase, where Q is a dimensional partitioning of order 0, which is simply a partitioning of Q into individual nodes. This combination of the M-torus algorithm with dimensional partitioning is referred to as the Md-torus multicast algorithm. As an example, in a 3D torus, during the first phase of the Md-torus algorithm, the destinations are partitioned into their respective 2D planes. A multi-destination message is sent by the source node to a single destination in each plane (among those planes that contain destination nodes). During the second phase, destinations reached during the first phase, as well as the original source node, each partition their respective 2D plane into 1D rings and send a multi-destination message that reaches one destination node in each of these rings. Finally, during the third phase, there is exactly one destination node in each 1D ring (the columns in Figure 5.5) that has received the message; each of these destinations sends a multi-destination message to cover the remaining destinations in the respective 1D ring. 6.4 The Mu-Torus Multicast Algorithm The Md-torus algorithm partitions the destination nodes based only on the structure of the underlying torus network, without regard to the actual destinations of a par- ticular multicast operation (except that partitions corresponding to sub-tori without destination nodes are not created). In some cases, it is useful to consider the structure of the set of destination nodes when partitioning the associated ’H-cycle. For example, limiting the number of destination nodes reached by any single message reduces the length of the message header (which contains the list of destination nodes addresses) and tends to also reduce the message path length. A method that allows the number of destinations per message to be controlled is uniform partitioning, in which the ’H-cycle is divided into a specified number of 126 sub-sequences whose sizes are as nearly equal as possible. Uniform partitioning is defined as follows. Definition 6.3 A linear partitioning Q of an ’H-cycle Q is a uniform partitioning of size r if and only if 1. [9| = r; and 2. For each partition Q, E Q, either |Qa| = [m/r] or |Qa| = [m/rj, where m = |Q| is the size of the multicast operation. When the M-torus algorithm employs uniform partitioning, we term the result the Mu-torus multicast algorithm. The Mu-torus algorithm is parameterized by the num- ber of partitions, r, which corresponds to one more than the number of destinations of each constituent message. During the last phase of the algorithm the multicast size, m, may be less than r. In this case, the ’H-cycle is partitioned into m single-node partitions, resulting in a final message that is sent to m — 1 destinations rather than r -— 1. Since the number of destinations of each message produced by the Mu-torus algo- rithm is r — 1 (except, perhaps, during the last phase), then the aggregate number of destinations reached will grow by a factor of r during each phase, leading to the following result: The Mu-torus algorithm, with partitioning parameter r, applied to a multicast operation of size m, requires flog, m] phases to complete the multicast. In addition to the two specific methods covered above, many other linear parti- tionings are possible. As an example, one possible partitioning could combine sub-tori produced by dimensional partitioning whenever two or more adjacent sub-tori contain relatively few destination nodes, and could split single sub-tori that contain an over- abundance of destination nodes. This hybrid of dimensional and uniform partitioning would balance the purely network view imposed by dimensional partitioning with the destination—set View of uniform partitioning. 127 6.5 Multi-Phase Implementation Issues In the M-torus multicast algorithm, the original source node sends a multicast message to a set of intermediate destination nodes, which, in turn, send the message on to other destination nodes. Depending on the number of phases used to perform the operation, these new destination nodes may also be required to relay the message, and so forth. In some way, each intermediate destination node must be informed of the sequence of nodes to which it will send the message. The following are three mechanisms by which the dissemination of the multicast structure can be accomplished: (1) If a particular source node is to perform repeated multicasts to the same group of destination nodes, the necessary information can be distributed once at the time the communication group is formed. During subsequent multicasts to that group, the group identifier (GID) is attached to the message body so that each destination node can refer to the local information now associated with the group [6]; (2) The 'H-cycle for which an intermediate destination node is subsequently responsible can be included in the multicast message body; and (3) The required information can be incorporated into the message header so that the router relays to each intermediate destination node the appropriate ”H-cycle. For succinct discussion, we introduce the following notation. Let Q" be the ’H-cycle that is originally used as input when invoking the M-torus algorithm at node u. The ’H-cycle Qu thus contains those destination nodes for which node u is responsible; that is, the destination nodes that receive the multicast message, either directly or indirectly, through the processor at node ii. For example, if u is the original source node, then Q“ represents the entire multicast operation. As a practical matter, we note that when distributing the information in Q“ to node u, the first element, which is the address of node u, need not be included. The 128 notation Q“ represents the value of Q“, after the address of node u has been removed; that is, Q" = Q“ — {a}. To support the third method, in which the message header contains information regarding the multicast structure, the header format shown in Figure 6.2 can be used. head u ,_1 l message body tail Figure 6.2. Compound message format Figure 6.2 depicts a message that will be transmitted from source node no to des- tination nodes ul, U2, . . . , um_1, where node uo has partitioned the multicast problem represented by ’H-cycle Q into r sub—sequences Q0, Q1, Q2, .. . , Q,_1, and where u.- is the first element of Q.- (0 S i S r — 1). We term this structure a compound message format. Each destination node address, 11;, in the header of a compound message is followed by a count, w,- = |Q“" |, of the number of destination nodes for which u,- will in later phases be responsible, as well as the list, Q‘“, of those node addresses. As in the standard multi-destination message header described in Section 5.6, the address of the last destination node, u,-1, is duplicated to signify the end of the header. In this example, node uo may be the original source of the multicast operation, or it may be an intermediate destination node that has received the message and is now responsible for the destination nodes contained in Q. In either case, the actions taken by node no will be the same. 129 To process a compound message header, in addition to the activities required for a basic multi-destination message as shown in Figure 5.10, the router must perform two additional tasks: (1) When forwarding the message header to the next destination node, the included H-cycles must also be forwarded; and (2) When the local node is a destination of the message, the ’H-cycle immediately following the local node address must be forwarded to the local host. After a compound message has been processed by the router at a node that is a destination of the message, the local host will have received, in addition to the message body, the ’H-cycle needed as input when invoking the M-torus algorithm at the local node. In the implementation of a system supporting path-based multicast, the choice between the above three methods of distributing the multicast structure depends on several factors. The first method, in which the structure information is distributed initially upon creation of a group of communicating nodes, is efficient only in cases where the same source and destinations will be involved in repeated multicasts. This method also requires that each destination node maintain a local group table in which information is stored for each group to which the node belongs. If the multicast structure is added to the message body, as in the second method described above, one difficulty is that every destination node of a multi-destination message must receive the information intended for all of the other destination nodes. Without additional router support, the same message body must be delivered to each of the destination nodes of the message; therefore, there is no way to distinguish the structural information that is needed by a particular destination. Instead, the local host at each destination node must locate and use the correct ’H-cycle from the sequence of ’H-cycles received in the message body. An advantage of this method is that no additional hardware support is needed beyond that required for single-phase multicast. 130 Minor enhancements to the routing hardware, however, are required to support the compound message format. As stated above, the router must forward the correct ’H-cycle, included in the message header, to the local host of a destination node; and must also forward all other ’H-cycles to subsequent destination nodes. In return for this additional router capability each destination node will receive only the 'H-cycle it requires in order to perform its portion of the multicast. The compound message format does add overhead in the case of simple multi-destination messages, which are used during the last phase of a multi-phase operation, and for single-phase methods, which may also be used on systems supporting multi-phase operations. When using the compound message format for a simple multi-destination message in which no multicast structure is being distributed, the ’H-cycle for each destination node is null; however, in order for the router to process the message header, the associated ’H-cycle size counts, all zero, must be included. Thus, the compound message format incurs an overhead of one additional header flit for each destination node when used for a simple (non—compound) message. To efficiently support both compound and non-compound multi-destination mes- sage formats, it is possible to incorporate a message type indicator into the message header. This indicator would distinguish between the two types of multi-destination message formats, and could even identify a unicast message, so that the most efficient message format could be used for each type of message. However, in order to interpret the message type information and adjust the router functionality accordingly, even more extensive hardware support is required. 6.6 Performance Evaluation In order to better understand the performance of the multicast algorithms presented in this chapter, a simulation study has been conducted in which these algorithms, 131 as well as a unicast-based multicast algorithm, are examined. The system model for the simulation is the same as assumed throughout this chapter, that is, a k-ary n-dimension torus with unidirectional communication links and all-port architecture. In order to evaluate the performance of the multicast methods through simulation, specific time values were chosen for the following message latencies. The software overhead at the message source and destination node are represented respectively by the message send latency, rs, and the message receive latency, T3. The combined send and receive latencies are referred to as the message startup latency. The time required for a message in the network to advance one flit is represented by the per-flit network latency, Tn. All simulations were performed for a 4096-node torus. Both 2D (16 x 16 x 16) and 3D (64 x 64) topologies were examined, as well as various message lengths, multicast sizes, and message latencies. To provide example cases for the simulation, each mul- ticast set was produced by selecting, from all nodes in the system, the appropriate number of unique nodes under a uniform distribution model, using a random number generator. Statistics for each configuration are averaged over 400 trials. Five specific multicast algorithms are simulated: the single-phase S-torus algo- rithm, the Md-torus algorithm, two instances of the Mu-torus algorithm, and for comparison, a unicast-based multicast algorithm. To examine the effect of the par- titioning parameter, r, on the Mu-torus algorithm, two such parameter values have been chosen; they are r = 8 and r = 64. We refer to the corresponding algorithms as Mu-torus(8) and Mu-torus(64), respectively. As a comparison to the path-based methods presented in this chapter, an effi- cient unicast-based multicast algorithm is also simulated. This algorithm uses the same minimum-path routing function, RUTPR, as do the path-based methods. The unicast-based algorithm is essentially the Mu-torus algorithm with parameter value r = 2. To complete a multicast of size m, the algorithm therefore requires [logz ml 132 phases, or communication steps, which has been shown in Chapter 4 to be optimal for unicast—based methods under the one-port model and is also the best known result for multicast in all-port torus architectures. Thus, the simulated unicast-based method is efficient and serves as a useful comparison to the studied path-based methods. Figure 6.3 shows the performance of the various multicast methods on a 2D (64 x 64) torus with message latency values of rs = 95usec, 73 = 75psec, and Tn = 0.5,usec. These values represent the relatively high message startup latencies associated with many of the current wormhole-routed computers. Both average and maximum multicast latencies are shown; the former being the elapsed time from the initiation of the multicast operation at the source node to the reception of the message at each destination processor, averaged over the destination nodes; and the later being the maximum time over all destinations of the multicast message. Results are shown for message lengths of 8, 512, and 16384 flits. With this configuration, the results show clearly that all of the path-based methods perform better than the unicast-based operation, except for very small messages. Even for small messages, the Mu-torus(8) algorithm performs much better than the unicast-based method. In general, as the message length is increased, methods that use fewer communication phases tend to perform better. This behavior is due to the pipelining of wormhole routing. Among path-based algorithms, the method providing the best performance de- pends greatly on the message length. The S-torus algorithm performs very well for large messages, whereas the Mu-torus(8) algorithm is better with short and medium message lengths. For the results shown in Figure 6.4, lower message startup latencies were used to simulate the architecture of state-of-the—art MPCs. The respective latency values are rs = 10psec, T3 = 8usec, and as before, Tn = 0.5psec. Again, message sizes of 8, 512, and 16384 Hits were simulated. Under this configuration, the unicast-based 133 method exhibits performance that is nearly identical to the Mu-torus(8) algorithm for small messages, but still performs poorly, compared to the path-based methods, for medium and large messages. Again, with large messages, the S-torus algorithm shows the best performance. Figures 6.5 and 6.6 show the results of simulating a 4096-node 3D (16 x 16 x 16) torus; the configurations are otherwise identical to those associated with Figures 6.3 and 6.4. Except for small messages, the performance of all algorithms is very similar to the corresponding 2D case. Compared to the unicast-based and Mu-torus(8) al- gorithms, the S-torus, Md-torus, and Mu-torus(64) algorithms produce relatively few communication phases, and consequently, tend to generate individual messages that are addressed to a larger number of destination nodes. The performance of these latter three algorithms increases on the 3D torus, as compared to the 2D torus, due to the more dense interconnection network and resulting shorter path lengths of the 3D topology. In Figure 6.7, the multicast latencies are plotted against the message length for a multicast size of 512 nodes. The results of using both the high and low message startup latencies described above are presented. The results show that the amount by which the algorithm is effected by an increase in the message size is directly related to the number of communication phases used by the algorithm. The S-torus algorithm, with only one phase, is least effected by the message length, while at the other extreme, the unicast-based method, requiring I'log2 512] = 9 phases, is most effected. An important attribute of any communication operation is the resultant amount of network congestion. To investigate the amount of network traffic produced by the above multicast methods, the total number of link visits was recorded for each sim- ulated multicast operation. Each link visit represents the use of one communication link by one message. Multiplying the number of link visits involved in a multicast 134 operation by the message length and by the per-flit network latency, Tn, produces a value equivalent to the summation, over all links in the network, of the total time during which the link is being used by the operation and is therefore unavailable. Figure 6.8 depicts the resultant link usage for both 2D and 3D 4096-node torus networks, for various multicast sizes. As shown, the path-based algorithms require the use of fewer communication links than does the unicast-based method, especially with the 2D topology. 6.7 Conclusions In this chapter, the area of multicast algorithms for unidirectional torus networks with IR capabilities has been studied. Specifically, several efficient path-based multicast algorithms were presented. The S—torus multicast algorithm uses a single multi- destination message to perform an arbitrary multicast operation. The S-torus algo- rithm was extended to the M—torus algorithm, a generalized multi-phase multicast algorithm, in which a combination of multi-destination messages is used to perform a multicast in one or more communication phases. Two specific instances of the M- torus algorithm, the Md-torus and Mu-torus multicast algorithms, were presented. These algorithms produce contention-free multicast operations and are deadlock-free under all combinations of network traffic. As a way to better gauge the real performance of the techniques presented in this chapter, a simulation study of the proposed multicast algorithms was conducted. The results of this study show that the path-based multicast algorithms presented in this chapter, together with the path-based routing method presented in Chapter 5, offer significant performance gains over unicast-based multicast techniques. By using the proposed algorithms, unidirectional torus systems with IR capability are able 135 to perform efficient unicast and path-based multicast operations within a variety of environments. Avg. Latency (msec) AVS- 13‘ch (“1500) Avg. Latency (msec) 136 3 Unicast-based -6---- Unicast-based -8---- 2-5 ‘ S-Torus ....m ' 2.5 - S-Torus ~+~ .l Md-Torus ....... ’9‘ Md-Torus ....... g 2 _ Mu-Torus(8) -+-— q a 2 L MuwTofii's'(8) -+-- . Mu-Torus(64) "n"- v -Torus(64) "x": ................... a >\ ........... 1.5 ~ J E 1.5 ~ 1;: 4g: ----------------- h . “...;“fi- 5 ........................... c .3 2"”. 1 __ 2'3' ... _. ”)9“ g, 1 __ g _4 2,; - 2 fi____‘/‘ 0.5 ~ WM 0.5 » l 0 J L L I 0 I I I L 0 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (a) Message length: 8 flits 6 I I I l 6 1 I T I Unicast-based -a---- Unicast-based -3-.- 5 - S-Torus ...”-.. q 5 _ S-Torus .--...“ _j Md-Torus -----* g Md-Toms --* ------- a 4 Mu-Torus(8) —+— . E 4 _ Mu-Torus(3 .... ................. . Mu-Torus(64) -* - ”a v Mu-g‘gms -*--- ............. >. .. -.-.av ------ o -" 3 h I,” ........ .. § 3 r- ,F' 3 2 ~ ‘3' 4 :5 2 - ____—————-—:X" 2 l - a 1 - ~ 0 L I i I O 4 J_ I l 0 128 256 512 1024 0 128 256 512 1024 Multicast Size Multicast Size (b) Message length: 512 flits 1m I I I 1m T j T T Unicast-based -B--- Unicast-based -B-- 80 _ S-Torus 4..-- g 80 _ S-Torus % ------------------- -0_ Md—Torus --‘ ---------- a g Md- Jaguar- e“ """" Mu-Torus(8}’ 1t:- ---------- a M- "5(8) —+—— 60 _ Mu-Toxnsf64 -*--- q I 60 _ PMu-oru rus(64) -*--- q 1}," E ‘3" .. 0 . j 40 . . g . t / 20 ” fl/i‘ 2 20 " (.~p-——-u——--—--—a- --------- 4" N) f... ..a» — r , .’—v e 6 c (...; e e A; O 41 L J I 0 I L I I 0 128 256 512 1024 0 128 256 512 1024 Multicast Size Multicast Size (c) Message length: 16384 flits Figure 6.3. Multicast latency (4096-node 2D torus, high message startup latency) 137 3 I I I I 3 I I T I Unicast-based -€+--- Unicast-based 43;" 2.5 - S-Torus 4a.... 4 2.5 .. S-Torus .....M... . A Md-Torus -‘--- ’3? Md-Torus ..-.... S 2 _ Mu-Torus(8) -°-- _ a 2 P Mu-nggstg} * c j 5 Mu—Torus(64) -*--- v Murfo'rus(64) "x"- >s >5 8 1.5 ~ 4 ‘=’ 1.5 - .1" 4 3 3 I, '3 A A 3 ,vix—-':rt?-‘-‘-’-‘-7 ----- x """""""" 2 ab 1 .. x". ..................... v v. § 1 _ .1, ‘1‘..- q < /"/ .._ 2.: “n: 2 ‘ 0.5 - {zi‘fi’m ‘ 0.5 ' _ 3‘ a H— 5L '61 M- W 0 I I I I O I L I I 0 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (a) Message length: 8 flits 6 T I fir 6 f I I r Unicast-based -B~--- Unicast-based -B--'- 5 ) S-Torus ~0- - 5 - S-Torus .....- - A Md-Torus ....... f3? Md-Torus "h" § 4 Mu-Torus(8) -4— j a 4 _ Mu-Torus(8) +— . é Mu—Torus(64) -*--- v Mu-Torus(64) -*--- >5 >5 E 3 b :5: 3 b .43 ............................. a" .3 2 _ .3 2 F B’,,.:;9,:::::::ZZ .... e e °° :6 a" ...: ---: 3: 2 “firs. l P l - /‘—‘——/ - 0 I I L I O L L L I O 128 256 512 1024 0 128 256 512 1024 Multicast Size Multicast Size (b) Message length: 512 flits I“) T I I I la) I I I T Unicast-based -a--- Unicast-based -&--- 80 _ S-Torus «.... . 80 _ S-Torus ...m. ,,,,,,,,,, a "‘ Md-Torus -" """" a A Md-Toru§. gen '''''''''' l § Mu-Toms(8;a_j;: .............. ' 2 Mu-Ionrsm) *— 5 _ Mu-Torus(64 -*--- ‘ a _ F—‘MuTorus(64) -..--- _ 5' 5‘ .F G p" C [3' 3 ,I’ 8 .3 r- B - 3 40 - _, ab 3 / > P v <20-//""f/: 220-... ... a. is ”gr-0"" — i I, I L A I’ e e e e r--¢ v v v 0 I I I I o I I LL I 0 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (c) Message length: 16384 flits Figure 6.4. Multicast latency (4096-node 2D torus, low message startup latency) 138 Unicast-based -G--- Unicast-based -a--- 2-5 ’ S-Torus ”’W“ ‘ 2.5 +- S-Torus ww- . "‘ Md-Torus ....... g Md—Toms ....... § 2 _ Mu-Torus(8) -+- . a 2 Mu—Torus(8) -+— ................... o_ E Mu-Torus(64) -*--- v a a g 1.5 r- « g .3 1 _1 ... “ s 2 2 0.5 - 0 I I I I O I I I I O 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (a) Message length: 8 flits 6 I I I I 6 I I I I Unicast-based -a-~- Unicast-based -B---- 5 - S-Torus ”.-..-. - 5 l- S-Torus ....-- - g Md-Torus ...... g Md-Torus ...... m 4 Mu-Torus(8) -+-— q E 4 _ Mu-Torus(8) ‘17 """"""""""""""" a. 5 Mu—Torus(64) -*--- v Mu-Torus(64’f"'~*--- >5 ................ ‘9 >~. ,IG" 2 3 ,_ ........ a— ........... fl 8 3 _ PM _ 8 I’D", 8 8" .3 ,2" .3 - M” oh 2 - 8' .. :5 2 '- > <1 ................ 2 1 .. .3;- ..-- .. l r r:"""" 1 0 I I I I O I I 4 I O 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (b) Message length: 512 flits 1m I I I I 100 I I fi I Unicast-based -8--‘- Unicast—based -8---- 8O _ S-Torus ..--.. _ 80 __ S-Torus -..... .............. a4 "‘ Md-Torus -*---- § Md-Togusr A“ """" § Mu-Toms(8) -f:-_-_ """""""""""""""" a a M - ’S(8) -+-— E Mu-Torys(64?'-':N--- . v _ Mu; orus(64) -*-- . 5‘ .I‘IB’.’ 5‘ .’. g I)?” g a 3 40 . t3 . 3 40 .. . a» s > < 20 - 2 20 ~ 0 I I I L O I J I A O 128 256 512 1024 0 128 256 512 1024 Multicast Size Multicast Size (c) Message length: 16384 flits Figure 6.5. Multicast latency (4096-node 3D torus, high message startup latency) 139 3 l I I 3 I I I I Unicast-based -B---- Unicast-based -B--'- 2-5 S-Torus ”"'“"“ ‘ 2.5 - S-Torus ~9~ ~~~~~ - "" Md-Torus ...... g Md-Torus ....... § 2 Mu-Torus(8) -+—— ‘ a 2 _ Mu-Torus(8) -+- .- 5 Mu-Torus(64) -*--- v Mu-Torus(64) -*--- 3 1'5 ‘ § 1.5 - ~ - .3 3 ab 1 ................................................... .- g 1 ' ‘ < I.” ..O’ 2 ‘1’". 0.5 f/V‘I' ‘ .5 a 0 +1 "4 O 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (a) Message length: 8 flits 6 I I f I 6 I I I Unicast-based -B--- Unicast-based -B--~- 5 '- S-TOI'US “......M - S .- S-Torus ”... ..... d ’8 Md-Torus ---‘-- ’3): Md-Torus ....... 3 4 _ Mu—Torus(8) —°-— - a 4 Mu-Torus(8) -*-— ‘ é Mu-Torus(64) -*--- v Mu—Torus(64) -*--- 5‘ 3 £5 3 b d 93 3 _ ......................... B- .3 a ........................... —a 3 ,9 ............. a.-. 4, 61) 2 ” _,a- """"""" ‘ g 2 _ ,B” * s ' > a" ‘3 ,..»--" < B" ........................... no 2 39-“ _.-» l _ ........................... . 1 _ ‘H . 5:. ... _- 0 I I I I 0 I I I I O 128 256 512 1024 0 128 256 512 1024 Multicast Size Multicast Size (b) Message length: 512 flits 100 I I I 100 l I I I Unicast-based -8---- Unicast-based -B'--- 80 _ S-Torus '4- ------ ‘ 80 _ S-Torus WW ,,,,,,,, 43‘ " Md-Torus --*--- ’9‘ Md-Toru§3_:;+..- ----------- § Mu-Torus(8) -*-—_ .................. a g Mu-Iomsw) —*— 5 _ Mu-Tom§.(64)?"¥":- . v _ Mu-‘Torus(64) -*--- . > 60 ’ """ >~ 60 F' O -’a o I' 3 4o - u‘ - 3 4o - — .2. :5 > < 20 - 2 2o - x % .1: 3 6 6 O I I I I 0 I I I I 0 128 256 512 1024 O 128 256 512 1024 Multicast Size Multicast Size (c) Message length: 16384 fiits Figure 6.6. Multicast latency (4096-node 3D torus, low message startup latency) Avg' Latency (msec) Avg. Latency (msec) 140 8 , . . . 7 l Unicast-based —e.-.- ‘ S-Torus --...... 6 ' Md-Toms ....... 5 MU'TOI'US(8) _¢—— Mai Mu-Torus(64) -a--- 4 , . 3 2 1 0 8 128 246 512 1024 Message Length (flits) (a) High message 8 , t I , 7 l' Unicast-based -a--.- . S-Torus «4...... 6 ’ Md-Torus ....... . 5 MU'TomS(8) _+_ Mu-Torus(64) u»--- l 4 . ‘ 3 2 1 O 8 128246 512 1024 Message Length (flits) 0’) Low message F 1 ‘ I L UniCaSt'baScd -e.-._ _, S-TOI'US "......" ’2’ ” Md-Torus . Mu-Torus(g) __ 1”,, Mu-Torus(64) ”:73?” g C ." ,1 Max. Latency (msec) I 8 128 246 512 1024 Message Length (flits) startup latency 8 f , . T 7 ' Unicast-based -a.-.- q A S-Torus mom... § 6 ,- MCI-TONS ......A. _, E 5 Mu-T0rus(8) —+—— « E Mu—Torus(64) -..--- “/2 £3 4 .. 3! 3 E 2 1 O 8 128 246 512 1024 Message Length (flits) startup latency Figure 6.7. Multicast latency (4096-node 2D torus, 512 node multicast size) Link Visits E E § 12000 i 3’ 0 I I I I Unicast-based -8---- - S-Torus ”Ow Md-Torus "5;, Mu-Torus(8)r"—- Mu-Toms164) «m I .I. l" ’. .I‘ " '0 ,l .’ I _ _-—-——- ---—o-------— u- ....... ....... ........ 5 12 Multicast Size (a) 2D torus 0 128 256 Figure 6.8. 1024 141 Link Visits 12000 10000 8000 I I I I I Unicast-based -3-.— S-Torus W0 Md-Torus ...... Mu-Torus(8) -+— Mu-Torus(64) -*--- 0 128 256 512 Multicast Size (b) 3D torus Total link usage (4096-node tori) CHAPTER 7 Conclusions Efficient multicast communication is critical to the performance of new generation supercomputers that use massively parallel architectures. In this dissertation, we have presented research that focuses on three aspects of parallel communication ar- chitecture affecting multicast communication, namely (1) port model; (2) virtual communication channels; and (3) intermediate message reception. We have shown that the performance of multicast operations in current wormhole-routed parallel computers can be significantly improved by accounting for these three characteristics in the design of the operations. In Chapter 3, we investigated the effects of all-port architectures on the per- formance of multicast communication in wormhole-routed hypercubes. It has been demonstrated why the U-cube multicast algorithm [19], which is optimal for one-port architectures, fails to take advantage of multiple ports when they are present in the system. New theoretical results regarding contention among messages in wormhole— routed hypercubes have been developed and used to design new multicast routing algorithms and to prove that these algorithms are contention-free. The algorithms were compared in terms of the number of steps required in each, their measured execution times when implemented on a relatively small-scale nCUBE-Z, and their simulated execution times on larger hypercubes. The results indicate that significant 142 143 performance improvement is possible when the multicast algorithm actively identifies and uses multiple ports in parallel. In Chapter 4, we considered multicast communication in wormhole-routed torus networks, which require the use of virtual channels in order to provide deadlock- free message routing. The proposed U—torus algorithm applies to unidirectional and bidirectional tori of any dimension. The algorithm produces multicast trees in which the constituent unicast messages do not contend for the same channels, regardless of message length or startup latency. Moreover, the number of message passing steps required to multicast data to m — 1 destinations is [log2 m] , which is optimal for one- port architectures. A simulation study was conducted which showed that the practical consideration of sharing of physical links by constituent messages transmitted on different virtual channels had little effect on performance. In Chapters 5 and 6, we addressed architectures in which intermediate nodes on a message path are able to receive a copy of a message while simultaneously routing the message to subsequent destinations. In particular, in Chapter 5, by focusing on the torus topology, we investigated the interaction of intermediate reception capability with the use of virtual communication channels. We developed a routing method for unidirectional torus networks that not only supports efficient multi-destination messages, but also unicast communication. This routing method is deadlock-free for all combinations of multi-destination and unicast messages. In Chapter 6, this routing method was used to develop a family of path-based multicast algorithms. These algorithms are contention-free, and were shown through simulation to perform well in a wide variety of situations. The research presented in this dissertation makes three primary contributions to the field of parallel computing: (1) the application of multi-port architectures to improve the performance of multicast communication in wormhole-routed parallel computers; (2) the use of virtual channels to provide efficient software-based multicast 144 communication in wormhole-routed torus networks; and (3) the use of intermediate message reception to implement efficient path-based multicast communication in uni- directional wormhole-routed torus networks. BIBLIOGRAPHY BIBLIOGRAPHY [1] L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct networks,” IEEE Computer, vol. 26, pp. 62—76, Feb. 1993. [2] J. Choi, J. J. Dongarra, and D. W. Walker, “PUMMA: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers,” Tech. Rep. ORNL/TM-12252, Oak Ridge National Laboratory, Aug. 1993. [3] J. Choi, J. J. Dongarra, and D. W. Walker, “Parallel matrix transpose algorithms on distributed memory concurrent computers,” Tech. Rep. ORNL/TM-12309, Oak Ridge National Laboratory, Oct. 1993. [4] J. Dongarra and R. A. van de Geijn, “Reduction to condensed form for the eigenvalue problem on distributed memory architectures,” Parallel Computing, vol. 18, pp. 973—982, 1992. [5] C. Trefftz, P. K. McKinley, T. Y. Li, and Z. Zeng, “A scalable eigenvalue solver for symmetric tridiagonal matrices,” in Proceedings of the Sixth SIAM Conference on Parallel Processing, pp. 602—609, 1993. [6] P. K. McKinley, H. Xu, E. Kalns, and L. M. Ni, “ComPaSS: Efficient communi- cation services for scalable architectures,” in Proceedings of Supercomputing ’92, pp. 478—487, Nov. 1992. [7] J. Choi, J. J. Dongarra, R. P020, and D. W. Walker, “ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers,” in Pro- ceedings, Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120—127, IEEE Computer Society Press, 1992. [8] H. Xu, P. K. McKinley, and L. M. Ni, “Efficient implementation of barrier syn- chronization in wormhole-routed hypercube multicomputers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 172—184, 1992. [9] K. Li and R. Schaefer, “A hypercube shared virtual memory,” in Proc. of the 1989 International Conference on Parallel Processing, vol. I, pp. 125-132, Aug. 1989. [10] Message Passing Interface Forum, “Document for standard message-passing in- terface,” Tech. Rep. CS-93-214, University of Tennessee, Nov. 1993. 145 [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] 146 J. Bruck, R. Cypher, P. Elustondo, A. Ho, and C.-T. Ho, “A proposal for common group structures in a collective communication library,” tech. rep., IBM Research Division, Almaden Research Center, San Jose, California. High Performance Fortran Forum, “Draft High Performance Fortran language specification.” (version 1.0), May 1993. S. L. Johnsson and C.-T. Ho, “Algorithms for matrix transposition on boolean n- cube configured ensemble architectures,” SIAM Journal of Matrix Analysis and Applications, vol. 9, pp. 419—454, July 1988. S. L. Johnsson and C.-T. Ho, “Optimum broadcasting and personalized commu- nication in hypercubes,” IEEE Transactions on Computers, vol. C-38, pp. 1249— 1268, Sept. 1989. D. W. Wall, Mechanisms for Broadcast and Selective Broadcast. PhD thesis, De- partment of Electrical Engineering and Computer Science, Stanford University, Stanford, California, 1980. A. Edelman, “Optimal matrix transposition and bit reversal on hypercubes: All- to-all personalized communication,” Journal of Parallel and Distributed Comput- ing, vol. 15, no. 11, pp. 328—331, 1991. D. P. Bertsekas, C. Ozveren, G. D. Stamoulis, and J. N. Tsitsiklis, “Optimal communication algorithms for hypercubes,” Journal of Parallel and Distributed Computing, vol. 15, no. 11, pp. 263—175, 1991. P. K. McKinley and J. W. S. Liu, “Multicast tree construction in bus-based networks,” Communications of the ACM, vol. 33, pp. 29—42, Jan. 1990. P. K. McKinley, H. Xu, A.-H. Esfahanian, and L. M. Ni, “Unicast-based multicast communication in wormhole-routed networks,” in Proc. of the 19.92 International Conference on Parallel Processing, vol. II, pp. 10—19, Aug. 1992. D. F. Robinson, D. L. Judd, P. K. McKinley, and B. H. C. Cheng, “Efficient collective data distribution in all-port wormhole-routed hypercubes,” in Proceed- ings of Supercomputing ’93, pp. 792—801, Nov. 1993. Accepted to appear in the Journal of Parallel and Distributed Computing. D. F. Robinson, P. K. McKinley, and B. H. C. Cheng, “Optimal multicast com- munication in wormhole-routed torus networks,” in Proc. of the 1.9.94 Interna- tional Conference on Parallel Processing, Aug. 1994. (accepted to appear). M. Barnett, D. G. Payne, and R. van de Geijn, “Optimal minimum spanning tree broadcasting in mesh-connected architectures.” Technical Report. M. Barnett, D. Payne, and R. A. van de Geijn, “Broadcasting on meshes with worm-hole routing,” Tech. Rep. TR—93-24, The University of Texas at Austin, Nov. 1993. 147 [24] C.-T. Ho and M.-Y. Kao, “Optimal broadcast in all—port wormhole-routed hy- percubes,” in Proc. of the 1994 International Conference on Parallel Processing, Aug. 1994. [25] P. K. McKinley and C. Trefftz, “Efficient broadcast in multi-port wormhole routed hypercubes,” in Proc. of the 1993 International Conference on Parallel Processing, Aug. 1993. [26] J.-Y. L. Park, S.-K. Lee, and H.-A. Choi, “Broadcasting in mesh and torus networks with circuit-switched communication.” Technical Report, 1993. [27] Y. Tsai and P. K. McKinley, “A dominating set model for broadcast in all-port wormhole-routed 2D mesh networks,” Tech. Rep. MSU-CPS-ACS-23, Michigan State University, Department of Computer Science, East Lansing, Michigan, Sept.1993. [28] C.-T. Ho and M. T. Raghunath, “Efficient communication primitives on hyper- cubes,” Tech. Rep. RJ 7932 (72915), IBM Almaden Research Center, Jan. 1991. [29] Y. Lan, L. M. Ni, and A.-H. Esfahanian, “Distributed multi-destination rout- ing in hypercube multiprocessors,” in Proceedings of the Third Conference on Hypercube Computers and Concurrent Applications, pp. 631—639, Jan. 1988. [30] X. Lin, P. K. McKinley, and L. M. Ni, “Performance evaluation of multicast wormhole routing in 2D-mesh multicomputers,” in International Conference on Parallel Processing, vol. 1, Architecture, pp. 435—442, Aug. 1991. [31] D. Kim and S.-H. Kim, “0(log n) numerical algorithms on a mesh with wormhole routing,” tech. rep., Pohang Institute of Science and Technology (POSTECH), Pohang, Korea, 1993. [32] J .-S. Tseng and C.-T. King, “Efficient routing algorithms for torus networks.” Technical Report, 1993. [33] D. K. Panda and S. Singal, “Broadcasting in k-ary n-cube wormhole routed networks using path-based routing,” Tech. Rep. TR36., Ohio State University, Sept.1993. [34] D. K. Panda and P. Prabhakaran, “Multicasting using multidestination-worms conforming to base routing schemes,” Tech. Rep. TR37, Ohio State University, Sept.1993. [35] C.-T. Ho and M.-Y. Kao, “Optimal broadcast on hypercubes with wormhole and E-cube routings,” in International Conference on Parallel and Distributed Systems, pp. 694—697, Dec. 1993. [36] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Gan- mukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, 148 M. C. Wong, S.-W. Yang, and R. Zak, “The network architecture of the connec- tion machine CM-5,” in Proc. 4th ACM Symposium on Parallel Algorithms and Architectures, pp. 272—285, June 1992. [37] Intel Corporation, Paragon XP/S Product Overview, 1991. [38] C. L. Seitz and W.-.K Su, “A family of routing and communication chips based on the Mosaic,” in Proceedings of the University of Washington Symposium on Integrated Systems, 1993. [39] W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R. Nuth, R. E. Davison, and G. A. Fyler, “The message-driven processor: A multicom- puter processing node with efficient mechanisms,” IEEE Micro, pp. 23—39, Apr. 1992. [40] S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb, “iWarp: An integrated solution to high-speed parallel computing,” in Proceedings of Supercomputing ’88, pp. 330—339, Nov. 1988. [41] R. E. Kessler and J. L. Schwarzmeier, “CRAY T3D: A new dimension for Cray research,” in Proc. COMPCON, pp. 176—182, Feb. 1993. [42] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Computing, vol. 1, no. 3, pp. 187—196, 1986. [43] NCUBE Company, NCUBE 6400 Processor Manual, 1990. [44] B. Duzett and R. Buck, “An overview of the nCUBE 3 supercomputer,” in Proc. Frontiers ’92: The 5th Symposium on the Frontiers of Massively Parallel Computation, pp. 458—464, Oct. 1992. [45] W. J. Dally, “Performance analysis of k-ary n-cube interconnection networks,” IEEE Transactions on Computers, vol. 39, pp. 775-785, June 1990. [46] C. L. Seitz, “The cosmic cube,” Communications of the ACM, vol. 28, pp. 22—33, Jan. 1985. [47] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W.-K. Su, “The architecture and programming of the Ametek Series 2010 multicomputer,” in Proceedings of the Third Conference on Hypercube Computers and Concurrent Applications, vol. I, (Pasadena, CA), pp. 33—36, ACM, Jan. 1988. [48] J. Duato, “On the design of deadlock-free adaptive routing algorithms for multi— computers: Theoretical aspects,” in Proc. Parallel Architectures and Languages Europe Conference (PARLE), pp. 234-243, 1991. [49] D. H. Linder and J. C. Harden, “An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes,” IEEE Transactions on Computers, vol. 40, pp. 2—12, Jan. 1991. 149 [50] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” in Proc. of the 19th Annual International Symposium on Computer Architecture, pp. 278—287, May 1992. [51] P. Berman, L. Gravano, J. Sanz, and G. D. Pifarre, “Adaptive deadlock- and livelock-free routing with all minimal paths in torus networks,” in Proc. 4th ACM Symposium on Parallel Algorithms and Architectures, pp. 3—12, June 1992. [52] Z. Liu and H. Wu, “Performance evaluations of adaptive wormhole routing in 3D mesh networks,” in Proc. 26th Annual Simulation Symposium, 1993. [53] Z. Liu, J. Duato, and L.-E. Thorelli, “Grouping virtual channels for deadlock- free adaptive wormhole routing,” in Proc. Parallel Architectures and Languages Europe Conference (PARLE), pp. 254—265, June 1993. [54] L. Schwiebert and D. N. Jayasimha, “Optimal fully adaptive wormhole routing for meshes,” in Proceedings of Supercomputing ’93, pp. 782—791, Nov. 1993. [55] P. T. Gaughan and S. Yalamanchili, “A family of fault-tolerant routing protocols for direct multiprocessor networks.” Technical Report. [56] H. Sullivan and T. Brashkow, “A large scale homogeneous machine,” Proceedings 4th Annual Symposium in Computer Architecture, pp. 105—124, 1977. [57] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networks,” IEEE Transactions on Computers, vol. C-36, pp. 547— 553, May 1987. [58] W. J. Dally, “Virtual channel flow control,” IEEE Transactions on Computers, vol. 3, pp. 194—205, Mar. 1992. [59] J. Duato, “A new theory of deadlock-free adaptive multicast routing in wormhole networks,” in Proc. of the 1993 IEEE Symposium on Parallel and Distributed Processing, pp. 64—71, Dec. 1993. [60] Y. Lan, A.-H. Esfahanian, and L. M. Ni, “Multicast in hypercube multiproces- sors,” Journal of Parallel and Distributed Computing, pp. 30—41, Jan. 1990. [61] P. K. McKinley and C. Trefftz, “MultiSim: A tool for the study of large-scale multiprocessors,” in Proc. 1993 International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Networks (MASCOTS 93), pp. 57—62, Jan. 1993. "‘[[[ilillllllllES