In .133: 2.31. n it...» 5.1.3.1! mug; $.31}. a...» .11; C - . E '1 43...... .Rv. 19>} 1.‘ ‘ .u‘l 7.. ~ THESIS MICHIGAN STA III IIIIIIZIII IIIII IIIIIIII III III IIIIIIII I 0417 2534 This is to certify that the dissertation entitled MULTICASTING IN MULTISTAGE INTERCONNECTION NETWORKS presented by Chi-Ming Chiang has been accepted towards fulfillment of the requirements for PhD degree in Commter Sc ience \AoMSl \‘1.\\I{ Major professor 1995 MSU is an Affirmative Action/Equal Opportunity Institution 0- 12771 LlfiafitRY Michigan fitate University PLACE III RETURN BOX to remove We cheekwt from your record. TO AVOID FINES Mum on or More det- due. DATE DUE DATE DUE DATE DUE MSU le An Affirmetive Action/Equal Opportunity lnetltulon W m1 MULTICASTING IN MULTISTAGE INTERCONNECTION NETWORKS By Chi-Ming Chiang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1995 ABSTRACT MULTICASTING IN MULTISTAGE INTERCONNECTION NETWORKS By Chi-Ming Chiang Multicast communication, also known as multi—point communication, refers to the delivery of a message from a single source node to a number of destination nodes. It is a frequently used communication pattern in distributed-memory parallel computers and computer networks. Multistage interconnection networks (MINs) have resurged as another popular class of interconnection architecture for constructing scalable par- allel computers and high speed network switches. While efficient implementation of multicast communication is critical to the performance of message-based scalable parallel computers and switch-based high speed networks, little research has been devoted to supporting multicast in MINS. Unlike unicast communication, the size of a header in a multicast message depends on the number of destinations, the distribution of destinations, and the multi-address encoding/ decoding schemes. This research suggests and compares six different multi- address encoding/ decoding schemes to shorten the header which is an overhead to the system. Each of them has its own advantages and disadvantages. An appropriate choice of the multi-address encoding scheme depends on the destination pattern and is detailed in this research. Several efficient multicast algorithms, both hardware and software implen‘lenta- tions, for unidirectional wormhole-switched MINS are proposed in this research. The hardware implementation offers better performance than the software approach. Tree- based hardware approaches for wormhole-switched MINS, namely multi-head worms, require special mechanisms to avoid potential deadlocks when there are multiple mul- ticasts. As shown in this research, the hardware approaches to support multicast should be considered in the design of high performance networks. In systems which do not support hardware multicast, multicast must be implemented atop existing unicast communications. This research proposes an efficient unicast—based multicast (or software multicast) algorithm for such systems. While Banyan Mle are limited to a unique routing path between any source and destination pair, an extra stage l\’IlN can provide extra routing paths. Extra routing paths can reduce the message transmission blocking probability and allow additional flexibility in selecting a routing path. An algorithm to find a traffic-optimal multicast tree in such networks within polynomial time is proposed. Many new ideas and new algorithms are proposed to support. efficient multicast communication in wormhole-switched MIN 5. Performance evaluation and comparison of different approaches are conducted through extensive simulation experiments. Re- search results obtained from this work will be extremely useful to parallel computer and network switch designers who wish to support multicast in their designs. © Copyright 1995 by Chi-Ming Chiang All Rights Reserved To my parents ACKNOWLEDGMENTS I would like to take this opportunity to express my appreciation to several persons, without whom this dissertation could not have been completed. My achievements, great or little, were possible through their participation. I will always be indebted to my advisor, Lionel M. Ni. He has been my mentor, my colleague, and my friend. His very positive influence on my personal and technical development will carry forward into my future endeavors. I am very grateful to the other members of my dissertation committee: Herman D. Hughes, Abdol Esfahanian, and Raoul D. LePage, for their valuable comments, help, encouragement, and friendship. I would also like to thank all my colleagues and friends who made my stay at Michigan State University enjoyable. A person cannot accomplish anything without the help and understanding of family members. I thank my parents, brother, and sisters for their contiguous en- couragement, support, patience, and love. I appreciate my host family, Ows, for their help, love, and friendship throughout the course of my master and doctorate work. I proudly share this accomplishment with them all. Last, but not least, my very special thanks go to my wife Ya—Ping and her family for sustaining me with their everlasting love and understanding. vi TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xi 1 Introduction 1 1.1 Wormhole Switching ........................... 4 1.2 Multistage Interconnection Networks .................. 5 1.3 Motivation and Problem Definition ................... 5 1.4 Performance Metrics ........................... 13 1.5 Objectives and Thesis Outline ...................... 14 2 Multi-Address Encoding 18 2.1 Header Encoding Design Considerations ................ 19 2.2 Multi-Address Encoding Schemes .................... 21 2.2.1 All-Destination Encoding ..................... 21 2.2.2 Bit String Encoding ....................... 22 2.2.3 Multiple Region Broadcast ..................... 23 2.2.4 Multiple Region Stride ...................... 25 2.2.5 Multiple Region Mask ...................... 27 2.2.6 Multiple Region Bit String .................... 28 2.3 Multi—Address Decoding ......................... 29 2.3.1 All Destination Decoding ..................... 31 2.3.2 Bit String Decoding ....................... 33 2.3.3 Multiple Region Broadcast Decoding .............. 34 2.3.4 Multiple Region Stride Decoding ................ 35 2.3.5 Multiple Region Mask Decoding ................. 37 2.3.6 Multiple Region Bit String Decoding .............. 37 2.4 Summary ................................. 39 3 Multistage Interconnection Networks 40 3.1 Switches .................................. 41 3.2 Network Topology ............................ 42 3.3 Node Architecture ............................ 45 3.4 Decoding in Multistage Interconnection Networks ........... 45 3.4.1 All Destination Decoding ..................... 49 vii 3.4.2 Bit String Decoding ....................... 51 3.4.3 Multiple Region Broadcast Decoding .............. 53 3.4.4 Multiple Region Stride Decoding ................ 54 3.4.5 Multiple Region Mask Decoding ................. 56 3.4.6 Multiple Region Bit String Decoding .............. 57 3.5 Performance of Multi-Address Encoding and Decoding on Multistage Interconnection Networks ......................... 59 3.6 Summary ................................. 64 Hardware Multicast Wormhole-Switched 66 4.1 Implementation of Multi-head Worms .................. 67 4.2 The Synchronous Multi-head Worm ................... 73 4.3 Multi-address Encoding ......................... 78 4.4 Performance and Comparison ...................... 82 4.5 Summary ................................. 88 Network Partitionability and Traffic Localization 89 5.1 Influence of Traffic Localization ..................... 90 5.2 Definition of Different Cubes ....................... 94 5.3 Contention Free and Channel Balanced Partition ........... 96 5.4 Non Contention-Free Partition ...................... 102 5.5 Modification of Baseline and Butterfly Networks ............ 105 5.5.1 Modification of a Baseline Network ............... 106 5.5.2 Modification of a Butterfly Network ............... 108 5.6 Performance Evaluation ......................... 111 5.7 Summary ................................. 115 Extra Stage Multistage Interconnection Networks 117 6.1 MINs with Extra Stages ......................... 119 6.1.1 Interstage Connection Patterns ................. 119 6.2 Structural Equivalence of different Delta Networks ........... 121 6.2.1 Design of Extra-Stage MINs ................... 121 6.2.2 Design Choices .......................... 124 6.3 Multicast in Extra-Stage MIN ..................... 126 6.3.1 Alternate Routing Styles in Extra-Stage MINs ......... 127 6.3.2 Distributed Routing and Multicast Implementation ...... 128 6.3.3 Number of Multicast Trees: Upper Bound ........... 130 6.3.4 Multicast Tree Selection Criteria ................ 131 6.3.5 Traffic Optimal Multicast Trees ................. 132 6.4 Multicast Algorithm ........................... 134 6.4.1 Latest Branch Multicast Algorithm ............... 135 6.4.2 Optimality Proof and Complexity ................ 137 6.4.3 Other Sub-Optimal Multicast Heuristics ............ 138 6.5 Performance ............................... 142 6.5.1 Simulation Description ...................... 142 viii 6.5.2 Dimension Patterns and Output Parameters .......... 143 6.5.3 Plots and Observations ...................... 144 6.6 Summary ................................. 147 7 Software-based Multicast 149 7.1 Issues in Multicast Communication ................... 149 7.2 Multicast Algorithm ........................... 153 7.3 Non-Optimal Multicast in Baseline and Butterfly Networks ...... 157 7.4 The C-min Algorithm ........................... 159 7.5 Performance Evaluation ......................... 163 7.6 Summary ................................. 167 8 Related Work 169 8.1 MIN-based ATM Switches ........................ 171 8.2 Hardware Multicast ............................ 177 8.2.1 Path- based Multicast ........................ 177 8. 2 .2 Trip- Based Multicast ....................... 179 8. 2 .3 A Multidestination Worm Conforming to Base Routing Schemes 180 8. 2. 4 Synchronous Recen er Initiated Multicast ............ 181 8.3 Software Multicast ............................ 182 8.4 Summary ................................. 185 9 Conclusions and Future Work 187 9.1 Research Contributions .......................... 187 9.2 Directions for Future Research ...................... 190 BIBLIOGRAPHY 192 ix 3.1 6.1 LIST or TABLES Number of flits in a header based on various multi-address encoding schemes. A header consists of a counter and addresses. ........ Patterns of this study ............................ 1.1 1.2 1.3 1.4 1.5 1.6 2.1 9 9 2.3 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 LIST or FIGURES Broadcast 100 integers from process 0 to every process in the group. . . . The framework ................................ Path—based hardware multicast ......................... Tree-based hardware multicast. ....................... An asynchronous multicast tree ........................ A synchronous multicast tree. ........................ The message header format of six address encoding schemes ........ Single region broadcast and single region stride, where the source node is 0 and the destination sets are {4, 5, 6} and {1, 3, 5}, respectively. . . Multicast to the same row and column in (a) 2D mesh and (b) linear array architectures. ............................... (a) A generic switch/router with four input/output ports, and the message is forwarded via both output ports 1 and 3. (b) The receiving of a duplicated message when the header is not handled properly and alternate routing paths are allowed .................... All destination decoding algorithm ...................... An example of all-destination decoding .................... Buffered bit string decoding algorithm .................... An example of buffered bit string decoding. ................ An example of hierarchical bit string decoding. .............. Multiple region broadcast decoding algorithm ................ An example of multiple region broadcast decoding. ............ An example of multiple region stride decoding ................ An example of multiple region mask decoding ................ Multiple region bit string decoding algorithm. ............... An example of multiple region bit string decoding. ............ A generic MIN structure with N = k" input/output ports and n stages. Four 16—node MINs built with 2 x 2 switches. ............... A 16 x 16 cube network built with 2 x 2 switches .............. A duplicated receiving when there are alternative paths, in an extra stage MIN, between source and destination nodes ............... All address decoding algorithm. ....................... An example of all-destination decoding .................... The algorithm of bit string decoding schemes for MIN. .......... An example of buffered bit string decoding. ................ xi {DOOOO‘JWIQ [\9 .—o 26 29 32 32 33 34 34 35 36 36 37 38 41 46 48 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 5.1 5.4 5.5 5.6 xii An example of hierarchical bit string decoding. .............. Multiple region broadcast decoding algorithm ................ An example of multiple region broadcast decoding. ............ An example of multiple region stride decoding ................ Define a subcube in a MIN. ......................... An example of multiple region mask decoding ................ Multiple region bit string decoding algorithm for MIN. .......... An example of multiple region bit string decoding. ............ The length of header and reduction rate on different destination patterns vs different multi-address encoding/ decoding schemes .......... An example of deadlock for a. path-based multicast on a 16—node multistage cube network ................................ An example to establish a wormhole—switched multicast tree, where the shadow box represents the header flit and white box stands for data flits. Establishment of an asynchronous multi-head worm with network con- tention. .................................. An example of a deadlock situation with two multi-head worms, where the bold line represents the multi-head worm To and the dotted line stands for T6. ................................... Establishment of a synchronous multi-head worm with network contention. Two multi-head worms requesting common channels to achieve destination sets may cause deadlock or starvation. ................. A deadlock example with all-destination encoding/decoding scheme with 3 synchronous multi-head worms, where the destination information carried by a flit is numbered within the box. .............. Comparison among hardware and software implementations. ....... Latency comparison of different switch sizes and different number of des- tinations on a cube network ........................ Comparison on a butterfly network with different sized switches and a different number of destinations ...................... Latency comparison of the different number of nodes in each cluster. Comparison between the region broadcast and pseudo all-destination en- coding schemes ............................... Average latency and associated system utilization of a blocking multicast message on a 64-node cube network. .................. Three different processor clusters, 0:172:17, 10.121: or 2X, and 11.11. ..... A l6-node cube network based on 4 x 4 switches is partitioned into four contention-free and channel-balanced binary cube clusters, 170.1317, 01:1‘11, 11x0 and llrl. .............................. An 8—node cube network and an 8-node omega network are partitioned into three contention-free and channel-balanced binary cube clusters. A 16—node baseline network is partitioned into different binary clusters. . A 16—node butterfly network is partitioned into different binary clusters. GIUIOIUIOT ©0101-1303 81 84 85 86 86 101 103 104 xiii 5.7 An example to partition a 16—node modified baseline network into contention-free and channel-balanced disjoint k-ary cubes. ...... 5.8 An example to partition modified butterfly network into contention—free and channel-balanced disjoint k-ary cubes. ............... 5.9 Average latency of a blocking multicast message with various intertask contention .................................. 5.10 Average throughput in cube network with average 64—flit messages. 5.11 Average latency in cube network of non-blocking message with average 64—flit length. ............................... 5.12 Average latency in cube network of non—blocking message with multiple sources. .................................. 6.1 A generic MIN structure with N : 2" input/output ports and h stages, where each 2 x 2 switch has four connectivity choices .......... 6.2 Structural equivalence between (a) PS-MIN(16) and CU-MIN(16) and (b) PS-ESMIN(16:5) and CU-ESMIN(16:1,3,2,1,0) ............. 6.3 An example of the inadequacy of the traditional routing approach (source EB destination): A source node 1 to destination node 2 route does not lead to the right destination ........................ 6.4 The four optimal multicast trees in an ESMIN(8:1,2,1,2,0) ........ 6.5 The eight optimal multicast trees in an ESMIN(8:1,1,2,2,0) ........ 6.6 Latest-Branch multicast algorithm illustration: a) C 1- ESMIN(16:1,2,3,3,1,1,0), b) Construction of the CU-ESMIN1 by seeking the rightmost occurrence of each dimension and marking the remaining stages as ‘horizontal.’ Horizontal stages merely carry-forward the data. c) Example multicast in the CU-ESMIN‘. . . 112 113 114 115 128 134 134 136 6.7 An example of blocking in multicast tree construction by the RB algorithm.141 6.8 The number of channels used in the first branch algorithm ......... 6.9 The number of channels used in the latest branch algorithm ........ 6.10 The number of channels used in the random mapping algorithm ...... 6.11 The probability of blocking in random branch algorithm. ......... 6.12 The number of channels used in three non-blocking algorithms. ..... 7.1 An example of deadlock in single region broadcast. ............ 7.2 Unicast-based software multicast trees ................... 7.3 The C-min algorithm ............................. 7.4 An example of software multicast tree implemented by C-min algorithm. 7.5 The latency in blocking multicasts. ..................... 7.6 The latency of C-min algorithm on various networks and various sized switches ................................... 7.7 The latency of various algorithms on a cube network constructed by 4 X 4 switches ................................... 144 145 145 147 160 161 164 166 167 7.8 The throughput of various algorithms on cube networks with 4 X 4 switches.168 8.1 Switch fabric. ................................. 172 90909090 OO~JOECJ1 xiv A 16 X 16 copy network ............................ 173 An example of synchronous region broadcast. ............... 176 An example of a path-based multicast in a 2-D mesh. The lower pair of numbers is the absolute address of a node and the upper number is the relative address of the same node based on a Hamiltonian path. . . . 178 An example of a trip-based multicast ..................... 180 Examples of a multidestination worm conforming to base routing schemes. 181 An example of a U-min multicast ....................... 183 An example of two multicasts based on U—min and C-min algorithms. . . 185 CHAPTER 1 Introduction Efficient data communication among processor nodes is critical to the performance of message-based scalable parallel computers (SPCS). Generally data communication can be classified into point-to—point communication and collective communication [1]. While point-to—point communication deals with the basic send and receive operations between two nodes, collective communication deals with communication that involves a group of processes which form a process group. Multiple applications may space- share an SPC in a way that processor nodes in the system can be partitioned into several disjoint subsets, namely processor clusters or clusters, each of which is dedi— cated to a distinct application. Multiple processes may time-share a processor node and multiple process groups may exist in a cluster. Each process group includes a subset of all processes in the cluster. As a result, a system—level multicast service, in which the same message is delivered from a source node to an arbitrary number of destination nodes, is fundamental in supporting collective communication primitives including the application—level broadcast, reduction, and barrier synchronization [2]. .2 Multicast communication is also demanded in high speed networks, such as ATM switches, for various network applications [3, 4] including multimedia applications (63-9-1 [511 For example, the format of the group broadcast in MP1 is specified as follows: MP1.Bcast(bufi'er, count, datatype, root, comm) The parameters buffer, count, datatype, root, and comm represent the data buffer, number of data, data type, source process, and communicator, respectively. The com— municator is used to distinguish different process groups since a process can join more than one process group. The message is delivered to every process which owns the specified communicator. Figure 1.1 gives an example that source process 0 forwards an array of 100 integers to all processes in the process group specified by comm. Note that this example only shows partial codes. Some of the variables, such as comm, must have been assigned appropriate values before these codes. M P I -Com m comm; int array[100]; int root = 0; MPLBcast (array, 100, MPLINT, root, comm); Figure 1.1: Broadcast 100 integers from process 0 to every process in the group. As indicated in [6], lots of applications require multicast communication. As the scale of application increases, communication becomes a bottleneck and degrades the overall performance [7]. The relationship between different applications, high level parallel languages, com- munication packages, ATM adaptor layer and networks is shown in Figure 1.2. The 3 message passing interface (MP1) explicitly supports application-level broadcast as well as multicast [1] while High Performance Fortran (HPF) supports these functions implicitly [8]. Both of them rely on the system to provide efficient multicast commu- nication. The interconnection network for these applications can be either a special purpose network or an asynchronous transfer mode (ATM) network. Teleconferencing, multimedia, and interactive television are some newly emerging and natural applications for multicast. communication. As long as there are more than two parties communicating in a teleconference, multicast communication is required. Although the source may send multiple copies of the message individually (one for each party), the scalability of such an approach is limited by the bandwidth, and the latency among different parties may become intolerable. Efficient multicast support becomes a critical issue for the success of these applications. Parallel Processing Multimedia Application Teleconferencing domain I ’ a High Performance FORTRAN _ Application ‘ ‘ ‘ ‘ ‘ “ ' ‘ - j """""" ATM Adapnon Layer , Message Passmg Interface interface Software Multicast [ Hardware Unicast Hardware Multicast J Network Figure 1.2: The framework 4 AS shown in Figure 1.2, teleconferencing, multimedia and interactive TV are ba- sically based on the ATM network. The current knowledge of multicast support of ATM switches is still limited. For example, both the Fore Systems ATM switch and the SynOptics ATMX switch support multicast via high speed buses [9, 10]. The objective for this work is to support efficient multicast communication for all applications on top of a MIN. It is certainly network topology dependent. In this thesis, we will concentrate our effort in wormhole—switched unidirectional MINs. 1 .1 Wormhole Switching Switching methods greatly affect network latency. Wormhole switching [11], which has been adopted in almost all existing direct networks, is also being used in mul— tistage networks. In wormhole switching, a message consists of a sequence of flow control digits, or flits. The header flit(s) of the message governs the route, and the remaining flits follow in a pipeline fashion. If the header flit(s) encounters a busy channel, other flits will hold and wait. This property makes wormhole switching sus- ceptible to deadlock. As a result, deadlock avoidance is a critical issue in wormhole- switched networks. Because of its low network latency and the small amount of ded- icated buffer space required at each node, wormhole switching has become the most promising switching technology and will be the only switching method considered in this thesis. A message is divided into small packets, called cell, in the ATM network. The propagation of a cell within a switch is similar to wormhole switching [12]. Thus, the 5 proposed model could be applied to ATM switching with minor modifications. 1.2 Multistage Interconnection Networks Multistage interconnection networks (MINS) are a popular class of interconnection architecture for constructing SPCS, such as the BBN (JP-1000 [13] and TC—2000 [14], IBM SP-1[15] and SP-2, TMC CM—5 [16], Meiko CS-2 [17], and NEC Cenju-3 [18], as well as high speed networks, such as the SynOptics ATMX [19] and DEC GIGAswitch [20]. In such systems, processor nodes are interconnected though a MIN, a class of indirect networks. Each processor node has its own processor(s), local memory, and other supporting devices. As the number of nodes in the system increases, the total communication bandwidth, memory bandwidth, and processing capability of the system scales up as well. MINS can be further classified as unidirectional MINS and bidirectional MINS [21]. As indicated in [21], bidirectional MINS are essentially a fat tree, such as the TMC CM—5, IBM SP—l, and Meiko CS-2. This thesis concentrates on those unidirectional MINS, such as the NEC Cenju—3 [18] and the BBN TC-2000 [14]. 1.3 Motivation and Problem Definition Since the multicast communication service is the primitive basis for increasing per- formance of parallel processing applications and newly emerged applications, several research efforts and works have been done on direct networks, such as 2-D mesh 6 and hypercube, but very little in unidirectional wormhole-switched MINS. Basically, there are two different approaches for providing multicast services: the hardware implementation and the software implementation. No matter what topology a net- work employs, a multicast message header must carry all destination addresses so that routers can make appropriate routing decisions. The header information is an overhead to the system and should be minimized in order to reduce communication latency and to increase effective network bandwidth. Multi—Address Encoding and Decoding The multiple address (multi-address) encoding and decoding is overlooked or ignored by most researchers. The all—destination encoding scheme, which puts all destination addresses in the header, is the most intuitive approach and the one used by most researchers when they mention this issue [22 . Basically, such a scheme is fine when the number of destinations is small; however, when the number of destinations in- creases, the overhead of the header increases. Few other works have used the single region broadcast or single region mask which forces their multicast communication to be limited in a single contiguous region, e.g.,the NEC Cenju—3, or in a single sub— cube, e.g., the nCUBE-2, respectively. To reduce the overhead of multi-address and eliminate such restrictions becomes the first challenge to us. A good multi-address encoding scheme should consider how to minimize the message header length over- head, how to reduce the header processing time (or routing decision making time), and how to support cut—through switching. \] Hardware Approach There are two different hardware approaches to support multicast: the path—based and the tree-based. As Shown in Figure 1.3 and 1.4, a path-based multicast is basi— cally a copy—and-forward technology, while the tree—based multicast is a replicate-and- forward technology. In a path-based approach, the router will replicate an incoming [destl ] [destZ] 17 16 15 14 13 12 11 10 M .e—4 5 6 7 8 9 3 [dest3] I 1 ——-1 I-—-+ 1<--I] 2 I I [destS] Figure 1.3: Path-based hardware multicast. flit and forward one copy to the processor if the process is one of the destination while sending the other copy to the outgoing channel as shown in Figure 1.3. As shown in this figure, a path is established from the source node to all destinations in a certain order. Every destination is visited once and only once. The path-based approach is more suitable for a direct network since the router is connected to the processor directly. In a unidirectional MIN, the distance between any source and destination pair is constant. The most serious problem is that deadlocks become possible due to a self-blocking property. 3 network X’ s‘ g : I source ‘ 4 Figure 1.4: Tree—based hardware multicast. The router in a tree-based network may replicate the incoming flits and then forwards them to multiple outgoing ports as shown in Figure 1.4. Intuitively, such an approach may cause deadlock if there are multiple multicast trees. Some restric- tions must be applied to avoid deadlock when multiple multicast trees are allowed simultaneously. Such restrictions depend on the underlying network topology. .............................................................................. network B i 2 B I ~ , - 3 B 2 stage 5 I s 2 s 3 Figure 1.5: An asynchronous multicast tree. Because multiple branches exist in a tree-based multicast, two different approaches 9 may be used when some branches are blocked. A switch in the asynchronous method forwards its flits independently as shown in Figure 1.5. As long as all associated switches in the next stage are ready to receive the next flit, the switch forwards the flit residing in its buffer. For example, flit 1 on the top switch at stage 33 is forwarded to stage 34 since those two associated switches in stage 34 are ready. Flit 2 in the top switch at stage 32 is blocked since not all three associated switches are ready. A buffer is left empty in a switch if the next flit is blocked and the previous flit is forwarded. Such an empty buffer is named a wormhole bubble and marked by B in Figure 1.5. 1 network I” I 7 ‘ ‘ i ; ’ L-.- e \ : ............................................................................. Figure 1.6: A synchronous multicast tree. On the contrary, the synchronous multicast requires headers of all branches to be forwarded simultaneously. While any branch is blocked, all branches stay in current status as shown in Figure 1.6. In this figure, the fourth switch at stage 34 is used by some other communication. The third switch at stage 33 is unable to forward flit 1. Thus, it sends the condition back to the source node and the source node 10 forces every branch to stop forwarding. A wire-AND or wire—OR connection may be used to implement the control circuit to avoid long latency caused by the condition checking. Obviously, such a switch would be more complicated than the switch used in an asynchronous multi—head worm. It offers a potential solution for deadlock—free multicasts and is adapted by the nCUBE—2 [23]. Both a synchronous multi-head worm and an asynchronous multi-head worm have their own problems. To avoid deadlock in an asynchronous multi-head worm is very difficult, if not impossible. To restrict one multicast at a time is a possible solution but a high Speed bus may offer the same performance. To enlarge the buffer within each switch is another solution if the buffer is large enough. Such a solution makes the network become a virtual cut-through network instead of a wormhole-switched network. Avoiding deadlock for multiple synchronous multi—head worms seems to be easier than for the asynchronous one, but it is not as easy as we think. For exam— ple, the nCUBE—2 has a synchronous multi-head worm but is forced to disable the hardware multicast due to potential deadlock. The NEC Cenju-3 is another system which supports the hardware multicast. Nonetheless, deadlock is still possible when multiple multicasts are allowed due to its asynchronous multi-head worm. Conse- quently, to provide hardware multicast capability without potential deadlock and / or starvation is one of the objectives of this work. 11 Software Approach In systems which do not support hardware multicast, we have to resort to software approaches to support efficient multicast, referred to as software multicast or unicast- based multicast. Traditional approaches use separate addressing in which the source node sends the message to one destination at a time. As the number of destinations increases, separate addressing may require excessive time because many systems allow a local processor to send only one or a few messages at a time. An alternative approach is a multicast tree1 in which the source sends the message to a subset of the destinations. Each recipient of the message forwards it to some subset of the destination that has not yet received it. Which type of multicast trees to use depends on the switching strategy and the unicast routing algorithm. An efficient multicast tree involves no local processors other than the source and destination processors, exploits the distance—insensitivity of wormhole switching, and is of minimum height, specifically, height [log2(m+ 1)] for m destination nodes. Another key requirement is that there be no channel contention among the constituent messages of the multicast. That is, the unicast messages involved should not simultaneously require the same channeL 1The software multicast tree is different from hardware multicast trees which are based on cut- through switching. In a software multicast tree, each intermediate node has to receive the complete message before forwarding to other nodes. 12 Multicast Communication in MINS Multicast communication has been extensively studied for distributed—memory multi— computers based in direct network architectures, but little research has been done in indirect networks, especially MIN—class topologies. There are several characteristics which distinguish a MIN from other network architecture such as unique length be— tween any (source, destination) pair, and 0(N log N) hardware complexity, etc.Since those multicast approaches proposed for direct networks are not suitable for MINS, this thesis concentrates on multicast communication in MIN-based networks. A regular MIN with N inputs and logkN stages, where k is the number of in— put/output of a switch, is a unique—path network. With such property, deadlock becomes possible if multicast trees share more than two channels. The multicast ca- pability in current systems is provided by either the unicast-based (software—based) approach or the hardware approach with excessive hardware and some limitations in order to avoid deadlock. Both approaches suffer long latency, inefficiency, or cost. Even worse, some of them are not deadlock-free. A general multicast algorithm to avoid these drawbacks and guarantee deadlock—free system is the major focus of this thesis. A regular MIN is a unique—path network; hence, the multicast tree is also unique unless some variants of the MIN topology are considered. One such design extension, among a few others, is to consider a MIN with extra stages. The idea is to attach a few extra stages to the regular n-stage MIN. The extra stages bring forth the flexibilities in multicast paths and the difficulties. To find all traffic-optimal multicast trees in 13 an ESMIN becomes an NP problem. 1 .4 Performance Metrics An important metric used to evaluate a network is its communication latency, which is the sum of three component values: start-up latency, network latency, and block- ing time [24]. The start-up latency refers to the time required for message fram- ing/unframing, memory/buffer copying, validation, and so on, at both source and destination nodes. The startup latency is mainly dependent on the design of system software within the nodes and the interface between nodes and routers. Start—up latency can be further classified into sending latency, the software latency at source node, and receiving latency, the software latency at destination node. The network latency equals the elapsed time after the head of a message has entered the network at the source until the tail of the packet emerges from the network at the destination. Given a source and destination node, the startup and network latencies are static values, frequently used to characterize contention-free networks. The blocking time includes all possible delays encountered during the lifetime of a message. These delays are mainly due to conflicts over the use of shared resources, such as busy channels and filled buffers. Blocking time reflects the dynamic behavior of the network due to the passing of multiple messages and may be high if the network traffic is heavy or unevenly distributed. In this dissertation, multicast latency is used to measure mul- ticast performance. The multicast latency refers to the time interval from when the source processor begins to send the first copy of the message until the last destination 14 processor has received the message. 1.5 Objectives and Thesis Outline The main objective of this research is to establish efficient multicast communications in unidirectional wormhole-Switched MINS. To support efficient multicast communi- cation, the header overhead of the multicast message should be minimized and ex- ternal network contention among different applications should be eliminated. These objectives and the thesis outline are discussed in the following. In Chapter 2, we discuss the issue to shorten the header overhead by investigating different multi-address encoding and decoding schemes. Unlike a unicast message, the header of a multicast message must carry multiple destination information. Because of that, the length of a header varies with the number of destinations, the distribution of destinations, and the multi—address encoding/decoding schemes. The consideration of such a header as well as six multi-address encoding and decoding schemes are addressed in this chapter for general network topology. As the scale of networks increases and the demand of multicast communication increases, the overhead of message header becomes critical, which implies that multi-address encoding becomes critical. Although the proposed multi-address encoding and decoding schemes can be applied to networks with different switching techniques, such as circuit switching, store-and-forward switching, and cell relay, the emphasis of this thesis will be in the wormhole-switched technique. Depending on characteristics of a network topology, these multi-address encoding and decoding schemes may be further optimized. 10 In Chapter 3, we brief review the different interconnection patterns as well as associated network topologies of MINS. Furthermore, the optimization of the multi- address encoding schemes and the associated decoding algorithm in a switch for each encoding scheme in Mle are also addressed in this chapter. Performance evaluation of different multi-address encoding and decoding schemes is given based on simula- tions. In Chapter 4, two different tree-based hardware multi-head worm implementa- tions are studied — the asynchronous and synchronous. The asynchronous multi-head worm allows that each branch forwards independently while the synchronous multi- head worm insists that all branches forward synchronously. Unfortunately, both im- plementations are not deadlock-free unless certain rules are applied. Hence, current hardware implementations exhibit either undesirable properties or are restricted in their use. A deadlock-free and starvation—free hardware implementation of multiple synchronous multi-head worms is presented. The difficulty to the implementation of deadlock—free multiple asynchronous multi-head worms is also shown in this chapter. Unlike the unicast message whose header has a fixed size, the size of a multicast message header depends on the number of destinations, the distribution of destina- tions and the encoding scheme. Since a single flit may not be able to carry all destina- tion information, a multi-head worm may have different distances of branches toward different destinations. Such latency may cause deadlock of synchronous multi—head worms. We address this issue and propose a pseudo multi-address encoding scheme to eliminate the potential deadlock. In Chapter 5, the network partitionability of different delta class networks is ex— 16 amined and performance results of various allocation schemes is given. By exploiting the locality of processor allocation, we show that the known topologically equivalent delta-class MINS have different capabilities in supporting hardware and software mul— ticasts. Since the internal contention is unavoidable unless doing one multicast at a time, we concentrate on eliminating the external contention. A binary cube network partitioning scheme is studied and shown to be external contention—free in both cube and omega MINS. On the contrary, we also show that the baseline and butterfly MINS may not be partitioned into contention-free and channel—balanced subcubes. Two connection patterns are studied to modify baseline and butterfly MINS. Both modified baseline and butterfly MINS have the same partitionability as cube and omega MINS. In Chapter 6, we consider the multicast problem for various MIN classes of topolo- gies. Since the multicast problem for a regular MIN has a unique solution and does not offer any flexibility, we consider a design extension. namely extra—stage .MIN(ESMIN), as our focus. The ESMIN multicast problem is formulated and an optimality criterion is defined. A lower bound on the number of multicast trees is estimated, and we show that the total number of traffic—optimal multicast trees may itself be exponential. However, a traffic—optimal multicast tree can be generated in polynomial time. We propose an algorithm towards this. The performance of this algorithm, with respect to three other proposed heuristics, is shown using simulation. In Chapter 7, an efficient software—based multicast algorithm is presented and an- alyzed for those existing message-based SPCs which do not have hardware multicast support. One way to implement multicast in such systems is separate addressing. An 17 alternative approach is multicast tree. The focus here is on such multicast tree imple- mentation, also known as unicast-based multicast implementation. By exploiting the locality of processor allocation, a minimum—time unicast—based multicast algorithm is proposed for some classes of MINS which support one—port communication. In Chapter 8, we give an overview of related work. Since lots of research has de- voted to multicast on different network topologies and different switching techniques, we concentrate on the work related to the multicast in MIN-based ATM switches as well as the hardware multicast and the software multicast in wormhole—switched networks. The concluding remarks as well as some possible future research directions are discussed in Chapter 9. CHAPTER 2 Multi-Address Encoding No matter what kind of topology a network is, a message header for multicast com- munication must carry the destination set information needed for routers/switches to make appropriate routing decisions. In some networks, such a header is needed in a sender-initiated control message in order to establish a multicast circuit; while in some other networks, such a header is used in all data messages. Nevertheless, the header information is an overhead to the system and should be minimized in order to reduce communication latency and to increase effective network bandwidth. Basically, a multicast message header carries multiple destination addresses (multi-address) information. A destination address can be either a physical address or a relative address. A relative address is usually used to represent the relative loca- tion to the source address. Each address may be further represented by a number of dimensions. For example, in a 2D mesh network, each address may have two dimen- sions —- one for each dimension. In this work, we don’t further distinguish different types of addresses. Unless otherwise specified, an address refers to all details of the 18 19 address in our work. 2.1 Header Encoding Design Considerations Each multicast message header must have a number of flits to carry the necessary routing information, where each ’flit is a flow control unit. It could be an address or a region (to be discussed later) depending on the encoding scheme. The total number of flits or header length is usually not only dependent on the number of destinations but also on the selected multi-address encoding scheme. The function of switches is taking a message from an input channel, making a routing decision, and forwarding (with possibly replicating) the message to one or more output channels. A good multi—address encoding scheme should not only shorten the message header length, but minimize communication latency and ease routing decisions. The following header design issues are considered. The length of a message header can be either fixed or variable in terms of the number of destinations. The length of a variable length header is dynamically adjusted by switches as the message header is processed. For the fixed one, the length may be a function of the location of switches in the network. In cut through switching, should each switch buffer the whole message header before making the routing decision? For a large message header, it implies a large buffer to hold the whole message header and a longer communication latency. In wormhole routing, a message is divided into a number of flits, and the minimum capacity of each buffer is one flit. Since a message header may be composed of many 20 flits, it is desirable that the routing decision in each router can be made as soon as possible to minimize the buffer requirement. Ideally, a message header can be processed on the fly on the flit basis. For a variable length header, the number of destination addresses (or regions to be discussed later) is another design parameter. It may be impractical and inefficient to use a counter to indicate the number of destinations (or regions) because the counter flit is usually placed at the beginning of a message header. Since the value of counters may be changed by switches if the destination address set is split into different subsets, it will prohibit the processing of message headers on the fly by switches. An alternative approach is to have an end-of-header (EOH) flit to indicate the end of an address header. Some known hardware and software techniques, such as code violation and address stuffing, may be used to implement the EOH flit. Both tree-based multicast, in which a router is able to replicate an input message through multiple output channels, and path-based multicast, in which a message traverses along a certain path picked by the destination nodes along the path, may be considered to implement multicast communication [25]. Usually, path—based multicast is used in direct networks, such as multi—dimensional meshes, and tree-based multicast is used in indirect (switch-based) networks, such as MINS. In both approaches, the message header may have to be modified by those intermediate routers/ switches. 21 2.2 Multi-Address Encoding Schemes Six multi-address encoding schemes are described in this section. The message header format of these schemes is shown in Figure 2.1. (a) All Destination Encoding addrl addr-z ------ addrm EOH (b) Bit String Encoding b ........................ e (c) Multiple Region Broadcast Encoding 01261 02:62 """ bkiek EOH ((1) Multiple Region Stride Encoding blzelzsl bgzegzsg ------ bk2ek:sk EOH (e) Multiple Region Mask Encoding 1. blzelzml [ b2:e«z:'m.2 ------ bkzekmik EOH (f) Multiple Region Bit String Encoding I'— "—‘T __ T _ (712613.771 [9226221727 """ [blekiTk EOH .] Figure 2.1: The message header format of six address encoding schemes. 2.2.1 All-Destination Encoding This is an intuitive method used in [26], in which all destination addresses are carried by the header as shown in Figure 2.1(a). Assuming that all addresses are sorted in ascending order, the EOH flit can be all 0’s. If the only destination address is all 0’s, the second all 0’s indicates the EOH. If the routing requires that the addresses 22 be arranged in a certain order, such as path-based multicast [25], the EOH flit may be represented by replicating the last address. However, for tree—based multicast, it is easier to generate the same EOH flit for all replicated outgoing messages. In this case, the EOH flit could be an unused address or use other methods. Note that if an EOH flit takes many addresses, the flit buffer must be large enough to hold the EOH flit. Clearly, all-destination encoding is good for a small number of irregular addresses as its header length is proportional to the number of addresses. 2.2.2 Bit String Encoding The major drawback of the all-destination encoding scheme is its significant header overhead when the number of destinations is large. One way to limit the size of a header is to have a bit string to indicate destinations, where each bit corresponds to a destination ranged between node I) and node e as shown in Figure 2.1(b). Since the number of nodes in a system is predefined, there is no need of an EOH field. In some network topologies, such as MINS, the bit string length can be a function of the number of reachable nodes from a given switch, which is still independent of the number of destinations. Apparently, the bit string encoding scheme is inefficient when the system is large and the number of destinations is small. However, it is flexible in handling a large number of irregular destination addresses. Usually, it is extremely difficult for a switch to make the routing decision on the fly based on the incoming bit string information and to prOduce the bit string information for each output port. Thus, a switch usually has to buffer the entire bit 23 string in order to make the routing decision and to generate output bit strings. This is named buffered bit string. Although the buffered bit string encoding scheme allows distributed routing, each switch requires a large flit bufler and the communication latency is also increased. To eliminate a large flit buffer, a hierarchical bit string may be used. This is a source routing method in which the source node has to determine the complete multicast tree information. In this scheme, the input message header has 1 + k flits for a switch with 11' output ports. The first flit has 1: bits corresponding to k output “1” ports. A in a bit position indicates that the corresponding output port should forward a copy of the message. The remaining 1: even sized flits carry bit string information for each output port. The bit string information is recursively defined because each bit string becomes an input to the next switch. Note that if these bit string flits are not even sized or have less than k flits, it will be very difficult to determine the delimiter between two adjacent flits. Obviously, the hierarchical bit string encoding scheme can perform routing on the fly and requires a small flit buffer in each switch. However, the header length is much longer than the buffered bit string method. 2.2.3 Multiple Region Broadcast In order to enforce communication locality and minimize communication interference among processors from different process groups, processors belonging to the same process group are usually allocated in a contiguous region, if possible. Thus the 24 multicast addresses can be confined within a region, and each region is specified by two fields: (b:e), the beginning and ending addresses of the region, respectively. This is referred to as region broadcast and is used in some ATM switches and the NEC Cenju-3 [18]. (a) Single region broadcast (1)) Single region stride Eig11_f_nifi [begin[ end stride (c) Anexample of single ((1) An example of single rcglon b1 oadc ast region stride “103 01-3)- QQIQ V—‘O ~10: was. com r—o ““103 Obi-x 9610 "‘0 Figure 2.2: Single region broadcast and single region stride, where the source node is 0 and the destination sets are {4, 5, 6} and {1, 3, 5}, respectively. Figure 2.2(a) shows the specification of a single region broadcast, where the be- ginning address must be no greater than the ending address. An example is given in Figure 2.2(b). When a message enters a switch, it buffers the complete region flit and then directs the message to corresponding output port(s). The header is revised at every switch where the message is replicated. For example, the incoming header at the shadow switch is (4:6), and the outgoing headers for upper and lower ports are (4:5) and (6:6), respectively. However, depending on the processor allocation scheme and application program characteristics, not all process groups can have all their nodes in a contiguous region. 25 The Single region broadcast thus cannot achieve the multicast in a single multicast communication. Two approaches may be used. One method is to send multiple single region broadcast messages in sequence to different disjoint regions. Another method is to add extra hardware (e.g., the copy network used in [27, 28]) and to introduce the concept of dummy addresses. Here, we generalize this scheme to multiple regions by allowing multiple region specification in the header. The multiple region broadcast, see Figure 2.1(c) for its header format, forwards the message to every node covered by each region. The EOH flit can be specified as a region containing an address with all 1’s followed by an address with all 0’s (i.e., address violation). In fact. any pair of addresses can be an EOH as long as the second address is smaller than the first address. Figure 2.3(a) illustrates an example of multiple region broadcast. Consider 64 nodes organized as a 2D array. For some 2D matrix applications, it may require the source node (3,2) to send a 1'1'Iessage to all nodes in the same column and in the same row. If the source node can also be a destination, the header for node (3,2) is (3:3,0z7; 0:7,2:2). Otherwise, the header will be (3:3,0:1; 3:3,327; 022,222; 4:7,2:2). 2.2.4 Multiple Region Stride In some applications, a source node may wish to send a message to all odd-numbered destinations or to all even—numbered destinations. In other words, the destination addresses have a constant distance between two adjacent addresses. Thus, the ad- dresses (or flit) can be specified by three parameters: (bzezs), the beginning address, 00 10 203-30] 40 50 60 70 0 8 165-24] 32 40 48 56 01 11 21, 31: 41 51 61 71 1 9 17: 25: 33 41 49 57 [62' 12’ 22132? 212' '52' -62. 72] '2" .10- "18136334" 112' ‘56 "58] b3l-13 -23i'33. 43 53 63 73 .3- 11 192 27. 35 43 51 59 04 14 24E 34] 44 54 64 74 4 12 20] 28] 36 44 52 60 05 15 25: 35: 45 55 65 75 5 13 21: 29: 37 45 53 61 06 16 26: 36! 46 56 66 76 6 14 22: 30: 38 46 54 62 07 17 271 37] 47 57 67 77 7 15 231 31: 39 47 55 63 Figure 2.3: Multicast to the same row and column in (a) 2D mesh and (b) linear array architectures. the ending address, and the stride value. Note that for a multi—dimensional address, each dimension may have its own stride value. This encoding scheme is referred to as region stride multicast. Figure 2.2(c) shows the specification of a single stride—region. Consider a stride— region, (bzezs), where (b S e) and s is the stride value. The message is forwarded to node {dld = b+i X s,i = 0,1,..., [(6 —- b)/s]}. Figure 2.2(d) shows an example in which destinations are {1,3,5} and the source node is 0. The header can be either (1:5:2) or (1:6:2). The header is also revised when the message is replicated. For example, the incoming header is revised to (1:312) and (5:5:2) for the outgoing messages to upper and lower ports of the shadow switch, respectively. Similarly, this encoding scheme can be generalized to multiple region stride. The header format is shown in Figure 2.1(d). To be consistent with the stride-region flit specification, the EOH flit may contain three fields. Like multiple region broadcast, an address with all 1’s followed by an address with all 0’s may be used to indicate 27 the EOH, where the third stride field is irrelevant. Consider the same multicast example illustrated in Figure 2.3(b). If the processors are organized in a linear order, such as the NEC Cenju-3 and the IBM SP-l, all nodes in the same row are not in a contiguous region. The multiple region stride encodes these destinations as (2:58z8; 24:31:1) if the source node is also a destination; otherwise, the header is (2:18:8; 34:58:8; 24:25:1; 27:31:1). 2.2.5 Multiple Region Mask Another regular pattern of destination addresses that typically occurs in k—ary n- cube networks is subcube. Let (l,,_1...dldo represent a node in a k-ary n-cube, where 0 S d,- S k. — 1 (note that this definition can be easily extended to a more general cube network in which the radix of each dimension may be different). Any subcube can be described by a binary mask, m, with an address, 6. The binary mask m defines the size and dimensions of the subcube, i.e., the number of ones in m and the location of each dimension with m,- = 1. The address 6 defines which subcube among those subcubes, i.e., bfs where m,- = 0. This approach is used in the nCUBE-2 [23]. In a more general case, we define a subset of a subcube with three fields: (bzezm), where b,- = 6,- when m.- = 0 and for those i’s with m,- = 1, b,- and 6,- specify the lower and upper bounds of the subcube, respectively. The multiple region mask shown in Fig. 2.1(c) extends this approach to allow that a message be destined for multiple subsets of subcubes. Consider a binary 4—cube with eight destinations {0100, 0101, 0110, 0111, 1100, 28 1101, 1110, 1111}. The multiple region mask encodes the header to (0100: 1111:1011), which is a complete 3—cube. When the destinations are {0110, 0111, 1100, 1101, 1110}, a subset of the 3-cube is specified as (01101111021011). 2.2.6 Multiple Region Bit String Both multiple region broadcast and multiple region stride are suitable for destinations that can be divided into a number of clusters and with a regular address pattern within each cluster. If the destinations are not in any regular shape, the header encoded by previous encoding schemes may be too large (e.g., many disjoint regions). The multiple region bit string encoding scheme shown in Figure 2.1(f) is proposed to reduce the header overhead under such a situation. In general, a region bit-string flit is specified as (b, e,T), where b and e (b S 6) indicate the beginning and ending addresses, and T is a binary bit string. Each bit in T corresponds to a node within b and e. A node is a destination if the corresponding bit is 1. Both 6 and e are destinations and T has (e — b + 1) bits]. Consider a system with 16 nodes and the destinations are {0,1,3,4,6,l2,13,15}. With multiple region broadcast, it requires five regions. With multiple region bit string, one possible specification is (0:6:1101101;12:l5:1101). Apparently, in multiple region bit string, the length of each region is not fixed and is determined by the values of b and 6. Depending on the flit buffer capacity, the system has to limit the maximum value of e — b. The EOH flit can contain two fields — a field with all 1’s 1Since the first and last bits are always 1 in T, these two bits may be removed from T. 29 followed by one with all 0’s. Although the multiple region bit string is more flexible and can handle addresses with irregular patterns, it may be difficult to determine the most efficient number of regions, and the switch design may be more complicated. 2.3 Multi-Address Decoding Consider a generic network switch/router with 1: input ports and 16 output ports as shown in Figure 2.4(a) with k = 4. The interconnection of routers defines the network topology. Each router may or may not have a local processor depending on the system architecture. When a multicast message arrives at a router via an input port, the router examines the header and may enable a. number of output ports depending on the routing strategy. The message is replicated and forwarded to all enabled output ports. A critical issue is how to define new header information for each enabled output port. If not handled properly, the same message may be received repeatedly by the same processor as shown in Fig. 2.4(b). Da H I? D Da: Data H: Header ::> nib Da H1 D ' Di) D D: Destination {T Sr: Source j? :4? Sp D Sp: Split DD: Destination with - -fi Da H2 Sr duplicated messages Figure 2.4: (a) A generic switch/router with four input/output ports, and the message is forwarded via both output ports 1 and 3. (b) The receiving of a duplicated message when the header is not handled properly and alternate routing paths are allowed. In distributed routing, if a router is able to process the input message header 30 on the fly, it will buffer one header flit, make the routing decision, and deposit new message header flit information to a selected output port. When an address or a region is forwarded to its associated output port, the other enabled ports will be idle since there is no address or region to be forwarded to these output ports. Such an idle on timing is referred to as wormhole bubble. Note that a wormhole bubble does not necessarily imply an empty or a null flit. With self—timed design, a wormhole bubble implies a timing delay to the next flit. However, for ease of explanation, the wormhole bubble is represented as an individual flit in the following figures as B. When a router replicates a message, the message header will be revised and an output message header is generated for each enabled output port (or each replicated message). Figure 2.4(a) shows an example replicating a message into two messages in which the headers H1 and H2 are different. Usually, the destination address set is divided into a number of disjoint address subsets, one for each replicated message. This approach can avoid multiple receptions of the same message, which is especially important if there are many alternate paths to a destination. The drawback of having disjoint subsets is its decreased flexibility in forming the header format. However, if the routing path is unique, those destination subsets may overlap. Since a router usually makes its routing decision based on the first field of a region flit, it may ignore any destination which is not reachable from its output port. This approach is more flexible in forming the header format. For a single region which contains a number of addresses, the region may be split into several regions by a router. Each region is directed to an enabled output port. It is possible that two or more disjoint regions are directed to the same output port. 31 Thus, the number of regions in the new header may be larger than the incoming one. ere ore, acin 1e num er 0 re ions in 1e Ieat er mav no e easi ) e. Th f ,pl tl b f til 1 . tbf 11 For each of the six multi-address encoding schemes, this section will describe some generic address decoding algorithms that may be performed by a router. Given a header flit, say F, the routing function Route(F) will determine the output port number to be selected to forward the message. Furthermore, we assume that if the processor, if any, associated with a router is one of the destinations, a copy of the message will be sent to the processor. An output port is enabled if a copy of the message is to be forwarded through the port. At the end of message transmission, all output ports will be disabled. These behaviors will not be further described in the following algorithms. 2.3.1 All Destination Decoding For each incoming flit, (1, Figure 2.5 shows the corresponding decoding algorithm. When sending information through an enabled port, it implies that the corresponding output channel is available; otherwise, the message transmission will be pending until the port is available. Avoiding deadlock is another critical design issue. The solution is dependent on the network topology and routing strategy, which is beyond the scope of this chapter. For example, the addresses of a message into a router are {2, 3, 5, 7, 15, B, B} as shown in Figure 2.6, where B indicates wormhole bubble and E represents EOH. Let the reachable nodes from port i be node i X 4 to node i X 4 + 3, where 0 S i S 3. 32 Algorithm: all-destination decoding Input: An address d. Output: append d to the header of the selected output port. Procedure: begin if d 2' EOH then send EOH to all enabled ports; exit; j:==R0ute(d); send (1 to port j; end Figure 2.5: All destination decoding algorithm. We further assume that there are 16 nodes in the system. The router enables port 0 and forwards address 2 when address 2 is decoded. Address 3 is forwarded to port 0. While address 5 is decoded, port 1 is enabled and address 5 is forwarded to port 1. There is a wormhole bubble on port 0. The process is repeated until the header is split completely as shown in the figure. Port 2 is not enabled Since there is no destination via this port. 00 \l 0‘ (II .5 U) N E B B 15 7 5 3 2 Input header I:> E B B B 7 5 2:> 3:>E s B 15 Output header Figure 2.6: An example of all-destination decoding. 33 2.3.2 Bit String Decoding Buffered Bit String. A router stores the entire bit string and then detects each bit which has a 1 to enable the corresponding output port. If there is more than one output port that can reach the destination, the router selects exactly one port. The length of each bit string is dependent on the corresponding port and the number of nodes reachable from that port. The router also knows the reachable node for a given bit position. The buffered bit string decoding algorithm is given in Figure 2.7 while an example based on the previous example is shown in Figure 2.8. Algorithm: buffered bit string decoding Input: Binary bit string, T. Output: Bit string, DJ, for each enabled output port j. Procedure: begin for every bit, t,, in bit string T do (j, k):=R0ute(i); (7 route through port j~ bit POSit'ion k *l (ljyc :2 li; endfor enable port j with DJ- 75 0 for all j; end Figure 2.7: Buffered bit string decoding algorithm. Hierarchical Bit String Decoding. In hierarchical bit string, the first field (1: bits) of an incoming bit string indicates the enabling or disabling of the 18 output ports of a k-port router. The following bits are divided into I: fields, one for each output port. The 1”" field is forwarded to the 34 15 0 0:>1100 10000000101011003 Input header 1 1 0 l 0 2EI> 3 1000 Output header Figure 2.8: An example of buffered bit string decoding. associated router in the next stage if port i is enabled, for 0 S i S [8. Otherwise, it is eliminated at the router. The decoding algorithm is too simple to be described here, and the corresponding example is shown in Figure 2.9. :> B B B 1100 15 0 10000000101011001011 O Input header :> 2::> 3 l 0 0 0 Output header Figure 2.9: An example of hierarchical bit string decoding. 2.3.3 Multiple Region Broadcast Decoding As shown in Figure 2.1(c), the header in multiple region broadcast contains several regions. All nodes covered by all regions should receive a copy of the message. Thus, the router may have to divide a region into several sub-regions, and each sub-region is directed through an appropriate output port. The algorithm for decoding each region is given in Figure 2.10. Algorithm: multiple region broadcast Input: A region (6, 6). Output: Send each sub—region to an output port. Procedure: begin if (b, e)=EOH, send EOH to all enabled ports; exit; while (b S e) do (j, e'):=R0ute(b, 6); send region (b, e') to port j; b = e’ + 1; end while; end; Figure 2.10: Multiple region broadcast decoding algorithm. In this decoding algorithm, the router checks the beginning address of a region and searches for the first address, 6", which cannot be reached by the same output port. A sub-region is then identified for that output port. The process repeats until all sub-regions have been identified. For example, consider an incoming header of (2:5; 7:8) shown in Figure 2.11. The router splits the first region to (2:3) and (4:5) for 7 7 ports 0 and 1, and the second region to ( ‘: ) and (8:8) for ports 1 and 2, respectively. 2.3.4 Multiple Region Stride Decoding The method to handle multiple region stride is similar to that of multiple region broadcast. A router divides the region into several sub-regions, where each sub— region is directed to a single output port. As indicated in Figure 2.1(d), there is an additional field, stride, in each region to indicate the distance between two adjacent nodes. The decoding algorithm is similar to Fig. 2.10, except replacing (b, e) and (b, e’) 36 Timing 4 3 2 1 0 E I B 32 E 8:7 I B I 5:2 2:) Input header 1:> E 7 7 B 5 4 2 E 88 .15 Output header Figure 2.11: An example of multiple region broadcast decoding. by (b, e,s) and (b, e’,s), respectively. Note that the stride value is never changed. Consider the example in Figure 2.12. The incoming header is {(1:5:2), (6:10:1)}, which indicates that nodes 1, 3, 5, 6, 7, 8, 9, and 10 are destinations. The router :2) for ports 0 and port 1 and the second splits the first region to (1:322) and (5:5 region to (6:721) and (8:10zl) for ports 1 and 2, respectively. Timing 4 3 2 1 O:> B B 223:1 1:10:6 2:5:l l:> E B 1:726 2:525 Input header 2:> E B 1:10:6 3 E> Output header Figure 2.12: An example of multiple region Stride decoding. 37 2.3.5 Multiple Region Mask Decoding The method to handle multiple region mask is similar to that of multiple region broadcast and multiple region stride. A region is divided by the router into several sub-regions, where each sub-region is directed to an associated output port. Not only the beginning and ending address but also the mask will be changed by the router to avoid duplicated receiving. An example of a binary 4-cube is given in Figure 2.13. The input header specifies six destinations {0000, 0001, 0010, 0011, 0100, 0101}. The router forwards a copy of the message to port 0, which defines a 2-cube with header (00002001120011), and another copy to port 1, which defines a l-cube with header (01002010120001). 0:) E 0011:0011:0000 E 0111:010120000 :> 1 E I 0001:0101:0100 2:> 3 [:> Output header Input header Figure 2.13: An example of multiple region mask decoding. 2.3.6 Multiple Region Bit String Decoding As shown in Figure 2.1(f), there are three fields in a region. The decoding algorithm is also similar to that of multiple region broadcast except for the representation of regions. The tricky part is in the partitioning of a region into sub-regions. When a single output port can reach several non-contiguous regions, the router can mark 1111 38 those bits that correspond to unreachable nodes to 0 instead of splitting the region. The corresponding algorithm is shown in Figure 2.14. Algorithm: multiple region bit string Input: A region (6, e, T). Output: Send each sub-region to an output port. Procedure: begin if (b, e, T)=EOH, send EOH to all enabled ports; exit; while (b S e) or (T = 0) do (j, e’, T’):=Route(b, e, T); send region (b, e’, T’) to port j; t, z: 0 if t,- is covered by T’; b :2 corresponding node of the first non—zero ti; end while; end; Figure 2.14: Multiple region bit string decoding algorithm. Let an incoming header be (2:7:110101) which indicates that nodes 2, 3, 5, and 7 are destinations, as shown in Figure 2.15. The router enables port 0 since bits 0 and l in the input bit string are 1’s; it enables port 1 since bits 3 and 5 are 1’s. 0:> E B 11:3:2 :u E B 101:7:5 2Ei> 3 E> Output header E B 10101127z2 ~ Input header Figure 2.15: An example of multiple region bit string decoding. 39 2.4 Summary As the scale of networks gets larger and the demand of multicast communication gets higher, the overhead of message header is becoming critical, which implies that multi— address encoding is becoming critical. Several multi-address encoding and decoding schemes have been investigated and explored in this chapter. Although the emphasis is on wormhole routing networks, the proposed schemes can be applied to networks with different switching techniques. Some other network topology dependent encoding schemes are possible. As indicated earlier, different encoding schemes have their own advantages and disadvantages. The choice of an appropriate encoding scheme is dependent on many factors, such as network topology, network size. routing strategy, processor alloca- tion strategy, switching technique, and frequent multicast communication patterns. A complicated system may be able to simultaneously support different encoding schemes. In this case, another indication flit is needed to indicate which encoding scheme is used in the associated multicast message. In this chapter, we only address the basic concept of various multi—address encod— ing and decoding schemes. When a scheme is implemented in a network, additional level of header optimization is possible, depending on the corresponding network characteristics. In the next chapter, we illustrate how to implement each encod- ing/decoding scheme in a MIN and how to further optimize the header overhead. CHAPTER 3 Multistage Interconnection Networks Multistage interconnection networks (MINS) are a popular class of interconnection architecture for constructing scalable parallel computers (SPCs), such as the BBN TC-2000, IBM SP—l, TMC CM-5, Meiko (IS-2, and NEC Cenju—3 as well as high speed networks, such as SynOptics ATMX and DEC GIGAswitch. In such systems, processor nodes are interconnected though a MIN, a class of indirect networks. Each processor node has its own processor(s), local memory, and other supporting devices. As the number of nodes in the system increases, the total communication bandwidth, memory bandwidth, and processing capability of the system scale up as well. MINS can be further classified as unidirectional MINS and bidirectional Mle [21]. As indicated in [21], bidirectional MINS are essentially a fat tree, such as the TMC CM- 5, IBM SP—l, and Meiko CS-2. We concentrate our work on those unidirectional MINS, such as the NEC Cenju-3 [18] and BBN TC-2000 [14]. 40 41 The general topological structure of a MIN can be employed in a number of varying configurations. For the sake of consistency and clarification, Figure 3.1 shows a generic MIN structure that is of interest to us. The N—node MIN with N = k” input ports and N output ports has n stages of k X to switches. The n stages are placed horizontallv a art, while within each sta e there are A switches verticallv stacked. . I. . k x k switch inter-stage connections 0 —F r-— 0 I _“ ‘——’ I 2 —I '—— 2 1 1 3 ——1 f—‘ 3 4 ‘_‘ '—— 4 2—1 r—g 7 ' I— 7 . I I l I I I I l . I I I l . I I I I N=kn . 1 1 O I I . . l I . 1 1 . N:k" , : : - . : - m. - : : - : : , input : : ' : j ’ . . ' : : ' : : output . I 1 e 1 l e sw1tches e l I e 1 1 . ports . l l O i i O . O i l O i i . ports ' I l . I I . . I I . l l ' l I I I ' l I I l N-4 —" '—‘ N-4 N'3 —J| ' o o I I e . .L—— N‘3 N-Z —_1 I: : I : I_—N'2 MI '—I-- __ II __r'—N-I Cn Gn-l Cn-l Git-2 01 C0 Figure 3.1: A generic MIN structure with N = k" input/output ports and n stages. 3.1 Switches Switches are the basic building blocks of MINS. A k X to switch is a crossbar network with 1: inputs and k outputs. If each input port is allowed to connect to more than one output port, the switch is able to support a more complicated function called one-to- many or multicast communication. Unfortunately, such design faces a critical problem 42 —— the deadlock. As long as two multicast trees share two common channels, deadlock is possible. Limitation or extra hardware is needed to avoid potential deadlock. The software approach provides an alternative for deadlock-free multicast. To achieve efficient unicast—based multicast on unidirectional MINS is the target of this study. Several network topologies are studied and specified in the following sections. 3.2 Network Topology An N—port MIN built with k X l: switches can be represented as C..(N)Gn_1(N/k)(-‘,._I(N) . . . (1',(Iv)c;,,( N/k)C‘U(N) where G,- refers to the 1"" stage, (',- refers to the 2"“ connection, and N = 1;". There are n stages. Each stage (1', consists of ;’\"//t' identical 1: X A: switches and thus is denoted as CAN/k). Each connection C",- connects N right—hand side ports at stage C,- to N left-hand side ports at stage CF, and thus is denoted as C(N). A connection pattern C,- defines the topology of the one—to—one correspondence between G,_1 ports and G.- ports, also known as permutation. There are many ways to interconnect adjacent stages. Banyan networks are a class of Mle with the property that there is a unique path between any pair of source and destination [29]. An N—node (N = 1:") Delta network is a subclass of banyan networks, which is constructed from identical 1: X k switches in n stages, where each stage contains (N/lc) switches. A unique property of the Delta network is its self- 43 routing property [30]. Many of the known MINS, such as Omega, flip, cube, butterfly, and baseline, belong to the class of Delta Networks [30] and have been shown to be topologically and functionally equivalent [31]. A good survey of those MINS can be found in [32]. This work considers three major interconnection patterns between adjacent stages: baseline, butterfly. and perfect shuffle, which are formally defined below. Definition 1 The ith k-arg baseline permutation (If, for 0 S i S n -— 1, is defined by 5f(.rn_1' - - 512,-+1.l',.r,_1---.1'l.1:0) = 22”-, - - '1',+1.r0.r,-.r,-_j - - ~11 where 0 S 1:,- S k — 1. Definition 2 The i‘h h-a-ry butterfly per-mutation if for 0 S i S n — 1, is defined by ,z3f(.rn_1' . - Ig+1.l‘,‘.1'i_1"'.l‘1.l'0)-—-.l‘,,_1 -~.r,+1.r0.r,-_1-~.r1.r, where 0 S 21',- S k — 1. Definition 3 The per/(ct lt-shujfle connection 0 is defined by k . . _ . . .. . . ., ,. . a, (.z.,,_1.1,,_2 - --.r1.r0) _ .1 ,,_-2.1,,_3 - --.1 1.70.1,,_1 uheie 0 S .r, S L --1. Four topologically equivalent MINS are considered in this thesis: baseline, butter— fly, cube, and omega. Since all of them are a class of Delta network, the self-routing property allows the routing decision to be determined by the destination address. For a k. X k switch, there are 1: output ports. If the value of the corresponding routing tag is i (0 S i S k — l), the corresponding packet will be forwarded via port i. For 44 an n—stage MIN, the routing tag is T = tn_1t,,_2 -- ~t1t0, where t,- controls the switch at stage 0;. Baseline MINS. In a baseline MIN [31], connection pattern C; is described by the i” baseline permutation (If which is defined in Definition 1. The i” baseline permutation puts the 0th digit to i” position and shifts every digit after and including the 2"" digit one digit right. The perfect shuffle connection, a, is selected to be connection pattern Cn. For a given destination (l,,_1d.,,_2 . —-do, the routing tag is formed by having t,- = d,- for 0 S i S n — 1. Butterfly MINS. In a butterfly MIN. connection pattern (57”-, is described by the 2"" butterfly permutation 13f, for 0 S i S n — 1. As indicated in Definition 2, the 2"" butterfly permutation interchanges the 0” digit and the i’h digit of the index. B3 is selected to be connection pattern ('0. For a given destination (ln_1d,,_2 - - ~ dldo, the routing tag is formed by having I,- : (l,,_,- for 1 S i S n — l and t0 : do. Cube MINS. In a. cube MIN (or multistage cube network [32]), connection pattern C.- is described by the 1"" butterfly permutation {if for 0 S i S n — l. C, is selected to be a. For a given destination (l,,_1d,,_2 - ~ - do, the routing tag is formed by having t,- = d, for 0 S i S n — 1. Omega Network. An omega network [33] is defined by C,- = 0‘, for 1 S i S n, and C0 is identity connection. For a given destination (ln_1d,,_2 - - - do, the routing tag is formed by having t,- = d,- for 0 S i S n — 1. Figure 3.2 shows the connection patterns of these MINS, which have been exten- sively studied in the past and have been adopted in many research prototype parallel computers, such as the Illinois Cedar [34], the Purdue PASM [35], the IBM RP3 [36], 45 and the NYU U ltracomputer [37]. Some commercial parallel computers have also adopted such networks, such as the BBN GP-IOOO (k = 4), TC-‘ZOOO (I; = 8) [l4], Monarch (k = 8) [38], and the NEC Cenju-3 (k = 4) [18]. Both the GP-lOOO and TC-‘ZOOO use circuit switching‘. The NEC Cenju-3 adopts wormhole switching. 3.3 Node Architecture All nodes in the system are interconnected via a unidirectional MIN. In this paper, it is assumed that there is exactly one pair of input channel and output channel connecting a node to the network, resulting in so—called “one-port communication architecture”. This assumption, which is consistent with many existing multistage network systems, implies that the local processor must transmit (receive) messages in sequence. For ease of explanation, all output channels of the nodes are connected to the left—hand side of the network. and all input channels of the nodes are connected to the right-hand side of the network (i.e., there is a wraparound connection on the network). 3.4 Decoding in Multistage Interconnection Net- works This section considers multi—address decoding on multistage cube networks of which the interconnection pattern is the butterfly permutation. An example of a 16 x 16 1With circuit switching, reply messages, such as acknowledgment, can be sent back via the same path. (In III: '11 1 Eli: lei f'11: [III III] II): j 101*: Fill 11‘}; ”’31 HUI HH 46 c, o, c, 02 c2 0, c, (30 c0 c4 0,, c, 02 c, G, c, 00 Co 0000 0001 H 0001 23:? Ellf"! 1:? 2:: MW?“ 3:2? 1.1-.1114 2:1,)”,1]; 3::? 1000 ] g — 1000 :21; [HIE- adj: :31; 1W1!” 3:10.. 1:3? filthyi 3:32? 1110 H 3:1110 (a) multistage Baseline network (b) multistage butterfly network 1 l l I x3311: 3:1: [[1- "IIIII 11:1:2111 n «Ill-1111 11] XX :0... m 3:133? :33? 1,1- Itggythlvth :s 10] mm 1010 l l l l —* 1010 1011 it]: 1011 1011 l“ l‘:$l‘ l ~—> 1011 c, G, c, (32 c2 0, C, Co c0 c, o, c, (32 c, G, C, Co Co 0000 I—" I‘— 0000 0000 f“ ’—“ 3: 0000 0001 0001 0001 0001 0010 . 0010 0010 I“ 0010 0011 0011 0011 F F \ 0011 0100 ~— :[: 0100 0100 ’ ’fi ' :I: 0100 0101 r._ M 0101 0101 'F F W W 0101 0'10 m T/ 0111 __1 [11 1100 1100 1100 ‘IJ‘ ‘ ——> 1100 1101 . 1101 1101 #d _» 1101 1110 1110 I: 1110 . 1111 l I l I l l l l (c) multistage cube network (d) omega network Figure 3.2: Four 16-node MINS built with ‘2 x ‘2 switches. l" as ”1 47 (k = ‘2) cube network is shown in Fig. 3.3, where G,- represents the 1"“ stage and C,- stands the 17’" interconnection pattern. In order to avoid drawing wrap-around connections, nodes are shown on both sides of the network. However, the dotted circle on the right side represents a shadow node of the associated node on the left side. The cube MIN has been demonstrated to perform reasonably well in practice and is the basis of the k-ary n-cube network architecture. Examples of such commercial parallel computers include the BBN GP-IOOO (k = 4) [13] and the BBN TC-‘ZOOO (k = 8) [14], among many other research prototypes. However, if a link becomes congested or fails, the unique path property can easily disrupt the communication between some input and output pairs. The reachable nodes from a switch are dependent on the stage and the number of output ports. In general, a switch, 50‘, can reach 10"“ nodes, where k is the switch size. For example, switches 5'10 and 5'20 in Figure 3.3 can reach 4 and 8 nodes. respectively. Because of space limitation for figures, I: is assumed to be ’2 in the following study unless otherwise specified. When there are alternative paths between the source and destination nodes, such as extra stage MINS, the switch must handle the header carefully to avoid duplicated receiving. Figure 3.4 gives an example of such a situation where source node 0 e11- codes its destination by multiple region stride (2:10:22; 12133), which indicates the destinations are nodes 1, ‘2, 4, 6, 7, 8, 10, and 13. The shadow switch routes the first stride to upper port and the second stride to lower port. The following switches route each stride to its associated destinations. Hence, nodes 4 and 10 receive two copies of the message. Note that this situation only happens when alternative paths exist. In Delta networks, the switch may forward the same address to all enabled output ['in 1”) C4 G3 C3 C7 C2 G] C] Co Co 0000 00 01 :0: 530 520 510 500 :3? S31 521 511 501 0011 l— —3 3:3? 532 L\\ X IF 522 512 x 502 :2 0110 533 523 f] \—l 513 503 —_i.'.'§.'." 01 l l —*._'I.‘ :83? S34 524 514 504 :Igrf: we \ f X 41-... 535 525 515 505 101 l _/ —‘l.lv‘ “3? S36 A 526 516 506 :1: “10 537 J \— 527 J 517 507 #2114." n1163>—————~ __{p§ Figure 3.3: A 16 X 16 cube network built with ‘2 x ‘2 switches. 0 \« ——19 I ‘- \ "'- -"..1 _~.‘ 2 ‘- £2} 3 ——-3 4 --<35 5 ——-5 6 1236. 7 -£i> 8 2282‘» 9 9 10 in? 11 ——-n 12 -——12 13 --fl3¥ 14 ~——14 15 -——15 iii-.3 destination ‘11-...3' destination received duplicated message Figure 3.4: A duplicated receiving when there are alternative paths, in an extra stage MIN, between source and destination nodes. 49 ports since this address is only reachable by one of these output ports. An unreachable address is used to indicate a wormhole bubble and is represented as a shadow square in the following figures. The destination subsets may be overlapped because the routing path is unique. Such an approach provides more flexibility in forming the header. The reachable nodes of a switch are in ascending order according to its output ports. In other words, a single region will not be split to multiple regions for any single output port. Hence, a counter, counter, is used to specify the number of regions in a multicast message rather than EOH to reduce the length of header for such a network. For each of the six n'iulti—address encoding schemes. this section will describe some address decoding algorithms that may be performed at a switch in a MIN. Given a header flit, say F, the routing function [fault/(F) will determine the output port number to be selected to forward the message. If F is not reachable by the current switch, this function returns -1. Furthermore, we assume that if the processor, if any, associated with a. switch is one of the destinations, a copy of the message will be sent to the processor. An output port is enabled if a copy of the message is to be forwarded through the port. At the end of message transmission, all output ports will be disabled. These behaviors will not be further described in the following algorithms. 3.4.1 All Destination Decoding Figure 3.5 gives the decoding algorithm for each incoming flit, d, which is an address. The associated output channel must be available when sending information 50 Algorithm all—destination decoding Input: An address d Output: Append d to the header of all enabled output port. Procedure: begin if d = EOH then send EOH to all enabled ports; exit; /* EOH is used since counter will not decrease the length of header in this algorithm.*/ j :2 Route(d); if (j _>_ 0) then enable port j; send d to all enabled port; /* d is a useful address to the message directed to port j and is a useless address to other enabled port. */ end Figure 3.5: All address decoding algorithm. through an enabled port; otherwise, the message transmission will be pending until the port is available. Avoiding deadlock is another critical design issue. The solution is dependent on the routing strategy, and is beyond the scope of this paper. For example, let the source node be 0 and destinations be {2, 3, 5} in an 8 X 8 MIN. The modification of header in each switch is shown in Figure 3.6, where E indicates EOH. Switch 520 receives the header (2,3,5,E). It enables upper output port when it detects address 2, and it enables the lower output port when it detects address 5. The new headers for upper port and lower port are (2,3,5,E) and (5,13), respectively. Address 5 in the first header is a wormhole bubble and is marked as a shadow square in the figure to indicate that it is not reachable by the first message. Elsisiz , $155131sz :>, :> 2" u- :>’":>121313121 :>°":> , :> , :> :>, $11515sz 21 :> H $01 E.- 22:> $12 . :> 02 23 21> 13b :> 03 Figure 3.6: An example of all—destination decoding. 0&& {HR} 3.4.2 Bit String Decoding Buffered Bit String Decoding In this scheme, a switch stores the entire bit string and then detects each bit which has a 1 to enable the corresponding output port. The length of each bit string is dependent on the location of the switch. A switch at stage C, can reach only 10‘“ destinations. Therefore, the bit string length is 10‘“, where the first 10'“ bits correspond to the nodes reachable from the first output port, the second 10'" bits indicate the second output port, and so on. Let T = totl . . . t(ks+x_1), where t, is a single bit to indicate the jth reachable node of the switch, be the input bit string. The buffered bit string decoding algorithm is given in Figure 3.7. Figure 3.8 gives an example where the source node is 0 and destinations are {2, 3, 5} in an 8 x 8 MIN. Hierarchical Bit String Decoding The first field (1: bits) of an incoming bit string indicates the enabling or disabling of the 11: output ports of a k-port switch in the hierarchical bit string encoding/decoding Algorithm buttered bit string decoding for a switch in stage G,- Input: Binary bit string, T. Output: Bit string, 0,, for each enabled output port j. Procedure: begin for t’ = 0 to k — 1 do (Eg, D3) = Route(ne:rt k‘ bits); /* Here Route returns 1 to E: if the OR operation of these ki bits is 1; otherwise E: = 0. D is a string contains these ki bits only. */ forallEg=1,where0St’Sk—1,do enable port 3; send D; to port (I; endfor; end Figure 3.7: The algorithm of bit string decoding schemes for MIN. 20 2] 01::>3 02 22 Dix/30600 {F $GU|[]][[]1]1[0k1]1[(fl:[}—:3 is m WNW :> is SI] ulna ,fi :>l3 Figure 3.9: An example of hierarchical bit string decoding. 0] 0900 DJ 02 :> :> i) :> 3 3 03 0&600003 000600 GGGGD 90 3.4.3 Multiple Region Broadcast Decoding The header in multiple region broadcast contains several regions as given in Figure 2.1(c). Any nodes covered by all regions should receive a copy of the message. Thus, the switch may have to divide a region into several sub-regions, and each sub-region is directed through an appropriate output port. A region counter, counter, to indicate the number of regions in the header is used in this scheme to reduce the length of header. A switch saves the counter when it is enabled by the incoming header. It decreases the counter by one when it processes one region, puts the counter in an outgoing header when the corresponding port is newly enabled, and disables the decoding procedure when the counter becomes 0. The algorithm to decode a region in multiple region broadcast header is given in Figure 3.10, where the process of counter is omitted. Algorithm multiple region broadcast Input: A region with the beginning address, b, and the ending address, 6 Output: Enable associated output ports and revised header Procedure: begin while (0 S e) do (j, e’) :2 Route(b,e) send region (b, e') to port j; b = e' + 1; end while; end Figure 3.10: Multiple region broadcast decoding algorithm. As given in the algorithm, the switch checks the beginning address of a region and searches for the first address, 6’, which cannot be reached by the same output port. A sub—region is then identified for that output port. The process repeats until all sub-regions have been identified. For example, consider that source node 0 sends a message to nodes 1, 3, 4, 5, and 6 as shown in Figure 3.11. Switch 520 directs the first region to port 0 and splits the second regions (3:6) to (3:3) and (4:6) for ports 0 and 1, respectively. The region counter to port 1 is updated to 1. This process is repeated in each switch, and the message is directed to all destinations as shown in this figure. 3.4.4 Multiple Region Stride Decoding The decoding scheme to handle multiple region stride is similar to that of multiple region broadcast. A switch divides the region into several sub-regions, where each 91. 1 6:3 1:1 2 20 21 22 00000 0 S 23 00 0 000 S 13 $ 55 3:3 1:1 2 S :>3’3 1:] 2 6:4 :> m 3:3 1 351%? $5 5:4 1‘ :> 12 6:6 1 :> :5 0000000 01 02 S 03 Figure 3.11: An example of multiple region broadcast decoding. sub—region is directed to a single output port. As indicated in Figure 2.1(d), there is an additional field, stride, in each region to indicate the distance between two adjacent nodes. The decoding algorithm is similar to Fig. 3.10, except that it replaces (0,8) and (b, e’) by (0,6,3) and (b, e’,s), respectively. Note that the stride value is never changed. 2:72] 1 20 2:3:1 _ 2:7:5 — 2] 22 S23 0000000 000000 0 10 ll 12 0] 2:11] 1 2:323 1 2:5:5 l‘ 2:7:7 l 00000000 SI3 00 000 0 02 0(00000l S03 Figure 3.12: An example of multiple region stride decoding. Consider the example in Figure 3.12 where source node 0 sends a message to all odd nodes. Switch 520 splits the region (1:7:2) to (1:3:2) and (527:2) for ports 0 and 1, respectively. This process is repeated until the message reaches all destinations as shown in the figure. 56 3.4.5 Multiple Region Mask Decoding Since the cube MIN is a cube network, the multiple region mask may be applied to it, where each stage represents a dimension. A region mask (b,m) can define a complete subcube and a region mask (b,e,m) can specify a subset of a subcube in such a network. Figure 3.13 shows an example of defining a subcube in a MIN. The subcube contains nodes {4, 5, 12, 13} (or {0100, 0101, 1100, 1101]), which can be specified by either (x10x:101) or (0100:110121001). If node 1101 is not a destination, the first scheme requires two regions, but the second one can specify the destinations by (0100:110021001). C4 03 C3 OOOO—fi 0001 02 0010 0011 G] C1 COCO —0000 —0001 0100 0101 x a r x\/r 0010 —'—0011 0110 0111 —0100 —0101 1000 1001 ——0110 —0111 1010 1011 —1000 —— 1001 1100 1101 /\L —— 1010 — 1011 1110 1111 as: “$8 .. JH Figure 3.13: Define a subcube in a MIN. The method to handle region mask is similar to that of multiple region broadcast —1100 —1101 ——1110 ——1111 LII S23 03 101:5:1 1:>S ]:> 10:3:1 1‘——* S :> 11:1 1 S 20:) 10:5:4 l :[> 10 :> 00 1 :>, :> ii, :> :2. :> 21$ :> I] :> $ 01 :>, :> 501:», :> 22 :> 12 :> 02 :> :> :> :> :> :> :> 000000 l3:> Figure 3.14: An example of multiple region mask decoding. and multiple region stride. A region is divided by the switch into several sub-regions, where each sub—region is directed to an associated output port. Not only the begin- ning and ending address but also the mask will be changed by the switch to avoid duplicated receiving. An example to handle region mask (001:101:101) is given in Figure 3.14. 3.4.6 Multiple Region Bit String Decoding There are three fields in a region as shown in Figure 2.1(f). The decoding algorithm is also similar to that of multiple region broadcast except for the representation of regions. Unlike the multiple region bit string specified in the previous chapter, it is not necessary to mark off an unreachable bit to avoid duplicated receiving since each destination is reachable by a unique path. The corresponding algorithm to decode one region bit string, which is repeated to decode the whole header, is shown in Figure 3.15. Figure 3.16 gives an example of multiple bit string. Source node 0 sends message 58 Algorithm multiple region bit string Input: A region (0, e, T) Output: Send each sub—region to an output port Procedure: begin if (counter S 0) then exit; while (I) S e) and (T # 0) do (j, e’, T’):=Route(b, e, T); send region (6, e’, T’) to port j; b := corresponding node of the first non-zero t,- after node 6’; end while; decrease counter by one; end Figure 3.15: Multiple region bit string decoding algorithm for MIN. 1011:5z21 s :>11:3:21 :>s $5 E> 2:) 2031551] 2:) m “3:219 00 a, :> :>, :> :>, :> Zl:> :> II:> :> 0] [>133 0,2,0 :>,H:>szsn,—2> :> 1 :j 02 155 0,30 0, :> 0, :> i> ID :> 3f> f> ”31> Figure 3.16: An example of multiple region bit string decoding. 59 to nodes ‘2, 3 and 5. The header incoming to switch 5'20 is (2,5,11012). 520 splits this region into two regions (2 3, 112) and (5, 5, 12) for port 0 and port 1, respectively. Both regions are forwarded by switches 510 and 5'12. The first region is further split by 501 to direct the message to nodes ‘2 and 3. The second region is forwarded to node 5 by switch 5'02. 3.5 Performance of Multi-Address Encoding and Decoding on Multistage Interconnection Net- works In this section, we evaluate the performance of the proposed schemes on different address patterns. Such performance is dependent on both the number of destinations and the distribution of destinations. Here, a 1256 x 256 cube MIN is considered. Since an address requires 8 bits, a Hit is assumed to contain 8 bits. A region may require ‘2 or 3 flits. The destination sets of these patterns are specified as follows: Pattern l: {1:1‘2zl, 32:50:22, 60:7521, 90:117:3} Pattern 2: {1, 3, 6, 10, 15, ‘21, ‘28, 36, 45, 55, 64, 722, 79, 85, 90, 94, 97, 99}. Pattern 3: node 1 to node 48. Pattern 1 contains 48 nodes. The all—destination scheme requires 48 flits to carry all addresses. The buffered bit string and hierarchical bit string schemes require 256 and 510 bits or 32 and 64 Hits, respectively. Twenty-eight nodes of these destinations 60 are located in two contiguous regions and others are located separately; therefore, the multiple region broadcast requires ‘22 regions. The multiple region stride outper- forms others since it can encode these destinations by 4 regions. The multiple region mask encodes the first 38 destinations into 3 regions (00000001:00001100:00001111), (00100000z00110010z00111110), and (001111100z01001011:01111111). The Hamming distance of the last 10 destinations is no longer 1. Therefore, the multiple region mask is unable to group them into a single region mask and encodes them into 10 regions. Hence, it has 13 regions. Since only the distance between nodes 122 and 3‘2 is larger than 16 (the length of two addresses), the multiple region bit string encodes these destinations into ‘.2 regions, where the first region contains nodes 1 to 1‘2 and the second region consists of all other destinations. There are 18 destinations in the second pattern. Therefore, the all destination scheme needs 18 flits to carry the addresses. Since there is no contiguous region or regular stride, the multiple region broadcast takes 18 regions, where each region has one destination; and the multiple region stride requires 9 regions, where each region contains two destinations. These destinations form 4 ‘2-cubes, (1,3), (64,72), (90,94) and (97,99); therefore, the multiple region mask has 14 regions. The region bit string encodes all destinations into a single region since the distance between any two neighboring nodes is less than the length of two addresses. It takes 15 flits, one for the beginning address, one for the ending address and the other thirteen for the bit string. Pattern 3 is easily encoded and is omitted here. The length of headers of these schemes for these three patterns is summarized in Table 3.1. f‘. 1f 61 Table 3.1: Number of flits in a header based on various multi-address encoding schemes. A header consists of a counter and addresses. Encoding schemes Pattern 1 Pattern '2 Pattern 3 all-destination 49 19 49 buffered bit string 3‘2 3‘2 3'2 hierarchical bit string 64 64 64 multiple region broadcast 45 37 3 multiple region stride 13 ‘28 4 multiple region mask 40 43 4 multiple region bit string 19 14 8 A naive way to implement multicast is to send multiple unicast messages, one for each destination. This approach has been implemented in a lot of communication software and is named as separate addressing [39]. This approach has a fixed length of header, one for each destination. No counter flit is needed. However, the separate addressing requires the source node to send m x (c." + 1) flits, where I? is the length of a multicast message and m is the number of destinations. The number of flits sent by the source node for patterns 1. 22. and 3 is based on different sized messages. The different encoding schemes are. given in Figure 3.17(a.), (c) and (e), respectively. To compare the proposed schemes with separate addressing, the reduction rate, Bred, is defined as one minus the ratio of the number of Hits sent in a. proposed scheme to the number of flits sent using the separate addressing. R 1 number of flits sent in a multi-address encoding scheme red = _ number of flits sent in the separate addressing Hence, a larger reduction rate, ij, indicates a better scheme. The reduction rate for patterns 1, 2 and 3 is shown in Figure 3.17(b), (d) and (f), respectively. 3:11. 1 LU Fig 360 1339 f is 0 g .._. 3 " ~ 1 Size of message (flits) Size of message (flits) (a) Flits sent by source node (pattern 1) (b) Reduction rate (pattern 1). l Size of message (flits) ' Size of message (flits) (c) Flits sent by source node (pattern ‘2). ((1) Reduction rate (pattern ‘2). 360 T I 70 ’ L l . 4 16 64 2256 Size of message (flits) Size of message (flits) (e) Flits sent by source node (pattern 3). (f) Reduction rate (pattern 3). separate addressing @— multiple region broadcast +— all destination 8— multip e region stride —e—- buffered bit string —><— multiple re ion mask ~9— hierarchical bit string 23— multiple region it string 6‘ Figure 3.17: The length of header and reduction rate on different destination patterns vs different multi-address encoding/ decoding schemes. 63 The performance of the separate addressing scheme is dependent on both the number of destinations and the length of messages. In order to complete a multicast, it needs to transmit the message m times where m is the number of destinations. However, unless the message is short and the number of destinations is small, its performance is worse than any proposed multi-address encoding/ decoding scheme. The performance gap between the separate addressing and the proposed schemes increases as the size message increases and / or the number of destinations increases. The performances on various multi-address encoding schemes become closer to each other when the message size increases, which implies that the data portion of the message comes to dominate. It is easy to observe that there is no single multi—address encoding/decoding scheme that can outperform all others in all destination patterns. For example, the multiple region stride has much shorter header than others if all destinations have a constant stride. And in this latter case. the worst multi-address scheme is the hierarchical bit string. However, the hierarchical bit string will become better than all-destination when, for example, there are more than 64 destinations; and multiple region broadcast when there are more than 32 regions. It is also easy to observe that the length of header in most multi-address encod- ing/decoding schemes is dependent not only on the number of destinations but also on the distribution of destinations. The all-destination encoding scheme is dependent on the number of destinations only. However, its performance becomes worse as the number of destinations increases. On the other hand, the buffered bit string and hierarchical bit string are dependent on the number of nodes in a network but not 64 the number of destinations. The multiple region stride is a superset of multiple region broadcast. Hence, the number of regions in the multiple region stride is always less than or equal to the number of regions in the multiple region broadcast. When the number of regions in the multiple region stride is less than g of that in the multiple region broadcast, the multiple region stride has a shorter header. The multiple region mask is also a superset of multiple region broadcast since a single contiguous region in a cube MIN can always form a subset of a subcube. However, it is not a superset of the subset of the multiple region stride encoding scheme. 3.6 Summary In this chapter. we give a brief review of delta class MINS followed by optimized multi- address encoding and decoding schemes on a multistage cube network. Since the baseline, butterfly, cube and omega MINS are topology equivalent [31], the optimized multi-address encoding and decoding schemes may apply to baseline, butterfly, and omega MINS directly. The hardware multicast tree injects fewer flits into the network which implies decreasing the load of the network as well as comnulnication latency. The reduction rate of the network traffic becomes larger as the size of the message increases. A multi-address encoding and decoding scheme may outperform the other schemes for some traffic patterns. However, the performance gap decreases as the size of the message increases. Hence, a system may choose a single multi-address encoding and 65 decoding scheme if supporting all multi-address schemes is expensive. Since a message destined to multiple destinations will reduce the network traffic and communication latency, the implementation of a hardware multicast tree becomes the target of our study and is discussed in the next chapter. [lit ml. 1m 1m hast a"? dq” foray, unit‘s CHAPTER 4 Hardware Multicast Wormhole-Switched The hardware multicast implementation offers a significant better performance than the software approach; however, it also suffers potential deadlock due to multiple multicasts. Two hardware multicasts have been introduced: the path-based and the tree—based. However, we will show that tree—based hardware multicast, multi-head worm, is more suitable for a MIN than the path-based hardware multicast, which is more suitable for direct network in avoiding deadlock. In order to provide efficient deadlock—free hardware multicast, two different tree- based hardware multi-head worm implementations, asynchronous and synchronous, are studied. The asynchronous multi-head worm allows each branch to forward in- dependently, while the synchronous multi-head worm insists on having all branches forward synchronously. Unfortunately, both implementations are not deadlock-free unless certain rules are applied. Hence, current hardware implementations of mul- 66 67 ticast either exhibit some undesirable properties or are restricted in their use. To implement deadlock-free multiple asynchronous multi-head worms is very difficult, if not impossible. A deadlock-free and starvation—free hardware implementation sup- porting multiple synchronous multi-head worms is presented in this chapter. Unlike a unicast message whose header has a fixed size, the size of a multicast message header is dependent on the number of destinations, the distribution of des- tinations, and the encoding scheme [40, 41]. Since a single flit may not be able to carry all destination information, a multi-head worm may have different distances of branches toward different destinations. The difference in latency among multiple branches may cause deadlock in synchronous multi-head worms. We address this issue and propose a pseudo multi-address encoding scheme to eliminate the potential deadlock. 4.1 Implementation of Multi-head Worms The path-based hardware multicast, which was shown to easily avoid deadlock in direct networks [225], can easily cause deadlock in banyan networks due to self blocking. Figure 4.1 gives an example of such a situation where the source is node 0 and the destinations are nodes 4, 8 and 9 on a 16-node multistage cube network. We further assume that there is one buffer for each outgoing link as shown by a blank square. A router replicates an incoming flit and then forwards one copy to the processor and the other copy to the network. Three paths are required to complete the path-based multicast: from node 0 to node 4, from node 4 to node 8, and from node 8 to node 68 9. However, the latter two paths will share channels between switches 324 and 314 as well as between switches 314 and 304, marked as a shadowed area. As indicated in the figure, the first two paths, shown as bold lines in the figure, are established completely. When the first flit heads from switch 330 to switch 524, it is blocked since the upper buffer in switch 330 is used to store Hit 5 unless the message is very short. Thus, the path forms a cycle due to self blocking and results in a deadlock. Due to the unique-path routing property in banyan networks, it is impossible to avoid deadlock unless extra stages are added or there are multiple virtual channels per physical channel. 3°33 3'. r 3: \/ o. ' ‘1 [— OI ' a. if I it on “it. . 0- t‘ 3: \ .. J/v ‘9' J \_ _/ x m- r‘i-‘DD _J established path - - - blocked path shared channel Figure 4.1: An example of deadlock for a path-based multicast on a 16-node multi- stage cube network. 69 Without considering the potential deadlock, the hardware multicast provides sev- eral advantages over the software implementation, such as sharing the common re— sources, higher throughput, lower latency, less network traffic, etc. Hence, some sys- tems restrict one multicast at a time (e.g., the control network in the TMC CM-5), which results in low network utilization and longer delay. An implementation to elim- inate such restriction and to guarantee a deadlock-free network is the objective of this chapter. We concentrate our work on tree-based hardware multicast. Establishing a multicast tree on the banyan Mle is straightforward due to the unique path property. The multicast tree is also unique. In addition to unicast, a switch must be capable of replicating the incoming fiit(s) and forwarding them to the associated output port(s). Since every outgoing port can be either used (enabled) or unused (disabled), there are (12" — 1) possible output port combinations on a k x k switch for a multi—head worm, When a header is received, the switch enables the asso- ciated outgoing port(s) and forwards the header to all enabled port(s). All following fiits are forwarded to all enabled outgoing ports directly. The buffers and channels are released after the tail fiit passes through the switch. An example to establish a multicast tree which is initiated by node ‘2 and heading for nodes {3, 4, 5, 6, 7} on an 8—node cube network with ‘2 x ‘2 switches is given, and the snapshots at the first and third cycles are shown in Figure 4.2. When switch 3221 receives the header from its upper incoming port, it replicates and forwards the header to both buffers since destination 3 requires the upper channel and other destinations need the lower 1In the following context, ng and 3;,1- indicate the switch at stage G,- and row j. sid- is used when there is potential confusion on i and j. 70 channel (see Figure 4.2(a)). Such a process is repeated on switches 310 and $12 in the next cycle and then on switches 30,, 302 and 303 at the third cycle (see Fig 4.2(b)). C3 G; C; G, C, G0 C0 C3 G; C2 G, C, G; C0 C; G; C; G, C, G, Co 0 O '—1 0 0 o l 1 1 l 1 2 2 :1: 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 _' 3:6 6 6 7 7 __J 7 7 7 (a) The first cycle (b) The second cycle (c) The third cycle Figure 4.2: An example to establish a wormhole-switched multicast tree. where the shadow box represents the header flit and white box stands for data flits. Deadlock is possible if the network allows simultaneous transmission of multiple multicasts. Depending on the coordination mechanism of multiple branches in a tree (or multiple heads in a multi-head worm), two different multi—head worm implemen- tations are described below, which provide different degrees of difficulty in handling deadlock—free multiple multicasts. The headers of different branches in an asynchronous multi—head worm are for- warded independently. When any branch is blocked, other branches may still be forwarded if the required outgoing channel is available. However, every flit on the blocked path will remain in the current switch. The buffers of the forwarding branches may become empty if the subsequent flits are blocked. The same example as the previous one is given in Figure 4.3 except that there exists a unicast from node 7 to node 5. The first cycle is similar to Figure 4.2(a), and the second cycle of multicast tree establishment is shown in Figure 4.3(a). The branch toward nodes 4 and 5 is blocked at switch 312 since the buffer to node 5 is used 71 by the unicast. All following flits on the branch are stopped. The first data flit toward node 3 is forwarding since it is not on the blocked path, but the one toward nodes 6 and 7 is stopped since it shares the same channel used by the blocked branch. In the first cycle after the blocking, the header flit is forwarded to switches 30, and 303. The data flit located at the upper buffer in switch 322 forwards to switch 310 while the one at the lower buffer remains in the current switch. All subsequent flits are kept at the source node since one buffer in 322 is unavailable as shown in Figure 4.3(b). The next cycle is shown in Figure 4.3(c). The paths to nodes 3, 6 and 7 are established but not the paths to nodes 4 and 5. Nodes 3, 6, and 7 receive the header flit; in the subsequent cycle, node 3 will receive the first data flit but nodes 6 and 7 will not. After that, node of these nodes will receive any flit until the unicast is done and the paths to nodes 4 and 5 are established. (a) The second cycle (b) The third cycle (c) The fourth cycle C3 G2 C; G, C, Co C0 C3 02 C; G, C, Go C0 C3 02 C2 G, C, G, CO l—'l l—l O‘UI-bbJN—‘O \lChUI-BUJN-‘O \IO‘LIIJiLfiN—‘C \lO‘MbWN—‘O \lO‘UlbbJN—‘O 7 flit used by unicast E wormhole bubble El header flit [3 first data flit Figure 4.3: Establishment of an asynchronous multi-head worm with network con~ tention. The empty buffers in a multicast tree are referred to as wormhole bubbles since they contain no data and are unusable to other transmission. Wormhole bubbles may cause deadlock. Figure 4.4 gives an example of such a situation. Node 0 initiates 72 a multi-head worm, To, which is destined for nodes {‘2, 4, 6, 8, 10, 1‘2, 14} when a unicast from node 4 to node 1 is on the network. Hence, the branches to nodes {‘2, 4, 6, 8} are blocked at switch 530, and the other branches are completely connected to the associated destinations. While To is waiting for the completion of the unicast, node 6 starts a new multi-head worm, T6, which is destined for nodes {5, 7, 13, 15}. The branches heading to nodes 5 and 7 are established, while the branches to nodes 13 and 15 are stopped at switch 326 by To. This situation is shown in the same figure. When the unicast is done, To will grab additional channels to reach node 2 but will wait for T6 to release switch 812. Hence, a deadlock is formed. C4 G3 C3 62 C2 G] Cl Go C0 _ ._ [_— O l— l— [‘9 1 L IT L—\ [— Lr—Z l— L F r L . r— :0 . 1 . L €513 I" E wormhole bubble buffer by unicast header flit II] first data flit second data flit Figure 4.4: An example of a deadlock situation with two multi-head worms, where the bold line represents the multi—head worm To and the dotted line stands for T6. It is easy to observe that establishing a deadlock-free asynchronous multi—head worm is very difficult, if not impossible, without any restriction and extra hardware. This is due to the fact that each branch forwards independently, and a switch can 73 not forward the next flit until all the previous flits are forwarded. 4.2 The Synchronous Multi-head Worm Unlike the asynchronous multi-head worm, the synchronous worm has a strong coor- dination of different branches. All headers must be forwarded synchronously. W hen- ever one branch is blocked, every branch is blocked. Since there is no path on which headers of different branches can exchange status, the information is sent backward to the source node. The source node will then push the whole multicast tree one flit forward if all requested buffers are free. Such coordination must be efficiently implemented in hardware to minimize the latency. A dedicated wire is established to the source node from a switch as soon as the switch is obtained by the multi-head worm. Wires from different branches are connected together like a reduction tree. The connection from different branches can be based on wired—OR (or wired—AND) logic. Thus, the signal can be immediately sent to the source, and the signal from the source can be immediately sent to all branches. A similar design was used in the nCUBE—‘Z to support hardware broadcast within a subcube [123] and in the Cray T3D in their synchronization network design [42]. Figure 4.5 gives the same example as shown in Figure 4.3 but for a synchronous multi-head worm. Since all request buffers are available, the header flit is forwarded to switch 322 and then switches 310 and 813, as illustrated in Figure 4.5(a) and (b), respectively, However, the lower buffer in switch .512 is used by another transmission. Switch 312 gets a negative signal which is passed backward to the source. The source 74 informs every branch to stay in the current state, as shown in Figure 4.5(b). (a) The first cycle (b) The second and following cycles C, G, C, G, C, G, C0 C, G, C, G, C, G, CO O‘UthUJN—‘C \JONMhUJN—‘O \lGUI-fiUJN—‘O \l flit used by unicast El header flit El first data flit Figure 4.5: Establishment of a synchronous multi—head worm with network con- tention. Due to a strong coordination of headers, it may mislead us into concluding that the multiple synchronous multi-head worms are deadlock-free. Let us consider an instance as shown in Figure 4.6 where the source nodes ‘2 and 8 initiate multi-head worms, T2 and T3, to all odd nodes and all even nodes, respectively. C4 G3 C3 Gz C2 G] C, Co Co ®D——W-—--cW---s liq—0 —‘ 1" x” ‘—l on ' L ‘e [— L ‘,/— |_ 1_——2 o- I I— \/ r‘ F I“ 3 o. I L\\ l— \ _ .7 L_4 g: , JYZH H/L 5:: 0. F F F—7 (DU- lttf Kip- - - l_,——8 3: l: LJ L "f ,_>4 8 C-minéBaseline) -<3---- {33 SA Baseline) ---x-- -' C-min (Butterfly --u-- SA (Butterfly v V V v v I L I l l l 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Load Figure 4.8: Comparison among hardware and software implementations. A common channel to reach different destinations can be shared in a multicast tree; therefore, we expect that the performances on different networks will be similar to each other. Figure 4.9 gives the latency of different sized switches on various networks when there are 8 nodes in each cluster. It not only confirms our expectation but also shows that the performance is independent of switch sized. The latter one becomes false when the number of nodes in a clusters is smaller than the size of switches on the butterfly network. External interference among different clusters occurs. Figure 4.10 confirms this observation. Every two multicasts, 4 nodes in each cluster, share one channel in the connection C, on butterfly network with 8 x 8 switches which forces 85 the latency increasing rapidly. Such situation will occur to the baseline network when the size of network become larger [43]. 600 fi I 1 fi f i Baseline — Butterfly ------- Cube _ 2x2 switch 0 , 400 4x4 switch + 8x8 switch a > 0 C 0) iii _1 200 - - 0 l 1 i i L a 0.1 0.2 0.3 0 4 0 5 0.6 0.7 0.8 ’ Load ' Figure 4.9: Latency comparison of different switch sizes and different number of destinations on a cube network. In the following evaluation, 4 x 4 switches are used as the. basic network unit unless otherwise specified. In hardware implementation, we expect that the latency should be independent of the number of destinations when the network load is light. This is confirmed in Figure 4.11 which gives the latency on various networks with a diflerent number of nodes in each cluster when the region broadcast is used to encode the header. Surprisingly, the latency decreases as the number of nodes increases. This is due to the utilization of outgoing channels used by a multi-head worm. There are (m — 1) channels used while a cluster contains 772. nodes. As the number of nodes decreases, the utilization decreases which implies that the latency increases. While the pseudo all—destination is more flexible than the region broadcast, the 600 i r i . 4 nodes — 8 nodes ------- 2x2 switch 0 4x4 switch L 400 8x8 switch a Latency 200 r ’3’ 1" I” 0' a. ' a a o‘ ' l l 1 I o 1 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Load Figure 4.10: Comparison on a l.)utterfly network with different sized switches and a different number of destinations. 350 T I T I T I If T 4 nodes — 8nodes ------- 300 - 16 nodes ----- - Baseline o _.- _: Butterfly + .5 250 _ Cube 0 J 52 200 ~ <0 " _l j' 150 ‘ 100 - 50 i 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 4.11: Latency comparison of the different number of nodes in each cluster. 87 latency increases faster than the region broadcast. as shown in Figure 4.12. The region broadcast has smaller latency as the load becomes higher and the number of nodes becomes greater. There are several reasons for such phenomena. First, the routing cycle of the pseudo all~destination is linear in the number of destinations plus the number of stages, while the region broadcast is linear in the number of stages but not the number of destinations. Second, the region broadcast has a lower blocking rate since it occupies each stage in an all or none fashion. The pseudo all-destination may block more channels while waiting for the blocked channels to be released. 600 . , , t . 500 " Region Broadcast ~—— , All Destination ------- 4 nodes o 5 400 " 8 nOdGS + If". ] > J 16 nodes 0 ,g: 5 E 300 - , , (U : . _1 .,~ . 200 " 2”,: _ I _‘ 100 u’i---";;;;;",,_.i...———-r-T = _ 0 . i , . l i l 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 09 Load Figure 4.12: Comparison between the region broadcast and pseudo all-destination encoding schemes. m (I) 4.5 Summary In this chapter, a deadlock-free and starvation-free hardware multicast implementa- tion is proposed for a. unidirectional wormhole-switched network. A pseudo address scheme is presented to prevent deadlock while multi-address encoding/ decoding is considered. Based on the performance, the hardware multicast is highly recommended and worth the investment even if the switch to support synchronous multi-head worm would be more complicated. Although the pseudo all-destination has higher flexibil- ity, a single region header is highly suggested. Hence, all branches can be forwarded in the same stage and the latency is shorter than other approaches. In other words, processors should be allocated based on certain relations, such as cube, stride, and contiguous region, if possible. The multistage cube and omega networks are intertask contention-free as long as the processors are allocated in a cube (as shown in Chapter 5), which implies that cube and omega networks are a better choice for constructing unidirectional MINS among others. CHAPTER 5 Network Partitionability and Traffic Localization Since the distance between any two nodes on a delta class MIN is fixed, different processor allocations will not affect the communication overhead in terms of path length. Instead, such overhead depends on the degree of channel contention and partitionability. Channel contention occurs whenever two or more transmissions are directed to the same output port or are competing for the same channel in a switch at any stage. If these transmissions are from different applications, it is called intertask contention (external contention); otherwise, it is called intratask contention (internal contention) To avoid intertask contention caused by common destinations or channels, an SPC usually allocates an exclusive subset of processors, called a processor cluster or cluster, to each individual application or task. Intratask contention for a unicast communica- tion can be minimized by flexible resource allocation that takes into account mapping 89 90 and reconfiguration [44]. For multicast or broadcast communication in a cluster, the intratask contention is unavoidable unless at most one multicast conununication is allowed at a time. Hence, we concentrate on the network partitionability to eliminate the intertask contcmtion in this chapter. Before discussing the methodology in parti- tioning delta class MINs, we would like to show the influence of intertask contention first. 5.1 Influence of Traffic Localization In addition to computation of a parallel task, communication is required to achieve certain functions, such as exchanging information, barrier synchronization, etc. An application changes between those communication and computation phases alterna— tively. Let Twmm stand for the total comnmnication time and Tcomp represent the total computation time. The total execution time, T, of an application is equal to Tcomp + Teamm- For the sake of simplicity, let the average network contention rate be c. Since the network is not available to a communication during contention, the actual communi- cation time T.’ can be estimated by comm TI TCOUI. 7n comm : m where Tcomm is the communication time without network contention. Hence, the total execution time of an application can be estimated by the following equation. 91 TCO "l, I'll. “Term (5.2) Since the processing power and communication capacity are fixed after a machine is built, the Team, and Twmm are close to constant values for an application with the same computing and communication requirements. The network contention rate is the only variable affecting the total execution time based on simple estimations in Equations 5.2. Hence, decreasing the intertask contention and / or intratask contention will reduce the network contention rate as well as the execution time. In other words, to force traffic localization will definitely improve the performance of a system. To examine the influence of traffic localization, we eliminate the intratask con- tention by limiting the number of multicast communications to be at most one in each task at any time. Let each task require 8 nodes and let there be 8 tasks on a 64—node cube network. Three different processor allocation approaches are applied. The first approach is to allocate processors randomly. The influence of the intertask contention is not predictable. The second approach allocates all but one of the processor nodes into a cube. This out-of-cube node will introduce some intertask contention but not as much as the previous approach. The last approach enforces the traffic localiza- tion by allocating a complete binary cube to each application. Note that the binary cube will be formally defined and proved to be external contention-free for the cube network in the following sections. Unlike previous simulations which considered the load of a network, we concentrate on the ratio of communication time to execution time which is formally defined below. 92 TC (1 m m __ TCO 771 771 T Tea-mm + Tcomp R: Like the simulation in Chapter 4, the length of a message is uniformly distributed between 32 and 96 flits, with an average of 64 flits. The time unit is a single cycle to forward a single flit one step toward the destination. Such a. cycle is dependent on the clock of a network and can be just tens or hundreds of nanoseconds. For example, the cycle is 40 nanoseconds on a 25 MHz network. A communication phase maybe overlapped with a computation phase to reduce the communication overhead. The source node may issue the second communication command before the previous one is completed. Such multicast communication is named as non-blocking multicast. To avoid intratask caused by non—blocking multicast, we concentrate on single blocking multicast at a cluster. In blocking nuilticast. the source node will not continue to the computation phase until the communication phase is completed. The latency of a message in different processor allocation approaches is given in Figure 5.1(a). (a) Blocking multicast (b) System utilization and Speedup Speedup 5m 1* I’ Y Y Y I I Y I T fir Y fl 6 L' Optimal case 4— 80 /° . ~. Complete cube ...... ‘64 < “a One out-ot-cube ---- A node is out of cube —9— ‘ Random allocation "" “" Random allocation —~— < c Complete cube +— 5.3 50% ’ .\ ‘ 14.8 4 S '. Q a >~ .._ a s = ~ . 2 W S \‘x‘. ' ,a “~.\ 3 J E 40% - .g ‘ ‘32 9.. "a _ ‘ My (0 Tu. ._' 20% ~ .. ° <1.6 50 t: a a e = 2 3 "r ‘ ‘I \ ' 'n. ,_ [ i I. 7‘ " >1-.. 1 O 1 A 4 .L A A A 00/0 1 A x A L 1 1 " '0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Communication Ratio Communication Ratio Figure 5.1: Average latency and associated system utilization of a blocking multicast message on a 64-node cube network. 93 Intuitively, increasing the communication ratio B would increase the contention rate c since there is more traffic on the network. Hence, the latency gets longer as the communication ratio gets higher. The latency of both the first and the second allocation approaches given in Figure 5.1(a) verifies such observation. However, the latency of cube allocation is not affected by the communication ratio as shown in the same figure. This is not contrary to our observation or estimation in Equation 5.1. As specified earlier, this approach enforces the traffic localization and eliminates the intertask contention. Since the intratask contention is avoided by single blocking multicast, this result confirms the estimation and the assumption that cube allocation is indeed an intertask contention-free allocation approach. To further clarify the influence of traffic localization, the corresponding system utilization and speedup are shown in Figure 5.1(b). The system utilization p and speedup 5,, are defined as following. p : Tco'mp(1) ’Iicom 171(71’) + fll‘ompin’) q : 71ccimp( 1) i P ‘Tcomm(n) + Tcomp(n) where Tcomp(n) and Twmmfn) are the. total computation time and the total commu- nication time of a node when n nodes are used for a task, respectively. As shown in this figure, the intertask contention degrade the performance rapidly. For exam- ple, the system utilization degrades to 28% in random allocation and to 55% in one out-of-cube allocation approach, while the utilization is about 69% in cube allocation 94 when the communication ratio is 0.3. Note that the optimal utilization is 70% when R is 0.3. Since multiple multicast communications may share the same destination in the same cluster, the intratask contention is unavoidable. \Ne concentrate on providing an intertask contention-free network partitionability of delta class MINS in this chapter. All contention in the following context refers to intertask contention unless other- wise specified. Further evaluation of the influence of traffic localization, or network partitionability. will be given in Section 36. 5.2 Definition of Different Cubes A cube, which provides an intertask contention-free partitionability on some delta class MINS, is formally defined as follows: Definition 4 In a ill/N with N = k” nodes, (i k-arg m-cube (cube) consists of km nodes which have the same it — in radix-l: digits (fired variables) in their node ad- dresses, where these same digits can be in any n — in. locations of the it possible locations. Two cubes are disjoint if they have dill} rent fired variables and one is not a subset of the other. Definition 5 A lt-ary m-cube is referred to as a base k-ary m—cube (base cube) if these n — m digits are in the most significant n — in. locations (or the remaining m digits are in the least significant m locations) of their node addresses. 95 A base cube is a special case of a cube. Consider a system with N = 44 nodes. The cluster (12XX) has 16 nodes ranging from (1200) to (1233) and is a base 4—ary 2-cube. The cluster (2X1X) has 16 nodes ranging from (2010) to (2313) and is a 4-ary 2-cube but not a base cube. Note that an X represents a digit based on k, or 0 S X < k, in the following context of this chapter. Most research is based on 2 x 2 switches, while larger switches provide better performance in the real world. Allocating a complete k-ary m-cube to an application which requires only part of the cube may reduce the overall utilization of a system. For example, an 8-ary 2-cube will be allocated to a 16—node application on a network with 8 x 8 switches. To reduce such penalty, we also consider allocation of binary cubes on a network with k x l: switches. The formal definitions of both binary m—cube and base binary m-cube on a MIN with It x k switches are given below. Definition 6 In a ill/N with N = k" nodes, where k z: 2”, a binary m-cube consists of 2’" nodes which have the same (p x n — m) bits (fired variables) in their node addresses (based on 2), where these same bits can be in. any (p x n — 772) locations of the (p x 71) possible locations. Two binary cubes are disjoint if they have diflerent fired variables and one is not a subset of the other. Definition 7 A binary m-cube is referred to as a base binary m-cube if these (n x p— 772.) bits are in the most significant (n x p— in) locations (or the remaining m bits are in the least significant m locations) of their node addresses (based on 2). Based on these definitions, a k-ary m-cube, where l: = 2”, is a binary (p X m)-cube. An example of different cubes is given in Figure 5.2. There are three cubes: 0:1:1'1‘, 96 1023.2: and llxl. The second cluster 10.1:1‘, or 2X, is a base 4—ary l—cube, or a base binary 2-cube. The first one is a base binary 3-cube and the third one is a binary 1-cube but not a base binary l-cube. The latter two are not 4-ary cubes. Note that in the following context of this chapter, an .r stands for a digit based on 2, or 0 g a: < 2. 01(0001) 01 02‘ (0010) 02 '1 9 03 (0011) 03 10 (01 ' 10) 11 (0101) 11 - 12 (0110) 12 20(1000) 20 21(1001) 21 22 (1010) 22 l 30 32 1110 32 O Oxxx (3 2X (lex) Figure 5.2: Three different processor clusters, 01:11:, 10:r.r or 2X, and 11:131. llxl 5.3 Contention Free and Channel Balanced Parti- tion In addition to guaranteeing contention-free network partitioning, it is important that the number of communication channels between two adjacent stages is the same as the number of nodes in the corresponding cluster. Thus, if a cluster has c nodes, the number of channels (or channel pairs) allocated to the cluster should be c between any 97 two adjacent stages. and this is referred as channel-balanced allocation. If the number of channels allocated is less than c, it implies the possibility of channel congestion within that cluster. If the number of channels allocated is greater than c, it implies that other clusters may be allocated with fewer channels or that the channels have to be shared with other clusters. Before showing this property on different delta net— works, two basic lemmas will be proven. The first one is related to channel—balanced allocation and the second one concerns the external contention-free allocation. Both lemmas are given below. Lemma 6 A k-ary cube is channel-balanced ifs,- is replaced by d,- for all 0 g i 3 71—1 while the address pattern is changed from s.,,_1 -~slso to d,,_1 ---d1d0. Proof: We shall prove that the number of channels used by an m-cube cluster is always 19’" between any two adjacent stages. For a. k-ary m-cube, there are n —— m fixed variables and in free variables in the definition of the cluster. For communication within each cluster and that s, is replaced by d,- for all 0 S i g n. — 1, all those n — m fixed variables and the number of free variables in are unchanged for a given k-ary in—cube cluster. Thus, the number of channels between any two adjacent stages of an m-cube cluster remains hm. Cl Lemma 7 Disjoint cubes are contention-free ifs, is replaced by d,- for all 0 S i 3 71—1 while the address pattern is changed from sn_1 - ~ - slso to dn_1 . - -d1d0. Proof: We will show that the channels from disjoint k-ary cubes (clusters) are always distinct. Suppose that a channel is shared by two different clusters. The 98 channel address is thus the same from both clusters. This implies that two different clusters have the same fixed variables, a contradiction. C] With these two lemmas, we would show that a cube network can be partitioned into contention—free and channel-balanced disjoint k—ary cubes in the following lemma. Lemma 8 A cube network with N : k" nodes can be partitioned into contention-free and channel-balanced disjoint ls-ary cubes. Proof: Based on Lemmas 6 and 7, we shall prove that s, is always replaced by d,- when the address pattern is changed from sn_ls,,_2 - - . slso to dn_1d,,-2 - - -d1d0. Consider a source and destination pair: sn_1-~-so and dn_1~-d0. The chan- nels before entering and after exiting stage n — 1 (Gn_1) are s,,_2 - --sosn_1 and sn_2-~-sodn_1, respectively. The former is due to perfect k-shuffle connection and the later is due to destination tag routing with tn_1 = dn_1. For stage i (0,, 0 S i S n — 2), the channels entering and exiting G,- are dn_1 ---d,-+ls,-_1---slsos,- and dn_1 ~~d,~+1s,-_1 - - -s1s0d,-, respectively. The former is due to butterfly connec- tion 13,-k and the latter is due to destination tag routing with t,- = d,-. As the address pattern is changed from sn-1 - --so to d,,_1 ---do, 3, is always replaced by d,- for all 0SiSn—1. In a more general case, when k is a power of 2, the restriction of k-ary cubes can be relaxed to binary cubes. From the proof of Lemma 8, we can immediately obtain the following result: 7 Theorem 1 A cube network with A’ = ls" nodes, where k = 2” for some p, can be partitioned into contention-free and channel-balanced disjoint “binary” cubes. Proof: We shall prove that the number of channels used by a binary m-cube cluster is always 2’" between any two adjacent stages and that the channels from different clusters (binary cubes) are always distinct. Let ls be 2” where p Z 2. Consider a source and destination pair: sn_1 ---so and d,,_1 -~dU, or sn_l,p_1sn_1,p_2 . . “an0 - . - 81,080,p—180,p—2 . . . 80,0 and dn_1,p_1dn_1,p_2 . . . dn_1,0 - - - ([1,0(10,p_1(10,p_2 . . . (10.0, respectively. The channels before entering and after exiting stage n — 1 (Gn_1) are s,,_2 - - - sosn_1 and sn_2 - ' - sodn_1, respectively. The former is due to perfect le—shuffle connection and the later is due to destination tag routing with t,,_1,J- = (l,,_1'J-, where 0 S j S p — 1. For stage i (0,, 0 S i S n — 2), the channels entering and exiting G,- are dn_1 - - - d,+1s,-_1 ---slsos, and dn_1 - - - d,+ls,--1 - - - s1s0d,, respectively. The former is due to butterfly connection ,3!“ and the latter is due to destination tag routing with tid- = dun As the address pattern is changed from sn_1 ---so to (l,,_1 ---(10, s,”- is always replaced by did- for all 0 S i S n — l and 0 S j S p — 1. Note that for a binary in-cube, there are (n x p — m) fixed variables and in free variables in the definition of the cluster. For communication within each cluster, it implies that sig 2 did for all those i and j of fixed variables and the number of free variables m is unchanged for a given binary m-cube cluster. Thus, the number of channels between any two adjacent stages of an m-cube cluster remains 2’". 100 Lemma 7 implies that the channels from different clusters are always distinct. Hence, different clusters are contention—free. An example to partition a 16-node cube network with 4 x 4 switches, k = 4, into four contention-free and channel-balanced binary cube clusters is shown in Figure 5.3. The clusters 1011', 01am, 11x0 and 1111 consist of8, 4, 2 and 2 nodes, respectively. As shown in the figure, all these four clusters are channel—balanced and contention—free. p”! | (‘ ' b 5 a a m‘n iv i/)i’ / 2 xi \VI \ a» If» ‘y, \‘I a. Q x0xx (D 01xx 11x0 [3 llxl Figure 5.3: A 16—node cube network based on 4 x 4 switches is partitioned into four contention—free and channel-balanced binary cube clusters, 1‘01‘1', 012:2, 11:50 and 111‘1. The Lemma 8 and Theorem 1 may apply to the omega network directly. Hence, we have Lemma 9 and Theorem 2 as following. 101 Lemma 9 An ome a network with N = k" nodes can be artitioned into contention- 9 P free and channel-balanced disjoint k-ary cubes. Theorem 2 An omega network with N : k" nodes, where k 2 2’” for some P, can be partitioned into contention-free and channel-balanced disjoint “binary" cubes. The proof is similar to that of Lemma 8. Figure 5.4(a) shows the partitioning of an 8—node cube network into contention-free and channel—balanced binary cube clusters: XXO, 0X1, and 1X1, while Figure 5.4(b) shows the same partitioning on an 8—node omega network. (a) multistage cube network (b) multistage omega network {:1000 000C; {:1000 r '.\ ' ..""""%IOI 101" ”"'°.101 _. {3110 110 ‘r ‘ ' Elm ..._ ..... ... ..... 9......[111 Ill/""l"” ------ '25: """ """'/ Ill [:1 xxo 0x1 74 1x1 Figure 5.4: An 8-node cube network and an S-node omega network are partitioned into three contention-free and channel~balanced binary cube clusters. Unlike the cube network or omega network, not all delta networks can be par- titioned into contention-free and channel—balanced clusters. The following section shows that neither the baseline network nor the butterfly network possess these latter properties. 102 5.4 ‘Non Contention-Free Partition Although baseline and butterfly networks are topology equivalent to omega and cube networks, the intertask contention exists on both of them. This section gives the description and formal proof of this property. Lemma 10 A baseline network with N : nk nodes may not be partitioned into contention-free and channel-balanced disjoint k-ary cubes. Proof: Consider a source and destination pair: sn-1~-so and dn_1---d0. The channels before entering and after exiting stage n — 1 (62,4, the leftmost stage) are s,,_2---s0s,,-1 and sn_-2---s0d,,_1, respectively. The former is due to perfect k-shuffle connection and the later is due to destination tag routing with tn_1 2: (1,,_1. For stage i (C:',-,0 S i S n — 2). the channels entering and exiting G.- are d,,_1 --- d,+1s,,_2s,,_3 - - - sn_,-_2 and (l,,_-2 . - - d,+1s,,_2 - - - sn_,-_1d,-, respectively. The former is due to baseline connection Bf, and the later is due to destination tag routing with t, : d,. As the address pattern is changed from the s,,_1---so to dn_1 ---d0, s,,_,-_2 is always replaced by d,- for 0 S i S n — 2 and s,,_1 is replaced by dn_1. Thus, a free variable may be replaced by a fixed variable, implying the number of channels is reduced at that stage. If a fixed variable is replaced by a free variable, this implies that more channels are used by that cluster and the channels may be shared with other clusters. El Two examples of partitioning a 16-node baseline network into different binary cube clusters are given in Figure 5.5. There are three contention-free clusters: 10.1117, 103 0101:, and 11mm. In all three clusters, the number of channels is reduced to half at connections C2 and C1 as shown in Figure 5.5(a). All 16 channels at connection C2 and C1 are shared by two clusters, 1‘er and :L‘lTJYI, in Figure 5.5(b). l l moi _ 101 1011 I I I I \ ‘ “El 2 mm i 3‘ El . \ \ C] x0xx CD 010x Q llxx C) xxxO C) xxx] (3) reduced channels (b) shared channels Figure 5.5: A 16-node baseline network is partitioned into different binary clusters. Similarly, we have the following lemma for a butterfly network. Lemma 11 A butterfly network with N = nk nodes may not be partitioned into contention—free and channel-balanced disjoint k-ary cubes. Proof: Consider a source and destination pair: sn_1 ~-so and dn_1 ---do. As the address pattern is changed from sn_1~-so to dn_1---d0, sj is always replaced by dj+1 for 0 S j S n — 2 and sn_1 is replaced by do. Thus, a free variable may be replaced by a fixed variable, implying that the number of channels is reduced at that stage. If a fixed variable is replaced by a free variable, this implies that more channels are used by that cluster and that the channels may be shared with other 104 channels. [1 Figure 5.6 demonstrates the partitioning of a 16—node butterfly network into dif- ferent binary cube clusters. There are three contention—free clusters: x0201, 0101:, and ll:r:r. In all three clusters, the number of channels is reduced to half between stages 02 and GI as shown in Figure 5.6(a). In Figure 5.6(b), there are two 8—node clusters: xxxO and 1:.rzrl. Both clusters share those 16 channels in connections C3, Cz and C1. :-_ I'M‘.>A\", rm; ”'1 I“ ‘lll-Im m '— ’4 “i s‘d'wt m fiII=Lzli “JAE: I‘W‘Wl m D x0xx C) OIOx O llxx Q xxxO C) xxx] (3) reduced channels (b) shared channels “01"" ’< ' ' ‘.( \ tilt—- -- -- - -- Figure 5.6: A 16-node butterfly network is partitioned into different binary clusters. 105 5.5 Modification of Baseline and Butterfly Net- works As discussed in previous sections, the cube and omega networks provide contention- free and channel—balanced network partitionability but not the baseline or butterfly networks. Due to the advantage of network partitionability, the cube and omega. I'ietworks are highly recon‘nnend in constructing new networks. For a current system, such as NEC Cenju-3, it may not. be cost effective to update the network. Hence, to provide contention-free and channel-balanced network partitionability for both baseline and butterfly networks with minimum modification becomes the goal of this section. Since the. intr-érconnection patterns between adjacent stages are fixed after a MIN is built, we will concentrate on modifying those connections between the source node and network, (7,, and CO. To be consistent with most delta networks, we choose to modify the leftmost. interconnection, (in, and keep the rightmost interconnection, Co, as an identity connection, which is a straight connection. Two modified interconnec- tion patterns will be used, the news-(d perfect k-shufllc and the digit-reversed. The former one is for the butterfly network and the later one is for the baseline network. Furthermore, we will show that both modified baseline and butterfly networks can be partitioned into contention—free and channel—balanced k—ary cubes. Before giving the definition of these interconnection patterns, let us carefully ex- amine the address transition on both baseline and butterfly networks in the following subsections. 106 5.5.1 Modification of a Baseline Network Consider a source and destination pair: sn_lsn_2 - - - s0 and dn_1dn_2 - - - do. As shown in the proof of Lemma. 10, s]- is always replaced by dn_J-_2 for 0 S j S n — ‘2 and sn_1 is replaced by dn_1 as the address pattern is changed from sn_1sn_2--~so to dn_1dn_2 ~ - - do. The replacement of s,,_1 by d,,_1 is caused by the perfect k—shuflle in the leftmost interconnection. If an identity connection is used in C", s,- will be replaced by (l,,_,_1 for 0 _<_ j _<_ n — 1. In order to have s,- be replaced by d], we shall reverse the order of source address in the leftmost interconnection. Hence, the digit reversing pattern, 7., is introduced to achieve such purpose and is formally defined below. Definition 8 The digit-'I'eversz'ng connection 7" is defined by '7'.‘k(~1'n—1.l‘-n—2 ' - - .1’2I11‘0) : 41’0-17141‘2"'l’n—‘zIn—l where 0 S xi S k _ 1' When a baseline network connects to the source nodes based on digit-reversing connection, it can be partitioned into contention—free and channel-balanced disjoint k—ary cubes. The formal description and proof are given in Lemma 12. Lemma 12 A baseline with digit reversing pattern as C, can. be partitioned into contention-free and channel-balanced disjoint k-ary cubes. Proof: Consider a source and destination pair: sn_1sn_2---slso and dn_1dn_2 - - ~d1d0. The channels before entering and after exiting stage n — 1 (Gn_1) 107 are sosl -o-sn_2s,,_1 and 30s] s,,-2d,,_1, respectively. The former is due to digit reversing connection and the later is due to destination tag routing with tn_1 = dn_1. For stage i (Gg, 0 _<_ i g n — l), the channel entering G,- is dn_, ~-d,~+lsosl ~~s,_ls,- due to baseline connection (5“ and the channel exiting G,- is d,,_1 ~~d,~+1s0sl ---s,-_1d,~ because of destination tag routing with t, 2 di. As the address pattern is changed from sn_1 - - - slso to dn_1 -- - dldo, s,- is always replaced by d,- for all 0 S i g n — 1. Hence, such network can be partitioned into channel—balanced and contention-free k—ary cubes based on 6 and 7. Figure 5.7 demonstrates the partitionability of a modified baseline network. For comparison purposes. the partitions in Figure 5.7 are identical to those in Figure 5.5 in the previous section. Those three clusters, .rO.r.r, 0101', and 11.r.r, which are contention-free but in which the number of channels are reduced to half in connections C2 and C] of a baseline network as shown in Figure 5.5(a), become channel—balanced in a modified baseline network as shown in Figure 5.7(a). The channel sharing of clusters .r.r.r0 and .r.r.r1 on a baseline network as given in Figure 515(1)) is eliminated in a modified baseline network as shown in Figure 5.70)). Like the cube network, when k is a power of ‘2, the restriction of k-ary cubes can be relaxed to binary cubes. From the proof of Lemma 12, we can immediately obtain the following result: 108 ‘ _ »L.-.l are \ _ , I 011 “ , ~E 011 0111 “ . -- -- - on? 101 1 1 '1 _ I ’ 101 1 '1 101“ \ r (1‘ __ I ' ' ' “.1011 ll (l 'A \ '— r—fi ll nun LL 1101 \t It ‘ 1 _ » - 1101 111 E: ~ ~ 111 11111 -- -- -- -- - 11131.1 Q .0... Q 010x 1... C) ...o C) m. (a) Three contention-free and channel-balanced clusters (b) Two contention-free and channel-balanced clusters Figure 5.7: An example to partition a 16—node modified baseline network into contention-free and channel-balanced disjoint k-ary cubes. Theorem 3 A modified baseline network with N = k" nodes, where k = 2” for some p, can be partitioned into contention-free and channel-balanced disjoint “binary” cubes. 5.5.2 Modification of a Butterfly Network To examine the address transition, we consider a source and destination pair: sn_lsn_2---so and dn_1dn_2---d0. While the address pattern is changed from sn_1 ”-3130 to dn_1---d1d0, s,- is always replaced by dj+1 for 0 S j S n — 2 and 3.1-1 is replaced by do as shown in the proof of Lemma 11. Based on Lemmas 6 and 7, the 3, shall be replaced by d]- for 0 S j S n — 1. Therefore, the address pattern entering stage Gn_1 shall be sosn_1 - --31 instead of sn_1---slso. To achieve this, a reversed perfect k—shuffle is used for connection Cn. With such modification, the 109 address pattern entering the stage Gn_1 becomes sosn_1 - - ~31. When the last digit is replaced in stage j, s,- is always replaced by d], for all 0 S j S n — 1. The definition of reversed perfect k—shuffle and the formal description as well as the proof of such property are given below. Definition 9 The reversed perfect k-sh'ujfle connection (3' is defined by -k ., . _ .. , . o,- (.r,,_1.r,,_2 - - - .121 1.1‘0) — .10.rn_1.r,,_2 - - - .r-Zrl where 0 S .r; S k — 1. Lemma 13 A butterfly network with reverse perfect k-sh‘ujfle as the leftmost connec- k tion, Cn = (3 . can be partitioned into contention-free and channel-balanced disjoint k-ary cubes. Proof: Consider a source and destination pair: sn_1sn_2 - - - 51.30 and dn_1dn_2 - - -d1d(). The channels before entering and after exiting stage n—l (Gn_,) are sosn_1 . - - $231 and 30an - - -szd1, resj')ect.ively. The former is due to reversed perfect k—shuflle connection and the later is due to destination tag routing with tn_1 2 (11. For stage n —i ((1.14, l S i S n — 1). the channel entering G,- is sos,,_1 ---s,-+1d,-_1~-dls,- due to butterfly connection L35 and the channel exiting G.- is sosn_1 ~--s,~+1d,~_1 ~~d1d,- because of destination tag routing with t,- = dn_,-. Similarly, the channels entering and exiting the rightmost stage G0 are dn_1dn_2 - - «1,30 and dn_1d,,_2 - - - dldo, respec- tively. The later is due to the destination routing tag to = (10. 110 As the address pattern is changed from s,,_1 ---.‘1s0 to dn_1 ---d1d0, s, is always replaced by d,- for all 0 S i S n — 1. Hence, the number of channels between any adjacent stages of a. k-ary m—cube remains km and different clusters are contention-free based on Lemmas 6 and 7. Those examples which partition a lG-node butterfly network into a different num- ber of clusters in Figure 5.6 are revisited on a modified butterfly network. As shown in Figure 5.8(a), those three contention-free cube clusters, r0.r.r, 010.1: and ‘ll.r;r., in which the number of channels is reduced to half in connection C; on a lfi—node butterfly network, become contention—free and channel-balanced clusters on a 16-node modified butterfly network. Those two clusters. .r.r.r0 and .r.r.r1 , which share channels in Figure 5.6(b) become contenticm-free as shown in Figure 5.8(b). Like the modified baseline network, the restriction of k-ary cubes can be relaxed to binary cubes when k is a. power of 2. We can immediately obtain the following result from the proof of Lemma 13. Theorem 4 A modified butterfly network with N = k" nodes, where k. = 2” for some p, can be partitioned into contention-free and channel-balanced disjoint “binary” cubes. l I [OH 1011 ‘:I<'.3 3: 'T 33? 113? 5:3 -31? Q fl 1110 1110 .. llll llll - Q x0xx O 010x 0 llxx xxxO Q xxxl (a) Three contention-free and channel-balanced clusters (b) Two contention-free and channel-balanced clusters Figure 5.8: An example to partition modified butterfly network into contention—free and channel-balanced disjoint k-ary cubes. 5.6 Performance Evaluation As discussed earlier, enforcing traffic localization will reduce the intertask contention and decrease the communication latency. In this section, we examine the influence of traffic localization in detail. The performance evaluation is based on a simulator which emulates a 64-node cube network with 4 x 4 switches. The length of a message is uniformly distributed between 32 and 96 flits. Note that based on Lemmas 8, 9, 1‘2, and 13 as well as Theorems 1, ‘2, 3, and 4, this result also applies to omega, modified baseline and modified butterfly networks. The latency among different intertask contention introduced by different allocation approaches is given in Figure 5.9. The latency of a complete cube confirms that binary cube partitioning is intertask contention-free. Such latency depends on the length of a message and the number of stages but not on the number of destinations or the communication ratio. 500 I I l I T T T s 450 - ~ 400 t . 4 nodes in a cluster 8 nodes in a cluster 350 - 16 nodes in a cluster Random allocation — A node is out of cube ----- Complete cube ------ 0+0 M 300 e 250 Latency 200 150 100 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Communication Ratio Figure 5.9: Average latency of a. blocking multicast message with various intertask contention. Intuitively, where the number of clusters increases, the potential intertask con- tention increases. The worst case occurs when each cluster blocks every other cluster and is itself blocked by another cluster. Hence, the average latency becomes close to (m x Teamm) when the communication ratio R gets larger. This observation is con- firmed by the random allocation of 8-node and 16-n0de tasks in Figure 5.9. However, when the number of nodes in one cluster is less than the number of clusters in the network, a cluster will not be able to block all other clusters. Hence, the latency decreases. This estimation is confirmed by the 4-node clusters in Figure 5.9. The network throughput is given in Figure 5.10 for reference. Here, the throughput is defined as the average number of flits passing an outgoing channel per time unit. 2m—l 2171 For a binary m-cube, the throughput. is equal to R x since there are (2m — 113 1) destinations. For example, when the communication ratio is 0.8, the expected throughput of a binary ‘2-cube is 0.6 (0.8 x 3/4). As expected, the throughput of binary cube allocation is close to the expected throughput. The difference is due to the header overhead. The throughput of the other two approaches becomes worse as either the communication ratio increases or the latency increases. 0.8 I l I I l I l _.-:i 0.75 L 7.. JD. 3".- 0.7 ~ 4 nodes in a cluster 0 - 8 nodes in a cluster + yr 0.65 r 16 nodes in a cluster 0 B" a Random allocation —— ..-"’ 0-6 ’ Anode is out of cube ----- _,.r' . Com lete c be ------ . -‘ 0.55 - p U “,0 ° l a 0.5 ’- “Null-3"}. "“0””. q 2 0.45 - x0," . U! ‘1' _.0 8 ' “ .......... ea 5 0 4 '- E} __________ 43— .. .C --------- 1- .-° ...... 8' 0.35 '2, ------------ 4. - ----- VT, . ......... 4’ ____________ 4> 0 3 gm: ___________ 9 __________ o 0.25 5f: """" '4 .. _ 2. 0.2 - 0.15 - 0.1 .5 ¥ 1.? if 005 1 1 1 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Communicaiton Ratio Figure 5.10: Average throughput in cube network with average 64—flit messages. In non-blocking multicast, the source node continues to the computation phase as soon as the multicast. communication request is issued to the router. The computation time and communication time are overlapped to reduce the communication overhead. Hence, intratask contention becomes possible since a communication request may be issued before the previous one is done. The influence of such intratask contention can be observed from the cube allocation in Figure 5.11, which gives the average latency of a non-blocking multicast communication based on various allocation approaches. 114 As the communication ratio becomes larger, the latency increases. A system becomes unstable when R closes to 0.5. 500 I T I: .1 I " 4nodes/cluster o 5 5 ’l 8nodes/cluster + : 5 :' 450 - 16 nodes/cluster D 5 : ,' ~ Random allocation — 5 5 , One is out of cube ----- : 5 i 400 - Complete cube 5 ; j _ i f i 4' "I 1' 350 i- I: : i -I 1 ci a 300 '- 'I' 'I’ ’I’ ‘1 c 'I '1 ’I 2 1’ II ’I m 1’ I, ’I _J 250 r- "! "t l,’ -i ,21’ 200 - ,x’ ’i I.“ - I, ,”a’ 150 '- ’ I 6” Ila’a’ .4 .I' x -a’ . -43” ,- :1- 18:13-""& 100 £222,341,..g ...... .- ._........gum= ...... .......---."C'"""."°"".""". 0.1 0.15 0.2 0.25 0.3 Communication Ratio Figure 5.11: Average latency in cube network of non-blocking message with average 64-flit length. The latency gets worse when intertask contention exists. As shown in this figure, the system becomes unstable when R is close to 0.1 if 4—node or 8—node tasks are randomly allocated. The performance of randomly allocated 16-node tasks is better than the previous two since it has lower intertask contention. Figure 5.12 gives the average latency of non-blocking multiple multicast commu- nications. Here, 8—node tasks are allocated into a 64-node cube network. Every node may initiate its own non-blocking multicast communication. To avoid network sat— uration, the network load is used as the X-coordinator. The load is defined as the average number of flits injected into the network times the number of destinations. In other words, the load is the average number of flits expected to pass through an 115 outgoing port per time unit. The relation of load and communication ratio R for a binary m—cube is given below. load : 2m _1 R Hence, R = load/7 in Figure 5.12. As expected, the complete cube still has the best performance. However, the system reaches saturation when the communication ratio becomes 0.1. 450 I I I j I I: 400 - 5' « 350 ~ 5' 5 . :' F’ 300 ~ 5 ' - 5‘ 250 r- - E ,1 .. ‘6 " x _, 200 _u - 150 p _a" J ..(3“ 100 ‘ Randoniaflocafion 4»— Anode is out of cube -+-- 50 " Complete cube -G--- a 0 l l 1 l 1 l 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Load Figure 5.12: Average latency in cube network of non—blocking message with multiple SOLII'CCS. 5.7 Summary Increasing the network contention increases the communication latency. Such net- work contention includes both the intertask contention (internal contention) and the 116 intratask contention (external contention). However, the intratask contention is un- avoidable when multiple multicast communications are allowed. Hence, we proposed an allocation scheme, binary cube scheme, to eliminate the intertask contention in this chapter. The binary cube partition algorithm provides an intertask contention—free environ- ment on both the cube and omega networks but not on the baseline nor the butterfly networks. Two interconnection patterns are introduced to reconfigure the baseline and butterfly networks. With such modification, both baseline and butterfly networks can be partitioned into contention—free and channel-balanced binary clusters. Due to the unique path property of delta class MINS, the path of a multicast communication is also unique. Network partitionability and traffic localization reduce the contention to increase the performance. In the next chapter, we will discuss establishing a multicast tree on an extension of MIN. CHAPTER 6 Extra Stage Multistage Interconnection Networks In this chapter, we show that unlike in the delta class MINS, the extra—stage MIN (ESMIN) structure opens up alternate routing strategies with varying tradeofis. We assume one such routing approach in our work. Then, we estimate the number of multicast trees that can be generated for a given multicast problem (i.e., source and destination list) and an extra-stage design. \Ne discuss the design implications of the “extra stages” with regard to the multicast attributes, and develop an optimality criterion for multicast trees in an ESMIN. Thus, enumeration of all the instances of optimum-traffic tree requires exponential complexity. However, a single instance of the optimum-traffic tree can be generated in polynomial time. A multicast tree generation algorithm is proposed towards this. Figure 6.1(a) gives an extra stage MIN structure constructed by ‘2 X ‘2 switches. It has N = 2" input ports and N output ports. Unlike n stages in a delta. class MIN, 117 118 there are h stages, where h 2 12. Similar to delta MINS, the h stages are placed horizontally apart, while within each stage there are % switches vertically stacked. Each 2 X 2 switch can connect in one of the four possible ways (Figure 6.1(b)): straight, crossed, upper broadcast, and lower broadcast. h stages, where h = n. + m. (‘2 x ‘2) switch inter-stage connection .—4...p— (1): Straight 2—3 5 . i : ' i : i 3 3"“; 2 .- 2 2 - —l i s 3‘3 fl»: 4 —: : H: :H , H: : * ‘ H: :H 4 _T 5_3 3 ._§ 3_ H H511 Hg :— 5 Crossed N29257:“? 'éé'éé‘N=‘2" input switches output :15: ' - ' ' ' ' ports pper Ports. : 3 . g g : . g g . g g . Broadcast N H - HH: :-— «‘— : :H . H—fl .; : ’7 7—1 I— N — 4 N — 3 — -l_l: 5“ mi ii 3— N “ 3 N _ 2_3 L. i g ; ‘—‘ .i ; N _ 2 j *— N— 1HflH - v HL~—a...el_- lagH N—l Lower (7h Gil—1 Ch—I G1 =11 ("0 in Broadcast (a) (b) Figure 6.1: A generic MIN structure with N = ‘2" input/output ports and h. stages, where each ‘2 x '2 switch has four connectivity choices. We consider a multicast communication from a source node 5 to k. destination nodes, D,,1 S i S k for k S N. Such a problem is usually characterized using two metrics: time and traffic [2‘2 . Trafl‘ic is a. measure of the local number of messages generated and passed from one node to another in the system. Time measures the maximum number of hops taken by the message to reach any destination. Since several inter—node message communications may take place simultaneously, time does not measure the total number of messages. Our objective is to develop a multicast 119 routing approach for the general class of ESMINs as shown in Figure 6.1. The actual connection patterns between the intermediate stages may vary, and the proposed multicast approach is expected to be generic to the connection patterns given in Chapter 3, e.g., the perfect—shuffle or the butterfly connection patterns. 6.1 MINS with Extra Stages Figure 6.1 shows a generic MIN with N input ports and N output ports, where N = ‘2". It has [2 stages, Gh_1 to G0. As shown in Figure 6.1, each stage, say 0,, has % 2 X ’2 switches. The connection between two adjacent stages, G.- and C,_1, denoted as Ci, defines the connection pattern for N links. Thus, an h-stage MIN can be represented as (.'}1(.1'h_1('h_1(j'h_~2 . . . (I'0("(). In this chapter, L\"IIN(N) refers to a multistage interconnection network with N = ‘2" input/output ports and h = n stages. An extra—stage MIN with N = '2" input/output ports and h. = n + m stages (i.e., m extra stages) is represented as ESMIN(N:h). 6.1.1 Interstage Connection Patterns A connection pattern C, defines how those N links should be connected between the N outputs from stage C,- and the N inputs to stage 63-1. Different connection patterns 120 give different characteristics and topological properties. Note that the connection pattern between a consecutive pair of stages, including C h and C0, can vary, and our study is not restricted to any specific connection pattern. However, the connection pattern must be a variant of the hypercube derived network, such as binary cube, perfect shuffle, and butterfly. The links are labeled from 0 to N — 1 at C,. With N = ‘2" ports, let X = .r,,_1;r,,_2 . . . .ro be an arbitrary port number, 0 g X g N — 1. As specified earlier, the butterfly connection, 5,, interchanges the zero—th and pth bits of the index. The value of p is referred to as butterfly dimension or dimension for short if there is no confusion in the context. Note that. ,6“), defines a straight one—to—one connection and is also called identity connection, I. The perfect shuffle connection, 71', is defined by Tl'(.l'n_1.l‘,,_2 ' ' ' tl'pl'o) '2 .l'n_2 ' ' ' .Tltl‘0.l'n_1 71 when N = ‘2 . An N—port cube MIN (denoted as CU-MIN(N)) is defined as Ch :2 7r and C, = ,3“) for h —- 1 = n -— 1 Z i Z 0. An Omega network is a MIN(N) with C, = 7r for h = n > i Z 1 and C0 = I [33]. Such a network is denoted as PS-lVIIN(./\"). 121 6.2 Structural Equivalence of different Delta Net- works The upper part of Figure 6.2(a) shows a PS-MIN(16). The corresponding lower part shows a CU-MIN(16). Note that if we have C}, = 71' in CU-MIN, the CU— MIN is equivalent to the multistage cube network [32]. With destination-tag routing, the leftmost stage, Ch, makes no difference between CU-MINs and multistage cube networks. Thus, we assume C}, = I for both MIN and ESMIN in this chapter. The PS-MIN and CU-MIN are structural equivalence as shown in Fig. 6.2(a) with N = 16. The PS—MIN involves perfect shuffle (which is left shifting the bit-id) and exchange (introducing a 0 or 1 bit at. the least significant bit) operations. A bare exchange operation corresponds to a dimension-0 toggle (E )31) of the CU-MIN. An exchange followed by a number of perfect shufltes essentially emulates the role of an intermediate stage dimension toggle of the CU-MIN. A formal proof in this regard can be carried out, but this property is either well known or can be derived easily from existing results [31]. Thus, PS—MIN and CU-MIN can be labeled as of equivalent capability. However, we will show that they differ in their flexibility in extra stage attachments (see Section 6.2.2). 6.2.1 Design of Extra-Stage MINS The regular MIN structure can be augmented in a number of ways. First, the destina- tion nodes can be connected back to the source nodes using cyclical wrap around links. This class of networks has become popular for high-speed networks [45]. Another de- _1 Pefgt‘Shuft'Jlflle r__ _I:thra 8.2516 Perfeit1 Shufflre_1\:IINr a O O a O ‘— 3 wear :- Wyatt—is; E W \tzm t, 4' : A; 4% t, t, t, _— : Atwgyzoftof H :1 VI»)! ((0')), C :3 V )[fifl’ltytv : :jjiilwiil‘ it”)? : = mt ._.. :15 law 2 :1 C :I 6 C: “fie. as L _ ‘_ 7 L_, LIL .1, 1.: C4 03 C3 02 C2 GI C1 00 C0 C5 04 C4 G3 C3 02 C2 G] C] Go C0 1.1? ‘1 Hrma "2 :. VIEW 4 :2” ‘1 \VIIEW“‘ E : 'w a: H1 — H: swam-I- ... : lim- 2 1; n —i [W g .3. ._: :: im- ? r; _2 .m- 7 fl: :l J!" fly. 6 [_f— 6 Wiltm‘va‘ ‘: : AIM.- : _3‘ A Z A..- : _J JL. .1, ,fi‘ — 7 ._. E a. ._Z c, G3 C3 02 C2 0, C, Co Co C5 G4 C4 G3 C3 C2 C2 G, c, Go C0 Structual-Equivalent Cube MIN Structual-Equivalent Extra Stage Cube MIN Figure 6.2: Structural equivalence between (a) PS-MIN(16) and CU-MIN(16) and (b) PS—ESMIN(16:5) and CU-ESMIN(16:1,3,2,1,0) 1'23 sign extension, in the context of optical passive star based firmware connections, is to allow dynamic on-line dimension selection in the CU-MIN networks. The multicast problem in such designs equates to an optimum ordering of the CU-MIN dimensions. In this chapter we consider a third type of design extension to the MIN structure, where a few extra stages are added beyond the n stages of the regular MIN. The extra stages bring forth various advantages and flexibilities, one of which namely the multicast traffic reduction opportunity, is the focus of this chapter. Figure 6.2(b) shows two ESMIN designs, the upper part using perfect shuffle connectivity (PS- ESMIN) and the lower part using butterfly connectivity (CU-ESMIN). We note that the ESMINs can offer significant advantages at the expense of limited design overheads. First, we discuss the overheads and then list out some of the advantages. The only design overhead in an ESMIN is the hardware cost of the extra stages. Packing density of VLSI technology allows a large number of switch components to be put in a single chip. Thus, a number of extra stage switches and their intercon- nections may be expected to be packed in a single chip. Note that the number of pin connections (i.e., input and output nodes) to the chip remains identical regardless of the number of stages. The number of hops between a source node and a destination node increases in the ESMIN design. A regular MIN with n stages involves n hops, while an ESMIN with h. = n + m stages would require m extra hops. This additional hops delay may indeed be an overhead with store-and-forward routing. However, with recent cut through switching or similar approaches (e.g., wormhole routing, ATM cell flow), this delay 1124 becomes insignificant. Note that in cut through switching, the overall delay becomes largely insensitive to the number of hops [‘24], and therefore, no significant routing overhead is expected with ESMINS. ESMINs can offer a wide range of advantages which basically stem from the num— ber of alternate paths that are generated due to the extra stages. These alternate paths have been explored in the context of hot—spot reductions, fault—tolerant MINS, and congestion controlled traffic flow. Detailed discussion along this line is beyond the scope of this chapter. In this chapter, we exploit a similar advantage of the “alternate paths” in the context of generating multicast trees. 6.2.2 Design Choices In principle, an ESMIN can be designed with any connection pattern between its stages. An (h. = n + m)-stage ESMIN can have the first n stages of the butterfly patterns and the remaining 772. stages of the shuffle pattern, or vice versa. For the sake of simplicity, we assume that one single connection pattern (e.g., either shuffle or butterfly) is followed over the (n + 7n) stages. Thus, among the butterfly and PS patterns, we could either have a PS-ESMIN(.’V:n + m.) or a CU-ESMIN(N:n + 7n). Two sample (4+1)-stage ESMINS are shown in Figure 6.2(b). Between a PS-ESMIN and a CU-ESMIN, we show that the latter has an added flexibility. Specifically, a CU-ESMIN can have any dimension at any of the ‘extra’ stages. However, the PS-ESMIN must retain a periodicity among its effective dimen- sions. Figure 6.2(a) shows a PS—MIN(16) and its equivalent CU—MIN(16). The cor~ 1‘25 respondence between the shuffle-exchange stages and butterfly dimensions are shown using double arrowed lines. Now, an extra stage is added to the 4-stage shuffle in Fig- ure 62(1)). The effect of the last stage is shown using an equivalent CU—ESMIN(16:5). Normally, if an extra stage is added to a CU-MIN, the extra stage can be assigned to anyone of the n —1 dimensions except dimension 0 (i.e, 5(0) : 1). However, ifextra stages are added to a PS—MIN, the effective repeat dimension occurs cyclically, i.e., a fixed dimension. In other words, for a CU-ESIV’IIN(8:3+‘2), suppose its first three dimensions are 2,1,0. The other two dimensions can be any one of the two nonzero dimensions. For example, the 3+2-stage CU-ESMIN dimensions can be (2,2,2,1,0), (1,2,,2,l,0), (1.1.2.1.0), etc. Thus, for an N-port and h—stage CU-ESlV‘IIN, it is represented as CU- ESMIN(N:C;,_1, (3,4, . . . , CU) in order to show the butterfly dimension at each stage. Note that Co must be rim); otherwise. not all destinations can be reached. On the con- trary, a 3+2—stage PS-ESMIN would always be equivalent to Cl_l—ESI\'IIN(8:‘2,12,1,0). The periodic sequence (2,1) and fragments thereof would always be the ordering. Hence, we observe that for ESMle, the CU-ESMINS offer an added degree of flexibility over PS-ESMINS, because the ertra stages in PS—MIN are apparently re- stricted in their dimensional roles. However, if similar extra stages are added to a CU-MIN, the extra stages can assume any dimension - which is an added degree of flexibility. From Figure 6.2 and the discussion above, we observe that a FS-ESMIN must include a periodicity in its effective dimensions. Thus, a PS-ESMIN(16:1‘2) can be considered equivalent to the CU-ESMIN(16:2,1,3,‘2,l,3,‘2,1,3,‘2,1,0). However, the 122- 1‘26 stage CU—ESMIN(16:12) could have had other dimension choices. In other words, the 1‘2—stage PS-ESMIN(16:12) is necessarily an instance of the 112-stage CU-ESMIN. While the latter could have any dimension pattern, a particular pattern of the latter equates that of the former. In this chapter we consider the generic CU—ESMIN for most illustrations and proof techniques. Since the PS—ESMIN is a particular instance of CU—ESMIN, the PS— ESMIN shares an identical set of properties. Similarly, in the performance section we report the simulation results for various dimension patterns of the CU—ESMIN. One of these dimension patterns. ”we have so chosen, that it fits the periodicity requirement of the PS-ES'A’IIN. In this way, we also report the simulation results for the PS-ESMIN (refer to “Pattern I” in Section 6.5). 6.3 Multicast in Extra-Stage MIN Unlike regular MIN, routing in ESMIN is no longer a trivial issue. A number of alternate routing styles may be adopted with varying tradeoffs. In this chapter we adopt one of these alternatives. For an ESMIN, the number of alternate multicast trees grows exponentially with the number of extra stages. Among these exponential numbers of multicast tree alternatives, an equally large (in order notation) number of multicast trees can be considered as traffic-optimal. A traffic-optimal multicast tree is a multicast tree with a minimum number of channels used. Therefore, enumeration of all traffic-optimum multicast trees requires an exponential complexity. However, though there are exponential number of traffic optimal multicast trees, 127 a single traffic-optimum multicast tree can be derived in polynomial time. The next section develops an algorithm which can yield a traffic-optimum multicast tree in polynomial time. 6.3.1 Alternate Routing Styles in Extra-Stage MINS Routing in a regular MIN is a trivial issue (since path unique) and requires little attention. However, routing in ESMIN no longer remains so simplistic. An example of conventional routing (source ‘2‘ destination) is given in Figure 6.3. In this figure, part (a) shows the XOR routing on MIN(8) and part (b) shows the XOR on CU- ESMIN(8:PS’,2,1,‘2,I,0). in which a source node 1 to destination node '2 routing is required. The XOR routing tag is .r.r011, where .r can be either 0 or 1, since the first two stages are extra, but the conventional routing (source LE destination based) leads the message to destination 4, 5, 6, or 7 instead of ‘2 as shown in part (b). This is clearly a malfunctioning of the traditional routing (source Ha destination), and requires further modifications. We have identified a fair number of alternatives, among which two routing styles (namely, the “destination routing” (DR), and exclusive—OR with LSB fixing (XOR—l—LSB) can provide successful routings in ESMIN. In the remainder of this chapter, the DR approach (pseudo—code given in Section 6.3.2 below) is used. 1The XOR works on MIN only if the rightmost connection is a perfect shuffle pattern and the destination routing has no such constraint. Figure 6.3: An example of the inadequacy of the traditional routing approach (source 83 destination): A source node 1 to destination node 2 route does not lead to the right destination. 6.3.2 Distributed Routing and Multicast Implementation A message multicast (which includes routing, as a special case) in an ESMIN is initiated by the source node. The message sent by a source node contains a routing tag set and data . Let a be the number of routing tags for a message. It is k in the source node and will gradually decrease as the message progresses through the stages, i.e., in every split, towards the destination. Let t,.,- be the routing tag for the destination 0,, 0 S i S k-l and 0 S J S h—l and t, be the vector {tm-IO S J S h—l}. Note that ti'j bears the following interpretation to the intermediate stage switches: 0 go up (port 0) 1 go down (port 1) I. go either up or down at random 3 go straight to the next switch in the same row The source node generates the t,,, tag values using the following destination- address based multicast algorithm (set k = 1 in this to obtain the DR algorithm). Procedure Destination-address-based Multicast I. Lettg’jZCB,OSjSh—l,0SiSk—I 2 Let {6,=g|C9_1:;’3J andC,7éfljfor0Si n on a PS—ESMIN, c, = [g] + .r, where .r = 1 for i Z n — (h mod n) or a? = 0 otherwise. Note that such closed form expressions cannot be derived for CU-ESMINS, due to the “dimension assignment flexibilities at the extra stages.” 131 Let us consider a particular destination Dj, among the 13 destinations (0 S j S k — 1). Based on the destination D}, a particular dimension i may be a 1 or a 0. We list these two cases in the following: I: At the latest. stage that corresponds to dimension i, the route must go up. At all other ith dimensional stages, the route may go either way, i.e., up or down. This leads to 2‘“—1 ossible )aths. P I 0: At the latest stage that corresponds to dimension i, the route must go down. At all other dimension i stages, the route may go either way. This also leads to ‘2“‘1 possible paths. Therefore, for dimension i, the route has 2“"1 flexibility. Hence, for all the n dimensions (regardless of whether they are ‘1’ or ‘0’), a maximum of 11:12-01 12C"l (2 2"”) routes can be formed. For A? destinations. an upper bound on the multicast trees is (Zh—n)k 6.3.4 Multicast Tree Selection Criteria An optimality criterion is proposed, in multicast tree selection, with primary focus on the ‘traffic' and ‘time’ metrics. The ‘time’ metric, for an ESMINUz), would always equal h. Thus, there is no opportunity for "time' reduction (besides the fact that with cut-through switching the ‘time’ metric or hops-count largely loses its significance). 132 Therefore, we equate traffic minimization to the optimality criterion of multicast tree selection. This is stated as follows: Given an ESMIN of n + m stages, a source node 5' and k destinations (0,,0 S J S k — l) generate a multicast tree that minimizes the total number of tree edges. 6.3.5 'Iraffic Optimal Multicast Trees The number of possible multicast trees grows exponentially with extra stages, and so do the number of traffic-optimal multicast trees. To verify that there are an exponential number of traffic-optimal multicast trees, Lemma l4 and Theorem 5 are given as follows: Lemma 14 There are ‘2’" alternatire paths for every (source. destination) pair of nodes in an ESE/11ml with. 772 extra stages. Proof: This property has also been proven in [46] and a. brief description is given as follows. Let the routing code, with destination routing, be .r0r1:r2....r;,_1, where .r, is 0 or 1 in a ‘2 x ‘2 switch. To achieve a certain destination, we know all those later dimensions (1, must be either 0 or 1, which match to the corresponding destination bit i, 0 S i S loggN — 1. Hence, logz N bits are decided by the destination node but not the other m bits. Hence, there are '2’" possible routing codes to achieve a destination. In other words, there are ‘2’" alternative paths for any (source, destination) pair. D 133 Theorem5 There are at least 2’" optimal multicast trees (OMTs) for any source destination set ' air on ESMIN with m e.rtra sta es. 9 Proof: From the Lemma 14, there are ‘2’" alternative paths to achieve the first. destination node from the source. By the latest branch algorithm (step 2 to 4), we can construct one OMT from every one of these paths. Hence, there are at least ‘2’" OMTs for any source/destination set pair on ESMIN with 172 extra stages. C1 The exact number of OMTs is dependent on the destination set and the location of the latest stages with different dimensions. For example, let the second stage be the latest stage for dimension j, l S J S n — 1, and there be two destinations .r and y. Furthermore, let .r and y be distinguished in the J’h bit. To achieve both .r and y, the path must branch at the second stage. After stage ‘2, there are m — 1 extra stages or ‘2'"‘1 alternative paths to achieve both destinations. However, these two alternative path sets are independent of each other. Hence, there are 22(m‘1l combinations after the second stage. The first stage also has two alternative paths to achieve the second stage. Hence, there are ‘2 x 22(m’”, or 22”“1, OMTs. An example of CU—ESMIN(8:5) is given in Figure 6.4 and 6.5. The source is 0 and destination set is {1, 3}. Figure 6.4 gives a CU-ESMIN(8:1,2,1,‘2,0). The number of OMTs is ‘2’" which is 4 as shown in the figure. In Figure 6.5, the dimension pattern is (1122). The destination 1 and 3 are distinguished at bit 1. The multicast tree splits at the second stage. After this stage, there is one extra. stage which implies two alternative paths for every path. Hence, there are 4 combinations. Since there are 134 two alternative paths at the first stage, there are 8 (:22x2—1) OMTs. 7 1| Iv IVs-AV. = b. 'Ib. -I- wit- Figure 6.5: The eight optimal multicast trees in an ESMIN(8:1,1,‘2,2,0) 6.4 Multicast Algorithm This section describes a multicast tree generation algorithm and shows that it can generate a traffic-optimum multicast tree in polynomial time. This algorithm applies generically to PS-ESMINS as well as to CU-ESMINs, and in our discussion we inter- changeably use illustrations from both types of ESMINS. Note that the PS-ESMIN is merely a specific instance of the CU-ESMIN, and assertions for the latter imply the same for the former. 135 We also propose three other multicast heuristics that are non-optimal. These heuristics are compared with respect to the optimal algorithm in the simulation per— formance section. 6.4.1 Latest Branch Multicast Algorithm The key point of this algorithm is to branch the message as late as possible. In a formal way, let us consider a CU—ESMIN(:’V : h ). The algorithm is specified as follows: when the destination id has a 0 (1) at dimension J, implement going up (down) at the latest (rightmost) dimensional j stage, where 0 S j S n — 1. At all other stages, go straight to the next switch in the same raw. The latest branch multicast algorithm (LB) can be implemented in the following way: Step 1: For each c, > 1 (0 S i S n — 1) mark the non—latest stages concerning dimension i as ‘null‘ stages. For every stage marked ‘null,’ eliminate the cross links and only keep the links that connect. within a switch row. Step ‘2: For every ‘null’ stage (also called ‘horizontal") the message is simply for- warded by the straight link to the switch within the same row at the next stage. For every non—null stage, i.e., the latest stages (one per dimension), implement a going up if the corresponding destination bit is 0; otherwise, implement a going down. Let the ESMIN be represented by enumerating its dimensions, and let a stage marked ‘null’ be labeled as (b. An example LB multicast in a CU- 136 Dimenosion—9 Iv‘I'I 'l $3230 II ‘I' I" v D“ I‘In'i II\ I - _IiI I. D Ivl“l it! will” :Iv'l . “MI I [l I. _j I (bu-v U I'i'Illl m n ”It 't /l\ :4 I‘m/ll“. 3% mili' I I u g ml i3 3 3 :31 Ufl‘ilfll m u .. (36 (2) 3.. (b) (c) Figure 6.6: Latest-Branch multicast algorithm illustration: a) CU- ESMIN(16:1,2,3,3,1,1,0), b) Construction of the CU-ESMIN1 by seeking the rightmost occurrence of each dimension and marking the remaining stages as ‘horizontal.’ Horizontal stages merely carry—forward the data. c) Example multicast in the CU-ESMINI. ESMIN(16:1,‘2,3,3,1,1,0) is shown in Figure 6.6. The LB algorithm marks the non—latest positions for each dimension as ‘null’. i.e., converts the CU-ESMIN as (null,2,null,3.null,1,0), as shown in Figure 661). Then, the CU-ESMIN is merely a regular MIN (with a few extra ‘straight’ stages) in which multicast can be done trivially, e.g., Figure 6.6c. Pseudo Code A general LB algorithm to generate any one of those exponential optimal multicast trees is expressed as follows: A restricted LB algorithm, to generate a unique multicast tree as discussed in the previous part of this section and to be referred to as LB in the following text, can be obtained by revising the first step to 137 Procedure latest branch multicast 1. Let toy 2 be 0 or 1 at random, 0 Sj S h —1 /* This gives one of those 2’" possible paths for the source node to the first destination */ Letty-2mg,1SiSk—1andOSJSh—1 Let {tJ- :gng_1 =I'3, and C,;él3, forOSi n — Stg. This contradicts the assumption that GMT(S, D, C”) has fewer channels than CMT(S, D, C"). Therefore, the CMT(S', D, C‘) is an optimal multicast tree. D Complexity: The first, second and last steps in the LB algorithm are repeated h, h X (k —— l) and h X k times, respectively. In other words, the complexity is 0(h X k) for these three steps. To find the rightmost ith dimensional stage, 0 S i S n — 1, it takes at most h steps. Therefore, the complexity of the 3rd step is also 0(h X 1:). Hence, the complexity of the LB algorithm is 0(h X ls), which is, indeed, a polynomial algorithm. 6.4.3 Other Sub-Optimal Multicast Heuristics The LB multicast algorithm is traffic—optimal and its performance would be identical to the exhaustively generated optimum solutions. Hence, a question arises as to the basis of comparison for the proposed LB algorithm. Towards this, our idea is to compare the LB algorithm with what the current practices might be. In the absence of our optimal multicast algorithm, and since there is no prior work on ESMIN multicast, current applications would use either a random dimension (RD) 139 selection approach or a first-available (i.e., greedy) approach to dimension selection. These lead to three heuristics, First-Branch (FB), Random Branch (RB) and Random Mapping (RM), which may not always generate traffic-optimal multicast trees. Their performance is compared the LB algorithm in the next section. First-Branch Instead of selecting the latest stage for each dimension, the FB approach selects the earliest stage for effecting each dimension. All non-earliest dimension stages are masked off and messages are simply forwarded to the next stage within the respective I'OVVS . In other words, for each dimension i that has a 1 in the destination id, the FB approach would go up at the leftmost stage occurrence of dimension i. In all other non—leftmost occurrences of dimension i, the message would be forwarded straight. Alternately, if the destination node had a 0 bit at dimension i, then the FB approach would go down at the leftmost stage occurrences of dimension i. At all other non- leftmost occurrence of dimension i, the message wouch be forwarded straight. The tag generation mechanism for the FB approach is described below. Procedure first branch multicast I. Letty-=3,0SiSk—land0SjSh—l /* ‘3’ indicates “go straight to next switch in the same row” */ 2. Let {f,=g|C9-1=BjandC,-;£flj forg— - pattern 2 +— _ pattern 3 B— _ pattern 4 —X— pattern 5 A— “ pattern 6 +— ‘ l l I l I l l I I l 1 l l l l 0 4 8 12 162024 28 3236404448 52 5660 64 Number of destinations Figure 6.9: The number of channels used in the latest branch algorithm. 3:2 (Ii—05239370 H-aO W'DU" —‘ T I I I I I I I I I I I I I I ._ A _. A A s ” :2".— _ A /§./. _ 3 3.7/:; pattern I 9— — A ‘3 pattern +— — ' / pattern 3 B— - ’3- pattern 4 —X— - pattern 5 AH “ pattern 6 +— i l l l L 1 J I 1 l J I I L l l 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 Number of destinations 64 Figure 6.10: The number of channels used in the random mapping algorithm. 146 As shown in Figure 6.8, patterns 2 and 3 have similar performance. Patterns 1, 4, 5 and 6 also have closed performance on the FB heuristic. Patterns 2 and 3 have fewer channels used since the other patterns split the message in the first 5 stages. For the LB, patterns 1 and 4 as well as patterns 2 and 3 have the same performance because of the similarity of these patterns. For patterns 1 and 4, the LB only uses 6 channels in the first 5 stages — 5 to connect the stages and the other one to connect the source node to the first stage. At the rightmost 5 stages, pattern 1 is the same as pattern 4. Therefore, they have the same T metrics. The even distribution of destinations made patterns 2 and 3 have similar performance. However, this result does not consider the locality of processing nodes. Depending on the locality, pattern 3 and pattern 4 may have entirely different performances. In the RM, the performance of patterns (1,4) is like that of patterns (2,3). Pattern 5 and pattern 6 are slightly different due to the location of extra. stages. In general. the later the message splits, the fewer the number of channels used. Figure 6.11 shows the B metric for the RB heuristic. The blocking rate is over 20% when there are more than 10 destinations and 40% when there are 20 or more destinations for every pattern. This result suggests that the RB algorithm shall not be used even though the number of destinations is small. Otherwise, an intelligence switch, which is able. to merge two messages with different routing tags but identical data, is needed to improve the blocking rate. The T metric for the RB heuristic is not reported since the RB heuristic does not always lead to successful multicast. T and also B increase with an increasing number of destinations in all the cases, and this can be trivially expected. However, due to the channel saturation effect, the 147 rate of increase gradually decreases with an increasing number of destinations. The blocking rate of random branch algorithm 1 I I I I " 4‘3"" I 91—; 0.8 - d P e r _ C 006 .- 3 pattern 1 e— t pattern ‘2 +— a 0'4 “ pattern 3 B— ‘ g , pattern 4 —><— 6 pattern 5 A— 0.2 - / pattern 6 +— ‘ "'/ 1 4 1 1 1 1 1 1 1 1 1 1 1 1 0 , O 4 81'216202428323640444852566064 Number of destinations Figure 6.11: The probability of blocking in random branch algorithm. Figure 6.12 compares the T metric for the optimal LB algorithm and FB, RM heuristics. It is easy to observe that the LB leads to the least traffic; RM leads to the next least; and PB gives the highest traffic. which is due to the fact that LB is optimum and RM is an intermediate generalization between LB and PB. Not all patterns have been reported in Figure 6.12, since some ofthe patterns lead to overlapping results. At this moment we are not reporting inter-pattern traffic variation analysis; however, we believe that the traffic generated has strong correlation with the dimension pattern. 6.6 Summary In this chapter, we discussed the establishment of a multicast tree in the ESMIN, which is an extension design of regular MINS. The ESMIN provides several advan- tages, such as a flexible routing path, fault tolerance, etc. The ESMIN multicast 300 N 270 31 240 2 210 T 0 180 f 150 C h 120 a n 90 I] <13 60 S 30 0 Figure 6.12: The number of channels used in three non—blocking algorithms. problem is formulated and an optimality criterion is defined. An upper bound on the number of multicast trees is estimated, and we have shown that the total number of traffic-optimal multicast trees may itself be exponential. However, a traffic—optimal multicast tree can be generated in polynomial time. The proposed latest branch F B _- ' Rl\ — - pattern 1 * pattern 3 0 pattern 5 o .- ” ’— ' 1 , 0 4 81216202428323640444852566064 l l l l l l l L l l l l l I Number of destinations multicast algorithm can achieve this. When multiple multicast communications occur simultaneously, deadlock becomes possible as shown in Chapter 4. Hence, some systems rely on unicast—based multicast, or software-based multicast. How to support an efficient software—based multicast on such systems will be discussed in the next chapter. CHAPTER 7 Software-based Multicast Most existing message—based SPCs support only unicast communication, that is, single-destination message passing. in hardware. in this environment, multicast must be implemented in software by sending unicast messages. One way to implement mul- ticast in such systems is separate addressing in which a separate copy of the message is sent from the source to every destination. As the number of destinations increases, this separate addressing may require excessive time. An alternative approach is to use m ultimsf free in which the source sends the message to only a subset of the desti— nations. Each recipient of the message forwards it to some subset of the destinations that have not yet received it. The focus of this chapter is on such multicast tree implementation, also known as unicast-based multicast implementation. 7.1 Issues in Multicast Communication Although hardware implementations of multicast communication would intuitively offer better performance than software implementations, many such implementations 149 150 exhibit either undesirable properties or are restricted in their use. The NEC Cenju-3 claims to provide restricted multicast in hardware with the limitation that all desti— nation addresses must be consecutive. This restricted multicast is called single region broadcast. Unfortunately, single region broadcast is not deadlock free unless only one does multicast at a time. A deadlock example is illustrated in Figure 7.1, where a shadow switch indicates that a message is blocked by the other in that switch. In the initial instance, given in Figure 7.1(a), nodes 0 and 5 initiate multicasts while two unicasts, node 1 to node 5 and node 4 to node ‘2, are transmitted in the network. Hence, both source nodes grab partial destinations. After both unicasts are com- pleted, as shown in Figure 7.1(b), each source node grabs half the number of nodes and waits for the other half to become available, deadlock occurs. O O O O l ..... ...-..... —_ 1 1 _1 -I‘ _ 2 '..-' “a. I """ 2 2— ’-<:- 2 ~ '~ 3 —— t - - .. 3 4 —‘, - ‘ "" - 4 5 " - "' " -" " 5 . nous -. \ 7 —. — — u‘ —1— — 7 (a) Initial instance (b) Final instance Figure 7.1: An example of deadlock in single region broadcast. Although some deadlock-free multicast algorithms were proposed [25, 47, 48, 49], most existing wormhole—switched SPC-s support only unicast communication in hard- ware. In these environments, all communication operations must be implemented in software by sending one or more unicast messages. One main issue is to develop an efficient multicast tree. Which type of multicast trees to use depends on the un— derlying switching strategy and unicast routing algorithm. An efficient multicast tree 151 should involve no local processors other than the source and destination processors, should exploit the distance—insensitivity of wormhole switching, and should maintain a minimum height, specifically, height I: = [log2(m)l for m. — 1 destination nodes. Another desirable property is that there be no channel contention among the con- stituent messages of the multicast. That is, the unicast messages involved should not simultaneously require the same channel. How to achieve these goals depends on the network topology, switching strategy, and unicast. routing algorithm of the SPC. The issues and difficulties involved in implementing efficient multicast communication in unidirectional MINS is illustrated in the following (small-scale) example. Let’s consider an 8-node multistage cube network built. with ‘2 x ‘2 switches. Suppose a multicast message is sent from source 100 to all other nodes. {000, 001, 010, 011, 101. 110, 111}. Figure 7.2(a) shows a binary multicast tree. At step 1, the source sends the message to node 000. At step ‘2, nodes 100 and 000 inform nodes 110 and 010, respectively. Continuing in this fashion, this implementation requires 4 steps to reach all destinations. Taking advantage of the distance insensitivity of wormhole routing, the duration of each step must be approximately equal to the duration of a single unicast transmission of the message. In other words, it. should be correct to assume that each step requires unit time as long as there exists no channel contention among the messages transmitted during each step. For this reason, the multicast latency in Figure 7.2(a) is 4 time steps. In Figure 7.2(b), the shape of the tree is rearranged in such a way that the number of steps to complete the tree can be reduced to 3. However, closer inspection reveals that the message sent from node 100 to node 001 and the message sent from node 000 152 .100 ..1 \ Ani- \g A” $21,331 [w W" W1 We 0)0 001 101 l” 0)O 101 111 Ms] 011 000 °°°° 000 001 - 001 o1o~ 3— 010 011 - 011 100-' '— 100 101‘ :i- a 101 110 ll ‘ - - 110 111 m ‘ - 111 (a) A binary multicast tree (b) Potential channel conntension in step 2 21 t 31 a)“ <6 [no-1V :13] my 4t k A [2] 131on 001\\ 110 [W421] 1&3] 101 )01 010 011 910 00 111 (c) Potentional depth contention (d) Contention-free multicast tree """ Transmission in step 1 —' Transmission in step 4, if any —— Transmission in step 2 _ - _ Transmission in step 3 Potential contention channel Figure 7.2: Unicast—based software multicast trees 1 53 to node 010 in step ‘2 use two common channels. The contention for those channels would force one message to block the other from using the channel. Consequently, these two unicasts cannot take place during a single time step. As a result, the multicast latency in Figure 7.2(b) is actually larger than 3 time steps. This situation is rectified in Figure 7.2(c), where the messages sent within a par- ticular time step do not contend for common channels. Contention among messages sent in different steps may still arise, however, if the message length is small and the sending latency is large. For ease of explanation, we model the sending latency 2t and the receiving latency t. The length of the multicast message is chosen in a way that its network latency is t. As shown in Figure 7.2(a), the message transmission from node 100 to node 110 and the message transmission from node 000 to node 010 take place concurrently during the time period between (it and 7t, and contention occurs for the shadowed areas. The multicast tree in Figure 7.2(d), which is based on the methods presented in next section, is contention-free regardless of message length or startup latency. In the next section, we will show that such contention-free multicast may not always exist. 7 .2 Multicast Algorithm This section defines an optimal multicast as well as an algorithm to implement it in certain network topologies. We also give a proof of some other network topologies whose optimal multicast tree may not exist even if the source and destination nodes form a cube. 154 The theoretical model for unicast-based multicast communication has been ad- dressed in [50]. The underlying topology of the network is represented by a directed graph, C(V, E) with the vertex set V(G) and the are set E(G). A vertex u in V((}) represents a switch box or a. processor node. An are (u, v) in E(G) represents the uni— directional channel from switch box 11 to switch box v, processor node 11 to switch box v, or switch box 11 to processor node 17. An alternating sequence uelvleg . . . vk_lekv of distinct vertices and distinct arcs, starting with vertex u and ending with vertex v is called a directed path, or path for short. A unicast operation can be defined as an ordered quadruple (u, "v, P(u, v), t), where u and v are the source and destination vertices respectively, P(u, v) is a given path in C over which the message will traverse, and t is the step at which the unicast is to take place. Each unicast is assumed to take a unit of time whose duration is independent of the path length. As a result, a unicast. that begins with time step t should terminate at time step t+ 1, provided that no encountering channel is blocked by other messages. This assumption is consistent with the wormhole switching strategy which diminishes the effect of the path length in communication latency. In a one—port architecture, in which a. processor node may send (receive) only one message at a. time, two unicasts (11,17, P(u,v),t) and (.‘r,y, P(.r,y),r), with t = T, are called feasible if vertices u, v, .r, and y are all distinct processors. A set of unicasts U, 2 {(ul, vl, P('u1, 111), t), (112, v2, P('U2, '02), t), - - - , ('1.1.k,'vk, P(uk, vk), t)}, whose members are pairwise feasible, is called a feasible unicast set. A multicast group can be represented by a set M = {(lo,d1,d2, . . . ,dm_1}, where vertex do represents the source and vertices dl , d2, . . . , dm_1 represent the destinations. An implementation 155 I(M) of a unicast-based multicast request M is a sequence of feasible unicast sets U1, U2, . . . , U1. satisfying the following conditions: 1. For each j, 1 Sj S k, if (u, v, P(u,v),j) E U]- then both 11 and v belong to 1W. [\9 . The set U1 2 {(clo,u, P(do,u), 1)}, where u = d,- for some i, 1 S i S m — 1. 3. For every unicast (’11., v, P(u, v),t) E Ut, l < t S ls, there must be a set U, with 1' < t which has (10,11, P(w,u), T) as a member. 4. For every destination (l,, 1 S i S m — 1, there exists one and only one integer t such that 1 S t S k and (w, d,, P(w,cl,-), t) appears in U, for some vertex w. The first condition guarantees that only the destination processors of the given message are involved in the implementation. The second condition states that the first step of the implementation involves a single unicast from the source to one of the destinations. The third condition ensures that a destination processor has received the message before it may forward it to another destination processor. Finally, the forth condition guarantees that every destination processor receives the message exactly once. Condition 3 above also implies that the total number of destinations receiving the message can at most double during each step. Therefore, flog? ml is the greatest lower bound of the number of steps required by an implementation. An implementation requiring exactly I'log2 in] steps is referred to as a minimarm-step implementation. Definition 10 Two feasible unicast operations (11,v,P(-u,v),t) and (:r,y, P(.zr,y),t) are called stepwise contention-free if P(u, v) and P(;r,y) are arc-disjoint. An imple- 156 mentation is called stepwise contention-free if the elements in each unicast set U, are pairwise stepwise contention-free. Stepwise contention—freeness guarantees that the minimum multicast latency can be achieved by a minimum—step implementation when the message size is large and startup latency is neglected. Definition 11 A vertex v is in the reachable set Ru ofa vertex u if and only if there exists at, 1 S t S k, such that either {(u,v, P(u,~v),t)} E U, or {(w,v, P(w,v),t)} E U, for some node 10 E Ru. The reachable set of a vertex 11 contains those vertices in .M to which the message is sent after having been handled by vertex u. lfthe implementation [(ll/I) is viewed as a tree of unicast messages, then the reachable set of a vertex 11 is the set of vertices in the subtree rooted at. a. For example, in Figure 7.2(b), 3.000 = {010, 101,110}. Using this definition, the characteristics of an implementation necessary to avoid contention between messages sent in different steps, called depth contention, can be formally defined as follows. Definition 12 An implementation [(111) is depth contention-free if and only if P(u,'v) and P(.r.y) are arc-disjoint for any two unicasts (u,v,P(u,1>),t) and (.r,y, P(.1:,y),T) in. 1(M) such that 11¢:r, 11¢ By, and .1? Q” RU. Depth contention includes all possible types of channel contention among concur- rently transmitted messages without consideration of message size or startup latency. Since each processor is characterized by one-port communication architecture, node 157 a has to send messages to any two nodes, '0 and y, sequentially. Thus, the message transmitted from u to v and the message transmitted from u to y will never contend for a common channel. In addition, no node in Ru can begin to send or receive a copy of the message before u completes receiving its copy of the message. This implies that the message transmitted from u to v and the message transmitted from :r to y will never contend for a common channel if .1: E RU or u 6 By. An implementation must be stepwise contention-free if it is depth contention-free. Depth contention-freeness guarantees that the minimum multicast latency can be achieved by a minimum-step implementation when the message size is small and the unicast communication latency is dominated by startup latency. An implementation that attains the minimum multicast latency is said to execute in minimum—time. Such an implementation is also called the optimal multicast in what follows. 7 .3 Non-Optimal Multicast in Baseline and But- terfly Networks This section shows that an optimal multicast may not exist for either baseline nor butterfly networks. The following lemmas prove this. Lemma 15 In an N-node baseline network, where N = k" and n 2 3 iflc 2 4 or n 2 5 ifk = 2, an optimal unicast—based multicast may not exist when source and destination nodes form a cube. 158 Proof: We shall disprove that there exists a stepwise contention—free multicast. As shown in Lemma 10, the channel used in connection C,- is dn_2 "'di+13"-2 ~~s,,_,-_1d,-. Therefore, this channel is used by those source nodes, a‘n_1s,,_2 - ~ - sn_,-_1.r,,_,-_2 - - - .130, and those destination nodes, dn_1 - - - d,‘y,-_1 - - - yo, where 41',- and y,- are variables within the range from 0 to k — 1. Since there are at least four nodes in a cube to have two node—disjoint unicast operations, n — i — 1 Z ‘2 and i 2 2 for k = ‘2 or n. — i — 1 > 0 and i > 0 for k 2 4. By solving these inequality, we get n 2 5 when k = ‘2 or n 2 3 while I: 2 4. Let (I, = s,- for l S j S n — 1, where t is the minimum value of {i, n —i— 1}. These nodes form a cube, C, with addressing code d,,_1 - - - d[.Fg_1 - - - .ro. This implies that P(sl,dl) and P(s-2,d2) are not arc-disjoint for {s1,d1,s-2,d2} E C. Thus, the unicast operations (.sl,(ll,P(sl,dl),t) and (82,(12,P(82,(12),t) are not feasible. Therefore, the minimum—time unicast-based multicast may not exist in the baseline network. D For example, let the cube be 000XX in a 3‘2-node baseline network constructed by ‘2 X ‘2 switches. As shown in Lemma 10, channels 00XXO are used in C4, 000X0 are used in C3, 00000 is used in C2, 0000K are used in C1, and 000XX are used in Co. In the connection C2, all nodes share the same channel to reach any other one. This implies that any two node-disjoint unicast operations are not feasible, i.e., stepwise contention-free, in this cube. Lemma 16 In an N-node butterfly network, the optimal unicast-based multicast may not exist when source and destination nodes form a cube. 159 The proof of this lemma is similar to that of Lemma 15. Since there is no op- timal multicast in both baseline and butterfly networks, we concentrate our efforts in cube and omega networks. However, the algorithm proposed below may apply to modified baseline and butterfly networks directly since they have the same network partitionability as cube and omega networks. 7 .4 The C-min Algorithm The C-min, (Corresponding MIN), algorithm is proposed not only to achieve the op- timal unicast-based multicast but also to reduce the probability of contention among different unicast/multicast communications. The basic idea of the C—rnin algorithm is to divide the lexicogra.[_)hy-ordered chain [51] into two even chains, the upper chain and the lower chain. Then, the source node delivers the message to the corresponding position of the other half. Such a process is recursively executed until every node receives a copy of the message. The C—min algorithm is formally specified in Figure 7.3. In Figure 7.3, the C-min algorithm first sorts source and destination addresses in lexicographic order, known as le.ricography-ordered chain and is denoted by (I). The source successively divides (I) in half. If a source node is in the lower or upper half, then it sends a copy of the message to the corresponding node in the upper or lower half, respectively. A corresponding node is the node which has same position as source node when (I) is divided into two halves. That destination will serve as source for the upper (lower) half if it is located in the upper (lower) half, using the C-ming 160 Algorithm: C-min Algorithm Inputs: (1): lexicography-ordered chain {(l(,(l(+1, . . . ,d,} for source and destination addresses d,: the address of source nodes Procedure: while l < r do _Lflfl. c— .2, if s < c then /* send to upper cube */ D = {(lc,(lc+1, . . . , (1,}; r = c — 1; else /* send to lower cube */ D = {(163 (1H! “'t dc‘l}; t= S - +3“ [7 = c; endif Send a message to node (I, with the address field D; endwhile Figure 7.3: The C—min algorithm algorithm. A source node continues this procedure until (1) contains only its own address. Such a divi(.le-and-conquer property makes the C—min algorithm successful in attaining the minimum-step multicast implementation. Figure 7.4 shows how to obtain the optimal multicast implementation (Figure 7.2) by using the C-min algorithm. Source, node 100, begins with a lexicography-ordered chain (P = {000,001,010,011,100,101,110,111}. As shown in Figure 7.4, the source node first sends a copy of the message to node 000, the node with the corresponding position in the lower half of (I). The lower half of (I) is deleted, and therefore the nodes remaining in (I) are {100, 101, 110, 111}. Since source node 100 is the first node in the new (I), it sends a copy to the first node, node 110, in the new upper half. Eventually, source 100 sends a copy of message to node 101. Each of the receiving 161 nodes is likewise responsible for delivering the message to the nodes in its subtree using the same algorithm. As shown in this figure, this multicast implementation requires 3 steps. Note that the associated diagram of such an implementation on a multistage cube network is shown in Figure T.‘2(d). .mM Lexicography-ordered chain 0 source S destination ——> transmission at step 1' Figure 7.4: An example of software multicast tree implen'lented by C—min algorithm. To show that the C—min algorithm can achieve the minimum-time multicast as well as minimum contention possibility, we will show that all transmissions are located in disjoint s-cube’s. which is formally defined as follows: Definition 13 An s-cube is the smallest k-ary cube in a k-ary m-cube. An s-cube, which is a k-ary 1-cube, contains only I: nodes, in which addresses are varied in the same single position. From Lemmas 8 and 9, which imply that two disjoint s-cube won't have any communication interference, we can obtain the 162 following Lemma. Lemma 17 Communications in disjoint s-cubes are contention-free in cube and omega net works. Another important property of an s-cube is that the connnunications within an s-cube are contention-free as long as those communications are node—disjoint. W’hen k = ‘2, an s-cnbe contains only '2 nodes which implies that there is at most one communication. Hence, the comn'mnication is stepwise and depth contention-free. In the following, we will consider the case where A: 2 4 and l: is power of ‘2. Lemma 18 .-‘\'ode-disjoint communications within an s-cube are contention-free in cube and omega networks. Proof: Let’s assume that there is a contention between two node-disjoint unicast operations, (s1, ([1, P(s1, (ll ),t) and (.92, ([2, P(s-2, ([2). t), where {sl , s-z, d1, (1)} belongs to an s—cube with address code (1,,_1 ---aJ’+1.raJ-_1 - "(10. Based on destination routing, the channels (1,,_1 - - - (1,. - - - a,+1a,~_1 - - - (10m, (1,,_1-~ a,+1(1,_1---1'---a0(1., and an_1 -- - ai+1a,_1---au.r are used by the unicast operation (s, d, P(s,d), t) whilej > i, j < i and j = i, respectively. Since the .1? exists during the address change, sharing a common channel between two unicast. operations implies that either rs, = 1’32 or '14, = 1d,, a contradiction. E] The proof for omega networks is similar and is omitted. Note that as long as the source nodes are distinct and destination nodes are distinct, the unicast transmissions 163 within an s—cube are stepwise contention—free. To show that the C—min algorithm can achieve optimal multicast, we need to show that multicast in an s-cube is depth contention—free. Lemma 19 The multicast tree generated by C-min algorithm in. an s-cube is depth contention-free. Proof: Let (s1,d1) and (sbd-z) be two source destination pairs in different steps and s1 # 82. Let’s assume that there are common channels used in these two source destination pairs. From the C-min algorithm, we have ([1 # d2. From Lemma 18, 31 must equal to .92 in order to share a common channel, a contradiction. Cl Since the C-min algorithm divides the le-ary m-cube into lc disjoint lc—ary (m. — 1)- cube in every toggle steps, we have the following theorem from Lemmas 17, 18, and 19. Theorem 7 The implementation constituting a C-min. tree is stepwise contention— free, depth contention-free, and a minimum-time implementation. 7 .5 Performance Evaluation Our performance evaluation is based on a simulator which simulates different unidirec- tional MINS, including cube, baseline (which is used in the NEC Cenju—3), butterfly, and omega networks. The message length is uniformly distributed between 32 and 96 flits. The number of nodes is 64, and different size of switches constituting the 164 network is considered in the simulation. Latency is reported in the 95% confidence interval. Multiple multicasts, where each multicast is in a separate cube, are allowed simultaneously since the unicast-based multicast is deadlock-free. Based on previ- ous discussions, it is easy to observe that the cube and omega networks should have identical performance. Our simulation results did confirm this observation; therefore, only cube networks are considered in the following context. 4000 I T I T r j t T I T r I I I” ,IHJ ”4",?” , "*" ’3’ I’m 3", I, ’ , I I I, H ”-4” ' 2r" 1 Baseline —— ,1. ..... W ,0 Butterfly ----- .I'I'B , ’ CUbe """ ’X ’92:” v, Cmm o * ..... 4 $332.3 , SA + [,1 ’2,” I. 5 ’ Umm 0 p’ 213’ _ . j: c)" 1’ ,fi 5 c .9" g 2000 '- ,.e" _1 ,8" .43" ..o ...... e ----- + ----- 4» ..o-- 4 o L 1 l l l l L L J l 1 l 1 l 12 3 4 5 6 7 8 910111213141516 Number of Multicasts Figure 7.5: The latency in blocking multicasts. Figure 7.5 gives the latency when multiple-blocking multicast communications are initiated simultaneously. Every cluster contains 16 nodes but only a limited number of nodes can initiate multicast communications. By carefully examining the networks, the number of channels is reduced to 4 from 16 in connection C1 of the butterfly network. This implies that latency on the butterfly will be the worse than the other two networks. Such phenomena is confirmed by Figure 7.5. Although the number of 165 channels is 16 between any two stages on the baseline network, there are potential channel contentions when node-disjoint communications are conducted within an 3- cube. For example, sources 0 and ‘2 compete channel Co at connection C1 when they are forwarding messages to destinations 1 and 3, respectively. Since such unicast patterns are the base for C-min algorithms. the performance of C~min on a baseline network should be degraded. For a cube network, the C-min can fully utilize the network. Hence, it has the shortest latency in a cube network as shown in Figure 7.5. Unlike cube network, the U-min has a smaller latency than the C-min algorithm on the baseline and butterfly networks due to batch—like message forwarding in those middle nodes. The perfor- mance of separate addressing on the cube and baseline networks are similar since the separate addressing can fully use all 16 channels. Figure 7.6 shows the latency on various networks for non-blocking multicast com- munications with 16 nodes in each cluster. The system load is defined as the average number of flits expected on an output. port per time unit. Such a definition is identical to the definition of load in Chapters 4 and 5. Unlike network independent of hardware multicast, it’s easy to observe from Fig- ure 7.6 that the network topology is a very important factor. The cube network always has the lowest latency in all traffic loads comparing with the two other net- work topologies because of network partitionability and traffic localization. The size of switches is another important factor. Larger switches provide lower latency in cube and baseline networks but not in butterfly networks. This contrary result in butterfly is due to the identity connection in the leftmost stage. Note that 64-node cube and 166 4000 n 3000 _ > I" 1 8 I" "i g 2000 .' - 3 ' ,n 1’ -'n'.' 1000 a - 2x2 switch — 4x4 switch ----- 8x8 switch ------ Baseline 0 Butterfly + Cube 0 l L l l l 0.2 0.3 0.4 0.5 Load Figure 7.6: The latency of C-Inin algorithm on various networks and various sized switches. baseline networks with 8 x 8 switches are identical. The comparison among three different unicast-based multicast algorithms: C-min, U—min, and separate addressing (SA), is given in Figure 7 7. Three different processor clusters are considered: 16 clusters with 4 nodes per cluster, 4 clusters with 16 nodes per cluster, and a 64-node cluster. Note that the U-min algorithm was designed for bidirectional MINS [51.], not for unidirectional MINS. The performances given here are only for comparison purpose. In a light load environment, both C-min and U- min algorithms provide a lower latency. Since the U-min always sends messages to middle nodes first, the queue in these nodes become saturated quickly. The latency shows such an effect even when the number of nodes in a cluster is small. The separate addressing is similar to a batch process. Therefore, it should have a higher throughput in a heavily loaded environment which also implies a lower latency. Such phenomena 167 is confirmed in Figure 7.7. 6000 T r i i 1 , Cmin —— xi 13 SA ----- 1." Umin ------ q 5000 4nodes o 16 nodes + 64nodes 0 4000 > 8 g 3000 (U .1 2000 1000 ’ """ Figure 7.7: The latency of various algorithms on a cube network constructed by 4 X 4 switches. Figure 7.8 gives the throughput of different networks with different multicast algo— rithms. The results also verify previously mentioned phenomena. Due to saturation in the U-min algorithm, its throughput. decreases rapidly as the number of nodes in a cluster increases. The peak throughput of the C—min is better than separate addressing during certain ranges since the C—min evenly distributes its traffic. 7 .6 Summary In this chapter, our original intention was to propose an optimal unicast-based mul- ticast algorithm for unidirectional MINS. The first finding in this work is that known topologically equivalent delta-class networks have quite different capabilities in sup- Thruput Figure 7.8: switches. 0.8 l I I I T l ’f‘ ’," ’--9"“"i ’.’ ’V' ‘. .I' ,1 > 0.7 - Cmin — ,r’ , . ,, . SA ..... ’1" °_-- -§I Umin ------ ,r , ,. 0 06 ~ 400995 ° x" . x , ""° ----- '0 ' 16 nodes + , ,.-;'9;_,;-.-¢----+—----+ ..... ,_ ..... 1 64 nodes 0 Ideal ~-- ,/ 0.5 ~ - — ‘-“—B"““G _________ q 0 4 ,1” B“‘-13----Gs-- ’l’ T --“B""'€3 . -;J"a~ = .'J ' -‘+‘. f 0.3 " " + e _‘ "+.. = 4. = .+__ ‘ .-EJ. - 0.2 ~ ~ ., * + 1 Bi. -. "+._ 3., +' """ <1- "B.““G 0.11 a ., 1:1 ..... c1- """ Bu-.. G """" B ..... {3. """ Elm-r: 1 l l l l l L 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The throughput of various algorithms on cube networks with 4 x 4 porting multicast. We have shown that cube and omega networks can always support optimal software multicast, while baseline and butterfly networks may not. This sug- gests that either the cube or the omega network is a better choice for constructing unidirectional MINS for SPCs among other MINS. The proposed C—min algorithm provides a minimum unicast-based multicast for such networks even when there are multiple multicasts. Although the modified baseline and butterfly networks are not discussed in this chapter, the C-min algorithm may apply to both networks directly. CHAPTER 8 Related Work As discussed earlier, multicast communication is a frequently used communication pattern. Some low-level languages based on message-passing interfaces or libraries, such as MP1 and PVM, explicitly define a number of collective communications in which the performance of many of them can benefit from efficient multicast commu- nication [1, 52]. In those high-level languages, such as HPF, the multicast commu- nication pattern is implicitly specified, such as aligning a low-dimensional array to a high-dimensional array [8]. Multicast communication is also demanded in high- speed networks, such as various ATM switches and the DEC GIGAswitch, for various applications [3, 4, 5]. Multicasting has been studied for distributed systems, such as reliable multicast (e.g., [53]) and ordered multicast (e.g., [54]), and for wide area networks, such as MBONE [55] and Internet multicast [56, 57]. Multicast communication has been extensively studied for distributed—memory multicomputers based on direct network architectures, such as mesh and hypercube topologies. Hardware multicast support for hypercube machines based on virtual cut- 169 170 through switching was proposed in [58, 22], while a prototype VLSI multicast router was designed in [26]. Performance evaluation of various wormhole multicast routing algorithms was reported in [59]. Since then, many researchers have contributed to this important area (e.g., [47, 60, 48, 61, 49, 62]). A theoretical study of multicast communication for 2D mesh and hypercube topologies can be found in [63]. Some parallel computer vendors have also tried to directly support multicast communica— tion in their machines. The nCUBE—2 was the first hypercube machine supporting a restricted multicast, in which each multicast is actually a broadcast with a subcube. However, as pointed out in [59], deadlock is possible in the nCUBE-2 when there are two or more simultaneously multicasts within a. subcube. Consequently, nCUBE decided to disable their hardware multicast support and replaced it with software mul— ticast. The Cray T3D has dedicated hardware to support a special case of multicast, namely barrier synchronization [42]. As pointed out in [64], barrier synchronization can be efficiently implemented by utilizing the underlying multicast support. Multicast connnunication for MIN-based parallel architecture was studied in [32]. However, their multicast support is restricted to a multistage cube network and to destination nodes forming a subcube. The TMC CM-5 has a dedicated control net- work to support multicast communication [65]. The multicast is restricted to one message at a time and must form a subtree. The NEC Cenju—3 also supports multi- cast [18]. The multicast destination must. be in consecutive addresses (not necessary a power of 2). However, deadlock is still possible if there are two or more simulta- neously multicast communications as shown in this thesis. Subsequently, NEC was decided to disable their hardware multicast support. Note that both the CM-5 and 171 Cenju-3 support wormhole switching. l\"lulticasting has been studied in MIN—based ATM switch design, Liew proposed a multicast. algorithm for ATM switches using the Clos network in [66, 67] and gave performance evaluation in [68]. Multicast support in [27] involved three segments: a copy network, a distributed network and a routing network. In both cases, an excessive number of MIN stages are needed. In this chapter, we concentrate on those works related to multicast communica— tion in wormhole-switched networks, including ATM-based, hardware, and software approaches. 8.1 MIN-based ATM Switches ATM (Asynchronous Transfer Mode) switches which were originally designed for broadband ISDNs have emerged as a promising platform to support. high-performance computing, multimedia, and teleconferencing in local area network. Supporting mul— ticast communication within an ATM switch has been studied by many researchers. Many ATM switch designs are based on MINS [67, 69, 12]. A notable design is by J. Turner [27] which has three MINS concatenated together in an ATM switch: the copy network replicates the multicast message, the distribution network randomizes the traffic, and the routing network delivers individual messages to corresponding destinations as shown in Figure 8.1. Each network is a 64 x 64 multistage baseline network constructed with 2 X 2 switches]. 1In such network, switch Sij can reach ‘2"+1 outgoing ports or reach ‘2" outgoing ports through a single link, where 0 S i < 6 and 0 S j < 32. PP6 3 copy network BGTO 172 Borg CP: control processor distribution network PP: port processor Figure 8.1: Switch fabric. routing network PP6 3 BGT: broadcast and group translator When a multicast cell2 enters the copy network, the number of copies required, m, and the broadcast channel identity, c, are carried in the header. Depending on the number of reachable nodes of a switch, the switch forwards the broadcast cell to a single outgoing link or both outgoing links. If the number of reachable nodes via a single outgoing link is larger than or equal to the number of copies required, m, the cell is forwarded to either the upper outgoing link or the lower outgoing link arbitrarily. Otherwise, the cell is replicated and forwarded to both outgoing links. The number of copies required are revised to mu and mg for the upper outgoing link and the lower outgoing link, respectively. The value of rm, and The are obtained based on the following rules: 1. lfc is even, 771,, = [(m +1)/2] and me = [in/2]. 2. If c is odd, mu 2 [in/2] and "lg = [(m + 1)/2]. A small version of a 16 x 16 copy network is shown in Figure 8.2. Considering an 2In ATM, a message is divided into small segments called cells. Each cell consists of 53 bytes. 173 C4 G3 C3 Gz C2 G] C] Go C0 [2 — __ 530 S20 s00 _ ‘l l __, S31 2 2 S11 ,\__ ‘ 3 l l .. _.S32 522 m 1 2 _ 2 I Data B, 5, 37 — ——S33 5 523 513 3 +— _J S34 524 514 504 __ _. S35 525 515 S05 __ __ S36 526 516 506 __ S37 527 S17 507..— Figure 8.2: A 16 X 16 copy network. incoming cell with a header (B, 5, 37) which indicates that it is a broadcast message, the number of copies required 771 is 5, and the broadcast channel identity c is 37. When switch 333 receives this cell, it forwards the cell to only the upper outgoing link since the number of reachable ports via. each outgoing link is 8, which is greater than the number of copies required, 5. Note that the upper outgoing link is arbitrarily chosen by the switch for this cell. For the next broadcast cell, this switch may choose the lower outgoing link. Such an arbitrary scheme is used to randomize the traffic for the routing network. Since switch 321 has fewer reachable ports from either outgoing link, it replicates the cell and forwards them to both the upper and lower outgoing links, while the number of copies required is modified to ‘2 and 3, respectively. The modification of m is based on the second rule specified earlier since 0 is an odd number. Such process 174 is repeated in each switch which receives the cell. The value 772 associated with each forwarding and replicating switch is indicated beside the outgoing link in the figure. Eventually, an exactly number of 772. copies will reach m distinct outgoing ports. After the copy network, the real addresses of destinations are taped to packets by table lookup, which is handled by the BGT (broadcast and group translator). A switch in the distributed network ignores the destination address of an incoming cell and forwards the incoming cell to one of its two outgoing links alternately. In other words, a switch forwards an incoming cell to the upper (lower) outgoing link if it delivers the previous incoming cell to the lower (upper) outgoing link no matter what the destination of the incoming cell is. When an outgoing link or both outgoing links are blocked, the switch sends the cell to the first available outgoing link. The purpose of such a process is to randomize the traffic since most researches have shown that uniform traffic has a better performance than other traffic model. The actual routing is performed in the routing network. In general, such network requires 3n stages and can handle one multicast message at a time for N = '2" nodes. The work by Lee ['28] demonstrates that it is possible to deliver multiple multicasts in a synchronous manner which involves a concentrator, a multistage cube network, and a point-to—point network. The concentrator synchronizes the distinct multicast messages and arranges them in consecutive locations. The latter one is achieved by an adder which counts the total number of copies required by all multicast messages in front of each node which initiates a multicast message at this time slot. The routing address of the I” multicast message after the adder becomes (bi, 1).-+771, — 1) where b, is the value from the counter and m,- is the number of copies required by the ith multicast 175 message. Note that when the b,- + m,- — 1 is greater than the number of outgoing ports in the copy network, all multicast message after and including the i“ one are blocked and will be re—initiated in the next time slot. The middle MIN replicates each of the multicast message and forwards them to associated outgoing ports ~— all ports between port b, and port (I),- + m,- —- 1) for the i” multicast message. The last point- to—point network delivers the individual messages to associated destinations. An example is given in Figure 8.3. The source nodes are 4, 7, and 10 and the associated destination sets are {5, 8}, {0, 2, 3, 6}, and {1, 11, 13, 14}, respectively. Since there is no multicast message in front of node 4, the header after the adder is (0, l). The counter is ‘2 for the second multicast message initiated by node 7. Hence, the header of the second multicast message becomes (2, 5'). Similarly, the header for the third multicast message (initiated by node 10) becomes (6, 9). The concentrator puts these three multicast messages into the first three input ports of the copy network. The copy network replicates and forwards the first, the second, and the third messages to output ports 0 to 1, ‘2 to 5, and 6 to 10, respectively. After the copy network, the real destination is tagged to each message. The point-to—point switch then delivers these messages to the actual destinations. This design requires an excessive number of stages, and all multicasts must be synchronized. When the second field of a header for a multicast message is larger than the number of outgoing ports in the copy network, all subsequent multicast messages as well as itself are blocked. Thus, the priority is implicitly given to upper nodes or lower nodes and starvation becomes possible. In order to avoid the implicit priority, Chen et (11. extended Lee’s work by propos— 176 point-to-point concentrator Copy network switch 9: ”E i -- ::. :93 .__ I I a ."\ ‘_—_‘ 5 . _ . .: Ema»: 8:.“ .. I I """ i221, :‘9 <33: «tiara-iii»... ,. -- 7g), 11 i F s: hlylniglp-I 7s 13‘:_ L_. ._J ._, — 621-> 61 60 59 58 57 -—> 56 0,7 <1” 1,7 2,7 <-- 3,7 <--1 4,7 La 57 <--- 6.7 7,7 21:1-1-1121= r V: ' ’ ' ' ' ' 48 .. 0,6 \‘I i 675 - '4 upper __ _:_A_ shortest path v . 32 ‘ 0’4 upperregion g: rrrrrrrrr fr"- 3] '- lower reg1on 0.3 : ’1 V ; 33 . —§ lower y (:1 shortest path 15‘ - 0,1 - 3 A source v . 0% O destination Figure 8.4: An example of a path—based multicast in a ‘2-D mesh. The lower pair of numbers is the absolute address of a. node and the upper number is the relative address of the same node based on a Hamiltonian path. Figure 8.4 gives an example of path—based multicast in a ‘2-D 8 x 8 mesh. The source node is node (3.4), or 35 in relative address, and the destinations are nodes (5,0), (1,1), (3,1), (6.2), (2,3), (5,5), (2,6) and (6,7), or 5, 14, 12, ‘22, ‘29, 47, 50, and 57 in relative addresses, respectively. The source node divides the destination set into two disjoint subsets: one subset contains all destination nodes whose relative addresses are smaller than the source node and the other consists of all destination nodes of which relative addresses are larger than the source node. These two subsets are sorted into descending and ascending orders, respectively. As shown in the figure, source node 35 sends one copy of the message followed the upper shortest path to 179 reach nodes {47, 50, 57} and the other copy followed the lower shortest path to reach nodes {‘29, ‘22, 14, 1‘2, 5}, sequentially. Nodes 38 to 41, 46 to 49, and 55 to 56 are skipped by the upper shortest path and nodes 30 to 33, ‘23 to ‘24, 15 to 16, and 6 to 9 are skipped by the lower path since such skips will not cause deadlock. It. was shown in [25] that in the intuitive tree-based multicast it is difficult to avoid deadlock in direct networks when there are multiple multicasts. In this thesis we have shown that the pal/z-buscd nmlticast suffers a potential deadlock in MINS due to self blocking, which is difficult to avoid. The tree—based multicast becomes the natural choice for MINS. 8.2.2 Trip-Based Multicast Tseng and Panda extended the study in path-based multicast and proposed a skirt- bascd trip multicasting in wormhole-switched networks [72]. A trip in any graph is defined to be a path which visits each node of the graph at least once. The Hamiltonian path, which visits each node once and only once, is a special case of a trip. The deadlock is avoided by using virtual channels, high-channel and low—channel, and by imposing certain restrictions. After constructing a spanning tree of a. multicast, let the tree root be at r. A skirt—based trip is defined as the path that starts from r’s leftmost descendant, ends with r’s rightmost descendant, and wraps around the boundary of T by visiting each node one by one. Figure 8.5 gives an example of a skirt—based trip multicast. There are 10 nodes in the network and their connections are shown in part (a). Part (b) 180 gives the corresponding spanning tree with skirt. Hence, the trip is {p4, p1, p5, p1, P0, P2, P61 P99 P65 P2, P0, P3, 79 P311923}- ipg \ l C p9 (a) An example system graph (b) a spanning tree with skirt Figure 8.5: An example of a trip-based multicast. The all-destination encoding/decoding scheme is used in their work implicitly. Like the path-based approach, this work concentrates in direct networks such as hypercube or meshes but not in MINS. A trip-based multicast faces the same difficulty as the path-based multicast in MIN. 8.2.3 A Multidestination Worm Conforming to Base Routing Schemes To co—exist with unicast routing algorithms such as e—cube, planar adaptive and fully adaptive routing schemes, a multicast approach conforming to base routing schemes, named as multidestination worm, was proposed in [73]. A multicast message is divided into several multidestination worms. Each multidestination worm consists of one or more destinations. All destinations in a multidestination worm must be located in a base routing path. Two examples are given in Figure 8.6 where e-cube routing is 181 used. The first one, shown in Figure 8.6(a), is the same as the example given in (7,7) (7,7) QT 1 as “17+ Y 2 8 $5 5—_ l4 . source ¥ 5 4 ‘1’ X 3 O destination (0,0) (0,0) (21) (b) Figure 8.6: Examples of a multidestination worm conforming to base routing schemes. Figure 8.4. Since every destination is located in a different base routing path, eight multidestination worms are needed to finished the multicast. This is the worst case which is the same as separate addressing. Part (1)) of this figure shows a different example, where source is node (3,4) and destinations are nodes (1,5), (1,6), (2,1), (2,3), (5,0), (5,4), (6,2) and (6,7). There are five destination worms. The first two consist of a single destination. The third, fourth, and fifth multidestination worms carry the multicast message for destinations {(5,0), (5,4)}, {(2,1), (2, 3)} and {(1,5), (1,6)}, respectively. 8.2.4 Synchronous Receiver Initiated Multicast Instead of sender—initiated multicast, Al-Hajery and Batcher proposed a synchronous receiver initiated multicast [74, 75]. In this approach, all nodes intended to receive messages must inform their associated senders. Those nodes without communication requests have to ask themselves to send a dummy message. Otherwise, the underlying 182 sorted network will fail to deliver messages to correct destinations. Because of the synchronization requirement, it is only suitable for short messages. Other than that, there are two major disadvantages of such a design. First, the setup time in every time slot may cause longer latency. Second, the sorted network requires more switches than a Banyan network. 8.3 Software Multicast Optimal software multicast for wormhole-switched meshes and hypercubes were first proposed in [39]. The first optimal software multicast for switch-based networks, bidirectional MINS, was given in [51]. The proposed U—min algorithm guarantees contention-free multicast communication for different message sizes and for different start-up latencies. The U-min algorithm was implemented in a 64—node IBM SP-1 and its performance is superior to Chameleon broadcast [76] and MPI-F broadcast [77] operations and is consistently about 30% better than Chameleon application—level broadcast. The U-min algorithm can be easily adapted to the Meiko CS-2 [17]. As shown in [51], the first step of the U-min algorithm is to sort the destinations in lexicographic order. Then it divides the destination nodes into equal halves, in which the first half is one node less than the second half. The source node forwards the message to the closest node in the other half. The last two steps are repeated until all destinations receive the message. An example of source node 2 sending a message to all other nodes in an 8-node MIN is given in Figure 8.7. Source node 2 first sends the message to node 4, then to node 1, and finally node 3 as shown in the 183 figure. After receiving the message, node 4 forwards a copy to node 6 and then node 5. @MQD step I MQD step 2 @IJUW step 3 Figure 8.7: An example of a U—min multicast. The U-min provides an efficient software—based multicast. However, it suffers in a longer queue at the middle nodes when multiple multicasts are allowed simultaneously. Since U-min forwards the message to the center nodes from all source nodes, those center nodes have a much longer queue than the other nodes. Hence, the performance degrades rapidly as the load of multicasts increases. The C—min avoids such centralized traffic by evenly distributing the senders and the receivers. For example, let there be two multicasts initiated simultaneously in a binary 771-cube3. These two multicasts will block each other in some steps unless one is initiated from the leftmost node and the other is started from the rightmost node based on the U—min algorithm. Hence, the blocking probability pa is 1 _ (m — 2)(m +1) C(m,2) _ 777(m — 1) i 3Since a k-ary p—cube may be interpreted as a binary (k x p)-cube, the following property is also suitable for all k-ary p—cubes 184 Based on the C-min algorithm, there is no destination contention when these two multicasts are initiated by an odd node and an even node. Therefore, the blocking probability 1),, is C(m/2, l) x (C(m/2,1) _ m — 2 C(m,2) _ 2(m — 1)- p621— Let us use a binary 3-cube as an example. The blocking probabilities pa and pC are 0.9643 and 0.4286, respectively. Furthermore, let the source nodes be 0 and 5. The steps of both the U-min and the C-min algorithms are given in Figure 8.8. As shown in part (a), source node 4, which is the destination of source node 0 in the first step, and source node 5 send their messages to node 6 in the second step. In the third step, node 6 needs to send both messages to node. 7. Such node contention degrades the performance of the U—min when multiple multicast communications are allowed simultaneously. The corresponding multicast steps of the C-min algorithm are given in Figure 8.8(b). All three steps are contention—free. Unlike hypercubes, meshes or bidirectional MINS, a contention-free multicast is very complicated and is not always possible for a non-restricted source and destination list in a unidirectional MIN, such as the NEC Cenju-3. In this work, we present a minimum step software multicast in such an environment. It has been proven to be intratask contention-free and to offer better performance than other approaches when the system is not heavily loaded. 185 cm @331 @332 (a) U-min multicast @331 m... (b) C-min multicast Figure 8.8: An example of two multicasts based on U—min and C-min algorithms. 8.4 Summary In this chapter, we reviewed several related works and presented the difference be- tween our work and theirs. Since cell size is fixed to 53 bytes and cells are forwarded synchronously in ATM, to support efficient multicast in ATM network is easier than the model we have. Several hardware approaches were proposed for different network topologies. Al- though some restrictions are required, none of them serve properly for unidirectional MINS. For example, Turner’s design allows one multicast at a time, which is relatively expensive. The nCUBE—2 and NEC Cenju-3 are both forced to disable the hardware support due to the deadlock issue. For those systems without hardware multicast capability, the efficient software- 186 based multicast becomes important to applications. The U-min and U—mesh provide an efficient solution for turn—around MIN and mesh, respectively. They fall behind in unidirectional MIN when multiple multicasts are allowed simultaneously. we have proposed both hardware implementation and a software approach to reduce the latency of multicast communication in unidirectional MINS. Basically, considering the practical intertask contention, allowing multiple multicasts and asyn— chronous message initiating distinguish our work from others. The contribution of our work as well as some possible future research directions are given in the next chapter. CHAPTER 9 Conclusions and Future Work The increasing requirement of multicast con‘lmunications in various applications ne- cessitates the development of efficient multicast support. This thesis proposes a set of efficient multicast. communications, including multi-address header optimization, hardware implementation, and software approaches for unidirectional MINS. In this chapter, we summarize the salient contributions made by this research and present. interesting avenues for possible future research. 9.1 Research Contributions The run-time overhead of communication on variant applications can significantly limit the amount of system parallelism that can be exploited. As more nodes are joined, the communication cost of systems increases. To support efficient and scalable multicast communication becomes critical to the success of various applications, such as parallel programs, teleconferencing, etc. Unlike a unicast message, a multicast message needs to carry all destination in— 187 188 formation. However, such header encoding and decoding is overlooked by most re- searchers. As shown in this thesis, all-destination is the most intuitive approach and is adapted by most works. Few other works refer to single region broadcast or single region mask. Such approaches restrict the destinations of a multicast communica- tion to a contiguous region or to a complete cube, respectively. We have proposed and studied several multi-address encoding and decoding schemes to eliminate such restriction, to give flexibility of destination distribution, and to minimize the header overhead. Although supporting deadlock—free hardware multicast is difficult, we have shown that hardware multicast implementation has much better performance than the soft- ware approach. To avoid deadlock, we have developed a synchronous multi-head worm. we also have shown that it is better than other approaches in different as— pects. The NEC Cenju-3 uses an asynchronous multi-head worm whose switches are less complicated than in the synchronous multi-head worm. However, the restriction of single region broadcast is not enough to avoid deadlock. The nCUBE-2 has a simi- lar approach to the synchronous multi—head worm and requires that destinations must form a subcube. It still fails to prevent deadlock. Both restrictions are caused by the multi-address encoding and decoding schemes. We have eliminated these restrictions by proposing flexible multi-address encoding and decoding schemes. Starvation is another potential problem in the synchronous multi-head worm. We have studied several priority schemes and show that one of them can provide starvation-free multicast communications. When multi-address encoding and decod- ing is considered with the synchronous multi-head worm, deadlock becomes possible 189 due to the distance between different headers. A pseudo multi—address encoding scheme is proposed to guarantee a deadlock-free network. Although processor allocation has been studied extensively, none of that research has studied the corresponding issue when multicast communication is involved. How- ever, in our work, a network partitioning to enforce traffic localization and eliminate intertask contention has been developed. As expected, a system with a systematic network partition outperforms those without such partitionability. Although delta class MINS are topology equivalent, not every one of them has such partitionabil- ity. We have shown that both the multistage cube and omega networks have such property and offer better performance. In addition to show that the baseline and but- terfly networks have no such property, we have given minimum modification schemes to reconfigure both networks. An existing system with either network topology can offer the same network partitionability as the cube and omega networks after such modification. There are lots of variant MINs to provide different features. For example, the extra stage MIN provides routing flexibility and fault tolerance. The multiple butterfly network gives multiple paths for any pair of source and destination nodes. The turn- around MIN gives fewer hops for local traffic. Among these variants, we have chosen the extra stage MIN to extend our work. we have shown that to find all traffic- optimal multicast trees on ESMIN is an NP problem, but to find one of them is not. We have developed a polynomial time algorithm to find such a traffic-optimal multicast tree. Many existing systems only support unicast communication. we have developed 190 an algorithm to implement efficient software—based multicast for those systems built with unidirectional MINS. The proposed C-min algorithm outperforms the separate addressing as expected. However, when the system is heavily loaded, separate ad- dressing has better performance due to its batch-like multicast. 9.2 Directions for Future Research In this thesis, we have concentrated our effort on wormhole—switched unidirectional MINS. The performance models and experimental results presented in this work establish the foundation for future study but need to be extended in several ways. The multi-address encoding and decoding schemes can be easily extended and optimized to networks with different switching techniques or network topologies. We have shown such optimization 011 unidirectional MINS. More research can be done on other network topologies to evaluate the performance and cost tradeoffs among different encoding schemes and to design efficient and deadlock—free multicast routing algorithms. The synchronous multi-head worm may be extended to different network topolo- gies, especially to bidirectional MINS, dilated MINS, and virtual-channel MINS. The bidirectional MINS have been adopted in IBM SP1/SP2 and Meiko CS2. Although the optimal software multicast has been studied for the bidirectional MINS, the hardware implementation has not been addressed. As shown in this thesis, the performance of hardware multicast implementation is superior to that of the optimal software approaches in a unidirectional MIN. Hence, the hardware multicast implementation 191 on these networks will provide better performance. How to establish a multi—head worm on a bidirectional MIN is not obvious since the number of hops among dif- ferent source and destination pairs may be different. Without careful multi-address encoding and routing, such distance difference may cause deadlock. The dilated MINs provide extra physical links between adjacent stages and the virtual-channel MINs offer channel sharing among different comn'mnications. Since both are unidirectional MINs, the multicast algorithms proposed in this research may be applied to them directly. However, how to fully utilize both MINs and how to minimize the communi- cation latency l)ecome challenge issues. Furthermore, how to find the optimal number of dilated links and the optimal number of virtual channels on different traffic models are another challenges. Finally, a. particularly challenging direction is to extend the multi-head worm to ATM switches. The cell size in ATM is fixed to 53 bytes, of which five are header and 48 are data. Although there is no deadlock issue, there are some other potential problems, such as out—of—order cell handling, lost data, etc. There are several chal- lenges. The first one is to establish efficiently the virtual channel and virtual path for a multicast. The second one is to replicate cells and deliver cells to output port syn- chronously. Currently, the virtual channel and virtual path establishment is receiver initiated. This approach is the natural way for some applications, such as teleconfer- encing, but not for parallel programs. To embed the active message becomes another challenging issue. BIBLIOGRAPHY Bibliography [1] Message Passing Interface Forum, “MPI: A Message-Passing Interface Stan- dard,” tech. rep., University of Tennessee, Mar. 1994. [2] H. Xu, E. T. Kalns, P. K. McKinley, and L. M. Ni, “ComPaSS: A communica- tion package for scalable software design,” Journal of Parallel and Distributed Computing, vol. 22, pp. 449461, Sept. 1994. [3] H. T. Kung, “Gigabit local area networks: A systems perspective,” IEEE Com- munications Magazine, pp. 79—89, Apr. 1992. [4] J. Hui, “Switching integrated broadband services by Sort-Banyan networks,” Proceedings of the IEEE, pp. 145—154, Feb. 1991. [5] V. Kompella, J. Pasquale, and G. Polyzos, “Multicasting for multimedia appli- cations,” in Proceeding of the IEEE INFOCOM’QQ, Mar. 1992. [6] High Performance Fortran Forum, “HPF-2 Scope of Activities and Motivating Applications,” Nov. 1994. [7] C.-M. Chiang, Q. Du, M. W. Mutka, and R. Sass, “An empirical study of scalable domain decomposition methods for a 2-d parabolic equation solver,” in Proceed- ing of the 6th SIAM Conference on Parallel Processing for Scientific Computing, (Norfolk, Viginia), pp. 687—690, SIAM, Mar. 1993. [8] Northeast Parallel Architectures Center at Syracuse University, “HPF/Fortran- D Benchmarking Suite (release 2.01),” 1992. (Available at Public Domain at Syracuse University). [9] Fore Systems, Inc., ForeRunnerTM ASX-IOO ATM Switch Architecture Manual, 1992. [10] Adaptive Corporation, Network Equipment Technologies, I N C., ATMX Broad- band Switch, Release 1.2?, 1993. [11] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Computing, vol. 1, no. 3, pp. 187—196, 1986. [12] J. S. Turner, “Design of local ATM networks.” Tutorial Notes at INFOCOM’92, 1992. 192 1131 1111 [151 [161 [171 [118] [191 [‘30] [‘21] [261 [‘27] 193 BBN Advanced Computers Inc., Cambridge, Massachusetts, Inside the GP1000, 1989. BBN Advanced Computers Inc., Cambridge, Massachusetts, Inside the TCBOUO Computer, 1990. W. Gropp, E. Lusk, and S. Pieper, “Users Guide for the ANL IBM SP-I DRAFT,” Tech. Rep. ANL/MCS-TM~00, Argonne National Laboratory, Feb. 1994. Thinking Machines Corporation, Cambridge, MA, The Connection Machine CM- 5 Technical Summary, October 1991. Meiko Limited, Waltham, MA., Computing Surface: CS-z’? Communications Net- works, 1993. N. Koike, “N EC Cenju—3: A microprocessor-based parallel computer,” in Proc. of the 8th International Parallel Processing Symposium, pp. 396—401, Apr. 1994. J. C. Jerome R., M. R. Gaddis, and J. S. Turner, “Project Zeus,” IEEE Network, vol. 7, pp. 20—30, Mar. 1993. R. J. Souza, P. G. Krishnakumar, C. M. Ozveren, R. J. Simcoe, B. A. Spinney, R. E. Thomas, and R. J. Walsh, “The GIGAswitch system: A high-performance packet switching platform,” Digital Technical Journal, vol. 6, Jan. 1994. L. M. Ni, Y. Gui, and S. Moore, “Performance evaluation of switch-based worm- hole networks,” in Proc. of the 1995 International Conference on Parallel Pro- cessing, vol. 1, Aug. 1995. (accepted to appear, also available as Technical Report MSU-CPS-ACS-96, Dept. of Computer Science, Michigan State University, July 1994). Y. Lan, A. H. Esfahanian, and L. M. Ni, “Multicast in hypercube multiproces- sors,” Journal of Parallel and Distributed Computing, pp. 30—41, Jan. 1990. NCUBE Company, NCUBE 6400 Processor Il/Ianual, 1990. L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct networks,” IEEE Computer, vol. 26, pp. 62 — 76, Feb. 1993. X. Lin and L. M. Ni, “Deadlock-free multicast wormhole routing in multicom- puter networks,” in Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 116—125, May 1991. Y. Lan, L. M. Ni, and A. H. Esfahanian, “A VLSI router design for hypercube multiprocessors,” Integration: The VLSI Journal, vol. 7, pp. 103-125, 1989. J. S. Turner, “Design of a broadcast packet switching network,” IEEE Transac- tions on Communications, vol. 36, pp. 734—743, June 1988. 1:281 1291 [30] [31] [33] [34] [35] [36] [37] [38] [39] [401 194 T. T. Lee, “Nonblocking copy networks for multicast packet switching,” IEEE Journal on Selected Areas in Communications, vol. 6, Dec. 1988. L. R. Coke and G. J. Lipovski, “Banyan networks for partitioning multipro— cessing systems,” in Proc. of the First International Symposium on Computer Architecture, pp. 21—28, 1973. J. H. Patel, “Performance of processor-memory interconnections for multipro- cessors,” IEEE Transactions on Computers, vol. C-30, pp. 771—780, Oct. 1981. C. L. Wu and T.-Y. Feng, “On a class of multistage interconnection networks,” IEEE Transactions on Computers, vol. C-29, pp. 694—702, Aug. 1980. H. J. Siegel, W. G. Nation, C. P. Kruskal, and L. M. Napolitano Jr., “Using the multistage cube network topology in parallel supercomputers,” Proceedings of the IEEE, vol. 77, pp. 1932—1953, Dec. 1989. D. H. Lawrie, “Access and alignment of data in an array processor,” IEEE Trans- actions on Computers, vol. C-24, pp. 1145—1155, Dec. 1975. D. J. Kuck, E. S. Davidson, D. H. Lawrie, and A. H. Sameh, “Parallel super- computing today and the Cedar approach,” Science, vol. 231, pp. 967—974, Feb. 1986. H. J. Siegel, L. J. Siegel, F. C. Kemmerer, P. T. Mueller, Jr., H. E. Samlley, Jr., and S. D. Smith, “PASM: A partitionable SIMD/MIMD system for image processing and pattern recognition,” IEEE Transactions on Computers, vol. C- 30, pp. 934—947, Dec. 1981. G. F. Pfister et al., “An introduction to the IBM research parallel processor prototype (RP3),” in Experimental Parallel Computing Architectures (J. J. Don- garra, Ed.), pp. 123 — 140, Elsevier Science Publishers B.V., Amsterdam, 1987. A. Gottlieb, “An overview of the NYU ultracomputer project,” in Experimental Parallel Computing Architectures (J. Dongarra, Ed), pp. 25 — 95, North Holland, 1987. R. D. Rettberg, W. R. Crowther, P. P. Carvey, and R. S. Tomlinson, “The Monarch parallel processor hardware design,” IEEE Computer, vol. 23, pp. 18 30, Apr. 1990. P. K. McKinley, H. Xu, A. H. Esfahanian, and L. M. Ni, “Unicast-based mul- ticast communication in wormhole-routed networks,” in Proceedings of the 1992 International Conference on Parallel Processing, vol. II, pp. 10—19, Aug. 1992. C.-M. Chiang and L. M. Ni, “Multi-address encoding for multicast,” in Proc. of the First International Workshop on Parallel Computer Routing and Com-muni- cation (PCRCW’QI) (K. Bolding and L. Snyder, Eds), pp. 146—160, Springer- Verlag, May 1994. [41] [4‘31 [431 [441 [46] [471 195 C.-M. Chiang and L. M. Ni, “Encoding and decoding of address information in multicast messages,” in Proceeding of the 1994 International Computer Sympo- sium, (HsinChu, Taiwan), pp. 1092—1097, Dec. 1994. Cray Research, Inc., Chippewa Falls, Wisconsin. CRAY T3D System Architec~ ture Overview, 1993. C.—M. Chiang and L. M. Ni, “Network partitioning and unicast-based multicast on multistage networks,” Tech. Rep. MSU-CPS—ACS-102, Dept. of Computer Science, Michigan State University, East Lansing, Michigan, Mar. 1995. W. Lin and C. L. Wu, “A distributed resource management mechanism for a partitionable multiprocessor system,” IEEE Transactions on Computers, vol. 37, pp. 201—210, Feb. 1988. M. G. Hluchyj and M. J. Karol, “ShuffleNet: An application of generalized perfect shuffles to multihop lightwave networks,” Journal of Lightwave Technology, vol. 9, 1991. C. Y. Chin and K. Hwang, “Packet switching networks for multiprocessors and data flow computers,” IEEE Transactions on Computers, vol. C-33, pp. 991— 1003, Nov. 1984. D. K. Panda, S. Singal, and P. Prabhakaran, “Multidestination message pass- ing mechanism conforming to base wormhole routing scheme,” in Proc. of the First International Workshop on Parallel Computer Routing and Communication {PCRCW’Qt} (K. Bolding and L. Snyder, Eds), pp. 131—145, Springer-Verlag, May 1994. R. V. Boppana and S. Chalasani, “On multicast wormhole routing in multi- computer networks,” in Proc. of the Sixth IEEE Symposium on Parallel and Distributed Processing, pp. 722—729, Oct. 1994. Y. Lan, “Adaptive fault-tolerant multicast in hypercube multicomputers,” Jour- nal of Parallel and Distributed Computing, vol. 23, Oct. 1994. P. K. McKinley, H. Xu, A. H. Esfahanian, and L. M. Ni, “Unicast-based multicast communication in wormhole-routed networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 5, pp. 1252—1265, Dec. 1994. H. Xu, Y. Gui, and L. M. Ni, “Optimal software multicast in wormhole-routed multistage networks,” in Proceedings of Supercomputing ’94, pp. 703—712, Nov. 1994. V. S. Sunderam, “PVM: A framework for parallel distributed computing,” Con- currency: Practice and Experience, vol. 2(4), pp. 315—339, Dec. 1990. B. Rajagopalan, “Reliability and scaling issues in multicast communication,” in Proceeding of the ACM SIGCOMM, pp. 188—198, Aug. 1992. 1541 1581 [591 [601 [61] 1621 [63] 1641 [6.5] 196 H. Garcia-Molina and A. M. Spauster, “Message ordering in a multicast en- vironment,” in Proceedings of the 9th International Conference on Distributed Computing Systems, pp. 354—361, June 1989. H. Eriksson, “MBONE: The multicast backbone,” Communications of the A CM, vol. 37, pp. 54—60, Aug. 1994. D. R. Cheriton and S. E. Deering, “Host groups: A multicast extension to the Internet protocol,” Tech. Rep. RFC-966, SRI Network Information Center, Dec. 1985. S. E. Deering and D. R. Cheriton, “h’lulticast routing in datagram internetworks and extended LANs,” ACAI Transactions on Computer Systems, vol. 8, pp. 85— 110, May 1990. Y. Lan, A. H. Esfahanian, and L. M. Ni, “Distributed multi-destination rout— ing in hypercube multiprocessors,” in Proceedings of the Third Conference on Hypercube Computers and Concurrent Applications, pp. 631—639, Jan. 1988. X. Lin, P. McKinley, and L. M. Ni, “Performance evaluation of multicast worm- hole routing in 2D-mesh multicomputers,” in Proc. of the 1991 International Conference on Parallel Processing, vol. I, pp. 435—442, Aug. 1991. D. K. Panda, “Fast synchronization in wormhole k—ary n—cube networks with multidestination worms,” in Proc. of the First High Performance Computer Ar- chitecture Symposiu m, Jan. 1995. (accepted to appear). S. Bhattacharya, G. Elsesser, VV.-T. Tsai, and D.-Z. Du, “Multicasting in gener- alized multistage interconnection networks,” Journal of Parallel and Distributed Computing, vol. 22, July 1994. J. Wu and Z. Li, “A multidestination routing scheme for hypercube multipro— cessors,” in Proc. of the 1991 International Conference on Parallel Processing, vol. III, pp. 290—291, Aug. 1991. X. Lin and L. M. Ni, “Multicast communication in multicomputers networks,” IEEE Transactions on Parallel and Distributed Systems, pp. 1105—1117, Oct. 1993. H. Xu, P. K. McKinley, and L. M. Ni, “Efficient implementation of barrier syn- chronization in wormhole-routed hypercube multicomputers,” Journal of Parallel and Distributed Computing, vol. 16, pp. 172 — 184, October 1992. C. E. Leiserson et al., “The network architecture of the Connection Machine CM-5,” in Proceedings of the ACM Symposium on Parallel Algorithms and Archi- tectures, (San Diego, CA.), pp. 272—285, Association for Computing Machinery, 1992. 197 [66] S. C. Liew, “Multicast routing algorithms for 3—stage CLOS ATM switching networks,” in Proceedings of the 1991 Globecom, pp. 1619—1625, 1991. [67] S. C. Liew, “Multicast routing in 3-stage Clos ATM switching networks,” IEEE Transactions on Com-munications, vol. 42, pp. 1380—1390, Febru— ary/March/April 1994. S. C. Liew, “Performance of various input-buffered and output-buffered atm switch design principles under bursty traffic: Simulation study,” IEEE Transac- tions on Communications, vol. 42, pp. 1371—1379, February/March/April 1994. J. Y. Hui, Switching and Traffic Theory for Integrated Broadband Networks. Nor- well, Mass: Kluwer Academic Pub., 1990. X. Chen, J. F. Hayes, and M. K. Megnet-Ali, “Performance comparison of two input access methods for a multicast switch,” IEEE Transactions on Communi- cations, pp. 2174—2178, May 1994. X. Lin, P. K. McKinley, and L. M. Ni, “Deadlock-free multicast wormhole routing in 2-D mesh multicomputers,” IEEE Transactions on Parallel and Distributed Systems, pp. 793—804, Aug. 1994. Y.-C. Tseng and D. K. Panda, “Trip—based multicasting in wormhole-routed networks,” Tech. Rep. OSU-CISRC-1/93-TR3, Department of Computer and Information Science, The Ohio State University, 1993. D. K. Panda and P. Prabhakaran, “Multicasting using multidestination-worms conforming to base routing schemes,” Tech. Rep. Technique Report 37, Depart- ment of Computer and Information Science, Ohio State University, Sept. 1993. M. Z. Al-Hajery and K. E. Batcher, “Multicast bitonic network,” in Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing, (Dallas, Taxes), pp. 320—326, Dec. 1993. M. Z. Al-Hajery and K. E. Batcher, “Low cost complexity of a general multicast network,” in Proceedings of the 1994 International Parallel Processing Sympo- sium, pp. 23—29, Apr. 1994. [76] W. Gropp and B. Smith, “Users manual for the Chameleon parallel programming tools,” Tech. Rep. ANL-93/23, Argonne National Laboratory, June 1993. [77] H. Franke, “MPI-F: An MPI implementation for IBM SP-l,” Feb. 1994. Available on anonymous ftp from info.mcs.anl.gov. HICHIGRN STRT 111111 [in] 312930 4 NV.L I l 1 1 11111111111“ 1 2