F3?" ‘ x’Fi' lCHlGAN STATE IIIIIIIIIIIIILII I IIIIIIIIIIIIIIIIIIIIIIIIIIIIII 24108 This is to certify that the dissertation entitled DESIGNING MAXIMALLY ADAPTIVE ALGORITHMS FOR WORMHOLE ROUTING: THE TURN MODEL presented by CHRI STOPHER JAMES GLASS has been accepted towards fulfillment of the requirements for PHD degree in COMPUTER SCIENCE gm Major professor Date M MS U i: an Affirmative Action/Equal Opportunity Institution 0-12771 LJE’SR‘ARY M'Chlgan State ntverslty * PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or betoro due due. DATE DUE DATE DUE DATE DUE MSU to An Affirmative Action/Equal Opportunity Institution oMMmS-ot DESIGNING MAXIMALLY ADAPTIVE ALGORITHMS FOR WORMHOLE ROUTING: THE TURN MODEL By Christopher James Glass A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1992 {7% a 5/59 ABSTRACT DESIGNING MAXIMALLY ADAPTIVE ALGORITHMS FOR WORMHOLE ROUTING: THE TURN MODEL By Christopher James Glass One obstacle to using wormhole routing as the switching technique in a direct net- work is designing a good algorithm for routing message packets through the network. A good routing algorithm should provide low communication latency, high network throughput, and ease of implementation in VLSI. Contributing to these objectives are such factors as deadlock freedom, adaptive routing, nonminimal routing, live- lock freedom, and fault tolerance. Previous models for designing routing algorithms have been either ad hoc or based on adding virtual channels to the networks. The algorithms produced by these models prevent deadlock, but the cost is either non- adaptiveness in routing or extra hardware to support virtual channels. To solve the design problem more completely, this dissertation presents a model for systematically designing routing algorithms that are deadlock free and maximally adaptive for a net- work, as well as minimal or nonminimal, livelock free, and fault tolerant. The model is not based on adding virtual channels to a network, but can be applied to networks that have virtual channels. Instead, the model is based on analyzing the directions in which packets can turn in a network and the cycles that the turns can form. This turn model is applied to n-dimensional meshes and k-ary n-cubes in the dissertation, but it can be applied to any direct network that has directions associated with its channels. Simulations of the routing algorithms produced for meshes and hypercubes show that they can perform better than previous routing algorithms. To my Wife, Fran iv ACKNOWLED GEMENTS Many persons have made this degree and dissertation possible. Among them, I especially wish to thank Dr. Carl V. Page for recruiting me and for lengthy discussions of artificial intelligence; Dr. Abdol Esfahanian, Dr. Dorian Feldman, and Dr. Wen-ling Hsu for making my first quarter back in school challenging and exciting; Dr. Lionel M. Ni for the stimulating class “Multiprocessors and Parallel Processing,” which inspired this dissertation, and for being an understanding advisor and shrewd researcher; Dr. Richard Enbody for pointing out practical considerations in the research; and Dr. Philip McKinley for helping develop the simulator. Finally, I wish to thank my wife, Fran, whose encouragement, sacrifices, and added. support have made my education possible. ' TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 Direct Networks and Multicomputers .................. 1.2 Network Topologies ............................ 1.3 Switching Techniques ........................... 1.4 Routing Algorithms ............................ 1.5 Problem Statement and Objective .................... 1.6 Dissertation Organization ........................ 2 Previous Models 2.1 Dimension Ordering ........................... 2.2 The Virtual Channel Model ....................... 2.3 The Virtual Network Model ....................... 2.4 The Dimension Reversal Model ..................... 2.5 The Wraparound Model ......................... 3 The Turn Model 3.1 Turns in my Networks ........................... 3.2 Turns in Double-my Networks ...................... 3.3 Turns in Double-y Networks ....................... 3.4 Systematic Design of Routing Algorithms ................ 4 The Simulator 4.1 Direct Networks .............................. 4.2 Channel Selection Policies . . . . . . . . . .............. 4.2.1 Input Selection Policies ...................... 4.2.2 Output Selection Policies ..................... 4.3 Traffic Patterns .............................. vi ix mqmtnwwo-t H G 9 10 12 13 15 18 19 21 22 22 26 27 27 27 29 32 4.4 Simulator Software ............................ 4.4.1 Structure ............................. 4.4.2 Inputs ............................... 4.4.3 Outputs .............................. 4.4.4 Running the Simulator ...................... Partially Adaptive Routing 5.1 - Two-Dimensional Meshes ......................... 5.1.1 The West-First Routing Algorithm ............... 5.1.2 The North-Last Routing Algorithm ............... 5.1.3 The Negative-First Routing Algorithm ............. 5.1.4 Degrees of Adaptiveness ..................... 5.2 Three-Dimensional Meshes ........................ 5.2.1 Degrees of Adaptiveness ..................... 5.3 n-Dimensional Meshes .......................... 5.3.1 Degrees of Adaptiveness ..................... 5.4 Ic-ary n-cubes ............................... 5.5 Hypercubes ................................ 5.5.1 Degrees of Adaptiveness ..................... 5.6 Simulation Experiments ......................... 5.6.1 Input Selection Policies ...................... 5.6.2 Output Selection Policies ..................... 5.6.3 Traffic Patterns .......................... 5.6.4 Spreading Traffic More Evenly .................. Fully Adaptive Routing 6.1 Two-Dimensional Meshes ......................... 6.2 Simulation Experiments ......................... 6.3 n-Dimensional Meshes .......................... 6.4 Three-Dimensional Meshes ........................ Fault-Tolerant Routing 7.1 A Model of Fault-Tolerant Routing ................... 7.2 Two-Dimensional Meshes ......................... 7.3 Simulation Experiments ......................... Summary and Conclusions 8.1 Implementation Considerations ..................... 8.2 Future Work ................................ vii 33 34 34 35 36 37 37 39 41 43 43 47 48 51 52 53 55 56 57 58 61 65 72 78 78 87 93 98 100 100 101 104 109 111 1 l4 BIBLIOGRAPHY viii 3.1 5.1 6.1 8.1 LIST OF TABLES Comparison of models for designing routing algorithms ......... 25 Example of a path allowed by the p—cube routing algorithm. ..... 58 The degree of adaptiveness for maximally fully adaptive routing algo- rithms .................................... 98 The directional phases of some routing algorithms ............ 113 ix 1.1 3.1 3.2 3.3 3.4 3.5 4.1 4.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 LIST OF FIGURES A4x4mesh ................................ A wormhole deadlock involving four routers and four packets in a two- dimensional mesh .............................. A router in an 33/ network and the turns allowed by some routing algorithms. ................................ The two simple cycles and one of the complex cycles formed by the turns in an :ry network ........ ' ................... A router in a double-2y network and the turns allowed by the double- zy routing algorithm ...... - ...................... A router in a double-y network and the turns allowed by the double-y routing algorithm .............................. Channel utilization in a 10 x 10 mesh for uniform traffic and the my routing algorithm .............................. Turns can increase contention ....................... The possible turns and simple cycles in a two-dimensional mesh. . . . The four turns allowed by the $3; routing algorithm ........... Six turns that complete the cycles and allow deadlock. ........ The west-first routing algorithm for two-dimensional meshes ...... Numbering of the channels leaving each node (3,31) for the west-first routing algorithm .............................. Numbering of a 4 x 4 mesh for the west-first routing algorithm. For each input channel, the output channels allowed by the west-first routing algorithm have lower numbers. ................. The north-last routing algorithm for two-dimensional meshes. The negative-first routing algorithm for two-dimensional meshes. . . . The possible turns and simple cycles in a three-dimensional mesh. . . The turns allowed by three routing algorithms for three-dimensional meshes. ................ i .................. 19 20 20 21 22 31 32 38 38 39 40 41 42 42 45 47 5.12 The channels used during the two phases of the negative-first routing algorithm for 4-ary 2-cubes. ....................... 54 5.13 The minimal p-cube routing algorithm for hypercubes. ........ 56 5.14 The nonminimal p-cube routing algorithm for hypercubes. ...... 57 5.15 Comparison of input selection policies for the xy routing algorithm. . 59 5.16 Comparison of input selection policies for the west-first routing algorithm. 59 5.17 Comparison of input selection policies for the north-last routing algo- rithm. ................................... 60 5.18 Comparison of input selection policies for the negative-first routing algorithm. ................................. 60 5.19 Comparison of input selection policies for the west-first routing algorithm. 62 5.20 Comparison of input selection policies for the north-last routing algo- rithm. ................................... 62 5.21 Comparison of input selection policies for the negative-first routing algorithm. ................................. 63 5.22 Comparison of output selection policies for the west-first routing algo- rithm. ................................... 64 5.23 Comparison of output selection policies for the north-last routing al- gorithm. .................................. 64 5.24 Comparison of output selection policies for the negative-first routing algorithm. ................................. 65 5.25 Comparison of routing algorithms for uniform traffic in a 10 x 10 mesh. 67 5.26 Comparison of routing algorithms for uniform traffic in a 16 x 16 mesh. 67 5.27 Comparison of routing algorithms for uniform traffic in a 10 X 10 x 10 mesh ....................... . .............. 68 5.28 Comparison of routing algorithms for matrix-transpose traffic in a 10 x 10 mesh. . , ................................ 68 5.29 Comparison of routing algorithms for matrix«transpose traffic in a 16 x 16 mesh ................................... 69 5.30 Comparison of routing algorithms for matrix-transpose traffic in a bi- nary 8-cube ................................. 69 5.31 Comparison of routing algorithms for matrix-rotate traffic in a 10 x 10 mesh ..................................... 70 5.32 Comparison of routing algorithms for merge-south traffic in a 10 x 10 mesh ..................................... 70 5.33 Comparison of routing algorithms for address-rotate traffic in a binary 8-cube .................................... 71 xi 5.34 Comparison of routing algorithms for reverse-flip traffic in a binary 8-cube .................................... 5.35 The directions in which packets in the two halves of a two-dimensional 71 mesh are routed by the east-west routing algorithm during its two phases. 75 5.36 Comparison of routing algorithms for uniform traffic in a 10 x 10 mesh. 76 5.37 Comparison of routing algorithms for matrix-transpose traffic in a 10 x 10 mesh ................................... 5.38 Comparison of routing algorithms for matrix-rotate traffic in a 10 x 10 mesh ..................................... 5.39 Comparison of routing algorithms for merge-south traffic in a 10 x 10 mesh ..................................... 6.1 Eight simple cycles formed by the sixteen turns in a double-y network. 6.2 A router in a double-y network and the turns allowed by some routing algorithms. ................................ 6.3 The potential cycles created by allowing O-degree turns when travelling . north. ................................... 6.4 Numbering of the channels leaving each node (2:, y) fer the mad-y rout- ing algorithm ................................ 6.5 For each input channel, the output channels allowed by the mad-y routing algorithm have higher numbers .................. 6.6 A router for a two-dimensional mesh with double eastward and north- ward channels and the possible turns ................... 6.7 Comparison of nonadaptive routing algorithms for uniform traffic in a 16 x 16 mesh with double y channels ................... 6.8 Comparison of routing algorithms for uniform traffic in a 16 x 16 mesh with double y channels ........................... 6.9 Comparison of routing algorithms for center-reflection traffic in a 16 x 16 mesh with double y channels ...................... 6.10 Comparison of routing algorithms for matrix-transpose traffic in a 16 x 16 mesh with double y channels ...................... 6.11 Comparison of routing algorithms for bit-reversal traffic in a 16 x 16 mesh with double 3; channels. ...................... 6.12 Comparison of routing algorithms for matrix-multiply traffic in a 16 x 16 mesh with double y channels ...................... 6.13 The directions of the virtual channels in the two pairs of classes for fully adaptive routing in a two-dimensional mesh. ........... xii 76 77 77 79 80 81 84 85 86 88 89 90 90 91 91 94 6.14 The directions of the virtual channels in the four pairs of classes for 7.1 7.2 7.3 7.4 7.5 7.6 7.7 fully adaptive routing in a three-dimensional mesh. .......... 99 Examples of the paths allowed by the one-fault-tolerant routing algo- rithm for two-dimensional meshes ..................... 102 Numbering of a 4 x 4 mesh for the one-fault-tolerant routing algorithm. 105 Comparison of the minimal and nonminimal west-first routing algo- rithms for uniform traffic. ........................ 106 Comparison of the minimal and nonminimal north-last routing algo- rithms for uniform traffic. ........................ 106 Comparison of the minimal and nonminimal negative-first routing al- gorithms for uniform traffic. ....................... 107 Comparison of different amounts of misrouting in the negative-first routing algorithm for uniform traffic. .................. 107 Comparison of the minimal and nonminimal negative-first routing al- gorithms for matrix-transpose traffic. .................. 108 xiii CHAPTER 1 Introduction Sequential electronic computers have been steadily decreasing in size, power consump- tion, and cost, and steadily increasing in speed and reliability, since their creation a half century ago. These improvements are largely a result of the development of transistors and ever better techniques for miniaturizing and integrating transistors. Miniaturization will, however, soon reach a physical limit. Current techniques can fabricate semiconductors with features smaller than a micron, but features cannot become smaller than atoms. Therefore, further miniaturization is limited to a few more orders of magnitude. Because signal propagation rates are limited to the speed of light, the limit on miniaturization is also a limit on the rate at which sequential electronic computers can process information. For electronic computers to continue improving in performance will require a change from sequential architectures to mas- sively parallel architectures. Massively parallel architectures enable thousands of pro- cessors to divide a task into subtasks and perform the subtasks concurrently, speeding computation and permitting greater fault tolerance and reliability. 1.1 Direct Networks and Multicomputers The first consideration in building a computer with numerous processors, memory banks, input/output devices, and other components is the interconnection of the components. The method that currently scales the best to thousands of components is the direct network [1, 2, 3]. A direct network has channels connecting each node (that is, group of components) directly to only a few other nodes, its neighbors. Neighboring nodes communicate with each other by sending messages across the channels. A node communicates with a node that is not its neighbor by sending a message to one of its neighbors for forwarding. To handle the complexities of routing messages through the direct network, each node often has a router. The router controls both local channels, which connect it to local devices, and network channels, which connect it to neighboring routers. One promising, massively parallel architecture based on the direct network is the multicomputer [4]. Each node of a multicomputer has at least one processor and memory bank. The processors execute separate instruction streams, and all memory is distributed. The lack of centralized resources makes multicomputers highly scalable and a promising architecture for continued improvements in computer performance. 1.2 Network Topologies Which nodes in a direct network are neighbors is defined by the topology of the net— work. The most popular topologies for direct networks are n-dimensional meshes and k-ary n-cubes, particularly low-dimensional meshes and hypercubes. A two- dimensional mesh is used in the Intel Touchstone DELTA [5], the Intel Paragon, and the Symult 2010 [6]; a two-dimensional torus is used in the Fujitsu AP1000 [7]; a three-dimensional mesh is used in the MIT J-machine [8]; and a hypercube is used in the nCUBE—2 [9]. Formally, an n-dimensional mesh has k0 x k1 x - - - x lend nodes, k.- nodes along each dimension i, where k,- 2 2. A 4 x 4 mesh is shown in Figure 1.1. Each node X in an n-dimensional mesh is identified by n coordinates, (2:0, 3:1, . . . , and), where 0 S 1:.- S k; - 1 for each dimension i. Two nodes X and Y are neighbors if and only if 2:.- = y,- for all i except one, j, where y, = 2:; i 1. Thus, nodes have from n to 2n neighbors, depending on their location in the mesh. In a k-ary n-cube [10], all nodes have the same number of neighbors. The definition of a k-ary n-cube differs from that of an n-dimensional mesh in that the 165,8 are all equal to k and in that two nodes X and Y are neighbors if and only if 2:.- = y,- for all i except one, j, where y,- = (2:,- :l: 1) mod Is. The change to modular arithmetic in the definition adds wraparound channels to the k-ary n-cube, giving it symmetry. As a result, each node in a k-ary n-cube has n neighbors if k = 2 and 2n neighbors if k > 2. Another topology with symmetry is a hypercube, which is a special case of both n-dimensional meshes and k-ary n-cubes. A hypercube is an n-dimensional mesh in which k,- = 2 for all i, or a 2—ary n-cube. [3 1-3 ___. 2e 33. nib—M. 1.2 ‘_‘——'., 2.2 3.2 L. .11.. .l , .l_. rm 1,1 , 2,1 3,1 [11 1,0 2,0 ‘— 3,0 Figure 1.1. A 4 x 4 mesh. Meshes and lc-ary n-cubes are popular, in part, because their regular topolo- gies simplify the routing of messages. In general, low-dimensional meshes are pre- ferred because they have low, fixed node degrees, which makes them more scalable than high-dimensional meshes and k-ary n-cubes. Low-dimensional meshes also have fewer channels and higher channel bandwidth per bisection density and have lower contention and blocking latencies, which result in lower communication latencies and higher hot-spot throughputs [10]. In addition, their function better matches their form, two and three topological dimensions being easier to implement in the three physical dimensions. High-dimensional meshes and k-ary n-cubes, on the other hand, have lower diameters, which shortens path lengths. High-dimensional topologies also have more paths between pairs of nodes, which permits more fault tolerance. Finally, the symmetry of lc-ary n-cubes can make it easier to spread message traffic more evenly. 1.3 Switching Techniques In the latest generation of multicomputers, the most popular technique for switching messages along network channels is wormhole routing [11]. It is used in the Intel Touchstone DELTA [5], the Intel Paragon, the Symult 2010 [6], the Fujitsu AP1000 [7], the MIT J-machine [8], and the nCUBE-2 [9]. Wormhole routing switches a message along the channels of a direct network by first dividing the message into packets and the packets into flow control digits or flits. It then routes the fiits in each packet through the network in a pipeline fashion. Header flits contain all of the routing information for a packet and lead each packet through the network. When the header flits reach a router that has no suitable output channel available, all of the flits in the packet wait where they are for a suitable channel to become available. One of the attractions of wormhole routing is that each router requires only enough buffer space to store a few flits for each channel. The store-and-forward and vir- tual cut-through switching techniques [12, 13] require enough buffer space to store an entire packet for each channel. Another attraction of wormhole routing is that its communication latencies are low. In the absence of contention, the latencies for store-and-forward are proportional to the product of the packet length and the dis- tance to travel [14, 10]. The latencies for wormhole routing, virtual cut-through, and circuit switching, on the other hand, are proportional to the sum of the packet length and the distance to travel. Wormhole routing also has some advantages over circuit switching. Channel reservation and release are an integral part of wormhole routing, but are separate phases in circuit switching. Also, wormhole routing permits routers to replicate flits and output them over multiple channels, making multicast and broadcast communications possible [15, 9]. 1 .4 Routing Algorithms The sequence of channels a message packet traverses on its way from its source node to its destination node is determined by the routing algorithm that the routers of the direct network execute. A good routing algorithm should have three features: low communication latency, high network throughput, and ease of implementation in VLSI. In other words, a routing algorithm should provide packet transmission that is high in quality, quantity, and practicality. Factors contributing to ease of implemen- tation are little hardware for channels, buffers, switches, and control logic. Factors contributing to low latency and high throughput are freedom from deadlock, freedom from indefinite postponement, freedom from livelock, spreading packet traffic evenly, routing packets adaptively, routing packets along short paths, and fault tolerance. Many of these terms require some definition (partly because they have been used in different ways by different authors). Deadlock occurs when a packet waits for an event that cannot happen. For example, a packet may wait for a network resource to be released by a packet that is, in turn, waiting for the first packet to release some re- source. In wormhole routing, such a circular wait condition causes deadlock, because a packet holds resources while waiting and excludes other packets from acquiring the held resources. Indefinite postponement is similar to deadlock, but occurs when a packet waits for an event that can happen but never does. For example, a packet may wait forever to acquire a network resource for which other packets are always competing successfully. Indefinite postponement is only possible when the allocation of network resources is fair. Both deadlock and indefinite postponement can stop a packet from being injected into a network or from moving once it is in the network. In contrast, livelock does not stop the movement of a packet, but rather stops its progress toward the destination. Livelock occurs when the routing of a packet never leads it to its destination. Livelock is possible only when routing is adaptive (not along a predetermined path) and nonminimal (away from, or equidistant to, the des- tination at times). Minimal routing, on the other hand, restricts packets to shortest paths. Although minimal routing may initially sound more promising, nonminimal routing provides better fault tolerance. Of the factors contributing to good routing algorithms, the most important is freedom from deadlock. Deadlock can keep many or all packets from reaching their destinations and occurs readily unless a routing algorithm contains preventive mea- sures. Indefinite postponement and livelock are less likely to occur and are generally easier to prevent. Adaptiveness is also an important factor, because it contributes to several of the other factors [16]. It provides alternative paths for packets that encounter unfair channel allocation, faulty hardware, or hot spots in traffic patterns. The adaptiveness of a routing algorithm falls into one of three categories, depending on the number of nodes the algorithm can choose from when forwarding packets, or the number of shortest paths the algorithm can route packets along. A nonadaptive algorithm has a choice of only one node for each packet to be forwarded. Thus, a nonadaptive algorithm routes a packet from its source node to its destination node along a path that is predetermined. A partially adaptive algorithm [17] can choose from multiple nodes for some packets, but cannot choose from every node leading toward the destination node for every packet. Thus, a partially adaptive algorithm cannot route packets along every shortest path in a network topology. A fully adap- tiue algorithm can choose from every node leading toward the destination node for every packet. Thus, a fully adaptive algorithm can route packets along every shortest path in a network topology. Closely related to the categories of nonadaptive, partially adaptive, and fully adaptive routing algorithms is the category of maximally adaptive routing algorithms. A maximally adaptive algorithm can choose from as many chan- nels leading toward the destination node as possible while preventing deadlock. The focus is on channels, instead of nodes. Depending on the number and arrangement of the channels in a direct network, the maximally adaptive algorithms for the network may be nonadaptive, partially adaptive, or fully adaptive. 1.5 Problem Statement and Objective One obstacle to using wormhole routing as the switching technique in a direct network is designing a good algorithm for routing message packets through the network. As described in the preceding section, a good algorithm provides low communication latency, high network throughput, and ease of implementation in VLSI. Further, low latency and high throughput are largely the result of deadlock prevention and adaptiveness in the algorithm. Accordingly, the objective of this dissertation is to develop a model for systematically designing routing algorithms that are deadlock free and maximally adaptive for typical direct networks. 1.6 Dissertation Organization This dissertation examines the design of wormhole routing algorithms for direct net- works. Chapter 2 reviews previous models for designing routing algorithms and some of the resulting algorithms. Chapter 3 presents a new model, one based on analyzing the directions in which packets can turn in a network and the cycles the turns can form. Chapter 4 describes the simulator used to study the routing algorithms pro- duced by the different models. Chapters 5 and 6 apply the turn model to networks without and with virtual channels. Chapter 7 applies the turn model to the problem of fault-tolerant routing. Finally, Chapter 8 concludes the dissertation and proposes future work. CHAPTER 2 Previous Models Previous models for designing wormhole routing algorithms for direct networks are either ad hoc or based on adding extra channels to the networks. The algorithms produced by these models prevent deadlock, but the cost is either nonadaptiveness in routing or extra hardware for channels. ' 2.1 Dimension Ordering The primary ad hoc model is dimension ordering, in which the dimensions of a direct network are ordered and message packets are routed along the dimensions in that order. Applied to a two-dimensional mesh with the minimum number of channels (an my network), this model produces the my routing algorithm [6, 5]: route a packet first along the m dimension (dimension 0) and then along the y dimension (dimen- sion 1). This routing algorithm is the most popular for my networks. Applied to hypercubes, dimension ordering produces the e-cube routing algorithm [18]: route a packet first along the lowest dimension and then along higher and higher dimensions, until the destination node is reached. This routing algorithm is the most popular for hypercubes. Dimension ordering ensures that the my and e-cube algorithms prevent deadlock, 10 by denying the circular wait condition. A packet that has just reserved a channel along one dimension will try to reserve another channel along only the same dimension or a higher dimension. Thus, there is no way for a cycle of waiting packets to be formed in which a packet holding a channel is waiting for a higher-dimension channel and a packet holding a channel along that higher dimension is waiting for a lower-dimension channel. Also, packets waiting to reserve channels along the same dimension cannot form a cycle: a packet holding a channel in one direction along the dimension will try to reserve another channel in only the same direction (because routing is minimal), and the channels in a single direction do not form a cycle (because meshes do not have wraparound channels, as do k-ary n-cubes). Although dimension ordering prevents deadlock, it has some major drawbacks. First, it is applicable only to network topologies without wraparound channels. More critically, it ensures nonadaptiveness: for each pair of source and destination nodes, it routes all packets along the same path. Finally, dimension ordering does not allow nonminimal routing, which is important for fault tolerance and increased adaptive- 11888. 2.2 The Virtual Channel Model Subsequent models for designing routing algorithms have attempted to overcome the drawbacks associated with dimension ordering. All of these models are based on the addition of virtual channels [19, 20] to direct networks. Adding special buffers and control logic allows a physical channel that would normally be used by only one packet at a time to be shared by multiple packets. Thus, one physical channel becomes multiple virtual channels. The virtual channels can be used both to prevent deadlock and to provide an arbitrary degree of adaptiveness. Dally and Seitz [19] propose a model to address the problem of topologies with 11 wraparound channels. Their virtual channel model is based on breaking all of the cycles in the channel dependency graph for a direct network and routing algorithm, which they prove prevents deadlock. The channel dependency graph is a directed graph with a vertex for each channel in the network and an arc from one vertex to another if the algorithm can route packets directly from that channel to the other. The steps of the model are as follows. 1. Find a nonadaptive routing algorithm for the direct network, without regard for whether it is deadlock free or not. 2. Construct the channel dependency graph for the network and algorithm. 3. Where there are cycles in the graph, divide the physical channels into virtual channels, and modify the algorithm to route a packet along any of the virtual channels. 4. Construct the channel dependency graph for the new network and algorithm. 5. Eliminate enough arcs in the graph to break all cycles, and modify the algorithm to prohibit the corresponding transitions from one channel to another. To demonstrate the versatility of their model, Dally and Seitz apply it to three net- work topologies: k-ary n-cubes, cube-connected cycles, and shuffle-exchange net- works. Although the virtual channel model can be applied to any topology, including those with wraparound channels, it still has several major drawbacks. First, some of the steps are too indefinite. How does one find the initial routing algorithm in Step 1? Into how many virtual channels does one divide the physical channels in Step 3? How does one choose arcs to eliminate in Step 5, especially when the channel dependency graph constructed in Step 4 is very complex? The result is that the model must be applied iteratively, using trial and error. Another drawback is that the model tends 12 to add more virtual channels to a network than necessary, increasing its cost. Finally, and most critically, the model produces routing algorithms that are nonadaptive and minimal, just as dimension ordering does. 2.3 The Virtual Network Model The first to propose a model for designing adaptive routing algorithms are Yantchev and Jesshope [21, 22]. Their virtual network model produces a fully adaptive algo- rithm for any network topology. The steps of the model are the following. 1. Classify the pairs of source and destination nodes in the topology according to the directions of the channels in the shortest paths connecting them. In an extreme case, there is a separate class for each pair of source and destination nodes. 2. For each class, create an independent, cycle-free virtual network to route its messages minimally and with full adaptiveness. The virtual network contains a virtual channel for each channel of the topology that is in any of the shortest paths of the class. 3. If partially adaptive routing is acceptable, eliminate some virtual networks and allow messages to transfer from some virtual networks to some others, as long as doing so permits full connectivity and does not introduce cycles. 4. Overlay the virtual networks onto the physical network, dividing physical chan— nels into virtual channels where more than one virtual channel is mapped onto the same physical channel. Applied to two-dimensional meshes, the virtual network model classifies source- destination pairs according to whether they route messages northeast, southeast, 13 southwest, or northwest and creates a separate virtual network to route each class with full adaptiveness. For example, for the northeast class, the model creates a north or east virtual channel for each north or east physical channel. Overlaying the four virtual networks results in two virtual channels for each physical channel, and a two— dimensional mesh called a double-my network. The corresponding double-my routing algorithm routes a northwest packet along the first set of north and west channels, a northeast packet along the second set of north and east channels, a southeast packet along the first set of south and east channels, and a southwest packet along the second set of south and west channels. If it is desirable to reduce the number of virtual channels per physical channel to one (that is, leave the physical channels unmodified), the southeast and northwest virtual networks can be eliminated. The resulting 2-plane routing algorithm routes packets southwest as far as necessary and then northeast as far as necessary. The draWback is that it is only partially adaptive. It routes packets southeast and northwest nonadaptively. The virtual network model has the advantages both of being applicable to any network topology and of producing adaptive routing algorithms. In addition, the steps are more definite than those in the virtual channel model. Like the previous models, however, the virtual network model produces algorithms only for minimal routing, which limits its usefulness for fault tolerance. The virtual network model also tends to add more virtual channels than necessary to achieve full or partial adaptiveness. Finally, after adding the virtual channels, the model produces a. routing algorithm that is not maximally adaptive. 2.4 The Dimension Reversal Model To design routing algorithms that are both adaptive and nonminimal, Dally and Aoki [16] propose a model based on dimension reversals. Each message packet keeps 14 track of the number of times, p, it reverses dimensions, that is, is routed from a higher dimension to a lower dimension. The model divides each physical channel into r + 1 virtual channels, numbered 0 to r, where r is large enough to give the desired degree of adaptiveness. Based on this division, the model allows either of two routing algorithms. The static routing algorithm routes packets that have made p dimension reversals only along virtual channels numbered p. If p < r, routing is along any such virtual channel; and if p = r, routing is according to dimension ordering. Thus, a packet is routed adaptively until it makes r dimension reversals, and r is the maximum number of dimension reversals a packet can make. The dynamic routing algorithm does not limit the number of dimension reversals. It routes a packet along any virtual channels with numbers less than r until it reaches a router where all suitable output channels are being used by packets with equal or lower values of p. It then routes the packet along virtual channels numbered r, using dimension ordering. Because these two algorithms allow adaptive, nonminimal routing, livelock is a potential problem. To prevent livelock, Dally and Aoki propose limiting either the number of times each packet is misrouted (routed away from the destination or equidistant to it) or the ratio of misroutings to routings toward the destination. The dynamic routing algorithm has the unique ability to route a packet along almost any of the virtual channels leaving a router. This ability significantly increases adaptiveness; and in simulations conducted by Dally and Aoki, the dynamic algorithm outperforms the static one. The simulations also show, however, that the dynamic routing algorithm requires throttling to limit the number of new message packets entering the direct network when traffic is heavy. Although the dimension reversal model provides a solution to the problem of de- signing nonminimal routing algorithms, it has several drawbacks. The dimension reversal model depends on dimension ordering and, like dimension ordering, is appli- cable only to meshes and other topologies that do not contain wraparound channels. 15 Also, the static and dynamic algorithms route packets along the virtual channels numbered r nonadaptively. This nonadaptiveness severely reduces the fault tolerance of the direct network. Next, the algorithms depend on the packets transmitting extra information, p, in the header flits. This extra information increases the buffer space required in each router and increases communication latency. Finally, like previous models, the dimension reversal model depends on adding virtual channels and does not produce routing algorithms that are as adaptive as possible for the number of virtual channels added. 2.5 The Wraparound Model Linder and Harden [23] propose a model similar to the dimension reversal model of Dally and Aoki, except that it is based on keeping track of the number times a packet travels over wraparound channels and that it needs no subnetwork of virtual channels to nonadaptively route packets as a back-up when contention is high. The steps of the model are as follows. 1. For each combination of directions in dimensions 1 to n — 1 of an n-dimensional network, create an independent virtual network. 2. If the topology has wraparound channels, give each virtual network at least n + 1 levels. Otherwise, give each virtual network one level. 3. For each level in each virtual network, create virtual channels in the directions associated with that virtual network and create virtual channels in both direc- tions of dimension 0. Where there are wraparound channels for a level, connect them from that level to the next lowest level. 4. Overlay the virtual networks onto the physical network. 16 The routing algorithm that goes with the resulting physical network injects a message packet into the top level of any virtual network that has a path to the destination. It then routes the packet along any of the many paths to the destination. These paths include all shortest paths and many longer ones. Each time a packet traverses a wraparound channel, it drops down a level. At high levels, almost any output channel leads to the destination, but at low levels, the routing algorithm must be more careful that the packet has at least as many levels below it as wraparound channels it must still traverse. The level the packet is on can be determined from the virtual channel it just traversed, and the minimum number of wraparound channels required to reach the destination can be determined from the addresses of the current and destination nodes. Therefore, the number of wraparound channels traversed by the packet does not have to be transmitted in the header flits, which is an improvement over the dimension reversal model. As in the virtual network model, the independent virtual networks prevent deadlock among packets travelling in different directions. In addition, the multiple levels in each virtual network prevent deadlock among packets traversing wraparound channels. Linder and Harden apply their wraparound model to n-dimensional meshes and k-ary n-cubes. Applied to two-dimensional meshes, it produces a minimal and fully adaptive routing algorithm that requires two virtual channels per physical channel along only the y dimension. Such a two—dimensional mesh is called a double-y network. The corresponding algorithm is called the double-y routing algorithm. It routes an eastbound packet along the east physical channels and one set of y virtual channels and routes a westbound packet along the west physical channels and the other set of y virtual channels. Dally [24] proposes the same network and algorithm. The wraparound model has advantages beyond those of the dimension reversal model and only some of the drawbacks. Like the dimension reversal model, it produces fully adaptive and nonminimal routing algorithms. In addition, it is applicable to 17 topologies with wraparound channels and is more fault tolerant, especially when the virtual networks contain more levels than the minimum required by Step 2. The drawbacks the wraparound model shares with the dimension reversal model are that it often depends on adding many virtual channels and does not produce routing algorithms that are as adaptive as possible for the number of virtual channels added. CHAPTER 3 The Turn Model Deadlock in wormhole routing is caused by message packets waiting on each other in a cycle. One way that deadlock can occur in a two-dimensional mesh is shown in Figure 3.1. Four packets travelling in different directions try to turn left and wind up in a circular wait. If only one of the packets had not turned, deadlock would have been avoided. This possibility suggests that a routing algorithm might prevent deadlock in a direct network by prohibiting certain turns. The routing algorithm would have to prohibit at least one turn in each of the many possible cycles in the network, however. At the same time, the algorithm would have to leave a path between every pair of nodes. Finally, the algorithm should not prohibit more turns than necessary; otherwise, its adaptiveness would be reduced. The deadlock freedom and adaptiveness of: most of the routing algorithms de— scribed in the preceding chapter can be explained in terms of the turns prohibited and allowed. The next few sections present explanations for several of the routing al- gorithms applicable to the different two-dimensional meshes defined in the preceding chapter. 18 19 \. D :11: buffer input selection circuit pocket 4 pocket 2 ~ packet progre;;ion .......... ....... - [:1 packet awaiting resource Figure 3.1. A wormhole deadlock involving four routers and four packets in a two— dimensional mesh. 3.1 Turns in my Networks Figure 3.2(a) shows a router in an my network, and Figure 3.2(b) shows the turns the my routing algorithm allows (solid lines) and prohibits (dashed lines) at each router. The possible turns from one direction to another form two simple cycles and several complex cycles, as illustrated in Figure 3.3. The my algorithm is deadlock free because it prohibits at least one turn in each of the simple cycles and each of the complex cycles. The algorithm is nonadaptive because the turns it allows cannot be combined into sequences; the heads of the allowed turns do not match the tails of any of the other allowed turns. The 2-plane routing algorithm is also applicable to my networks, allowing the turns shown in Figure 3.2(c). It, too, is deadlock free because it prohibits at least one turn in each of the cycles. It, however, allows more turns than the my algorithm; and some of the turns can be combined into sequences, permitting partial adaptiveness in routing. 20 (north) ”I {'7 F‘"? 4,, _t can: .... (west) -X +X (east) (1)) xy algorithm (nonadaptive) _Y I'M? F‘ as (south) I—‘J L._.I (c) 2— lane al orithm (a) a router in an xy network (pg-flan, Edaptive) Figure 3.2. A router in an my network and the turns allowed by some routing algo- rithms. ""7 ._l I“? V—II L._II_._IL.__I (a) (b) (c) Figure 3.3. The two simple cycles and one of the complex cycles formed by the turns in an my network. 21 3.2 Turns in Double-my Networks Figure 3.4(a) shows a router in a double-my network, and Figure 3.4(b) shows the turns the double-my routing algorithm allows and prohibits at each router. Which of the two virtual channels in each direction appears in a turn is important in determining cycles and adaptiveness. The possible turns from one set of virtual channels to another form eight simple cycles. The double-my algorithm is deadlock free because it prohibits at least one turn in each of the simple cycles, as well as each of the complex cycles. The algorithm is fully adaptive because the turns it allows can be combined into a variety of sequences in each of the four directions packets need to be routed: southwest, southeast, northwest, and northeast. _ ‘53-» 3.....‘41 ..q. '1' I... <4 +Y1 +Y2 " '1 ‘fl'e It it f v . =i= -X1 +‘ ""’+x1 e T <-I-I ‘—|- a 1".» 8,32 41* -II-> mm»; gun-<4. .X2 C-II- ”+X2 # I; $ * H It I s s . I -Y1 -Y2 "‘°“'*' 't'r't" arms-g. ;-Il- 1.0 ‘ 6.0 4.2 4.0 2.2 2.0 also I 70 8,0 9 ’ u**—’-—__. 1.2. 2.2 3.2 I. , so , 3.0 w 1.0 6.1 4.1 4.1 2.1 2.1 0.110.: 7L0 0 9,0 [M‘ 14—4—32: 3.! _ 5,0 ‘ 3,0 } 1,0 .1 so 6.2 4.0 4.2 2.0 2.2 “Jo: ._z.o__‘,._u.2_~ ,._2.o__ ”Tl. so ,2. 1.0 3.0 Figure 5.6. Numbering of a 4 x 4 mesh for the west-first routing algorithm. 2m-2-2x,n-2oy 2m-2-2x,y 2m-1-2x,0 :;:’_:ZLn-3-2x,0 +1m-3-2xfi 2m-2-2x,y-12m-2-2x,y-1 (a) input from west (h) input from north 2m-2-2x,n-2-y 2m-2-2x,n-2-y 2m-2+x,0 33'1”» a 2m-3-2x,0 e—1 w ‘_. L]— 2m-3-2x,0 2m-2-2x,y-l 2m-2-2x,n-l-y (c) input from east (d) input from south Figure 5.7. For each input channel, the output channels allowed by the west-first routing algorithm have lower numbers. 43 suggests the north-last routing algorithm: route a packet first adaptively west, south, and east, and then north. Examples of the paths allowed by the north-last algorithm are shown in Figure 5.8(b). Theorem 5.2 The north-last routing algorithm is deadlock free. Proof: The proof is similar to that for Theorem 5.1. Rotate Figures 5.5 and 5.6 counterclockwise 90 degrees, and reverse the directions of the channels. The fig- ures now show that the north—last algorithm routes every packet along channels with strictly increasing numbers. [3 5.1.3 The Negative-First Routing Algorithm Figure 5.9(a) shows a third way to prohibit two turns in a two-dimensional mesh. The prohibited turns are the two from a positive direction to a negative direction. Therefore, to travel in a negative direction, a packet must start out in a negative direction. This requirement suggests the negative-first routing algorithm: route a packet first adaptively west and south, and then adaptively east and north. As de- scribed in Chapter 2, Yantchev and Jesshope propose a similar algorithm [22], but only for minimal routing. The negative-first algorithm can be either minimal or non- minimal, the nonminimal version being more adaptive and fault tolerant. Examples of the paths allowed by the negative-first algorithm are shown in Figure 5.9(b). Theorem 5.3 The negative-first routing algorithm is deadlock free. The proof is a special case of the one given for n-dimensional meshes in Theorem 5.5. 5.1.4 Degrees of Adaptiveness How adaptive are these partially adaptive routing algorithms? Let Saharan," be the number of shortest paths an algorithm allows from source node (3,, s”) to destination 44 (b) Examples of the paths allowed in an 8 x 8 mesh. Figure 5.8. The north-last routing algorithm for two-dimensional meshes. ‘IC. L._I r"... t_ I_-. (a) The six turns allowed. IUIIIIII (b) Examples of the paths allowed in an 8 x 8 mesh. Figure 5.9. The negative-first routing algorithm for two-dimensional meshes. 46 node (d,, d”). Also, let f be a fully adaptive algorithm, p be one of the three partially adaptive algorithms, Am = Id, — 3,], and Ay = Id, — 3,]. Then, S = (Am + Ay)! ’ .mmw [Am-+482)! - AmlAy! ‘f d3 2 31' Sweet—first = 1 otherwise (Am+Ay]! - AmlAy! 1f du S 3:: Snorth—last = 1 otherwise %fi%ifl¢gnmm4$%) Snegatiue-first = I 01‘ (d3 2 83 and dy 2 8y) [ 1 otherwise For minimal routing algorithms, the larger Satgwgthm is, the more adaptive the algorithm is. For the three partially adaptive routing algorithms, however, S,p = 1 for at least half of the source-destination pairs. Another measure of the degree to which a minimal routing algorithm is adaptive is the ratio Sugars“... / S f. Sp/S; ranges from 0 to l, but averaged across all source-destination pairs, Sp/Sf > 1/2, which indicates a significant degree of adaptiveness. Finally, note that S, = 1 for a source-destination pair does not imply that algorithm p is nonadaptive for that pair. The partially adaptive algorithms all permit nonminimal routing. In the case of the negative— first algorithm, nonminimal routing means that routing can be adaptive even when d“r _<_ 3, and (1,, Z 3,, or when d, 2 s, and d, S 3,. For an illustration of this added 47 adaptiveness, see the bottom path in Figure 5.9(b). 5.2 Three-Dimensional Meshes For three-dimensional meshes, the terminology used in the definition of n-dimensional meshes can be simplified. Dimensions 0, 1, and 2 become m, y, and z. The direc- tions —m, +m, —y, +y, —z, and +2 become west, east, south, north, down, and up respectively. There are 24 possible 90-degree turns in a three—dimensional mesh, four for each of the six directions. These turns form six simple cycles, two in each of the three planes, as shown in Figure 5.10. At least one turn must be prohibited in each cycle in order to prevent deadlock. Of the 4096 different ways to prohibit these six turns, 176 prevent deadlock and nine are unique if symmetries are taken into account. (2) U I" L K V Figure 5.10. The possible turns and simple cycles in a three-dimensional mesh. Each of the nine combinations suggests a partially adaptive routing algorithm. The first three combinations suggest algorithms that route a packet first west and 48 then adaptively south, down, east, north, and up with two restrictions on the turns. The second three combinations suggest algorithms that route a packet first adaptively west, south, down, east, and north with two restrictions and then up. The last three combinations, Figure 5.11, suggest simpler algorithms. The seventh combination suggests the west-south-first routing algorithm: route a packet first adaptively west and south and then adaptively down, east, north, and up. The eighth combination suggests the north-up-last routing algorithm: route a packet first adaptively west, south, down, and east and then adaptively north and up. The ninth combination suggests the negative-first routing algorithm: route a packet first adaptively west, south, and down and then adaptively east, north, and up. All nine of the algorithms are deadlock free. The individual proofs are omitted here. 5.2.1 Degrees of Adaptiveness The algorithms for three-dimensional meshes are more likely to be able to route any particular packet adaptively than are the algorithms for two—dimensional meshes. Let Smash... be the number of shortest paths from source node (3,, 3,, 3,) to destination node (d,, d”, d,). Also, let f be a fully adaptive algorithm, p be one of the last three partially adaptive algorithms, Am = Id, — s,I, Ay = Id, — syI, and A2 = Id, — 3,]. Then, S __ (Am + Ag + A2)! I - AmlAylAz! M if (d, 2 s, and d, 2 3:) AmlAylAz! W if(d.2s.and 4.5.9.) Sweat-south- first = I ‘W if(d,gs,and age.) I LL-Liifz’ if (d. s s: and <12 2 3,) 49 [.7 [3K 045 257 (a) the west-south-f'n'st routing algorithm .--— "1"" I“ fit 1:.- L] D V [gt/.5 (b) the north-up-last routing algorithm I:._l LJKJKJZJAJ‘ Figure 5.11. The turns allowed by three routing algorithms for three-dimensional meshes. 50 1. W 1f(d,$s, and d,$s,) W if(d.3s.and d. 2s.) M if (d, 2 .9, and d, 2 s.) AmlAz! Snorth-up-last = W “((13- Z 3:.- and d: S 3:) V L—J—Leggymf! if (d, g s, and d, g s, and d. g s,) 01' (dz-23:: and (#28, and d,2s,) W if(d=Ss.andd._<_s,andd.>s,) or (d, 2 s, and dy Z s, and d, < s.) Snegative— first = I Egg-)- if(d,Ss, and d,>s, and (1.33,) or (d, 2 s, and d, <3" and d, 2 3,) %# if(d,>s,andd,,Ssv andd,Ss,) or (d, <3, and d, _>_sv and d, _>_ 3,) S, = l for few source-destination pairs, especially when the lengths of the dimen- sions are large. Therefore, minimal routing is adaptive for most source-destination pairs. Averaged across all source-destination pairs, Sp/S; > 1/4. This average is low compared to the average of over 1/2 for two-dimensional meshes. As in the case of two-dimensional meshes, nonminimal routing can increase the degree of adap- tiveness. Overall, a three-dimensional mesh with the same number of nodes as a two-dimensional mesh permits a lower average degree of adaptiveness relative to fully adaptive routing, but distributes the adaptiveness more uniformly across the source- destination pairs. 51 5.3 n-Dimensional Meshes In an n-dimensional mesh, each node has up to 2n input channels and 2n output channels. For each of the 2n directions, there are 2n — 2 possible 90-degree turns to a different direction, making 4n(n — 1) total turns. These turns form two simple cycles in each of the n(n — 1)/2 planes, making n(n — 1) total cycles of four turns. Theorem 5.4 The minimum number of turns that must be prohibited to prevent deadlock in an n-dimensional mesh is n(n - l), or a quarter of the possible turns. Proof: The 4n(n - 1) turns in an n-dimensional mesh form n(n — 1) cycles of four turns each. At least one of the four turns in each cycle must be prohibited in order to prevent these cycles. Therefore, prohibiting a quarter of the turns is necessary to prevent deadlock. El Of the many routing algorithms formed by prohibiting one turn per cycle, three are noteworthy for their simplicity. They are analogs of the west-first, north—last, and negative-first routing algorithms for two-dimensional meshes and the west-south-first, north-up-last, and negative-first routing algorithms for three-dimensional meshes. The analog of the west-first and west-south-first routing algorithms is the all-but- one-negative-first routing algorithm: route a packet first adaptively in the negative directions of all but one dimension (n — 1) and then adaptively in the other di- rections. The analog of the north-last and north-up-last routing algorithms is the all-but-one-positive-last routing algorithm: route a packet first adaptively in the neg- ative directions and the positive direction of one dimension (0) and then adaptively in the other directions. The analog of the negative-first routing algorithms is also called the negative-first routing algorithm: route a packet first adaptively in the negative directions and then adaptively in the positive directions. All of these algorithms are deadlock free, but the only proof presented here is for the negative-first algorithm. 52 Theorem 5.5 The negative-first routing algorithm for n-dimensional meshes is dead- lock free. Proof: Let K be the sum of the k; for an n-dimensional mesh, and let X be the sum of the m,- for any node (mo,m1, . . . ,m,_2,m,,_1). Number each channel leaving a node in a positive direction K — n + X and each channel leaving a node in a negative direction K - n — X. Then, if a packet enters a node when travelling in a negative direction, it enters along a channel numbered K — n - X — 1, which is less than both K - n — X and K — n + X, the numbers of the channels leaving the node in the negative and positive directions. If a packet enters a node when travelling in a positive direction, it enters along a channel numbered K — n + X — l, which is less than K — n + X, the number of the channels leaving the node in the positive directions. Therefore, the negative-first algorithm routes every packet along channels with strictly increasing numbers and is deadlock free. 0 The negative-first algorithm is deadlock free as a result of prohibiting just one turn per cycle, the turn from a positive direction to a negative direction. Therefore, pro- hibiting some quarter of the turns is sufficient to prevent deadlock in an n-dimensional mesh. Combining this observation with Theorem 5.4, gives the following theorem, which supports the claim that the turn model produces maximally adaptive routing algorithms. Theorem 5.6 Prohibiting some quarter of the 4n(n — 1) turns in an n-dimensional mesh is necessary and sufiicient to prevent deadlock. 5.3.1 Degrees of Adaptiveness As the number of dimensions increases, the minimal, partially adaptive algorithms are more likely to be able to route message packets adaptively. S, = 1 less often, especially 53 when the k,- are large. But averaged across all source-destination pairs, Sp/Sf > 1/2""1 , indicating that the degree of adaptiveness relative to fully adaptive algorithms decreases as n increases. Again, adaptiveness can be increased by nonminimal routing. 5.4 k-ary n-cubes The partially adaptive routing algorithms for meshes can be extended to use the wraparound channels of k-ary n—cubes. One way to extend them is to allow a packet to be routed along a wraparound channel only on its first hop. To prove that the routing algorithm is still deadlock free, number the mesh channels as before and assign the wraparound channels a number that is either greater than or less than those of the mesh channels, depending on whether packets are routed along channels in strictly decreasing or increasing order. The negative-first routing algorithm can also be extended to k-ary n-cubes in another way: classify each wraparound channel as negative or positive according to how it routes packets relative to the mesh channels and then apply the negative-first algorithm. For example, a node at the east edge of the mesh channels has two channels to the west: a mesh channel to the node immediately to its west and a wraparound channel to a node at. the west edge of the mesh channels. Either channel to the west can be used during the negative phase of routing a packet. The negative and positive phases in a 4-ary 2-cube are illustrated in Figure 5.12. The proof of deadlock freedom is, again, a simple modification of the proof for meshes. All of the preceding routing algorithms are strictly nonminimal. For k-ary n- cubes with k > 4, it is impossible to design deadlock-free routing algorithms that are minimal, without adding extra channels. To see that it is impossible, imagine a k-ary n-cube and an arbitrary k-ary l-cube embedded in it. The nodes in the k-ary 1-cube all lie along one dimension. Next, imagine that each of the nodes sends a message to (a) negative phase (b) positive phase Figure 5.12. The channels used during the two phases of the negative-first routing algorithm for 4-ary 2-cubes. the node that is two hops away in the positive direction of the dimension. If routing is minimal and k > 4, each of these messages is routed two hops in the positive direction. These messages will deadlock if each reserves its first channel before any of them reserve the second channel. The deadlock is a result of a cycle that does not involve turns. To design a routing algorithm that is deadlock free, minimal, and fully adaptive, Linder and Harden [23] add virtual channels to a k-ary n-cube. They add, however, enough virtual channels to partition the k-ary n-cube into 2““1 virtual networks with n + 1 levels per virtual network and (n + 1)k“ channels per level. That is an average of (n + 1)22n-2 11 virtual channels per physical channel. Dividing the physical channels in a k-ary n-cube into fewer virtual channels, how ever, is sufficient for designing routing algorithms that are deadlock free, minimal, and partially adaptive. A suitable design method is inspired by the one Dally and Seitz use for unidirectional k-ary n-cubes [19]. Dally and Seitz divide each physical channel into two virtual channels and connect the virtual channels in each k-ary l- 55 cube embedded in the k-ary n-cube into a spiral that circles the k-ary n-cube nearly twice. Unfolding the spirals converts the unidirectional k-ary n-cube into a unidirec- tional, n-dimensional mesh with 2k — 1 nodes along each dimension. Each node of the k-ary n-cube appears in the mesh up to 2" times. Dimension-order routing in the mesh prevents deadlock. Routing algorithms that are deadlock free, minimal, and partially adaptive for a bidirectional k-ary n-cube can be designed by unfolding the k-ary n-cube into a bidirectional, n—dimensional mesh with 2k — 1 nodes along each dimension. The unfolding requires an average of 2k—1 "" 2(——. I virtual channels per physical channel. Applying the turn model to the mesh produces routing algorithms that are deadlock free, minimal, and partially adaptive. The routing algorithms to use in the k-ary n-cube are the ones that correspond to routing a packet to the nearest copy of its destination node in the mesh. 5.5 Hypercubes Hypercubes are a special case of both n-dimensional meshes and k-ary n-cubes. Con- sequently, the partially adaptive routing algorithms for hypercubes are special cases of the algorithms for n-dimensional meshes and k-ary n-cubes. Proofs that the routing algorithms are deadlock free are corollaries of the proofs for the more general cases. The special case of the negative-first algorithm has a particularly compact expres- sion, the p-cube routing algorithm. Let S be the binary address of the source node for a packet, C be the binary address of the node the header flits currently occupy, and D be the binary address of the destination node. The p-cube routing algorithm has two phases. In the case of minimal routing, the first phase routes the packet along 56 any dimension, i, for which c,- = 1 and d,- = 0. When there is no such dimension, the second phase routes the packet along any dimension, i, for which c.- = 0 and d,- = 1. These steps are easily computed using bitwise logic operations, as shown in Figure 5.13. If nonminimal routing is desired, because of its increased adaptiveness and fault tolerance, the first phase can also route the packet along any dimension, i, for which a,- = l and d.- = 1. Then, the steps can be computed as shown in Fig- ure 5.14. In both of these algorithms, the only input transmitted in the header flits is D. C is a unique constant for each router, and p depends on which input buffer the header flits occupy in the router. Konstantinidou proposes an algorithm similar to p-cube [28], but only for minimal routing. Algorithm: Minimal p-cube routing algorithm. Input: Current address, C, and destination address, D. Procedure: ‘ I 1. If C = D, route the packet to the local processor and exit. 2R=CAD. 3. IfR= 0,thenR= CAD. 4. Route the packet along any available channel, i, for which r,- = 1. Figure 5.13. The minimal p-cube routing algorithm for hypercubes. 5.5.1 Degrees of Adaptiveness Let IXI represent the number of 1’s in the binary number X. The number of shortest paths from S to D, S,-,.,b,, equals hllhol, where h1=I(S A D)| and ho=|(S A D)I. For a fully adaptive routing algorithm, f, S; = h!, where h = h + ho = |(S ® D)|, the Hamming distance between S and D. The other measure of the degree of adaptiveness 57 Algorithm: Nonminimal p-cube routing algorithm. Input: Current address, C; destination address, D; and whether the last hop was in a positive direction, p. Procedure: 1. If C = D, route the packet to the local processor and exit. 2. pr= 1, thenR= CAD. 3. ElseifC AD=0,thenR=C V (CAD). 4. Else R = C. 5. Route the packet along any available channel i, for which r.- = l. Figure 5.14. The nonminimal p-cube routing algorithm for hypercubes. is Sp_cub¢/Sf, which equals 1/(,:). Overall, the p-cube routing algorithm offers a choice of many shortest paths, especially when compared to the nonadaptive e-cube routing algorithm. The following table illustrates this adaptiveness, for a binary 10-cube in which the node 1011010100 sends a packet to the node 0010111001. In this example, h = 6, ho = 3, h = 3, and 5,4,5, = 36. The table shows only one of the 36 possible shortest paths. For each node transmitting the packet, however, the table shows the number of choices of output channel allowed by the minimal p-cube routing algorithm. The number of choices in parentheses indicates the additional choices available with nonminimal routing. 5.6 Simulation Experiments Numerous simulations have been conducted to study the partially adaptive routing algorithms produced by the turn model and the nonadaptive routing algorithms pro- duced by dimension ordering for different direct networks, different input and output 58 Table 5.1. Example of a path allowed by the p—cube routing algorithm. I address I choices I dimension taken I comment I 1011010100 3(+2) 2 source 1011010000 2(+2) 9 phase 1 0011010000 1(+2) 6 phase 1 0010010000 3 5 phase 2 0010110000 2 0 phase 2 0010110001 1 3 phase 2 0010111001 destination selection policies, and different traffic patterns. 5.6.1 Input Selection Policies The first simulations compare the random, no—turn, local FCF S, global FCFS, and distance-travelled input selection policies. Each policy is simulated in a 10 x 10 mesh for the my, west-first, north-last, and negative-first routing algorithms. In all cases, the output selection policy is zigzag, the routing is minimal, and the message traffic is uniform. The results for each routing algorithm are fairly consistent (Figures 5.15, 5.16, 5.17, and 5.18). At low throughputs, the input selection policies perform the same. At high throughputs, the distance-travelled policy produces the lowest latencies, the global FCFS policy produces the second lowest latencies, and the random, no-turn, and local FCFS policies produce the highest latencies. Compared to the random, no- turn, and local FCFS policies, the distance-travelled policy increases the maximum sustainable throughput by 15%. The most surprising of these results is that the local FCFS and no—turn policies do not perform much better than the random policy. The random, no—turn, and local FCFS policies are the three that do not require additional routing information. 59 40 l I I l 35 '- random 9- -< no-turn +- 30 ' local FCFS 5- - global FCFS -><— 25 I- distance-travelled A— - Average latency 20 .. (pace) 15 q 10 . 5 ' 4 0 ' l 1 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.15. Comparison of input selection policies for the my routing algorithm. 40 I l l T 35 '- random 9- . no-turn -I— 30 *- local FCFS 5- q global FCFS -)(—- 25 distance-travelled A— .. Average latency 20 -I (usec) 15 . 10 -I 5 I 4 0 1 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.16. Comparison of input selection policies for the west-first routing algo- rithm. 60 40 I I T I I 35 '- random 9- - no-turn +— 30 _ local FCFS 5— d global FCFS ->(— 25 distance-travelled A— J Average 3 latency 20 .. ( 86¢) It 15 - 10 ... 5 ' .1 0 . r 0 5 10 I5 20 25 30 Percent of maximum theoretical throughput Figure 5.17. Comparison of input selection policies for the north-last routing algo- rithm. 40 I 1 I I 35 F random 0- - no-turn -l— 30 *- local FCFS '0'- - global FCFS -x— 25 distance-travelled A— -I Average latency 20 .. (wee) 15 .t 10 q 5 .. 0 l L l l 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.18. Comparison of input selection policies for the negative-first routing algorithm. 61 That they perform the worst may not be a coincidence and suggests that improving performance by changing input selection policies requires changing to a policy that [is based on additional information carried in the header flits. The other main result is that the distance-travelled policy outperforms the FCFS policies at high throughputs. This result suggests that reducing contention is more important than maintaining fairness. In the simulations of the distance-travelled policy, indefinite postponement, though theoretically possible, is not a problem. The next simulations compare the distance-travelled, least-adaptive, and distance- least input selection policies. These three policies differ only in whether they empha- size distance or adaptiveness and in how they decide ties. Each policy is simulated in a 10 x 10 mesh for the west-first, north-last, and negative—first routing algorithms. Again, the output selection policy is zigzag, the routing is minimal, and the message traffic is uniform. The results for each routing algorithm are very consistent (Figures 5.19, 5.20, and 5.21). At all throughputs, the input selection policies have nearly the same laten- cies. These results suggest that augmenting a successful policy, such as the distance- travelled policy, to give priority to the least adaptive packet is of marginal value, at least for two-dimensional meshes. For higher-dimensional meshes, the adaptiveness of packets has a wider range of values, and the added discriminating power of the least-adaptive and distance-least policies may make them more useful. 5.6.2 Output Selection Policies The next simulations compare the zigzag, my, and no-turn output selection policies. Each policy is simulated in a 10 x 10 mesh for the west-first, north-last, and negative- first routing algorithms. In all cases, the input selection policy is distancetravelled, the routing is minimal, and the message traffic is uniform. Figures 5.22, 5.23, and 5.24 show the results. For each of the routing algorithms, 62 40 I l l I T 35 I- distance-travelled A— - least-adaptive +- 30 " distance-least -0— - Average '- latency 20 .. (wee) l5 - 10 - 5 'I 0 . l l l l l 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.19. Comparison of input selection policies for the west-first routing algo- rithm. 40 T r I l l 35 ' distancetravelled A— - least-adaptive +— 30 I' distance-least -0— - Average2.5 _ latency 20 q ([1880) 15 - 10 - 5 4 0 l l J l l 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.20. Comparison of input selection policies for the north-last routing algo- rithm. 63 40 I M I I I 35 P distance-travelled A— d least-adaptive +— 30 " distance-least '0— d 25 - Average latency 20 .- (pace) 15 - 10 d 5 .- 0 7 1 1 1 1 1 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.21. Comparison of input selection policies for the negative—first routing algorithm. the no-turn policy has the lowest latencies and the zigzag policy has the highest latencies. Depending on the routing algorithm, the my policy is comparable to either the no-turn or zigzag policy. Relative to the zigzag policy, the no—turn policy increases the maximum sustainable throughput by an average of 5%. The relative performance of the three output selection policies appears to be due to two factors: spreading message traffic evenly and avoiding turns. The zigzag policy tends to concentrate uniform traffic in the center of the mesh, increasing contention and latencies. Compensating for this tendency somewhat is the adaptiveness of the routing algorithms. The my policy attempts to spread the uniform traffic more evenly. When contention is high, however, the my policy, like the other policies, chooses the first available channel, ignoring which channel is preferred. This behavior tends to route more packets toward the center of the mesh. The better performance of the my policy relative to the zigzag policy in general suggests that spreading traffic 64 40 t l l r' I 35 - zigzag O— .. my D— 30 ' no-turn A— .. 25 Average latency 20 (#880) 15 10 J l l l 7 m 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.22. Comparison of output selection policies for the west-first routing algo- rithm. ' 40 I l l 1 I 35 - zigzag O— , - my Q— ‘ 30 ' no-turn -A- .. 25 Average latency 20 (am) 15 10 l o 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.23. Comparison of output selection policies for the north-last routing algo- rithm. 65 40 T I I l I 35 - zigzag 0— cut my D— 30 '- no-turn A— q 25 Average latency 20 (#896) 15 10 I l l l J 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 5.24. Comparison of output selection policies for the negative-first routing algorithm. evenly is more important than maintaining maximum adaptiveness. The superior performance of the no—turn policy suggests that turns are indeed costly and that avoiding unnecessary turns is even more important than spreading traffic evenly. 5.6.3 Traffic Patterns The next simulations compare the partially adaptive routing algorithms produced by the turn model with the nonadaptive routing algorithms produced by dimension ordering for different direct networks and different patterns of message traffic. The four networks studied are a 10 x 10 mesh, a 16 x 16 mesh, a 10 x 10 x 10 mesh, and a binary 8-cube. The six traffic patterns studied are uniform, matrix—transpose, matrix-rotate, merge-south, reverse-flip, and address-rotate. In Figures 5.25, 5.27, 5.28, 5.31, and 5.32, the input selection policy is distance-travelled and the output selection policy is no-turn. In Figures 5.26, 5.29, 5.30, 5.33, and 5.34, the input 66 selection policy is local FCFS and the output selection policy is my. In all cases, routing is minimal. For each traffic pattern, a different routing algorithm has the lowest latencies at the high throughputs. For uniform traffic, it is the my algorithm (Figures 5.25, 5.26, and 5.27); for matrix-transpose traffic, the negative-first algorithm (Figures 5.28, 5.29, and 5.30); for matrix-rotate, merge-south, and address-rotate traffic, the north- last or all-but-one—negative-first (ABONF) algorithm (Figures 5.31, 5.32, and 5.33); and for reverse-flip traffic, the west-first or all-but-one-positive-last (ABOPL) algo- rithm (Figure 5.34). In general, the nonadaptive routing algorithms perform better for uniform traffic, and the partially adaptive routing algorithms perform better for nonuniform traffic. For matrix-transpose traffic in the meshes and hypercube, the maximum sustainable throughput of the partially adaptive algorithms is twice that of the nonadaptive algorithms. In the meshes, this throughput is a third higher than the second highest sustainable throughput, which occurs for the my algorithm and uniform traffic. The latter improvement in throughput is not due to shorter path lengths for matrix-transpose traffic than for uniform traffic: the average path length for matrix-transpose traffic is 11.34 hops and for uniform traffic is 10.61 hops. For reverse—flip traffic in the hypercube, the maximum sustainable throughput of the par- tially adaptive algorithms is four times that of the nonadaptive e-cube algorithm and one half higher than the next highest sustainable throughput, which occurs for the e-cube algorithm and uniform traffic. Again, average path length is longer for reverse-flip traffic (4.27 hops) than for uniform traffic (4.01 hops). The differences in the performances of the algorithms, for individual traffic pat- terns and across traffic patterns, are largely due to differences in the types of in- formation about traffic used in the algorithms. The my and e-cube algorithms are nonadaptive, but happen to embody global, long-term information about the uni- form traffic pattern. By routing packets first along one dimension and then the other, 67 40 I I I I I I 35 r- 533! -)<-— q west-first O— 30 '- north-last 5— .. negativefirst A— Average - latency 20 - (pace) 15 - 10 4 5 -‘ - 0 l l l 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.25. Comparison of routing algorithms for uniform traffic in a 10 x 10 mesh. 40 I I I I I I I 35 . any -x- .. west-first O— 30 - north—last 5— - negative-first Average25 ( - latency 20 - (#sec) 15 - 10 - 5 u 0 , 1 l l 1 l l 1 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.26. Comparison of routing algorithms for uniform traffic in a 16 x 16 mesh. 68 40 I w I I I I I 35 . I- 33/2 -x-— 4 west-south-first 0— 30 - north-up-last «D- - negative~first A— Average ‘ latency 20 - (#886) 15 - 10 e 5 " d 0 l l I l 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.27. Comparison of routing algorithms for uniform traffic in a 10 x 10 x 10 mesh. ‘ 40 I I I I I I I 35 - my -x— l d west-first 0- 30 " north-last 9- .. negative-first A- 25 Average latency 20 (wee) 15 10 , 4 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.28. Comparison of routing algorithms for matrix-transpose traffic in a 10 x 10 mesh. 69 40 I I I I I I I 35 - $.11 *- - west-first 0— 30 '- northolast 5— .. negativefirst A- 25 Average latency 20 (mec) 15 10 0 . I I I I I L I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.29. Comparison of routing algorithms for matrix-transpose traffic in a 16 x 16 mesh. ‘ 40 I I I I I I I I 35 *' e-cube -)(- _ ABONF @- 30 - ABOPL 9— - p-cube A- 25 - Average latency 20 . (pace) 15 ‘ 10 I I I I I I 0 2 4 6 8 10 12 14 16 18 20 Percent of maximum theoretical throughput Figure 5.30. Comparison of routing algorithms for matrix-transpose traffic in a binary 8-cube. 70 40 I I - I I I I I 35 - 3y -x— J west-first @- 30 '- north-last 5— - negative-first A— Average latency 20 - (#866) 15 d 10 a 5 .. 0 I I I I I I I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.31. Comparison of routing algorithms for matrix-rotate traffic in a 10 x 10 mesh. ' 40 I I f I I I I 35 - fry -><— - west-first 0— 30 ' north-last 9- m negative-first A— 25 Average latency 20 (usec) 15 10 0 I I I I I I L o 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.32. Comparison of routing algorithms for merge-south traffic in a 10 x 10 mesh. 40 35 30 Average latency 20 (wee) 15 10 71 I ~ e-cube -)(— ABONF O— - ABOPL 5— p-cube «A— I 12 Percent of maximum theoretical throughput I 16 18 20 Figure 5.33. Comparison of routing algorithms for address-rotate traffic in a binary 8-cube. 40 35 30 25 Average latency 20 (We) 15 10 I - e-cube 9(— ABONF 9- '- ABOPL '9— p-cube - I 12 Percent of maximum theoretical throughput 16 18 20 Figure 5.34. Comparison of routing algorithms for reverse-flip traffic in a binary 8-cube. 72 they spread uniform traffic as evenly as possible across the channels of a mesh or hypercube in the long term. The adaptive algorithms, on the other hand, select channels based on local, short-term information. These selections tend to benefit just the routed packet, and only for the immediate future, and tend to interfere with other packets. The selections often produce zigzag paths, which disturb the global, long- term evenness of uniform traffic. The result is increased contention and decreased performance. Despite the superior performance of the nonadaptive routing algorithms for uni- form traffic, the partially adaptive algorithms probably provide better performance in real systems. Uniform traffic has been used in many previous simulation studies, but is not generated by many applications. A traffic pattern is determined by the application and how its processes are mapped to the nodes of the direct network. For most applications, each process communicates with some processes much more than others. Nonuniform traffic presents a problem for the my and e-cube algorithms because they are nonadaptive. Just as they maintain the evenness of uniform traffic, they blindly maintain the unevenness of nonuniform traffic. The result, as the figures illustrate, is often poor performance. 5.6.4 Spreading Traffic More Evenly The overall conclusion from the simulations in the preceding subsection is that the performance of a routing algorithm is largely determined by how well it spreads traffic across the channels of a direct network. The present subsection discusses how to modify partially adaptive routing algorithms so that they spread traffic more evenly. Konstantinidou [28] describes two separate modifications of the p-cube routing algorithm that make it spread traffic more evenly across a hypercube. The first modification is to route some packets first in positive directions and then in negative directions and to route the other packets first in negative directions and then in 73 positive directions. To prevent deadlock, the number of channels in the positive directions is doubled. The positive-first packets and negative-first packets use separate channels in the positive directions but share the channels in the negative directions. The traffic of the positive-first packets is heavier for channels nearer node n — 1, and the traffic of the negative-first packets is heavier for channels nearer node 0. In total, the traffic of the two types of packets is fairly even. The second modification is to route all packets first in positive directions, then in negative directions, and then again in positive directions. Deadlock is again prevented by doubling the number of channels in the positive directions. The total traffic of the three phases is fairly even across the channels of the hypercube. The modifications by Konstantinidou can be generalized to any partially adaptive routing algorithm and any direct network. The first general technique for modifying a routing algorithm to spread traffic more evenly in a direct network is to create multiple virtual networks and use a different version of the routing algorithm (or an entirely different routing algorithm) in each virtual network. For example, a north- last routing algorithm might be used in one virtual network, and a south-last routing algorithm might be used in a second virtual network. Which routing algorithm and virtual network are used to route a particular message can be determined randomly (as Konstantinidou suggests) or according to which routing algorithm is best suited for the particular source and destination nodes. When the virtual networks are overlaid onto the physical network, it may be possible for virtual networks to share some virtual channels and still prevent deadlock (as Konstantinidou demonstrates for the p-cube routing algorithm). The overall objective is that the channel utilizations in the virtual networks complement each other such that the channel utilizations in the physical network are as even as possible. The second general technique for modifying a partially adaptive routing algorithm to spread traffic more evenly is again to create multiple virtual networks with multiple 74 routing algorithms, but to connect the virtual networks in sequence. All messages start in the same virtual network and repeatedly advance to the next virtual network in the sequence. When the last phase of the routing algorithm in one virtual network is similar to the first phase of the routing algorithm in the next virtual network, the two virtual networks may share virtual channels. Again, the overall objective is that the channel utilizations in the virtual networks complement each other such that the channel utilizations in the physical network are as even as possible. The drawback to both of these general techniques is that they require that virtual channels be added to a direct network. Adding virtual channels increases the cost of a direct network. It also creates the possibility that a routing algorithm designed from the start for a direct network with that many virtual channels might make better use of the channels and perform better. A third general technique for modifying a partially adaptive routing algorithm to spread traffic more evenly does not have this drawback of adding virtual channels to a direct network. The technique is to use different versions of the routing algorithm (or entirely different routing algorithms) in different parts of the direct network. The objective is to use the routing algorithm that spreads traffic the most evenly in each part of the direct network. For example, this technique can be applied to two- dimensional meshes and the west-first routing algorithm. The result is the east-west routing algorithm: the east half of the mesh uses a west-first routing algorithm to route packets, and the west half uses an east-first routing algorithm. Thus, as shown in Figure 5.35(a), packets destined for the other half of the mesh are routed there nonadaptively (that is, east or west without going north or south). Packets in the same half as their destination are routed adaptively primarily if they are heading away from the center of the mesh, as shown in Figure 5.35(b). This combination of nonadaptive routing toward the center of the mesh and adaptive routing away from the center helps keep adaptiveness from inadvertently congesting the center of the 75 mesh. (a) (b) Figure 5.35. The directions in which packets in the two halves of a two-dimensional mesh are routed by the east-west routing algorithm during its two phases. To find out how the east-west routing algorithm performs, it is simulated in a 10 x 10 mesh for a variety of traffic patterns. In all cases, the input selection pol- icy is distance-travelled, the output selection policy is no-turn, and the routing is minimal. The results are shown in Figures 5.36, 5.37, 5.38, and 5.39. Overall, the east-west routing algorithm performs better than the other partially adaptive routing algorithms. For uniform traffic, the nonadaptive my routing algorithm still performs the best, but the east-west algorithm is a close second. 76 40 I I I I I I 35 - my 9(— ., west-first 9- 30 - north-last 5- a negative-first A- east-west -O— - Average latency 20 - (#866) 15 q 10 . 5 ‘ q 0 , I I I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.36. Comparison of routing algorithms for uniform traffic in a 10 x 10 mesh. 40 I I I I I 35 - my -)(— - west-first 0- 30 - north-last D— .. negative-first A— east-west -0— - Average latency 20 (psec) 15 10 5 . 0 , l l 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.37. Comparison of routing algorithms for matrix-transpose traffic in a 10 x 10 mesh. ‘ 77 40 I I I I I I I 35 - ity -x— a west-first 0— 30 - north-last 5- 4 negative-first A— 25 east-west -0— .l Average latency 20 -+ (flux) 15 " 10 " 5 d 0 7 I I I I I I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.38. Comparison of routing algorithms for matrix-rotate traffic in a 10 x 10 mesh. ' 40 I I ’ I I I I I 35 - 3y *- . - west-first @- 30 ' north-last 9— , - negative-first A— Average i - latency 20 - (flux) 15 - 10 a 5 ‘ u 0 I I I I o 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 5.39. Comparison of routing algorithms for merge-south traffic in a 10 x 10 mesh. CHAPTER 6 Fully Adaptive Routing The turn model produces fully adaptive routing algorithms when applied to direct networks with enough virtual channels. Unlike other models, the turn model helps determine the minimum configuration of virtual channels required for fully adap- tive routing and produces routing algorithms that are maximally adaptive. This chapter first applies the turn model to the problem of fully adaptive routing in two- dimensional meshes, finding the minimum configuration of virtual channels required and producing a maximally adaptive routing algorithm. Simulations show that this algorithm performs better than the fully adaptive and nonadaptive algorithms pro- duced by other models for the same network. Next, the chapter describes how to easily design maximally fully adaptive routing algorithms for n-dimensional meshes in gen- eral, adding the minimum number of virtual channels. Finally, the chapter applies the simplified model to the design of a maximally fully adaptive routing algorithm for three-dimensional meshes. 6.1 Two-Dimensional Meshes Fully adaptive routing is possible in two-dimensional meshes if there are two pairs of channels in one of the two dimensions. Chapter 2 assumed that this dimension was y 78 79 and named the routing algorithm the double-y algorithm. In light of the turn model, this algorithm raises two questions. First, is the double-y algorithm the most adaptive routing algorithm for two-dimensional meshes with double y channels? Second, is fully adaptive routing possible in two-dimensional meshes with fewer channels? This section uses the turn model to answer these questions. The first question can be answered by applying the turn model to two-dimensional meshes with double y channels. In such double-y networks, there are six virtual directions: one west, one east, two north, and two south, as shown in Figure 6.2(a). The six directions permit sixteen 90-degree turns: two turns to the north and two to the south when travelling in each of the directions west and east and one turn to the west and one to the east when travelling in each of the two north and two south directions. These turns conveniently form eight cycles of four turns each, as shown in Figure 6.1. Cycles in which the head of one turn and the tail of the turn to which it is matched are in the same topological direction but different virtual directions are not considered now, because the turn model delays considering 0«degree (and ISO-degree) turns until the last step. Figure 6.1. Eight simple cycles formed by the sixteen turns in a double-y network. The next step in applying the turn model is to prohibit enough of the sixteen turns to break all of the cycles in Figure 6.1, as well as all of the cycles involving more than 80 four turns. A simple way to start is with the four cycles in the left half of Figure 6.1, which contain all sixteen turns. There are 4“ ways to prohibit one turn in each of these four cycles. Of these 256 ways, eight break all cycles and allow fully adaptive routing. The eight ways are symmetric to each other, however, and yield equivalent routing algorithms. One of the eight ways to prohibit four turns is shown in Figure 6.2(c). It is easy to verify that the four prohibited turns also break the four cycles in the right half of Figure 6.1. Note‘that the double-y algorithm (Figure 6.2(b)) prohibits l the same four turns, plus four additional turns, thereby reducing its adaptiveness. .2 +Yl +Y2 1H H H -Y1 -Y2 (a) a router in a double-y network 5"“ {—1 ____‘__. 8/16 F . . 1..-... L _4 (b) double-y algorithm (fully adaptive) l" "'1’ i— “f 15::1 #:fi 12/16 =Z= L -33: L _4 (c) mad-y algorithm (maximally fully adaptive) Figure 6.2. A router in a double-y network and the turns allowed by some routing algorithms. Because there are no wraparound channels in a mesh, the next step is to incor- 81 porate as many 0-degree turns as possible without reintroducing cycles. The four possible O-degree turns are from the first set of northward channels to the second and vice versa and from the first set of southward channels to the second and vice versa. Each of these four O-degree turns must be tested to see whether its inclusion creates any cycles. For example, if turns from the first set of northward channels to the second were allowed, they would create four potential cycles of four turns, as shown in Figure 6.3(a). At least one turn in each of these cycles is already prohibited, so these O-degree turns can be allowed safely. If turns from the second set of northward channels to the first were allowed, they would create four potential cycles of four turns, as shown in Figure 6.3(b). No turns are prohibited in two of these cycles, so these O-degree turns must remain prohibited. Continuing to test O-degree turns reveals that it is safe to allow O-degree turns only from the first sets of northward and southward channels to their respective second sets. Note that, though these turns do not cause deadlock, routing a packet along them when the packet still has to be routed farther west would prevent the packet from reaching its destination, because turns to the west from the second sets of channels are prohibited. 1"” I"? {“1 L"; I'ZJ L714 i—«é 7 i— L...-.€ LA LA (a) If the Mega turn (b) If the 0-degree turn I. to I '5 allowed. I to I, is allowed. Figure 6.3. The potential cycles created by allowing 0-degree turns when travelling north. -"--‘—-r— 82 If nonminimal routing is desired, the final step is to incorporate as many 180. degree turns as possible without reintroducing cycles. The five pairs of possible 180-degree turns are from the eastward channels to the westward channels and vice verso, from the first set of northward channels to the first set of southward channels and vice versa, from the first set of northward channels to the second set of southward channels and vice verso, from the second set of northward channels to the first set of southward channels and vice versa, and from the second set of northward channels to the second set of southward channels and vice versa. Testing whether allowing each 180—degree turn creates cycles, much as was done for O—degree turns, reveals that three particular turns must remain prohibited. These are the turns from the eastward channels to the westward channels, from the second set of southward channels to the first set of northward channels, and from the second set of northward channels to the first set of southward channels. These three turns come from three different pairs. In the other two pairs, neither of the turns creates a cycle when allowed. Both of the turns in a pair cannot be allowed, however. If both turns in a pair were allowed, they would form a cycle by themselves. Thus, in three pairs, a specific 180-degree turn can be allowed; and in the other two pairs, either, but not both, of the 180-degree turns can be allowed. Therefore, four different sets of five 180-degree turns can be allowed. That half of the 180-degree turns can be allowed is no accident. Theorem 6.1 The mamimum number of 180-degree turns that the turn model can allow in a direct network is always one half of the total number of 180-degree turns. Proof: The ISO-degree turns in a direct network always come in pairs: for each 180—degree turn, its reverse is always a different ISO-degree turn. As argued before, both of the turns in a pair cannot be allowed, without creating a cycle by themselves. Therefore, the turn model can allow at most one half of the 180-degree turns. If .a ISO-degree turn creates a cycle when allowed, some combination of the other allowed 83 turns must form the equivalent of the reverse 180-degree turn. If both of the turns in a pair create cycles when allowed, the other allowed turns must form the equivalents of both of the turns, which implies that the other allowed turns can also form a cycle. This contradiction implies that at least one of the turns in each pair can be allowed. Therefore, the turn model can allow at least one half of the l80degree turns. D The routing algorithm produced by the preceding application of the turn model is called the momimolly adoptive double-y routing algorithm or mad-y for short. It routes northeastbound packets first along the first set of northward channels and then along the second set of northward channels and the eastward channels. It routes southeast- bound packets first along the first set of southward channels and then along the second set of southward channels and the eastward channels. It routes southwestbound pack- ets first along the westward channels and the first set of southward channels and then along the second set of southward channels. It routes northwestbound packets first along the westward channels and the first set of northward channels and then along the second set of northward channels. Proof that the mad-y routing algorithm is deadlock free is based on the work of Dally and Seitz [19], who show that a routing algorithm is deadlock free if the channels in the direct network can be numbered so that the algorithm routes every packet along channels with strictly increasing (or decreasing) numbers. Theorem 6.2 The mad-y routing algorithm for double-y networks is deadlock free. Proof: Assign each channel in the m x n- mesh a two-digit number a,b in base r, where r is the greater of 2m and n. Figure 6.4 shows the particular numbers to assign to the channels leaving node (m, y). To show that the mad-y algorithm routes every packet along channels with strictly increasing numbers, it is sufficient to show that, for each channel into an arbitrary node, the algorithm can only route the packet 84 out along a channel with a higher number. Figure 6.5 shows the six possible cases. Examination shows that, in each case, the channels used to route a packet out of a node have higher numbers than the input channel. [I] m-l-x,l+y m+x,l+y m-x,0 m+l+x,0 m-l-x,n-y m+x,n-y Figure 6.4. Numbering of the channels leaving each node (2:, y) for the mad-y routing algorithm. The first question from the beginning of this section has now been answered. The double-y routing algorithm is not the most adaptive algorithm for two-dimensional meshes with double y channels. It prohibits eight of the sixteen 90-degree turns, all four O-degree turns, and all ten of the 180—degree turns, while the mad-y algorithm prohibits only four of the sixteen 90-degree turns, two of the four O-degree turns, and five of the ten 180-degree turns. In addition, prohibiting any fewer than four 90—degree turns, two 0~degree turns, or five ISO-degree turns allows packets to wait on each other in cycles, creating deadlock. Thus, the mad—y algorithm cannot be made more adaptive. The second question from the beginning of this section, whether fully adaptive routing is possible in two-dimensional meshes with fewer channels, can be answered by applying the turn model to two-dimensional meshes with double eastward and northward channels (Figure 6.6(a)). Such a network has six virtual directions and eighteen 90-degree turns (Figure 6.6(b)). The eighteen turns form eight cycles of four 85 m+XII+Y m+x,l+y l m-l-x,0 m+x,0 m+l+x,0 fir” m+x,n-y m+x,n-y m-l-x,n-1-y m-l-x,l+y m+x,l+y m-x,0 m+l+x,0 m-x,0 m+1+x,0 m-l-x,n-y m+x,n-y m-l-x,y m+x,n-l-y m+x,l+y m+1+x,0 m+1+x,0 m+x,n-y m+x,y Figure 6.5. For each input channel, the output channels allowed by the mad-y routing algorithm have higher numbers. 86 turns. It is not possible to break all of the cycles by prohibiting turns and retain full adaptiveness, however. Thus, two-dimensional meshes with double eastward and northward channels do not support fully adaptive routing, even though they have the same number of channels as double-y networks. Also, two-dimensional meshes with just double eastward channels or just double northward channels do not support fully adaptive routing, since they have even fewer of the same turns as two-dimensional meshes with double eastward and northward channels. Therefore, of the obvious regular direct networks, double-y networks have the minimum number of channels necessary to support fully adaptive routing and have the only arrangement of these channels, other than double :1: channels, that support fully adaptive routing. # 4"“: r1 $7.791 LA in ‘1 .x: :“I *— LN} +— -II->+X2 I". 1;; .1 ' l" (a) a router with double ; channels in the +X (east) and +Y (north) directions (b) 18 poSsible turns Figure 6.6. A router for a two-dimensional mesh with double eastward and northward channels and the possible turns. 87 6.2 Simulation Experiments This section reports on simulations designed to compare the double-y, mad-y, and my routing algorithms for different traffic patterns. The direct network studied is a 16 x 16 mesh with double y channels. Neighbors along the y dimension are connected by two pairs of unidirectional physical channels in order to simplify the computations. The five traffic patterns studied are uniform, center-reflection, matrix-transpose, bit- reversal, and matrix-multiply. The input selection policy is distance-travelled, the output selection policy is my, and the routing is minimal. The first simulations explore whether the my routing algorithm or its reverse, the ym routing algorithm, performs better under uniform traffic. These two nonadaptive algorithms are allowed to route packets along any of the y channels, so that they are not at a disadvantage relative to the adaptive algorithms in later simulations. Because the y dimension has twice the number of channels and twice the bandwidth of the m dimension, however, one of the nonadaptive algorithms may perform better than the other. The results of the simulations are shown in Figure 6.7. At all throughputs, espe- cially high ones, the my routing algorithm has the lower latency. The my algorithm also has a maximum sustainable throughput that is about 13% higher. These improvements are caused by the greater throttling effect of initially routing packets along the dimension with the fewer number of channels. Throttling is inherent in direct networks. Increased traffic tends to increase contention for channels, and the increased contention tends to restrict the number of new packets that can enter the network, thereby reducing traffic. When some dimensions have fewer channels than others, contention tends to increase most rapidly in the dimensions with the fewest channels, making them the bottleneck. Routing packets along these dimensions first places the bottleneck at the best point for keeping new packets from entering the 88 network and tying up resources. Similar reasoning is behind the decision to use the my output selection policy for the adaptive routing algorithms instead of a 3):: policy. The my output selection policy increases contention for the smaller number of m channels, which are the bottleneck for packets travelling through the network. The distance-travelled input selection policy causes competitions for the m channels to be won by packets that have travelled farther through the network and that, therefore, tend to be older. Thus, with the my output selection policy, younger packets tend to hold fewer precious resources, and older packets tend to finish with less interference. 40 35 '- 30 - 25 - Average latency 20 - (twee) 15 _ 10 - 5T7 -J 0 1 I I I 1 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 6.7. Comparison of nonadaptive routing algorithms for uniform traffic in a 16 x 16 mesh with double y channels. The remaining simulations explore how the performances of the my, double-y, and mad-y routing algorithms for double-y networks vary with the traffic pattern. Figures 6.8 to 6.12 show the results. For uniform traffic and high throughputs, the nonadaptive my algorithm has lower latencies than the fully adaptive algorithms. At 89 low throughputs, however, the mad-y algorithm has slightly lower latencies than the others. The maximum sustainable throughput for the my algorithm is also higher than those of the fully adaptive algorithms by around 50%. The results for center-reflection traffic are similar. For the matrix-transpose, bit-reversal, and matrix-multiply traffic patterns, the mad-y algorithm has the lowest latencies at all throughputs, the double- y algorithm has the second lowest latencies, and the my algorithm has the highest latencies. These differences increase as throughput increases. At high throughputs, the mad-y algorithm has latencies around 30% lower than those of the double-y algorithm. The maximum sustainable throughputs of the fully adaptive algorithms are around 50% higher than that of the my algorithm. Finally, regardless of the traffic pattern, the mad-y routing algorithm has lower latencies than the double-y algorithm. 40 I I I 35 " $3] *- double-y @— 30 mad-y D— 25 Average latency 20 J I L I o 5 1o 15 20 25 30 Percent of maximum theoretical throughput Figure 6.8. Comparison of routing algorithms for uniform traffic in a 16 x 16 mesh with double y channels. As in Section 5.6.3, the differences in the performances of all the algorithms for 90 40 I I I I 35 F $31 ->(— .. double-y @— 30 - . mad-y -D— _ Average25 - . latency 20 - .. (Itsec) l5 " -I 10 - 1 5T -. 0 L l I I I 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 6.9. Comparison of routing algorithms for center-reflection traffic in a 16 x 16 mesh with double y channels. 40 m I F I I I 35 - 3y -)<— q double-y @— 30 - mad-y Q— q Averagez'5 ‘ latency 20 - (pace) 15 n 10 0 7 I I I I I 0 5 10 15 20 25 Percent of maximum theoretical throughput 30 Figure 6.10. Comparison of routing algorithms for matrix-transpose traffic in a 16 x 16 mesh with double y channels. 91 40 I I I I 35 "' 3y 9(— .( double-y 0- 30 mad-y 5- q Average d latency 20 q (pace) 15 - 10 - 5 .- 0 7 I 1 I l 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 6.11. Comparison of routing algorithms for bit-reversal traffic in a 16 x 16 mesh with double y channels. 40 F I I I I 35 - my -)(— a double-y 0— 30 - mad-y D— _ Average25 ‘ latency 20 - (#986) 15 - 10 - 5 .. 0 7 I I I I I 0 5 10 15 20 25 30 Percent of maximum theoretical throughput Figure 6.12. Comparison of routing algorithms for matrix-multiply traffic in a 16 x 16 mesh with double y channels. is 92 individual traffic patterns and across traffic patterns are largely due to differences in the types of information about traffic used in the algorithms. The nonadaptive my algorithm happens to embody global, long-term information about the uniform and center-reflection traffic patterns. It spreads uniform and center-reflection traffic as evenly as possible across the channels of a mesh in the long run. The adaptive algorithms, on the other hand, select channels based on local, short-term information. These selections tend to benefit just the routed packet, and only for the immediate future, and tend to interfere with other packets. The selections often produce zigzag paths, which disturb the global, long-term evenness of uniform and center-reflection traffic. The result is increased contention and decreased performance. For the matrix-transpose, bit-reversal, and matrix-multiply traffic patterns, the situation is different. The my algorithm embodies the wrong global, long-term in- formation for these patterns and spreads the traffic very unevenly. The adaptive algorithms, especially the more adaptive mad-y algorithm, spread the traffic fairly evenly by avoiding local, short-term hot spots. They do not spread the traffic as evenly as an algorithm would that embodied the correct global and long-term infor- mation, but they come reasonably close. The maximum sustainable throughputs for the my algorithm under uniform and center-reflection traffic are 29% and 30%; for the adaptive algorithms under matrix-transpose, bit-reversal, and matrix-multiply traffic, they are 25%, 20%, and 15%. What makes the adaptive routing algorithms, particularly the mad-y algorithm, preferable overall is that uniform and center-reflection traffic patterns are rare. In an actual system, traffic is much more likely to be a changing mix of nonuniform traffic patterns over time. For most of the nonuniform traffic patterns simulated, the adaptive routing algorithms perform better than the nonadaptive algorithms. 93 6.3 n-Dimensional Meshes Designing maximally fully adaptive routing algorithms is easy for two-dimensional meshes. The analysis is simple, and a little trial and error reveals the virtual channels that must be added to the topology. For higher dimensions, it is less obvious exactly how many virtual channels need to be added and where, and the analysis of particular cases is much more difficult. This section uses the turn model and insights gained from applying it to two-dimensional meshes to discover how to easily design maximally fully adaptive routing algorithms for n—dimensional meshes. The first step is to discover the minimum number of virtual channels required and their positions. Then, it is straightforward to design a fully adaptive routing algorithm that is not maximally adaptive, much as other models might do. Analysis of the fraction of turns used by such an algorithm reveals a very low degree of adaptiveness, however. Consequently, the next step is to discover how to easily modify a fully adaptive routing algorithm to make it maximally adaptive, with the same result as rigorous application of the turn model. Analysis of the fraction of turns used by a maximally fully adaptive routing algorithm reveals a high degree of adaptiveness. To start, an n-dimensional mesh has 2n topological directions: a positive and a negative direction along each dimension. For each dimension, a packet must travel in at most one of the two directions. Thus, each packet must travel in at most n directions and belOngs to at least one of 2" classes, based on its particular combination of up to n directions. Each class can be designated by a sequence of n 0’s and 1’s, such as 11010000, where 0 or 1 in position i represents routing in the negative or positive direction of dimension i. For each class, a fully adaptive algorithm must route packets along a set of virtual channels in each of the n directions. The virtual network model would use a separate set of virtual channels for each direction in each class. In the case of two-dimensional meshes, however, Section 6.1 shows that pairs of 94 classes can share sets. For the double-y and mad-y algorithms, the 00 and 01 classes share the westward channels, and the 10 and 11 classes share the eastward channels. The sets of virtual channels in the two pairs of classes can be represented graphically as in Figure 6.13. Figure 6.13. The directions of the virtual channels in the two pairs of classes for fully adaptive routing in a two-dimensional mesh. Theorem 6.3 In an n-dimensional mesh, a pair of classes can share all but one set of virtual channels and still prevent deadlock. Proof: Assume that the two classes share sets in every dimension except i. Be- cause two classes can share a set in a dimension only if they have the same direction in that dimension, the two classes must have the same directions in all but dimension i. In dimension i, the classes must have different directions; otherwise, the two classes would be the same class. If at least one of the two 180-degree turns along dimension i are prohibited, there can be no cycles involving the n + 1 sets. A cycle would require travelling in both directions of two dimensions, and only one dimension, i, has virtual channels in two directions. Therefore, deadlock is prevented. D From the standpoint of reducing the number of virtual channels required for the fully adaptive routing of two classes, being able to share all but one set is ideal. Unfortunately, a class involved in such a pair cannot be a member of any other pair. 95 Theorem 6.4 In an n-dimensional mesh, if a class is paired with each of two other classes and shares all but one set of virtual channels with each, deadlock is possible. Proof: Assume that class one does not share sets in dimension i with class two and does not share sets in dimension j with class three. Dimensions i and j must be different; otherwise, classes two and three would be the same class. The directions of. the three classes in the two dimensions i and j must be either 00, 01, 10; 00, 01, 11; 00, 10, 11; or 01, 10, 11. Each of these four cases is symmetric to the example of deadlock shown in Figure 5.3. C] If a class is paired with each of two other classes and shares less than n — 1 sets with each, even more sets of virtual channels are involved and deadlock is still possible. Therefore, the biggest reduction in the number of virtual channels comes from pairing classes so that they share all but one set. Such a pairing leads to 2""1 pairs of classes. Since each pair requires n + 1 sets of virtual channels, each interior node of the mesh has (n + l)2"‘1 output network channels. The pairs can be formed so that the maximum number, V, of virtual channels per physical channel is V = [n + 12n-2] n Some physical channels have V virtual channels, and the others have V — 1. In contrast, the wraparound model requires the same total number of virtual channels as the turn model, but places more of the virtual channels along dimension 0. For the wraparound model, V = 2"“. Also unlike the turn model, the wraparound model does not show that the number of virtual channels used is the minimum necessary for fully adaptive routing. The virtual network model, on the other hand, requires more virtual channels than either the turn model or the wraparound model. The simplest fully adaptive routing algorithm is to route each packet along the 96 sets of virtual channels for its class. This algorithm is the one that the wraparound model produces. The problem with the algorithm is that it is not nearly as adaptive as it could be, considering the number of virtual channels that have been added. One meaningful measurement of the degree of adaptiveness of a routing algorithm in an n-dimensional mesh with virtual channels is the fraction of 90-degree turns allowed from one virtual direction to another, A. A is easiest to estimate if it is assumed that every physical channel has V virtual channels, which happens when n = 2‘, where i 2 2. In this case, V = lily-2 n In such a network, the number of virtual directions is 2nV, and the number of possible 90-degree turns from one virtual direction to another is 2 2nV(2n — 2)V = n(n _ 1) (1&1) 22n-2 The simple fully adaptive routing algorithm allows n(n — 1)+ (n — l)2 or (n +2)(n -— 1) 90-degree turns from one virtual direction to another per pair of classes. The total number of turns is, thus, (n + 2)(n - l)2”‘1. Therefore, _ (n +1)2 —1 ~ 1 — (n +1)22n-1 ~ 273-1 Thus, for all but low-dimensional meshes, the simple fully adaptive routing al- gorithm makes very poor use of the virtual channels. Fortunately, the turn model can produce more adaptive routing algorithms, as Section 6.1 demonstrated for two- dimensional meshes. The difficulty is that applying the turn model to meshes with many dimensions and virtual channels is very tedious. In the case of maximally fully adaptive routing, however, the turn model can be simplified. The insight for 97 the simplification comes from comparing the double-y and mad-y routing algorithms for two-dimensional meshes. The mad-y algorithms is basically the double-y routing algorithm with turns added from the virtual channels in one pair of classes to the virtual channels in the other pair. This insight suggests the following design method for maximally fully adaptive routing algorithms. First, design a simple fully adaptive routing algorithm as described above. Second, order the pairs of classes. Finally, add all turns from earlier pairs to later pairs. How adaptive are the algorithms produced by this special case of the turn model? First, the maximum number, P, of pairs with channels in both directions of some n—2 12:2? 1 72 Some dimensions have P pairs with channels in both directions, and the other dimen- particular dimension is sion have P - 2 pairs. A is easiest to estimate if it is assumed that every dimension has P pairs with channels in both directions, which happens when n = 2‘, where i2 2. In this case, 2n—1 II P The number 90-degree turns from one virtual direction to another added to the simple fully adaptive routing algorithm is (2(n -1)+(n _1,,,!LI’2:;2, + (n -1+ a?) (”4‘2“ ’ I) — 53—22,.) 2 2 or ' 1 2""z(2"'1(n2 + n — 1 — a) — n2 — n + 2) Therefore, n— 1 — 2n ~ bDl 98 Thus, the degree of adaptiveness is approximately independent of the number of dimensions. Table 6.1 shows that the degree of adaptiveness quickly approaches one half as the number of dimensions increases. Table 6.1. The degree of adaptiveness for maximally fully adaptive routing algorithms. ur-—-sr .—.. _.- —‘ . I. _l I I . ‘9 g I 12/16 = 0.75 104/168 z 0.62 672/1200 = 0.56 585088/1161216 z 0.50 mfiww 6.4 Three-Dimensional‘Meshes This section illustrates the ease with which the simplified model of the previous section can be applied to the design of a maximally fully adaptive routing algorithm for three- dimensional meshes. The first step is to identify the 23 classes of messages. The eight classes can be paired as shown in Figure 6.14. This pairing nearly balances the number of virtual channels in each dimension, with two virtual channels per physical channel in the m dimension and three virtual channels per physical channel in the y and 2 dimensions. The next step is to order the four pairs of classes. Any order is probably as good as another. The final step is to allow all 90-degree turns within pairs of classes, one 180-turn within each pair, and all 0-, 90-, and ISO-degree turns from earlier pairs to later pairs. That amounts to 14 0-degree turns, 104 90-degree turns, and 22 180-degree turns for the maximally fully adaptive routing algorithm. A simple fully adaptive routing algorithm, in contrast, would allow only 40 90—degree turns. 99 Figure 6.14. The directions of the virtual channels in the four pairs of classes for fully adaptive routing in a three-dimensional mesh. CHAPTER 7 Fault-Tolerant Routing The turn model produces routing algorithms that are deadlock free, maximally adap- tive, minimal or nonminimal, and livelock free. This combination is ideal for fault- tolerant routing. This chapter presents a model of fault-tolerant routing, a one-fault- tolerant routing algorithm for two-dimensional meshes, and simulation results for nonminimal and fault-tolerant routing algorithms. 7 .1 A Model of Fault-Tolerant Routing The objective of fault-tolerant routing is to maximize the ability of the good nodes in a direct network to communicate with each other in the presence of faulty nodes and channels. Ideally, all of the good nOdes would be able to communicate with each other, regardless of the number and position of the faulty nodes and channels. Because nodes in direct networks rely on neighboring nodes to forward message packets, however, such ideal fault tolerance is not possible. This section describes a set of requirements for the design of direct networks that support fault-tolerant routing. It is not, by any means, the only possible set of requirements, but it is a convenient one for an initial examination of fault tolerance. To simplify the requirements some, nodes and channels are not considered separately. 100 101 Instead, the input network channels for a node are considered to be part of the node. The first requirement is that faults be detectable. To the extent that some are not, packet routing and transmission may not function properly. Second, once any part of a node is detected as faulty, all other parts of the node are detectable only as faulty. Packets passing through a node when it fails immediately have an erroneous tail flit inserted in their stream of fiits. Then, the faulty node stops operating, performing no further routing and generating no further messages. A node that receives a packet for a faulty neighbor or for a node to which the path is blocked by faulty nodes discards the packet and possibly returns an acknowledgement of the error to the sender. Error detection and acknowledgement of packet receipt are performed from end to end. If a node sending a packet receives an acknowledgement that the destination node has received the packet successfully, the transmission is complete. If the sender receives an acknowledgement of an error or receives no acknowledgement after a long period, it may resend the packet. Doing so will probably correct transmission errors, but still may not allow the packet to be delivered around all of the faulty nodes. 7 .2 Two-Dimensional Meshes Based on the model of fault-tolerant routing given in the previous section, it is pos- sible to design a routing algorithm that is one-fault tolerant for a two—dimensional mesh that has no virtual channels. That is, the algorithm continues to route messages between all of the good nodes no matter which one node is faulty. One-fault tolerance is the best that can be achieved in a two-dimensional mesh, since it is possible for a corner node to be isolated by faults in its two neighbors. It is also a surprising re- sult, considering that two-dimensional meshes without virtual channels are commonly thought to require nonadaptive routing. The one—fault-tolerant routing algorithm is a modification of the negative-first 102 routing algorithm produced by the turn' model for two-dimensional meshes. The nonminimal version of the negative-first routing algorithm can usually route packets adaptively during both its negative and positive phases. The only time during the negative phase when routing is nonadaptive is when a packet is at the west or south edge of the mesh. Then, the packet has only one direction it can be routed: along the edge in a negative direction. The only time during the positive phase when routing is nonadaptive is when a packet has only one dimension to travel along in order to reach the destination. A few simple modifications of the negative-first routing algorithm can remove these few cases of nonadaptiveness. The modifications are basically to route a packet around faulty nodes on the west or south edge of the mesh, to route a packet farther west or south than required for minimal routing in order to make the remainder of the path to the east or north adaptive, and to leave the remainder of the path adaptive until the last hop. Figure 7.1 illustrates these modifications with a few examples of the new paths. faulty node é minimal path Figure 7.1. Examples of the paths allowed by the one-fault-tolerant routing algorithm for two-dimensional meshes. ' 103 The one-fault-tolerant routing algorithm resulting from the modification of the negative-first routing algorithm has the following four phases. 1. Route a packet as far west and south as necessary. If a faulty node not on the west or south edge blocks the path, route the packet an extra hop west or south. If a faulty node on the west edge blocks the path to the west, route the packet one hop north. If a faulty node on the south edge blocks the path to the south, route the packet one hop east. If a faulty node on the west edge blocks the path to the south, route the packet one hop east. If a faulty node on the south edge blocks the path to the west, route the packet one hop north. 2. If the packet must travel more than one hop north and no hops east, route the packet one hop west if possible. If not possible because the node to the west is faulty, route the packet one hop north and repeat this phase. 3. If the packet must travel more than one hop east and no hops north, route the packet one hop south if possible. If not possible because the node to the south is faulty, route the packet one hop east and repeat this phase. 4. Route the packet as far east and north as necessary, maintaining the ability to route the packet both east and north as long as possible. If a faulty node on the west or south edge of the mesh blocks the path to a destination on that edge, route the packet one hop perpendicular to the edge, two hops toward the destination, and one hop back to the edge. While a faulty node blocks the path back to the edge, route the packet one more hop toward the destination and one hop back to the edge. As described in the previous section, if a node ever finds it impossible to route a packet farther, it discards the packet and possibly returns an acknowledgement of the error to the sender. 104 The one-fault-tolerant algorithm can always route packets around a single faulty node and can often route packets around multiple faulty nodes. Proof for a 4 x14 mesh is given in Figure 7.2. The figure shows how to number the channels of a 4 x 4 mesh so that the one-fault-tolerant routing algorithm routes packets along them in strictly increasing order. Channels with two numbers, one being in parentheses, vary their numbering according to which nodes are faulty. If the node immediately to the southwest of the channel is good, the number outside of the parentheses is used. If the node immediately to the southwest of the channel is faulty, the number inside of the parentheses is used. Varying the numbering is a legitimate proof technique since each failure of a node creates a new direct network in which a new proof must be constructed. Fortunately, similar numberings work for all of the possible networks; only seven faulty nodes require that some channels be renumbered. Examining the numberings in detail shows that no matter which one node is faulty and no matter which good nodes are the source and destination of a packet, the packet can be routed from the source to the destination along channels with strictly increasing numbers. Therefore, the routing algorithm is both deadlock free and one-fault tolerant for a 4 x 4 mesh. The proof can easily be extended to two-dimensional meshes with other sizes. 7 .3 Simulation Experiments The simulations reported in the previous chapters have all been of minimal routing algorithms. This section describes the results of simulations of some nonminimal routing algorithms, including the one-fault-tolerant algorithm from the previous sec- tion. The direct network studied is a 10 x 10 mesh. The two traffic patterns studied are uniform and matrix-transpose. The input selection policy is distance-travelled, and the output selection policy is no-turn. 105 C(22) 3 ‘_ 0 [E ,1.3"—‘—1 2.3“'—“, 3.3 21 (5) 24 2‘7 7[ 20 6 21 3 I24 0 27 _ 9(19) - s 3 [0.2 nth—52.2 .,3.2 ‘ 18 (8) 7 ‘ 21 A 24 7 10 17 9 18 6 21 3 24 12(16) . 9 ‘ 6 _ ran (1,1"—"— 2,1 3,1 7 15 (11) 18 7 21 4 13 14 12 15 9 18 s 21 I (16) (u) (19) 1(8) (22) (5) 13 10 7 [fl 1,0“'— 2,0"'—— 3,0 14 17 20 Figure 7.2. Numbering of a 4 X 4 mesh for the one-fault-tolerant routing algorithm. The first simulations compare minimal and nonminimal versions of the west-first, north-last, and negative-first routing algorithms for uniform traffic. Figures 7.3, 7.4, and 7.5 show the results. In each case, nonminimal routing has slightly lower latencies at low throughputs and has much higher latencies at high throughputs. These results suggests that, when contention is high, shorter path lengths (min- imal routing) are more important than greater adaptiveness (nonminimal routing). The advantage of nonminimal routing, however, is that it can greatly increase fault tolerance. One way to make nonminimal routing more practical is to restrict the number of hops that do not lead closer to the destination. In the previous simula- tions of nonminimal routing, the number of misroutings was unrestricted. In the next set of simulations, the number of misroutings is varied. Figure 7.6 shows the results. As the number of misroutings allowed increases, so does the average latency at high throughputs. Again, the results suggest that nonminimal routing should be avoided. To be most practical, nonminimal routing should be used only to avoid faulty nodes. The one- 106 40 j I I I I I I 35 - minimal --><— . nonminimal -0— 30 '- .4 Average - latency 20 -4 (wee) 15 r 10 n 5 -1 0 7 I I I I I I I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 7.3. Comparison of the minimal and nonminimal west-first routing algorithms for uniform traffic. ' 40 I I I I I I 35 " minimal 9(— . nonminimal '0- 30 r -‘ Average - .- latency 20 - < -: (11m) 15 r - 10 - '1 5 -1 0 I I I I I I I 0 5 10 15' 20 25 30 35 40 Percent of maximum theoretical throughput Figure 7.4. Comparison of the minimal and nonminimal north-last routing algorithms for uniform traffic. 107 40 I I TI I I i I 35 -' minimal 9(— . nonminimal -O— 30 L- one-fault-tolerant -k— - 25 .. Average latency 20 q (#sec) 15 - 10 q 5 - 0 I I .' J L I I I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 7.5. Comparison of the minimal and nonminimal negative-first routing algo- rithms for uniform traffic. 40 I I WI. I 35 - minimal *- 1 misrouting @— 30 - 2 misroutings D— 3 misroutings A- 00 misroutings -o— Average latency 20 (psec) 15 10 I I I I I I 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 7.6. Comparison of different amounts of misrouting in the negative-first routing algorithm for uniform traffic. 108 fault-tolerant version of the negative-first routing algorithm uses nonminimal routing only to avoidnodes that have failed or that may fail and prevent packet delivery. In the remaining simulations, this algorithm is compared to the minimal and nonminimal versions of the negative-first algorithm. Figures 7.5 and 7.7 show the results for uniform and matrix-transpose traffic. At low throughputs, nonminimal routing has the lowest latencies, minimal routing has slightly higher latencies, and fault-tolerant routing has the highest latencies. At high throughputs, minimal routing has the lowest latencies, fault-tolerant routing has the next highest latencies, and nonminimal routing has the highest latencies. Thus, there is a cost in performance for fault-tolerance. Figure 7.6 suggests that the cost is largely due to the fault-tolerant algorithm allowing each packet one misrouting. 40 I I I I I I I 35 - minimal *— .. nonminimal -9-- 30 P onefault-tolerant d— . Average25 latency 20 0 5 10 15 20 25 30 35 40 Percent of maximum theoretical throughput Figure 7.7. Comparison of the minimal and nonminimalnegative-first routing algo- rithms for matrix-transpose traflic. CHAPTER 8 Summary and Conclusions For electronic computers to continue improving in performance far into the future, computer architectures must become massively parallel. A promising architecture involves interconnecting thousands of processors and other components using direct networks and wormhole routing. One. obstacle to this approach is designing the best algorithms for routing message packets through the direct networks. This dissertation presents a new model for designing routing algorithms. Unlike previous models, which are largely based on adding virtual channels, the new model is based on analyzing the directions in which packets can turn in a network. This turn model explains why most routing algorithms are deadlock free. More importantly, it systematically designs routing algorithms that are maximally adaptive for a network. For networks without virtual channels, the turn model produces routing algorithms that are partially adaptive. For networks with virtual channels, the turn model can produce routing algorithms that are fully adaptive. Adaptiveness is important in a routing algorithm because it reduces the effects of hot spots, unfair channel allocation, and faults. The turn model also designs routing algorithms that are ideal for fault tolerance, being minimal or nonminimal and livelock free, in addition to maximally adaptive. Based on a simple model for fault tolerance, the turn model designs a one-fault-tolerant routing algorithm for two-dimensional meshes, without adding any 109 110 virtual channels. This dissertation also presents the results of numerous simulations. Simulations of different input selection policies show the distance-travelled policy to perform the best, implying that it is more important for a policy to reduce contention than to provide fairness. Simulations of different output selection policies show the no-turn policy to perform the best, implying that it is more important for a policy to reduce contention than to increase adaptiveness. Simulations of different traffic patterns and routing algorithms show the nonadaptive dimension-order routing algorithm to per- form the best for uniform traffic and more adaptive routing algorithms to perform better for nonuniform traffic. The implication is that spreading traffic evenly across the channels of a network is very important for good performance by a routing al- gorithm. Three ways are suggested to make partially adaptive routing algorithms spread traffic more evenly. 8.1 Implementation Considerations Whether the improvements in performance expected from the turn model are realiz- able in real systems depends on two things: whether real traffic patterns are highly nonuniform and whether the routing algorithms can be implemented in efficiently. The latter factor is the topic of this section. Adaptive routing requires that more or larger header flits be present in a router at one time, increasing communication latencies. While nonadaptive routing decisions can be made based on the distance remaining in a single dimension, adaptive routing decisions must be based on the distance remaining in more than one dimension, if not all dimensions. For dimension ordering, the address of the destination node can be stored in a series of header flits, one or more for each dimension. Only the first header flit is required for each routing decision. When it is no longer needed, it can 111 be discarded and the next flit used. For adaptive routing, the best approach may be to place the least significant bits of each dimension in the first header flit, the next least significant bits in the second header flit, and so forth. For example, the first flit might contain bits 0 to 3, the second store bits 3 to 6, the third store bits 6 to 9, and so forth. Then, when the distances in the first flit are reduced to zero, the flit is discarded, the distances in the second flit are decremented by one, and a new first flit is generated by the router. Adaptive routing also requires more complex control circuits. More information is involved in the routing decision, and the decision rules are more complex. The extra control hardware means more expense and possibly longer decision times. How much longer decision times are depends on the specifics of the direct network and routing algorithm. Basing the decision on more information does not necessarily require more time. It is likely, however, that the complexity of the decision rules increases decision times by a few clock cycles. Finally, adaptive routing may increase the time for reassembling the packets of a message. Nonadaptive routing algorithms usually prevent the packets in a message from passing each other on their way through a network, because all of the packets follow the same path. Adaptive routing, on the other hand, makes no guarantee that packets will arrive at the destination in the order they are sent. Thus, reassembling a message may require sorting. There may have to be more packets for a message, too, since each packet must carry information about its proper order. 8.2 Future Work There are many possibilities for future work. One of the most important is identi- fying realistic tramc patterns, so that the results of future simulations can be more meaningful. Another is exploring the efficiency with which the routing algorithms 112 produced by the turn model can be implemented in hardware. There are many more input and output selection policies to be explored, too, es— pecially ones based on additional information. For example, an input selection policy that assigns packets priorities based on the distance travelled, minus the distance remaining and the packet length, might release more channels more quickly. Also, an output selection policy that waits a short period before selecting a nonpreferred out- put channel, especially a channel that leads away from the destination, may improve performance. The direct networks simulated in this dissertation had one buffer for each input channel. Another topic for research is the impact on the performance of adaptive routing algorithms by queues of buffers of different sizes. Preliminary research shows that infinitely large queues can keep at most an additional quarter of the channels busy in a k x k mesh. The turn model has been applied to n-dimensional meshes and k-ary n-cubes in this dissertation, but it can also be applied to other topologies, such as hexagonal networks, octagonal networks, cube-connected cycles, and express cubes [30]. In such topologies, the turns are not necessarily 90-degrees and the simple cycles are not necessarily formed by four turns. The turn model should again produce interesting and useful routing algorithms. So far, the research has involved only unicast communication, in which each mes- sage has only one destination. The question is whether the turn model can help design routing algorithms for multicast and broadcast communication, in which each message can have multiple destinations. The turn model may be applicable to the multicast star model of multicast communication [15]. Many of the routing algorithms designed by the turn model can be described strictly in terms of routing packets first in one group of directions and then in another. Some examples are given in Table 8.1. They suggest a second model for designing 113 routing algorithms, one based on grouping channel directions into routing phases. This direction-based model can never completely replace the turn model, however. Many of the algorithms designed by the turn model, such as six of the nine for three- dimensional meshes, cannot be described in terms of grouping channel directions into phases, and would have been nearly impossible to design without the turn model. The proofs of deadlock freedom for individual routing algorithms suggest a third model for designing routing algorithms. The proofs rely on numbering individual channels, and the model suggested is based on grouping individual channels into routing phases. Whether the two suggested models can be developed to the point that they are guaranteed to produce routing algorithms that are deadlock free, maximally adaptive, minimal or nonminimal, livelock free, and fault tolerant is an open question. If they can, they may prove to be simpler to apply than the turn model. Table 8.1. The directional phases of some routing algorithms. routing ' ° ' in —0, +0 —1 +1 west- —0 —1,+0,+1 've- —0, -1 +0, +1 -0,—1,+0 +1 -l,-2,...,—n-l —n,+1,+2, —l,-2,...,—n-1,—n +1,+2,... -1—2,... —n-1,-—n +1 +2,... BIBLIOGRAPHY BIBLIOGRAPHY [l] S. B. Borkar, R. Cohn, C. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb, “iWarp: An integrated solution to high-speed parallel computing,” in Proceedings of Supercomputing ’88, pp. 330—339, Nov. 1988. [2] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, “The directory-based cache coherence protocol for the DASH multiprocessor,” in Proc. of the 17th International Symposium on Computer Architecture, pp. 148—159, May 1990. [3] A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz, “APRIL: A processor ar- chitecture for multiprocessing multiprocessor,” in Proc. of the 17th International Symposium on Computer Architecture, pp. 104-114, May 1990. [4] W. C. Athas and C. L. Seitz, “Multicomputers: Message-passing concurrent computers,” IEEE Computer, pp. 9—25, Aug. 1988. [5] Intel Corporation, A Touchstone DELTA System Description, 1991. [6] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W. K. Su, “The architecture and programming of the Ametek Series 2010 mul- ticomputer,” in Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, vol. 1, (Pasadena, CA), pp. 33—36, Association for Computing Machinery, Jan. 1988. [7] H. Ishihata, T. Horie, S. Inano, T. Shimizu, S. Kato, and M. Ikesaka, “Third- generation message-passing computer AP1000,” in International Symposium on Supercomputing, Kyusyu, pp. 46-55, nov 1991. [8] W. J. Dally, ‘_‘The J-machine: System support for Actors,” in Actors: K nowledge- Bascd Concurrent Computing (Hewitt and Agha, eds.), MIT Press, 1989. [9] NCUBE Company, NCUBE 64 00 Processor Manual, 1990. 114 115 [10] W. J. Dally, “Performance analysis of k-ary n-cube interconnection networks,” IEEE Transactions on Computers, vol. C-39, pp. 775-785, June 1990. [11] W. J. Dally and C. L. Seitz, “The torus routing chip,” Journal of Distributed Computing, vol. 1, no. 3, pp. 187-196, 1986. [12] P. Kermani and L. Kleinrock, “Virtual cut-through: A new computer commu- nication switching technique,” Computer Networks, vol. 3, no. 4, pp. 267—286, 1979. [13] J. Y. Ngai and C. L. Seitz, “A framework for adaptive routing in multicomputer networks,” in Proceedings of the 1989 ACM Symposium on Parallel Algorithms and Architectures, 1989. [14] L. M. Ni, “Communication issues in multicomputers,” in Proceedings of the First Workshop on Parallel Processing, (Hsinchu, Taiwan), pp. 52-64, Dec. 1990. [15] X. Lin and L. M. Ni, “Deadlock-free multicast wormhole routing in in multicom- puter networks,” in Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 116-125, May 1991. [16] W. J. Dally and H. Aoki, “Adaptive routing using virtual channels,” tech. rep., Massachusetts Institute of Technology, Laboratory for Computer Science, Sept. 1990. [17] L. M. Ni and P. K. McKinley, “A survery of routing techniques in wormhole networks,” Tech. Rep. MSU-CPS—ACS-46, Dept. of Computer Science, Michigan State University, East Lansing, Michigan, Oct. 1991. [18] H. Sullivan and T. R. Bashkow, “A large scale, homogeneous, fully distributed parallel machine,” in Proceedings of the 4th Annual International Symposium on Computer Architecture, pp. 105-124, Mar. 1977. [19] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networks,” IEEE Transactions on Computers, vol. C-36, pp. 547- 553, May 1987. [20] W. J. Dally, “Virtual channel flow control,” in Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 60—68, May 1990. [21] C. R. Jesshope, P. R. Miller, and J. T. Yantchev, “High performance communi- cations in processor networks,” in Proceedings of the 16th Annual International Symposium on Computer Architecture, pp. 150—157, 1989. 116 [22] J. T. Yantchev and C. R. Jesshope, “Adaptive, low latency, deadlock-free packet routing for networks of processors,” in IEE Proceeding, Pt. E, vol. 136, pp. 178- 186, May 1989. [23] D. H. Linder and J. C. Harden, “An adaptive and fault tolerant wormhole routing strategy for lc-ary n-cubes,” IEEE Transactions on Computers, vol. 40, pp. 2—12, Jan. 1991. [24] W. J. Dally, “Fine-grain message passing concurrent computers,” in Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, vol. 1, (Pasadena, CA.), pp. 2—12, Association for Computing Machinery, Jan. 1988. [25] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” in Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 278- 287, May 1992. [26] C. J. Glass and L. M. Ni, “Adaptive routing in mesh-connected networks,” in Proceedings of the 12th International Conference on Distributed Computing Sys- tems, pp. 12—19, June 1992. l [27] C. J. Glass and L. M. Ni, “Maximally fully adaptive routing in 2D meshes,” in Proceedings of the 213t International Conference on Parallel Processing, Aug. 1992. [28] S. Konstantinidou, “Adaptive, minimal routing in hypercubes,” in Proceedings of the 6th MIT Conference: Advanced Research in VLSI, pp. 139-153, 1990. [29] H. D. Schwetman, “CSIM: A C-based, process-oriented simulation language,” tech. rep., 1985. [30] W. J. Dally, “Express cubes: Improving the performance of k-ary n-cube inter- connection networks,” IEEE Transactions on Computers, vol. 40, pp. 1016—1023, Sept. 1991.